Mistral Devstral 2 hits 72.2% SWE‑Bench – 24B laptop coder rivals giants

Stay in the loop

Free daily newsletter & Telegram daily report

Executive Summary

Mistral showed up to the coding race with receipts, not vibes. Devstral 2 (123B params) and Devstral Small 2 (24B) both ship as open‑weight coders with 256K context and FP8 checkpoints, posting 72.2% and 68.0% on SWE‑Bench Verified—within a few points of proprietary staples like Claude 4.5 Sonnet and GPT‑5.1 Codex Max. The twist: the 24B Small model is roughly 28× smaller than some DeepSeek‑class flagships yet lands in the same accuracy band, and it’s Apache 2.0, laptop‑deployable, and very privacy‑friendly.

What’s new versus yet another open model drop is the stack around it. Mistral shipped Vibe CLI as an open, repo‑aware terminal agent—plan → read → edit → run → summarize—where all prompts and tools live in Markdown, begging to be forked. Day‑zero support from vLLM (with a dedicated tool‑calling parser), Zed’s new Vibe Agent Server, AnyCoder’s model picker, and Kilo Code’s IDE (free Devstral usage all December, after quietly running a pre‑release “Spectre” build) means you can trial this in real workflows without writing glue.

Builders are already tagging Devstral Small 2 as “SOTTA” (state of the tiny art) and treating it as the default self‑hosted coder, while grumbling about the big model’s revenue cap for $20M+/month companies. Net effect: if you’ve been leaning on DeepSeek or closed coders, Devstral is now a serious, open toggle in your production dropdown.

Feature: Mistral’s Devstral 2 + Vibe CLI push open‑source coding to SOTA

Mistral ships Devstral 2 (123B) and Devstral Small 2 (24B) plus the Vibe CLI—open SOTA coding with 72.2%/68.0% SWE‑bench Verified, 256K context, FP8 weights, and a repo‑aware terminal agent.

Biggest cross‑account story today. New open‑weight coding models (123B, 24B) with 256K context and a native terminal agent. Multiple third‑party benchmarks, tools, and day‑0 serving surfaced in the sample.

Jump to Feature: Mistral’s Devstral 2 + Vibe CLI push open‑source coding to SOTA topics

🛠️ Feature: Mistral’s Devstral 2 + Vibe CLI push open‑source coding to SOTA

Devstral 2 hits 72.2% SWE‑Bench and pushes tiny‑model efficiency

On SWE‑Bench Verified, Devstral 2 scores 72.2% and Devstral Small 2 68.0%, putting them at or near the top of all open-weight coding models and close to proprietary coders like Claude 4.5 Sonnet (77.2%) and GPT‑5.1 Codex Max (77.9%).

benchmark chart That 68.0% from the 24B Small model is particularly notable, matching or beating much larger open models while remaining realistically laptop‑deployable.

A separate “SWE‑Bench vs model size” scatterplot shows Devstral 2 and Small 2 clustered in the top-left—high accuracy but trained on far fewer tokens than rivals like Kimi K2 or DeepSeek V3.2—earning the community nickname “SOTTA” (state of the tiny art).

efficiency plot Third‑party evals using the Cline framework report Devstral 2 winning or tying DeepSeek V3.2 in about 71% of coding tasks, though Claude Sonnet 4.5 still wins more than half the time while costing up to ~7× more per solved task in those tests. (cost comparison, cline comparison) For teams, the message is clear: Devstral isn’t the single best coder on earth, but its accuracy‑per‑parameter and accuracy‑per‑dollar make it a very strong default for open, self‑hosted coding agents.

Mistral Devstral 2 hits 72.2% SWE‑Bench – 24B laptop coder rivals giants

Executive Summary

Top links today

Feature: Mistral’s Devstral 2 + Vibe CLI push open‑source coding to SOTA

Table of Contents

🛠️ Feature: Mistral’s Devstral 2 + Vibe CLI push open‑source coding to SOTA

Devstral 2 hits 72.2% SWE‑Bench and pushes tiny‑model efficiency

Mistral launches Devstral 2 coding family with 123B and 24B models

Devstral lands in vLLM, Zed, AnyCoder and Kilo on day one

Mistral Vibe CLI turns Devstral into a repo-aware coding agent

Community touts Devstral Small 2 as laptop‑class ‘state of the tiny art’

Kilo Code makes Devstral 2 and Small 2 free for December

vLLM ships day‑0 Devstral‑2‑123B serving recipe with tool parser

AnyCoder exposes Devstral Medium as a selectable build model

Zed exposes Mistral Vibe as a plug‑and‑play coding agent

🔌 Open agent standards: MCP donated to Linux Foundation’s AAIF

MCP moves under Linux Foundation’s Agentic AI Foundation

MCP Apps spec adds a shared UI layer for agent tools

Browser Use turns Skills into MCP tools and automates Instacart shopping

👩‍💻 Agent SDKs and coding ops: sandboxes, forks, and cloud workers

Claude Agent SDK adds 1M‑token Sonnet, sandboxing, and simpler TS v2 API

Claude Code mishap nukes a user’s home directory, highlighting agent safety gaps

Kilo Code Cloud Agents let devs run coding agents from any device

Warp adds agent-friendly forking and Git-style diff viewer in the terminal

OpenCode gains MCP OAuth support through a community PR

Droid adds `/review` command for branch and diff-aware code reviews

📊 Leaderboards and eval hygiene: Arena shifts, OCR bake‑off, context tests

Arena shares 2025 top‑10 lab trends and invites harder prompts

Context Arena MRCR shows Qwen3‑Next Thinking helps at 8K, hurts at 128K

Datalab launches OCR benchmark and eval service over ~8K multilingual pages

ERNIE‑5.0‑Preview‑1103 cracks Text Arena’s top 20 with strong coding scores

LM Arena adds live per‑model creation feed for qualitative comparisons

Hamel Husain drops an eval explainer video and companion meme gallery

💼 Enterprise GTM: CRO hire, telco pact, Accenture scale, and $140M for gen‑media

Accenture and Anthropic build a 30k‑person Claude practice to move pilots into production

Menlo Ventures report pegs 2025 gen‑AI enterprise spend at $37B, with Anthropic leading

Commonwealth Bank of Australia rolls out ChatGPT Enterprise to nearly 50,000 staff

Deutsche Telekom taps OpenAI alpha‑model access and ChatGPT Enterprise in multi‑year deal

Enterprise AI GTM patterns converge: CROs, telcos, SIs, banks, and infra funds

OpenAI appoints ex‑Slack CEO Denise Dresser as Chief Revenue Officer

Fal raises $140M Series D and launches a Generative Media Fund

OpenAI launches certification courses with goal to upskill 10M Americans by 2030

OpenAI’s enterprise report shows power users burn 8× more AI credits than median staff

📑 Research focus: positional geometry, coordination layers, robust agents

GRAPE unifies RoPE, ALiBi and FoX into a single positional geometry

M4‑RAG finds retrieval boosts small VLMs but can hurt large ones

Omega designs trusted cloud agents with enclaves and encrypted logs

‘Missing Layer of AGI’ paper argues LLMs need a coordination controller

DoVer auto‑debugs multi‑agent tasks via targeted interventions

KAMI study categorizes how LLM agents fail on realistic tool tasks

ThreadWeaver trains adaptive parallel reasoning with speedups on AIME24

C3 adds calibrated uncertainty to controllable video world models for robots

AI Correctness Checker finds rising math and citation errors in AI papers

🎬 Creative stacks: Gemini templates, NB Pro + Kling reels, and CHORD PBR

Nano Banana Pro quietly becomes the slide engine in multiple Google tools

OpenAI’s Chestnut and Hazelnut image models surface on Arena with mixed early takes

Gemini tests Veo 3.1 video templates for one‑click stylized clips

Stitch’s NB‑powered redesign agent now ships code and attention heatmaps

Ubisoft open‑sources CHORD PBR materials with ComfyUI nodes for AAA pipelines

Creators chain Nano Banana Pro stills into Kling 2.6/O1 video for “cinema”

Felo LiveDoc turns documents into image‑rich decks and reports on one canvas

Grok Imagine lets X users generate short videos from the post composer

ImagineArt builds consumer video editing apps on top of Kling O1

Light Migration LoRA brings controllable relighting to ComfyUI workflows

🛡️ Alignment & control: SGTM, fast unlearning, and trusted execution

Anthropic’s SGTM localizes risky knowledge into deletable ‘forget’ weights

LUNE uses LoRA plus negative examples for fast, cheap factual unlearning

Omega proposes trusted cloud VMs for safer multi‑agent AI systems

🏛️ Public sector & defense: GenAI.mil starts with Gemini

Pentagon launches GenAI.mil with Google Gemini as first model

🗣️ Realtime voice agents: higher‑fidelity TTS and sandwich patterns

Gemini text‑to‑speech preview models get 12/10 quality upgrade with no API changes

VoxCPM 1.5 bumps TTS to 44.1 kHz and halves tokens per second

LangChain breaks down ‘sandwich’ vs speech‑to‑speech architectures for voice agents

ElevenLabs shows Santa voice agent running in React with ~8 lines of code

Voice AI Primer distills RAG, multi‑agent, and state‑machine patterns for voice agents

⚙️ Runtime throughput: InferenceMAX and MoE kernel work

InferenceMAX pushes DeepSeek R1 FP8 to 4,260 tok/s/GPU at realistic loads

Transformers gains batched/grouped MoE kernels to speed expert models

🤖 Embodied AI: construction demo and real‑world challenge spec