Mistral Devstral 2 hits 72.2% SWE‑Bench – 24B laptop coder rivals giants

Stay in the loop

Free daily newsletter & Telegram daily report

Join Telegram Channel

Executive Summary

Mistral showed up to the coding race with receipts, not vibes. Devstral 2 (123B params) and Devstral Small 2 (24B) both ship as open‑weight coders with 256K context and FP8 checkpoints, posting 72.2% and 68.0% on SWE‑Bench Verified—within a few points of proprietary staples like Claude 4.5 Sonnet and GPT‑5.1 Codex Max. The twist: the 24B Small model is roughly 28× smaller than some DeepSeek‑class flagships yet lands in the same accuracy band, and it’s Apache 2.0, laptop‑deployable, and very privacy‑friendly.

What’s new versus yet another open model drop is the stack around it. Mistral shipped Vibe CLI as an open, repo‑aware terminal agent—plan → read → edit → run → summarize—where all prompts and tools live in Markdown, begging to be forked. Day‑zero support from vLLM (with a dedicated tool‑calling parser), Zed’s new Vibe Agent Server, AnyCoder’s model picker, and Kilo Code’s IDE (free Devstral usage all December, after quietly running a pre‑release “Spectre” build) means you can trial this in real workflows without writing glue.

Builders are already tagging Devstral Small 2 as “SOTTA” (state of the tiny art) and treating it as the default self‑hosted coder, while grumbling about the big model’s revenue cap for $20M+/month companies. Net effect: if you’ve been leaning on DeepSeek or closed coders, Devstral is now a serious, open toggle in your production dropdown.

Top links today

Feature Spotlight

Feature: Mistral’s Devstral 2 + Vibe CLI push open‑source coding to SOTA

Mistral ships Devstral 2 (123B) and Devstral Small 2 (24B) plus the Vibe CLI—open SOTA coding with 72.2%/68.0% SWE‑bench Verified, 256K context, FP8 weights, and a repo‑aware terminal agent.

Biggest cross‑account story today. New open‑weight coding models (123B, 24B) with 256K context and a native terminal agent. Multiple third‑party benchmarks, tools, and day‑0 serving surfaced in the sample.

Jump to Feature: Mistral’s Devstral 2 + Vibe CLI push open‑source coding to SOTA topics

Table of Contents

🛠️ Feature: Mistral’s Devstral 2 + Vibe CLI push open‑source coding to SOTA


🔌 Open agent standards: MCP donated to Linux Foundation’s AAIF


👩‍💻 Agent SDKs and coding ops: sandboxes, forks, and cloud workers


📊 Leaderboards and eval hygiene: Arena shifts, OCR bake‑off, context tests


💼 Enterprise GTM: CRO hire, telco pact, Accenture scale, and $140M for gen‑media


📑 Research focus: positional geometry, coordination layers, robust agents


🎬 Creative stacks: Gemini templates, NB Pro + Kling reels, and CHORD PBR


🛡️ Alignment & control: SGTM, fast unlearning, and trusted execution


🏛️ Public sector & defense: GenAI.mil starts with Gemini


🗣️ Realtime voice agents: higher‑fidelity TTS and sandwich patterns


⚙️ Runtime throughput: InferenceMAX and MoE kernel work


🤖 Embodied AI: construction demo and real‑world challenge spec

On this page

Executive Summary
Feature Spotlight: Feature: Mistral’s Devstral 2 + Vibe CLI push open‑source coding to SOTA
🛠️ Feature: Mistral’s Devstral 2 + Vibe CLI push open‑source coding to SOTA
Devstral 2 hits 72.2% SWE‑Bench and pushes tiny‑model efficiency
Mistral launches Devstral 2 coding family with 123B and 24B models
Devstral lands in vLLM, Zed, AnyCoder and Kilo on day one
Mistral Vibe CLI turns Devstral into a repo-aware coding agent
Community touts Devstral Small 2 as laptop‑class ‘state of the tiny art’
Kilo Code makes Devstral 2 and Small 2 free for December
vLLM ships day‑0 Devstral‑2‑123B serving recipe with tool parser
AnyCoder exposes Devstral Medium as a selectable build model
Zed exposes Mistral Vibe as a plug‑and‑play coding agent
🔌 Open agent standards: MCP donated to Linux Foundation’s AAIF
MCP moves under Linux Foundation’s Agentic AI Foundation
MCP Apps spec adds a shared UI layer for agent tools
Browser Use turns Skills into MCP tools and automates Instacart shopping
👩‍💻 Agent SDKs and coding ops: sandboxes, forks, and cloud workers
Claude Agent SDK adds 1M‑token Sonnet, sandboxing, and simpler TS v2 API
Claude Code mishap nukes a user’s home directory, highlighting agent safety gaps
Kilo Code Cloud Agents let devs run coding agents from any device
Warp adds agent-friendly forking and Git-style diff viewer in the terminal
OpenCode gains MCP OAuth support through a community PR
Droid adds `/review` command for branch and diff-aware code reviews
📊 Leaderboards and eval hygiene: Arena shifts, OCR bake‑off, context tests
Arena shares 2025 top‑10 lab trends and invites harder prompts
Context Arena MRCR shows Qwen3‑Next Thinking helps at 8K, hurts at 128K
Datalab launches OCR benchmark and eval service over ~8K multilingual pages
ERNIE‑5.0‑Preview‑1103 cracks Text Arena’s top 20 with strong coding scores
LM Arena adds live per‑model creation feed for qualitative comparisons
Hamel Husain drops an eval explainer video and companion meme gallery
💼 Enterprise GTM: CRO hire, telco pact, Accenture scale, and $140M for gen‑media
Accenture and Anthropic build a 30k‑person Claude practice to move pilots into production
Menlo Ventures report pegs 2025 gen‑AI enterprise spend at $37B, with Anthropic leading
Commonwealth Bank of Australia rolls out ChatGPT Enterprise to nearly 50,000 staff
Deutsche Telekom taps OpenAI alpha‑model access and ChatGPT Enterprise in multi‑year deal
Enterprise AI GTM patterns converge: CROs, telcos, SIs, banks, and infra funds
OpenAI appoints ex‑Slack CEO Denise Dresser as Chief Revenue Officer
Fal raises $140M Series D and launches a Generative Media Fund
OpenAI launches certification courses with goal to upskill 10M Americans by 2030
OpenAI’s enterprise report shows power users burn 8× more AI credits than median staff
📑 Research focus: positional geometry, coordination layers, robust agents
GRAPE unifies RoPE, ALiBi and FoX into a single positional geometry
M4‑RAG finds retrieval boosts small VLMs but can hurt large ones
Omega designs trusted cloud agents with enclaves and encrypted logs
‘Missing Layer of AGI’ paper argues LLMs need a coordination controller
DoVer auto‑debugs multi‑agent tasks via targeted interventions
KAMI study categorizes how LLM agents fail on realistic tool tasks
ThreadWeaver trains adaptive parallel reasoning with speedups on AIME24
C3 adds calibrated uncertainty to controllable video world models for robots
AI Correctness Checker finds rising math and citation errors in AI papers
🎬 Creative stacks: Gemini templates, NB Pro + Kling reels, and CHORD PBR
Nano Banana Pro quietly becomes the slide engine in multiple Google tools
OpenAI’s Chestnut and Hazelnut image models surface on Arena with mixed early takes
Gemini tests Veo 3.1 video templates for one‑click stylized clips
Stitch’s NB‑powered redesign agent now ships code and attention heatmaps
Ubisoft open‑sources CHORD PBR materials with ComfyUI nodes for AAA pipelines
Creators chain Nano Banana Pro stills into Kling 2.6/O1 video for “cinema”
Felo LiveDoc turns documents into image‑rich decks and reports on one canvas
Grok Imagine lets X users generate short videos from the post composer
ImagineArt builds consumer video editing apps on top of Kling O1
Light Migration LoRA brings controllable relighting to ComfyUI workflows
🛡️ Alignment & control: SGTM, fast unlearning, and trusted execution
Anthropic’s SGTM localizes risky knowledge into deletable ‘forget’ weights
LUNE uses LoRA plus negative examples for fast, cheap factual unlearning
Omega proposes trusted cloud VMs for safer multi‑agent AI systems
🏛️ Public sector & defense: GenAI.mil starts with Gemini
Pentagon launches GenAI.mil with Google Gemini as first model
🗣️ Realtime voice agents: higher‑fidelity TTS and sandwich patterns
Gemini text‑to‑speech preview models get 12/10 quality upgrade with no API changes
VoxCPM 1.5 bumps TTS to 44.1 kHz and halves tokens per second
LangChain breaks down ‘sandwich’ vs speech‑to‑speech architectures for voice agents
ElevenLabs shows Santa voice agent running in React with ~8 lines of code
Voice AI Primer distills RAG, multi‑agent, and state‑machine patterns for voice agents
⚙️ Runtime throughput: InferenceMAX and MoE kernel work
InferenceMAX pushes DeepSeek R1 FP8 to 4,260 tok/s/GPU at realistic loads
Transformers gains batched/grouped MoE kernels to speed expert models
🤖 Embodied AI: construction demo and real‑world challenge spec
GITAI’s robots cooperatively assemble a 5 m lunar construction tower
ATEC 2025’s Offline Extreme Challenge formalizes four hard real‑world robot tasks