Gemini 3 Flash uses agentic RL to rival Pro – 69% long‑context recall

Stay in the loop

Free daily newsletter & Telegram daily report

Join Telegram Channel

Executive Summary

Gemini 3 Flash turns out not to be a diet Pro at all. Google engineers now say it’s running on a fresh agentic RL stack, not a distilled Pro checkpoint, which helps explain why this “fast” model is punching up: SimpleBench puts Flash Preview at 61.1% (vs Gemini 2.5 Flash’s 41.2%) and Repo Bench sees ~67% real‑repo success, right in the frontier pack.

New third‑party evals sharpen the picture we started on Tuesday. Flash trades blows with Pro and GPT‑5.2 on mainstream coding and tools—78.0% vs 76.2% on SWE‑Bench Verified—while lagging on the nastiest abstraction tasks, hitting 36% on FrontierMath’s Tiers 1–3 but only 15% on Tier 4 as 5.2 pulls away. The routing story is clear: send day‑to‑day coding and broad knowledge to Flash, but keep a narrow lane on Pro or 5.2 for ARC‑AGI‑style puzzles and frontier math.

Context Arena’s MRCR runs add a practical twist. Flash Medium reaches 69.2% AUC at 128k and 45.9% at 1M tokens while burning roughly half the output cost of High, making Medium the obvious default for 128k–1M‑token agents. With Google’s new Antigravity “computer use” agent and other UI‑automation surfaces already standardizing on Flash, this agentic‑RL brain is quietly becoming Google’s real flagship for shipped work, not just benchmarks.

Top links today

Feature Spotlight

Feature: Gemini 3 Flash’s RL edge and post‑launch gains

Google confirms agentic RL in Gemini 3 Flash; it posts top‑tier scores (e.g., SimpleBench #5, strong MRCR) at roughly half the Pro price, with those RL upgrades planned for a coming Pro refresh.

Broad, cross‑account coverage that Flash isn’t a distilled Pro: it ships new agentic RL that’s now showing up in third‑party evals and real usage. Mostly Gemini 3 Flash metrics and adoption; excludes GPT‑5.2‑Codex (covered yesterday).

Jump to Feature: Gemini 3 Flash’s RL edge and post‑launch gains topics

Table of Contents

Feature: Gemini 3 Flash’s RL edge and post‑launch gains

Gemini 3 Flash’s edge comes from agentic RL, not Pro distillation

New benchmarks show Gemini 3 Flash trading blows with Pro and GPT‑5.2

MRCR: Gemini 3 Flash Medium hits 69% AUC at 128k with lower cost

Google Antigravity adopts Gemini 3 Flash as its default fast model

Repo Bench: Gemini 3 Flash Preview scores ~67% on real coding repos


🧰 Agent skills, IDEs and coding workflows

Codex Skills go GA in CLI and IDE with Agent Skills standard

Codex’s experimental TUI2 sparks debate over alternate-screen UX trade-offs

Claude Code hooks into LangSmith to expose every tool and LLM call

Codex rewrites usage accounting, resets limits and grants fresh headroom

OpenRouter adds Anthropic-compatible API so Claude Code can target any model

RepoPrompt’s rp-build automates repo-scale context building for coding agents

CodexBar 0.9.1 now uses your login shell PATH instead of brittle path hacks


📊 Evals: long‑horizon, tool use and research QA

METR: Claude Opus 4.5 reaches 4h49m 50% horizon but only 27m at 80%

MCP Atlas benchmarks real tool use across 36 servers, 220 tools and 1k tasks

Parallel Task API tops Google DeepSearchQA at 72.6% accuracy and lower cost

MiMo‑V2‑Flash scores 66 on AA Intelligence Index with standout tool-use and math

Repo Bench tracks real end‑to‑end coding success; GPT‑5.2‑Codex looks stable


🧪 New models and modes (open weights and APIs)

Xiaomi releases MiMo‑V2‑Flash, a 309B open‑weights reasoning model

FunctionGemma ships as a tiny function‑calling specialist

MiniMax opens early access for M2.1, its interleaved coding model

GLM‑4.7 surfaces in vLLM as a new tool‑aware variant

KAT‑Coder‑Pro V1 emerges as a top non‑reasoning coding model

Moondream 3 gets native MLX support for Mac developers

History LLMs project previews Ranke‑4B, time‑locked open models


🏗️ Compute, power and buildouts

Google creates exec council to ration scarce AI compute across the company

Satya Nadella warns power and ‘warm shells’, not GPUs, are now the AI bottleneck

China doubles power generation in eight years as AI era looms

Chinese fabs upgrade older ASML DUV tools to keep 7 nm AI chips flowing

DOE’s Genesis Mission now has 24 AI partners, plus concrete lab deployments

Oracle’s $10B, 1GW Michigan data center for OpenAI stalls over financing

Meta’s 2025 AI capex jumps to at least $70B amid investor nerves

New analysis argues AI data centers’ water use is far less dire than headlines


🚀 Serving and runtime speedups

vLLM-Omni adds TeaCache and Cache‑DiT for up to 2.4× faster diffusion

SGLang’s dLLM framework ships day‑0 support for LLaDA 2.0

CodexBar 0.9.1 fixes CLI path issues by inheriting your login shell


💼 Capital, M&A and enterprise adoption

OpenAI said to seek up to $100B at ~$830B valuation

ChatGPT App Directory adds richer apps and SDK details

Cursor acquires Graphite to fuse AI coding and code review

Epoch survey: most Americans now use AI weekly, but few pay

Accenture credits AI for revenue lift and $2.2B in booked work

BNY Mellon’s Eliza hub puts OpenAI into 20k employees’ hands


🧠 RL, world models and reasoning recipes

Nature Comms paper learns reward functions automatically for embodied RL

Qwen formalizes when RL on LLMs is stable and how to fix MoE collapse

Online and adversarial world-model updates cut gradient planning time ~10×

Universal Reasoning Model pushes ARC-AGI pass@1 with tiny conv and truncated BPTT


🧷 Parsing, search and data plumbing

LlamaParse v2 adds chart‑aware parsing at ≤$0.003 per page

Exa and Fireworks publish cookbook for AI research assistants

PaddlePaddle posts Unsloth tutorial for fine‑tuning PaddleOCR


🛡️ Security and oversight

Vercel’s $1M React2Shell bounty hardens its WAF and runtime defenses


🎨 Creative stacks: layered edits, nodes and playable video

Qwen‑Image‑Layered spreads to fal and Replicate with 15× faster generation

Beam launches early access canvas to turn AI videos into playable mini‑games

ComfyUI adds GPT‑Image‑1.5 node with multi-edit, contact sheets and style grids

ComfyUI showcases WanMove path-based image animation from a single still

Meta lets users add their selfie and voice into AI-generated media with granular controls


🤖 Embodied AI and field robots

UPS plans ~$120M purchase of ~400 Pickle unload robots for dock automation

Disney’s Olaf robot paper details mimic-RL control and dense mechatronic design

China trials self-driving traffic cones that deploy a lane in under 10 seconds

Reachy Mini hits 3,000 units shipped as builders spin up coding labs

Unitree humanoids pull Webster flips at Wang Leehom concert

Russian delivery robot collision video highlights urban safety edge cases

On this page

Executive Summary
Feature Spotlight: Feature: Gemini 3 Flash’s RL edge and post‑launch gains
⚡ Feature: Gemini 3 Flash’s RL edge and post‑launch gains
Gemini 3 Flash’s edge comes from agentic RL, not Pro distillation
New benchmarks show Gemini 3 Flash trading blows with Pro and GPT‑5.2
MRCR: Gemini 3 Flash Medium hits 69% AUC at 128k with lower cost
Google Antigravity adopts Gemini 3 Flash as its default fast model
Repo Bench: Gemini 3 Flash Preview scores ~67% on real coding repos
🧰 Agent skills, IDEs and coding workflows
Codex Skills go GA in CLI and IDE with Agent Skills standard
Codex’s experimental TUI2 sparks debate over alternate-screen UX trade-offs
Claude Code hooks into LangSmith to expose every tool and LLM call
Codex rewrites usage accounting, resets limits and grants fresh headroom
OpenRouter adds Anthropic-compatible API so Claude Code can target any model
RepoPrompt’s rp-build automates repo-scale context building for coding agents
CodexBar 0.9.1 now uses your login shell PATH instead of brittle path hacks
📊 Evals: long‑horizon, tool use and research QA
METR: Claude Opus 4.5 reaches 4h49m 50% horizon but only 27m at 80%
MCP Atlas benchmarks real tool use across 36 servers, 220 tools and 1k tasks
Parallel Task API tops Google DeepSearchQA at 72.6% accuracy and lower cost
MiMo‑V2‑Flash scores 66 on AA Intelligence Index with standout tool-use and math
Repo Bench tracks real end‑to‑end coding success; GPT‑5.2‑Codex looks stable
🧪 New models and modes (open weights and APIs)
Xiaomi releases MiMo‑V2‑Flash, a 309B open‑weights reasoning model
FunctionGemma ships as a tiny function‑calling specialist
MiniMax opens early access for M2.1, its interleaved coding model
GLM‑4.7 surfaces in vLLM as a new tool‑aware variant
KAT‑Coder‑Pro V1 emerges as a top non‑reasoning coding model
Moondream 3 gets native MLX support for Mac developers
History LLMs project previews Ranke‑4B, time‑locked open models
🏗️ Compute, power and buildouts
Google creates exec council to ration scarce AI compute across the company
Satya Nadella warns power and ‘warm shells’, not GPUs, are now the AI bottleneck
China doubles power generation in eight years as AI era looms
Chinese fabs upgrade older ASML DUV tools to keep 7 nm AI chips flowing
DOE’s Genesis Mission now has 24 AI partners, plus concrete lab deployments
Oracle’s $10B, 1GW Michigan data center for OpenAI stalls over financing
Meta’s 2025 AI capex jumps to at least $70B amid investor nerves
New analysis argues AI data centers’ water use is far less dire than headlines
🚀 Serving and runtime speedups
vLLM-Omni adds TeaCache and Cache‑DiT for up to 2.4× faster diffusion
SGLang’s dLLM framework ships day‑0 support for LLaDA 2.0
CodexBar 0.9.1 fixes CLI path issues by inheriting your login shell
💼 Capital, M&A and enterprise adoption
OpenAI said to seek up to $100B at ~$830B valuation
ChatGPT App Directory adds richer apps and SDK details
Cursor acquires Graphite to fuse AI coding and code review
Epoch survey: most Americans now use AI weekly, but few pay
Accenture credits AI for revenue lift and $2.2B in booked work
BNY Mellon’s Eliza hub puts OpenAI into 20k employees’ hands
🧠 RL, world models and reasoning recipes
Nature Comms paper learns reward functions automatically for embodied RL
Qwen formalizes when RL on LLMs is stable and how to fix MoE collapse
Online and adversarial world-model updates cut gradient planning time ~10×
Universal Reasoning Model pushes ARC-AGI pass@1 with tiny conv and truncated BPTT
🧷 Parsing, search and data plumbing
Datalab OCR now preserves PDF hyperlinks into HTML output
LlamaParse v2 adds chart‑aware parsing at ≤$0.003 per page
Exa and Fireworks publish cookbook for AI research assistants
PaddlePaddle posts Unsloth tutorial for fine‑tuning PaddleOCR
🛡️ Security and oversight
Vercel’s $1M React2Shell bounty hardens its WAF and runtime defenses
🎨 Creative stacks: layered edits, nodes and playable video
Qwen‑Image‑Layered spreads to fal and Replicate with 15× faster generation
Beam launches early access canvas to turn AI videos into playable mini‑games
ComfyUI adds GPT‑Image‑1.5 node with multi-edit, contact sheets and style grids
ComfyUI showcases WanMove path-based image animation from a single still
Meta lets users add their selfie and voice into AI-generated media with granular controls
🤖 Embodied AI and field robots
UPS plans ~$120M purchase of ~400 Pickle unload robots for dock automation
Disney’s Olaf robot paper details mimic-RL control and dense mechatronic design
China trials self-driving traffic cones that deploy a lane in under 10 seconds
Reachy Mini hits 3,000 units shipped as builders spin up coding labs
Unitree humanoids pull Webster flips at Wang Leehom concert
Russian delivery robot collision video highlights urban safety edge cases
Gemini 3 Flash uses agentic RL to rival Pro – 69% long‑context recall | Daily AI Primer – Engineer (Fri, Dec 19, 2025)