GPT‑5.1 Codex hits 70.4% on SWE‑Bench – ~26× cheaper

Executive Summary

Benchmarks moved again: GPT‑5.1 Codex grabbed the SWE‑Bench lead at 70.4%, edging Claude Sonnet 4.5 (Thinking) at 69.8% while running about $0.31 per test versus $8.26—roughly 26× cheaper. Vals AI also reports GPT‑5.1 topping its Finance Agent benchmark by 0.6%, with LiveCodeBench performance jumping from 12th to 2nd.

If you fix real repos or wire fintech flows, that cost/perf mix argues for routing more traffic to 5.1 Codex and letting behavior, latency, and price steer the rest. Artificial Analysis’ latest run nudges GPT‑5.1 to a 70 on its Intelligence Index and shows 81M output tokens vs 85M for GPT‑5, trimming estimated run cost to ~$859 from ~$913.

Don’t hand it the keys to low‑level optimization, though. A new ML/HPC leaderboard puts expert humans at 1.00× speedup while current LLM agent systems manage ≤0.15×, so keep humans in the loop for performance tuning. And if latency matters, retrieval+classifier pipelines are winning: DeReC beats rationale‑generating LLMs for fact‑checking with ~95% less runtime.

Feature Spotlight

Feature: Gemini 3 signals hit critical mass

Gemini 3 appears imminent: Sundar Pichai teases a Nov‑22 window; “Gemini 3.0 Pro” strings show up in Enterprise model selectors; “Riftrunner” shows in arenas. If confirmed, Google’s distribution could reset model choice for many teams.

Multiple independent sightings and CEO hints point to an imminent Gemini 3 release; today’s sample centers on strings in enterprise UIs, a “Riftrunner” label in arenas, and market chatter. Excludes other model news, which is covered separately.

Jump to Feature: Gemini 3 signals hit critical mass topics

Table of Contents

Stay in the loop

Get the Daily AI Primer delivered straight to your inbox. One email per day, unsubscribe anytime.

Feature: Gemini 3 signals hit critical mass

Multiple independent sightings and CEO hints point to an imminent Gemini 3 release; today’s sample centers on strings in enterprise UIs, a “Riftrunner” label in arenas, and market chatter. Excludes other model news, which is covered separately.

“Gemini 3.0 Pro” spotted inside Enterprise agent selector

Multiple screenshots show a “Gemini 3.0 Pro” label appearing in the Gemini Enterprise agent model picker, though access remains blocked for general users sighting summary. Devtools strings align with a production‑bound model option, strengthening the case that final wiring is underway devtools strings, with write‑ups documenting recurring sightings across builds TestingCatalog post.

For AI platform owners, this is the clearest enterprise‑grade breadcrumb yet: start drafting routing and fallbacks so you can A/B 3.0 Pro versus your current defaults on day one.

Sundar boosts 69% Polymarket odds for Gemini 3 by Nov 22

Google’s CEO amplified a Polymarket contract showing a 69% chance Gemini 3.0 ships by Nov 22, which the community reads as a deliberate signal to expect a near‑term launch CEO hint. A separate roundup repeats the same read, framing Sundar’s post as soft confirmation of timing market odds.

So what? Leaders and PMs can prep eval sandboxes and rollout comms now, especially if you plan announcements at or right after AIE Code week.

‘Riftrunner’ resurfaces in arenas and tools as a likely Gemini 3 tag

A “Riftrunner” model id keeps appearing in design arenas and developer consoles, with testers describing it as a larger, more capable variant that matches expected Gemini 3.0 Pro behavior devtools console and outperforming peers on an SVG rendering comparison in creator tests svg comparison, following up on Riftrunner early strings and arena probes.

If you run eval harnesses, add a placeholder lane for Riftrunner so you can drop in the model id the moment it’s routable.

Timing chatter converges on “next week,” likely during AIE Code

Several well‑followed accounts say Gemini 3 is landing next week, with one tying the reveal to the AI Engineer Code event where Google has launched onstage before timing claim. Posts narrow it further to early week, even “likely on Tuesday,” reinforcing scheduling urgency next week call tuesday hint, while broader sightings threads keep stoking the countdown speculation post.

Practical move: line up side‑by‑side prompts and traffic shaping such that switching a portion of user flows to 3.0 takes minutes, not days.

Nano‑Banana 2 buzz suggests refreshed image stack alongside Gemini 3

Creators report strong results from “Nano‑Banana 2,” noting more realistic images, better text rendering, and accurate reflections—pointing to a revamped Google image stack that could ship alongside Gemini 3 creator take. Others explicitly pair Nano‑Banana 2 mentions with Gemini 3 timing chatter paired mention, with more output dumps circulating outputs thread and third‑party workflows already wiring “nano banana” as a selectable node workflow example.

If your product leans on generative visuals, budget time to re‑shoot style guides and update safety filters—the output distribution may shift.


Benchmarks: GPT‑5.1 Codex tops SWE‑Bench; finance agent SOTA

Strong day for public evals: GPT‑5.1 Codex edges Sonnet 4.5 (Thinking) on SWE‑Bench at a fraction of cost; GPT‑5.1 leads a finance agent benchmark; meta‑analysis adds token/price deltas. Excludes Gemini 3 coverage (see Feature).

GPT‑5.1 Codex tops SWE‑Bench at 70.4% and ~26× cheaper than Sonnet 4.5

OpenAI’s GPT‑5.1 Codex leads SWE‑Bench with 70.4% accuracy versus Claude Sonnet 4.5 (Thinking) at 69.8%, while costing ~$0.31 per test vs ~$8.26 (~26× cheaper) benchmarks table. Following up on launch-top5 where new code leaderboards surfaced, this run confirms 5.1 Codex as the top value pick for repo‑level bug fixing, with latencies shown alongside the cost deltas SWE‑Bench note, and the public board now reflectable in Vals AI’s pages benchmarks page.

GPT‑5.1 leads Vals AI Finance Agent Benchmark by 0.6%

Vals AI reports GPT‑5.1 sets a new state of the art on its Finance Agent Benchmark, edging Claude Sonnet 4.5 (Thinking) by 0.6% on goal completion, with additional gains on LiveCodeBench (jumping from 12th to 2nd) and minor improvements on MMMU/GPQA/IOI finance benchmark post, follow‑up details. For teams prototyping agentic fintech workflows, this narrows the top tier to 5.1 vs Sonnet 4.5, and suggests routing by tool‑use behavior and cost may matter more than small headline margins.

Artificial Analysis: GPT‑5.1 +2 on Intelligence Index; 81M vs 85M output tokens

Artificial Analysis’ latest run gives GPT‑5.1 a score of 70, +2 over GPT‑5 at similar reasoning effort, driven largely by TerminalBench improvements; it also used 81M output tokens vs 85M for GPT‑5, cutting run cost to ~$859 from ~$913 index recap. The live dashboard breaks down per‑eval deltas and cost/latency tradeoffs useful for routing and budgeting analysis site.

BEAM benchmark hits 10M‑token chats; LIGHT memory stack outperforms long context

BEAM introduces ultra‑long conversation evals up to 10M tokens and shows LIGHT—a hybrid of episodic retrieval, working memory, and scratchpad—consistently outperforms relying on huge context windows alone, with average gains reported across models and a clear fade in long‑context models as length grows paper abstract. For agents that must persist across days, this favors explicit memory stacks over bigger windows.

Bridgewater’s AIA Forecaster combines agentic search over high‑quality news, a supervisor that reconciles disparate forecasts, and calibration (e.g., Platt scaling) to match superforecaster accuracy on ForecastBench, beating prior LLM baselines; on a liquid markets set, markets still lead but ensembles with the model improve accuracy paper abstract. For ops, this argues for supervised multi‑agent pipelines over single‑shot judgments.

Conciseness reward model trims tokens ~20% and lifts 7B accuracy by 8.1%

A conciseness reward model that grants brevity bonuses only when final answers are correct prevents length/training collapse, delivering +8.1% accuracy with ~19.9% fewer tokens on a 7B backbone across math tasks; the bonus fades over training and scales by difficulty paper abstract. This is a practical recipe to cut inference cost in reasoning agents without sacrificing quality.

Dense retrieval + classifier beats LLM rationales for fact‑checking at 95% less runtime

DeReC (Dense Retrieval Classification) replaces rationale‑generating LLM pipelines with dense evidence retrieval and a classifier, improving RAWFC F1 to 65.58% (from 61.20%) while cutting runtime ~95% (454m→23m). Similar speedups are shown on LIAR‑RAW paper abstract. If you need scalable veracity checks, retrieval+classifier is a strong baseline before spinning up expensive generation.

New ML/HPC leaderboard shows LLM agents slower than expert humans

A new SWE/ML optimization leaderboard with a human baseline shows expert humans at 1.00× speedup, while top LLM‑driven systems achieve ≤0.15× on ML/HPC tasks, implying current agents slow practitioners down for performance tuning despite strong coding scores elsewhere leaderboard post. Use this as a routing signal: keep human‑in‑the‑loop for low‑level optimization and reserve agents for scaffolding, search, and glue code.

Rubric‑based instruction‑following benchmark and RL recipe land for agents

A new rubric‑based benchmark and reinforcement learning approach for instruction following is out, providing a repeatable way to grade agent outputs and train toward rubric compliance—useful when subjective spec adherence matters (e.g., tone, structure) paper thread. Expect more agent evals to standardize on rubric scoring with verifiable checks.


Stay first in your field.

No more doomscrolling X. A crisp morning report for entrepreneurs, AI creators, and engineers. Clear updates, time-sensitive offers, and working pipelines that keep you on the cutting edge. We read the firehose and hand-pick what matters so you can act today.

I don’t have time to scroll X all day. Primer does it, filters it, done.

Renee J.

Startup Founder

The fastest way to stay professionally expensive.

Felix B.

AI Animator

AI moves at ‘blink and it’s gone’. Primer is how I don’t blink.

Alex T.

Creative Technologist

Best ROI on ten minutes of my day. I’ve shipped two features purely from their daily prompts.

Marta S.

Product Designer

From release noise to a working workflow in 15 minutes.

Viktor H

AI Artist

It’s the only digest that explains why a release matters and shows how to use it—same page, same morning.

Priya R.

Startup Founder

Stay professionally expensive

Make the right move sooner

Ship a product

WebEmailTelegram

On this page

Executive Summary
Feature Spotlight: Feature: Gemini 3 signals hit critical mass
✨ Feature: Gemini 3 signals hit critical mass
“Gemini 3.0 Pro” spotted inside Enterprise agent selector
Sundar boosts 69% Polymarket odds for Gemini 3 by Nov 22
‘Riftrunner’ resurfaces in arenas and tools as a likely Gemini 3 tag
Timing chatter converges on “next week,” likely during AIE Code
Nano‑Banana 2 buzz suggests refreshed image stack alongside Gemini 3
📊 Benchmarks: GPT‑5.1 Codex tops SWE‑Bench; finance agent SOTA
GPT‑5.1 Codex tops SWE‑Bench at 70.4% and ~26× cheaper than Sonnet 4.5
GPT‑5.1 leads Vals AI Finance Agent Benchmark by 0.6%
Artificial Analysis: GPT‑5.1 +2 on Intelligence Index; 81M vs 85M output tokens
BEAM benchmark hits 10M‑token chats; LIGHT memory stack outperforms long context
Bridgewater’s AIA Forecaster reaches expert‑level on ForecastBench with agentic search
Conciseness reward model trims tokens ~20% and lifts 7B accuracy by 8.1%
Dense retrieval + classifier beats LLM rationales for fact‑checking at 95% less runtime
New ML/HPC leaderboard shows LLM agents slower than expert humans
Rubric‑based instruction‑following benchmark and RL recipe land for agents
🧰 Agentic coding stacks and DX improvements
Evalite adds aggressive model caching to cut eval cost and iteration time
Qwen Code v0.2.1: web search, fuzzy code edits, Zed support, plain‑text tools
Vercel publishes a practical playbook for deploying internal agents
Amp Free chains ad→search→playground into runnable RF‑DETR demo in ~30s
Cline enables Hermes‑4 70B/405B across VS Code, JetBrains and CLI
OpenAI’s GPT‑5.1 cookbook codifies plan tools and persistence patterns
Review AI mega‑diffs with Graphite’s stacked PR flow for Claude Code output
Amp posts a hands‑on context management guide for coding agents
Memex desktop adds a code viewer to inspect agent‑made edits
v0 now reports time, files, LOC and credits after each generation
🕸️ Interoperability: MCP in the wild
Groq plugs Box’s remote MCP server directly into its Responses API
rtrvr.ai turns a Chrome extension into a remote MCP server any agent can drive
MCP turns one: Anthropic × Gradio kick off a community hackathon
AITinkerers Web Agents Hackathon spotlights a pragmatic MCP-friendly stack
🛡️ Security: AI‑orchestrated intrusions and safer assistants
Anthropic: China-linked actor automated 80–90% of espionage with Claude Code
OpenAI resists NYT demand for 20M chats, accelerates client‑side encryption
Perplexity Comet adds permission prompts and transparent browsing for risky actions
APIs aren’t a safety shield against misuse, say open‑source leaders
🏗️ AI datacenters, power, and memory constraints
Google commits $40B for three Texas AI data centers with 6.2 GW power deals
Samsung raises DDR5 contract prices 30–60% as AI demand tightens supply
Local opposition to data centers accelerates; $98B in projects blocked or delayed
💼 Enterprise adoption and go‑to‑market signals
OpenAI pace: ~$6B H1 revenue, ~$13B ARR in June; aiming for ~$20B by year-end
Meta will grade “AI‑driven impact” in 2026 reviews; assistant helps write self‑evals
Berkshire adds ~$4.3B Alphabet stake, trims Apple 15%; read as an AI distribution bet
OpenAI will de‑identify a 20M‑chat sample under protest and accelerate client‑side encryption
Genspark says AWS stack cut GPU cost 60–70% and 72% off inference via prompt caching
Perplexity’s Comet adds permission gates for logins/purchases and shows all browsing actions
OpenRouter adds backup payment methods to prevent downtime on auto top‑ups
🧠 Model & API updates (non‑Gemini)
Anthropic brings structured outputs (public beta) to Claude API
Grok‑5 roadmap: 6T params, multimodal, Q1’26 target
GPT‑5.1 lands on Replicate with quickstart UI
Qwen Code v0.2.1 ships free web search and smarter editing
Claude Opus 4.5 shows up in CLI metadata
Cline integrates Hermes‑4 70B/405B at aggressive token prices
OpenAI posts GPT‑5.1 prompting guide and optimizer
🎬 Creative AI: ads, restyling, and video tools
Higgsfield rolls out Click‑To‑Ad and runs Black Friday with unlimited image models
NotebookLM adds images as sources; Veo 3.1 supports multiple reference images
ElevenLabs updates mobile app to create and clone voices on device
LSD v2 brings real‑time video restyling with stronger temporal consistency
Higgsfield Recast + Face Swap workflow spreads for fast character replacements
Grok Imagine text‑to‑video draws fresh creator praise and “flow‑state” tests
ImagineArt shows node‑based Workspaces for chained image→video loops
Local TTS tip: mlx_audio runs Kokoro voices fast on Mac via CLI
🦾 Embodied AI: robot dexterity and agentic gameplay
DeepMind’s SIMA 2 uses Gemini to plan and act across unseen 3D games
Unitree G1 clip marked not teleoperated sparks autonomy debate
UBTech Walker S2 enters factories at scale with auto battery swap, 500 units targeted
ALLEX robot hands demonstrate precise micro‑pick and fastening with safe HRI
MindOn tests Unitree G1 on household chores with new hardware/software stack
China trials robotic traffic cones that auto‑secure accident scenes in under 10 seconds
📚 Research: self‑improving agents, memory, and efficient reasoning
Dense retrieval + classifier (DeReC) outperforms LLM rationales and slashes runtime ~95%
Instruction following vs functionality: more instructions, more regression
On‑policy, black‑box distillation (GAD) pushes a 14B student toward GPT‑5‑Chat
Open‑world multi‑agent ‘Station’ sets SOTA on circle packing and Sokoban
Smarter tool routing raises top‑5 recall by 19.4% on LiveMCPBench
Tiny cache for Agent‑RAG serves answers from ~0.015% of the corpus
Agent‑RAG for fintech boosts answers via acronym expansion and refined search
Dr. MAMR addresses lazy multi‑agent failures with causal influence and restarts
DreamGym uses LLM‑simulated environments to scale RL and speed sim‑to‑real
Vector symbolic algebra solver hits 83.1% on 1D‑ARC and 94.5% on Sort‑of‑ARC