GPT‑5.1 Codex hits 70.4% on SWE‑Bench – ~26× cheaper | Daily AI Primer

Executive Summary

Benchmarks moved again: GPT‑5.1 Codex grabbed the SWE‑Bench lead at 70.4%, edging Claude Sonnet 4.5 (Thinking) at 69.8% while running about $0.31 per test versus $8.26—roughly 26× cheaper. Vals AI also reports GPT‑5.1 topping its Finance Agent benchmark by 0.6%, with LiveCodeBench performance jumping from 12th to 2nd.

If you fix real repos or wire fintech flows, that cost/perf mix argues for routing more traffic to 5.1 Codex and letting behavior, latency, and price steer the rest. Artificial Analysis’ latest run nudges GPT‑5.1 to a 70 on its Intelligence Index and shows 81M output tokens vs 85M for GPT‑5, trimming estimated run cost to ~$859 from ~$913.

Don’t hand it the keys to low‑level optimization, though. A new ML/HPC leaderboard puts expert humans at 1.00× speedup while current LLM agent systems manage ≤0.15×, so keep humans in the loop for performance tuning. And if latency matters, retrieval+classifier pipelines are winning: DeReC beats rationale‑generating LLMs for fact‑checking with ~95% less runtime.

Feature Spotlight

Feature: Gemini 3 signals hit critical mass

Gemini 3 appears imminent: Sundar Pichai teases a Nov‑22 window; “Gemini 3.0 Pro” strings show up in Enterprise model selectors; “Riftrunner” shows in arenas. If confirmed, Google’s distribution could reset model choice for many teams.

Multiple independent sightings and CEO hints point to an imminent Gemini 3 release; today’s sample centers on strings in enterprise UIs, a “Riftrunner” label in arenas, and market chatter. Excludes other model news, which is covered separately.

Jump to Feature: Gemini 3 signals hit critical mass topics

Stay in the loop

Get the Daily AI Primer delivered straight to your inbox. One email per day, unsubscribe anytime.

Feature: Gemini 3 signals hit critical mass

“Gemini 3.0 Pro” spotted inside Enterprise agent selector

Multiple screenshots show a “Gemini 3.0 Pro” label appearing in the Gemini Enterprise agent model picker, though access remains blocked for general users sighting summary. Devtools strings align with a production‑bound model option, strengthening the case that final wiring is underway devtools strings, with write‑ups documenting recurring sightings across builds TestingCatalog post.

For AI platform owners, this is the clearest enterprise‑grade breadcrumb yet: start drafting routing and fallbacks so you can A/B 3.0 Pro versus your current defaults on day one.

Sundar boosts 69% Polymarket odds for Gemini 3 by Nov 22

Google’s CEO amplified a Polymarket contract showing a 69% chance Gemini 3.0 ships by Nov 22, which the community reads as a deliberate signal to expect a near‑term launch CEO hint. A separate roundup repeats the same read, framing Sundar’s post as soft confirmation of timing market odds.

So what? Leaders and PMs can prep eval sandboxes and rollout comms now, especially if you plan announcements at or right after AIE Code week.

‘Riftrunner’ resurfaces in arenas and tools as a likely Gemini 3 tag

A “Riftrunner” model id keeps appearing in design arenas and developer consoles, with testers describing it as a larger, more capable variant that matches expected Gemini 3.0 Pro behavior devtools console and outperforming peers on an SVG rendering comparison in creator tests svg comparison, following up on Riftrunner early strings and arena probes.

If you run eval harnesses, add a placeholder lane for Riftrunner so you can drop in the model id the moment it’s routable.

Timing chatter converges on “next week,” likely during AIE Code

Several well‑followed accounts say Gemini 3 is landing next week, with one tying the reveal to the AI Engineer Code event where Google has launched onstage before timing claim. Posts narrow it further to early week, even “likely on Tuesday,” reinforcing scheduling urgency next week call tuesday hint, while broader sightings threads keep stoking the countdown speculation post.

Practical move: line up side‑by‑side prompts and traffic shaping such that switching a portion of user flows to 3.0 takes minutes, not days.

Nano‑Banana 2 buzz suggests refreshed image stack alongside Gemini 3

Creators report strong results from “Nano‑Banana 2,” noting more realistic images, better text rendering, and accurate reflections—pointing to a revamped Google image stack that could ship alongside Gemini 3 creator take. Others explicitly pair Nano‑Banana 2 mentions with Gemini 3 timing chatter paired mention, with more output dumps circulating outputs thread and third‑party workflows already wiring “nano banana” as a selectable node workflow example.

If your product leans on generative visuals, budget time to re‑shoot style guides and update safety filters—the output distribution may shift.

Benchmarks: GPT‑5.1 Codex tops SWE‑Bench; finance agent SOTA

Strong day for public evals: GPT‑5.1 Codex edges Sonnet 4.5 (Thinking) on SWE‑Bench at a fraction of cost; GPT‑5.1 leads a finance agent benchmark; meta‑analysis adds token/price deltas. Excludes Gemini 3 coverage (see Feature).

GPT‑5.1 Codex tops SWE‑Bench at 70.4% and ~26× cheaper than Sonnet 4.5

OpenAI’s GPT‑5.1 Codex leads SWE‑Bench with 70.4% accuracy versus Claude Sonnet 4.5 (Thinking) at 69.8%, while costing ~$0.31 per test vs ~$8.26 (~26× cheaper) benchmarks table. Following up on launch-top5 where new code leaderboards surfaced, this run confirms 5.1 Codex as the top value pick for repo‑level bug fixing, with latencies shown alongside the cost deltas SWE‑Bench note, and the public board now reflectable in Vals AI’s pages benchmarks page.

GPT‑5.1 leads Vals AI Finance Agent Benchmark by 0.6%

Vals AI reports GPT‑5.1 sets a new state of the art on its Finance Agent Benchmark, edging Claude Sonnet 4.5 (Thinking) by 0.6% on goal completion, with additional gains on LiveCodeBench (jumping from 12th to 2nd) and minor improvements on MMMU/GPQA/IOI finance benchmark post, follow‑up details. For teams prototyping agentic fintech workflows, this narrows the top tier to 5.1 vs Sonnet 4.5, and suggests routing by tool‑use behavior and cost may matter more than small headline margins.

Artificial Analysis: GPT‑5.1 +2 on Intelligence Index; 81M vs 85M output tokens

Artificial Analysis’ latest run gives GPT‑5.1 a score of 70, +2 over GPT‑5 at similar reasoning effort, driven largely by TerminalBench improvements; it also used 81M output tokens vs 85M for GPT‑5, cutting run cost to ~$859 from ~$913 index recap. The live dashboard breaks down per‑eval deltas and cost/latency tradeoffs useful for routing and budgeting analysis site.

BEAM benchmark hits 10M‑token chats; LIGHT memory stack outperforms long context

BEAM introduces ultra‑long conversation evals up to 10M tokens and shows LIGHT—a hybrid of episodic retrieval, working memory, and scratchpad—consistently outperforms relying on huge context windows alone, with average gains reported across models and a clear fade in long‑context models as length grows paper abstract. For agents that must persist across days, this favors explicit memory stacks over bigger windows.

Bridgewater’s AIA Forecaster reaches expert‑level on ForecastBench with agentic search

Bridgewater’s AIA Forecaster combines agentic search over high‑quality news, a supervisor that reconciles disparate forecasts, and calibration (e.g., Platt scaling) to match superforecaster accuracy on ForecastBench, beating prior LLM baselines; on a liquid markets set, markets still lead but ensembles with the model improve accuracy paper abstract. For ops, this argues for supervised multi‑agent pipelines over single‑shot judgments.

Conciseness reward model trims tokens ~20% and lifts 7B accuracy by 8.1%

A conciseness reward model that grants brevity bonuses only when final answers are correct prevents length/training collapse, delivering +8.1% accuracy with ~19.9% fewer tokens on a 7B backbone across math tasks; the bonus fades over training and scales by difficulty paper abstract. This is a practical recipe to cut inference cost in reasoning agents without sacrificing quality.

Dense retrieval + classifier beats LLM rationales for fact‑checking at 95% less runtime

DeReC (Dense Retrieval Classification) replaces rationale‑generating LLM pipelines with dense evidence retrieval and a classifier, improving RAWFC F1 to 65.58% (from 61.20%) while cutting runtime ~95% (454m→23m). Similar speedups are shown on LIAR‑RAW paper abstract. If you need scalable veracity checks, retrieval+classifier is a strong baseline before spinning up expensive generation.

New ML/HPC leaderboard shows LLM agents slower than expert humans

A new SWE/ML optimization leaderboard with a human baseline shows expert humans at 1.00× speedup, while top LLM‑driven systems achieve ≤0.15× on ML/HPC tasks, implying current agents slow practitioners down for performance tuning despite strong coding scores elsewhere leaderboard post. Use this as a routing signal: keep human‑in‑the‑loop for low‑level optimization and reserve agents for scaffolding, search, and glue code.

Rubric‑based instruction‑following benchmark and RL recipe land for agents

A new rubric‑based benchmark and reinforcement learning approach for instruction following is out, providing a repeatable way to grade agent outputs and train toward rubric compliance—useful when subjective spec adherence matters (e.g., tone, structure) paper thread. Expect more agent evals to standardize on rubric scoring with verifiable checks.

Stay first in your field.

No more doomscrolling X. A crisp morning report for entrepreneurs, AI creators, and engineers. Clear updates, time-sensitive offers, and working pipelines that keep you on the cutting edge. We read the firehose and hand-pick what matters so you can act today.

I don’t have time to scroll X all day. Primer does it, filters it, done.

Renee J.

Startup Founder

The fastest way to stay professionally expensive.

Felix B.

AI Animator

AI moves at ‘blink and it’s gone’. Primer is how I don’t blink.

Alex T.

Creative Technologist

Best ROI on ten minutes of my day. I’ve shipped two features purely from their daily prompts.

Marta S.

Product Designer

From release noise to a working workflow in 15 minutes.

Viktor H

AI Artist

It’s the only digest that explains why a release matters and shows how to use it—same page, same morning.

Priya R.

Startup Founder

Get access

Stay professionally expensive

Make the right move sooner

Ship a product

WebEmailTelegram

Agentic coding stacks and DX improvements

Evalite adds aggressive model caching to cut eval cost and iteration time

Evalite now caches AI SDK models in watch mode with a local UI, so you can rerun evals without reloading models—saving tokens and speeding up the red/green loop on prompts, tools and routing. The maintainer’s PR shows the feature landing with CI-friendly artifacts to keep spend predictable PR details.

GPT‑5.1 Codex hits 70.4% on SWE‑Bench – ~26× cheaper

Executive Summary

Feature: Gemini 3 signals hit critical mass

Table of Contents

Feature: Gemini 3 signals hit critical mass

“Gemini 3.0 Pro” spotted inside Enterprise agent selector

Sundar boosts 69% Polymarket odds for Gemini 3 by Nov 22

‘Riftrunner’ resurfaces in arenas and tools as a likely Gemini 3 tag

Timing chatter converges on “next week,” likely during AIE Code

Nano‑Banana 2 buzz suggests refreshed image stack alongside Gemini 3

Benchmarks: GPT‑5.1 Codex tops SWE‑Bench; finance agent SOTA

GPT‑5.1 Codex tops SWE‑Bench at 70.4% and ~26× cheaper than Sonnet 4.5

GPT‑5.1 leads Vals AI Finance Agent Benchmark by 0.6%

Artificial Analysis: GPT‑5.1 +2 on Intelligence Index; 81M vs 85M output tokens

BEAM benchmark hits 10M‑token chats; LIGHT memory stack outperforms long context

Bridgewater’s AIA Forecaster reaches expert‑level on ForecastBench with agentic search

Conciseness reward model trims tokens ~20% and lifts 7B accuracy by 8.1%

Dense retrieval + classifier beats LLM rationales for fact‑checking at 95% less runtime

New ML/HPC leaderboard shows LLM agents slower than expert humans

Rubric‑based instruction‑following benchmark and RL recipe land for agents

Stay first in your field.

Agentic coding stacks and DX improvements

Evalite adds aggressive model caching to cut eval cost and iteration time

On this page

GPT‑5.1 Codex hits 70.4% on SWE‑Bench – ~26× cheaper

Executive Summary

Feature: Gemini 3 signals hit critical mass

Table of Contents

✨Feature: Gemini 3 signals hit critical mass

“Gemini 3.0 Pro” spotted inside Enterprise agent selector

Sundar boosts 69% Polymarket odds for Gemini 3 by Nov 22

‘Riftrunner’ resurfaces in arenas and tools as a likely Gemini 3 tag

Timing chatter converges on “next week,” likely during AIE Code

Nano‑Banana 2 buzz suggests refreshed image stack alongside Gemini 3

📊Benchmarks: GPT‑5.1 Codex tops SWE‑Bench; finance agent SOTA

GPT‑5.1 Codex tops SWE‑Bench at 70.4% and ~26× cheaper than Sonnet 4.5

GPT‑5.1 leads Vals AI Finance Agent Benchmark by 0.6%

Artificial Analysis: GPT‑5.1 +2 on Intelligence Index; 81M vs 85M output tokens

BEAM benchmark hits 10M‑token chats; LIGHT memory stack outperforms long context

Bridgewater’s AIA Forecaster reaches expert‑level on ForecastBench with agentic search

Conciseness reward model trims tokens ~20% and lifts 7B accuracy by 8.1%

Dense retrieval + classifier beats LLM rationales for fact‑checking at 95% less runtime

New ML/HPC leaderboard shows LLM agents slower than expert humans

Rubric‑based instruction‑following benchmark and RL recipe land for agents

Stay first in your field.

On this page

Feature: Gemini 3 signals hit critical mass

Benchmarks: GPT‑5.1 Codex tops SWE‑Bench; finance agent SOTA