Claude Opus 4.5 tops WebDev at 1493 Elo – coding agents converge

Stay in the loop

Free daily newsletter & Telegram daily report

Executive Summary

After Monday’s pricing and SWE‑bench splash, today is about receipts: Claude Opus 4.5 (thinking‑32k) is now #1 on LMArena’s Code WebDev board at 1493 Elo, with the non‑thinking variant at #2 and Gemini 3 Pro bumped to third. The same family also grabbed the top Arena Expert slot and a top‑3 spot on text, while LisanBench now ranks Opus 4.5 Thinking as the longest‑chain reasoner on 18 of 50 tasks despite a 16,384‑token thinking cap and roughly $35 total batch cost.

The twist: broad benchmarks still say “specialist, not emperor.” SimpleBench shows Opus 4.5 inching from 60% to 62% over Opus 4.1, while Gemini 3 Pro sits way out front at 76.4%, so if you need one generalist brain, Google still holds the belt. But on code‑heavy and spec‑driven work, the live arenas, SWE‑bench, and terminal evals are all pointing to Opus.

Tooling is catching up fast. Cline 3.38.3 now treats Opus 4.5 and its 32k thinking mode as first‑class for long‑horizon coding, and builders are reporting “three weeks of engineering in three hours” bursts and full SaaS rewrites in under two. At the same time, veterans warn non‑thinking Opus can underperform Sonnet 4.5 without a good harness—plan modes, hooks, and multi‑model routing still separate wizardry from wishful thinking.

Feature: Opus 4.5 consolidates the coding crown

Opus 4.5 claims #1 on Code Arena WebDev (thinking‑32k), top Expert/Text placements, strong LisanBench chains, and fast IDE uptake (Cline). Builders report low-cost batch runs and longer valid chains.

Cross‑account momentum centers on Opus 4.5 topping live leaderboards and showing practical agent gains; mostly eval updates, dev tool rollouts, and usage data today.

Jump to Feature: Opus 4.5 consolidates the coding crown topics

🏆 Feature: Opus 4.5 consolidates the coding crown

Cross‑account momentum centers on Opus 4.5 topping live leaderboards and showing practical agent gains; mostly eval updates, dev tool rollouts, and usage data today.

Opus 4.5 takes #1 WebDev slot and tops Arena Expert leaderboard

Claude‑Opus‑4.5 (thinking‑32k) has jumped straight to #1 on LMArena’s Code Arena WebDev board with 1493 Elo, with the non‑thinking variant at #2 (1479) and Gemini 3 Pro now third at 1473, extending Anthropic’s lead on coding‑style evals following its earlier SWE‑bench and ARC‑AGI wins benchmarks and position arena update.

The same model family also now holds #1 on the Arena Expert leaderboard and #3/#4 on the Text leaderboard (thinking and non‑thinking), and ranks top‑3 across occupational categories like software, business, medicine, writing, and math expert rankings text leaderboard. For engineers, this means that if you’re already routing hard Web/UI tasks to Opus 4.5, the broader Arena data increasingly backs that choice over GPT‑5.1 medium and Gemini 3 Pro for complex, spec‑heavy front‑end work.

Claude Opus 4.5 tops WebDev at 1493 Elo – coding agents converge

Executive Summary

Top links today

Feature: Opus 4.5 consolidates the coding crown

Table of Contents

🏆 Feature: Opus 4.5 consolidates the coding crown

Opus 4.5 takes #1 WebDev slot and tops Arena Expert leaderboard

Builders report huge productivity jumps—and some friction—with Opus 4.5

Cline 3.38.3 makes Opus 4.5 a first‑class long‑horizon code agent

LisanBench crowns Opus 4.5 Thinking as longest‑chain reasoner

SimpleBench: Opus 4.5 edges Opus 4.1 but trails Gemini 3

📊 Benchmarks: Gemini 3 and Grok 4.1 step up

Gemini 3 Pro hits 93% on GPQA Diamond, lifting SOTA by ~5 points

CAIS indices rank Gemini 3 Pro #1 in both text and vision capabilities

Gemini 3 Pro tops SimpleBench with 76.4% vs low‑60s for rivals

Google claims Gemini 3 Pro remains SOTA on Vending‑Bench tool use

Grok 4.1 Thinking jumps to #2 on LMArena Text behind Gemini 3 Pro

Independent tests show Gemini 3 Pro excelling at hard mathy code, weak at web apps

Gemini 3 Pro still leads GeoBench despite new Claude 4.5 results

🧪 New model drops and open APIs

FLUX.2 spreads to fal, Picsart Flows, and OpenArt Unlimited

Alibaba’s Z‑Image Turbo debuts as fast, cheap 6B text‑to‑image model

INTELLECT‑3 opens with weights and an OpenRouter reasoning API

Perceptron’s Isaac 0.1 VLM comes to Replicate with grounded OCR and spatial QA

ImagineArt 1.5 gains hosted API on fal for photoreal portraits

🛠️ Engineering agents: harnesses, IDE flows, planning

Hyperbrowser sessions can now load your own Chrome extensions

Cline 3.38.3 expands hooks, models, and native tool calling

LangChain sharpens agent mental models and surfaces reusable skills

OpenCode adds Exa web/code search and an official Docker image

Zed 0.214.0 makes project search and TypeScript errors feel instant

Anthropic AI SDK 6 beta adds agent instruction caching and richer system messages

Builders converge on multimodel planning workflows in Cursor

Firecrawl becomes a native search-and-scrape step in Vercel Workflow Builder

Practitioners spotlight tool clarity and evals as agent bottlenecks

Zed adds agent server extensions for SSH remoting

🧩 Orchestration & MCP: extensible browsers and skills

Hyperbrowser can now load your own Chrome extensions into agent sessions

LangChain positions Deep Agents as the harness layer above runtimes like LangGraph

n8n turns entire automation instance into an MCP-ready surface

OpenAI shares design patterns for great ChatGPT Apps built on Apps SDK and MCP

Weaviate + CrewAI show multi-agent, tool-rich patterns for industry-specific assistants

🧪 Reasoning & memory: agentic research and CoT control

Bridgewater’s AIA Forecaster uses ~10 agents plus calibration to reach superforecaster‑level Brier scores

General Agentic Memory lets agents deep‑search their own history instead of squashing it into notes

Learning‑to‑Reason study finds GPT‑OSS traces 4× more token‑efficient than DeepSeek‑R1 for 12B models

Agent0‑VL trains a self‑evolving vision‑language agent using tools, verifier, and self‑repair loops

BeMyEYES uses perceiver+reasoner agents to bolt strong vision onto text‑only LLMs

Gradient steering of hidden states elicits chain‑of‑thought from base LLMs without fine‑tuning

UniSandbox study finds unified multimodal models often fail to use their language understanding for image generation

Universe of Thoughts framework boosts creative reasoning by remixing and mutating "thought atoms"

Computer‑use agents act as judges for generative UIs in AUI benchmark

"Foundations of AI Frameworks" argues current static neural nets can’t yield true AGI, calls for richer self‑modifying systems

🏢 Enterprise AI rollouts and shopper UX

Perplexity Memory adds cross-thread recall and Comet integration with user controls

Perplexity launches AI shopping experience with conversational discovery and PayPal checkout

Perplexity rolls out Nano Banana–powered virtual try-on for Pro and Max

Copilot in Microsoft Edge adds in-browser cashback, price tracking, and comparisons

Deliveroo uses ElevenLabs voice agents to re-engage riders and audit restaurants

v0 expands free premium access to students at five more universities

Character AI launches Stories, a visual interactive fiction format for teens

ElevenLabs pairs voice agents with FLUX.2 images and video for customer engagement

Genspark shifts from research-only to delivering finished decks, sheets, and sites

Perplexity turns Comet into a language tutor and research aide backed by Supermemory

⚙️ Serving & runtimes: FP8 RL, Ray symmetric-run, dLLM

SGLang unifies FP8 training and inference for large RL models

Ray 2.52 adds symmetric-run and Wide-EP for simpler multi-node vLLM

SGLang dLLM framework brings block diffusion to language model serving

💸 AI revenue models and capital needs

HSBC sees a $207B–$500B funding gap in OpenAI’s AI build‑out

OpenAI projects 220M ChatGPT subscribers and $270B subs revenue by 2030

🧠 Memory, retrieval and token spend hygiene

Perplexity Memory auto-loads cross-thread context with incognito safeguards

Memori 3.0 slashes GPT‑5 token costs ~80% and adds REST API

ColBERT-style ToolRet-bge models lead new tool retrieval benchmark

🎨 Creative stacks: Retake edits, interactive images, VTO

Gemini’s interactive images turn static diagrams into explorable lessons

Perplexity rolls out Nano Banana–powered virtual try‑on to Pro and Max

ComfyUI bakes in Topaz Astra, Starlight, Apollo and Bloom for 4K/8K

Nano Banana Pro + free converter unlock fast pixel‑art game assets