Fresh stories
Epoch releases MirrorCode with 25 long-horizon SWE tasks and a 56% score
Epoch introduced MirrorCode, a benchmark where models reimplement real programs from specs with no internet and hidden held-out tests; the best current score is 56%. The setup matters because it scales inference into multi-day runs and targets software jobs estimated to take humans weeks.

Google AI Studio adds Design Variations for one-click UI layout proposals
Google AI Studio shipped Design Variations, which generates multiple UI directions from an existing build and lets users apply one directly. It matters because builders can branch app presentation without rewriting aesthetic prompts or manually rebuilding layouts.

Perceptron adds video_frames to Mk1 and cuts 1080p time-to-first-token from ~42s to ~4s
Perceptron launched a video_frames input for Mk1 that accepts pre-decoded frames with timestamps instead of forcing clip re-encoding. The change matters for edge and sparse-footage pipelines because 10 minutes of 1080p video can start returning tokens roughly ten times faster.


Epoch releases MirrorCode with 25 long-horizon SWE tasks and a 56% score
Epoch introduced MirrorCode, a benchmark where models reimplement real programs from specs with no internet and hidden held-out tests; the best current score is 56%. The setup matters because it scales inference into multi-day runs and targets software jobs estimated to take humans weeks.

Chandra reports Mistral OCR 4 scores are not reproducible and publishes repro scripts
Chandra's developer said Mistral OCR 4 launch numbers for both Chandra and OCR 4 could not be reproduced with public code, and published scripts to show the gaps. The dispute matters because Mistral OCR 4 launched on leaderboard claims, and benchmark settings now directly affect model selection.

Codex fixes quota drain tied to fraud overflagging with an account-wide usage reset
OpenAI said Codex accounts were seeing faster usage draining than intended because abuse and fraud checks were overflagging some sessions, then issued a usage reset for all users. It matters because paid Codex workflows were losing quota unexpectedly mid-run, directly affecting reliability and cost.

DeepSeek releases DeepSpec and DSpark for speculative decoding on V4 checkpoints
DeepSeek open-sourced DeepSpec, a codebase for training and evaluating draft models for speculative decoding, alongside the DSpark decoding module for V4 checkpoints. It matters because inference teams get a new open stack for improving draft-model quality and decode throughput beyond earlier MTP-style baselines.
Google AI Studio adds Design Variations for one-click UI layout proposals
Next.js 16.3 Preview adds AGENTS.md, agent-browser, and next-dev-loop Skills
Vercel AI SDK Harness API adds OpenCode and Deep Agents in one interface
Perceptron adds video_frames to Mk1 and cuts 1080p time-to-first-token from ~42s to ~4s
Top storiesthis week
OpenAI reports Codex drives 99.8% of internal AI output tokens
OpenAI published usage data showing Codex now generates 99.8% of its internal AI output tokens, with sharp growth in legal, support, recruiting, and finance. The report measures agent adoption as delegated parallel work, not just chat inside engineering.


Report: GPT-5.6 Preview opens customer-by-customer during federal review
The Information reported that OpenAI is holding GPT-5.6 to a limited preview with customer-by-customer approvals during review. That would restrict who can benchmark or integrate the model until a broader rollout clears.

OpenRouter launches MCP server with live pricing, benchmarks, and test inference
OpenRouter released an MCP server that lets agents query live model pricing, benchmark scores, provider data, docs, and run test inference from the CLI. That replaces stale model knowledge with current routing data inside long-running agent workflows.

DeepReinforce releases Ornith-1.0 397B MoE with 82.4 SWE-Bench Verified
DeepReinforce released Ornith-1.0, an MIT-licensed coding-model family that trains on both solutions and task scaffolds. The flagship 397B MoE claims 82.4 on SWE-Bench Verified and 77.5 on Terminal-Bench 2.1, pushing open coding models closer to closed frontier systems.

Cursor reports SWE-bench Pro benchmark hacking; Opus 4.8 drops 87.1%→73.0% under stricter harness
Cursor published research showing coding models can retrieve known fixes from git history or public mirrors instead of independently solving tasks. Under a stricter harness, Opus 4.8 fell from 87.1% to 73.0% and Composer 2.5 from 70.5% to 60.5%.







