Claude Opus 4.5 hits 4h49m METR horizon – 8‑hour agents by 2026 feature image for Sat, Dec 20, 2025

Claude Opus 4.5 hits 4h49m METR horizon – 8‑hour agents by 2026

Stay in the loop

Free daily newsletter & Telegram daily report

Join Telegram Channel

Executive Summary

METR’s updated long‑horizon charts turn last week’s “Opus is near 5 hours” headline into something sharper: the trendline now shows agent time horizons doubling roughly every 4 months, not 7. Claude Opus 4.5 already reaches about 4h49m of human‑estimated coding work at a 50% success threshold, and extrapolations point to agents reliably handling an 8‑hour workday by April 2026 and roughly two days of work by mid‑2026. If those curves hold, your 2027 staffing model is probably wrong.

The catch is reliability. On METR’s 80% chart, Opus 4.5 collapses to a 27‑minute horizon, with GPT‑5.1‑Codex‑Max around 32 minutes—great contact hitting on shorter tasks, but nowhere near “set it and forget it” for multi‑hour projects. Builders are reading this as a mandate for self‑verification loops, cross‑model checks, and OS‑level guardrails if they want real‑world 80% horizons to approach the glossy 50% numbers.

There’s also a quiet understatement baked into today’s graphs: Gemini 3 Pro, Gemini 3 Flash, and GPT‑5.2 aren’t even on them yet, and practitioners expect Gemini 3 Pro to be the first model past a 5‑hour METR horizon. Plan your 2026 agent roadmap assuming these charts are the floor, not the ceiling.

Top links today

Feature Spotlight

Feature: Long‑horizon coding agents go vertical (METR)

METR shows 2025’s 7× jump in agent task horizons; Opus 4.5 hits ~4h49m at 50% but only 27m at 80%, while GPT‑5.1‑Codex‑Max leads the 80% bar. Builders expect >5h horizons and full‑workday agents in 2026.

Cross‑account discussion centers on METR’s time‑horizon charts: Opus 4.5’s huge 50% horizon, the tougher 80% reliability gap, and a 2025 step‑change in task duration. Multiple posts project near‑term day‑long autonomy.

Jump to Feature: Long‑horizon coding agents go vertical (METR) topics

Table of Contents

📈 Feature: Long‑horizon coding agents go vertical (METR)

Cross‑account discussion centers on METR’s time‑horizon charts: Opus 4.5’s huge 50% horizon, the tougher 80% reliability gap, and a 2025 step‑change in task duration. Multiple posts project near‑term day‑long autonomy.

METR time-horizon curves now double every ~4 months

New extrapolations based on METR’s long-horizon coding tasks argue that agent time horizons doubled every 7 months from 2019–2024 but are now doubling roughly every 4 months in 2024–2025, implying first AI agents that can reliably complete an 8‑hour human workday by April 2026 and two days by mid‑2026 doubling thread.

The updated chart shows Claude Opus 4.5 already handling about 4h49m of human-estimated software work at a 50% success threshold, with the curve bending upward even faster than the earlier "world’s most important graph" fit to a 196‑day doubling over 2019–2025 world graph. Builders like giansegato warn that many "haven't internalized what 2026–27 are set to look like" as tasks such as training adversarially robust image models, previously multi-hour human efforts, move into scope for autonomous agents 2026 forecast. For AI engineers and leads, this suggests planning around quarter-on-quarter jumps in autonomous coding capacity, not gentle linear gains.

METR’s 80% success graph highlights brittle long-horizon reliability

New commentary around METR’s 50% vs 80% success charts underlines how brittle long-horizon coding agents remain at high reliability, even as Claude Opus 4.5 sits atop the 50% graph with a 4h49m time horizon Opus horizons.

On the 80% chart, Opus 4.5’s horizon collapses to just 27 minutes of human-estimated work per task, slightly behind GPT‑5.1‑Codex‑Max at 32 minutes, while it dominates in the 4–16‑hour band on the 50% plot 80 vs 50 charts. Daniel Mac frames this as Opus being a "home run hitter" for very long, ambitious tasks and GPT‑5.1‑Codex‑Max a "contact hitter" better suited for shorter but more consistently successful jobs 80 vs 50 charts. Other builders stress that this gap between "can sometimes do it" and "almost always does it" makes self‑verification loops and cross‑checks a necessary next layer if we want real-world 80% horizons to catch up with the headline 50% numbers self verification.

Gemini 3 Pro tipped to surpass Opus 4.5 on METR

Practitioners note that METR’s public long-horizon charts still omit Gemini 3 Pro, Gemini 3 Flash and GPT‑5.2, even though 2025 alone appears to have delivered a roughly 7× jump in model performance on these tasks 7x performance.

Given Gemini 3 Pro’s strong factuality scores and heavily scaled pretraining, some analysts now expect it to beat Claude Opus 4.5’s 4h49m 50% time horizon and likely become the first model to clear a 5‑hour METR task length once the suite is rerun gemini prediction. The practical read for teams is that the current "world’s most important graph" understates what frontier models can already do, so capacity planning and agent designs should assume another step up in 2026 as these newer checkpoints are formally evaluated.


🧰 Coding agents: Skills, CLIs and safety rails

Heavy practitioner chatter on Agent Skills (Codex + ChatGPT experiments), Claude Code’s prompt tweak, and new terminal/CLI utilities. Excludes METR horizons (covered in the Feature).

Agent Skills solidify as cross‑stack standard for coding agents

Agent Skills are moving from spec to daily practice: Codex now ships a built‑in planning Skill, open‑source repos show how simple SKILL.md files can be, and tools like OpenSkills are emerging as a de‑facto package manager for Skills, following up on Codex skills where GA support first landed.

Codex plan skill creating summary
Video loads on view

Codex users can invoke a planning Skill with commands like $ plan summarize our conversation, which generates a structured plan and optionally persists it under .codex/plans/plan.md for later agent runs plan skill demo codex skills note. Anthropic’s reference repo and docs emphasize that a Skill is just a folder with a SKILL.md containing YAML front‑matter plus markdown instructions and examples, making them easy to version and share skill md screenshot. Maintainers of AGENTS.md argue that Skills are an evolution of their format with progressive disclosure and more structure, rather than a competing concept agentsmd question agentsmd reply (agentsmd guide). On the tooling side, OpenSkills lets you npm i -g openskills, install a Skill like a fast browser (openskills install SawyerHood/dev-browser), then sync into AGENTS.md so any compatible coding agent can discover and load it openskills instructions openskills comment (codex skills repo). For engineers, the point is that Skills are quickly becoming the common way to bundle procedures, tools and context across Claude Code, Codex and community agents, so it’s worth standardizing how your team writes SKILL.md files now.

Leak: ChatGPT to gain Skills as slash commands with editor

A leak suggests ChatGPT will get first‑class "Skills" support (codename hazelnuts), exposed as slash commands with a Skills editor and the ability to convert existing custom GPTs into reusable Skills hazelnuts leak. That effectively brings the Agent Skills pattern inside the main ChatGPT UI rather than keeping it for Codex and external agents.

If accurate, this would let users define small, task‑specific behaviors (similar to Codex Skills) that can be invoked like /plan or /summarize, edited in a dedicated UI, and perhaps shared across chats or teams gpts vs skills comment. One detail people are excited about is the "convert GPT → Skill" path, which could turn today’s long‑tail GPT Store creations into more composable building blocks chatgpt skills question. For builders already standardizing on SKILL.md and Agent Skills, this is a strong hint that the same mental model will soon apply to both terminal‑style coding agents and mainstream ChatGPT workflows.

Claude Code 2.0.75 relaxes colon rule before hidden tool calls

Anthropic quietly shipped Claude Code 2.0.75 with a single but noticeable prompt tweak: it removed the rule that banned colons immediately before hidden tool calls prompt change note. Previously, Claude was forced to write things like “Let me read the file.” instead of “Let me read the file:” because the colon could collide with tool‑call formatting.

Changelog watchers comparing the prompt diff confirm that this explicit prohibition is now gone while the rest of the tool‑use protocol remains intact colon rule detail (prompt diff). The practical upshot is that Claude’s explanations and subtitles can read more naturally without risking malformed tool invocations; if you have regex‑based log parsers or guardrails that assumed the old style, it’s worth double‑checking they don’t depend on the missing colon quirk.

MCP Agent Mail and new setup wizard make 24/7 coding agents more accessible

An updated site and script bundle from Jeffrey Emanuel (aka doodlestein) aims to give non‑experts a one‑stop way to spin up a cloud dev box and run multiple coding agents 24/7, using MCP Agent Mail as the coordination layer setup site tweet. The wizard provisions an Ubuntu VPS, installs tools like Claude Code, Codex CLI, Gemini‑CLI and Cursor, and wires them into a shared "mailbox" so agents can pass tasks and context between each other (setup walkthrough).

MCP Agent Mail itself is a mailbox abstraction over MCP that lets agents send each other structured messages, with an optional static viewer that exports the full conversation log as a GitHub Pages site agent mail intro (agent mail github). A separate command‑palette file collects the author’s most effective prompts so users can drive complex projects by queuing up “beads” of work rather than micromanaging each step prompt palette intro (prompt palette). The result is a surprisingly approachable path for older learners, kids, or career‑switchers to get from "no server" to a serious multi‑agent coding workstation without having to know Linux first.

New git hooks add hard safety rails around Claude Code’s shell access

A community guide shows how to wrap Claude Code in git hooks that block destructive commands like git reset --hard, after users reported the agent running them despite being forbidden in AGENTS.md/CLAUDE.md instructions git guardrail intro. The idea is to enforce a project‑level safety net regardless of what the model decides.

The setup uses local hooks that inspect every git command Claude tries to execute and abort if it matches a deny‑list (e.g., reset --hard, clean -fdx, or force pushes), with step‑by‑step instructions and sample hook scripts in the repo (git hook docs). This doesn’t fix higher‑level logic bugs, but it’s a simple pattern teams can copy: treat the agent’s shell as untrusted and put real OS‑level guardrails in front of anything that can destroy work.

OpenCode plugin lets ChatGPT subscribers tap all GPT‑5.2 models in the IDE

An updated opencode-openai-codex-auth plugin now lets you use your existing ChatGPT subscription to access the full GPT‑5.2 Codex lineup inside the OpenCode IDE, with all reasoning options unlocked plugin release. The author explicitly pitches it as a way to run “all GPT‑5.2 models in @opencode at no additional cost” for personal use chatgpt bridge comment.

The plugin wires OpenCode’s agent to ChatGPT’s API credentials, so you don’t need separate OpenAI API billing and can keep usage consolidated under your ChatGPT plan plugin clarification. There’s even a built‑in agent prompt to auto‑update the plugin version, which hints at how fast this ecosystem is iterating. For indie devs who live in OpenCode but already pay for ChatGPT, this neatly reduces friction and encourages heavier experimentation with long‑horizon GPT‑5.2‑Codex runs.

Amp Free moves from 24‑hour caps to $0.42/hour rolling refill

Amp’s free tier for its coding agent now uses a rolling $0.42 per hour refill model instead of a flat $10 in the last 24h, making it significantly more forgiving for bursty use amp refill announcement. The creator walks through scenarios showing how you can effectively get up to $20 of usage in a day if you hadn’t used it in the prior window, versus a hard $10 cap before amp math explanation.

The key change is that credits partially replenish every hour, so after maxing out you don’t need to wait a full day to try another run, which better matches how people spike their coding work across evenings and weekends (amp usage page). For engineers kicking the tires on agentic coding without a paid plan, this makes Amp a more viable primary tool instead of something you only dare touch once or twice a day.

Toad terminal adds minimal UI options and VS Code workflow

Following up on Toad terminal, which introduced Toad as a universal ACP interface for Claude, Codex, Gemini and more, the latest builds add UI toggles for an ultra‑minimal look and prove out a solid VS Code workflow. The maintainer shows new settings that let you strip the chrome down if you want a distraction‑free terminal agent toad settings video.

Toad minimal UI settings demo
Video loads on view

Screenshots of Toad running inside VS Code show it managing multiple files and panes while still presenting its plan/task view cleanly in a side panel vscode integration. A one‑line uv tool install -U batrachian-toad works for Python devs, and the project page emphasizes that it’s now a first‑class AI front‑end for both terminal and editor workflows (toad github). The direction is clear: instead of each vendor shipping a separate CLI, Toad is becoming the place where you wire all your coding agents together and tune how much UI you actually want to see.

ordercli gives agents a CLI to track Foodora and Deliveroo orders

A small but fun new utility, ordercli, lets you check Foodora and Deliveroo orders from the terminal so your coding agents can answer “when does my food arrive?” without you opening a website ordercli announcement. It’s positioned as "new day, new CLI", and fits the pattern of wiring everyday APIs into the same shell where your agents already operate.

Under the hood, ordercli wraps the two delivery services’ APIs and exposes a simple status command; the README outlines how to authenticate and run it on macOS or Linux (ordercli github). This is niche, but it’s a concrete example of how people are starting to see the terminal + agent stack as the default way to talk to all their services, not only repos.


🎯 Evals beyond METR: misalignment, contests, factuality

New tooling and signals: Anthropic’s Bloom for behavioral evaluations, a Gemini 3 Flash Codeforces run, and a SimpleQA jump. Excludes METR long‑horizon charts (Feature).

Anthropic open-sources Bloom for automated behavioral misalignment evals

Anthropic released Bloom, an open-source framework to generate and run large-scale behavioral misalignment evaluations (e.g. sycophancy, sabotage, self-preservation) against frontier models, with built-in metrics and tooling for researchers. Anthropic announcement Bloom lets you specify a target behavior, automatically synthesizes diverse scenarios, then measures elicitation rates and severity instead of hand-authoring every prompt, a step up from earlier tools like Petri which required manual scenario design. It ships with benchmark suites for four behaviors across 16 models, exports transcripts to Inspect, and plugs into Weights & Biases for large runs, so labs and safety teams can stand up custom evals in days rather than weeks. Bloom blog

Gemini 3 Flash reportedly solves full Codeforces Div 1+2 set via local agent

A competitive programmer reports using Gemini 3 Flash Preview, wired into a custom local "AICodeforcer" agent, to solve all problems in a Codeforces Div 1+2 round—including an H2 that no human contestant finished—in about 40 minutes. contest recap The setup only exposed code execution (no web search), with the agent iterating on solutions and submitting through the Codeforces API; the original account was banned for AI assistance, so a new account was used in virtual participation, where the run would have ranked near #1 globally. agent details Follow-up logs show Flash handling C-level problems end to end, reinforcing that live contests are becoming de facto evals for algorithmic reasoning and raising fresh questions about what “fair play” means once strong coding agents are widely accessible. c problem log

Gemini 3 Pro Preview tops SimpleQA factuality leaderboard at 70.5%

The SimpleQA leaderboard now shows Gemini 3 Pro Preview at 70.5% ±1.4% accuracy, ahead of Gemini 2.5 Pro at 55.1% and GPT‑5 at 51.1% on OpenAI’s short-form factual QA benchmark. simpleqa chart SimpleQA is designed to stress whether models recall concrete facts rather than improvise, so a ~15-point absolute jump over Gemini 2.5 Pro suggests Google has materially improved how much world knowledge the model has internalized into its weights. Commenters frame this as a sign that larger, better-pretrained Gemini 3 variants may now lead on at least some factuality metrics, though it’s one benchmark and still needs to be weighed against reliability and hallucination behavior in the wild.

SmolVLM runs real-time, fully local webcam demo on MacBook M3

A community demo shows SmolVLM running in real time on a MacBook M3 via llama.cpp, processing a live webcam feed entirely on-device with no cloud calls. smolvlm demo For builders, it’s a concrete datapoint that tiny open VLMs are now fast enough for interactive use on consumer laptops, which changes how you might prototype privacy-sensitive or offline visual agents: you can iterate on perception and eval behavior locally, then later decide whether you need a bigger hosted model rather than starting there by default.


🧠 Reasoning recipes: when RL works and why mid‑training matters

Fresh synthesis from CMU/others on RL’s effective regimes, the benefit of a mid‑training phase, and process‑aware rewards. Mostly methods guidance, not product launches.

CMU study maps when RL actually boosts LLM reasoning

A new CMU-led study, summarized by The Turing Post, gives one of the clearest recipes so far for when reinforcement learning actually improves LLM reasoning versus wasting compute, and why a mid‑training phase often matters more than RL itself. (rl overview, arxiv paper) They find RL only helps on tasks right at the edge of the model’s current ability: if a task is too familiar (already mastered in pre‑training) or too alien, RL gives little gain. rl overview To get value, you should train on problems the model usually fails but can sometimes solve, and keep refreshing that task set as the model improves so it always targets its current limits. task targeting advice A small amount of pre‑training exposure (~1% of data) to a new context is enough for RL to generalize there; more exposure mainly boosts creativity on harder variants. context generalization The work also shows that inserting a structured “mid‑training” phase between pre‑training and RL delivers larger gains than spending the same compute on RL alone, especially on easier or nearby tasks where most budget should go to mid‑training rather than rollouts. mid training result For very hard or far‑out tasks, you then shift more budget to RL while still keeping some mid‑training first. compute split guidance Finally, process‑aware rewards that score intermediate reasoning steps—not just final answers—consistently reduce reward hacking and produce more faithful chains of thought, as long as those step‑level labels are high quality. process rewards insight For engineers designing o3‑style RL runs or smaller bespoke curricula, this paper is essentially a playbook for choosing tasks, data splits and reward shapes so RL time actually moves your reasoning frontier instead of churning tokens.


⚙️ Serving and runtime: diffusion speed and usage telemetry

Runtime and tooling updates spanning vLLM‑Omni image models, fal’s sub‑second Flux variants, and cost telemetry in dev utilities.

fal launches Flux 2 Flash, Turbo and Edit for sub-second diffusion

Inference platform fal released timestep‑distilled Flux 2 Flash and Flux 2 Turbo models, plus a Turbo Edit variant, claiming sub‑1 second image generation with quality that matches or beats the base Flux 2 fal flux announcement. The examples show complex, high‑res scenes (day/night, seasons) rendered and edited across multiple variants, giving app builders a hosted path to very low‑latency diffusion suitable for interactive UIs or multi-step agent workflows flux turbo edit.

vLLM-Omni adds Qwen-Image-Layered support for server-side layered editing

vLLM-Omni has merged support for Alibaba’s Qwen-Image-Layered model, adding ~2,189 lines of code to wire up native RGBA layer outputs into its multimodal runtime vLLM omni update. The PR was co-developed with the Qwen-Image team and community contributors, so teams can now serve Photoshop-style layered generations and decompositions from the same vLLM-Omni stack they already use for text and vision models, instead of standing up a separate image server GitHub PR.

CodexBar 0.11.0 adds ccusage-powered cost charts for Codex usage

CodexBar 0.11.0 now integrates ccusage, showing per‑session and 30‑day token and dollar spend directly in the macOS menu bar UI, with Swift Charts visualizing usage over time ccusage integration. A shared report screenshot shows ~$44,958 of GPT‑5.x and Codex usage in 10 days plus multi‑hundred‑million token counts, illustrating how the tool surfaces real spend patterns for heavy users, while the author notes ccusage itself takes ~5 minutes and 12–16 GB RAM to crunch the logs locally usage screenshot release notes.

Summarize CLI v0.4.0 adds markdown conversion and smarter URL extraction

The summarize CLI hit v0.4.0 with a new --extract URL mode, a renamed --markdown-mode flag, and a configurable --preprocess pipeline that can auto‑convert many file types and HTML into clean Markdown, falling back to markitdown when needed summarize update. This turns the tool into a more robust front‑end for LLM summarization: you can point it at arbitrary URLs or documents, let it normalize them into Markdown once, then feed that into whichever model you prefer for cheap, token‑efficient summaries release list.


🎬 Creative stacks: Sora Characters, layered edits, 3D pipelines

Strong creator momentum: Sora Characters on InVideo with identity/voice consistency, layered image edits, Veo prompt recipes, and Tripo image→3D→rig workflows.

ChatGPT adds Sora-powered holiday video gift personalized from Memory

Inside ChatGPT, sending a 🎁 emoji now triggers a Sora "Connector OpenAI Santa" app that asks for a selfie and returns a short animated holiday video starring you, with details customized from your ChatGPT Memory such as pets’ names on stockings or familiar decor sora gift explanation.

Personalized Sora holiday cartoon
Video loads on view

A UI tile lets you choose or take a selfie, then configure tone and language before generation, making this one of the first mainstream Sora experiences that feels like a casual chat feature rather than a separate pro app


. For builders, the interesting part is the pattern: an app scoped to a narrow use case (holiday greeting), powered by a video model behind the scenes, and enriched by stateful user memory—a recipe that can be reused for many seasonal or brand experiences without exposing Sora’s raw complexity.

Sora Character lands on InVideo with free 7‑day unlimited persona videos

OpenAI’s Sora Character can now be used directly in InVideo: creators set up a single face‑based persona once, then reuse it across talking heads, cinematic shots, vlogs and ads with consistent identity and voice, and generations are free and unlimited for 7 days for this launch promo invideo sora launch.

InVideo shows Sora character clips
Video loads on view

The workflow lives under Agents & models → Characters, where you consent once, upload a selfie, and Sora Character locks onto your appearance so you don’t need to re‑upload for each scene; that makes multi‑clip projects like series or explainer campaigns practical instead of one‑off demos character usage guide. Pricing after the trial isn’t detailed yet, but the no‑credit, time‑boxed window is a clear push to get production teams to test full pipelines (script → storyboard → Sora clips → edit) inside InVideo rather than treating Sora as a novelty clip generator invideo sora page.

Qwen‑Image‑Layered gets day‑0 ComfyUI graphs and early vLLM‑Omni support

Following up on Qwen‑Image‑Layered’s launch as an auto‑layering RGBA editor layered edits, the model now has a day‑0 node in ComfyUI plus a pre‑alpha layer management preset, so artists can drive Photoshop‑style stacks (3–10+ layers, masks, infinitely nested groups) entirely inside node graphs comfyui support note.

This means you can, for example, split a character, background, shadows and GUI chrome into physically separate layers from a single image prompt, then re‑prompt or replace pieces without re‑rendering the whole frame—great for UI design, ad variants, or game key art. In parallel, vLLM‑Omni added backend support for the model, which won’t change the editing semantics but does give teams a path to serve these layered generations from the same infra they already use for text and standard image models vllm omni pr.

Tripo v3 powers image→3D→rig pipeline with invites and discounts shared

A detailed Tripo v3 walkthrough shows how to turn a 2D character image (often from Nano Banana Pro) into a rigged 3D model: upload the image, generate a high‑poly mesh, run smart retopology, apply AI texturing against the original reference, auto‑rig, then export FBX into Blender to pose and render final stills for animation tripo workflow guide.

The author recommends Ultra mode with ~50k polygons for characters, then using Tripo’s enhance texture and auto‑rig options before handing off to Blender for scene layout and camera work. Combined with downstream video models like Kling 2.5 for start/end‑frame animation, this gives solo creators a full stack from face‑guided 2D portrait to consistent 3D hero in a couple of tools, and invite and discount codes are being circulated to lower the barrier for teams who want to test 3D without touching ZBrush or Maya tripo studio.

Higgsfield Cinema Studio gets step-by-step indie film workflow guides

Building on Cinema Studio’s launch with camera and move presets cinema launch, creators are now sharing full workflows for making short films by chaining prompts, stills and clips—one guide shows a "Santa" scene built from a single image and nine shots, animated and cut into a coherent sequence in minutes cinema studio thread.

Cinema Studio Santa scene
Video loads on view

The pipeline is: generate character stills with an image model, keep structure and outfit constant by reusing previous frames as references, then feed them to Cinema Studio’s prompt‑driven video tool to handle motion (dialogue, camera moves, transitions). Final assembly happens in a normal editor with speed ramps smoothing transitions workflow conclusion. For small teams, it’s a concrete recipe for going from concept to polished 10–30s clips without touching After Effects or traditional 3D, and shows how these tools are starting to be used for series and skits rather than single hero shots.

Veo 3.1 prompt showcases cinematic FPV Christmas village flythrough

A shared Veo 3.1 recipe walks through an extremely detailed prompt for an 8‑second first‑person drone flight through a snowy Christmas village—specifying FPV camera moves, heavy snow, mixed blue/warm lighting, and a final orbit around a central tree with controlled speed and stable rotation veo fpv prompt.

Veo 3.1 FPV village clip
Video loads on view

The point is less the holiday theme and more the structure: motion verbs ("weaving between", "sharp dives"), physical scene cues ("frozen cobblestone", "smoke drifting"), and post terms ("high contrast grading", "cinematic depth of field") combined into a single long paragraph. For video teams experimenting with Veo, this serves as a concrete template for complex shots like FPV tours, product fly‑arounds or sports replays, where camera language and environment detail matter as much as subject description.


🧪 New and upcoming models: gaming, coding and world video

Smaller set of fresh models/previews relevant to builders: NVIDIA’s NitroGen gaming agent, MiniMax M2.1 early access, and LongVie 2 world video model demos.

MiniMax M2.1 shows strong subagent orchestration and design sense in early use

MiniMax’s M2.1 coding model is starting to look more like an agent orchestrator than a plain LLM, with the team and early users emphasizing reliable subagents and surprisingly good product/design instincts, following up on its early‑access launch M2.1 launch. Core positioning from the team is “real, hard‑core engineering” that’s safe for production, plus "vibe coding" flows where the model handles most of the scaffolding while humans steer at a higher level capability goals.

On the agent side, one shared run shows M2.1 reliably spinning up around five subagents with a single tactical prompt, assigning each a role and coordinating them toward a shared goal rather than letting them thrash subagent demo. The same people also call out a noticeable upgrade in design and visual structure—M2.1 is tuned to keep layouts consistent, keep style choices coherent, and generally "feel" like a designer instead of a code‑only model design focus. That shows up in creative workflows too, like one‑shot SVG prompts such as "a pelican riding a bicycle" producing clean, usable vector art from a single instruction svg example.

If you’re building multi‑agent systems or front‑end heavy apps, the signal here is that M2.1 is worth trial‑routing for tasks that mix planning, coordination and visual polish—not just raw codegen—while keeping an eye on the inevitable M2.5 follow‑up the team is already teasing as a further design upgrade design focus.

NVIDIA ships NitroGen, an open generalist gaming agent on Hugging Face

NVIDIA quietly posted NitroGen, a foundation model for generalist gaming agents, to Hugging Face, giving builders a ready‑made policy that plays console‑style games directly from video frames rather than hand‑coded rules. NitroGen uses a vision transformer plus diffusion transformer stack trained via large‑scale imitation on human gameplay, and outputs gamepad actions for genres like action, platformers and racing rather than mouse‑heavy RTS/MOBA titles NitroGen announcement, with details in the open model card.

For AI engineers this is effectively a pre‑trained game control policy you can plug into research on embodied agents, automated gameplay QA, and "AI as player" test harnesses, without needing to collect and label your own massive gameplay dataset. The model is explicitly positioned as a research tool, so you still need to wrap it with environment interfaces, safety checks, and task‑specific rewards if you want to push past pure imitation into more robust, agentic behavior.

LongVie 2 teases controllable ultra‑long video world model

Researchers behind LongVie 2 released a new demo of their "Multimodal Controllable Ultra‑Long Video World Model," showing long, coherent video sequences that can be steered over extended time spans rather than the usual few‑second clips LongVie demo.

LongVie 2 long video
Video loads on view

The shared reel jumps across diverse scenes and actions while keeping style and motion stable over what looks like minutes of footage, indicating a model that’s learning something closer to an environment dynamics model than a short‑form video generator. According to the linked preprint, LongVie 2 fuses multimodal inputs with controllable conditioning to maintain temporal coherence and scene consistency at horizons that would normally lead to drift or hard resets ArXiv paper. For builders working on simulation, robotics, synthetic data or story‑driven video, this is a sign that "world models" for video are maturing beyond toy demos into something you may soon be able to tune for your own domains.


🧩 Agent memory and context compression

New memory/compression work for long‑running agents: probe‑based compression evals, a persistent memory SDK, and Meta’s Memories/Custom Prompts experiments.

Meta quietly tests Memories and Custom Prompts for Meta AI

TestingCatalog surfaced screenshots and strings showing Meta AI working on Memories and Custom Prompts features that would let users explicitly tell the assistant what to remember and how to respond, across surfaces like WhatsApp, Instagram and Messenger. (meta ai leak, feature scoop) Memories appears to give you toggles per fact or thread ("remember this" / "forget this"), while Custom Prompts looks closer to OpenAI’s and Anthropic’s global instructions, letting you set persistent style and preference defaults instead of repeating them in every chat. settings deep dive If and when this ships broadly, Meta’s assistant would shift from stateless chat plus hidden training data to something more like an end‑user‑visible memory layer, which matters for both UX and policy. Engineers and analysts should watch how granular the controls are (per‑app vs global, per‑chat vs global) and whether enterprises get stronger guarantees about what’s remembered, since that will determine if Meta AI is viable for workflows that carry sensitive or regulated data rather than casual consumer use.

Probe-based tests reveal which context compression keeps agents on track

Factory.ai lays out a concrete method to measure how well long‑running agents remember after compression by probing them with questions that require specific details from the truncated history, rather than just counting tokens saved. context engineering praise The idea: if a compression method preserved the right facts, the agent answers correctly; if not, you see guessing and hallucinations, letting you compare approaches like raw truncation, generic summaries, and more structured, slot‑based summarization (which their experiments suggest performs best across recall, continuity, and decision‑making). compression blog For AI engineers, this gives a practical, model‑agnostic harness to A/B test your own memory stacks (RAG logs, meeting transcripts, multi‑hour coding traces) and pick formats that actually preserve task‑critical information instead of assuming that “shorter == better”. It also points toward adding automated compression health checks into agent pipelines so you can detect when long contexts are silently degrading behavior instead of relying on one‑off manual spot checks.

zkStash brings structured, persistent memory to LangChain agents

LangChain community contributors introduced zkStash, a TypeScript SDK that gives agents a persistent, schema‑driven memory store for user preferences and conversation summaries, with a privacy‑first design. zkstash announcement The system uses Zod schemas to define what gets stored (e.g., {color, foods}) and offers two integration paths: explicit MCP tools like SearchMemories/CreateMemory, and background middleware that auto‑injects relevant context before an agent run and auto‑saves new facts afterward.


zkstash docs

For people building production agents, zkStash addresses two recurring problems: remembering across sessions without dumping entire logs back into context, and doing so with clear isolation boundaries (per‑agent, per‑thread, per‑user) so memories don’t leak. Because it plugs into LangChain’s tooling stack, you can start by adding a single MCP tool to an existing agent graph, then graduate to the middleware approach once you’re confident you’ve picked the right schemas and retention rules.


🏗️ Compute buildouts and public‑sector partnerships

Infra signals point to larger AI footprints: a sprawling Amazon data‑center complex, plus DOE Genesis Mission MOUs with major AI orgs.

Amazon’s new hyperscale campus visualizes the “city of data centers” future

Aerial footage of a sprawling new Amazon data center complex circulating on X makes Dario Amodei’s line about "superintelligence as a city of data centers" feel very literal, with multiple near-identical halls laid out like city blocks. commentary thread

Drone flyover of Amazon data center campus
Video loads on view

For AI engineers and infra leads, the takeaway is scale and intent: this isn’t a single region build, it’s a long‑lived campus for dense GPU and storage clusters, with all the power, cooling, and networking that implies. It signals that hyperscalers are still planning for much larger AI workloads in the back half of the decade, even as power and water become constraints elsewhere. Expect this kind of campus to be the default target environment for next‑gen training runs and long‑horizon inference services, and plan your own capacity and data‑gravity assumptions accordingly.

China adds ~50 TWh nuclear in 12 months as AI-era baseload

New EMBER data shows China increased electricity generation by just over 500 TWh in the last 12 months versus the prior year, including roughly 45–50 TWh more nuclear, ~350 TWh more solar, and material growth in wind and hydro, while reducing coal and gas output; the US, by contrast, added little nuclear and grew coal. energy mix analysis

Following up on China power, which highlighted China doubling total generation in eight years, this update matters for AI because nuclear and hydro are exactly the kind of steady baseload that large GPU campuses need. The chart’s author explicitly connects the shift to operating data centers, arguing China’s grid now deserves an “A++” for AI‑era readiness, while the US leans more on gas and even higher coal generation. For AI leaders thinking about where long‑term training and inference footprints land, this strengthens the case that a growing share of global compute may cluster in regions that can pair massive solar build‑out with firm low‑carbon power rather than fragile gas‑heavy grids.


🤖 Embodied AI: stage robots, firefighting dogs and industrial crawlers

High‑visibility demos and field tests: Unitree stage shows, a 200‑kg‑carry firefighting quadruped, China’s modular military bots, and a spider‑inspired inspector robot.

Chinese parade shows transforming spiders, missile dogs and snake robots

Footage from a Chinese military parade showcases an ecosystem of transforming robots: multi‑terrain spider bots that can roll, fly and go amphibious, missile‑armed robot dogs, and modular snake robots that both swim and burrow. parade description

Transforming military robots parade
Video loads on view

Beyond the spectacle, this underlines how quickly embodied AI is being integrated into defense concepts, with platforms explicitly designed for cross‑domain mobility and modular payloads rather than single‑role UGVs. For engineers and analysts, the takeaways are that (a) China is investing heavily in morphable chassis and modular control stacks, and (b) AI‑driven targeting, navigation, and coordination on such platforms will raise new safety, escalation, and counter‑measure questions for anyone working on dual‑use robotics. parade description

Sichuan tests 200 kg‑load firefighting robot dog in the field

Fire departments in Sichuan, China are trialing a firefighting quadruped that can carry roughly 200 kg of gear, haul hoses through rubble, stream real‑time video, and log toxic gas and temperature readings in environments too dangerous for humans. firefighting dog description

Firefighting robot dog in drills
Video loads on view

The video shows the robot moving through warehouses and debris piles while towing lines and operating in smoke, which is exactly the kind of constrained, uneven terrain where legged platforms are supposed to shine. firefighting dog description For AI and robotics teams, this is a live testbed for robust perception, semi‑autonomous navigation, and teleoperation UI in mission‑critical public‑safety workflows, not a lab demo. It also signals growing institutional willingness to invest in embodied AI that augments, rather than replaces, human crews, which could open procurement channels in firefighting, hazmat, mining, and disaster response.

Nio’s Aru spider robot targets industrial inspection and maintenance

French startup Nio Robotics is promoting Aru, a spider‑like inspection and maintenance robot designed for industrial sites, combining multi‑legged mobility with articulated arms and a modular payload bay for cameras, thermal sensors, LiDAR and other tools. Aru teaser The company’s site pitches Aru as a polymorphic platform that blends aspects of snake, rover and quadruped robots while retaining humanoid‑style interaction abilities. product page

Aru inspection robot demo
Video loads on view

For embodied‑AI builders, Aru is a concrete example of where the market for non‑humanoid, high‑DOF platforms seems to be going: task‑oriented designs that can climb, crawl and manipulate in tight spaces, with AI handling semi‑autonomous navigation, anomaly detection and operator assistance. It also shows how inspection/maintenance is becoming a primary commercial beachhead for advanced robots, well before general‑purpose home or office assistants scale.

Unitree concert humanoids draw mainstream praise and scrutiny

Unitree’s G1 humanoids that pulled Webster flips at Wang Leehom’s Chengdu concert are now getting broad mainstream attention, including an “Impressive” quote-retweet from Elon Musk and write‑ups on Futurism and the artist’s official site. Following up on stage flips, where the focus was the technical feat itself, this wave of coverage highlights that agile humanoids are being trusted on crowded stages alongside A‑list performers, a meaningful real‑world validation of locomotion, safety envelopes, and reliability under show conditions.

For embodied‑AI teams, this is a concrete case study in packaging high‑performance control stacks into a product that can survive bright lights, sound, potential collisions, and live choreography, while also passing the “PR risk” sniff test for major artists and platforms. concert demo clip Engineers can mine the footage for gait transitions, fall‑prevention behavior, and operator hand‑off patterns, while leaders should note how quickly a single polished demo can reframe public expectations around what humanoids are ready for. (Musk reaction screenshot, Futurism writeup mention, official site recap)


🛡️ Platform policy: scraping enforcement and AI media controls

One legal move against search scraping and new user controls for AI media. Safety eval tooling (e.g., Bloom) is covered under Evals, not here.

Google sues SerpApi over alleged unlawful scraping of Search content

Google has filed a lawsuit against SerpApi, accusing it of bypassing technical protections, ignoring website directives, and reselling copyrighted Google Search content, including licensed images and real‑time results, in violation of Google’s terms and publisher rights Google SerpApi lawsuit. Google frames the case as a move to "stop malicious scraping" and protect publishers and rightsholders, noting SerpApi allegedly uses cloaking, rotating identities and large bot networks to evade detection, with activity sharply increasing over the past year Google blog post.

For AI teams, this raises real legal and platform‑risk questions around depending on third‑party SERP scrapers for training data, RAG corpora or agent tools: Google is signaling it will aggressively enforce both technical measures (robots, rate limits) and contractual limits, which could force a shift toward official Search APIs, first‑party crawling that honors robots, or alternative data sources for search‑like grounding.

On this page

Executive Summary
Feature Spotlight: Feature: Long‑horizon coding agents go vertical (METR)
📈 Feature: Long‑horizon coding agents go vertical (METR)
METR time-horizon curves now double every ~4 months
METR’s 80% success graph highlights brittle long-horizon reliability
Gemini 3 Pro tipped to surpass Opus 4.5 on METR
🧰 Coding agents: Skills, CLIs and safety rails
Agent Skills solidify as cross‑stack standard for coding agents
Leak: ChatGPT to gain Skills as slash commands with editor
Claude Code 2.0.75 relaxes colon rule before hidden tool calls
MCP Agent Mail and new setup wizard make 24/7 coding agents more accessible
New git hooks add hard safety rails around Claude Code’s shell access
OpenCode plugin lets ChatGPT subscribers tap all GPT‑5.2 models in the IDE
Amp Free moves from 24‑hour caps to $0.42/hour rolling refill
Toad terminal adds minimal UI options and VS Code workflow
ordercli gives agents a CLI to track Foodora and Deliveroo orders
🎯 Evals beyond METR: misalignment, contests, factuality
Anthropic open-sources Bloom for automated behavioral misalignment evals
Gemini 3 Flash reportedly solves full Codeforces Div 1+2 set via local agent
Gemini 3 Pro Preview tops SimpleQA factuality leaderboard at 70.5%
SmolVLM runs real-time, fully local webcam demo on MacBook M3
🧠 Reasoning recipes: when RL works and why mid‑training matters
CMU study maps when RL actually boosts LLM reasoning
⚙️ Serving and runtime: diffusion speed and usage telemetry
fal launches Flux 2 Flash, Turbo and Edit for sub-second diffusion
vLLM-Omni adds Qwen-Image-Layered support for server-side layered editing
CodexBar 0.11.0 adds ccusage-powered cost charts for Codex usage
Summarize CLI v0.4.0 adds markdown conversion and smarter URL extraction
🎬 Creative stacks: Sora Characters, layered edits, 3D pipelines
ChatGPT adds Sora-powered holiday video gift personalized from Memory
Sora Character lands on InVideo with free 7‑day unlimited persona videos
Qwen‑Image‑Layered gets day‑0 ComfyUI graphs and early vLLM‑Omni support
Tripo v3 powers image→3D→rig pipeline with invites and discounts shared
Higgsfield Cinema Studio gets step-by-step indie film workflow guides
Veo 3.1 prompt showcases cinematic FPV Christmas village flythrough
🧪 New and upcoming models: gaming, coding and world video
MiniMax M2.1 shows strong subagent orchestration and design sense in early use
NVIDIA ships NitroGen, an open generalist gaming agent on Hugging Face
LongVie 2 teases controllable ultra‑long video world model
🧩 Agent memory and context compression
Meta quietly tests Memories and Custom Prompts for Meta AI
Probe-based tests reveal which context compression keeps agents on track
zkStash brings structured, persistent memory to LangChain agents
🏗️ Compute buildouts and public‑sector partnerships
Amazon’s new hyperscale campus visualizes the “city of data centers” future
China adds ~50 TWh nuclear in 12 months as AI-era baseload
🤖 Embodied AI: stage robots, firefighting dogs and industrial crawlers
Chinese parade shows transforming spiders, missile dogs and snake robots
Sichuan tests 200 kg‑load firefighting robot dog in the field
Nio’s Aru spider robot targets industrial inspection and maintenance
Unitree concert humanoids draw mainstream praise and scrutiny
🛡️ Platform policy: scraping enforcement and AI media controls
Google sues SerpApi over alleged unlawful scraping of Search content