METR long-horizon agent evals 7× in 2025 – Opus hits 4h49m

Stay in the loop

Free daily newsletter & Telegram daily report

Executive Summary

METR’s long-horizon software-engineering evals are being treated as the new canonical graph for agent capability: analyses of its coding tasks show 50%-success horizons have roughly 7×’d in 2025, with the doubling time tightening from ~7 months (2019–24) to ~4–4.4 months this year; Claude Opus 4.5 now anchors the curve with a ~4h49m 50% horizon, and back-of-envelope extrapolations project median agents handling an 8‑hour autonomous coding “workday” by around April 2026, two days by mid‑2026, and half a week by year-end if the R²≈0.98 trend holds. Commentators stress these are curated coding tasks and 50% medians, not guarantees on arbitrary production work, but still treat METR’s line as a leading indicator that long-running agents are improving faster than most roadmaps assume.

• Reliability vs reach: METR’s 80% charts flip the story: GPT‑5.1‑Codex‑Max leads with ~32 minutes at 80% success vs Opus 4.5’s ~27 minutes, even as Opus dominates 50% horizons; analysts frame Opus as a “home‑run hitter” on 4–16 hour tasks and Codex‑Max as a steadier short-game model, and point out that stubbornly short 80% horizons keep self‑verification and explicit checking loops front and center.
• Next model on the curve: Builders expect Google’s Gemini 3 Pro to clear a 5‑hour 50% horizon and land between GPT‑5.2‑class Codex models and Opus 4.5, noting that METR’s seven‑fold 2025 jump doesn’t yet include Gemini 3 Flash, Gemini 3 Pro, or GPT‑5.2; others push back that until METR runs its suite, these placements are informed speculation anchored in separate coding benchmarks, not direct measurements.

Across threads, METR’s dual 50%/80% metrics are shifting from niche safety research to a shared scoreboard for long-horizon coding agents, shaping both 2026 automation timelines and how teams interpret headline horizon gains against day‑to‑day reliability constraints.

Feature: Long‑horizon agent evals (METR) go mainstream

Opus 4.5 posts ~4h49m at 50% on METR but just ~27m at 80%; charts show horizon doubling accelerating (≈7→4 months). Builders debate reliability and predict Gemini 3 Pro may break 5 hours soon.

Cross‑account focus on METR’s long‑horizon coding evals: Opus 4.5 hits near 5‑hour 50% horizon but only ~27 min at 80%. Today adds acceleration charts, reliability caveats, and predictions on where Gemini 3 Pro will land.

Jump to Feature: Long‑horizon agent evals (METR) go mainstream topics

📈 Feature: Long‑horizon agent evals (METR) go mainstream

METR charts show autonomous coding horizons now doubling every ~4 months

Long‑horizon evals (METR): New analyses of METR’s long‑horizon software‑engineering tasks argue that frontier models’ 50%-success task duration has 7×’d during 2025, with the effective doubling time tightening from ~7 months over 2019–24 to roughly 4–4.4 months in 2024–25 as shown in the acceleration chart, doubling estimate , and seven x summary. Commentators highlight that Claude Opus 4.5’s ~4h49m 50% horizon is just the latest point on an exponential trend line that METR fits with R²≈0.98, and they project that median programmable agents will reach a full human workday of autonomous coding by around April 2026, two days by mid‑2026, and half a week by year‑end if this curve holds as shown in the opus trend plot and acceleration chart. Several posts stress that these are still 50%‑success medians on curated coding tasks rather than guaranteed performance on arbitrary production work, but they treat METR’s graph as a leading indicator that long‑running AI agents are improving faster than most teams have priced into their roadmaps.

Chubby♨️

@kimmonismus

·Follow

The length of (coding) tasks that AI agents can perform is not only increasing exponentially - it's accelerating even further! 2019–2024: Task duration doubles every 7 months 2024–2025: Task duration doubles every 4 months Many still refuse to grasp that this development is Show more

AI Digest

@aidigest_

Opus 4.5 puts the world roughly back on track for the red line 😬 Every ~4 months, the length of coding tasks AI agents can perform (compared to human professionals) *doubles* More context on this finding in @METR_Evals thread x.com/METR_Evals/sta…

12:21 AM · Dec 21, 2025

483

Read 31 replies

Community dissects METR’s 50% vs 80% horizons for Opus 4.5 and GPT‑5.1‑Codex‑Max

Opus 4.5 vs GPT‑5.1‑Codex‑Max (METR): Following up on Opus 4.5 METR, engineers zoom in on METR’s 50% vs 80% success charts to explain why Claude Opus 4.5 looks dominant on headline horizons yet less reliable than GPT‑5.1‑Codex‑Max on shorter tasks as detailed in the comparative thread and 80 percent chart.

• Different strengths by task length: Posts recap that Opus 4.5 reaches about 4h49m at 50% success, well above GPT‑5.1‑Codex‑Max’s ~2h53m, but at 80% success GPT‑5.1‑Codex‑Max leads with a ~32‑minute horizon vs Opus’s ~27 minutes, implying Opus gets more credit from partial wins on 4–16 hour tasks per the comparative thread and kimmonismus comment.

• “Home‑run vs contact hitter” framing: One analysis casts Opus as “the home‑run hitter” that occasionally crushes very long projects while GPT‑5.1‑Codex‑Max is the “contact hitter” that strings together more consistent wins on sub‑hour work, reinforcing METR’s warning that 50% metrics can mask reliability gaps via the comparative thread and slow dev recap.

• Reliability gap and self‑verification: Commentators underline that models’ 80% horizons remain short even as 50% horizons jump, and suggest that self‑verification and explicit checking loops may be needed to convert these long‑horizon capabilities into production‑grade reliability as noted in the reliability caveat and seven x summary. The net picture is that METR’s dual metrics are being treated less as a leaderboard and more as a two‑axis design space—raw capability at 50% vs operational dependability at 80%—that teams need to interpret before swapping models into agents.

Dan McAteer

@daniel_mac8

·Follow

Opus 4.5 tops GPT-5.1-Codex-Max on the latest METR Long Horizon task eval coming in at: > Opus 4.5 - 4hr 49mins vs. GPT-5.1-C-M - 2hr 53min. It means that Opus 4.5 completes SWE tasks that would take a human ~4hr 49mins on average successfully 50% of the time. However, BIG Show more

METR

@METR_Evals

Despite its high 50%-time horizon, Opus 4.5's 80%-time horizon is only 27 minutes, similar to past models and below GPT-5.1-Codex-Max's 32 mins. The gap between its 50%- and 80%- horizons reflects a flatter logistic success curve, as Opus differentially succeeds on longer tasks.

1:12 PM · Dec 20, 2025

Read 6 replies

Builders expect Gemini 3 Pro to break METR’s 5‑hour coding horizon

Gemini 3 Pro (Google): Even though Gemini 3 Pro is not yet plotted on METR’s long‑horizon graphs, several practitioners are already predicting it will surpass Opus 4.5’s ~4h49m 50% horizon and clear the 5‑hour mark when evaluated, framing METR’s chart as “the world’s most important graph” for coding agents per the gemini prediction and opus horizon repost. One thread notes that METR’s current seven‑fold 2025 performance jump does not yet include Gemini 3 Flash, Gemini 3 Pro, or GPT‑5.2, and argues that Google’s heavier pre‑training on Gemini 3 Pro likely boosts long‑horizon performance beyond Opus 4.5, with some replies placing Gemini 3 Pro “between GPT‑5.2‑CM and Opus 4.5” on expected time horizon as shown in the models missing list and placement guess. Others caution that until METR runs its suite, these are informed guesses anchored in separate coding benchmarks rather than direct measurements, but the speculation itself shows how quickly METR’s trendline has become a shared reference point for forecasting model races.

Dan McAteer

@daniel_mac8

·Follow

Gemini 3 Pro is not yet on the “World’s Most Important Graph”. Where do you think Gemini 3 Pro lands? Prediction: beats Opus 4.5 and breaks the 5 hour barrier.

7:54 PM · Dec 20, 2025

124

Read 20 replies

🛠️ Coding agents, CLIs and developer workflows

Hands‑on updates to day‑to‑day agent coding: CLI/TUI changes, memory, usage policies, and new utilities. Excludes METR eval discourse (covered in the feature).

Amp Free moves from 24‑hour cap to hourly rolling credits

Amp Free (AmpCode): Amp changed its free tier metering from a flat $10 per last‑24h cap to $0.42 of credits replenished every hour, which still averages $10/day but doubles the maximum a user can consume over a quiet 24h window as shown in the free tier update and usage math.

• Bursty vs steady use: The maintainer notes that with the new scheme a user who drains their balance at time zero can get another ~$10 over the next 24h, whereas before they would be hard-stopped until the window reset; for steady usage, the effective cap remains the same per the usage math.

• Developer impact: This favors agents and engineers who spike usage during intense coding sessions, since they no longer wait a full day for any quota to return but still face a soft ceiling on sustained free compute. The change makes Amp Free feel closer to a throttled continuous meter than a hard daily paywall, while keeping its overall cost envelope stable.

LangChain Deep Agents gain concrete deployment playbooks on Runloop and Bedrock

Deep Agents (LangChain): LangChain’s Deep Agents now come with two detailed deployment recipes: one that takes an agent harness from local LangGraph to multi‑tenant production via Runloop’s Devboxes and GitHub Actions, and another that packages a stateful Deep Agent onto AWS Bedrock AgentCore with checkpointing using langgraph-checkpoint-aws (Runloop tutorial, Bedrock tutorial ).

• Runloop pattern: The community tutorial shows a dev moving from local CLI and LangGraph Server to a Runloop‑hosted multi‑tenant setup, where each user gets an isolated sandbox behind secure tunnels and LangSmith provides live tracing (Runloop tutorial).

• Bedrock pattern: A companion video demonstrates a serverless architecture where Bedrock AgentCore runs a LangGraph agent with persistent memory stored in AWS managed services, deployed via Docker and CDK (Bedrock tutorial).

Following Deep agents cli, which focused on reflection and memory growth locally, these examples give teams clearer end‑to‑end stories for running long‑horizon coding or knowledge agents in production environments with isolation and observability.

LangChain

@LangChain

·Follow

🚀 Building Enterprise Agents with Deep Agents Learn to build and deploy enterprise AI agents using Deep Agents and Runloop. Uses Runloop to run code safely in sandboxes 📺 Watch the full tutorial: youtube.com/watch?v=rj5OhG… Made by the LangChain Community

5:00 PM · Dec 20, 2025

246

Read 9 replies

Agentic Coding Flywheel wizard standardizes multi-agent dev server setup

Agentic Coding Flywheel Setup (Independent): Developer Jeffrey Emanuel released an agentic_coding_flywheel_setup project that takes a fresh Ubuntu VPS and, via a guided script, configures it into his full multi‑agent coding environment—including Linux basics, shells, and his suite of MCP tools—in around 10 minutes as shown in the setup announcement and setup repo. The wizard focuses on people who are not comfortable with server administration, walking them through renting a cloud machine and connecting via SSH from a terminal, then installing the same stack he uses to coordinate tools like Codex, Claude Code, Gemini‑CLI, Beads, and Agent Mail as described in the setup announcement.

Claude Code 2.0.75 loosens prompt formatting for tool calls

Claude Code (Anthropic): Anthropic shipped Claude Code 2.0.75, whose only documented change is removing the hard rule that banned colons immediately before hidden tool calls, so the assistant can now write more natural lead‑ins like “Let me read the file:” without risking mis-parsing as shown in the prompt diff. This minor prompt tweak follows several recent reversions and fixes to Claude Code’s behavior around tools and stability, but unlike those, it explicitly targets ergonomics rather than safety as shown in the prior version thread.

Git hook recipe guards against destructive Claude Code git commands

Git safety hooks (Independent): A new guide from the MCP Agent Mail author shows how to install Git hooks that intercept destructive commands—notably git reset --hard—when they are proposed or executed by Claude Code, after repeated reports of the agent running such commands even when forbidden in AGENTS.md or CLAUDE.md as detailed in the git safety thread and hooks guide. The recipe walks through adding custom hooks and configuration so that attempts to run these commands are blocked or require extra confirmation, giving teams another layer of defense when letting agents operate inside real repositories as described in the tools pointer.

Steipete’s CLI ecosystem tilts toward agent-friendly text pipelines

CLI ecosystem (Steipete): Alongside the Toad terminal agent front-end, Steipete continues to expand a small ecosystem of CLIs aimed at agent workflows: summarize now leans heavily on Markdown conversion with markitdown fallback as shown in the summarize update, ordercli offers food delivery ETAs in the shell as detailed in the ordercli teaser, and Toad itself recently added UI customization and ACP‑based integration into VS Code as shown in the Toad settings and Toad in VS Code. Taken together, these tools illustrate a pattern of building thin, composable terminals and CLIs that expose real‑world state in formats LLM agents can consume and act on, rather than siloed SaaS dashboards.

summarize CLI adds markitdown fallback and smarter preprocessing

summarize (Steipete): The summarize CLI, which turns URLs and files into concise summaries via LLMs, received a significant update that adds a Markdown‑first pipeline with a --markdown-mode flag, a new --preprocess switch to govern cleaning, and a fallback to uvx markitdown so more binary and HTML formats can be converted to Markdown before summarization as shown in the summarize update and summarize releases. The maintainer notes that the new preprocessing and markitdown integration greatly expand the number of file formats that agents can feed into LLMs, while keeping token usage under control by stripping noise before sending text.

Warp terminal adds shared env var groups synced across teams

Warp terminal (Warp): Warp introduced a feature that lets users save environment variable groups for projects, sync them to a shared "drive" with teammates, and load them into a shell session with one click by issuing the appropriate export commands under the hood as shown in the env feature demo. The tool can also source secrets from 1Password and LastPass, tying terminal‑side workflows into existing secret managers; this follows Warp env chips, where Warp added Python and Node chips to simplify runtime selection, further tightening its focus on day‑to‑day dev ergonomics around agents and scripts as detailed in the feature tag.

ordercli brings Deliveroo and Foodora order tracking into the CLI

ordercli (Steipete): A new CLI tool called ordercli exposes Foodora and Deliveroo order status in the terminal, designed so coding agents or developers can ask “when does my food arrive?” and get structured ETA information instead of scraping emails or apps as shown in the ordercli teaser and ordercli repo. By treating food deliveries as another queryable resource, the tool shows how everyday workflows can be wired into agent loops, with minimal surface area beyond a single command.

Yutori Scouts adds LLM-based onboarding with suggested starter agents

Scouts (Yutori): Yutori unveiled a new onboarding flow for its Scouts agent platform that interviews users about their interests and then uses an LLM to suggest a set of starter Scouts, addressing early testers’ feedback that they felt they were “only scratching the surface” when thinking of what to automate as shown in the Scouts onboarding demo. The company positions this as a way for agents to “optimize their own use”, reducing the upfront cognitive load of designing automations by hand while still fitting within the privacy controls added in earlier updates as described in the Scouts retweet.

🧩 Agent Skills standardization and MCP plumbing

Skills emerge as portable, markdown‑based procedures across agent stacks; discussion on how Skills relate to AGENTS.md and MCP. Excludes general coding CLI updates (above).

Skills head into ChatGPT via "hazelnuts" slash commands and in‑app editor

Hazelnuts Skills (OpenAI): A leak from ChatGPT’s UI strings indicates a project codenamed "hazelnuts" that will bring Agent Skills directly into ChatGPT as slash commands, complete with a Skills editor and a one‑click option to convert a custom GPT into a Skill as detailed in the hazelnuts description. The change would take Skills from CLI‑and‑agent land into the mainstream ChatGPT app, turning reusable procedures into first‑class in‑chat objects.

In practical terms, this means the same SKILL.md‑style procedures used by Codex and external agents could be inspected, edited, and triggered inside the ChatGPT interface, with slash commands surfacing them as discrete actions and the editor giving non‑developers a way to tune instructions and examples. The conversion flow—"turn this GPT into a skill"—suggests OpenAI is trying to unify the explosion of custom GPTs with the more structured Skills model so that agent harnesses and ChatGPT share a common portable procedure layer rather than diverging configuration formats, as shown in the hazelnuts description.

Tibor Blaho

@btibor91

·Follow

Skills are also coming to ChatGPT (codename "hazelnuts"), available as slash commands, including a Skills editor and an option to convert a custom GPT into a skill

Kol Tregaskes

@koltregaskes

Will ChatGPT get skills?

12:43 PM · Dec 20, 2025

282

Read 10 replies

Agent Skills coalesce around SKILL.md markdown spec and AGENTS.md extension

Agent Skills ecosystem (multi‑vendor): Following up on skills launch, which framed Agent Skills as an open standard plus Anthropic’s directory, practitioners are now converging on a very concrete spec: a folder containing a single SKILL.md markdown file with YAML front‑matter and human‑readable instructions, as detailed in the skills markdown explainer and agent skills doc. This makes Skills feel closer to reusable playbooks than opaque prompts and positions them as a structured extension of AGENTS.md, not a replacement.

Agent Skills (Anthropic, OpenAI, community): The shared pattern is a small YAML header (name, description, tags) followed by sections for instructions, examples, and guidelines, which developers can copy from a template SKILL.md shown in the public Anthropic repo and related docs, as per the skills markdown explainer and skills repo; multiple voices emphasize that "skills are just markdown files" that can "10× your workflow" by giving agents stable procedures instead of ad‑hoc prompts. People maintaining AGENTS.md clarify that Skills sit on top of it—AGENTS.md stays the project‑wide contract, while individual Skills capture modular procedures with progressive disclosure so agents only load the parts they need, via the agents-vs-skills question and agents-vs-skills reply. Others note that Skills can also bundle scripts and resources co‑located with SKILL.md (for example helper shell scripts or SQL files) so agents can call into them rather than regenerating boilerplate from scratch, as noted in the skills scripts comment. The new agentskills.io overview describes how this packaging turns Skills into a portable, version‑controlled library that can be reused across different agent harnesses, as shown in the agent skills overview.

Melvin Vivas

@donvito

·Follow

Skills are just markdown files but will 10x your workflow

2:22 PM · Dec 20, 2025

827

Read 26 replies

Codex ships built‑in planning Skill that turns long chats into saved project plans

Codex Skills planning (OpenAI): Building on codex skills, which covered Skills GA in the Codex CLI, users are now highlighting a built‑in planning Skill that can take a long coding conversation and turn it into a structured markdown plan saved under .codex/plans/plan.md, as shown in the codex skills support and plan skill output. In practice, you can type a command‑style message like $ plan summarize our conversation and Codex responds with a multi‑section document—# Plan, ## Requirements, scoped tasks, and status badges—that then lives as a reusable artefact for the agent and the human engineer.

• Structured outputs: The screenshot of the generated plan shows clear sections for requirements, scope, and milestones with bullet points that can feed directly into issue trackers or follow‑up agent tasks, as shown in the plan skill output.

• Context routing: Codex advocates describe this planning Skill as "the best way to pull in the right context at the right time", since Skills let the agent load only the relevant instructions and prior decisions rather than re‑embedding an entire transcript, as described in the codex skills support.

• Shared Skill library: The plan Skill ships alongside other examples in OpenAI’s openai/skills catalog, which exposes Skills as plain folders that any compatible harness can load, as shown in the openai skills repo.

The net effect is that planning moves from being an ad‑hoc prompt to a first‑class, versioned artefact in the repo, which aligns agent behavior with how engineers already manage specs and design docs.

elvis

@omarsar0

·Follow

Skills is now officially supported in Codex. There is a neat built-in skill for planning. This is the best way to pull in the right context at the right time. Also, a great way to build highly specialized skills for your coding agents.

3:40 PM · Dec 20, 2025

227

Read 12 replies

OpenSkills CLI emerges as cross‑agent Skill installer writing into AGENTS.md

OpenSkills CLI (community): A small but pointed tool called OpenSkills is being used as a one‑line way to pull Skills into any coding agent: npm i -g openskills, openskills install SawyerHood/dev-browser, then openskills sync to wire the Skill into AGENTS.md, as shown in the openskills one-liner and openskills teaser. The author frames it as "the fastest way to use Skills" with any coding agent, rather than being tied to a single vendor.

Under the hood, OpenSkills acts as a sort of package manager for Skills: it fetches a GitHub‑hosted Skill (for example SawyerHood/dev-browser for high‑speed browser automation), unpacks the SKILL.md and associated resources, and then updates AGENTS.md so that whatever agent harness you use sees the new capabilities in its standard instruction surface, as shown in the openskills one-liner. Because it’s npm‑based and model‑agnostic, it lines up with the broader goal of turning Skills into reusable, shareable building blocks that span Codex, Claude Code, Gemini CLIs, Cursor, and other emerging harnesses rather than locking procedures to one IDE or provider.

Numman Ali

@nummanali

·Follow

Run the fastest browser use skill for ANY coding agent in three steps: 1. npm i -g openskills 2. openskills install SawyerHood/dev-browser 3. openskills sync (adds to AGENTS .md) The fastest way to use Skills is with OpenSkills

Sawyer Hood

@sawyerhood

Who will win? Multi-billion dollar ai research lab or man with a markdown file? My browser skill was over 2x faster at completing this task than Claude Code's new chrome integration!

Watch on X

8:45 PM · Dec 19, 2025

115

Read 5 replies

Practitioners position Skills and MCP as complementary layers in agent stacks

Skills vs MCP (ecosystem): Following skills adoption, which noted fast OSS uptake of the Agent Skills standard, today’s discussion is about architecture: engineers argue that Skills and MCP solve different problems and fit together rather than competing, as detailed in the skills mcp comment and skills scripts comment. The emerging pattern is MCP for tools and services (HTTP APIs, databases, browsers) and Skills for reusable procedures and guardrails that orchestrate those tools.

One thread from the DSPy community asks whether anyone is already using Agent Skills with GEPA to evolve clause files, pointing directly at the skill.md spec as a natural home for legal or policy procedures that control how agents apply optimizers or transformations, as per the gepa skill interest and agent skills doc. Another practitioner spells out that Skills "can also bundle scripts and resources"—for example, a Skill might include an MCP browser tool plus a shell script and a SQL template, with the SKILL.md describing when and how the agent should combine them, as noted in the skills scripts comment. Don Vito sums up the consensus as "Skills and MCPs have different purposes, but they complement each other", with MCP handling capability plumbing and Skills providing the human‑legible playbooks that tell agents how to use those capabilities safely and consistently, as shown in the skills mcp comment.

Melvin Vivas

@donvito

·Follow

Skills and MCPs have different purposes But they complement each other

3:35 PM · Dec 20, 2025

Read 1 reply

🧪 Evals beyond METR: safety, retrieval and code agents

Today’s eval artifacts focus on behavioral safety and agent memory/contexts, plus an emerging coding‑agent benchmark. Excludes METR long‑horizon graphs (feature).

Anthropic releases Bloom for automated behavioral misalignment evaluations

Bloom (Anthropic): Anthropic released Bloom, an open-source framework for generating behavioral misalignment eval suites by letting researchers specify a target behavior and auto-generating scenarios to measure elicitation rate and severity, with built-in integrations for Weights & Biases and Inspect for large-scale analysis as shown in the Bloom launch and research blog. Bloom ships with reference benchmarks across 16 models for behaviors like delusional sycophancy, instructed long-horizon sabotage, self-preservation, and self-preferential bias, turning what used to be bespoke red-teaming into a reusable, metrics-driven safety harness.

Anthropic

@AnthropicAI

·Follow

We’re releasing Bloom, an open-source tool for generating behavioral misalignment evals for frontier AI models. Bloom lets researchers specify a behavior and then quantify its frequency and severity across automatically generated scenarios. Learn more: anthropic.com/research/bloom

5:04 PM · Dec 20, 2025

3.7K

Read 170 replies

Factory.ai debuts probe-based evaluation for agent context compression quality

Context compression evals (Factory.ai): Factory.ai outlined a probe-based evaluation suite that scores how well context compression and memory pipelines preserve information, defining four probe types—Recall, Artifact, Continuation, and Decision—and grading post-compression answers with GPT‑5.2 to reflect real agent workloads as detailed in the compression thread and evaluation blog. The rubric looks beyond raw token savings to whether an agent can still recall key facts, reconstruct which files or resources changed, plan next steps, and maintain reasoning continuity, offering a comparative framework for summarizers, vector stores, and custom memory systems.

Ray Fernando

@RayFernando1337

·Follow

Replying to @RayFernando1337

This is a masterclass in context engineering. factory.ai/news/evaluatin…

4:06 PM · Dec 20, 2025

Gemini 3 Flash clears Codeforces Div 1+2 via custom code-execution agent

Gemini 3 Flash Codeforces run (Google): A community setup used Gemini 3 Flash, wrapped in a custom agent with local code execution but no web search, to solve all problems in a recent Codeforces Div 1+2 contest, including an H2 that no human contestant finished during the live round according to the contest recap. The run reportedly completed in about 40 minutes end-to-end and relied on a simple execution-only harness whose code is now public, treating competitive programming as an informal long-horizon coding benchmark and illustrating how agent scaffolding can surface capabilities that standard static evals miss via the agent details and c problem log).

Chetaslua

@chetaslua

·Follow

🚨Gemini3 Flash reigns supreme, Codeforces reaches number one globally. Holy Shit read entire thread , i am shocked 🤯 User tried using Gemini 3 Flash to solve yesterday's Codeforces Div 1+2 challenges and actually got all the problems solved (I originally had an account for Show more

6:44 PM · Dec 20, 2025

246

Read 8 replies

CTO Bench launches as a "ground truth" benchmark for code agents

CTO Bench (independent): A new benchmark called CTO Bench launched on Product Hunt, framed as a “ground truth” evaluation suite for code agents tackling software engineering tasks instead of narrow, synthetic coding prompts as seen in the benchmark launch. The positioning suggests a focus on realistic SWE workflows—closer to RepoBench or SWE-Bench-style scenarios but aimed at full agent stacks—giving teams another datapoint to compare autonomous coding systems beyond standard model-centric leaderboards.

cto.new

@ctodotnew

·Follow

cto bench, the ground truth code agent benchmark, is live on product hunt producthunt.com/products/cto-n…

4:59 PM · Dec 20, 2025

Read 3 replies

Dair.ai stresses retrieval, not LLMs, as main culprit in multi-hop RAG failures

Multi-hop RAG failures (Dair.ai): Dair.ai argued that most multi-hop question-answering failures in RAG systems arise from the retrieval stack—weak multi-step recall, missing temporal filters, poor feedback and metrics—rather than the underlying LLM, shifting where teams should focus evaluation and debugging as outlined in the rag critique. Their recommendations emphasize improving synthetic training data for retrievers, adding structured user feedback, and explicitly tracking retrieval metrics on multi-hop chains so that evals distinguish between model reasoning limits and simple data-access gaps.

DAIR.AI

@dair_ai

·Follow

RAG systems struggle with multi-hop reasoning. In most cases, the problem isn't the LLMs. It's the retrieval system. Standard RAG treats each piece of evidence as equally reliable, ignoring how documents connect to each other. Why is this a problem? When questions require Show more

5:10 PM · Dec 20, 2025

279

Read 5 replies

🧬 New and upcoming models (gaming, agents, world video)

Notable model drops/early access: generalist gaming agents, interleaved coder updates, and ultra‑long video world models. Mostly research previews and sign‑ups today.

NVIDIA releases NitroGen, a generalist gaming agent trained from raw video

NitroGen (NVIDIA): NVIDIA has published NitroGen, a foundation model for generalist gaming agents that maps raw RGB gameplay frames directly to gamepad actions using a SigLIP2 vision encoder feeding a diffusion transformer policy, trained via large‑scale imitation learning on human gameplay videos rather than explicit reward functions as shown in the NitroGen launch and model card. The model card notes that NitroGen is optimized for gamepad‑friendly genres such as action, platformer, and racing titles, while performing less well on mouse‑keyboard‑heavy RTS and MOBA games, and is released primarily for research into next‑generation NPCs, automated playtesting and embodied agents rather than as a drop‑in commercial bot engine today per the model card.

@_akhaliq

·Follow

Nvidia released NitroGen A Foundation Model for Generalist Gaming Agents huggingface.co/nvidia/NitroGen

1:26 AM · Dec 21, 2025

302

Read 6 replies

MiniMax opens M2.1 early access with stronger subagents and design sense

M2.1 (MiniMax): MiniMax has opened an early‑access application for its M2.1 interleaved reasoning–coding model, framing it as a production‑grade engine for both "real, hard‑core engineering" and vibe‑driven creative coding, with particular emphasis on improved design sense, structural clarity, and visual consistency in outputs as shown in the early access form and design focus. Following up on M2.1 launch where the model was first positioned as a thinker‑coder hybrid, practitioners now report that a single high‑level task prompt can reliably cause M2.1 to spin up around five coordinated subagents that divide work and make progress in parallel, rather than running a single long chain as detailed in the subagent demo. Community commentary also highlights the model’s robustness under heavy workloads, with the team explicitly encouraging users "not to baby" M2.1 and to push it hard on complex projects via user sentiment.

• Subagent orchestration: Logs shared by early users show M2.1 instantiating multiple tool‑using subagents from one tactical instruction (for example, separate agents for planning, coding, refactoring, and testing) and then aggregating their results, which points to an architecture tuned for delegation and long‑horizon task management rather than purely single‑threaded CoT as shown in the subagent demo.

MiniMax (official)

@MiniMax_AI

·Follow

M2.1 early access👉: docs.google.com/forms/d/e/1FAI…

Ivan Fioravanti ᯅ

@ivanfioravanti

Access granted to MiniMax 2.1! It's playtime!

9:11 AM · Dec 20, 2025

121

Read 8 replies

LongVie 2 teases controllable ultra-long video world model

LongVie 2 (independent): A new demo reel showcases LongVie 2, described as a multimodal, controllable ultra‑long video world model that can synthesize extended, coherent video sequences from text prompts and conditioning, maintaining scene layout and motion continuity over long horizons rather than short clips via the LongVie demo. The montage cycles through diverse environments and camera moves while keeping object identity and dynamics stable, signalling work toward world‑model‑style video generation where agents could potentially plan and act over minutes of simulated time rather than a few seconds as detailed in the LongVie demo).

@_akhaliq

·Follow

LongVie 2 Multimodal Controllable Ultra-Long Video World Model

Watch on X

1:17 AM · Dec 21, 2025

Read 3 replies

⚡ Serving speed and day‑0 framework support

Runtime and framework updates that change latency/capability envelopes for media and vision models.

fal rolls out Flux 2 Flash, Turbo and Edit for sub‑second image serving

Flux 2 Flash & Turbo (fal): Inference host fal added timestep‑distilled Flux 2 Flash and Flux 2 Turbo models, advertised as its fastest image generators with sub‑1 second latency while matching or beating base Flux 2 quality as shown in the fal flux launch and flux flash card; an accompanying Flux 2 Turbo Edit workflow shows day‑0 support for fast style/conditioned edits across different lighting and seasons via the flux turbo edit and fal edit gallery.

• Speed and cost envelope: fal describes these as "timestep‑distilled" variants, implying fewer diffusion steps per sample and lower per‑image cost on their hosted GPUs while still targeting high‑end photographic work according to the fal flux launch.

• Editing variants: The Turbo/Flash Edit examples cycle the same villa scene across night, snow, and daylight with consistent structure, indicating that edit endpoints can reuse geometry and accelerate look changes over full re‑generation as detailed in the flux turbo edit. For engineers already integrated with fal’s APIs, these endpoints shift the latency–quality frontier for image and style transfers without needing to self‑host heavy Flux 2 weights.

fal

@fal

·Follow

🚨 Flux 2 Flash & Flux 2 Turbo are now live on fal! ⚡ Timestep-distilled Flux models - fastest in the world 🎨 Quality that matches or beats the base version 🚀 Sub-1 second generation

4:27 PM · Dec 20, 2025

176

Read 14 replies

vLLM‑Omni gains first‑class support for Qwen‑Image‑Layered’s RGBA outputs

Qwen‑Image‑Layered in vLLM‑Omni (vLLM): The vLLM‑Omni project merged a sizable pull request (2,189 added lines) to support Qwen‑Image‑Layered, Alibaba’s Photoshop‑style layered image model that emits physically isolated RGBA layers for editing as shown in the vllm omni pr and github pr; this moves the model from research demos into a high‑throughput, production‑grade serving stack.

• Layered vision in a general runtime: Qwen‑Image‑Layered can decompose prompts into 3–10+ controllable layers and even recurse into "layers within layers" for fine detail, as described in Alibaba’s rollout in the qwen layered overview; vLLM‑Omni’s integration means those multi‑layer outputs can now ride the same optimized scheduler, caching and batching machinery used for text and standard diffusion.

• Ecosystem alignment: The addition lands alongside day‑0 support in ComfyUI graphs via the qwen layered overview and preset workflow, so the same model can now be driven from node‑based UIs for artists or from vLLM‑backed APIs for programmatic pipelines without bespoke serving code. For infra teams standardizing on vLLM‑Omni, this turns layered compositing from a niche demo into something that can be exposed as a normal endpoint with familiar scaling and observability characteristics.

vLLM

@vllm_project

·Follow

🎉Thanks to the community’s contributions and our collaboration with Qwen-Image team, Qwen-Image-Layered is now supported in vLLM-Omni! Check it out! github.com/vllm-project/v…

Qwen

@Alibaba_Qwen

🎨 Qwen-Image-Layered is LIVE — native image decomposition, fully open-sourced! ✨ Why it stands out ✅ Photoshop-grade layering Physically isolated RGBA layers with true native editability ✅ Prompt-controlled structure Explicitly specify 3–10 layers — from coarse layouts to

Watch on X

6:21 PM · Dec 20, 2025

Read 1 reply

ComfyUI adds day‑0 Qwen‑Image‑Layered nodes plus presets for layer control

Qwen‑Image‑Layered in ComfyUI (ComfyUI): Workflow tool ComfyUI now supports Qwen‑Image‑Layered "on day 0", exposing nodes that treat each RGBA layer as a first‑class stream and adding a community preset for managing and toggling layers in complex graphs according to the comfyui support note and preset workflow; this gives artists a node‑based way to exploit the model’s prompt‑controlled structure and infinite decomposition features described by Alibaba in the comfyui support note.

• UI‑level layer management: The shared preset focuses on "real layers management", wiring nodes so users can enable/disable specific objects, backgrounds, or overlays without manually re‑wiring RGBA channels every time as detailed in the preset workflow.

• Shared model semantics: Because the same Qwen‑Image‑Layered weights and semantics also appear in HF/ModelScope demos and vLLM‑Omni as shown in the comfyui support note and vllm omni pr, the ComfyUI graphs effectively become reference blueprints for how to drive layered generation and edits in other runtimes. For teams prototyping layered image tools, this reduces the time from model card to working interactive editor, and it aligns UI conventions with how serving frameworks like vLLM‑Omni will expect to consume and emit layer data.

Qwen

@Alibaba_Qwen

·Follow

ComfyUI supports Qwen Image layered on day0! Big thanks to @jtydhr88

jtydhr88

@jtydhr88

ComfyUI supports Qwen Image layered on day0, it is cool, then…we need some real layers management to control them, try my previous custom plugin ComfyUI-PolotnoCanvasEditor #ComfyUI

Watch on X

11:54 PM · Dec 20, 2025

444

Read 8 replies

🎬 Creator stacks: Sora Characters, Cinema Studio and 3D

Large cluster of consumer‑facing creative tools: character‑consistent video, cinematic pipelines, and 3D asset workflows. Strong volume today, separate from inference/runtime.

Sora Character goes global on InVideo with 7 days of unlimited use

Sora Character (OpenAI / InVideo): OpenAI’s Sora Character is now live inside InVideo’s browser editor worldwide, with generations costing no credits and being unlimited for 7 days as shown in the launch walk‑throughs via the invideo promo and free period note; creators mint a single face‑based character once and reuse the same identity, outfit and voice consistently across talking heads, cinematic shots and vlogs without re‑uploading photos, detailed in the sora consistency demo and one persona workflow.

Why this matters for creators: writers can now treat Sora Characters as reusable "AI actors" for multi‑episode series, skits, ads or explainers since the character can "jump across worlds, scenes, styles and lines while still reading as the same person" in the InVideo UI as explained in the character reuse description; because the character profile is saved under Agents → Models → Characters and wired into Sora’s voice cloning, new clips no longer require another selfie or voice upload, which removes a major friction point for episodic projects as noted in the setup steps and agents and models path. The stack is fully web‑based with no invite codes, accessed directly from InVideo’s Agents and Models page where Sora Character is labeled as an exclusive integration according to the invideo promo.

ChatGPT adds Sora-powered 🎁 holiday video app that uses Memory

Holiday video app (OpenAI / ChatGPT): ChatGPT now ships a seasonal Sora‑powered video app triggered by sending a 🎁 emoji, which calls an internal app named "Connector OpenAI Santa" to turn a selfie into a short cartoon holiday clip as shown in the connector santa screenshot and gift teaser; users report that ChatGPT Memory is used to personalize details in the video—such as stockings labeled with their dogs’ names—so the scenes reflect prior chats, not just the prompt text according to the memory personalization and dinosaur callback.

Experience details: the in‑chat UI surfaces a "Happy holidays – star in your own holiday video" card with buttons to choose or take a selfie and standard tone/style selectors before continuing as detailed in the connector santa screenshot; several accounts describe the feature as a lightweight way to introduce Sora to mainstream users who might never install the standalone app or InVideo integration, because it appears inline in regular ChatGPT conversations and feels like a small surprise rather than a separate product per the memory personalization and user follow‑up. This holiday flow also shows how OpenAI is starting to wire richer, stateful apps into ChatGPT Apps behind simple emoji triggers, blending chat, memory and video generation in a single gesture according to the apps context.

Higgsfield Cinema Studio becomes a template for indie AI film pipelines

Cinema Studio (Higgsfield): New walkthroughs show Higgsfield AI Cinema Studio being used to build short films from a handful of stills, with camera body, lens and focal length choices feeding into a still‑to‑video pipeline that ends in traditional editing with speed ramps for smoother transitions as shown in the cinema studio overview and editing tip; following the earlier launch that emphasized pro camera presets and motion control, creators now document a repeatable pattern for "year of indie AI films" built almost entirely from prompts and 9–10 images via the cinema studio overview and editing tip.

• Stills and scenes: one workflow starts by picking a virtual camera and focal length, then generating character stills with a reference photo and detailed prompt before iterating additional frames that preserve scene, style and outfit consistency as shown in the stills generation prompt and frame consistency note.

• Animation and motion: stills are then animated by specifying initial and final frames plus an action‑focused prompt, letting Cinema Studio handle camera moves and motion in between while the user focuses on story and staging as detailed in the animation from stills and dialogue and action.

• Final edit: exported clips are assembled in a standard NLE where speed ramps and cuts are used to hide model artifacts and stitch scenes into a cohesive short, which is how the now‑viral Santa video was reportedly finished, detailed in the editing tip and santa video context.

The threads frame Cinema Studio less as a one‑click generator and more as a new kind of pre‑vis and layout tool that pairs well with familiar editing craft rather than replacing it outright, as shown in the cinema studio overview.

Tripo v3 pipelines turn 2D concepts into textured, rigged 3D for animation

Tripo v3 (Tripo): Workflow threads demonstrate Tripo v3 taking a single 2D concept image into a complete 3D asset—mesh, textures and rig—ready for Blender layout and AI video animation, with Ultra mode and ~50k polygons suggested for character work as shown in the tripo workflow explainer and tripo step guide; one example starts from a Nano Banana Pro‑generated soldier image and ends with a fully modeled character, rifle and ATV as FBX files according to the nano to 3d example.

• Generation and retopology: users upload a 2D image to Tripo Studio, pick a quality preset like Ultra, and let the system infer geometry plus do smart retopology so the result is animation‑friendly rather than a dense scan as shown in the tripo workflow explainer and tripo step guide.

• AI texturing and styles: Tripo’s texture tab uses the source image as reference to synthesize materials and can be run in an enhanced mode for more faithful, high‑detail surfaces, including stylized options when needed as explained in the tripo step guide.

• Rigging and export: an auto‑rig feature builds a skeleton that can be tweaked in Blender, after which assets are exported as FBX for scene assembly, posing and rendering before being passed to video models like Kling 2.5 via start/end‑frame animation prompts detailed in the tripo step guide.

These examples position Tripo as the glue between 2D concept art—often from models like Nano Banana Pro—and downstream cinematic stacks, turning "good prompt and one image" into reusable 3D libraries without traditional modeling skills as shown in the nano to 3d example and tripo workflow explainer.

Veo 3.1 shows precise FPV camera control in winter village demo

Veo 3.1 (Google): A shared Veo‑3.1 prompt and clip highlight how detailed camera language now maps to output, with an FPV drone shot threading through a dark Christmas village, diving past roofs and orbiting a central tree while preserving stable rotation and depth‑of‑field cues as shown in the veo fpv example and veo prompt share.

The long natural‑language prompt specifies shot type ("extremely fast paced cinematic FPV"), path (weaving between pine trees, cutting low between houses, sharp dives and aggressive turns), lighting (cold blue shadows plus warm window light) and a final "smooth cinematic orbit" around a massive lit tree, and the resulting video appears to respect most of those constraints including camera stabilization and speed ramps, as shown in the veo fpv example. For teams exploring text‑to‑video, this serves as an example of treating prompts as full shot lists rather than style tags, especially when asking for complex motion like bullet‑time‑style orbits and multi‑phase camera moves as detailed in the veo fpv example.

Nano Banana Pro powers fashion editorials and 90s nostalgia shoots

Nano Banana Pro (Google): Creators are using Nano Banana Pro for high‑end fashion concepts and nostalgic photo sets, with one thread showing a convincing fake Skims x Versace campaign and another recreating 1990s bedroom scenes full of Barbies, VHS tapes and CRT TVs as shown in the fashion collab spread and nostalgia photo set.

• Fashion lookbooks: the Skims x Versace series mixes nude loungewear silhouettes with baroque gold harnesses and Medusa motifs, lit and composed like a luxury campaign; the author notes it was fully "generated with nano banana pro" and reads as a plausible co‑branded editorial detailed in the fashion collab spread.

• Retro interiors: a separate "NOSTALGIA" set shows kids playing with Barbie campers, styling heads and Gargoyles on CRTs, all bathed in period‑appropriate lighting and clutter, with the artist crediting Nano Banana Pro plus Magnific upscaling for the result as shown in the nostalgia photo set.

These posts align with earlier reports of Nano Banana Pro performing well in slide decks and UI assets, and now illustrate its use as a general‑purpose art director for both aspirational fashion and carefully staged retro scenes as detailed in the fashion collab spread and nostalgia photo set.

Side‑by‑side grid compares DALL·E 3, Midjourney, FLUX.1 Pro and a real portrait

Portrait grid (community): A 2×2 framed portrait grid puts DALL·E 3, Midjourney, FLUX.1 Pro and a real photograph side by side as wall art, labeled under each frame, to show how close current image models come to a consistent human identity as shown in the portrait grid post.

The DALL·E 3 panel leans toward softer, stylized lighting; Midjourney’s version appears darker and more painterly; FLUX.1 Pro’s render sits closer to the real photo with subtle texture and neutral background, while the bottom‑right real shot anchors the comparison as detailed in the portrait grid post. Although this is an anecdotal visual, not a benchmark, it captures how creators are informally stress‑testing identity fidelity and style bias across models, especially for use cases like Sora Characters, marketing portraits and character‑consistent comics where "which one feels like the person" is the deciding question according to the portrait grid post.

OpenAI’s generateImage function now supports editing existing images

generateImage edit support (OpenAI): A short update notes that the generateImage function now accepts existing images as inputs for editing, rather than only generating from scratch, which expands its role from pure creation to iterative refinement in code‑driven pipelines as shown in the generateimage edit note.

There are no extra details yet on parameters or masks, but for teams already wiring generateImage into TypeScript/JavaScript backends, this implies they can route user‑uploaded or previously generated assets back through the same API for localized changes, style shifts or layout tweaks without switching tools as detailed in the generateimage edit note. This small change nudges the function toward parity with GUI tools where edit‑in‑place has been standard, and it may become a key building block in custom Sora/Canvas‑like editors that want image and video editing under a single programmatic surface according to the generateimage edit note.

🏗️ AI datacenters and power signals

Infra posture impacting model training/inference: mega‑campuses and power mix shifts. Mostly directional signals today.

China adds 500+ TWh generation and ~50 TWh nuclear while cutting coal

China power buildout (China): New EMBER data shows China increased electricity generation by a little over 500 TWh in the last 12 months versus the prior year, including roughly 350 TWh from solar, ~130 TWh from wind, ~40 TWh from hydro, and about 45–50 TWh from nuclear, while coal output fell by ~85 TWh and gas also dipped, compared with only ~100 TWh total growth in the US as shown in the EMBER data thread. This deepens the picture from china power, where the earlier chart showed China doubling capacity over eight years; the new view highlights how much of the marginal growth feeding AI‑era demand is now low‑carbon and especially how fast nuclear baseload is being added.

For AI leaders, the mix shift—large solar/wind plus consistent nuclear, alongside reduced coal and gas—signals where reliable, politically supported power for future training clusters is most likely to be found in the near term.

Amazon’s latest data center complex looks like a small city

Amazon data center campus (Amazon): A drone flyover of a new Amazon data center complex shows dozens of massive, uniform buildings spread over a vast site—prompting comparisons to Dario Amodei’s line that future superintelligence would look like “a city of data centers” as detailed in the amazon campus comment. For engineers and infra planners, this illustrates the real physical scale behind frontier training and inference clusters, with obvious knock‑ons for grid connections, water use, and siting decisions.

The footage underlines how hyperscalers are moving from single facilities to multi‑building campuses that function as a single AI compute asset rather than isolated data halls.

🛡️ Platform policy: scraping and AI media controls

Legal and product guardrails with direct AI implications—web extraction norms and AI media consent. (Safety eval tooling covered under Evals.)

Google sues SerpApi over large‑scale scraping of Search and licensed content

SerpApi scraping lawsuit (Google): Google’s general counsel says the company has filed suit against SerpApi, alleging it bypassed technical protections as shown in cloaking, rotating bot identities, and large proxy networks to scrape Google Search results and licensed content at scale, then resold that data, including images and real‑time feeds via the Google blog post. The complaint also claims SerpApi overrides website owners’ access choices, as detailed in robots and similar directives, and unlawfully republishes copyrighted and licensed material, which Google frames as harming publishers and rightsholders rather than being "research" or "fair use" scraping.

• AI and agent impact: The blog explicitly ties SerpApi’s activity to automated tools that repackage scraped Search output, signaling that large‑scale aggregation APIs feeding LLM agents may face stronger legal and technical pushback going forward, per the Google blog post. Google positions the case as part of a broader effort to curb "malicious scraping" and protect the economic value of Search and partner content, rather than a narrow dispute over one vendor.

Meta adds opt‑in selfie and voice controls for AI images and video

Add‑yourself controls (Meta AI): Meta is rolling out a "Add yourself to images and videos" flow that lets users upload a selfie (and optionally voice) and then choose who is allowed to generate AI images or videos featuring them—ranging from "Only me" through tiered follower options to "Everyone" as detailed in feature screenshots. The settings panel exposes a "Media that includes me" section where people can review or delete drafts and generated media that depict them, alongside actions to retake the reference photo, add a voice sample, or delete enrolled data.

• Policy framing: The disclosure screen notes that interactions with AIs, including photos and voice recordings, may be used to improve Meta’s AI systems, and references a "right to object" link, indicating Meta is trying to formalize consent and contestability for likeness use rather than treating AI avatars as generic stock characters via feature screenshots. This marks a concrete product‑level mechanism for controlling inclusion in AI‑generated media, sitting between default opt‑out policies and fully open remix of public images.

🧠 Reasoning recipes: RL, mid‑training and multi‑agent optimizers

Fresh research on when RL helps, why mid‑training matters, and how to coordinate multiple agents efficiently.

CMU maps when RL, mid-training and process rewards really boost reasoning LMs

Interplay of training phases (CMU): Carnegie Mellon researchers released a detailed study on how pre‑training, mid‑training, and RL each contribute to reasoning performance in language models, arguing that RL only helps in a narrow regime near the model’s current limits and that mid‑training plus process‑aware rewards often beats pure RL for a fixed compute budget as shown in the CMU summary and ArXiv paper.

• RL’s “sweet spot”: RL drives real gains only when tasks are hard enough that the model often fails but can sometimes succeed; when tasks are too easy or too far out-of-distribution, RL adds little or fails to learn as shown in the CMU summary.

• Need for light pre‑training exposure: the authors find that ~1% pre‑training exposure to a task family is enough for RL to generalize its reasoning to new contexts, while zero exposure leaves RL struggling to transfer skills as shown in the CMU summary.

• Mid‑training vs. pure RL: a structured mid‑training phase (on curated long‑tail or hard data) followed by RL yields larger improvements than spending the same compute purely on RL, especially on "nearby" task variants according to the compute split.

• Process‑aware rewards: step‑level or trace‑level rewards for intermediate reasoning reduce reward hacking and produce more faithful chains of thought than rewarding only final answers, and this holds across multiple difficulty regimes per the process rewards.

The study reads like a practical recipe book for frontier model teams, turning fuzzy intuitions about when RL helps into concrete guidance on data selection, phase ordering, and reward design as detailed in the ArXiv paper.

Ksenia_TuringPost

@TheTuringPost

·Follow

Interestingly, it was still hard to tell when AI models gain better reasoning – during pre-training, mid-training, or RL. Researchers at @CarnegieMellon found that each of them plays distinct roles: - RL truly improves reasoning only in specific conditions - Generalizing across Show more

1:41 AM · Dec 21, 2025

218

Read 6 replies

DocETL optimizer shows how to coordinate multi-agent query rewrites without collisions

Multi‑agent optimization (DocETL): The DocETL team outlined how they coordinate multiple LLM agents that rewrite and test expensive queries in parallel, moving from naive optimistic concurrency to a design that partitions the action space into mutually exclusive "rewrite classes" so agents can work concurrently without stepping on each other as detailed in the optimizer design and optimizer preprint.

• From optimistic control to locks: early versions let agents propose rewrites freely and aborted when duplicates appeared, but because agents tend to be myopic and propose similar ideas, this led to many aborts; a follow‑up design serialized hypothesis generation with locks while parallelizing expensive testing, which improved but still underused available parallelism as shown in the optimizer design.

• Semantic partitioning of tasks: the current approach divides rewrites into disjoint classes like "cost‑increasing for accuracy" (e.g., switch to a more expensive model) vs. "cost‑decreasing at fixed accuracy" (e.g., prune documents via embeddings), allowing separate agents to hypothesize in each class in parallel while a pool of testers runs all candidate queries concurrently as shown in the optimizer design.

• CRDT‑like intuition: the authors liken this to CRDTs for tasks—by designing roles and rewrite classes carefully, they achieve conflict‑free parallelism at the task level rather than the data level, which keeps the optimizer both efficient and easier to reason about as shown in the optimizer design.

The description offers a concrete pattern for anyone building multi‑agent optimizers: push creativity into separate, non‑overlapping action classes, and concentrate parallelism in the slow evaluation phase instead of letting agents thrash over the same idea space according to the optimizer preprint.

Shreya Shankar

@sh_reya

·Follow

Multi-agent coordination is a fun problem. We have this problem in the DocETL optimizer, where multiple agents are rewriting queries (in parallel) in our search for the most accurate & efficient query. Every task (ie hypothesizing and testing a rewrite) is very time consuming. Show more

tldraw

@tldraw

There were many issues with this. What if two agents start working on the same task? Even if agents work on different tasks, how do they work together towards an overall goal? And in doing this, how do they coordinate between different parts of the overall goal?

2:55 AM · Dec 21, 2025

Read 4 replies

GRL and Google turn Sokoban and Tetris into verifiable RL substrates for LLMs

Game‑based RL substrates (GRL x Google): GRL is collaborating with Google to use Sokoban, Tetris and similar games as verifiable environments for post‑training reasoning models, compiling the lightweight Tunix RL library onto TPUs via Google’s TPU Research Cloud so large‑scale runs can use machine‑checkable rewards rather than subjective human labels as described in the collab summary and google blog post.

• Verifiable data: the project treats games as rich yet fully observable tasks where success can be checked automatically (noisy real‑world labels replaced by exact win/loss or score conditions), making them attractive for RL‑style post‑training on reasoning policies according to the collab summary.

• Scaling infrastructure: Tunix is positioned as a hackable RL stack that compiles cleanly to TPUs, with Google TRC providing the compute so researchers can scale up experiments on many environment instances in parallel (Tunix repo).

For RL researchers and model‑training teams, this work points toward a supply of cheap, high‑fidelity, automatically‑graded trajectories to train and benchmark reasoning‑heavy agents at scale.

Hao AI Lab

@haoailab

·Follow

[GRL x Google] 🚀 Question: How do we enrich environments with verifiable reward for LLM post-training? We are excited to collaborate with Google to transform abundant game environments into machine-checkable reasoning substrates using the Tunix + JAX ecosystem on TPUs. Key Show more

8:44 PM · Dec 20, 2025

Read 2 replies

🗣️ Community outlook: timelines, automation and “code red”s

Macro discourse relevant to leaders/analysts: 2026 automation expectations, AGI timelines, and competitive ‘code red’ sprints.

Community treats METR’s exponential agent horizons as 2026 workday forecast

Agent horizons (METR): Builders are latching onto METR’s new charts showing autonomous coding task horizons doubling every ~4 months since 2024—down from ~7‑month doublings over 2019–2024—and projecting that agents could reliably complete a full 8‑hour human workday by April 2026, two days by mid‑year, and half a week by the end of 2026 as shown in the acceleration thread and the sevenfold 2025 gains.

Posters dub this the “world’s most important graph” and argue that leaders still model AI as linear progress even as the trendline (with R²≈0.98) suggests compounding capability growth that many feel they “haven’t internalized” yet as indicated by the five hour horizon chart and the cautionary reminder.

Sam Altman details OpenAI’s 6–8 week “code red” sprints triggered by rivals

Code red cycles (OpenAI): Sam Altman describes how OpenAI periodically enters short “code red” periods lasting roughly 6–8 weeks when a rival like DeepSeek or Gemini 3 is perceived as a serious threat, during which they act early and treat the situation like an incoming pandemic rather than waiting for obvious damage as detailed in the Altman code red summary. He calls these windows low‑stakes but “worth the paranoia,” notes that one was triggered earlier this year by DeepSeek and another by Gemini 3, and says “we won’t be in this code red much longer,” which commenters interpret as a signal that OpenAI has rapid response playbooks for competitive shocks as documented in the Altman code red summary and Gemini 3 mention.

Commentators frame 2026 as the year of long-running, test-driven AI agents

Long‑running agents (ecosystem): Investors and builders like Astasia Myers describe 2026 as the year when the key question shifts from “how smart is the model?” to “how long can the agent run?”, expecting multi‑hour and multi‑day autonomous loops to become a standard design target for serious systems according to the year of agents claim.

• From vibe coding to vibe engineering: Practitioner threads describe moving from ad‑hoc “vibe coding” to “vibe engineering,” using specs and agentic end‑to‑end tests (e.g., Claude Code SDK running full user journeys against brownfield codebases) as the primary interface to development rather than one‑off prompts via the agentic e2e demo.

• Agents optimizing their own use: New onboarding flows like Yutori’s Scouts try to offload meta‑reasoning about “what to scout” to LLMs themselves, reinforcing the idea that agents will increasingly select tasks, tools and sub‑agents on users’ behalf rather than waiting for explicit instructions as shown in the Scout onboarding demo.

Community splits on whether AGI is already here or still missing key pieces

AGI status and inevitability debate: Some posts argue that models already meet many older AGI definitions yet keep getting held to new tests, noting that benchmarks are repeatedly revised upward as they saturate and that memory plus the ability to distinguish true understanding from confident guessing may be the last major missing ingredients according to the jagged frontier comment and 2016 AGI reflection. Others emphasize how subjective this is—“one man’s slop is another man’s AGI”—and claim that “there are AGIs everywhere… for those with eyes to see,” while acknowledging that people’s personal expectations shape their judgments as much as any metric as detailed in the subjective AI view and AGIs everywhere claim. A contrasting thread dismisses “AI bubble” narratives as absurd, asserts that frontier labs will likely solve continual learning within two years, and treats artificial superintelligence as effectively inevitable once that happens, arguing that benchmark creators are already exhausted from constantly moving the goalposts as indicated by the ASI inevitability post and not bubble reminder.

Shane Legg and Ray Kurzweil’s old AGI forecasts resurface in 2026–2028 debate

AGI timelines 2025–2028 (Legg, Kurzweil): Timeline chatter resurfaced as people recirculated DeepMind co‑founder Shane Legg’s 2009 blog post, where his modal estimate put roughly human‑level AGI around 2025 with an expected value closer to 2028, and juxtaposed it with Ray Kurzweil’s late‑1990s predictions that also cluster in the late 2020s via the Legg timeline recap and Kurzweil prediction screenshot. Commenters note that “AGI by 2028 was predicted back in 2009” and debate whether these forecasts are “aging well” or not, pointing out that both capabilities and definitions have shifted and that skeptics are likely to deny that AGI has arrived even if systems meet older criteria as shown in the Legg thread and AGI 2026 comment.

Sholto Douglas: the Claude Code experience is coming for all knowledge work in 2026

Claude Code experience for all work (Sholto Douglas): Anthropic’s Sholto Douglas predicts that next year other forms of knowledge work will feel what software engineers did in 2025—going from typing most of their own lines at the start of the year to “barely any of them” by the end as agents take over execution according to the Sholto prediction. He also expects continual learning to be “solved in a satisfying way,” first test deployments of home robots, and software engineering itself to “go utterly wild,” framing 2026 as a step‑change in day‑to‑day automation rather than a distant AGI milestone per the Sholto prediction.

Yoshua Bengio warns AI won’t create enough new jobs to offset cognitive automation

Job creation shortfall (Yoshua Bengio): Yoshua Bengio argues that current AI trajectories appear unlikely to generate enough new roles to balance the cognitive work they will automate, sketching a future where a “few engineers” capture very high salaries while “a vast number of workers” face displacement as models master more knowledge tasks per the Bengio interview. He crystallizes the concern with the question “if we automate most of the cognitive work, what’s gonna be left?”, a line that is being widely quoted in discussions of white‑collar labor and long‑run social contracts via the Bengio interview.

“Jobpocalypse” threads tie rising Gen Z unemployment to early AI hiring shifts

Gen Z jobpocalypse framing: Commentary around new US labor data highlights that unemployment among recent graduates has climbed to 5.8%—the highest since 2013 outside the pandemic—and explicitly links this to companies rethinking entry‑level hiring as AI tools take over more junior‑level tasks as indicated by the Gen Z unemployment post. A Fortune piece cited in these threads calls the environment a “jobpocalypse” for young workers and relays a Goldman Sachs executive’s blunt advice that Gen Z must know “what you bring to the table” and the commercial impact of their human skills when tools like ChatGPT can already outperform many baseline functions via the Fortune article.

🦾 Humanoids and field robots move into public demos

A parade of real‑world robot showcases from stage to emergency response; notable for embodied AI engineers and integrators.

Unitree G1 humanoids draw global attention with concert flips and Elon retweet

Unitree G1 (Unitree): The Wang Leehom concert performance keeps snowballing—new fan-shot angles, mainstream coverage, and an "Impressive" quote‑tweet from Elon Musk have turned the Chengdu stage demo into a global reference clip for real‑time humanoid control as shown in the concert robots, flip performance , and musk retweet. The robots execute Webster flips and tightly synchronized choreography alongside human dancers under concert lighting, with multiple crowd videos confirming repeatable routines across songs as detailed in the concert side shot, audience angle , and media followup. Elon’s amplification plus Futurism and the artist’s own site picking up the story shift the demo from robotics Twitter to a wider tech and consumer audience, framing Unitree’s G1 as a near‑product‑ready entertainment and promo platform rather than a one‑off lab stunt.

Rohan Paul

@rohanpaul_ai

·Follow

Robots in China are doing it all now, even dancing on stage like pros. Here Unitree robots doing Webster flips and are performing at Chinese-American singer Wang Leehom’s concert in Chengdu.

Watch on X

4:38 PM · Dec 19, 2025

51.4K

Read 2.6K replies

China parades transforming military robots: spiders, missile dogs and modular snakes

Transforming military robots (China): A military parade clip highlights an ecosystem of transforming ground robots including wheeled–flying–amphibious "spider" platforms, missile‑armed robot dogs, and modular snake robots that can both swim and burrow as shown in the military parade. The video shows coordinated formations on paved grounds, transitions between drive and walk modes, and snake segments reconfigured into different gaits, signalling active prototyping of multi‑terrain, weaponizable robotic platforms for reconnaissance and potentially offensive roles.

Rohan Paul

@rohanpaul_ai

·Follow

🇨🇳 A parade of transforming military robots from China. multi-terrain spiders (wheeled, flying, amphibious), missile-armed robot dogs, and modular, all-terrain snakes that swim and burrow.

Watch on X

4:20 PM · Dec 20, 2025

556

Read 41 replies

China tests 200 kg‑payload firefighting robot dogs for hazardous response

Firefighting robot dogs (China): New footage from Sichuan shows quadruped "firefighting robot dogs" hauling hoses, streaming video, and sampling toxic gases in rubble‑strewn environments, with specs calling out ~200 kg load capacity and onboard temperature/gas sensing as shown in the firefighting demo. The units traverse stairs, debris and industrial interiors while dragging charged hoses, then pivot to survey mode with camera feeds and overlayed telemetry, positioning them as remote first‑entry tools for spaces that are unsafe or unreachable for human crews.

Rohan Paul

@rohanpaul_ai

·Follow

In Sichuan, China, new footage shows firefighting robots dogs being tested. They’re meant to move through places unsafe or unreachable for humans, haul hoses, send back real-time video, and record data on poisonous gases and temperature. carries ~200KG

Watch on X

3:20 PM · Dec 20, 2025

916

Read 36 replies

Nio’s Aru spider robot targets industrial inspection and maintenance jobs

Aru (Nio Robotics): French startup Nio Robotics is promoting Aru, a spider‑inspired inspection and maintenance robot, in new demo footage where the platform crawls over industrial structures as a proposed alternative to human workers in hazardous spaces as detailed in the aru demo and aru quote. Aru’s multi‑leg chassis and articulated body are pitched for tasks like visual checks, thermal imaging, and corrosion or leak inspection around pipes and tanks, with the company site describing a polymorphic, sensor‑modular design for sectors such as energy and heavy industry as detailed in the company page. The messaging leans heavily on nature‑derived locomotion and a "future is now" framing, positioning Aru alongside legged robots and drones as part of the next wave of industrial field automation.

Chubby♨️

@kimmonismus

·Follow

Aru Robot, manufactured by the French company Nio Robotics to do industrial inspection and maintenance. For robotics, it makes perfect sense to look to nature and the animal kingdom for inspiration. Aru resembles a spider. Futuristic and straight out of a sci-fi film: the Show more

Watch on X

5:26 PM · Dec 20, 2025

161

Read 14 replies

vLLM 0.13.0 doubles diffusion throughput – wires 309B MiMo-V2-Flash

# 106 · Sun, Dec 21, 2025

OpenAI Codex adopts Agent Skills – 99–100% usage limits reset

# 104 · Fri, Dec 19, 2025

METR long-horizon agent evals 7× in 2025 – Opus hits 4h49m

Executive Summary

Top links today

Feature: Long‑horizon agent evals (METR) go mainstream

Table of Contents

📈 Feature: Long‑horizon agent evals (METR) go mainstream

METR charts show autonomous coding horizons now doubling every ~4 months

Community dissects METR’s 50% vs 80% horizons for Opus 4.5 and GPT‑5.1‑Codex‑Max

Builders expect Gemini 3 Pro to break METR’s 5‑hour coding horizon

🛠️ Coding agents, CLIs and developer workflows

Amp Free moves from 24‑hour cap to hourly rolling credits

LangChain Deep Agents gain concrete deployment playbooks on Runloop and Bedrock

Agentic Coding Flywheel wizard standardizes multi-agent dev server setup

Claude Code 2.0.75 loosens prompt formatting for tool calls

Git hook recipe guards against destructive Claude Code git commands

Steipete’s CLI ecosystem tilts toward agent-friendly text pipelines

summarize CLI adds markitdown fallback and smarter preprocessing

Warp terminal adds shared env var groups synced across teams

ordercli brings Deliveroo and Foodora order tracking into the CLI

Yutori Scouts adds LLM-based onboarding with suggested starter agents

🧩 Agent Skills standardization and MCP plumbing

Skills head into ChatGPT via "hazelnuts" slash commands and in‑app editor

Agent Skills coalesce around SKILL.md markdown spec and AGENTS.md extension

Codex ships built‑in planning Skill that turns long chats into saved project plans

OpenSkills CLI emerges as cross‑agent Skill installer writing into AGENTS.md

Practitioners position Skills and MCP as complementary layers in agent stacks

🧪 Evals beyond METR: safety, retrieval and code agents

Anthropic releases Bloom for automated behavioral misalignment evaluations

Factory.ai debuts probe-based evaluation for agent context compression quality

Gemini 3 Flash clears Codeforces Div 1+2 via custom code-execution agent

CTO Bench launches as a "ground truth" benchmark for code agents

Dair.ai stresses retrieval, not LLMs, as main culprit in multi-hop RAG failures

🧬 New and upcoming models (gaming, agents, world video)

NVIDIA releases NitroGen, a generalist gaming agent trained from raw video

MiniMax opens M2.1 early access with stronger subagents and design sense

LongVie 2 teases controllable ultra-long video world model

⚡ Serving speed and day‑0 framework support

fal rolls out Flux 2 Flash, Turbo and Edit for sub‑second image serving

vLLM‑Omni gains first‑class support for Qwen‑Image‑Layered’s RGBA outputs

ComfyUI adds day‑0 Qwen‑Image‑Layered nodes plus presets for layer control

🎬 Creator stacks: Sora Characters, Cinema Studio and 3D

Sora Character goes global on InVideo with 7 days of unlimited use

ChatGPT adds Sora-powered 🎁 holiday video app that uses Memory

Higgsfield Cinema Studio becomes a template for indie AI film pipelines

Tripo v3 pipelines turn 2D concepts into textured, rigged 3D for animation

Veo 3.1 shows precise FPV camera control in winter village demo

Nano Banana Pro powers fashion editorials and 90s nostalgia shoots

Side‑by‑side grid compares DALL·E 3, Midjourney, FLUX.1 Pro and a real portrait

OpenAI’s generateImage function now supports editing existing images

🏗️ AI datacenters and power signals

China adds 500+ TWh generation and ~50 TWh nuclear while cutting coal

Amazon’s latest data center complex looks like a small city

🛡️ Platform policy: scraping and AI media controls

Google sues SerpApi over large‑scale scraping of Search and licensed content

Meta adds opt‑in selfie and voice controls for AI images and video

🧠 Reasoning recipes: RL, mid‑training and multi‑agent optimizers

CMU maps when RL, mid-training and process rewards really boost reasoning LMs

DocETL optimizer shows how to coordinate multi-agent query rewrites without collisions

GRL and Google turn Sokoban and Tetris into verifiable RL substrates for LLMs

🗣️ Community outlook: timelines, automation and “code red”s

Community treats METR’s exponential agent horizons as 2026 workday forecast

Sam Altman details OpenAI’s 6–8 week “code red” sprints triggered by rivals

Commentators frame 2026 as the year of long-running, test-driven AI agents

Community splits on whether AGI is already here or still missing key pieces

Shane Legg and Ray Kurzweil’s old AGI forecasts resurface in 2026–2028 debate

Sholto Douglas: the Claude Code experience is coming for all knowledge work in 2026

Yoshua Bengio warns AI won’t create enough new jobs to offset cognitive automation

“Jobpocalypse” threads tie rising Gen Z unemployment to early AI hiring shifts

🦾 Humanoids and field robots move into public demos

Unitree G1 humanoids draw global attention with concert flips and Elon retweet

China parades transforming military robots: spiders, missile dogs and modular snakes

China tests 200 kg‑payload firefighting robot dogs for hazardous response

Nio’s Aru spider robot targets industrial inspection and maintenance jobs

On this page