Zhipu GLM‑4.7 hits 73.8% SWE‑bench – 4–7× cheaper coding SOTA

Conductor 0.28.0 is out! Introducing a new workspaces page, context meter, interactive planning, and keyboard navigation. Workspace page Use the workspaces page to view a history of all of your workspaces. Open existing or un-archive old workspaces. Filter by repo, branch, Show more

10:14 PM · Dec 22, 2025

190

Read 11 replies

Codex experimental background terminals tackle long-running CLI tasks

Codex CLI (OpenAI): Codex picked up an /experimental toggle for background terminals, allowing the coding agent to keep long‑running shells alive for dev servers, tests, and installs without blocking other actions, as shown in the Background terminal note.

• Less babysitting, same workflows: With background mode enabled, tests can continue running while users and the agent keep working, npm publish can wait on browser‑based auth flows, and package installs no longer require constant supervision, which addresses one of the most common complaints about agent‑driven CLIs according to the Background terminal note and Benefits thread.

• Closer to real dev ergonomics: The feature effectively moves Codex closer to how human developers use terminals—multiple concurrent shells, some tailing logs, some executing long tasks—rather than the previous single‑shot command execution model that often stalled complex automation as detailed in the Benefits thread. There is no benchmark yet on how this impacts success rates, but the ergonomics change is substantial for anyone leaning on Codex as a primary driver of shell‑based workflows.

Kevin Kern

@kevinkern

Codex has (finally) a new /experimental setting that enables background terminals. (useful for long running processes) Especially when you run a dev server or logs, you won't be blocked and you can resume working in Codex.

3:01 PM · Dec 22, 2025

254

Read 13 replies

RepoPrompt 1.5.60 streamlines Codex prompt install and CLI use

RepoPrompt 1.5.60 (RepoPrompt): RepoPrompt 1.5.60 introduced a helper that installs its prompt workflows directly into Codex and added CLI variants of those prompts so the same automations can be triggered from the terminal as well as from MCP‑aware frontends, as shown in the RepoPrompt release and changelog page.

• One‑step Codex integration: The new installer wires RepoPrompt’s /rp-build, /rp-investigate and similar commands into Codex, removing earlier manual steps where users had to copy and maintain prompt text themselves across agents and projects according to the RepoPrompt release.

• CLI parity for workflows: By exposing CLI variants of each prompt, 1.5.60 lets teams run the same repo‑aware analysis and build workflows in CI, scripts, or local shells, not only via chat UIs, which is a shift from pure MPC‑only usage toward a more general automation layer as detailed in the RepoPrompt release and changelog page. The update turns RepoPrompt from a mostly ChatGPT/Claude‑side helper into a small but flexible command‑line tool that can sit inside existing engineering pipelines.

eric provencher

@pvncher

Just released @RepoPrompt 1.5.60 It includes a nice helper to install the rp prompts into codex, and you can also add cli variants of the prompts to make it easier to use the cli with these helpful workflows. Small other fixes included in the update too!

7:33 PM · Dec 22, 2025

Read 2 replies

Warp terminal exposes agent run and run-ambient commands

Warp agent CLI (Warp): Warp highlighted that its CLI can now run agents either locally with warp agent run or in a cloud sandbox via warp agent run-ambient, then let developers SSH into those ambient runs and interact as if they were local terminals as shown in the Warp agent cli and cli docs.

• Local vs ambient agents: warp agent run keeps the agent on the local machine, while warp agent run-ambient spins it up in a remote sandbox suitable for external collaborators or untrusted code, which separates experimentation from core dev environments without changing how you talk to the agent according to the Warp agent cli.

• SSH into agent shells: Once an ambient agent is running, Warp exposes an SSH endpoint so users can drop into the same shell the agent is using, inspect logs, or manually intervene, which moves agent runs closer to traditional long‑lived server processes rather than opaque chat sessions as detailed in the Warp agent cli and cli docs. The feature set positions Warp not only as an AI‑aware terminal, but as a hosting surface for persistent agent processes that can be inspected and debugged with normal Unix tools.

Warp

@warpdotdev

Did you know you can run Warp from a CLI? - `warp agent run` to query our agent locally - `warp agent run-ambient` to kick off an agent in a cloud sandbox (great for external use) Then, you can SSH into your running agent and interact, just like a local agent 🙌

12:22 AM · Dec 23, 2025

Read 4 replies

Agentic Coding Flywheel project grows into full beginner-friendly guide

Agentic Coding Flywheel (Dicklesworthstone): Building on the earlier VPS wizard for setting up multi‑agent dev servers (vps wizard), the author reports that the Agentic Coding Flywheel site now includes around 33k lines of shell scripts, 30k lines of TypeScript/React, and a new "beads_viewer" static site documenting the whole design and refinement process (Flywheel site).

• Targeting "hungry but clueless" users: The guide explicitly targets people with little computer background who still want to use real tools instead of "slop factory" sites; jargon is heavily defined, and the author ran multiple "audits" using an agent to simulate a novice’s perspective, with those audits published as step‑by‑step documents (Ux audit link, ux audit ).

• Beads viewer and context: A separate beads viewer site visualizes the dependency graph of tasks and scripts in the flywheel, making it easier to understand how context, agents, and infrastructure pieces fit together instead of treating the setup as a black box (Beads viewer mention, beads viewer ).

The project effectively turns one person’s agent‑heavy setup into a reproducible playbook for others, with both narrative and code artifacts evolving in lockstep.

Jeffrey Emanuel

@doodlestein

My agent-flywheel.com site seems to be working pretty well now, with a huge amount of new content. Not bad for a few days on the side as I worked on other stuff. Around 33k lines of high-quality shell scripts and nearly another 30k of typescript/react for the webapp.

Jeffrey Emanuel

@doodlestein

I made a new website and set of scripts and prompts to help people get set up with the same kind of setup that I use to develop software. You can see it here: agent-flywheel.com I get asked a lot about my workflows and so I wanted to have one single resource I could share

3:37 AM · Dec 23, 2025

140

Read 12 replies

CodexBar 0.12 refines cost tracking and credit buying UX

CodexBar 0.12 (Steipete): Following up on earlier cost‑charting features for CodexBar, which added detailed token and dollar histories for Codex usage as shown in the usage charts, version 0.12 reorganizes the macOS menu into submenus, adds a persistent credits bar, and streamlines buying credits via an automated browser flow as per the CodexBar update.

• Cleaner menu structure: The author reports iterating “for hours” on the menu layout, ultimately moving many options into submenus to cut clutter and adjusting highlight colors so submenu chevrons are visible without drawing too much attention according to the CodexBar update.

• Credits and auto‑updates: A new credits usage bar and a quick “Buy Credits…” entry open a window that navigates directly to Stripe checkout, while update checks now happen in the background with a “Click to restart” menu item only when a new version is ready, removing the old explicit “Check for updates” entry as detailed in the CodexBar update. These tweaks focus less on raw functionality and more on making heavy daily Codex users comfortable monitoring and topping up usage without breaking flow.

Peter Steinberger 🦞

@steipete

CodexBar 0.12 is out! I iterated on this for hours. Found the menu entries to cluttered until I got the idea to make the entries submenus. Ofc that was tricky to place the > chefron, I had to tweak the highlight to not be bright-blue; added a bar for credits, a buy shortcut, and Show more

Peter Steinberger 🦞

@steipete

I ended up rewriting a minimal (<1000 LOC) version of ccusage into codexbar and made parsing ~20x faster + added a tiny cache to make this efficient across restarts. Still sweating some details, coming later today. For heavy users, cost data should now appear within 30 seconds

3:43 AM · Dec 23, 2025

123

Read 11 replies

Oracle CLI improves recovery and gains agent skill wrapper

oracle CLI (Steipete): The oracle tool, which wraps GPT‑5.2 Pro in a browser‑driven debugging workflow, gained stronger recovery logic and an accompanying Skill definition so agents can call it more safely and efficiently as shown in the Oracle skill note, Recovery update , and oracle repo.

• Session reattachment after crashes: Version 0.7.3 ensures that even if an agent kills the process or Chrome is closed, oracle can reattach to existing sessions instead of losing state, by marking browser sessions as errored when ports drop and improving how it discovers and reconnects to them according to the Recovery update and release notes.

• Skill integration for agents: A new Skill in the agent-scripts repo describes how agents should invoke oracle, which reduces mistakes and speeds up runs compared to having the model guess shell commands and flags every time as detailed in the Oracle skill note and skill docs. The changes push oracle further toward being a reliable, reusable component in larger agent harnesses rather than a one‑off personal debugging script.

Peter Steinberger 🦞

@steipete

First time in days I need to use oracle 🧿 again (GPT 5.2 Pro via browser) - write a skill and it runs even smoother. github.com/steipete/oracle github.com/steipete/agent…

11:43 PM · Dec 22, 2025

Peakypanes debuts as YAML-driven tmux dashboard for agent sessions

peakypanes (Kevin Kern): A new CLI tool called peakypanes launched as a tmux dashboard and layout manager driven by YAML, aimed at developers juggling multiple agents, servers, and projects across terminals as shown in the Feature description and Usage reflection.

• Dashboard for many projects: The dashboard view shows projects, sessions, and windows in one screen so users can see which agents or processes are running and quickly start, switch, or manage tmux sessions, giving a higher‑level overview than raw tmux alone according to the Feature description and Peakypanes demo.

• Shared layouts as code: Layouts are described in a simple YAML format, including panes and the commands they should run, so teams can check them into git and share reproducible multi‑pane setups for things like multi‑agent harnesses or microservice dev stacks as detailed in the Peakypanes demo and Repo link. The author calls this an early release and warns about bugs, but the structure points toward tmux becoming a more first‑class orchestration surface for agent‑heavy workflows.

Kevin Kern

@kevinkern

This is the result. "peakypanes" 🎩. A simple tmux dashboard + layout manager. I use this for multi-agent work and reusable layouts. Basically one screen to see whats running across projects.

Kevin Kern

@kevinkern

Switched from figma to ascii

10:27 AM · Dec 22, 2025

188

Read 12 replies

ck code indexer accelerates Codex file lookup with new embedding backend

ck + Codex (Kevin Kern): The ck tool, which builds a local index of a codebase so Codex can find files more quickly, has been in daily use for a month and now has a pull request testing Mixedbread embeddings as a faster backend as shown in the Ck codex usage and Mixedbread pr.

• Local indexing for speed: Instead of relying on Codex to scan the repo from scratch each time, ck pre‑indexes files so the agent can jump straight to relevant paths, which the author says makes Codex locate files “much faster” in large projects according to the Ck codex usage.

• Embedding swap experiment: A new PR experiments with swapping in Mixedbread embeddings to see if they improve search latency and relevance over the existing setup, reflecting a trend toward treating vector backends as pluggable infrastructure beneath agent‑facing tools as detailed in the Mixedbread pr and pr details. The work is small‑scale but shows how practitioners are hand‑tuning retrieval layers around coding agents instead of waiting for monolithic IDE updates.

Kevin Kern

@kevinkern

I've been using ck + codex for a month now and I'm pretty happy with it. Basically it indexes your codebase and codex will find your files much faster. Credits to @runonthespot for this work.

7:30 PM · Dec 22, 2025

189

Read 10 replies

🔗 Agent interop and app surfaces (MCP, Apps SDK)

Interoperability and app surfaces saw movement: skills loading patterns, browser agents, and ChatGPT Apps SDK connectors. Excludes IDE‑specific coding features (see Coding agents).

OpenAI ships “Your Year with ChatGPT” as a first-party Apps SDK experience

Your Year with ChatGPT (OpenAI): OpenAI is rolling out an end‑of‑year recap experience, Your Year with ChatGPT, to Free, Plus and Pro users in the US, UK, Canada, Australia and New Zealand; it runs only when Memory and reference chat history are on and when a minimum activity threshold is met as shown in the rollout details and feature explainer. The recap is implemented as an internal ChatGPT app using a new connector called OpenAI Cocoon, making it one of the first public, production examples of the Apps SDK in action rather than a hard‑coded product feature as detailed in the cocoon mention and apps sdk docs.

• App-like UX inside ChatGPT: The experience surfaces as a tappable card with custom layouts, animations and navigation distinct from normal chats, and people describe it as a "first class" mini‑app that feels closer to WeChat‑style in‑app experiences than to a plain conversation, with one observer saying it shows "what a 'first class' app experience could look like inside ChatGPT" as shown in the ui commentary.

• Developer interest in the SDK: Seeing a polished first‑party app backed by the SDK has triggered renewed interest from builders, with developers noting that this convinced them to "look again at the ChatGPT Apps SDK and build something" and OpenAI’s PM for the SDK saying they "can’t wait to see what people build" in 2026 as mentioned in the dev reaction and sdk teaser.

• Connector usage and discovery: The "Your Year with ChatGPT" widget appears in the app list and can also be invoked via the plus‑menu and a natural language command, with a direct deep‑link to the experience shared so users can check whether they already have access as noted in the invocation hint and deep link.

The recap doubles as both a sticky user‑facing feature and a reference implementation for how OpenAI expects third‑party Apps SDK experiences and connectors to look and feel inside ChatGPT.

OpenAI

@OpenAI

Your Year with ChatGPT! Now rolling out to everyone in the US, UK, Canada, New Zealand, and Australia who have reference saved memory and reference chat history turned on. Just make sure your app is updated.

7:45 PM · Dec 22, 2025

5.1K

Read 1.1K replies

Claude in Chrome proves out practical browser agents for real dashboards

Claude browser agent (Anthropic): Multiple practitioners are now reporting real utility from Anthropic’s Claude in Chrome browser agent, with one detailed write‑up describing how it recovered a lost CORS configuration buried deep in Cloudflare’s dashboard by scanning pages, following links and identifying the relevant Transform Rule without the user remembering where it lived as shown in the cloudflare story and cors blog post. The same author extracted a full HTML transcript showing Claude’s step‑by‑step actions—navigation, text recognition in the UI and reasoning about which rule controlled which path—illustrating how a browser‑embedded agent can function as a point‑and‑click debugger for complex SaaS admin panels via the transcript link.

• Agent UX and safeguards: Screenshots from another user show Claude’s browser agent UI labeling itself as "HIGH RISK" when allowed to "take most actions on the internet", with options like "Act without asking" and warnings that the agent can click hidden CAPTCHA elements or forms, underscoring both the power and risk profile of giving an LLM direct control over a live browser session as detailed in the risk banner screenshot.

• From skepticism to adoption: The Cloudflare user notes that they had been skeptical of browser agents due to prompt‑injection risks but called this their "first successful experience" solving a real problem, contrasting manual hunting through a confusing UI with a guided session where the agent quickly identified the rule name, path pattern and header being set as noted in the cloudflare story.

Together these reports suggest browser‑level agents are starting to cross from novelty demos into tools that can operate vendor dashboards and consoles on behalf of engineers, as long as users are comfortable with the elevated access they require.

Simon Willison

@simonw

Had my first successful experience using a browser agent to solve a real problem - in this case I had the Claude in Chrome extension help me find some configuration I had lost deep within the Cloudflare control panel simonwillison.net/2025/Dec/22/cl…

5:21 PM · Dec 22, 2025

401

Read 24 replies

Claude Code gains OpenRouter backend, exposing 320+ models via one agent surface

Claude Code on OpenRouter (Anthropic/OpenRouter): Claude Code, Anthropic’s agentic coding environment, can now run against OpenRouter as a backend provider, letting users route its multi‑step coding and tool‑use workflows through more than 320 different LLMs rather than only Anthropic‑hosted models as noted in the claude code announcement. OpenRouter’s docs for the integration explicitly recommend "highly capable" models like Claude 4.5 Sonnet and GPT‑5.2 for best results but note that any compatible model—closed or open, including newcomers like GLM‑4.7—can be plugged into the same Claude Code harness per the

and glm model page.

• Single agent, many engines: The integration turns Claude Code into an interop surface where the front‑end agent logic (planning, diffing, tool orchestration) stays the same while the underlying inference stack can be swapped between Anthropic, OpenAI, Google, Z.ai and others using OpenRouter’s normalized API and routing layer as shown in the claude code announcement.

• Skills and tools compatibility: Because Claude Code already supports open Agent Skills and MCP‑style tools, binding it to OpenRouter means those higher‑level capabilities can now be exercised on top of cheaper or specialized models as they appear on the platform, without per‑model glue code.

For teams experimenting with a mix of closed and open‑weight models, this gives a single coding‑agent UX that can sit in front of a very fluid backend model portfolio.

OpenRouter

@OpenRouterAI

You can now use Claude Code with OpenRouter 🎊 Code with over 320 LLMs, including 39 free ones!

5:24 PM · Dec 19, 2025

3.3K

Read 134 replies

OpenRouter highlights nextTurnParams pattern for self-managing skills

Skills loader pattern (OpenRouter): OpenRouter is pushing a concrete design for self‑managing skills by showcasing how its SDK’s nextTurnParams can automatically enrich future turns with specialized instructions once a skill is loaded as noted in the tip on nextturnparams. The example skills loader skill turns a one‑time discovery call into a persistent context modifier, so tools can quietly attach domain‑specific guidance or routing hints to every subsequent model call without extra prompting boilerplate via the skills loader example.

• Encapsulated tools, minimal prompts: The pattern keeps skills configuration in one place (a skill manifest plus loader code) and relies on the harness, not the human prompt, to inject the right system messages or tool configs on future turns, which helps reduce prompt drift and lets different frontends reuse the same skills library.

• Ties into open Skills spec: The loader builds on the open Agent Skills spec that packages instructions and resources into SKILL.md folders, with prior work showing Codex and other agents adopting that standard; OpenRouter’s contribution is a concrete runtime hook (nextTurnParams) for turning those static skill bundles into living, per‑conversation behavior per the skills standard mention and skills overview.

This gives agent frameworks a portable way to keep long‑running conversations skill‑aware without forcing each step of the dialog to repeat the same configuration scaffolding.

OpenRouter

@OpenRouterAI

TIP 💡: use the powerful `nextTurnParams` in our SDK to create encapsulated, self-managing tools. When a skill is loaded, it automatically enriches subsequent turns with specialized instructions.

5:38 PM · Dec 22, 2025

136

Read 8 replies

⚙️ Serving stacks and latency tricks

Runtime/serving updates with concrete throughput/TTFT wins and day‑0 integration. This excludes model metrics (Feature) and training algorithms (see Training/Reasoning).

SGLang + Baidu context parallelism cuts DeepSeek V3.2 TTFT by up to 80%

Context parallelism for DeepSeek V3.2‑DSA (SGLang/Baidu): Baidu’s Baige AIAK team open‑sourced a context_parallel implementation for DeepSeek‑V3.2‑DSA in SGLang and reports that enabling it reduces time‑to‑first‑token by about 75% at 16k tokens and 80% at 32k versus the non‑CP baseline on a single machine as shown in the context parallel post.

The design reuses routing patterns across experts, does load‑balanced sequence splitting tailored to DeepSeek’s DSA layout, and avoids tensor‑parallel all‑reduce overhead while remaining compatible with data‑parallel attention and other parallelism schemes, aiming squarely at long‑context inference bottlenecks as detailed in the context parallel post.

LMSYS Org

@lmsysorg

SGLang + Baidu Baige AIAK Team: Open-sourcing Context Parallelism for DeepSeek V3.2-DSA We’re excited to announce that Baidu Baige has open-sourced a Context Parallelism (CP) solution for DeepSeek V3.2-DSA, now merged into SGLang. As long-context inference scales to 128K tokens, Show more

7:24 PM · Dec 22, 2025

vLLM adds day‑0 GLM‑4.7 serve with MTP and tool parsers

GLM-4.7 in vLLM (vLLM project): vLLM added day‑0 support for Z.AI’s GLM‑4.7, exposing a single vllm serve command that wires in MTP speculative decoding, GLM‑style tool/function calling, and a reasoning parser tuned to the model’s “thinking” traces, as shown in the vllm glm47 serve.

The example launch uses 4‑way tensor parallelism plus --speculative-config.method mtp with one speculative token, indicating a focus on higher throughput without any model retraining as detailed in the vllm glm47 serve.

vLLM

@vllm_project

Congrats to the GLM team on GLM-4.7 — a step up in the GLM-4.x series, with day-0 serving support in vLLM!🚀 ⚡ Support MTP decode (faster throughput). 🧰 Tool/function calling. 🧠 Thinking controls: interleaved/preserved/per-turn. Command in screenshot below👇 Read more: Show more

Z.ai

@Zai_org

GLM-4.7 is here! GLM-4.7 surpasses GLM-4.6 with substantial improvements in coding, complex reasoning, and tool usage, setting new open-source SOTA standards. It also boosts performance in chat, creative writing, and role-play scenarios. Default Model for Coding Plan:

1:00 AM · Dec 23, 2025

239

SGLang publishes GLM‑4.7‑FP8 serving recipe with EAGLE speculative decode

GLM-4.7-FP8 in SGLang (LMSYS/SGLang): LMSYS released a concrete sglang.launch_server command for serving Z.AI’s GLM‑4.7‑FP8 in SGLang, enabling the EAGLE speculative decoding algorithm, GLM‑specific tool and reasoning parsers, and 8‑way tensor parallelism in one config as shown in the sglang glm47 example.

The recipe also sets --speculative-num-steps 3, --speculative-num-draft-tokens 4, and pins GPU memory with --mem-fraction-static 0.8, signalling a production‑oriented setup for long‑context GLM‑4.7 serving on multi‑GPU hosts as outlined in the sglang glm47 example.

LMSYS Org

@lmsysorg

Congrats to the GLM team on GLM-4.7 🎉SGLang has day-0 support ready. If you’re deploying GLM-4.7 with SGLang, the command is already there in the image. Try it out and let us know how it goes!

Z.ai

@Zai_org

8:23 PM · Dec 22, 2025

Read 1 reply

vLLM‑Omni adds LongCat-Image-Edit for instruction-following image edits

LongCat-Image-Edit in vLLM-Omni (vLLM project): The vLLM community integrated Meituan’s LongCat‑Image‑Edit model into vLLM‑Omni, giving operators a unified runtime to serve instruction‑following image edits such as object insertion, background replacement, and style adjustments from the same stack that handles text LLMs as detailed in the longcat support.

A demo shows a Qwen bear illustration turned into a painting scene with an art board labeled “vLLM‑Omni” and a brush in the bear’s hand, reflecting how text prompts plus a reference image can drive structured edit actions inside the new image‑editing endpoint as demonstrated in the longcat support.

vLLM

@vllm_project

🎉 The vLLM community has added support for LongCat-Image-Edit (from @Meituan_LongCat team) in vLLM-Omni. - Simpler path to serve instruction-following image edits - Supports common operations like object add/replace, background changes, and style adjustments - Useful for Show more

3:57 AM · Dec 23, 2025

Read 1 reply

🏗️ Power and campuses for AI growth

Non‑model, supply‑side signals: datacenter power and siting moves. Mostly power procurement/capex news; separate from enterprise adoption metrics in Business.

Alphabet buys Intersect for $4.75B to align AI datacenters with new power

Alphabet–Intersect deal (Google): Alphabet is acquiring clean‑energy developer Intersect for about $4.75B in cash plus assumed debt to co‑locate new solar and battery projects with Google’s AI datacenters, targeting a pipeline of ~10.8 GW by 2028 as shown in the Intersect summary and Bloomberg report; the goal is to move from pure power‑purchase agreements to owning a "development platform" that handles land, permits, grid interconnection and financing on the same schedule as new compute.

Deployment impact: Adding large AI datacenter loads often means waiting years on grid upgrades and interconnection queues, so compute can be ready before power; Intersect’s model is to build solar plus battery storage next to new campuses so generation and transmission are planned around a known AI load, reducing dependence on constrained local grids as detailed in the Intersect summary. Google signals it will buy Intersect’s in‑development projects and team, but not all of its operating grid assets, which positions this more as a forward pipeline of custom power for AI than a generic utility buy as shown in the Intersect repost. For AI infra planners, this is a clear data point that power siting and permitting are now strategic bottlenecks on par with GPU supply, and that hyperscalers are willing to own more of the "electron supply chain" to keep model training and inference roadmaps on track.

🧠 Google is buying Intersect for $4.75B in cash plus assumed debt to secure more electricity for Google’s fast growing AI data center buildout. Intersect mainly gives Google a way to build new clean power and storage right next to new AI data centers, so the power shows up on Show more

12:07 AM · Dec 23, 2025

Read 8 replies

Amazon’s $11B Indiana AI campus adds 2.2 GW load and heavy water use

Indiana AI campus (Amazon): Amazon’s planned $11B data center complex in St. Joseph County, Indiana, will be sized for about 2.2 GW of power draw—enough electricity for roughly 1M homes—and is expected to use around 300M gallons of water per year for cooling as detailed in the Indiana campus update; the site is framed as one of Amazon’s largest AI training and inference hubs, following up on Indiana campus where the basic campus scale and on‑site power plant plans were first outlined.

Local grid and environment angle: A 2.2 GW load concentrated in a single AI campus effectively turns it into a dedicated power customer the size of a mid‑sized city, which is why Amazon pairs it with its own energy infrastructure rather than leaning entirely on the regional grid as noted in the Indiana campus update. The newly mentioned ~300M gallons/year cooling demand highlights the water footprint of large AI campuses, raising questions about sustainability and local resource planning that regulators and communities will have to weigh alongside economic benefits. For other hyperscalers, the Indiana numbers give a concrete reference point for what a next‑generation AI campus looks like in power and water terms, not just capex.

Wes Roth

@WesRoth

Amazon is investing $11 billion into a sprawling new data center campus in St. Joseph County, Indiana, one of its largest infrastructure projects ever. The facility will be primarily used to train and operate AI models and is expected to consume 2.2 gigawatts of power enough to Show more

Wall Street Apes

@WallStreetApes

This is Amazon’s new $11 billion dollar massive Data Center Campus in St. Joseph County, Indiana It will be primarily dedicated to training and running AI models It will use 2.2 gigawatts of power, equivalent to the electricity needed to power roughly 1 million homes and

1:00 PM · Dec 22, 2025

Read 9 replies

China’s power capacity reaches 3.75 TW, nearly triple US, shaping AI headroom

China power capacity (China): New charts from Morgan Stanley put China’s total power‑generation capacity at ~3.75 TW, compared with about 1.30 TW in the US, implying China now has nearly 3× the installed capacity and extending the earlier picture of rapid generation growth described via China grid and as noted in the China capacity tweet.

Why this matters for AI: The same report notes China accounted for about 54% of global industrial robot installations, tying its power build‑out to rising automation and electric load from factories and datacenters, while US capacity growth has been relatively flat per the robot patents recap and China capacity tweet. For AI infra, the 3.75 TW figure sets the ceiling on how far China can scale energy‑hungry GPU clusters, fast‑charge networks, and robotics plants before hitting hard power limits, whereas the US will need either faster capacity additions or more aggressive efficiency gains to support similar levels of AI and robotics deployment. The numbers do not say how much of that capacity is directly allocated to AI, but they define the macro headroom in which future Chinese AI campuses and model‑training projects will compete.

Wes Roth

@WesRoth

China has reached a historic 3.75 terawatts in total power generation capacity nearly triple that of the U.S., which stands at around 1.30 terawatts.

The Kobeissi Letter

@KobeissiLetter

China is dominating the worldwide race for power: China now has a record 3.75 terawatts of power generation capacity. That capacity has doubled over the last 8 years. This is nearly 3 TIMES more than the US, which has ~1.30 terawatts of capacity. Furthermore, China has 34

12:30 PM · Dec 22, 2025

Business signals around AI platforms and go‑to‑market. Continues yesterday’s adoption narrative with fresh metrics; excludes infra procurement (see Infrastructure).

GenAI web traffic (Similarweb): Similarweb’s Jan–Nov 2025 data shows Gemini’s share of global GenAI website traffic rising from about 5.64% to 14.95% (roughly 3×), while ChatGPT falls about 4 percentage points to ~74% and DeepSeek plunges from 12.79% to 5.35% as shown in the share chart; Grok grows from 0.02% to 2.53% and Perplexity holds steady around 3%, which commentators frame as “stable niche” rather than breakout according to the traffic recap. Following up on us traffic, where ChatGPT still dominated US visits, this new slice suggests Google and xAI are the only players meaningfully gaining share inside this traffic bucket.

• Google and Anthropic momentum: One analyst calls Google, alongside Anthropic, “the big winner in 2025” as Gemini’s share nearly triples while ChatGPT’s dips and DeepSeek’s visibly shrinks, arguing that “Google, along with Anthropic, is the big winner in 2025” as detailed in the share chart.

• DeepSeek and Grok repositioning: The same chart breakdown highlights DeepSeek’s sharp decline and Grok’s rise from essentially zero to a few percent, suggesting early traction for xAI while DeepSeek’s direct‑to‑consumer reach weakens as shown in the traffic recap. The point is: within this specific web traffic lens, the competitive field is still highly concentrated around ChatGPT, but Google’s Gemini and xAI’s Grok are now the only meaningful challengers gaining ground while some earlier contenders lose visibility.

xAI selected to power DoD GenAI.mil at IL5 for up to 3M users

GenAI.mil program (xAI): xAI says its Grok‑based “frontier AI” stack has been selected by the U.S. Department of Defense Chief Digital and Artificial Intelligence Office (CDAO) as a provider for the GenAI.mil initiative, targeting around 3 million DoD users at Impact Level 5 (IL5), the cloud security tier for Controlled Unclassified Information via the program summary and xai gov post. The company states its models will run inside the IL5 boundary and be exposed through an enterprise platform with an API plus agent tooling that can chain steps like search, drafting and summarization into single workflows as detailed in the program summary.

• Enterprise and mission split: xAI describes two tracks: Enterprise AI for day‑to‑day Pentagon knowledge work, and “mission systems” using government‑optimized foundation models for classified operational workloads, likely in separate deployment enclaves with tighter controls according to the program summary.

• Data and sourcing model: The announcement notes that DoD users will receive real‑time insights sourced from X, shifting answers from static training data toward live feeds, which is a data‑integration choice rather than a model architecture change but has clear implications for provenance and information governance in defense settings per the program summary.

• Procurement context: The deal slots into CDAO’s pattern of awarding multiple frontier AI vendors contracts with ceilings up to roughly $200M each to build agentic workflows across mission areas, signalling that xAI will now compete head‑to‑head with other large labs inside one of the highest‑stakes enterprise environments as shown in the program summary. This marks one of the first public large‑scale defense deployments of a Grok‑class model at IL5, putting xAI directly into the enterprise AI platform conversation alongside more established vendors.

Reports describe Microsoft Copilot adoption woes and Satya’s hands‑on reset

Copilot (Microsoft): Commentary around Microsoft’s Copilot paints a picture of underwhelming enterprise traction, with reports that Microsoft has cut Copilot AI sales targets after weak adoption and that CEO Satya Nadella has taken a hands‑on product management role to accelerate improvements via the copilot critique and sales target note. One summary says Nadella is “personally overseeing engineering and recruiting while delegating other executive duties,” driven by frustration over technical flaws and market share erosion versus rivals like Google and Cursor per the copilot critique.

• Adoption and perception issues: Posts describe users seeing Copilot as unreliable and agentic tools as “untrustworthy” in daily workflows, which reportedly slows enterprise rollout despite aggressive bundling such as an unremovable Copilot app appearing on LG TVs after a firmware update as shown in the forced install article.

• Competitive pressure: The same threads explicitly list Google and Cursor as key competitors, implying that Microsoft’s current Copilot experience is not winning developers by default and prompting this internal “code red” style response according to the copilot critique. For AI leaders, this is one of the clearest public signals that even a hyperscaler with distribution still has to win on perceived reliability and day‑to‑day usefulness, not only on bundling and brand.

🎬 Creator workflows: motion control, music, design

Generative media saw heavy traffic: motion‑controlled video pipelines, music creation tooling, and design iteration UX. This cluster is kept separate for creators/marketing teams.

Kling 2.6 Motion Control spreads across Higgsfield, fal and Replicate

Kling 2.6 Motion Control (Kuaishou / ecosystem): Following up on the earlier Kling 2.6 launch for motion‑controlled video workflows as shown in the Kling workflow, multiple platforms have now wired it into creator‑friendly pipelines—Higgsfield offers day‑0 access with 30‑second generations and full‑body sync, expression mapping and lip‑sync as demonstrated in the Higgs launch; fal hosts a one‑take 30s Motion Control endpoint targeting fast dance/sports/martial‑arts style moves as outlined in the fal integration; and Replicate exposes a "static image + reference video or motion library → animated output" flow with side‑by‑side previews for creators as presented in the replicate demo.

• Higgsfield workflows: Higgsfield markets Kling 2.6 as “any motion reference becomes any character’s performance,” and pairs it with Nano Banana Pro so users can stylize characters and then retarget complex, fast movement onto them as shown in the Higgs launch and workflow guide.

• fal and Replicate surfaces: fal’s hosted endpoint promises synchronized motion, expressions and lip sync in up to 30s clips with a single prompt as detailed in the fal integration; Replicate’s UI shows a split‑screen of source vs generated clip, highlighting how static photos like a Santa portrait can inherit motion from a live‑action reference as illustrated in the replicate demo.

• Multi‑character control: Community guides now document recording separate performances for each actor, converting first frames with Nano Banana Pro, then driving two Kling Motion Control runs and compositing, effectively turning a solo performer into a multi‑character cast as shown in the multi character demo and higgs tutorial.

The net effect is that Kling’s motion system is no longer a single web demo but a multi‑hosted primitive that creators can reach through Higgsfield, fal and Replicate, often chained with image models like Nano Banana Pro for character design.

ElevenLabs Music adds Explore, stem separation and better lyric tools

Eleven Music (ElevenLabs): ElevenLabs shipped a substantial update to its music model and UI, adding an Explore surface for discovering and remixing tracks, multi‑level stem separation, improved lyric generation and precise per‑line timestamps as shown in the music update.

• Stem control: Creators can now split songs into 2, 4 or 6 stems—ranging from simple vocal/instrumental all the way to vocals, drums, bass and an "other" channel—enabling fine‑grained remixing and targeted edits inside or outside ElevenLabs as detailed in the music update.

• Lyric workflow: The company reports better clarity, coherence and stylistic alignment for generated lyrics plus section‑level regeneration, making it easier to inpaint or extend specific song parts without discarding a whole take per the music update and UI improvements.

• Timing and UI polish: A new lyric timestamp system exposes exact timings via both UI and API, and the Music interface gains richer history, smoother navigation and real‑time highlighting of lyric lines during playback as detailed in the music update and UI improvements.

For music‑tool builders and sync‑heavy workflows, this turns Eleven Music from a pure generator into more of a DAW‑adjacent tool that can sit in the middle of editing, stems prep and lyric‑driven visualizations.

Genspark AI Developer builds games from screen recordings and one prompt

AI Developer (Genspark): Genspark showcased a workflow where its AI Developer agent turns a simple screen recording of the mobile game Block Blast plus a single prompt (“build a game like this”) into a playable clone in minutes, with no manual coding by the user as demonstrated in the game demo.

• Multi‑model orchestration: Behind the scenes, Gemini handles video understanding, Nano Banana Pro designs themes and assets, and Claude writes production‑ready code; Genspark’s orchestrator routes sub‑tasks across models as needed instead of following a fixed pipeline as explained in the stack explanation.

• Agentic workflow: The system identifies user journeys in the recorded game, plans the feature set, generates assets and code, and then assembles a working web game, effectively turning "vibe coding" into a repeatable pattern for prototyping casual games as shown in the game demo and stack explanation.

This positions Genspark’s tool as an example of how creator workflows can move from prompt‑only to "prompt + demonstration" inputs, especially for indie game and interactive prototype work.

Manus launches Design View for end‑to‑end AI design workflows

Design View (Manus): Manus introduced a new "Design View" on web and mobile that reframes its image model as a full design workflow—users can commission, create and iteratively refine visual assets as part of one continuous session rather than firing isolated prompts as shown in the design view launch.

• From prompt to layout: The demo shows a prompt box feeding into a canvas‑like interface where each refinement pass mutates composition, style and details, while preserving the project context instead of starting from scratch as illustrated in the design view launch.

• Agent extension: Manus positions Design View explicitly as an extension of its existing agent, so the same underlying model that chats can now act as a design assistant with persistent memory of prior iterations and instructions as demonstrated in the design view launch.

For designers and marketers, this shifts Manus from being “an image generator” to something closer to an AI design environment that tracks intent and makes iteration loops feel like editing rather than random re‑rolls.

YouTube Playables Builder lets creators prompt Gemini 3 into mini‑games

Playables Builder (Google / YouTube): Google DeepMind highlighted that the new YouTube Playables Builder web app uses Gemini 3 to help creators spin up small, playable games from text, video or image prompts, aimed at "fun, bite‑sized" interactive content as shown in the playables teaser.

• Prompt‑to‑game flow: A short demo shows a creator entering a prompt, selecting options in the Builder UI and previewing a simple text‑based game directly in the browser, all under a "Powered by Gemini 3" banner as presented in the playables teaser.

• Creator positioning: Commentary frames this as "the first big release in AI powered game creation," suggesting Google wants Playables Builder to be a mainstream on‑ramp for non‑programmer creators to experiment with interactive experiences inside the YouTube ecosystem as detailed in the playables analysis.

For game‑adjacent YouTube channels, this folds lightweight game prototyping into the same surface they already use for video publishing and monetization.

Hailuo and Nano Banana Pro front a Christmas AI‑video contest

Hailuo 2.3 + Nano Banana Pro (Hailuo / Flowith): Creator @ai_for_success shared a "Modern day Santa" short built by first designing a consistent Santa character in Nano Banana Pro, then using Hailuo’s First and Last Frame feature (with Veo 3.1 and the Hailuo 2.3 model) to animate those stills into a full video as shown in the santa entry.

• Contest mechanics: Hailuo is running a #HailuoChristmas campaign from Dec 19 to Jan 5 where participants either start from Christmas templates (≥15s) or original stories (≥30s), post on major social platforms with the hashtag and submit via a landing page, with prizes of $1,500, $1,000, $500 and ten $100 random awards plus 1,000 free credits for the first 20 submissions as outlined in the contest rules and contest page.

• Toolchain pattern: The shared workflow emphasizes creating a single hero image, generating story beats in Nano Banana Pro, then binding them with Hailuo’s frame‑to‑frame interpolation so the character identity remains stable across the clip as shown in the process recording.

• Promo tie‑in: Nano Banana Pro is temporarily free on Hailuo until Dec 31, explicitly marketed as a way to create contest entries without extra image‑model spend per the nb pro promo.

This campaign shows how model vendors are packaging specific visual pipelines (character design → first/last frame video) into seasonal events to drive both experimentation and user‑generated marketing assets.

New 3D and world‑event generators target editable scenes and characters

3D‑RE‑GEN, WorldCanvas and character animation tools (multiple): Several research and demo posts highlighted creator‑facing 3D and world‑event generators: 3D‑RE‑GEN shows a generative framework that reconstructs detailed indoor scenes, moving from wireframes to textured rooms as demonstrated in the 3d regen demo; "The World is Your Canvas" introduces WorldCanvas, which can "paint" promptable events into a scene using reference images, trajectories and text as shown in the worldcanvas link; and a separate "Animate Any Character in Any World" demo lets users drop characters into arbitrary environments and control their motion as shown in the character animator.

• Scene reconstruction: 3D‑RE‑GEN’s video cycles between mesh overlays and final renders of living rooms and other interiors, suggesting a pipeline where artists can start from sparse captures and end up with editable, photorealistic environments as demonstrated in the 3d regen demo.

• Promptable events: The WorldCanvas work focuses on combining text, guidance trajectories and reference visuals so that creators can specify not just a static scene but a dynamic sequence of events in it, effectively turning the world into a parametrizable canvas as shown in the worldcanvas link.

• Character/world decoupling: The "Animate Any Character in Any World" tool emphasizes dragging a character asset into a separate background and then keyframing or prompting motion, which mirrors how many 2D creators already think about layers and rigs as detailed in the character animator.

Taken together, these projects sketch a near‑future pipeline where world models, 3D recon and character rigs interlock, letting creators describe or sketch scenes and then iterate on both environment and motion without hand‑authoring every asset.

ImagineArt adds Topaz and Magnific upscalers for up to 16× enlargement

ImagineArt upscaling (ImagineArt): Commentators noted that ImagineArt has integrated Topaz Labs and Magnific AI as built‑in upscaling backends, enabling creators to push AI‑generated images up to 16× their original resolution inside the platform as highlighted in the imagineart mention and upscale comment.

• Workflow impact: Instead of exporting to separate tools, users can now generate an image in ImagineArt, choose between Topaz or Magnific as an upscaler and produce large prints or high‑res crops from the same interface as demonstrated in the imagineart mention.

• Cost angle: One thread calls out that this is a "great practical use" for large‑scale upscaling, implying that the main benefit is consolidating an otherwise multi‑tool, credit‑heavy workflow into a single app as noted in the upscale comment.

For illustrators and print‑oriented artists, native 16× upscaling reduces the friction of taking AI concepts into formats suitable for posters, merch and high‑dpi layouts.

🤖 Embodied AI: factory deployments and ‘Robot Olympics’

Embodied threads include real deployments in factories/borders and fine‑tuned generalist skills on household tasks. Mostly manipulation/control; distinct from media robots on stage.

Physical Intelligence’s π0.6 robot tackles “Robot Olympics” household chores

Robot Olympics chores (Physical Intelligence): Physical Intelligence fine‑tuned its π0.6 vision‑language‑action model to perform Benjie Holson’s "Robot Olympics" tasks—door traversal, sock inversion, key use, sandwich making, orange peeling and pan washing—with fully autonomous rollouts rather than teleoperation as shown in the pi demo thread and holson reference.

• Task coverage and data: The team reports solving 3 of 5 event categories at gold level and 2 at silver, using under 9 hours of new data per task on top of a generalist π0.6 pretrain according to the pi blog post.

• Success metrics vs baseline: Across events, π0.6 averages 52% full success and 72% task progress, while a standard VLM baseline achieves 0% success and ~9% progress, highlighting the importance of robot‑specific pretraining and fine‑tuning as described in the pi summary and pi blog post.

• Examples of skills: Videos show the robot keeping a self‑closing lever door open while walking through, turning a sock inside‑out, unlocking a lock with a key, making a peanut butter sandwich (open, spread, cut, close), washing both sides of a frying pan, and even peeling an orange with a tool when gripper limits make finger‑only peeling impossible via the door task video and sandwich video.

• Moravec’s paradox angle: Sergey Levine notes they "didn't actually do anything special" beyond fine‑tuning π0.6 and suggests this level of everyday manipulation might force a rethink of Moravec’s paradox in the age of robotic foundation models per Levine’s comment.

The work frames a concrete benchmark suite for embodied generalists and shows that a single VLA model plus modest per‑task fine‑tuning can span long‑horizon, contact‑rich household chores rather than one‑off lab tricks.

CATL’s Xiaomo humanoids achieve 99% success on high‑voltage battery plug‑ins

Xiaomo factory robots (CATL): Chinese battery giant CATL reports deploying its Spirit AI Xiaomo humanoid robots on high‑voltage battery production lines, claiming 99% successful plug‑in operations and roughly 3× the daily throughput of a human worker on that station as indicated in the catl deployment.

• Task characteristics: The target job used to be manual because connectors and cables shift slightly each cycle and mistakes at high voltage pose real safety risks; traditional industrial arms prefer fixed geometry cells and struggle with this kind of "fiddly" connector alignment as noted in the catl deployment.

• Control approach: CATL says Xiaomo uses a vision‑language‑action model that takes camera input plus a task goal and outputs motor actions directly, allowing real‑time adjustment of grip, approach angle and insertion force instead of brittle scripted trajectories according to the catl deployment.

• Utilization metrics: The robots reportedly hot‑swap their own batteries in under three minutes and walk at ~2 m/s, enabling true 24/7 operation on the line where the 3× workload uplift mainly comes from continuous running rather than speed alone as shown in the catl deployment.

If these performance and reliability numbers hold across product variants and maintenance cycles, this is a concrete example of end‑to‑end learned control replacing classic, hard‑tooled automation for variable, safety‑critical assembly steps.

China signs ¥264M deal to staff Vietnam border with UBTech Walker S2 humanoids

Walker S2 at the border (UBTech): China has reportedly signed a 264 million yuan (~$37M) contract to deploy UBTech’s Walker S2 humanoid robots at the Fangchenggang border with Vietnam, where they’ll handle personnel flow management, inspection and logistics in harsh, remote conditions around the clock as reported in the walker s2 summary and border article.

• Platform specs: The Walker S2 robots are about 176 cm tall and 70 kg, can walk at roughly 2 m/s, and can autonomously hot‑swap their batteries in under three minutes to support continuous 24/7 operation without human swaps as described in the walker s2 summary.

• Operational role: The deployment is framed as a way to staff a remote border crossing with persistent robotic presence for document checks and cargo handling, where human staffing is costly and the environment may be unpleasant or risky over long shifts according to the walker s2 summary and border article.

• Trend signal: Earlier Chinese deployments focused on pilots in factories and exhibitions; a dedicated, funded border deal suggests humanoids are starting to be evaluated as regular infrastructure in public security and customs workflows rather than one‑off demos via the patrol video.

The rollout will test whether bipedal platforms can meet reliability, maintenance and uptime expectations in an operational government setting rather than a controlled lab or expo.

Kyber Labs shows fully autonomous robotic arm assembling mechanical parts

Autonomous assembly (Kyber Labs): Kyber Labs has released a demo of its robotics system autonomously positioning and fastening a small metal part onto a base plate, with no human teleoperation or intervention during the sequence as demonstrated in the kyber demo.

• Task structure: The video shows a robot arm picking up a component, aligning it with pre‑drilled holes on a plate, placing it, then driving fasteners, which combines perception, precise pose estimation and force control rather than simple point‑to‑point moves as shown in the kyber demo.

• Claimed autonomy: The announcement stresses "no human intervention, just full‑stack robotic precision", implying the stack covers perception, planning and low‑level control internally instead of relying on offline teaching or joystick control according to the kyber demo.

The demo sits between academic manipulation benchmarks and full industrial deployment, hinting at how lab‑scale systems are being hardened into repeatable assembly skills.

Midea’s MIRO U “one head, six arms” robot targets flexible production lines

MIRO U multi‑arm robot (Midea Group): Chinese appliance maker Midea is showcasing MIRO U, a production robot with a humanoid torso, vertical lift, 360° rotation and six coordinated arms mounted on a wheeled base, pitched as delivering about 30% efficiency gains on factory lines according to the miro u summary.

• Mobility and workspace: MIRO U rides on a wheeled base for fast movement across stations, then uses torso lift and rotation to reach different fixtures and conveyors without re‑rigging the cell layout as detailed in the miro u summary.

• Multi‑arm coordination: The demo shows all six arms working around a large battery pack or appliance module, suggesting tasks like parallel fastening, inspection and cable routing that would normally require multiple separate arms or operators as shown in the miro u summary.

• Stack context: This comes alongside CATL’s humanoid deployment and other Vision‑Language‑Action industrial pilots, indicating Chinese manufacturers are experimenting with different embodied form factors for the same goal—handling high‑mix, geometry‑varying tasks that rigid cells don’t handle well according to the catl deployment.

The design points toward a hybrid between a mobile base and a dense arm cluster, trading humanoid leg complexity for more hands and reach on each station.

Morgan Stanley tallies China’s humanoid patent surge and 6.5B‑robot forecast

China’s robot footprint (Morgan Stanley): A Morgan Stanley report summarized by commentators counts 7,705 humanoid robot patents filed in China over five years—about 5× the 1,561 in the US—alongside estimates that China accounts for 54% of global industrial robot installations and a projection of 6.5 billion robots worldwide by 2050, heavily weighted toward drones and home robots as shown in the patent summary.

• Patent signal: Patents are described as a rough proxy for distinct technical ideas rather than product quality, but the 5× gap suggests a broad R&D push across Chinese labs and manufacturers on humanoid and related platforms according to the patent summary.

• Install base: The same summary notes that China already leads in industrial robot deployments, with 54% of new installs, reinforcing the idea that the country’s factories are becoming "small teams running big systems" rather than labor‑heavy floors as elaborated in the jobs chart and patent summary.

• Long‑term forecast: Morgan Stanley’s 6.5B‑robot forecast breaks down to about 34% small drones and 29% home robots by 2050, implying that most embodied AI units will operate outside classical factory settings even as industrial deployments like CATL and Midea scale up as detailed in the patent summary.

These numbers frame the CATL, MIRO U and border‑control deployments as early instances within a much larger projected shift toward ubiquitous embodied systems.

Disney’s Spider‑Man robot executes 25‑meter autonomous stunt at Avengers Campus

Stunt robot (Disney Avengers Campus): At Disneyland’s Avengers Campus, an AI‑driven Spider‑Man robot now performs 25‑meter aerial launches, mid‑air flips and self‑correcting landings over a show stage with no human in the loop, raising questions about the future of stunt work as described in the spiderman description.

• Performance profile: The robot is shown being catapulted above the stage, executing multiple flips, adjusting attitude mid‑flight and landing on a platform before transitioning to a hero pose sequence, suggesting a combination of precise model‑based control and robust state estimation as shown in the spiderman description.

• Job‑replacement discourse: Posts ask whether stunt performers’ jobs are "gone too", placing this system in the same conversation as dancing humanoids and factory robots about which physical roles get automated first as noted in the spiderman description and dance reaction.

The deployment highlights that high‑risk, repeatable acrobatics under tight safety constraints are a natural early niche for embodied autonomy, before broader adoption in less scripted environments.

Unitree G1 humanoids pull concert flips as stage robots mature

G1 stage performances (Unitree Robotics): Unitree’s G1 humanoid robots are drawing attention for human‑like dance precision, first in studio clips where they match backup dancers’ timing and moves, and then on stage at Wang Leehom’s 30th anniversary concert in Chengdu where they perform synchronized backflips in costumes as shown in the dance reaction and concert performance.

• Control fidelity: Commentators note the robots’ actions, timing and poses are "almost perfectly aligned" with human performers, including coordinated flips and choreography that depend on tight whole‑body control rather than simple pre‑scripted motions according to the dance reaction.

• Perceived labor impact: Posts half‑jokingly suggest "background dancers seriously need to find alternative jobs", reflecting both the quality of the demo and rising concern about how far embodied AI will encroach on repetitive stage roles as discussed in the dance reaction.

• Broader deployment arc: These entertainment‑grade routines sit alongside patrol trials and industrial pilots for other Chinese robots, indicating that dynamic balance and motion control are being exercised in public‑facing, high‑pressure settings before being pointed at more utilitarian tasks as shown in the patrol video and catl deployment.

While this is still performance robotics, the same control stacks could carry over to logistics or inspection roles where tight timing and coordination around humans matter.

📑 Fresh papers: long‑context, diffusion‑LLMs, agent safety

A dense day for papers: sparse attention for long context, diffusion‑LLMs at 100B, agent timing/safety, egocentric data, and 4D video perception. These are research artifacts, not product launches.

Distributional AGI Safety reframes risk around patchwork agent economies

Distributional AGI Safety (Google DeepMind): A new Distributional AGI Safety paper argues that real risk will come from "patchwork AGI"—many sub‑AGI agents coordinating via tools and protocols—so safety work should govern agent economies rather than a single monolithic model as shown in the paper summary.

The authors propose virtual agentic sandbox economies, ranging from impermeable to semi‑permeable, where agents trade work under cryptographically enforced identity, append‑only audit logs, and reputation‑gated access, with circuit‑breaker mechanisms to slow or halt cascades during unstable behaviors detailed in the paper summary. Their framework outlines four layers of "defense in depth"—market design, agent‑side hardening, real‑time graph monitoring for emerging AGI‑like clusters, and external standards and liability regimes—to align incentives in multi‑agent systems before those networks reach super‑human capability, as shown in the paper summary.

New @GoogleDeepMind paper says AGI safety is about governing networks of agents, not just aligning 1 model. The authors argue today’s monolithic AGI framing misses “patchwork AGI”, where specialized sub-AGI agents coordinate via agent-to-agent (A2A) protocols. Instead of just Show more

1:50 PM · Dec 22, 2025

Read 5 replies

“Fixing It in Post” shows smaller, cleaner post‑training mixes can beat larger ones

Fixing It in Post (IBM & TUM): The "Fixing It in Post" study compares post‑training data mixtures like Tulu‑3‑SFT‑Mix and SmolTalk using Magpie‑tagged metadata, and finds that a new TuluTalk mix, 23% smaller than SmolTalk and 14% smaller than Tulu, can still outperform them on standard instruction‑following and chat benchmarks as shown in the paper abstract.

Their pipeline tags each conversation with task type, turn count, and answer quality using another LLM, then systematically filters and re‑weights examples before training the same base model across all mixes, so differences in performance arise solely from data quality and composition, according to the paper abstract. Results suggest that high‑quality, targeted post‑training data matters more than raw volume, especially when optimizing for specific behaviors like multi‑turn support or coding, and that smaller curated mixtures can save compute without sacrificing downstream scores, as shown in the paper abstract.

New IBM and Munich Univ paper shows that smarter post training data picking can make a language model stronger with less data. Their TuluTalk mix is 23% smaller than SmolTalk and 14% smaller than Tulu, yet scores better on standard tests. A large language model predicts the Show more

11:27 AM · Dec 22, 2025

Read 7 replies

“When Reasoning Meets Its Laws” proposes LoRe laws and benchmark for LRMs

When Reasoning Meets Its Laws (LoRe): The LoRe paper introduces "Laws of Reasoning" for Large Reasoning Models (LRMs), positing compute and accuracy laws that should scale roughly linearly with task complexity, and builds LoRe‑Bench to test whether models obey monotonicity and compositionality constraints as shown in the paper summary.

The authors argue that current reasoning LMs often violate intuitive laws—for example, sometimes doing better on a harder variant than on an easier base case—so LoRe‑Bench decomposes tasks into structured families where such violations can be measured systematically as detailed in the paper summary. Early experiments show prominent LRMs deviating from idealized laws in nontrivial ways, which the paper frames as both a diagnostic for overfitting and a guide for future architecture and training changes aimed at more stable, law‑like reasoning behavior according to the paper summary.

When Reasoning Meets Its Laws huggingface.co/papers/2512.17…

6:06 PM · Dec 22, 2025

4D‑RGPT targets region‑level 4D video understanding with new R4D‑Bench

4D‑RGPT (NVIDIA): The 4D‑RGPT paper presents a multimodal LLM tuned for region‑level 4D understanding (space + time), using a Perceptual 4D Distillation (P4D) pipeline to import 4D structure from an expert model and a new benchmark, R4D‑Bench, focused on depth‑aware dynamic scenes as shown in the paper summary.

4D‑RGPT aims to fix two gaps in current video MLLMs: limited temporal reasoning and lack of region‑conditioned prompts, so R4D‑Bench includes tasks where models must answer questions about specific moving regions over time rather than whole frames per the paper summary. The authors show 4D‑RGPT improving on prior 4D video QA baselines and demonstrate that distilling 4D representations into a language‑conditioned model yields better temporal coherence and region accuracy than pure 2D or clip‑level training via the paper summary.

Nvidia presents 4D-RGPT Toward Region-level 4D Understanding via Perceptual Distillation huggingface.co/papers/2512.17…

3:32 PM · Dec 22, 2025

Read 4 replies

FrontierMath shows Chinese open‑weight models ~7‑month lag on hardest tiers

FrontierMath (Epoch AI): New FrontierMath results benchmark several open‑weight Chinese models and find their Tier 1–3 performance lags top frontier models by roughly seven months, while on the hardest Tier 4 set only DeepSeek‑V3.2 (Thinking) answers 1/48 problems (~2%) correctly as shown in the frontiermath update.

Epoch notes that FrontierMath data are largely private, with OpenAI having exclusive access to all Tier 1–3 problems and most Tier 4, while the public portion and a shared OTIS Mock AIME benchmark are used to sanity‑check third‑party API evaluations via Fireworks and Together for data security per the data access note. On aggregate Tiers 1–3, GPT‑5.2 and Gemini 3 Pro sit in the mid‑30% accuracy range, while top open‑weight Chinese models like DeepSeek‑V3.2 and Kimi K2 Thinking cluster near 20%, reinforcing a still‑visible capability gap on competition‑level math, especially at the frontier tier, according to the tier performance chart.

Epoch AI

@EpochAIResearch

We benchmarked several open-weight Chinese models on FrontierMath. Their top scores on Tiers 1-3 lag the overall frontier by about seven months.

6:57 PM · Dec 22, 2025

264

Read 9 replies

Generative Adversarial Reasoner boosts math accuracy with step‑level critics

Generative Adversarial Reasoner (Johns Hopkins): The Generative Adversarial Reasoner framework trains a math LLM reasoner together with an LLM discriminator that scores short reasoning slices, turning those local signals into reinforcement‑learning rewards that encourage correct intermediate steps, not just correct final answers as shown in the paper abstract.

On the AIME 2024 benchmark, the authors report accuracy gains from 54.0 → 61.3 (+7.3 points) for one backbone and 43.7 → 53.7 (+10.0) for another, attributing improvements to the discriminator’s ability to reward locally valid algebra and penalize wrong turns even when the final numeric answer is wrong as shown in the paper abstract. After training, only the reasoner is used at inference time; the discriminator’s cost stays in training, making the method a candidate for general reasoning‑oriented RL without permanent dual‑model overhead per the paper abstract.

New Johns Hopkins University paper's Generative Adversarial Reasoner trains a math solving LLM with a critic, so the reasoner learns to stop making bad logic moves. It fixes the usual "only final answer gets graded" problem by teaching the model with step by step feedback, so it Show more

12:31 PM · Dec 22, 2025

101

Read 8 replies

Learning to Wait trains agents to sleep instead of spamming async tools

Learning to Wait (Tsinghua): The Learning to Wait paper shows that LLM agents can learn when to insert sleep(t) calls for asynchronous tools—rather than polling status in tight loops—by predicting wait times from tool semantics and in‑context examples in a simulated Kubernetes cluster, according to the paper overview.

In their setup, real tools start work in the background and expose only coarse statuses like PENDING or DONE; excess status checks incur penalties, as do confirmations that are too delayed, so the agent must trade off latency against token and context overhead per the paper overview. The authors report that, after training, several models converge on policies with about one status check per task over multi‑episode runs, cutting unnecessary polling while still confirming completion on time, which they frame as evidence that agents can internalize a crude "time sense" for external actions detailed in the paper overview.

New Tsinghua University paper teaches LLM agents, text models that write code, to predict waits in asynchronous tools and avoid extra checks. Several models reached 1 status check per task while reducing extra waiting over 12 episodes. Real tools start work in the background, Show more

2:24 AM · Dec 23, 2025

Read 4 replies

PhysBrain uses human egocentric video to teach physical intelligence

PhysBrain (multi‑institution): The PhysBrain work introduces an Egocentric2Embodiment pipeline that turns large‑scale human egocentric videos into structured supervision for robots, aiming to bridge static vision‑language models and physical intelligence without collecting vast robot datasets as shown in the paper overview.

Their Egocentric2Embodiment (E2E‑3M) dataset converts first‑person videos into multi‑level, schema‑driven VQA signals about object states, contacts, and long‑horizon changes, designed to train models that can reason about state transitions and contact‑rich manipulation, not just label frames per the paper overview. The authors argue that leveraging human head‑camera footage as a surrogate for robot experience lets embodied models learn perception and planning priors, with robots fine‑tuning later for morphology‑specific control via the paper overview.

PhysBrain Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence huggingface.co/papers/2512.16…

5:32 PM · Dec 22, 2025

SGI‑Bench probes scientific general intelligence across deep research workflows

Scientific General Intelligence (SGI‑Bench): A new benchmark for Scientific General Intelligence (SGI) defines it as the ability to autonomously conceive, investigate, and reason across disciplines, and introduces SGI‑Bench, a 1,000+ sample suite aligned with the Practical Inquiry Model’s phases: deep research, idea generation, dry/wet experiments, and experimental reasoning as shown in the sgi paper.

The authors report that current top LLMs show low exact‑match rates (10–20%) on deep research tasks, despite high code executability in dry experiments, and that many generated ideas lack feasibility or sufficient detail, suggesting a gap between today’s agentic tooling and the level needed for autonomous science as detailed in the sgi paper. They frame SGI‑Bench as a target for methods like test‑time RL and improved tool integration, emphasizing that progress on SGI requires evaluating full research workflows rather than isolated QA or coding tasks, according to the sgi paper.

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows huggingface.co/papers/2512.16…

5:32 PM · Dec 22, 2025

Read 2 replies

87‑page survey maps techniques and tradeoffs for Small Language Models

Small Language Models survey (Penn State et al.): A comprehensive 87‑page survey defines Small Language Models (SLMs) as those between the emergent‑ability threshold and resource‑constrained upper bounds, cataloging architectures, training tricks, applications, and trustworthiness concerns across this size band, as shown in the survey overview.

The authors argue SLMs are increasingly favored for on‑device, low‑latency, and domain‑specific deployments, especially in settings where privacy constraints or edge hardware make giant LLMs impractical, and highlight their role as components in multi‑agent systems where many small models collaborate according to the survey overview. The survey reviews methods like distillation, quantization, retrieval‑augmentation, and modular fine‑tuning, and devotes a section to safety and evaluation practices tailored to SLMs rather than copying LLM‑centric benchmarks wholesale, as detailed in the survey overview.

elvis

@omarsar0

Great survey on small language models. This 87-page survey closely examines small language models (SLMs), defined as models sized between the minimum threshold for emergent abilities on specialized tasks and the maximum sustainable under resource constraints. SLMs are really Show more

5:31 PM · Dec 22, 2025

408

Read 17 replies

🚀 Other models: MiniMax M2.1 rolls into stacks

Non‑feature model updates with direct relevance to coding/agents. Excludes GLM‑4.7 (Feature). Focus on availability, early usage, and pricing/adoption signals.

MiniMax M2.1 officially launches as 10B OSS coding and agent model

MiniMax M2.1 (MiniMax): MiniMax has moved M2.1 from early access into an official release, positioning it as a 10B‑activated open‑source coding and agent model with strong evals in both SWE‑bench Multilingual (72.5%) and the new VIBE‑bench UI test (88.6%) as shown in the launch details; the team calls it "the most powerful OSS model for the agentic era" and says a full open‑weights drop will follow in two days per the launch blog. This is framed as a follow‑up to early access, where M2.1 first appeared as a design‑savvy coder rather than a fully benchmarked release.

Benchmarks and positioning: MiniMax highlights M2.1’s 72.5% score on SWE‑bench Multilingual and 88.6% on its newly open‑sourced VIBE‑bench, claiming it beats closed models like Gemini 3 Pro and Claude Sonnet 4.5 on those specific tests as shown in the launch details and launch blog; the company also emphasizes M2.1’s strength on long‑horizon, tool‑heavy "agent" workflows, pitching it as a general "Digital Employee" rather than just a code autocomplete. The post stresses that these numbers come from a 10B‑parameter active MoE slice rather than a giant dense model, which matters for cost and deployment, but external replication of the evals has not yet been shared in these threads.

The release sets M2.1 up as one of the main open competitors in multilingual coding and UI‑heavy development tasks, with the next concrete milestone being the promised open‑weights drop and independent confirmation of the VIBE‑bench leadership claims.

MiniMax (official)

@MiniMax_AI

MiniMax M2.1 is officially live🚀 Built for real-world coding and AI-native organizations — from vibe builds to serious workflows. A SOTA 10B-activated OSS coding & agent model, scoring 72.5% on SWE-multilingual and 88.6% on our newly open-sourced VIBE-bench, exceeding leading Show more

5:27 AM · Dec 23, 2025

2.6K

Read 116 replies

MiniMax M2.1 lands on Ollama and Cline as a general coding backend

Ecosystem adoption (MiniMax M2.1): M2.1 is quickly rolling into common developer stacks, with Ollama, Cline and Code Arena all adding support in the same week; this extends its reach well beyond MiniMax’s own UI as indicated in the ollama support, cline announcement , and arena note.

• Ollama runtime: Ollama now exposes minimax-m2.1:cloud, describing the updated model as performing "much better" across Rust, Java, Golang, C++, Kotlin, Objective‑C, TypeScript and JavaScript as noted in the ollama support and ollama model page; this gives local‑style workflows access to M2.1 while still hitting MiniMax’s cloud backend.

• Cline integration: The Cline team added M2.1 as a first‑class provider, calling out a 200k context window, 128k max output and a MoE design with 10B active / 230B total params, and emphasizing improved code quality, instruction following, and cleaner reasoning across refactors, feature work, bugfixes and DevOps scripting as mentioned in the cline announcement.

• Competitive eval interest: MiniMax notes that M2.1 has entered the Code Arena eval suite with live WebDev tasks, though results are still pending at this stage according to the arena note.

These integrations mean M2.1 can now be swapped into existing agent harnesses and CLIs with minimal wiring, letting teams compare its behavior directly against Claude, GPT‑5.x and Gemini in real codebases rather than only on MiniMax’s own platform.

ollama

@ollama

.@minimax_ai's M2.1 model is now available on Ollama! ollama run minimax-m2.1:cloud It's now updated to perform much better across Rust, Java, Golang, C++, Kotlin, Objective-C, TypeScript, and JavaScript. more information👇👇👇 Show more

3:20 AM · Dec 23, 2025

194

Read 6 replies

MiniMax Agent showcases M2.1 as a "Digital Employee" across 10+ workflows

MiniMax Agent (M2.1): MiniMax is leaning hard into the "Digital Employee" framing by wiring M2.1 into its MiniMax Agent product, advertising long‑horizon tool use, browser automation and office workflows such as multi‑step instructions and reasoning as highlighted in the agent upgrade; the agent UI now defaults to M2.1 in "Lightning" mode for these tasks.

• General‑purpose workflows: A "10 wild examples" page shows M2.1 handling everything from guided meditations and fact‑checking document citations to Taihu self‑drive trip planning, meme‑coin trend scans and dual‑moving‑average portfolio backtests, all inside MiniMax Agent’s task interface as seen in the example gallery and agent gallery.

• Productivity and coding: The launch copy describes M2.1 as a multilingual coding expert and long‑horizon tool user that can execute browser‑based tasks with autonomous planning as explained in the agent upgrade; separate threads highlight use as a "Digital Employee" for office workflows like email drafting and spreadsheet logic, in addition to classic coding roles.

• Design and app building demos: Community posts show M2.1 building a "Notion Lite" editor in a single prompt inside MiniMax Agent and shipping it as a playable web app, as well as generating full UI concepts in one shot, per the notion lite demo; another thread showcases M2.1 "vibing" custom art pieces on the same agent stack as shown in the notion lite demo and design gallery.

Taken together, these examples flesh out MiniMax’s earlier promises about M2.1’s agentic capabilities by showing it running real multi‑step tasks in the wild, rather than only synthetic code benchmarks.

TestingCatalog News 🗞

@testingcatalog

BREAKING 🚨: MiniMax M2.1 by @minimax_ai is now available on MiniMax Agent. What's new? 👀 - Multilingual Coding: Expert performance in multi-language development, excelling in test case generation, code optimisation, and code review. - Agentic Tool Use: Reliably executes Show more

Skyler Miao

@SkylerMiao7

Vibe your own masterpiece with M2.1 🎨 Built and played with on MiniMax Agent: agent.minimax.io Check out these creations 👇

3:44 PM · Dec 22, 2025

254

Read 6 replies

🛡️ Safety hardening and legal friction

Security/safety items focused on agent misuse defenses and scraping enforcement. Not general policy; both items have direct impact on AI agent/web operations.

OpenAI hardens ChatGPT Atlas browser agent against prompt injection with RL red‑teaming

ChatGPT Atlas (OpenAI): OpenAI details a new security pipeline for its ChatGPT Atlas browser agent that uses reinforcement‑learning‑based automated red‑teaming to discover and patch prompt‑injection attacks before they are widely exploited as shown in the OpenAI security note and OpenAI blog. The post describes an adversarially trained classifier that runs inside Atlas’s browser mode to detect untrusted page content trying to override system instructions, plus a continuous loop where RL agents search for new jailbreak patterns, engineers add mitigations, and the model is retrained and redeployed on short cycles; this moves Atlas closer to a traditional vulnerability‑management model for web agents rather than a one‑off prompt hardening exercise, which directly affects anyone relying on browser automation for workflows that touch sensitive internal data or credentials.

Adam.GPT

@TheRealAdamG