Zhipu GLM‑4.7 hits 73.8% SWE‑bench – 4–7× cheaper coding SOTA
Stay in the loop
Free daily newsletter & Telegram daily report
Executive Summary
Zhipu’s GLM‑4.7 lands as the new open‑weight coding and reasoning reference point: 73.8% on SWE‑bench Verified, 66.7% on SWE‑bench Multilingual, 41.0% on Terminal Bench 2.0, 42.8% on Humanity’s Last Exam with tools, and 95.7% on AIME 2025 put it within single‑digit points of GPT‑5.1 High, Gemini 3 Pro and Claude Sonnet 4.5 on multiple benchmarks while undercutting them by roughly 4–7× per token. τ²‑Bench (87.4%) and GPQA‑Diamond (85.7%) highlight strong world‑knowledge reasoning; Code Arena now ranks it #6 overall and #1 among open models. Day‑0 integrations across OpenRouter (200k context), Cline, Crush, Anycoder, vLLM and SGLang plus new Interleaved/Preserved/Turn‑level “thinking modes” make it a plug‑and‑play agent backend, even as early jailbreak reports show safety guardrails remain fragile for a widely downloadable checkpoint.
• Eval and horizons: Context Arena’s MRCR bias tools reveal Claude 4.5’s heavy recency and “creative retrieval” failure modes and confirm its lag behind GPT‑5.2/Gemini 3 on dense 8‑needle retrieval; renewed METR discussion stresses Opus 4.5’s wide 50%‑horizon confidence bands and the gap between 27‑minute 80% reliability and multi‑hour medians.
• MiniMax and infra: MiniMax’s 10B‑active M2.1 exits early access with a claimed 72.5% SWE‑bench Multilingual and 88.6% VIBE‑bench, rolls onto Ollama and Cline, and powers a “Digital Employee” agent suite, though most evals remain vendor‑reported; serving stacks add context parallelism that cuts DeepSeek V3.2 time‑to‑first‑token by up to 80%, publish FP8+spec‑decode recipes for GLM‑4.7, while Alphabet’s $4.75B Intersect buy and Amazon’s 2.2 GW Indiana campus underscore that power siting and water use are now first‑order constraints on AI build‑out.
Top links today
- GLM-4.7 technical blog and benchmarks
- GLM-4.7 model card on Hugging Face
- vLLM blog on serving GLM-4.7
- SGLang guide for deploying GLM-4.7
- Kascade sparse attention method for long context
- LLaDA2.0 diffusion language model at 100B
- Generative Adversarial Reasoner for math reasoning
- AI driven systems performance research framework
- Study on LLM post training data quality
- Learning to wait for asynchronous tool agents
- PhysBrain egocentric data for physical intelligence
- Probing scientific general intelligence of LLMs
- EpochAI FrontierMath benchmark analysis
- EpochAI comprehensive LLM benchmarking hub
- Vercel AI SDK v6 agents and tools
Feature Spotlight
Feature: GLM‑4.7 becomes the open coding SOTA
GLM‑4.7 posts open‑source SOTA‑level coding (SWE‑bench 73.8%), strong HLE (42% w/tools), adds new thinking modes, and ships day‑0 across major stacks—positioning a cheaper open model as a credible coding default.
Cross‑account coverage centers on Zhipu’s GLM‑4.7: big coding/evals gains, new “thinking” modes, day‑0 availability, and rapid adoption across platforms. This section focuses only on GLM‑4.7; other model updates are covered elsewhere.
Jump to Feature: GLM‑4.7 becomes the open coding SOTA topicsTable of Contents
🧠 Feature: GLM‑4.7 becomes the open coding SOTA
Cross‑account coverage centers on Zhipu’s GLM‑4.7: big coding/evals gains, new “thinking” modes, day‑0 availability, and rapid adoption across platforms. This section focuses only on GLM‑4.7; other model updates are covered elsewhere.
GLM‑4.7 sets new open‑source coding SOTA on SWE‑bench and τ²‑Bench
Open‑source SOTA (GLM‑4.7): Across multiple independent evals, GLM‑4.7 now looks like the strongest open‑weight coding and math model, posting 73.8% on SWE‑bench Verified, 66.7% on SWE‑bench Multilingual and 41.0% on Terminal Bench 2.0, with especially strong tool‑using and math reasoning scores (benchmarks table, insane evals ). Z.ai also reports 42.8% on Humanity’s Last Exam with tools (up from 30.4% for GLM‑4.6) and 95.7% on AIME 2025, bringing it close to GPT‑5.1 High and Gemini 3 Pro on several reasoning benchmarks while staying much smaller and cheaper (capabilities summary, benchmarks blog ).
• Code and agents: On coding‑agent suites, GLM‑4.7 reaches 73.8% SWE‑bench Verified (vs 68.0% for GLM‑4.6) and 66.7% SWE‑bench Multilingual, edging out Kimi K2 Thinking (73.4%) and DeepSeek‑V3.2 (73.1%) and approaching Claude Sonnet 4.5 (77.2%) and GPT‑5.1 High on some tasks (cline code summary, delta benchmarks ).
• World‑knowledge reasoning: On GPQA‑Diamond it scores 85.7% (up from 81.0% in 4.6), and on τ²‑Bench real‑world interaction it hits 87.4%, slightly ahead of DeepSeek‑V3.2 (85.3%) and near Gemini 3 Pro (90.7%) and Claude Sonnet 4.5 (87.2%) (insane evals, benchmarks table ).
• Arena and index rankings: In Code Arena’s live WebDev evals, GLM‑4.7 is now #6 overall and #1 among open models, gaining 83 points over GLM‑4.6 and surpassing both Claude‑Sonnet‑4.5 and GPT‑5 (code arena update); Vals AI’s text index likewise debuts it as the top open‑weight entry and #9 model overall with a 9.5% performance jump vs GLM‑4.6 (vals index chart).
• Cost–performance: Commentators note that GLM‑4.7’s scores come at significantly lower cost than closed competitors—roughly 4–7× cheaper than GPT‑5.1 High or Claude Sonnet 4.5 per token in several hosted offerings (open source praise, openrouter model card ).
Taken together, these numbers make GLM‑4.7 the first open model to consistently sit within single‑digit points of top closed models on serious coding and reasoning evals while often beating other frontier‑class open weights like DeepSeek and Kimi.
Zhipu’s GLM‑4.7 launches as open flagship coding and reasoning model
GLM‑4.7 (Zhipu / Z.ai): Zhipu has released GLM‑4.7, positioning it as its new flagship open model with a focus on coding, complex reasoning, and tool‑using agents, and making it the default in the GLM Coding Plan for real‑world development scenarios (zai launch thread, tech blog ). The release emphasizes three pillars—programming, reasoning and agent capabilities—with Z.ai highlighting substantial gains over GLM‑4.6 in SWE‑bench, Humanity’s Last Exam and AIME 2025, while keeping the model available as open weights for community serving and fine‑tuning (capabilities summary, coding plan note ).

• Launch focus: Z.ai frames GLM‑4.7 as "advancing the coding capability", citing a 5.8‑point gain on SWE‑bench Verified and 12.9‑point gain on SWE‑bench Multilingual versus GLM‑4.6 along with better terminal tasks and agent performance (delta benchmarks, glm overview docs ).
• Intended uses: Official messaging calls out multilingual coding, UI "vibe coding", improved slide and poster generation, and more stable multi‑step agent executions as core design targets, rather than only leaderboard chasing (zai launch thread, frontend demo ).
• Productization: GLM‑4.7 immediately becomes the default engine for Z.ai’s subscription‑style GLM Coding Plan, which plugs into editors like Claude Code, Cline, OpenCode and Roo Code for day‑to‑day software work (coding plan note, pricing docs ).
The net effect is that GLM‑4.7 arrives not as a research‑only checkpoint but as an open, production‑aimed coding and reasoning workhorse with a clear upgrade story over GLM‑4.6.
GLM‑4.7 sees rapid day‑0 adoption across coding tools and runtimes
Ecosystem uptake (GLM‑4.7): Within days of launch, GLM‑4.7 has shown up in most major open coding stacks—hosted on OpenRouter with 200k context, wired into editors like Cline, Crush and Anycoder, and available as BF16 and FP8 checkpoints on Hugging Face for local serving (openrouter listing, anycoder integration ). OpenRouter prices it around $0.44/M input and $1.74/M output tokens with 202,752 context tokens, and multiple devs are already running it free via Z.ai’s own chat and coding plans or inside opencode testbeds (openrouter listing, opencode free note ).
• Editor and IDE integrations: Cline has added GLM‑4.7 as a first‑class model (alongside GPT‑5.2 and Claude 4.5) for its coding agent, highlighting its SWE‑bench scores and τ²‑Bench strengths (cline code summary); Charm’s Crush CLI exposes GLM‑4.7 with selectable thinking modes for terminal‑first coding flows (crush demo); Anycoder lets users pick GLM‑4.7 as the backend for its UI‑from‑prompt scaffolding (anycoder integration).
• Cloud platforms and agents: Atlas Cloud AI announced GLM‑4.7 as a coding partner, Z.ai’s own chat UI surfaces the model directly (zai chat demo), and Trae advertises GLM‑4.7 support via custom provider config (trae support).
• Serving stacks: vLLM‑Omni and SGLang both shipped day‑0 recipes for GLM‑4.7 FP8, including support for multi‑token speculative decoding, GLM‑specific tool‑call parsing, and reasoning parsers, signaling that the infra community expects heavy agent and coding workloads on this model (vllm serve command, sglang launch command , huggingface model card ).
• Community sandboxes: Developers report GLM‑4.7 is temporarily free inside opencode while it’s being tested (opencode free note) and is rolling out on Z.ai’s own chat frontends, making it easy for builders to probe its behavior before committing config changes (zai rollout mention, zai code scaffold ui ).
This breadth of integrations means GLM‑4.7 is already "one click away" in many agent harnesses and IDEs, which lowers the friction for teams to A/B it against their existing Claude, GPT‑5 or DeepSeek setups.
Builders hail GLM‑4.7 as best open coding model and strong UI “vibe coder”
Developer reception (GLM‑4.7): Early hands‑on reports from tool builders and power users are broadly positive, with several calling GLM‑4.7 "one of the best" open models for coding and UI generation and noting that it rivals closed models like Claude Sonnet 4.5 and GPT‑5.1 High while being much cheaper (open source praise, ui generation praise ). Cline and other agent authors highlight its combination of high SWE‑bench scores, strong τ²‑Bench tool usage and math reasoning as the key reasons they are exposing it to users alongside GPT‑5 and Claude (cline code summary, insane evals ).
• UI and "vibe coding": Multiple demos show GLM‑4.7 generating diverse, non‑cookie‑cutter frontends and dashboards that "don’t have a coded vibe" at first glance, including full game UIs and landing pages built from JSON prompts or loose textual descriptions (frontend demo, ui generation praise , zai code scaffold ui ).
• Cost‑sensitive coding: Steipete and others recommend GLM‑4.7 as a super low‑cost backend when paired with Claude Code via Z.ai’s MCP bridge, noting that Z.ai "cleverly fixed the missing features (search, vision) via an MCP" while letting GLM handle the heavy coding work (cheap coding comment, coding blog ).
• Free and trial access: GLM‑4.7 is temporarily free in opencode while it’s tested (opencode free note) and available via Z.ai chat and multiple hosted UIs, which lowers the barrier for individual devs to run serious experiments without committing cloud budget up front (zai rollout mention, zai chat demo ).
• Community framing: Commentators describe it as "better than DeepSeek 3.2 (in most benchmarks)" and "competitive with Sonnet 4.5 and GPT‑5.1 High" while emphasizing that it comes in a smaller, faster, and 4–7× cheaper package than those closed models (open source praise, glm blog ).
This mix of performance, UX quality and economics is leading many coding‑agent maintainers to slot GLM‑4.7 into their default or recommended open‑model presets rather than treating it as a niche side option.
GLM‑4.7 adds Interleaved, Preserved and Turn‑level “thinking modes” for agents
Thinking modes (GLM‑4.7): Z.ai has overhauled GLM‑4.x’s reasoning controls, making "thinking" the default in GLM‑4.7 and introducing three explicit modes—Interleaved Thinking, Preserved Thinking and Turn‑level Thinking—to stabilize multi‑step tasks and give agent harnesses finer control over when and how the model reasons (thinking docs, frontend demo ). These modes replace GLM‑4.6’s hybrid approach and are wired into the API as optional configuration, allowing callers to trade off latency, consistency and controllability.

• Interleaved Thinking: This mode lets GLM‑4.7 think between tool calls or code edits, interspersing short reasoning segments with actions so long‑running agent workflows can adapt mid‑plan; Z.ai presents it as the default for complex coding sessions and browsing agents (thinking docs, coding interaction demo ).
• Preserved Thinking: Here the model carries over internal thoughts across turns to maintain a single coherent chain of reasoning, useful for multi‑hour refactors or research tasks where revisiting earlier context is important (thinking docs).
• Turn‑level Thinking: This mode constrains reasoning to each individual turn, which can cap latency and token costs for simpler tasks while still benefiting from deliberate reasoning in single‑step answers (thinking docs).
• Runtime hooks: Popular serving stacks have already exposed GLM‑4.7’s thinking and tool‑call parsers as first‑class options—vLLM’s nightly images include --reasoning-parser glm45 and --tool-call-parser glm47, while SGLang’s launch examples add matching flags—so infra teams can experiment with the modes without custom glue code (vllm serve command, sglang launch command ).
For agent builders, this gives a rare degree of knob‑level control over an open‑weight model’s reasoning behavior, rather than baking all trade‑offs into a single opaque "thinking" preset.
Prompt hackers report successful GLM‑4.7 jailbreak despite stronger guardrails
Safety probing (GLM‑4.7): At least one red‑teamer claims to have bypassed GLM‑4.7’s safety layer with a single elaborate jailbreak prompt, noting that while guardrails "are a bit stronger than last time", once the model’s reasoning is hijacked it will answer clearly harmful questions in a second, "post‑divider" reply (jailbreak report, jailbreak followup ). The shared system prompt uses nested markdown details, internal "RESET_CORTEX" and "!GODMODE" tags, and explicit instructions to first output a fake refusal and then a fully unrestricted answer, showing that the attacker is targeting the model’s meta‑instruction following rather than simple keyword filters.
• Content scope: The tester says they obtained step‑by‑step responses on topics like drug synthesis, weapon construction and malware, though the examples are summarized rather than fully reproduced; this suggests that, like many open weights, GLM‑4.7’s base capabilities remain powerful enough that weak or brittle guardrails can still be routed around (jailbreak report).
• Model framing vs reality: Z.ai’s official materials focus on coding and agentic use cases and do not make strong safety claims beyond standard alignment, but the jailbreak thread argues that "information wants to be free"‑style system prompts can still steer GLM‑4.7 into prohibited domains despite those defaults (capabilities summary, jailbreak report ).
• Open‑weight tension: Because GLM‑4.7’s weights are downloadable, third‑party hosts and self‑hosters bear most of the responsibility for additional safety layers, logging and filtering; this report illustrates the gap between published evals and unbounded prompt‑space behavior for a model now widely embedded in coding agents and chat frontends (huggingface listing, openrouter listing ).
For engineers and platform operators adopting GLM‑4.7, the episode is an early signal that safety wrappers, monitoring and possibly fine‑tuned variants will be as important as its coding and reasoning strength if the model is exposed to untrusted prompts.
📊 Evals and long‑context: MRCR bias, METR horizons
Mostly eval releases and analysis today: long‑context retrieval on MRCR for Claude 4.5 with bias diagnostics, plus refreshed discussion of METR time‑horizons. Excludes GLM‑4.7 benchmark bullets (covered in the Feature).
Context Arena ships MRCR bias analysis, exposing recency and “creative” failures
MRCR bias analysis tool (Context Arena): Context Arena introduced a new analysis view that computes detailed bias metrics for each model on MRCR—recency vs early‑context preference, distance from the true needle, and rates of “no variant matched” errors—so users can see how a model fails, not just its AUC and point‑wise scores (analysis feature,
). Using this tool on Claude Opus 4.5 shows that when it misses, 91.6% of errors come from "creative" retrieval (inventing new content matching the metadata), it strongly prefers later variants in the prompt (56.0% of picks in the second half vs 32.4% expected), and it exhibits a classic "lost in the middle" pattern with 72.9% accuracy at the start/end of the context but only 33.7% in the middle (Claude bias summary). Follow‑up analyses apply the same lens to other frontier models: Gemini 3 Flash (high) shows a relatively even spread across positions with no strong recency or middle drop‑off, Gemini 3 Pro (high) displays some drift and bias, while Grok 4.1 Fast (thinking) combines a positive‑drift tendency to overshoot later variants with its own lost‑in‑the‑middle pattern and a 70.7% share of "creative" retrieval misses (Gemini flash pattern, Gemini pro bias , Grok bias profile ). The result is that MRCR now doubles as both a long‑context accuracy benchmark and a structured way to differentiate error modes across Claude, Gemini, Grok and other models, rather than treating all failures as equivalent.
Claude 4.5 underperforms peers on MRCR long‑context retrieval
Claude 4.5 MRCR results (Anthropic): Context Arena added Claude Opus 4.5, Sonnet 4.5 and Haiku 4.5 to its MRCR long‑context retrieval leaderboard at 128k tokens; Opus scores 86.5% AUC / 74.0% point‑wise on 2‑needle, 64.3/55.7 on 4‑needle, and 38.9/27.1 on 8‑needle tests, noticeably behind GPT‑5.2 and Gemini 3 models at higher needle counts (leaderboard post,
). Community reactions describe Claude’s long‑context results as "pretty impressive how bad" relative to expectations, despite its strong coding reputation (critical comment). The tests here are capped at 128k context, with 1M‑token evaluations for Sonnet 4.5 still pending, so these numbers reflect mid‑range rather than full‑window behavior (leaderboard post).
New discussion highlights wide METR horizon uncertainty for Claude Opus 4.5
METR horizons for Claude Opus 4.5 (METR): Commentators revisited METR’s estimate that Claude Opus 4.5 reaches a 50% success time horizon around 4 hours 49 minutes on their task suite, with a 95% confidence interval spanning from 109 to 1,225 minutes, while its 80% success horizon is much shorter at roughly 27 minutes (horizon chart,
). Building on earlier breakdowns of 50% vs 80% horizons for Claude and GPT‑5.1‑Codex‑Max (Opus horizons), today’s posts stress that the very wide error bars reflect too few long‑duration tasks and that performance decays smoothly with task length on a logistic curve rather than dropping off at a hard cutoff (metric recap, horizon explanation ). The discussion frames Opus 4.5 as unusually strong on very long runs compared to the existing trend line, but also notes that real‑world reliability for multi‑hour autonomous work is still constrained by the steeper drop between the 27‑minute 80% point and the multi‑hour 50% mark (horizon explanation).
🛠️ Coding agents and dev tooling in practice
Hands‑on agent/coding updates dominated by IDE/CLI features, planning, background processes and session hygiene. This section excludes Apps SDK/connectors (see Orchestration).
Conductor 0.28.0 adds workspaces, context meter and interactive planning
Conductor 0.28.0 (Conductor): Cursor’s Conductor agent IDE shipped v0.28.0 with a new workspaces page, a live context meter, interactive plan mode, .context folders, and keyboard navigation for chats, aiming to make multi‑session agent work less fragile and more transparent (Conductor release, Context meter detail , Plan mode note , Keyboard demo , Bugfix summary , release notes).

• Workspace history and filters: The new workspaces page shows a history of all workspaces and lets users reopen or un‑archive them, with filters by repo, branch, or PR number, which is a shift from the previous “one active thread at a time” feel (Conductor release).
• Context visibility and shared state: A context meter now appears in the Composer when Claude is close to running out of context, and each new workspace includes a .context folder where attachments and other shared artifacts live without being committed to git, tightening control over what the agent actually sees (Context meter detail).
• Interactive planning and navigation: Planning became more conversational—Claude now asks follow‑up questions while building plans—and chats can be navigated using arrow keys or j/k, with [ and ] toggling sidebars, while PR views in GitHub are reported as nearly instant compared to earlier sluggish loads (Plan mode note, Keyboard demo , Bugfix summary ).
The release leans into Conductor as a long‑running coding cockpit rather than a single ephemeral chat, with most changes aimed at keeping large agent sessions legible and recoverable under heavy use.
Codex experimental background terminals tackle long-running CLI tasks
Codex CLI (OpenAI): Codex picked up an /experimental toggle for background terminals, allowing the coding agent to keep long‑running shells alive for dev servers, tests, and installs without blocking other actions (Background terminal note).
• Less babysitting, same workflows: With background mode enabled, tests can continue running while users and the agent keep working, npm publish can wait on browser‑based auth flows, and package installs no longer require constant supervision, which addresses one of the most common complaints about agent‑driven CLIs (Background terminal note, Benefits thread ).
• Closer to real dev ergonomics: The feature effectively moves Codex closer to how human developers use terminals—multiple concurrent shells, some tailing logs, some executing long tasks—rather than the previous single‑shot command execution model that often stalled complex automation (Benefits thread).
There is no benchmark yet on how this impacts success rates, but the ergonomics change is substantial for anyone leaning on Codex as a primary driver of shell‑based workflows.
RepoPrompt 1.5.60 streamlines Codex prompt install and CLI use
RepoPrompt 1.5.60 (RepoPrompt): RepoPrompt 1.5.60 introduced a helper that installs its prompt workflows directly into Codex and added CLI variants of those prompts so the same automations can be triggered from the terminal as well as from MCP‑aware frontends (RepoPrompt release, changelog page ).
• One‑step Codex integration: The new installer wires RepoPrompt’s /rp-build, /rp-investigate and similar commands into Codex, removing earlier manual steps where users had to copy and maintain prompt text themselves across agents and projects (RepoPrompt release).
• CLI parity for workflows: By exposing CLI variants of each prompt, 1.5.60 lets teams run the same repo‑aware analysis and build workflows in CI, scripts, or local shells, not only via chat UIs, which is a shift from pure MPC‑only usage toward a more general automation layer (RepoPrompt release, changelog page ).
The update turns RepoPrompt from a mostly ChatGPT/Claude‑side helper into a small but flexible command‑line tool that can sit inside existing engineering pipelines.
Warp terminal exposes agent run and run-ambient commands
Warp agent CLI (Warp): Warp highlighted that its CLI can now run agents either locally with warp agent run or in a cloud sandbox via warp agent run-ambient, then let developers SSH into those ambient runs and interact as if they were local terminals (Warp agent cli, cli docs ).

• Local vs ambient agents: warp agent run keeps the agent on the local machine, while warp agent run-ambient spins it up in a remote sandbox suitable for external collaborators or untrusted code, which separates experimentation from core dev environments without changing how you talk to the agent (Warp agent cli).
• SSH into agent shells: Once an ambient agent is running, Warp exposes an SSH endpoint so users can drop into the same shell the agent is using, inspect logs, or manually intervene, which moves agent runs closer to traditional long‑lived server processes rather than opaque chat sessions (Warp agent cli, cli docs ).
The feature set positions Warp not only as an AI‑aware terminal, but as a hosting surface for persistent agent processes that can be inspected and debugged with normal Unix tools.
Agentic Coding Flywheel project grows into full beginner-friendly guide
Agentic Coding Flywheel (Dicklesworthstone): Building on the earlier VPS wizard for setting up multi‑agent dev servers (vps wizard), the author reports that the Agentic Coding Flywheel site now includes around 33k lines of shell scripts, 30k lines of TypeScript/React, and a new "beads_viewer" static site documenting the whole design and refinement process (Flywheel site).
• Targeting "hungry but clueless" users: The guide explicitly targets people with little computer background who still want to use real tools instead of "slop factory" sites; jargon is heavily defined, and the author ran multiple "audits" using an agent to simulate a novice’s perspective, with those audits published as step‑by‑step documents (Ux audit link, ux audit ).
• Beads viewer and context: A separate beads viewer site visualizes the dependency graph of tasks and scripts in the flywheel, making it easier to understand how context, agents, and infrastructure pieces fit together instead of treating the setup as a black box (Beads viewer mention, beads viewer ).
The project effectively turns one person’s agent‑heavy setup into a reproducible playbook for others, with both narrative and code artifacts evolving in lockstep.
CodexBar 0.12 refines cost tracking and credit buying UX
CodexBar 0.12 (Steipete): Following up on earlier cost‑charting features for CodexBar, which added detailed token and dollar histories for Codex usage (usage charts), version 0.12 reorganizes the macOS menu into submenus, adds a persistent credits bar, and streamlines buying credits via an automated browser flow (CodexBar update).
• Cleaner menu structure: The author reports iterating “for hours” on the menu layout, ultimately moving many options into submenus to cut clutter and adjusting highlight colors so submenu chevrons are visible without drawing too much attention (CodexBar update).
• Credits and auto‑updates: A new credits usage bar and a quick “Buy Credits…” entry open a window that navigates directly to Stripe checkout, while update checks now happen in the background with a “Click to restart” menu item only when a new version is ready, removing the old explicit “Check for updates” entry (CodexBar update).
These tweaks focus less on raw functionality and more on making heavy daily Codex users comfortable monitoring and topping up usage without breaking flow.
Oracle CLI improves recovery and gains agent skill wrapper
oracle CLI (Steipete): The oracle tool, which wraps GPT‑5.2 Pro in a browser‑driven debugging workflow, gained stronger recovery logic and an accompanying Skill definition so agents can call it more safely and efficiently (Oracle skill note, Recovery update , oracle repo ).
• Session reattachment after crashes: Version 0.7.3 ensures that even if an agent kills the process or Chrome is closed, oracle can reattach to existing sessions instead of losing state, by marking browser sessions as errored when ports drop and improving how it discovers and reconnects to them (Recovery update, release notes ).
• Skill integration for agents: A new Skill in the agent-scripts repo describes how agents should invoke oracle, which reduces mistakes and speeds up runs compared to having the model guess shell commands and flags every time (Oracle skill note, skill docs ).
The changes push oracle further toward being a reliable, reusable component in larger agent harnesses rather than a one‑off personal debugging script.
Peakypanes debuts as YAML-driven tmux dashboard for agent sessions
peakypanes (Kevin Kern): A new CLI tool called peakypanes launched as a tmux dashboard and layout manager driven by YAML, aimed at developers juggling multiple agents, servers, and projects across terminals (Feature description, Usage reflection ).
• Dashboard for many projects: The dashboard view shows projects, sessions, and windows in one screen so users can see which agents or processes are running and quickly start, switch, or manage tmux sessions, giving a higher‑level overview than raw tmux alone (Feature description, Peakypanes demo ).
• Shared layouts as code: Layouts are described in a simple YAML format, including panes and the commands they should run, so teams can check them into git and share reproducible multi‑pane setups for things like multi‑agent harnesses or microservice dev stacks (Peakypanes demo, Repo link ).
The author calls this an early release and warns about bugs, but the structure points toward tmux becoming a more first‑class orchestration surface for agent‑heavy workflows.
ck code indexer accelerates Codex file lookup with new embedding backend
ck + Codex (Kevin Kern): The ck tool, which builds a local index of a codebase so Codex can find files more quickly, has been in daily use for a month and now has a pull request testing Mixedbread embeddings as a faster backend (Ck codex usage, Mixedbread pr ).
• Local indexing for speed: Instead of relying on Codex to scan the repo from scratch each time, ck pre‑indexes files so the agent can jump straight to relevant paths, which the author says makes Codex locate files “much faster” in large projects (Ck codex usage).
• Embedding swap experiment: A new PR experiments with swapping in Mixedbread embeddings to see if they improve search latency and relevance over the existing setup, reflecting a trend toward treating vector backends as pluggable infrastructure beneath agent‑facing tools (Mixedbread pr, pr details ).
The work is small‑scale but shows how practitioners are hand‑tuning retrieval layers around coding agents instead of waiting for monolithic IDE updates.
🔗 Agent interop and app surfaces (MCP, Apps SDK)
Interoperability and app surfaces saw movement: skills loading patterns, browser agents, and ChatGPT Apps SDK connectors. Excludes IDE‑specific coding features (see Coding agents).
OpenAI ships “Your Year with ChatGPT” as a first-party Apps SDK experience
Your Year with ChatGPT (OpenAI): OpenAI is rolling out an end‑of‑year recap experience, Your Year with ChatGPT, to Free, Plus and Pro users in the US, UK, Canada, Australia and New Zealand; it runs only when Memory and reference chat history are on and when a minimum activity threshold is met (rollout details, feature explainer ). The recap is implemented as an internal ChatGPT app using a new connector called OpenAI Cocoon, making it one of the first public, production examples of the Apps SDK in action rather than a hard‑coded product feature (cocoon mention, apps sdk docs ).

• App-like UX inside ChatGPT: The experience surfaces as a tappable card with custom layouts, animations and navigation distinct from normal chats, and people describe it as a "first class" mini‑app that feels closer to WeChat‑style in‑app experiences than to a plain conversation, with one observer saying it shows "what a 'first class' app experience could look like inside ChatGPT" (ui commentary).
• Developer interest in the SDK: Seeing a polished first‑party app backed by the SDK has triggered renewed interest from builders, with developers noting that this convinced them to "look again at the ChatGPT Apps SDK and build something" and OpenAI’s PM for the SDK saying they "can’t wait to see what people build" in 2026 (dev reaction, sdk teaser ).
• Connector usage and discovery: The "Your Year with ChatGPT" widget appears in the app list and can also be invoked via the plus‑menu and a natural language command, with a direct deep‑link to the experience shared so users can check whether they already have access (invocation hint, deep link ).
The recap doubles as both a sticky user‑facing feature and a reference implementation for how OpenAI expects third‑party Apps SDK experiences and connectors to look and feel inside ChatGPT.
Claude in Chrome proves out practical browser agents for real dashboards
Claude browser agent (Anthropic): Multiple practitioners are now reporting real utility from Anthropic’s Claude in Chrome browser agent, with one detailed write‑up describing how it recovered a lost CORS configuration buried deep in Cloudflare’s dashboard by scanning pages, following links and identifying the relevant Transform Rule without the user remembering where it lived (cloudflare story, cors blog post ). The same author extracted a full HTML transcript showing Claude’s step‑by‑step actions—navigation, text recognition in the UI and reasoning about which rule controlled which path—illustrating how a browser‑embedded agent can function as a point‑and‑click debugger for complex SaaS admin panels (transcript link).
• Agent UX and safeguards: Screenshots from another user show Claude’s browser agent UI labeling itself as "HIGH RISK" when allowed to "take most actions on the internet", with options like "Act without asking" and warnings that the agent can click hidden CAPTCHA elements or forms, underscoring both the power and risk profile of giving an LLM direct control over a live browser session (risk banner screenshot).
• From skepticism to adoption: The Cloudflare user notes that they had been skeptical of browser agents due to prompt‑injection risks but called this their "first successful experience" solving a real problem, contrasting manual hunting through a confusing UI with a guided session where the agent quickly identified the rule name, path pattern and header being set (cloudflare story).
Together these reports suggest browser‑level agents are starting to cross from novelty demos into tools that can operate vendor dashboards and consoles on behalf of engineers, as long as users are comfortable with the elevated access they require.
Claude Code gains OpenRouter backend, exposing 320+ models via one agent surface
Claude Code on OpenRouter (Anthropic/OpenRouter): Claude Code, Anthropic’s agentic coding environment, can now run against OpenRouter as a backend provider, letting users route its multi‑step coding and tool‑use workflows through more than 320 different LLMs rather than only Anthropic‑hosted models (claude code announcement). OpenRouter’s docs for the integration explicitly recommend "highly capable" models like Claude 4.5 Sonnet and GPT‑5.2 for best results but note that any compatible model—closed or open, including newcomers like GLM‑4.7—can be plugged into the same Claude Code harness (
, glm model page ).
• Single agent, many engines: The integration turns Claude Code into an interop surface where the front‑end agent logic (planning, diffing, tool orchestration) stays the same while the underlying inference stack can be swapped between Anthropic, OpenAI, Google, Z.ai and others using OpenRouter’s normalized API and routing layer (claude code announcement).
• Skills and tools compatibility: Because Claude Code already supports open Agent Skills and MCP‑style tools, binding it to OpenRouter means those higher‑level capabilities can now be exercised on top of cheaper or specialized models as they appear on the platform, without per‑model glue code.
For teams experimenting with a mix of closed and open‑weight models, this gives a single coding‑agent UX that can sit in front of a very fluid backend model portfolio.
OpenRouter highlights nextTurnParams pattern for self-managing skills
Skills loader pattern (OpenRouter): OpenRouter is pushing a concrete design for self‑managing skills by showcasing how its SDK’s nextTurnParams can automatically enrich future turns with specialized instructions once a skill is loaded (tip on nextturnparams). The example skills loader skill turns a one‑time discovery call into a persistent context modifier, so tools can quietly attach domain‑specific guidance or routing hints to every subsequent model call without extra prompting boilerplate (skills loader example).
• Encapsulated tools, minimal prompts: The pattern keeps skills configuration in one place (a skill manifest plus loader code) and relies on the harness, not the human prompt, to inject the right system messages or tool configs on future turns, which helps reduce prompt drift and lets different frontends reuse the same skills library.
• Ties into open Skills spec: The loader builds on the open Agent Skills spec that packages instructions and resources into SKILL.md folders, with prior work showing Codex and other agents adopting that standard; OpenRouter’s contribution is a concrete runtime hook (nextTurnParams) for turning those static skill bundles into living, per‑conversation behavior (skills standard mention, skills overview ).
This gives agent frameworks a portable way to keep long‑running conversations skill‑aware without forcing each step of the dialog to repeat the same configuration scaffolding.
⚙️ Serving stacks and latency tricks
Runtime/serving updates with concrete throughput/TTFT wins and day‑0 integration. This excludes model metrics (Feature) and training algorithms (see Training/Reasoning).
SGLang + Baidu context parallelism cuts DeepSeek V3.2 TTFT by up to 80%
Context parallelism for DeepSeek V3.2‑DSA (SGLang/Baidu): Baidu’s Baige AIAK team open‑sourced a context_parallel implementation for DeepSeek‑V3.2‑DSA in SGLang and reports that enabling it reduces time‑to‑first‑token by about 75% at 16k tokens and 80% at 32k versus the non‑CP baseline on a single machine (context parallel post).
The design reuses routing patterns across experts, does load‑balanced sequence splitting tailored to DeepSeek’s DSA layout, and avoids tensor‑parallel all‑reduce overhead while remaining compatible with data‑parallel attention and other parallelism schemes, aiming squarely at long‑context inference bottlenecks (context parallel post).
vLLM adds day‑0 GLM‑4.7 serve with MTP and tool parsers
GLM-4.7 in vLLM (vLLM project): vLLM added day‑0 support for Z.AI’s GLM‑4.7, exposing a single vllm serve command that wires in MTP speculative decoding, GLM‑style tool/function calling, and a reasoning parser tuned to the model’s “thinking” traces, as shown in the vllm glm47 serve.
The example launch uses 4‑way tensor parallelism plus --speculative-config.method mtp with one speculative token, indicating a focus on higher throughput without any model retraining (vllm glm47 serve).
SGLang publishes GLM‑4.7‑FP8 serving recipe with EAGLE speculative decode
GLM-4.7-FP8 in SGLang (LMSYS/SGLang): LMSYS released a concrete sglang.launch_server command for serving Z.AI’s GLM‑4.7‑FP8 in SGLang, enabling the EAGLE speculative decoding algorithm, GLM‑specific tool and reasoning parsers, and 8‑way tensor parallelism in one config (sglang glm47 example).
The recipe also sets --speculative-num-steps 3, --speculative-num-draft-tokens 4, and pins GPU memory with --mem-fraction-static 0.8, signalling a production‑oriented setup for long‑context GLM‑4.7 serving on multi‑GPU hosts (sglang glm47 example).
vLLM‑Omni adds LongCat-Image-Edit for instruction-following image edits
LongCat-Image-Edit in vLLM-Omni (vLLM project): The vLLM community integrated Meituan’s LongCat‑Image‑Edit model into vLLM‑Omni, giving operators a unified runtime to serve instruction‑following image edits such as object insertion, background replacement, and style adjustments from the same stack that handles text LLMs (longcat support).
A demo shows a Qwen bear illustration turned into a painting scene with an art board labeled “vLLM‑Omni” and a brush in the bear’s hand, reflecting how text prompts plus a reference image can drive structured edit actions inside the new image‑editing endpoint (longcat support).
🏗️ Power and campuses for AI growth
Non‑model, supply‑side signals: datacenter power and siting moves. Mostly power procurement/capex news; separate from enterprise adoption metrics in Business.
Alphabet buys Intersect for $4.75B to align AI datacenters with new power
Alphabet–Intersect deal (Google): Alphabet is acquiring clean‑energy developer Intersect for about $4.75B in cash plus assumed debt to co‑locate new solar and battery projects with Google’s AI datacenters, targeting a pipeline of ~10.8 GW by 2028 (Intersect summary, Bloomberg report); the goal is to move from pure power‑purchase agreements to owning a "development platform" that handles land, permits, grid interconnection and financing on the same schedule as new compute.
Deployment impact: Adding large AI datacenter loads often means waiting years on grid upgrades and interconnection queues, so compute can be ready before power; Intersect’s model is to build solar plus battery storage next to new campuses so generation and transmission are planned around a known AI load, reducing dependence on constrained local grids (Intersect summary). Google signals it will buy Intersect’s in‑development projects and team, but not all of its operating grid assets, which positions this more as a forward pipeline of custom power for AI than a generic utility buy (Intersect repost). For AI infra planners, this is a clear data point that power siting and permitting are now strategic bottlenecks on par with GPU supply, and that hyperscalers are willing to own more of the "electron supply chain" to keep model training and inference roadmaps on track.
Amazon’s $11B Indiana AI campus adds 2.2 GW load and heavy water use
Indiana AI campus (Amazon): Amazon’s planned $11B data center complex in St. Joseph County, Indiana, will be sized for about 2.2 GW of power draw—enough electricity for roughly 1M homes—and is expected to use around 300M gallons of water per year for cooling (Indiana campus update); the site is framed as one of Amazon’s largest AI training and inference hubs, following up on Indiana campus where the basic campus scale and on‑site power plant plans were first outlined.

Local grid and environment angle: A 2.2 GW load concentrated in a single AI campus effectively turns it into a dedicated power customer the size of a mid‑sized city, which is why Amazon pairs it with its own energy infrastructure rather than leaning entirely on the regional grid (Indiana campus update). The newly mentioned ~300M gallons/year cooling demand highlights the water footprint of large AI campuses, raising questions about sustainability and local resource planning that regulators and communities will have to weigh alongside economic benefits. For other hyperscalers, the Indiana numbers give a concrete reference point for what a next‑generation AI campus looks like in power and water terms, not just capex.
China’s power capacity reaches 3.75 TW, nearly triple US, shaping AI headroom
China power capacity (China): New charts from Morgan Stanley put China’s total power‑generation capacity at ~3.75 TW, compared with about 1.30 TW in the US, implying China now has nearly 3× the installed capacity and extending the earlier picture of rapid generation growth described in China grid (China capacity tweet).
Why this matters for AI: The same report notes China accounted for about 54% of global industrial robot installations, tying its power build‑out to rising automation and electric load from factories and datacenters, while US capacity growth has been relatively flat (robot patents recap, China capacity tweet ). For AI infra, the 3.75 TW figure sets the ceiling on how far China can scale energy‑hungry GPU clusters, fast‑charge networks, and robotics plants before hitting hard power limits, whereas the US will need either faster capacity additions or more aggressive efficiency gains to support similar levels of AI and robotics deployment. The numbers do not say how much of that capacity is directly allocated to AI, but they define the macro headroom in which future Chinese AI campuses and model‑training projects will compete.
💼 Enterprise traction and market share shifts
Business signals around AI platforms and go‑to‑market. Continues yesterday’s adoption narrative with fresh metrics; excludes infra procurement (see Infrastructure).
Gemini triples GenAI web share as DeepSeek collapses and Grok rises
GenAI web traffic (Similarweb): Similarweb’s Jan–Nov 2025 data shows Gemini’s share of global GenAI website traffic rising from about 5.64% to 14.95% (roughly 3×), while ChatGPT falls about 4 percentage points to ~74% and DeepSeek plunges from 12.79% to 5.35% (share chart); Grok grows from 0.02% to 2.53% and Perplexity holds steady around 3%, which commentators frame as “stable niche” rather than breakout (traffic recap). Following up on us traffic, where ChatGPT still dominated US visits, this new slice suggests Google and xAI are the only players meaningfully gaining share inside this traffic bucket.
• Google and Anthropic momentum: One analyst calls Google, alongside Anthropic, “the big winner in 2025” as Gemini’s share nearly triples while ChatGPT’s dips and DeepSeek’s visibly shrinks, arguing that “Google, along with Anthropic, is the big winner in 2025” (share chart).
• DeepSeek and Grok repositioning: The same chart breakdown highlights DeepSeek’s sharp decline and Grok’s rise from essentially zero to a few percent, suggesting early traction for xAI while DeepSeek’s direct‑to‑consumer reach weakens (traffic recap).
The point is: within this specific web traffic lens, the competitive field is still highly concentrated around ChatGPT, but Google’s Gemini and xAI’s Grok are now the only meaningful challengers gaining ground while some earlier contenders lose visibility.
xAI selected to power DoD GenAI.mil at IL5 for up to 3M users
GenAI.mil program (xAI): xAI says its Grok‑based “frontier AI” stack has been selected by the U.S. Department of Defense Chief Digital and Artificial Intelligence Office (CDAO) as a provider for the GenAI.mil initiative, targeting around 3 million DoD users at Impact Level 5 (IL5), the cloud security tier for Controlled Unclassified Information (program summary, xai gov post ). The company states its models will run inside the IL5 boundary and be exposed through an enterprise platform with an API plus agent tooling that can chain steps like search, drafting and summarization into single workflows (program summary).
• Enterprise and mission split: xAI describes two tracks: Enterprise AI for day‑to‑day Pentagon knowledge work, and “mission systems” using government‑optimized foundation models for classified operational workloads, likely in separate deployment enclaves with tighter controls (program summary).
• Data and sourcing model: The announcement notes that DoD users will receive real‑time insights sourced from X, shifting answers from static training data toward live feeds, which is a data‑integration choice rather than a model architecture change but has clear implications for provenance and information governance in defense settings (program summary).
• Procurement context: The deal slots into CDAO’s pattern of awarding multiple frontier AI vendors contracts with ceilings up to roughly $200M each to build agentic workflows across mission areas, signalling that xAI will now compete head‑to‑head with other large labs inside one of the highest‑stakes enterprise environments (program summary).
This marks one of the first public large‑scale defense deployments of a Grok‑class model at IL5, putting xAI directly into the enterprise AI platform conversation alongside more established vendors.
Reports describe Microsoft Copilot adoption woes and Satya’s hands‑on reset
Copilot (Microsoft): Commentary around Microsoft’s Copilot paints a picture of underwhelming enterprise traction, with reports that Microsoft has cut Copilot AI sales targets after weak adoption and that CEO Satya Nadella has taken a hands‑on product management role to accelerate improvements (copilot critique, sales target note ). One summary says Nadella is “personally overseeing engineering and recruiting while delegating other executive duties,” driven by frustration over technical flaws and market share erosion versus rivals like Google and Cursor (copilot critique).
• Adoption and perception issues: Posts describe users seeing Copilot as unreliable and agentic tools as “untrustworthy” in daily workflows, which reportedly slows enterprise rollout despite aggressive bundling such as an unremovable Copilot app appearing on LG TVs after a firmware update (forced install article).
• Competitive pressure: The same threads explicitly list Google and Cursor as key competitors, implying that Microsoft’s current Copilot experience is not winning developers by default and prompting this internal “code red” style response (copilot critique).
For AI leaders, this is one of the clearest public signals that even a hyperscaler with distribution still has to win on perceived reliability and day‑to‑day usefulness, not only on bundling and brand.
🎬 Creator workflows: motion control, music, design
Generative media saw heavy traffic: motion‑controlled video pipelines, music creation tooling, and design iteration UX. This cluster is kept separate for creators/marketing teams.
Kling 2.6 Motion Control spreads across Higgsfield, fal and Replicate
Kling 2.6 Motion Control (Kuaishou / ecosystem): Following up on the earlier Kling 2.6 launch for motion‑controlled video workflows Kling workflow, multiple platforms have now wired it into creator‑friendly pipelines—Higgsfield offers day‑0 access with 30‑second generations and full‑body sync, expression mapping and lip‑sync (Higgs launch); fal hosts a one‑take 30s Motion Control endpoint targeting fast dance/sports/martial‑arts style moves (fal integration); and Replicate exposes a "static image + reference video or motion library → animated output" flow with side‑by‑side previews for creators (replicate demo).

• Higgsfield workflows: Higgsfield markets Kling 2.6 as “any motion reference becomes any character’s performance,” and pairs it with Nano Banana Pro so users can stylize characters and then retarget complex, fast movement onto them (Higgs launch, workflow guide ).
• fal and Replicate surfaces: fal’s hosted endpoint promises synchronized motion, expressions and lip sync in up to 30s clips with a single prompt (fal integration); Replicate’s UI shows a split‑screen of source vs generated clip, highlighting how static photos like a Santa portrait can inherit motion from a live‑action reference (replicate demo).
• Multi‑character control: Community guides now document recording separate performances for each actor, converting first frames with Nano Banana Pro, then driving two Kling Motion Control runs and compositing, effectively turning a solo performer into a multi‑character cast (multi character demo, higgs tutorial ).
The net effect is that Kling’s motion system is no longer a single web demo but a multi‑hosted primitive that creators can reach through Higgsfield, fal and Replicate, often chained with image models like Nano Banana Pro for character design.
ElevenLabs Music adds Explore, stem separation and better lyric tools
Eleven Music (ElevenLabs): ElevenLabs shipped a substantial update to its music model and UI, adding an Explore surface for discovering and remixing tracks, multi‑level stem separation, improved lyric generation and precise per‑line timestamps (music update).

• Stem control: Creators can now split songs into 2, 4 or 6 stems—ranging from simple vocal/instrumental all the way to vocals, drums, bass and an "other" channel—enabling fine‑grained remixing and targeted edits inside or outside ElevenLabs (music update).
• Lyric workflow: The company reports better clarity, coherence and stylistic alignment for generated lyrics plus section‑level regeneration, making it easier to inpaint or extend specific song parts without discarding a whole take (music update, ui improvements ).
• Timing and UI polish: A new lyric timestamp system exposes exact timings via both UI and API, and the Music interface gains richer history, smoother navigation and real‑time highlighting of lyric lines during playback (music update, ui improvements ).
For music‑tool builders and sync‑heavy workflows, this turns Eleven Music from a pure generator into more of a DAW‑adjacent tool that can sit in the middle of editing, stems prep and lyric‑driven visualizations.
Genspark AI Developer builds games from screen recordings and one prompt
AI Developer (Genspark): Genspark showcased a workflow where its AI Developer agent turns a simple screen recording of the mobile game Block Blast plus a single prompt (“build a game like this”) into a playable clone in minutes, with no manual coding by the user (game demo).

• Multi‑model orchestration: Behind the scenes, Gemini handles video understanding, Nano Banana Pro designs themes and assets, and Claude writes production‑ready code; Genspark’s orchestrator routes sub‑tasks across models as needed instead of following a fixed pipeline (stack explanation).
• Agentic workflow: The system identifies user journeys in the recorded game, plans the feature set, generates assets and code, and then assembles a working web game, effectively turning "vibe coding" into a repeatable pattern for prototyping casual games (game demo, stack explanation ).
This positions Genspark’s tool as an example of how creator workflows can move from prompt‑only to "prompt + demonstration" inputs, especially for indie game and interactive prototype work.
Manus launches Design View for end‑to‑end AI design workflows
Design View (Manus): Manus introduced a new "Design View" on web and mobile that reframes its image model as a full design workflow—users can commission, create and iteratively refine visual assets as part of one continuous session rather than firing isolated prompts (design view launch).

• From prompt to layout: The demo shows a prompt box feeding into a canvas‑like interface where each refinement pass mutates composition, style and details, while preserving the project context instead of starting from scratch (design view launch).
• Agent extension: Manus positions Design View explicitly as an extension of its existing agent, so the same underlying model that chats can now act as a design assistant with persistent memory of prior iterations and instructions (design view launch).
For designers and marketers, this shifts Manus from being “an image generator” to something closer to an AI design environment that tracks intent and makes iteration loops feel like editing rather than random re‑rolls.
YouTube Playables Builder lets creators prompt Gemini 3 into mini‑games
Playables Builder (Google / YouTube): Google DeepMind highlighted that the new YouTube Playables Builder web app uses Gemini 3 to help creators spin up small, playable games from text, video or image prompts, aimed at "fun, bite‑sized" interactive content (playables teaser).

• Prompt‑to‑game flow: A short demo shows a creator entering a prompt, selecting options in the Builder UI and previewing a simple text‑based game directly in the browser, all under a "Powered by Gemini 3" banner (playables teaser).
• Creator positioning: Commentary frames this as "the first big release in AI powered game creation," suggesting Google wants Playables Builder to be a mainstream on‑ramp for non‑programmer creators to experiment with interactive experiences inside the YouTube ecosystem (playables analysis).
For game‑adjacent YouTube channels, this folds lightweight game prototyping into the same surface they already use for video publishing and monetization.
Hailuo and Nano Banana Pro front a Christmas AI‑video contest
Hailuo 2.3 + Nano Banana Pro (Hailuo / Flowith): Creator @ai_for_success shared a "Modern day Santa" short built by first designing a consistent Santa character in Nano Banana Pro, then using Hailuo’s First and Last Frame feature (with Veo 3.1 and the Hailuo 2.3 model) to animate those stills into a full video (santa entry).
• Contest mechanics: Hailuo is running a #HailuoChristmas campaign from Dec 19 to Jan 5 where participants either start from Christmas templates (≥15s) or original stories (≥30s), post on major social platforms with the hashtag and submit via a landing page, with prizes of $1,500, $1,000, $500 and ten $100 random awards plus 1,000 free credits for the first 20 submissions (contest rules, contest page ).
• Toolchain pattern: The shared workflow emphasizes creating a single hero image, generating story beats in Nano Banana Pro, then binding them with Hailuo’s frame‑to‑frame interpolation so the character identity remains stable across the clip (process recording).
• Promo tie‑in: Nano Banana Pro is temporarily free on Hailuo until Dec 31, explicitly marketed as a way to create contest entries without extra image‑model spend (nb pro promo).
This campaign shows how model vendors are packaging specific visual pipelines (character design → first/last frame video) into seasonal events to drive both experimentation and user‑generated marketing assets.
New 3D and world‑event generators target editable scenes and characters
3D‑RE‑GEN, WorldCanvas and character animation tools (multiple): Several research and demo posts highlighted creator‑facing 3D and world‑event generators: 3D‑RE‑GEN shows a generative framework that reconstructs detailed indoor scenes, moving from wireframes to textured rooms (3d regen demo); "The World is Your Canvas" introduces WorldCanvas, which can "paint" promptable events into a scene using reference images, trajectories and text (worldcanvas link); and a separate "Animate Any Character in Any World" demo lets users drop characters into arbitrary environments and control their motion (character animator).


• Scene reconstruction: 3D‑RE‑GEN’s video cycles between mesh overlays and final renders of living rooms and other interiors, suggesting a pipeline where artists can start from sparse captures and end up with editable, photorealistic environments (3d regen demo).
• Promptable events: The WorldCanvas work focuses on combining text, guidance trajectories and reference visuals so that creators can specify not just a static scene but a dynamic sequence of events in it, effectively turning the world into a parametrizable canvas (worldcanvas link).
• Character/world decoupling: The "Animate Any Character in Any World" tool emphasizes dragging a character asset into a separate background and then keyframing or prompting motion, which mirrors how many 2D creators already think about layers and rigs (character animator).
Taken together, these projects sketch a near‑future pipeline where world models, 3D recon and character rigs interlock, letting creators describe or sketch scenes and then iterate on both environment and motion without hand‑authoring every asset.
ImagineArt adds Topaz and Magnific upscalers for up to 16× enlargement
ImagineArt upscaling (ImagineArt): Commentators noted that ImagineArt has integrated Topaz Labs and Magnific AI as built‑in upscaling backends, enabling creators to push AI‑generated images up to 16× their original resolution inside the platform (imagineart mention, upscale comment ).
• Workflow impact: Instead of exporting to separate tools, users can now generate an image in ImagineArt, choose between Topaz or Magnific as an upscaler and produce large prints or high‑res crops from the same interface (imagineart mention).
• Cost angle: One thread calls out that this is a "great practical use" for large‑scale upscaling, implying that the main benefit is consolidating an otherwise multi‑tool, credit‑heavy workflow into a single app (upscale comment).
For illustrators and print‑oriented artists, native 16× upscaling reduces the friction of taking AI concepts into formats suitable for posters, merch and high‑dpi layouts.
🤖 Embodied AI: factory deployments and ‘Robot Olympics’
Embodied threads include real deployments in factories/borders and fine‑tuned generalist skills on household tasks. Mostly manipulation/control; distinct from media robots on stage.
Physical Intelligence’s π0.6 robot tackles “Robot Olympics” household chores
Robot Olympics chores (Physical Intelligence): Physical Intelligence fine‑tuned its π0.6 vision‑language‑action model to perform Benjie Holson’s "Robot Olympics" tasks—door traversal, sock inversion, key use, sandwich making, orange peeling and pan washing—with fully autonomous rollouts rather than teleoperation (pi demo thread, holson reference ).

• Task coverage and data: The team reports solving 3 of 5 event categories at gold level and 2 at silver, using under 9 hours of new data per task on top of a generalist π0.6 pretrain (pi blog post).
• Success metrics vs baseline: Across events, π0.6 averages 52% full success and 72% task progress, while a standard VLM baseline achieves 0% success and ~9% progress, highlighting the importance of robot‑specific pretraining and fine‑tuning (pi summary, pi blog post ).
• Examples of skills: Videos show the robot keeping a self‑closing lever door open while walking through, turning a sock inside‑out, unlocking a lock with a key, making a peanut butter sandwich (open, spread, cut, close), washing both sides of a frying pan, and even peeling an orange with a tool when gripper limits make finger‑only peeling impossible (door task video, sandwich video ).
• Moravec’s paradox angle: Sergey Levine notes they "didn't actually do anything special" beyond fine‑tuning π0.6 and suggests this level of everyday manipulation might force a rethink of Moravec’s paradox in the age of robotic foundation models (levine comment).
The work frames a concrete benchmark suite for embodied generalists and shows that a single VLA model plus modest per‑task fine‑tuning can span long‑horizon, contact‑rich household chores rather than one‑off lab tricks.
CATL’s Xiaomo humanoids achieve 99% success on high‑voltage battery plug‑ins
Xiaomo factory robots (CATL): Chinese battery giant CATL reports deploying its Spirit AI Xiaomo humanoid robots on high‑voltage battery production lines, claiming 99% successful plug‑in operations and roughly 3× the daily throughput of a human worker on that station (catl deployment).
• Task characteristics: The target job used to be manual because connectors and cables shift slightly each cycle and mistakes at high voltage pose real safety risks; traditional industrial arms prefer fixed geometry cells and struggle with this kind of "fiddly" connector alignment (catl deployment).
• Control approach: CATL says Xiaomo uses a vision‑language‑action model that takes camera input plus a task goal and outputs motor actions directly, allowing real‑time adjustment of grip, approach angle and insertion force instead of brittle scripted trajectories (catl deployment).
• Utilization metrics: The robots reportedly hot‑swap their own batteries in under three minutes and walk at ~2 m/s, enabling true 24/7 operation on the line where the 3× workload uplift mainly comes from continuous running rather than speed alone (catl deployment).
If these performance and reliability numbers hold across product variants and maintenance cycles, this is a concrete example of end‑to‑end learned control replacing classic, hard‑tooled automation for variable, safety‑critical assembly steps.
China signs ¥264M deal to staff Vietnam border with UBTech Walker S2 humanoids
Walker S2 at the border (UBTech): China has reportedly signed a 264 million yuan (~$37M) contract to deploy UBTech’s Walker S2 humanoid robots at the Fangchenggang border with Vietnam, where they’ll handle personnel flow management, inspection and logistics in harsh, remote conditions around the clock (walker s2 summary, border article ).
• Platform specs: The Walker S2 robots are about 176 cm tall and 70 kg, can walk at roughly 2 m/s, and can autonomously hot‑swap their batteries in under three minutes to support continuous 24/7 operation without human swaps (walker s2 summary).
• Operational role: The deployment is framed as a way to staff a remote border crossing with persistent robotic presence for document checks and cargo handling, where human staffing is costly and the environment may be unpleasant or risky over long shifts (walker s2 summary, border article ).
• Trend signal: Earlier Chinese deployments focused on pilots in factories and exhibitions; a dedicated, funded border deal suggests humanoids are starting to be evaluated as regular infrastructure in public security and customs workflows rather than one‑off demos (patrol video).
The rollout will test whether bipedal platforms can meet reliability, maintenance and uptime expectations in an operational government setting rather than a controlled lab or expo.
Kyber Labs shows fully autonomous robotic arm assembling mechanical parts
Autonomous assembly (Kyber Labs): Kyber Labs has released a demo of its robotics system autonomously positioning and fastening a small metal part onto a base plate, with no human teleoperation or intervention during the sequence (kyber demo).

• Task structure: The video shows a robot arm picking up a component, aligning it with pre‑drilled holes on a plate, placing it, then driving fasteners, which combines perception, precise pose estimation and force control rather than simple point‑to‑point moves (kyber demo).
• Claimed autonomy: The announcement stresses "no human intervention, just full‑stack robotic precision", implying the stack covers perception, planning and low‑level control internally instead of relying on offline teaching or joystick control (kyber demo).
The demo sits between academic manipulation benchmarks and full industrial deployment, hinting at how lab‑scale systems are being hardened into repeatable assembly skills.
Midea’s MIRO U “one head, six arms” robot targets flexible production lines
MIRO U multi‑arm robot (Midea Group): Chinese appliance maker Midea is showcasing MIRO U, a production robot with a humanoid torso, vertical lift, 360° rotation and six coordinated arms mounted on a wheeled base, pitched as delivering about 30% efficiency gains on factory lines (miro u summary).

• Mobility and workspace: MIRO U rides on a wheeled base for fast movement across stations, then uses torso lift and rotation to reach different fixtures and conveyors without re‑rigging the cell layout (miro u summary).
• Multi‑arm coordination: The demo shows all six arms working around a large battery pack or appliance module, suggesting tasks like parallel fastening, inspection and cable routing that would normally require multiple separate arms or operators (miro u summary).
• Stack context: This comes alongside CATL’s humanoid deployment and other Vision‑Language‑Action industrial pilots, indicating Chinese manufacturers are experimenting with different embodied form factors for the same goal—handling high‑mix, geometry‑varying tasks that rigid cells don’t handle well (catl deployment).
The design points toward a hybrid between a mobile base and a dense arm cluster, trading humanoid leg complexity for more hands and reach on each station.
Morgan Stanley tallies China’s humanoid patent surge and 6.5B‑robot forecast
China’s robot footprint (Morgan Stanley): A Morgan Stanley report summarized by commentators counts 7,705 humanoid robot patents filed in China over five years—about 5× the 1,561 in the US—alongside estimates that China accounts for 54% of global industrial robot installations and a projection of 6.5 billion robots worldwide by 2050, heavily weighted toward drones and home robots (patent summary).
• Patent signal: Patents are described as a rough proxy for distinct technical ideas rather than product quality, but the 5× gap suggests a broad R&D push across Chinese labs and manufacturers on humanoid and related platforms (patent summary).
• Install base: The same summary notes that China already leads in industrial robot deployments, with 54% of new installs, reinforcing the idea that the country’s factories are becoming "small teams running big systems" rather than labor‑heavy floors (jobs chart, patent summary ).
• Long‑term forecast: Morgan Stanley’s 6.5B‑robot forecast breaks down to about 34% small drones and 29% home robots by 2050, implying that most embodied AI units will operate outside classical factory settings even as industrial deployments like CATL and Midea scale up (patent summary).
These numbers frame the CATL, MIRO U and border‑control deployments as early instances within a much larger projected shift toward ubiquitous embodied systems.
Disney’s Spider‑Man robot executes 25‑meter autonomous stunt at Avengers Campus
Stunt robot (Disney Avengers Campus): At Disneyland’s Avengers Campus, an AI‑driven Spider‑Man robot now performs 25‑meter aerial launches, mid‑air flips and self‑correcting landings over a show stage with no human in the loop, raising questions about the future of stunt work (spiderman description).

• Performance profile: The robot is shown being catapulted above the stage, executing multiple flips, adjusting attitude mid‑flight and landing on a platform before transitioning to a hero pose sequence, suggesting a combination of precise model‑based control and robust state estimation (spiderman description).
• Job‑replacement discourse: Posts ask whether stunt performers’ jobs are "gone too", placing this system in the same conversation as dancing humanoids and factory robots about which physical roles get automated first (spiderman description, dance reaction ).
The deployment highlights that high‑risk, repeatable acrobatics under tight safety constraints are a natural early niche for embodied autonomy, before broader adoption in less scripted environments.
Unitree G1 humanoids pull concert flips as stage robots mature
G1 stage performances (Unitree Robotics): Unitree’s G1 humanoid robots are drawing attention for human‑like dance precision, first in studio clips where they match backup dancers’ timing and moves, and then on stage at Wang Leehom’s 30th anniversary concert in Chengdu where they perform synchronized backflips in costumes (dance reaction, concert performance ).

• Control fidelity: Commentators note the robots’ actions, timing and poses are "almost perfectly aligned" with human performers, including coordinated flips and choreography that depend on tight whole‑body control rather than simple pre‑scripted motions (dance reaction).
• Perceived labor impact: Posts half‑jokingly suggest "background dancers seriously need to find alternative jobs", reflecting both the quality of the demo and rising concern about how far embodied AI will encroach on repetitive stage roles (dance reaction).
• Broader deployment arc: These entertainment‑grade routines sit alongside patrol trials and industrial pilots for other Chinese robots, indicating that dynamic balance and motion control are being exercised in public‑facing, high‑pressure settings before being pointed at more utilitarian tasks (patrol video, catl deployment ).
While this is still performance robotics, the same control stacks could carry over to logistics or inspection roles where tight timing and coordination around humans matter.
📑 Fresh papers: long‑context, diffusion‑LLMs, agent safety
A dense day for papers: sparse attention for long context, diffusion‑LLMs at 100B, agent timing/safety, egocentric data, and 4D video perception. These are research artifacts, not product launches.
Distributional AGI Safety reframes risk around patchwork agent economies
Distributional AGI Safety (Google DeepMind): A new Distributional AGI Safety paper argues that real risk will come from "patchwork AGI"—many sub‑AGI agents coordinating via tools and protocols—so safety work should govern agent economies rather than a single monolithic model (paper summary).
The authors propose virtual agentic sandbox economies, ranging from impermeable to semi‑permeable, where agents trade work under cryptographically enforced identity, append‑only audit logs, and reputation‑gated access, with circuit‑breaker mechanisms to slow or halt cascades during unstable behaviors (paper summary). Their framework outlines four layers of "defense in depth"—market design, agent‑side hardening, real‑time graph monitoring for emerging AGI‑like clusters, and external standards and liability regimes—to align incentives in multi‑agent systems before those networks reach super‑human capability (paper summary).
“Fixing It in Post” shows smaller, cleaner post‑training mixes can beat larger ones
Fixing It in Post (IBM & TUM): The "Fixing It in Post" study compares post‑training data mixtures like Tulu‑3‑SFT‑Mix and SmolTalk using Magpie‑tagged metadata, and finds that a new TuluTalk mix, 23% smaller than SmolTalk and 14% smaller than Tulu, can still outperform them on standard instruction‑following and chat benchmarks (paper abstract).
Their pipeline tags each conversation with task type, turn count, and answer quality using another LLM, then systematically filters and re‑weights examples before training the same base model across all mixes, so differences in performance arise solely from data quality and composition (paper abstract). Results suggest that high‑quality, targeted post‑training data matters more than raw volume, especially when optimizing for specific behaviors like multi‑turn support or coding, and that smaller curated mixtures can save compute without sacrificing downstream scores (paper abstract).
“When Reasoning Meets Its Laws” proposes LoRe laws and benchmark for LRMs
When Reasoning Meets Its Laws (LoRe): The LoRe paper introduces "Laws of Reasoning" for Large Reasoning Models (LRMs), positing compute and accuracy laws that should scale roughly linearly with task complexity, and builds LoRe‑Bench to test whether models obey monotonicity and compositionality constraints (paper summary).
The authors argue that current reasoning LMs often violate intuitive laws—for example, sometimes doing better on a harder variant than on an easier base case—so LoRe‑Bench decomposes tasks into structured families where such violations can be measured systematically (paper summary). Early experiments show prominent LRMs deviating from idealized laws in nontrivial ways, which the paper frames as both a diagnostic for overfitting and a guide for future architecture and training changes aimed at more stable, law‑like reasoning behavior (paper summary).
4D‑RGPT targets region‑level 4D video understanding with new R4D‑Bench
4D‑RGPT (NVIDIA): The 4D‑RGPT paper presents a multimodal LLM tuned for region‑level 4D understanding (space + time), using a Perceptual 4D Distillation (P4D) pipeline to import 4D structure from an expert model and a new benchmark, R4D‑Bench, focused on depth‑aware dynamic scenes (paper summary).
4D‑RGPT aims to fix two gaps in current video MLLMs: limited temporal reasoning and lack of region‑conditioned prompts, so R4D‑Bench includes tasks where models must answer questions about specific moving regions over time rather than whole frames (paper summary). The authors show 4D‑RGPT improving on prior 4D video QA baselines and demonstrate that distilling 4D representations into a language‑conditioned model yields better temporal coherence and region accuracy than pure 2D or clip‑level training (paper summary).
FrontierMath shows Chinese open‑weight models ~7‑month lag on hardest tiers
FrontierMath (Epoch AI): New FrontierMath results benchmark several open‑weight Chinese models and find their Tier 1–3 performance lags top frontier models by roughly seven months, while on the hardest Tier 4 set only DeepSeek‑V3.2 (Thinking) answers 1/48 problems (~2%) correctly (frontiermath update).
Epoch notes that FrontierMath data are largely private, with OpenAI having exclusive access to all Tier 1–3 problems and most Tier 4, while the public portion and a shared OTIS Mock AIME benchmark are used to sanity‑check third‑party API evaluations via Fireworks and Together for data security (data access note). On aggregate Tiers 1–3, GPT‑5.2 and Gemini 3 Pro sit in the mid‑30% accuracy range, while top open‑weight Chinese models like DeepSeek‑V3.2 and Kimi K2 Thinking cluster near 20%, reinforcing a still‑visible capability gap on competition‑level math, especially at the frontier tier (tier performance chart).
Generative Adversarial Reasoner boosts math accuracy with step‑level critics
Generative Adversarial Reasoner (Johns Hopkins): The Generative Adversarial Reasoner framework trains a math LLM reasoner together with an LLM discriminator that scores short reasoning slices, turning those local signals into reinforcement‑learning rewards that encourage correct intermediate steps, not just correct final answers (paper abstract).
On the AIME 2024 benchmark, the authors report accuracy gains from 54.0 → 61.3 (+7.3 points) for one backbone and 43.7 → 53.7 (+10.0) for another, attributing improvements to the discriminator’s ability to reward locally valid algebra and penalize wrong turns even when the final numeric answer is wrong (paper abstract). After training, only the reasoner is used at inference time; the discriminator’s cost stays in training, making the method a candidate for general reasoning‑oriented RL without permanent dual‑model overhead (paper abstract).
Learning to Wait trains agents to sleep instead of spamming async tools
Learning to Wait (Tsinghua): The Learning to Wait paper shows that LLM agents can learn when to insert sleep(t) calls for asynchronous tools—rather than polling status in tight loops—by predicting wait times from tool semantics and in‑context examples in a simulated Kubernetes cluster (paper overview).
In their setup, real tools start work in the background and expose only coarse statuses like PENDING or DONE; excess status checks incur penalties, as do confirmations that are too delayed, so the agent must trade off latency against token and context overhead (paper overview). The authors report that, after training, several models converge on policies with about one status check per task over multi‑episode runs, cutting unnecessary polling while still confirming completion on time, which they frame as evidence that agents can internalize a crude "time sense" for external actions (paper overview).
PhysBrain uses human egocentric video to teach physical intelligence
PhysBrain (multi‑institution): The PhysBrain work introduces an Egocentric2Embodiment pipeline that turns large‑scale human egocentric videos into structured supervision for robots, aiming to bridge static vision‑language models and physical intelligence without collecting vast robot datasets (paper overview).
Their Egocentric2Embodiment (E2E‑3M) dataset converts first‑person videos into multi‑level, schema‑driven VQA signals about object states, contacts, and long‑horizon changes, designed to train models that can reason about state transitions and contact‑rich manipulation, not just label frames (paper overview). The authors argue that leveraging human head‑camera footage as a surrogate for robot experience lets embodied models learn perception and planning priors, with robots fine‑tuning later for morphology‑specific control (paper overview).
SGI‑Bench probes scientific general intelligence across deep research workflows
Scientific General Intelligence (SGI‑Bench): A new benchmark for Scientific General Intelligence (SGI) defines it as the ability to autonomously conceive, investigate, and reason across disciplines, and introduces SGI‑Bench, a 1,000+ sample suite aligned with the Practical Inquiry Model’s phases: deep research, idea generation, dry/wet experiments, and experimental reasoning (sgi paper).
The authors report that current top LLMs show low exact‑match rates (10–20%) on deep research tasks, despite high code executability in dry experiments, and that many generated ideas lack feasibility or sufficient detail, suggesting a gap between today’s agentic tooling and the level needed for autonomous science (sgi paper). They frame SGI‑Bench as a target for methods like test‑time RL and improved tool integration, emphasizing that progress on SGI requires evaluating full research workflows rather than isolated QA or coding tasks (sgi paper).
87‑page survey maps techniques and tradeoffs for Small Language Models
Small Language Models survey (Penn State et al.): A comprehensive 87‑page survey defines Small Language Models (SLMs) as those between the emergent‑ability threshold and resource‑constrained upper bounds, cataloging architectures, training tricks, applications, and trustworthiness concerns across this size band (survey overview).
The authors argue SLMs are increasingly favored for on‑device, low‑latency, and domain‑specific deployments, especially in settings where privacy constraints or edge hardware make giant LLMs impractical, and highlight their role as components in multi‑agent systems where many small models collaborate (survey overview). The survey reviews methods like distillation, quantization, retrieval‑augmentation, and modular fine‑tuning, and devotes a section to safety and evaluation practices tailored to SLMs rather than copying LLM‑centric benchmarks wholesale (survey overview).
🚀 Other models: MiniMax M2.1 rolls into stacks
Non‑feature model updates with direct relevance to coding/agents. Excludes GLM‑4.7 (Feature). Focus on availability, early usage, and pricing/adoption signals.
MiniMax M2.1 officially launches as 10B OSS coding and agent model
MiniMax M2.1 (MiniMax): MiniMax has moved M2.1 from early access into an official release, positioning it as a 10B‑activated open‑source coding and agent model with strong evals in both SWE‑bench Multilingual (72.5%) and the new VIBE‑bench UI test (88.6%) (launch details); the team calls it "the most powerful OSS model for the agentic era" and says a full open‑weights drop will follow in two days (launch blog). This is framed as a follow‑up to early access, where M2.1 first appeared as a design‑savvy coder rather than a fully benchmarked release.

Benchmarks and positioning: MiniMax highlights M2.1’s 72.5% score on SWE‑bench Multilingual and 88.6% on its newly open‑sourced VIBE‑bench, claiming it beats closed models like Gemini 3 Pro and Claude Sonnet 4.5 on those specific tests (launch details, launch blog ); the company also emphasizes M2.1’s strength on long‑horizon, tool‑heavy "agent" workflows, pitching it as a general "Digital Employee" rather than just a code autocomplete. The post stresses that these numbers come from a 10B‑parameter active MoE slice rather than a giant dense model, which matters for cost and deployment, but external replication of the evals has not yet been shared in these threads.
The release sets M2.1 up as one of the main open competitors in multilingual coding and UI‑heavy development tasks, with the next concrete milestone being the promised open‑weights drop and independent confirmation of the VIBE‑bench leadership claims.
MiniMax M2.1 lands on Ollama and Cline as a general coding backend
Ecosystem adoption (MiniMax M2.1): M2.1 is quickly rolling into common developer stacks, with Ollama, Cline and Code Arena all adding support in the same week; this extends its reach well beyond MiniMax’s own UI (ollama support, cline announcement , arena note ).
• Ollama runtime: Ollama now exposes minimax-m2.1:cloud, describing the updated model as performing "much better" across Rust, Java, Golang, C++, Kotlin, Objective‑C, TypeScript and JavaScript (ollama support, ollama model page ); this gives local‑style workflows access to M2.1 while still hitting MiniMax’s cloud backend.
• Cline integration: The Cline team added M2.1 as a first‑class provider, calling out a 200k context window, 128k max output and a MoE design with 10B active / 230B total params, and emphasizing improved code quality, instruction following, and cleaner reasoning across refactors, feature work, bugfixes and DevOps scripting (cline announcement).
• Competitive eval interest: MiniMax notes that M2.1 has entered the Code Arena eval suite with live WebDev tasks, though results are still pending at this stage (arena note).
These integrations mean M2.1 can now be swapped into existing agent harnesses and CLIs with minimal wiring, letting teams compare its behavior directly against Claude, GPT‑5.x and Gemini in real codebases rather than only on MiniMax’s own platform.
MiniMax Agent showcases M2.1 as a "Digital Employee" across 10+ workflows
MiniMax Agent (M2.1): MiniMax is leaning hard into the "Digital Employee" framing by wiring M2.1 into its MiniMax Agent product, advertising long‑horizon tool use, browser automation and office workflows such as multi‑step instructions and reasoning (agent upgrade); the agent UI now defaults to M2.1 in "Lightning" mode for these tasks.

• General‑purpose workflows: A "10 wild examples" page shows M2.1 handling everything from guided meditations and fact‑checking document citations to Taihu self‑drive trip planning, meme‑coin trend scans and dual‑moving‑average portfolio backtests, all inside MiniMax Agent’s task interface (example gallery, agent gallery ).
• Productivity and coding: The launch copy describes M2.1 as a multilingual coding expert and long‑horizon tool user that can execute browser‑based tasks with autonomous planning (agent upgrade); separate threads highlight use as a "Digital Employee" for office workflows like email drafting and spreadsheet logic, in addition to classic coding roles.
• Design and app building demos: Community posts show M2.1 building a "Notion Lite" editor in a single prompt inside MiniMax Agent and shipping it as a playable web app, as well as generating full UI concepts in one shot (notion lite demo, notion lite demo ); another thread showcases M2.1 "vibing" custom art pieces on the same agent stack (notion lite demo, design gallery ).
Taken together, these examples flesh out MiniMax’s earlier promises about M2.1’s agentic capabilities by showing it running real multi‑step tasks in the wild, rather than only synthetic code benchmarks.
🛡️ Safety hardening and legal friction
Security/safety items focused on agent misuse defenses and scraping enforcement. Not general policy; both items have direct impact on AI agent/web operations.
OpenAI hardens ChatGPT Atlas browser agent against prompt injection with RL red‑teaming
ChatGPT Atlas (OpenAI): OpenAI details a new security pipeline for its ChatGPT Atlas browser agent that uses reinforcement‑learning‑based automated red‑teaming to discover and patch prompt‑injection attacks before they are widely exploited (OpenAI security note, OpenAI blog ). The post describes an adversarially trained classifier that runs inside Atlas’s browser mode to detect untrusted page content trying to override system instructions, plus a continuous loop where RL agents search for new jailbreak patterns, engineers add mitigations, and the model is retrained and redeployed on short cycles; this moves Atlas closer to a traditional vulnerability‑management model for web agents rather than a one‑off prompt hardening exercise, which directly affects anyone relying on browser automation for workflows that touch sensitive internal data or credentials.