vLLM 0.13.0 doubles diffusion throughput – wires 309B MiMo-V2-Flash

📢vLLM v0.13.0 is now available. Huge thanks to the community. Highlights: Engine core: compile_ranges for selective kernel compilation, PrefixLM support for FlexAttention + TritonAttention, and CUDA graphs for 3D Triton attention. Plus: xxHash option for prefix caching, chunked Show more

6:09 AM · Dec 22, 2025

339

Read 8 replies

vLLM publishes MiMo‑V2‑Flash serving recipe with thinking mode

MiMo‑V2‑Flash serving (vLLM project): vLLM shares an official recipe for serving Xiaomi’s MiMo‑V2‑Flash 309B MoE reasoning model in production with vLLM, covering tensor parallelism, tool-parsing flags, context-length trade-offs, KV‑cache utilization, and how to toggle the model’s internal "thinking" traces as shown in the mimo recipe image and mimo recipe.

• Baseline vLLM command and tools: The recommended command uses vllm serve XiaomiMiMo/MiMo-V2-Flash --served-model-name mimo_v2_flash --tensor-parallel-size 4 --trust-remote-code --gpu-memory-utilization 0.9, along with --tool-call-parser qwen3_xml and --reasoning-parser qwen3 so that MiMo’s tool calls and reasoning annotations are parsed correctly for agent harnesses via the mimo recipe image.

• Context, latency and KV tuning: The recipe suggests controlling memory via --max-model-len (common deployments around 65,536 tokens, up to a 128k maximum) and balancing throughput versus latency using --max-num-batched-tokens, with values like 32,768 for prompt‑heavy workloads and lower settings such as 16k or 8k when latency and activation memory are the priority; it also notes that raising --gpu-memory-utilization toward 0.95 can allocate more space to the KV cache for long contexts as detailed in the mimo recipe image.

• Thinking mode wiring: To surface MiMo‑V2‑Flash’s chain‑of‑thought style traces through the API, the guide points to chat_template_kwargs where "enable_thinking": true enables the model’s reasoning outputs, while omitting the key or setting it false keeps responses in default non‑thinking mode; this gives operators a single configuration hook to expose or hide think traces without changing model weights or client code as shown in the mimo recipe image and mimo github. The recipe effectively turns MiMo‑V2‑Flash from a benchmarked open MoE into a plug‑and‑play vLLM deployment, including flags that matter for both agent integration and GPU utilization.

vLLM

@vllm_project

We’ve published an official vLLM Recipe for serving XiaomiMiMo/MiMo-V2-Flash—including tool calling, DP/TP/EP configs, and the key knobs to tune context length, latency, and KV cache. Commands are in the images; highlights below. 💡 Tips: - Set --max-model-len to manage memory Show more

3:03 AM · Dec 22, 2025

225

Read 3 replies

WarpGrep latency drops ~40% via RL and fused MoE kernels

WarpGrep latency (MorphLLM): MorphLLM reports that its WarpGrep code search agent now runs about 40% faster end‑to‑end after it reduced wasted LLM turns through reinforcement learning and switched to fused MoE kernels tuned for its model on NVIDIA B200 GPUs as shown in the warpgrep improvement.

• Latency curve shift: A before/after latency chart shows the 50th percentile end‑to‑end latency moving from roughly 2.6 seconds down to around 1.7 seconds, with the entire percentile curve shifting left, indicating both median and tail improvements rather than an isolated point optimization according to the latency graph.

• Rollback for investigation: A follow‑up note says the new path was temporarily rolled back to investigate an issue, suggesting these performance gains are from an A/B rollout that may still change as the team debugs and re‑deploys the optimized kernels as shown in the rollback comment. For serving‑stack observers, WarpGrep’s update combines model‑side RL to shrink tool‑use loops with low‑level MoE kernel work on B200s, illustrating how much latency can move even without changing base model families.

Morph

@morphllm

WarpGrep is ~40% faster now. Two changes: RL reduced wasted turns fused MoE kernels for our model on B200 made forward pass calls calls more efficient

1:47 AM · Dec 22, 2025

Read 5 replies

📈 Evals: MRCR long‑context results and METR correlations

New MRCR detail for Gemini 3 Flash, WeirdML cost/perf snapshots, and meta‑signals tying METR to other suites; includes YAML field clarifications. Excludes robotics deployments (feature).

Gemini 3 Flash MRCR runs show big gains across reasoning modes

Gemini 3 Flash (Google): New Context Arena MRCRv2 results break out 8‑needle performance for Base, Low, Medium and High reasoning modes, with strong long‑context recall and clear cost/quality trade‑offs, extending earlier MRCR coverage of Flash vs Pro as shown in the Gemini Flash MRCR and contextarena update. Base scores ~36.5% AUC at 128k and 29.3% at 1M tokens, while Low jumps to 54.5%/40.9%, Medium to 69.2%/45.9%, and High tops out at 71.6%/49.4% on 8‑needle retrieval at 128k/1M respectively, with Medium and High relatively close despite High’s higher output‑token cost as detailed in the contextarena update.

Cost–quality curve: The same run reports that stepping from Base → Low → Medium → High sharply increases both AUC and output spend, with Medium described as the “efficiency sweet spot” because it nearly matches High at 128k while using ~45% fewer output tokens per the contextarena update.
Long‑context reliability: All four modes retain non‑trivial retrieval at 1M tokens, but the widening gap between Base and the reasoned modes at 1M underlines how much extra thinking budget matters for true long‑horizon tasks rather than pure context length alone according to the contextarena update.

The point is: Flash now comes with a quantified menu of MRCR trade‑offs across reasoning intensities, which is the kind of data engineers have been asking for around long‑context agents.

Dillon Uzar

@DillonUzar

For those that have been asking - will be posting Claude 4.5 ContextArena results tomorrow. Extra thanks to those that were willing to help cover costs.

12:18 PM · Dec 21, 2025

Community decodes METR YAML cost/time fields to compare Opus vs Codex

METR YAML (Community): After digging through METR’s raw YAML, community analysts conclude that the working_time and usd fields are best interpreted as the wall‑clock minutes and dollar cost to complete the full benchmark once (one of the eight attempts), not averages or totals across all runs as seen in the yaml interpretation and metr question.

• Field semantics: Posts argue that the consistent mapping of working_time values to known run durations, and usd to per‑run billing estimates, makes it “very likely” these fields describe a single benchmark pass, which explains why total figures can look misleading without that context according to the yaml interpretation and yaml data note.

• Opus vs GPT‑5.1 Codex: Using those fields, one analysis claims GPT‑5.1‑Codex‑Max takes about 2.6× longer than Claude Opus 4.5 to complete the METR suite but is around 7.6× cheaper per run, and that Opus also uses working time more efficiently by reaching higher p50 horizons with fewer minutes on the clock via the codex vs opus comment and codex cheaper remark.

• Calls for richer reporting: The same threads ask METR to start publishing per‑task cost, per‑task success probability, and an “expected surplus” metric that would combine human and model time/cost, rather than just aggregate working_time and usd, which can’t be meaningfully averaged across very different task types as discussed in the cost metrics comment and yaml interpretation.

Building on earlier focus on Opus 4.5’s near‑5‑hour horizon on METR Opus horizon, this shifts attention from raw autonomy to explicit cost–time trade‑offs between frontier models on the same task mix.

Lisan al Gaib

@scaling01

this is why I was asking: - my guess is that models using millions and bazillions of tokens or super expensive models are going in the wrong direction - they might climb the METR exponential but actually make no progress at all in terms of cost or time - because if cost and Show more

Lisan al Gaib

@scaling01

can someone from METR confirm this? please define what those two fields mean in the .yaml working_time: usd:

5:30 PM · Dec 21, 2025

Read 1 reply

GPT‑5.2 Thinking holds near‑perfect MRCR recall out to 256k tokens

GPT‑5.2 Thinking (OpenAI): A new MRCRv2 long‑context chart shows GPT‑5.2 Thinking maintaining an almost flat mean match ratio near 100% across 8k–256k tokens on a 4‑needle retrieval setup, while GPT‑5.1 Thinking degrades from ~90% at 8k to ~45% at 256k as detailed in the mrcr comparison.

• Long‑context headroom: The GPT‑5.2 curve effectively overlays the 100% line from 8k to 256k, suggesting that, at least on this MRCR needle‑in‑haystack configuration, context length is no longer the limiting factor for recall, whereas GPT‑5.1 shows sharp drops past 32k per the mrcr comparison.

• Implications for METR‑style tasks: Although this chart is MRCR rather than METR, the minimal recall decay for GPT‑5.2 at extreme lengths is the kind of property METR’s long‑horizon tasks depend on, and is already being cited in speculation that GPT‑5.2 will surpass Claude Opus 4.5 on METR once fully evaluated as suggested in the mrcr comparison.

So what changes is that long‑context reliability has shifted from something engineers had to design around in GPT‑5.1 to something much closer to a solved subproblem for GPT‑5.2 on retrieval‑style workloads.

Dan McAteer

@daniel_mac8

GPT-5.2 will surpass Opus 4.5 on METR Long Horizon benchmark because of its long context performance. Agree/disagree?

1:57 PM · Dec 21, 2025

221

Read 22 replies

METR time horizons line up tightly with other frontier evals

METR correlations (Meta‑analysis): New analysis of public benchmark tables finds that log(METR time horizon) for long‑horizon tasks is very strongly correlated with several major capability benchmarks, including an r≈0.957 with Epoch’s aggregate ECI score, r≈0.948 with ARC‑AGI, and r≈0.914 with GPQA‑Diamond via the metr correlation analysis.

• Breadth of alignment: Across 11 benchmarks with ≥10 overlapping models, Pearson correlations with METR log‑time are all positive and often high: ECI 0.957, ARC‑AGI 0.948, GPQA‑Diamond 0.914, WeirdML 0.906, Aider Polyglot 0.898, OTIS Mock AIME 0.886, FrontierMath 0.880, and Math Level 5 0.831, with writing‑focused and small‑n suites showing weaker or noisier relationships per the metr correlation analysis.

• ARC‑AGI in particular: Follow‑up commentary notes that ARC‑AGI, despite aiming to measure a different notion of general reasoning, is also highly correlated with most other benchmarks, feeding the argument that many of these suites are tracking a common underlying model capability curve rather than distinct axes according to the arc agi comment.

• Data caveats: The author stresses that correlations are computed over relatively small model sets and that some rows (like DeepResearchBench or SWE‑bench variants) have n=5–7 samples, so those specific r values are more fragile than the headline ECI/ARC/GPQA relationships as discussed in the metr correlation analysis.

The point is: METR’s long‑task time horizon now looks less like an outlier metric and more like another lens on the same frontier capability gradient captured by more familiar academic and coding benchmarks.

Ethan Mollick

@emollick

I think there is likely too much emphasis on the METR long-task measurement as a sign of AI progress... ... but it doesn't matter. With a little help from GPT-5.2 Pro, I calculated the correlations between log(METR) & other key benchmarks, and they basically all correlate highly

10:00 PM · Dec 21, 2025

399

Read 22 replies

WeirdML benchmark puts Gemini 3 Flash alongside GPT‑5.1 at a fraction of the cost

WeirdML multi‑model suite (Independent): A new WeirdML summary chart shows Gemini 3 Flash Preview (high reasoning) averaging ~0.616 accuracy across 17 quirky tasks, essentially tied with GPT‑5.1‑high at ~0.608 while delivering the lowest run cost in the table at ~$0.222 per run according to the weirdml summary.

• Accuracy vs peers: On the averaged “WeirdML” scoreline, GPT‑5.2‑xhigh leads at ~0.892, Gemini 3 Pro Preview clusters around ~0.722, Claude Opus 4.5 around ~0.637, GPT‑5.2‑medium ~0.634, Gemini 3 Flash‑high ~0.616, GPT‑5.1‑high ~0.608, and GPT‑5‑high ~0.607, making Flash‑high competitive with prior‑gen proprietary models as noted in the weirdml summary.

• Cost and code size: The same table reports that Gemini 3 Flash‑high’s WeirdML run cost is ~0.222 USD with ~149 lines of generated code on average, which is markedly cheaper than most higher‑accuracy models in the chart that often spend more while not always gaining much accuracy per the weirdml summary.

Sample sizes are still modest and this is a single community eval, but it adds an early cost‑per‑performance datapoint for Flash alongside the more formal MRCR and METR suites.

Lisan al Gaib

@scaling01

Gemini 3 Flash doing really well on WeirdML - placing on par with GPT-5.1-high

Håvard Ihle

@htihle

Gemini 3 flash scores 61.6% on WeirdML, beating out gpt-5.1 and pushing the frontier of cost vs accuracy. For reference gemini-2.5 flash and pro scored 41.9 and 54.0%, so this is quite a step up for flash. One thing that stands out for this model are the code execution times

9:16 AM · Dec 21, 2025

OpenRouter latency pages now show Opus 4.5 running faster than Sonnet 4.5

Claude Opus 4.5 vs Sonnet 4.5 (Anthropic/OpenRouter): OpenRouter notes that its live performance dashboards now show Claude Opus 4.5 returning responses faster than Claude Sonnet 4.5 under current traffic, reversing the usual expectation that the smaller model should be quicker as shown in the latency comparison.

• Live metrics, not lab tests: The tweet links directly to separate OpenRouter performance pages for Sonnet 4.5 and Opus 4.5, which track real‑world latency and throughput over time rather than synthetic microbenchmarks, and are being used by some builders to decide routing strategies for coding and chat workloads as indicated in the perf links.

• Context with METR: This runtime inversion comes alongside METR data that already showed Opus 4.5 achieving longer autonomous horizons than many peers, so the new dashboard readings suggest that, at least on OpenRouter’s stack, there is no obvious latency penalty for choosing the more capable Anthropic model in the latency comparison.

Numbers aren’t provided in the tweets, and the dashboards can fluctuate as traffic changes, but they add another empirical datapoint to the evolving cost–latency–capability trade‑off picture for Anthropic’s 4.5 family.

OpenRouter

@OpenRouterAI

Claude Opus 4.5 (left) is running faster than Sonnet 4.5 (right)

3:17 AM · Dec 22, 2025

849

Read 34 replies

🧰 Long‑run coding: continuity ledgers, CI bots and speedups

Practical patterns and tools for agentic coding sessions: continuity prompts for Codex, GitHub Actions, editor updates and grep performance. Excludes ACP/A2UI UI debates, which sit under Orchestration.

Continuity Ledger pattern keeps GPT‑5.2 Codex coherent for hours

Continuity Ledger (OpenAI Codex): Community builders are standardizing on a "Continuity Ledger" prompt block in ~/.codex/AGENTS.md so GPT‑5.2 Codex can work on the same coding task for 3+ hours without losing the plot, by keeping a compaction‑safe session briefing that the model explicitly reads and rewrites every turn as shown in the continuity pattern and ledger details. The spec distinguishes between short‑term plans (a small functions.update_plan list) and a persistent ledger file that tracks goal, constraints, key decisions, Done/Now/Next, open questions and working set, and it tells the assistant to rebuild the ledger when context gets compacted or recall seems fuzzy, detailed in the ledger details and AGENTS docs. One practitioner reports letting Codex run "for 3 hours coherently" with this setup and notes that 5.2‑Codex Extra High is currently the only model they trust for such long autonomous runs, which turns this ledger pattern into a de facto best practice for long‑horizon refactors and migrations, per the continuity pattern and model comment.

OpenSkills v1.3.0 matures into a universal Skills loader for coding agents

OpenSkills loader (community): The OpenSkills project shipped v1.3.0, positioning itself as a universal Skills loader for AI coding agents by standardizing how markdown‑based "Skills" live on disk and are injected into agent configs across tools like Codex, Claude Code, and others, as reported in the openskills release.

• CLI integration: A typical install flow now uses npm i -g openskills, openskills install owner/repo to pull a Skill repo (for example, an "Agent Skills for Context Engineering" pack), and openskills sync to automatically merge those instructions into AGENTS.md so every new session in supporting CLIs picks them up via the install example.

• Context‑engineering packs: Early adopters highlight that Skills can encode not just tools but also detailed context‑engineering rules (like how to structure plans, memory, or repo scans), turning them into reusable building blocks that travel with the repo rather than being buried in one user’s dotfiles, as noted in the install example and skills commentary. This makes Skills feel closer to shareable "agent plugins" that teams can version, review, and standardize.

Warp’s GitHub Action turns its agent into an issue triage bot

Warp agent (Warp): Warp is wiring its terminal agent into CI by shipping a warp-agent-action GitHub Action that runs on issues, inspects the repo, and automatically adds a needs info label to underwritten bug reports, effectively making the coding agent a review bot in the project’s workflow as demonstrated in the action demo.

• Workflow behavior: The sample workflow passes a natural‑language prompt and API key into the action; the agent reads the issue text and codebase, then posts back labels and comments directly on GitHub, as seen in the triage example where vague reports get flagged for more detail, detailed in the action demo and triage guide.

• Enterprise angle: Because it runs inside Actions with repo access, teams can wrap the agent in their own policies and logs, making it easier to sell as an auditable helper to infra and security teams that already trust GitHub CI according to the action demo and triage guide.

WarpGrep reports ~40% faster runs from RL and fused MoE kernels

WarpGrep (MorphLLM): The WarpGrep agentic search tool claims about a 40% end‑to‑end latency reduction after two changes—reward‑learning to cut wasted LLM turns and fused MoE kernels for its model on NVIDIA B200—before temporarily rolling the release back to investigate an issue, as shown in the speedup claim and rollback note.

• Latency numbers: A before‑and‑after plot shows median latency dropping from roughly 2.6 seconds to about 1.7 seconds at the 50th percentile, while still supporting the same grep‑style query interface, which is a tangible gain for interactive use according to the latency chart.

• Optimization techniques: The team attributes gains to RL that encourages the agent to focus its queries rather than over‑chatter, plus a fused MoE forward pass optimized for B200 GPUs, which cuts both throughput time and time‑to‑first‑token, as detailed in the speedup claim. A follow‑up notes they rolled back the change to debug an unspecified issue, so these numbers should be treated as provisional rather than a stable baseline, per the rollback note.

Athas Code Editor adds PR view and colored diff viewer for agents

Athas Code Editor (Athas): The Athas AI code editor is evolving toward full review tooling, adding a dedicated Pull Requests view plus a diff viewer that now supports syntax highlighting, making agent‑generated changes easier to inspect and ship, as previewed in the pr view demo and diff highlight.

• Pull request workflow: The new PR view surfaces open pull requests directly inside the editor, lets users switch between them, and appears designed to pair with the agent so it can help summarize and tweak PRs without leaving the tool, as shown in the pr view demo and release comment.

• Readable diffs: The diff viewer now color‑codes code by language instead of plain text, which the maintainer credits to a community Rust contributor, and this upgrade is explicitly aimed at making large, multi‑file agent edits less error‑prone to review, as noted in the diff highlight and contributor praise. The maintainer says these features will ship in the next release and then focus will shift from new UI to stabilizing the agent and editor core per the release comment.

DeepWiki’s ask + Codemap features help agents reason about large codebases

DeepWiki Codemap (DeepWiki): DeepWiki’s MCP‑style tools are drawing praise from power users who call them "still underrated" for answering GitHub codebase questions, and a separate update notes that its Codemap feature is now available in the web UI to visualize how components hang together, as explained in the ask usage and codemap ui.

• Question‑answering over repos: The deepwiki.ask_question tool can take a repo like charmbracelet/crush plus a natural‑language question (for example, "how is the popover dialog implemented?") and returns a structured explanation with links to the relevant files and interfaces, which then feeds neatly into coding agents as detailed in the ask usage.

• Structural overview: The Codemap view in the browser lays out key components and their relationships in a tree, so developers and agents can quickly jump into the right part of the codebase instead of brute‑forcing search across thousands of files according to the codemap ui. Together, these features push toward a pattern where long‑run coding agents consult a shared, indexed map of the repo before editing rather than rediscovering structure from scratch each session.

🧩 Agent UI and interoperability: ACP in practice, A2UI momentum

Hands‑on UI/UX comparisons (Toad vs native CLIs), ACP behaviors, and Google’s A2UI thesis on agent GUIs. Excludes continuity/prompting tips covered in coding tooling.

Google’s Ben Goodger pitches A2UI: the browser as a body for AGI agents

A2UI concept (Google DeepMind/OpenAI): Ben Goodger, now leading engineering on ChatGPT Atlas after building Firefox and Chrome, describes a proposed "Agent‑to‑User Interface" (A2UI) layer where the browser acts as a literal body for AGI by giving agents a structured environment for taking actions, not just returning text as shown in the A2UI teaser clip and Atlas interview. In the interview he argues that AGI is "something that can take action for you" and that the browser is already the environment where that happens—through navigation, clicks, form fills, and app surfaces—which suggests agent GUIs will need first‑class patterns for permissions, state, and multi‑step workflows rather than ad‑hoc chat logs as detailed in the A2UI teaser clip. The same conversation frames Atlas as an early testbed for these ideas, with the browser integrating agent capabilities like search, tool use, and memory directly into tabs while keeping users in control via visible UI affordances instead of background automation as discussed in the Atlas interview. For engineers working on CLIs, MCP servers, or in‑browser agents, this A2UI thesis ties together many of the day’s smaller UX debates by suggesting that agent interoperability will eventually hinge as much on shared interaction conventions in the browser as on wire‑level protocols like ACP or MCP.

Ksenia_TuringPost

@TheTuringPost

The browser as a body for AGI? "AGI is something that can take action for you. The browser itself is an environment where that can happen." @bengoodger, head of engineering for ChatGPT Atlas at @OpenAI (and former Firefox and Chrome builder), in our interview

11:58 PM · Dec 21, 2025

Read 9 replies

Toad surfaces divergent UX patterns across Claude, Gemini, and OpenCode CLIs

Toad CLI (BatrachianAI): Toad’s alternate‑screen terminal is being used as a lens on how different agent CLIs behave over the Agent Client Protocol, with detailed side‑by‑side tests against Claude Code, Gemini CLI, and OpenCode highlighting that ACP standardizes transport but not UX as shown in the Claude CLI comparison, Gemini CLI comparison , and OpenCode UI notes. In the Claude comparison, the official CLI sticks to pure scrollback so markdown arrives in one delayed block, color parsing wraps long lines and breaks fractal output, and @‑based path completion only triggers directory scans after the first typed character, while Toad’s own file picker and streaming markdown avoid those constraints per the Claude CLI comparison. With Gemini CLI, Toad’s author notes that markdown appears paragraph‑by‑paragraph instead of token streaming and that ANSI color is stripped once the command exits, including from output forwarded over ACP, leaving both Gemini’s own scrollback and Toad’s pane monochrome for completed runs as detailed in the Gemini CLI comparison. OpenCode, which also uses an alternate‑screen TUI, is reported to expose agent "thoughts" text locally but not via ACP, show only directories (no files) when @ is first pressed in its picker, and funnel ls output into a single narrow column instead of using full terminal width, pointing to PTY sizing issues in its integration as noted in the OpenCode UI notes. Overall this informal bake‑off frames ACP as a workable interoperability layer while making clear that higher‑level conventions for streaming, color, file pickers, and thought telemetry are still fragmented across early agent CLIs.

Will McGugan

@willmcgugan

Replying to @willmcgugan

First up, Claude. Claude has gone for a 100% scrollback approach, rather than Toad's alternate screen approach. This places limitations on how they can present information. For instance, the markdown doesn't appear to stream and appears at once afer a delay. They do parse Show more

12:44 PM · Dec 21, 2025

RepoPrompt MCP client UI showcases shared skill registry for Claude Code and Codex

RepoPrompt MCP client (RepoPrompt): A new MCP server configuration screen shows how repo‑aware tools like RepoPrompt are starting to present themselves as shared skills across multiple coding agents via a single UI, rather than per‑tool boilerplate in each IDE as visible in the MCP config screenshot. The screenshot displays an "MCP Server" pane where a RepoPrompt server is active, auto‑starting with the window, and wired into both Claude Code and GPT‑5.2 Codex presets through a unified context builder dropdown, with the UI explaining that this builder is used by a dedicated context_builder MCP tool according to the MCP config screenshot. The same panel shows how chat model presets and context builders can be toggled globally for the workspace, hinting at a future where repo‑level skills (like rp-build style context construction) are discoverable by any ACP‑aware client that speaks MCP rather than hard‑coded into a single vendor’s CLI per the MCP config screenshot. For UI and interoperability, this is one of the clearer examples of how a skill registry plus MCP can let users choose their agent front‑end while reusing the same underlying repo tooling.

Rudrank Riyam

@rudrank

RepoPrompt MCP with Claude Code and Codex Give it a try repoprompt.com

Rudrank Riyam

@rudrank

My two favorite MCP for the year: - XcodeBuildMCP - App Store Connect MCP

7:44 PM · Dec 21, 2025

Read 3 replies

AmpCode TUI draws praise for visuals but criticism for flicker and audio

AmpCode CLI (AmpCode): A separate review of the AmpCode agentic TUI calls out a mix of appealing and awkward UI choices, underscoring how much terminal ergonomics still vary across ACP‑capable clients as shown in the AmpCode UI review. The app uses an alternate‑screen interface with an eye‑catching ASCII sphere animation on startup, but its file picker resizes horizontally as you type to match the longest path, causing deliberate width flicker that some users may find distracting, and command output is rendered in monochrome with no inline color, pushing rich viewing into an external "view all" mode instead. The same walkthrough notes that AmpCode’s scrollbar is only a single cell wide, which makes it hard to grab in a small terminal, and that it plays notification sounds when agents finish—behaviour the reviewer hopes will become configurable for quieter workflows as mentioned in the AmpCode UI review. Together with the Toad comparisons, this points to a wider pattern where agent CLIs are experimenting independently with animations, scrollback behavior, and notification styles rather than converging on shared UI norms.

Will McGugan

@willmcgugan

Replying to @willmcgugan

Last one I think. AmpCode. This is also an alternate screen TUI. I like their ascii sphere animation. Old school cool. Some issues running commands. This may be pbcac, because I haven't ready any of their docs. I expected ! to introduce a shell commands, but it looks like Show more

12:44 PM · Dec 21, 2025

Infra economics and power constraints continue: US share of compute, China’s generation curve, Amazon’s 2.2GW site, and OpenAI’s ‘Code Red’ focus. Also notes new accelerator claims at a high level.

OpenAI’s Stargate footprint maps to more than 10 GW of planned AI capacity

Stargate data center buildout (OpenAI + partners): A follow‑on thread from Kol Tregaskes details OpenAI’s ‘Stargate’ plans, outlining more than 10 GW of AI data center capacity across the US and abroad, mostly in partnership with Microsoft, Oracle, SoftBank and regional energy providers as shown in the stargate summary. Listed projects include a flagship Abilene, TX complex already running GB200 racks and contributing to a 600 MW regional cluster, additional Oracle‑linked sites in Shackelford County, TX, Doña Ana County, NM, and Wisconsin, twin 1.5 GW campuses in Lordstown, OH and Milam County, TX backed by SoftBank, a 1.4 GW Saline Township, MI facility approved in December 2025, plus land or partnerships in Grand Rapids (MI), Norway, Patagonia (Argentina), and the UAE/South Korea corridor aimed at adding many more exaflops of AI capacity over the next few years via the stargate summary.

Kol Tregaskes

@koltregaskes

OpenAI's Stargate has the biggest buildout and plans right now so here a focus summary on them: - Abilene, TX (Flagship): Operational on Oracle Cloud; NVIDIA GB200 racks since June 2025; Buildings 1 & 2 fully online Nov 2025; part of 600 MW+ near Abilene. - Shackelford County, Show more

Kol Tregaskes

@koltregaskes

As a follow up to the quote, here's a Grok researched summary of the latest AI chips and where each AI lab is at now and plans for 2026: Chips Status - Nvidia GB200/GB300 (Blackwell Series): Shipments ramping since mid-2025; GB300 volume started Sep-Oct. Deployed in

11:13 AM · Dec 21, 2025

Epoch chart shows US at 74% of known frontier AI compute

Global AI compute share (Epoch AI): An Epoch AI snapshot of frontier AI supercomputer capacity shows the US controlling about 74.4% of known available AI compute, with China at 14.1%, the EU at 4.8%, and the rest of the world making up single‑digit shares, based on an estimated 10–20% sample of global AI supercomputer performance as of March 2025 as shown in the compute share chart. This framing underlines how far ahead US‑based labs still are in raw training and inference capacity and quantifies how much China and others would need to grow to close the gap.

Rohan Paul

@rohanpaul_ai

Insane. US is the absolute king of total available AI compute at 74.4%. China at 14.1% and the European Union is at 4.8% So US has 5.3x the capacity of China, and 15.5x of EU.

9:42 PM · Dec 21, 2025

412

Read 72 replies

OpenAI compute margins reach ~70% as ‘Code Red’ pivots to infra

Compute economics and ‘Code Red’ (OpenAI): Reporting from The Information and Bloomberg summarized on X says OpenAI’s compute margin on paying users climbed to about 70% in October 2025, up from roughly 52% at the end of 2024 and 35% in January 2024, helped by cheaper rented compute, inference efficiency work, and a higher‑priced subscription tier as shown in the margin summary and margin article. Mark Chen also describes how the company has gone into internal ‘Code Red’ mode multiple times—including after Google’s Gemini 3 launch—temporarily reassigning teams from agents and ads back to core work on latency, uptime, tighter eval loops, and now the training engine itself, with discussions around up to $1.4T in infrastructure over eight years as detailed in the code red recap.

Chubby♨️

@kimmonismus

OpenAI’s Compute margins surge again (via the information) OpenAI’s compute margin on paying users reportedly hit ~70% in October (up from ~52% at end-2024 and ~35% in Jan 2024), driven by cheaper rented compute, inference efficiency tweaks, and a higher-priced subscription Show more

6:12 PM · Dec 21, 2025

287

Read 16 replies

Roadmap roundup shows GB200, TPU v7, MI300 and others ramping from 2026

AI accelerator roadmap (multi‑vendor): A long rundown by Kol Tregaskes compiles the current status and 2026–2028 roadmaps for most major AI accelerators, covering Nvidia’s GB200/GB300 (Blackwell), Google TPU v7 Ironwood, AMD Instinct MI300X/MI355X/MI450, Intel Gaudi3, Cerebras WSE‑3, Groq LPU v2, Amazon Trainium3/Trainium4, Microsoft Maia 100/200/300, Qualcomm AI200/AI250, Huawei Ascend 910C/950/960/970, and Alibaba’s new 7 nm inference chips, along with where they are already deployed in cloud and supercomputing setups as shown in the accelerator roundup. He stresses that many GB200/GB300‑class systems only began shipping in volume in late 2025 and ties this to a separate observation that the real ramp in new data centers—especially those using Nvidia GB200/300—will likely occur in the second half of 2026 through 2028, which is also when some labs expect to hit “intern AI researcher” capabilities per the GB200 ramp comment.

Kol Tregaskes

@koltregaskes

Kol Tregaskes

@koltregaskes

Yes, I don't think people realise the vast majority of the new data centres are not up and running and not using the newer GB200/300 Nvidia chips. GPT-5.2 apparently had some GB200 involvement at later stages but is most a Hopper model, for example. The second half of 2026 is

9:25 AM · Dec 21, 2025

Amazon’s $11B Indiana AI campus sized at 2.2 GW with own power plant

Indiana AI campus (Amazon): Amazon’s $11B Indiana data center complex is described as a 2.2 GW AI training and inference campus that will include its own power plants so that, in theory, it has minimal impact on local electricity rates, following earlier flyovers showing the site’s “city of data centers” footprint via the city-scale buildout. Rohan Paul notes the campus’ eventual draw is comparable to about 1 million homes, highlighting how hyperscale AI clusters increasingly need vertically integrated power to avoid overwhelming regional grids, as shown in the Indiana campus video.

Rohan Paul

@rohanpaul_ai

This is Amazon’s massive $11B Indiana data center campus. Aimed at AI training/inference, drawing 2.2 GW (about 1M homes). This campus is going to include power plants, so it should in theory have minimal impact on the local electrical rates.

1:23 AM · Dec 22, 2025

Read 12 replies

China touts analog and photonic AI chips with 100–1000× task speed claims

Analog and photonic accelerators (China): Multiple posts highlight Chinese efforts on light‑based and analogue AI chips that claim 100×–1000× speed or efficiency gains over Nvidia GPUs on specific workloads, particularly generative video and image tasks as detailed in the analog chip teaser and chip article. Kimmonismus points to Peking University’s analogue math chip and to photonic accelerators like ACCEL and LightGen, which perform matrix operations via optical interference, waste less energy as heat, and can be manufactured on older processes, while stressing that these devices are highly specialized accelerators, not general‑purpose GPU replacements, so they would require careful workload–hardware co‑design rather than a simple swap‑in per the photonic overview and YouTube explainer.

Chubby♨️

@kimmonismus

Chinese scientists unveil a superfast analog chip, 1000x faster than top digital processors. It solves complex maths for AI & science, outperforming even NVIDIA's GPUs in tests. The race for the best computing power is entering its next phase.

10:25 PM · Dec 21, 2025

928

Read 75 replies

🧪 Frontier and open models to watch

Quieter launch day; community focuses on Xiaomi’s MiMo‑V2‑Flash open‑weights report, NVIDIA’s NitroGen generalist gaming agent, and MiniMax M2.1 showing up in live code evals.

MiMo-V2-Flash open MoE challenges DeepSeek and Kimi with 309B‑param design

MiMo-V2-Flash (Xiaomi): Xiaomi’s open-weight MiMo-V2-Flash Mixture‑of‑Experts model (309B total parameters, 15B active) is now detailed in a technical report claiming it matches or beats DeepSeek‑V3.2 and Kimi‑K2 on several reasoning, coding, and math benchmarks while using roughly half and one‑third of their total parameters respectively, as shown in the MiMo benchmarks and building on the model launch; the model is released under Apache 2.0 with weights and report available to the community via the GitHub repo.

• Frontier positioning: The benchmark figure shows MiMo-V2-Flash reaching 94.3 on AIME25 math and 86.2 on Arena-Hard general capabilities—within a few points of top proprietary models and competitive with Kimi K2-Thinking and Gemini Pro 3.0 despite its smaller active parameter count, per the MiMo benchmarks.

• Open-weight implications: Because the full 309B‑parameter MoE and 15B‑active routing design are available for self‑hosting and fine‑tuning, MiMo-V2-Flash adds a new large open model option for teams who previously defaulted to DeepSeek or Kimi for high‑end reasoning workloads, detailed in the GitHub repo).

elvis

@omarsar0

Huge if true. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2x and 1/3x of their total parameters, respectively. Those results look pretty strong for an open-weight model.

3:51 PM · Dec 21, 2025

412

Read 12 replies

MiniMax M2.1 gains traction for design and coding as M2.5 is teased

M2.1 (MiniMax): MiniMax’s M2.1 model is now visible both in creative and coding workflows, with users praising its design and visual quality upgrades while LMArena adds it to the Code Arena for live, head-to-head coding evaluations as detailed in the M2.1 teaser, the M2.1 design demo, and the Code Arena listing, extending early access; MiniMax hints that upcoming M2.5 will stack further improvements on top of M2.1’s already stronger visuals via the M2.1 teaser.

• Design and media usage: Community posts describe M2.1 as “a beast in design” and showcase polished motion and layout work for UI and branding projects, with creators explicitly tagging finished pieces as “Built with M2.1” as shown in the M2.1 design demo and built with M2.1.

• Coding eval presence: The Code Arena now lists M2.1 in its live coding eval harness so developers can compare its planning, scaffolding, and debugging abilities against other frontier models in real web-app tasks per the Code Arena listing and the Code Arena page.

• Roadmap signal: MiniMax is already teasing M2.5 as the next step after M2.1’s visual gains, suggesting a short cadence of model updates aimed at staying competitive in both design and agentic coding benchmarks, detailed in the M2.1 teaser).

MiniMax (official)

@MiniMax_AI

M2.1 brought major upgrades to design and visual quality Guess what M2.5 will bring😉

CloudAI-X

@cloudxdev

M2.1 is a beast in design area @minimax_ai 🤝 @FactoryAI

4:38 AM · Dec 22, 2025

800

Read 21 replies

NVIDIA’s NitroGen emerges as open foundation model for generalist gaming agents

NitroGen (NVIDIA): NVIDIA’s NitroGen is framed as an open vision-action foundation model for “generalist gaming agents,” trained via behavior cloning on 40,000 hours of gameplay across more than 1,000 commercial games, and reporting up to 52% relative success gains when transferring to unseen titles versus training from scratch, as shown in the NitroGen intro and NitroGen training details; this expands on the initial release by detailing a new dataset with automatically extracted controller traces and a Gymnasium-style universal simulator benchmark for cross-game evaluation, detailed in the dataset and eval, following the initial release.

• Data and tools: The team describes a large internet-scale dataset built from gameplay videos that include on-screen input overlays, plus tooling that turns many commercial games into a unified simulator interface developers can call from RL or planning code as shown in the NitroGen training details and project page.

• Open research angle: Because the dataset, model card, and evaluation suite are open, NitroGen lowers the entry barrier for academic and hobby projects trying to study generalist control across heterogeneous games instead of training bespoke agents per title as shown in the NitroGen intro and project page.

Chubby♨️

@kimmonismus

1/ NitroGen: NVIDIA's new image-to-action model! NitroGen, a vision-action foundation model for generalist gaming agents that is trained on 40,000 hours of gameplay videos across more than 1,000 games. Gaming is a significant factor in AI training. Google DeepMind trained AI Show more

7:06 PM · Dec 21, 2025

246

Read 35 replies

📚 Interpretability and agent adaptation (hallucination neurons, multi‑agent)

New/spotlighted papers: sparse H‑neurons linked to hallucinations, adaptive multi‑agent document understanding, and a taxonomy for agent/tool adaptation. Excludes productized evals.

Sparse H‑neurons linked to over‑compliant hallucinations in LLMs

H‑Neurons (OpenBMB/Tsinghua): OpenBMB and Tsinghua report a sparse set of “hallucination‑associated neurons” (H‑neurons)—less than 0.1% of activations—that reliably predict when a model will hallucinate, and show that these neurons encode an over‑compliance impulse that favors agreeing with user prompts over truthfulness, as shown in the paper summary and the arxiv paper. The team traces these neurons to the pre‑training phase (next‑token prediction), not later alignment tuning, and shows that selectively suppressing or activating them systematically reduces or increases behaviors like endorsing false premises, adopting misleading context, caving to skeptical pushback, or bypassing safety filters.

• Sparse but high leverage: A linear classifier over neuron contributions finds a tiny subset whose activity predicts hallucinations across domains (general knowledge, biomedical QA), suggesting a compact handle for future interventions rather than diffuse causes spread across the network as detailed in the paper summary.

• Over‑compliance vs. truth: Interventions that damp H‑neurons make models more likely to reject false premises or harmful instructions, while boosting them nudges models to “go along” with user framings—even when that means fabricating facts or giving disallowed guidance per the paper summary.

• Pre‑training origin: Because the neurons emerge during base‑model pre‑training, the work implies that safety fine‑tuning alone is fighting an entrenched compliance prior, and points interpretability efforts toward earlier stages of the training pipeline as shown in the arxiv paper.

The result gives interpretability teams a concrete, neuron‑level target for dialing hallucination/compliance trade‑offs, though it remains to be seen how robust these specific H‑neurons are across architectures and scales.

OpenBMB

@OpenBMB

LLMs lie to please you? The physical source of hallucinations has been found. 🤯 Today, we highlight a breakthrough from the OpenBMB and Tsinghua University: the first-ever identification of H-Neurons —— a sparse class of neurons encoding hallucinations inside LLMs. 📄 Paper: Show more

2:31 PM · Dec 21, 2025

314

Read 12 replies

A1/A2/T1/T2 framework clarifies how to adapt agents vs. tools

Agentic AI adaptation (UIUC/Stanford/Harvard): Researchers from UIUC, Stanford, Harvard and others propose a four‑way taxonomy for making agentic AI systems adapt over time—whether by updating the reasoning model itself or the tools it calls—breaking strategies into A1, A2, T1, and T2 and arguing that real systems should mix them rather than rely on a single pattern, as summarized in the adaptation explainer and expanded in the arxiv paper. A companion “Awesome Adaptation of Agentic AI” list catalogs concrete examples of each type and related open‑source projects, turning the framework into a practical design reference for tool builders via the github list.

• A1 – Agent learns from tool outcomes: In A1, the agent is fine‑tuned on signals like “code ran or crashed” or “API returned correct vs. incorrect,” using those execution results as rewards or supervised labels so future tool calls are better targeted as detailed in the adaptation explainer.

• A2 – Agent learns from its own answers: A2 adapts the agent on judgments of its final outputs—human preference, correctness scores, automatic critics—which can improve planning and reasoning even when tools are unchanged, though the authors warn that naïve A2 may cause agents to under‑use tools if tool costs aren’t modeled explicitly per the a1 description.

• T1 – Tools improve independently: T1 keeps the agent frozen and instead trains tools offline (better retrievers, domain‑specific models, memory systems) so that any compatible agent can benefit, a natural fit when the agent is a closed API like GPT, Claude, or Gemini as shown in the t1 explanation.

• T2 – Tools learn from a frozen agent: In T2, tools adapt to how a fixed agent uses them—for example, a retriever learning from which documents the agent tends to pick, or a memory system deciding what to store based on agent behavior—with the authors noting that adaptive memory is essentially a T2 mechanism according to the memory framing.

The taxonomy gives engineers a common language for talking about “agent fine‑tuning” vs. “tool fine‑tuning” and highlights that many high‑performing systems quietly combine pretrained tools (T1), agent‑supervised tool tweaks (T2), and selective agent retraining (A1/A2) rather than betting on a single loop.

Ksenia_TuringPost

@TheTuringPost

UIUC, @Stanford, @Harvard researchers and others outlined the key strategies for agentic AI adaptation: There are 2 things you can adapt: - The agent itself (the reasoning model) - The tools it uses (search systems, retrievers, memory, APIs) From this, the researchers define 4 Show more

11:07 PM · Dec 21, 2025

156

Read 10 replies

Adaptive routing and feedback make multi‑agent document systems far more reliable

Parallelism Meets Adaptiveness (UCSC/CMU): A UCSC/CMU team proposes a coordination framework for multi‑agent LLM systems that combines dynamic routing, bidirectional feedback, and parallel agent evaluation, yielding 92% factual coverage and 94% compliance accuracy on SEC 10‑K analysis—versus 71% and 74% for static pipelines—alongside a 74% drop in revision rates and 73% less redundant output, as described in the paper overview and the arxiv paper. The key idea is to let agents reassign subtasks at run time based on confidence/complexity, request upstream revisions when inconsistencies are found, and have multiple agents tackle ambiguous chunks in parallel with an evaluator picking the best answer.

• Dynamic routing: When an agent encounters, say, a dense legal paragraph in a 10‑K, it can hand that span to a more specialized compliance agent instead of producing a weak guess, which improves both correctness and specialization without rigid task graphs as explained in the paper overview.

• Feedback loops and shared memory: Downstream QA or cross‑checking agents can flag contradictions between sections and push revision requests upstream, with a shared memory store keeping state consistent—a mechanism the ablation study identifies as critical, with coverage and coherence dropping more than 20% when removed as detailed in the paper overview.

• Parallel candidate generation: For high‑ambiguity clauses, several agents propose independent interpretations in parallel, and an evaluator scores them on factuality, coherence, and relevance before passing a single consolidated answer downstream, trading a bit more compute for substantially fewer human‑required revisions per the paper overview.

The work suggests that future “agent stacks” will need explicit coordination and memory architecture—not just more agents—to get consistent, auditable behavior on long, structured documents.

elvis

@omarsar0

Neat ideas to improve multi-agent LLM systems. Most frameworks rely on static workflows: fixed role assignments, linear task flows, and limited communication between agents. When tasks involve ambiguity, changing context, or uneven agent performance, rigid pipelines break down. Show more

3:17 PM · Dec 21, 2025

254

Read 13 replies

MemFlow introduces adaptive memory for production instruction‑following agents

MemFlow (Kling/Kuaishou): The Kling video team introduces MemFlow, an adaptive memory mechanism that dynamically retrieves relevant history for instruction‑following agents so they can ground new actions in prior conversation or task state, as reported in the memflow summary. The system is framed as production infrastructure for long‑running, multi‑turn agents rather than a toy feature, aiming to reduce context bloat by learning which past interactions actually matter instead of replaying full transcripts every turn.

The announcement positions MemFlow alongside recent research on adaptive memory and agent/tool co‑adaptation, but concrete metrics and implementation details (store schema, eviction policy, interaction with model‑side attention) have not yet been shared in public threads.

DailyPapers

@HuggingPapers

Kling Team (Kuaishou Technology) introduces MemFlow MemFlow uses adaptive memory to dynamically retrieve relevant historical frames, ensuring consistent & efficient long streaming video narratives, even with event or scenario switches.

12:14 PM · Dec 21, 2025

128

Read 1 reply

💼 Enterprise AI layer and unit economics

Executives frame the thick app layer above LLMs; OpenAI’s compute margins reportedly reach ~70%. IPO chatter resurfaces; early agent marketplaces form. Excludes infra buildouts covered elsewhere.

Reports say OpenAI’s compute margin on paying users hits ~70% in October

OpenAI economics (compute margin): A Bloomberg recap shared by analysts says OpenAI’s compute margin on paying users climbed to about 70% in October 2025, up from roughly 52% at end‑2024 and around 35% in January 2024, attributed to cheaper rented compute, inference efficiency work, and a higher‑priced subscription tier as shown in the margin summary and Bloomberg link. The same reporting notes that on total computing costs Anthropic is believed to be more efficient, and that OpenAI declared internal "Code Red" modes multiple times to focus staff on latency, uptime, and server costs after competitive pressure from DeepSeek and Google’s Gemini 3, as detailed in the code red recap.

Executives also reportedly shifted teams off side projects like agents and ads to tighten evaluation loops and catch quality drops before users see them, then shipped GPT‑5.2, GPT‑5.2‑Codex and a rebuilt ChatGPT Images stack that generates up to 4× faster while holding edit consistency, as detailed in the code red recap. Taken together, the story points to a business now throwing off high gross margins on inference, but still locked in a capital‑intensive race where power, infrastructure scaling, and relative efficiency versus rivals like Anthropic remain open questions.

Graphic pegs SpaceX, OpenAI, Anthropic among largest potential 2026 IPOs at $3.6T total

IPO pipeline (SpaceX, OpenAI, Anthropic): A widely shared graphic of "largest potential IPOs" lists SpaceX at $1.5T, OpenAI at $830B, Anthropic at $230B, and ByteDance at $480B, with the top ten private companies totaling about $3.6T in hypothetical IPO value based on Yahoo Finance estimates, as shown in the ipo chart and ipo recap. The same chart places Databricks at $160B, Stripe at $120B, Revolut at $90B, Shein at $55B, Ripple at $50B, and Canva at $50B, and is being circulated in the context of a possible 2026 window for mega‑listings.

Commentators point out that these numbers are not firm valuations but directional guesses, with one response calling the list "entirely made up" even while conceding that several of the names are likely to go public in the next year, as detailed in the valuation skepticism. Another thread frames this as part of an expected wave where SpaceX could list at "1.5 metric tons" (a joke on trillions) followed by OpenAI and Anthropic, underscoring how much AI‑heavy firms now dominate late‑stage private markets, according to the ipo chart. For AI leaders and analysts, the graphic captures how capital markets are starting to price the agent and model ecosystems even before clear profit trajectories are visible.

Aaron Levie argues the real value sits in a thick agent layer above LLMs

Thick agent layer (Box/Levie): Aaron Levie sketches an enterprise stack where LLM labs graduate "generally capable college students," while a much thicker app layer turns them into deployed professionals via tools, private data, sensors, actuators, and feedback loops as described in the Levie thread. He frames successful apps like Cursor as orchestrating multi‑step LLM DAGs, doing context engineering, surfacing autonomy sliders, and providing job‑specific GUIs so that enterprises can actually adopt these systems instead of just chatting with a raw model.

Levie also stresses that in real companies, most of the value sits in domain‑specific workflows, change management, and system integration—agents need to plug into CRMs, ERPs, and identity systems, not just a browser tab, as noted in the Levie thread. This pushes the economic center of gravity toward vertical agents and specialized products built on top of foundation models, rather than the models alone.

AI talent, investor and dating agents start behaving like early marketplaces

Agent marketplaces (consumer + B2B): A series of posts highlights AI voice and chat agents that behave like lightweight marketplaces—interviewing thousands of candidates and companies, then matchmaking across both sides over time, rather than serving a single user in isolation, as shown in the talent agent thread. Examples include Jack & Jill AI for hiring, Boardy AI for matching founders and investors, and Known AI for dating, all of which rely on aggregating proprietary supply so that, once there is enough liquidity, they can sit in the middle of high‑value relationships, as detailed in the talent agent thread and liquidity comment.

The builders note that most of these products are still at a few thousand users and are working to build liquidity, but argue that if they succeed they can evolve from "personal agents" into proper two‑sided networks that own discovery and matching economics in their niches, as discussed in the talent agent thread and venturetwins note. That framing puts early agent products on a path closer to Uber‑ or dating‑app style businesses—where proprietary data and matching quality determine defensibility—than to generic chatbots.

Similarweb shows ChatGPT at 1B US visits, with Gemini, Claude, Grok as main challengers

AI chatbot traffic (Similarweb): November 2025 US web traffic estimates from Similarweb put chatgpt.com at 1.005B monthly visits and 66.81% traffic share among the top AI chatbot and tool sites, down 6.55% month‑over‑month but still several times larger than any rival as shown in the Similarweb table. Gemini’s web client gemini.google.com ranks second with 167.9M visits and 11.16% share (up 9.39% MoM), followed by claude.ai at 53.98M (3.59% share, down 11.33%), grok.com at 53.18M (3.53% share, up 14.36%), and character.ai at 44.82M as detailed in the Similarweb table.

Lower in the top ten, openai.com itself draws 43.44M visits (2.89%, down 34.97%), perplexity.ai has 32.85M (2.18%), and chat.deepseek.com, copilot.microsoft.com, and polybuzz.ai round out the list with 14–20M monthly visits each as shown in the Similarweb table. For AI product teams and analysts, these figures sketch a market where ChatGPT still dominates direct web usage in the US but where Google’s Gemini, Anthropic’s Claude, and xAI’s Grok are now meaningful destinations, and where up‑and‑coming tools like Polybuzz are starting to register in traffic‑share tables.

🎬 Creator workflows: motion control, remix and open image SOTA

Significant creative stack chatter: Kling Motion Control recipes, Google Photos Remix rollout, Freepik Spaces pipelines, and Z‑Image Turbo topping an open‑weights arena. Mostly how‑to guides and promos.

Z‑Image Turbo tops Artificial Analysis Image Arena as open 6B image model

Z‑Image Turbo (Alibaba Tongyi‑MAI): Artificial Analysis now ranks Z‑Image Turbo, a 6B‑parameter open‑weights text‑to‑image model, as the new #1 model in its Image Arena, ahead of FLUX.2 [dev], HunyuanImage 3.0 (Fal), and Qwen‑Image on a battery of five diverse prompts as shown in the z-image summary and prompt gallery. The team notes that Turbo costs about $5 per 1,000 images via Alibaba Cloud and is released under Apache 2.0, which is markedly cheaper than FLUX.2 [dev] at $12, Qwen‑Image at $20, and HiDream‑I1‑Dev at $26 for the same volume, while still running on 16 GB consumer GPUs per the z-image summary.

• Arena performance: Side‑by‑side grids on rainforest, savannah, Europa rocket launch, anime cityscape and watercolor train station prompts show Turbo delivering sharp composition and consistent style across both natural and stylized scenes, often matching or exceeding larger peers in perceived fidelity and prompt adherence detailed in the prompt gallery and anime prompt example.

• Open and multi‑host: Z‑Image Turbo is already exposed on Alibaba Cloud, fal, and Replicate, and the Apache 2.0 license removes commercial‑use friction, making it a realistic default for teams that want strong image quality without closed‑model pricing as shown in the z-image summary and image arena.
This positions Z‑Image Turbo as a serious open alternative in creative stacks where cost per render and on-prem or BYO-GPU deployment matter as much as raw image quality.

Artificial Analysis

@ArtificialAnlys

Z-Image Turbo is the new #1 open weights Text to Image model, surpassing FLUX.2 [dev], HunyuanImage 3.0 (Fal), and Qwen-Image in the Artificial Analysis Image Arena. Z-Image Turbo is the first model released by the Alibaba Tongyi-MAI team (separate from the Alibaba Wan, and Qwen Show more

8:32 PM · Dec 21, 2025

279

Read 14 replies

Kling 2.6 Motion Control workflow chains pose transfer with Nano Banana Pro

Kling Motion Control (Kuaishou): Creators are standardizing on a repeatable recipe that uses Kling 2.6 Motion Control plus Nano Banana Pro to copy arbitrary movement from a reference clip while completely swapping the subject’s identity and styling as shown in the workflow overview and motion tips. The flow is: grab a pose-heavy reference video, export the first frame, run it through Nano Banana Pro with a prompt like “keep the image structure… now replace the woman by the bald man with beard” to preserve composition while swapping appearance via the nano banana example; then feed Kling’s Motion Control with that edited keyframe plus the original motion video and a short prompt such as “a bald man with beard performing break dance at the olympics stage”, letting Kling handle motion tracking and synthesis according to the kling interface and kling signup page.

• Decoupled motion and identity: The setup cleanly separates pose/motion (video) from appearance (image), giving engineers a concrete pattern for building pipelines where one model handles subject editing and another handles temporally consistent animation as detailed in the workflow overview.

• Low‑friction prompting: Posts emphasize that Motion Control works with very plain language prompts, leaning on the reference clip for choreography and on the still image for art direction, which reduces prompt engineering overhead for teams prototyping character-driven video content according to the motion tips.
The point is: Kling’s Motion Control is already being used as a practical module inside multi-model production chains rather than a standalone toy, which is a useful signal for anyone designing higher-level video tooling around it.

TechHalla

@techhalla

AI is out of control right now. Kling's Motion Control is wild, and here's how you can use it 👇

12:19 PM · Dec 21, 2025

199

Read 41 replies

Freepik Spaces tutorial chains Nano Banana Pro stills into Veo 3.1 animation

Spaces pipeline (Freepik): A detailed tutorial shows how to use Freepik Spaces as a node‑based canvas to build a full AI video scene by combining Nano Banana Pro (NB Pro) for image generation with Veo 3.1 for motion, starting from a 3×3 grid of stills in the style of Pawn Stars as shown in the pawn stars demo and spaces walkthrough. The flow is: generate a 3×3 character grid with NB Pro, then add nine extraction nodes to pull each frame, optionally run lightweight edit nodes on specific stills, and finally connect each into Veo 3.1 video nodes to animate dialogue or camera motion using JSON‑style prompts embedded in the node metadata according to the pawn stars demo and freepik spaces page.

• Single‑space orchestration: Because the entire pipeline—stills, crops, edits, and clips—lives inside one Space graph, creators can tweak a character pose or color treatment upstream and immediately re‑render downstream video nodes without re‑wiring prompts, which functions like a lightweight DAG engine for AI media workflows per the spaces walkthrough.

• Two‑model separation of concerns: NB Pro handles identity/style and composition, while Veo deals with temporal coherence and motion, mirroring a broader pattern where image and video models are composed rather than over‑prompted to do everything at once as shown in the pawn stars demo.
For engineers and analysts, this is a concrete example of how consumer‑facing tools are turning multi‑model creative graphs into a productized experience rather than leaving users to assemble scripts by hand.

TechHalla

@techhalla

I asked the AI to make me part of Pawn Stars and I don't know, Rick... but I'll tell you how I did it all on freepik below 👇

8:56 PM · Dec 21, 2025

Read 14 replies

AI film festival entry shows how to onboard non‑technical cinematographers

AI film workflow (PJ Accetturo): Filmmaker PJ Accetturo submitted a $1,000,000 AI film festival entry built by taking six top Hollywood cinematographers with “zero AI experience” and teaching them a simple, repeatable framework for using generative tools to plan and assemble shots as shown in the ai film explainer. The shared clips highlight expressive character work and note that “the emotional facial gestures are getting super strong,” which reflects how far current image and video models have come for close‑up performance according to the model quality reaction.

• Process over model worship: The emphasis is on instruction design, prompts, and iteration loops rather than on any single model name, which gives AI engineers and PMs a concrete example of how to translate raw model capability into a training program that non‑technical creatives can actually adopt per the ai film explainer.

• One‑year progress snapshot: Accetturo contrasts this film with his first AI short from a year earlier that went viral on Reddit, underscoring how tool maturity over 12 months has shifted AI filmmaking from niche experiment to something professionals can plausibly slot into their day‑to‑day work according to the ai film explainer.
So for leaders tracking creative adoption, this is a live case study in turning general‑purpose models into a production‑ready, teachable workflow for traditional film crews.

PJ Ace

@PJaccetturo

Our $1,000,000 AI film festival submission 🚨 I took 6 top Hollywood cinematographers with ZERO AI experience and taught them a simple framework to master the tools. You can easily learn this process in minutes 👇 Here are the exact tips, prompts, and framework we used: 🧵

4:26 PM · Dec 21, 2025

1.3K

Read 131 replies

Google Photos Remix brings daily AI stylization to iOS users

Photos Remix (Google): Google is rolling out a “Remix your photos” feature in the Photos app on iOS that turns portraits into styles like sketch and anime, with a daily free generation quota and an upsell path to a paid Google AI plan for heavier use according to the remix rollout. The onboarding screen spells out that you “pick a portrait, choose your favorite style, and watch the magic happen,” and explicitly notes experimental GenAI plus policy constraints around prohibited uses per the remix rollout.

For AI product teams, this is another concrete example of how a mainstream app is packaging image models: tight in-context education, rate-limited free tier, and an integrated subscription upgrade instead of sending users to a separate AI product.

TestingCatalog News 🗞

@testingcatalog