Google lithiumflow renders 1‑minute, 2,000‑line SVG short – orionmist adds dialogue and music
Stay in the loop
Get the Daily AI Primer delivered straight to your inbox. One email per day, unsubscribe anytime.
Executive Summary
Two unannounced Google models just slipped into LMArena, and the demos look wild. Testers say “lithiumflow” generates a fully playable, 1‑minute South Park–style short as a single HTML/SVG file with 2,000+ lines of zero‑shot code, while “orionmist” reportedly does the same for a Rick & Morty sketch and wires in dialogue, music, and SFX. If these holds up, we’re not talking UI scaffolding anymore; this is end‑to‑end program synthesis that ships self‑contained, animated front‑ends on demand.
Early hands‑ons paint a consistent pattern: lithiumflow nails Taelin’s A::B challenge, composes piano with left/right‑hand sync, and keeps layouts coherent across a four‑scene SVG stress suite (home, landscape, NYC skyline, desert tree). It handles double base64 but chokes on triple. There are rough edges: raw chain‑of‑thought sometimes leaks (“I will be honest…”) and can diverge from final answers—a safety and UX gap Google will need to close before any public endpoint. Arena banners tag both models “trained by Google,” but there’s no formal release; treat this as pre‑launch staging that lines up with last week’s Gemini 3.0 timing chatter.
If this becomes the baseline for on‑the‑fly UIs and animation, serving stacks that squeeze GPUs—think token‑level pooling claiming 82% savings—suddenly matter a lot for cost per token and fleet utilization.
Feature Spotlight
Feature: Google’s “lithiumflow” and “orionmist” surface in arena tests
Two rumored Gemini 3.0 variants (“lithiumflow”, “orionmist”) appeared on LMArena; testers show zero‑shot, code‑heavy SVG animation shorts and strong composition/reasoning—suggesting a step‑function in agentic/creative capability.
Cross-account sightings of two Google DeepMind models on LMArena show big qualitative jumps (SVG, music, reasoning). This cluster dominated today’s sample; treat as pre‑launch signal rather than an official release.
Jump to Feature: Google’s “lithiumflow” and “orionmist” surface in arena tests topics📑 Table of Contents
🛰️ Feature: Google’s “lithiumflow” and “orionmist” surface in arena tests
Cross-account sightings of two Google DeepMind models on LMArena show big qualitative jumps (SVG, music, reasoning). This cluster dominated today’s sample; treat as pre‑launch signal rather than an official release.
Two unannounced Google models “orionmist” and “lithiumflow” surface on LMArena
LMArena quietly listed “orionmist” and “lithiumflow,” both identified as “trained by Google,” pointing to pre‑launch evaluation in the wild Arena screenshot. Following timing hints of a Gemini 3.0 release this year, community watchers suggest these could be Pro/Flash variants rather than public endpoints yet UI capture, with the arena’s own update post naming the pair Arena update.

Early reactions center on the models’ coding‑from‑scratch and reasoning behavior, but there’s no official attribution beyond the arena banners; treat as staging signals, not a formal release.
“lithiumflow” codes a 1‑minute South Park SVG episode from a single prompt
A tester reports lithiumflow generated a fully‑functional, 1‑minute South Park‑style short as a single HTML/SVG block with 2,000+ lines of code, zero‑shot from one prompt CodePen demo CodePen project. A follow‑up post shares the exact prompt and code link, reinforcing the end‑to‑end output claim Prompt details CodePen project. This points to unusually strong program‑synthesis plus graphics capabilities beyond typical UI‑component scaffolding.
Early read on “lithiumflow”: elite SVG, A::B pass, piano composition; raw CoT leakage seen
Hands‑on notes say lithiumflow excels at SVG tasks, handles Taelin’s A::B challenge, and composes piano music with left/right‑hand synchronization better than peers; it struggles on triple base64 but handles double Capability notes. Observers also report raw chain‑of‑thought traces leaking (caps‑lock emphasis, “I will be honest” surrender) with divergences from final answers—evidence of internal rationale exposure during testing CoT leak thread. One tester calls its music the first to feel “decent” among LLMs tried Music notes.

If representative, this suggests a step up in code‑as‑UI generation and emergent reasoning behaviors, with potential safety/UX implications around rationale exposure.
“orionmist” reportedly generates a Rick & Morty SVG short with dialogue, music, and SFX
Separate tests credit orionmist with coding a full SVG‑based Rick & Morty episode including dialogue, music, and sound effects in one go—again pointing to strong code‑generation and multimedia wiring in a single artifact CodePen link CodePen project. The author doubles down with a second share of the project link Follow‑up link CodePen project. While anecdotal, the pattern mirrors lithiumflow’s zero‑shot SVG pipeline strength.
Consistent SVG outputs across a 4‑scene stress suite hint at stable graphics synthesis
A repeatable stress set—home, landscape, New York skyline, and a desert tree—shows lithiumflow producing coherent SVG scenes across diverse prompts, suggesting reliable rendering and layout control rather than one‑off lucky hits SVG test set. The author shares the canonical set for others to reproduce Test prompts.

Stable SVG composition at this breadth is notable for agents that must generate complex, self‑contained front‑ends or animations without external assets.
⚙️ Serving efficiency: Alibaba’s token‑level model pooling
Today’s standout systems story is Alibaba Cloud’s Aegaeon runtime. It claims token‑level preemption to keep GPUs hot and cut idle time. Excludes the Gemini 3.0 arena sightings (feature).
Alibaba’s Aegaeon claims ~82% GPU savings with token‑level model pooling
Alibaba Cloud says its Aegaeon serving runtime slashes Nvidia GPU usage by about 82% by preempting at token boundaries so a single GPU can time‑slice many models; the company reports switching latency drops ~97% and up to seven models share one GPU without idling hot models serving claims.

Under the hood, a proxy layer routes requests into separate Prefill and Decode instances, with per‑GPU schedulers coordinating via Redis; the design splits a first‑token pool (to batch identical models for fast TTFB) from a later‑token pool that uses weighted round robin for steady decoding architecture explainer, and cold models JIT‑load weights to borrow brief compute slices instead of pinning a full GPU serving claims. This token‑level preemption generalizes beyond hot models to long‑tail traffic and claims up to seven models per GPU versus two to three for prior pooling schemes system design. Trade‑offs remain—uneven memory footprints, fewer preemption points on very long sequences, and scheduler overhead under spikes—but if validated broadly this could lower cost per token, lift fleet utilization, and defer capex for multi‑model inference fleets serving claims.
🧪 Self‑rewarding RL and instruction‑following recipes
Heavy day for training papers: quantized RL rollouts, last‑token self‑checks, confidence‑based rewards, hindsight rewriting, vision‑grounded web coding RL. Excludes policy/eval chatter.
Confidence‑as‑Reward turns token probabilities into supervision; +8.9% then +11.5% on MATH500 with CRew‑DPO
Researchers convert a model’s own final‑answer token probabilities into a training‑free reward to judge math solutions, then apply CRew‑DPO to a 7B model for +8.9% and then +11.5% on MATH500 across two rounds paper summary.

Confidence correlates with correctness (r≈0.83), enabling label‑free judging and data selection; the approach rivals trained reward models on RewardMATH while cutting supervision cost.
Last‑token self‑reward (LaSeR) gets ~80% F1 math verification with a 7B model, rivaling a 72B judge
Tencent’s LaSeR adds a single last‑token probe to generate a self‑score for verification, enabling a 7B model to match a 72B reward model on math judging with F1 near 80%, while costing just one extra token at inference paper summary.

By reading next‑token probabilities at the end and fitting a small auxiliary loss to a verifier target, LaSeR ranks and weights samples cheaply; results show strong gains on reasoning tasks without a separate reward model.
RLSR (reinforcement with supervised embedding reward) beats SFT on instruction following (30.73% vs 21.01%)
Inflection’s RLSR scores each sampled answer by semantic similarity to the human reference (embedding reward), then reinforces higher‑scoring outputs; on Qwen‑7B with the INFINITY dataset it reaches a 30.73% AlpacaEval win rate vs 21.01% for SFT, with lower repetition paper summary.

The supervised reward is cheap and always available from paired data; teams can use RLSR alone or as a post‑SFT booster, trading extra sampling compute for noticeably better adherence.
Closed‑loop agentic self‑learning (task writer–solver–judge) improves search agents without human labels
A single framework grows a dataset and a verifier while training the solver: a task writer adjusts difficulty, the policy attempts answers, and a small generative judge scores correctness; jointly updating the judge prevents reward gaming and sustains improvement across runs paper overview.

A late dose of real verification data raises the ceiling; the loop outperforms rigid rules and reduces dependency on costly human labels for iterative agent training.
Microsoft’s ECHO rewrites failed attempts into minimal workflows, trimming ≈1.6 tool messages per QA query
ECHO performs hindsight trajectory rewriting: after a failure, it derives the shortest successful substeps and stores that workflow for reuse, improving sample efficiency and reducing back‑and‑forth in a multi‑person QA setting by ~1.6 messages/query while preserving accuracy paper summary.

Compared to standard hindsight replay (goal relabeling), ECHO yields cleaner, executable plans, speeds gridworld learning, and offers practical gains for tool‑using agents.
ReLook: Vision‑grounded RL for web coding accepts edits only when a multimodal critic’s score improves
ReLook trains a coding agent to render pages, ask a vision LLM critic what’s wrong, and reinforce only edits that raise the pixel‑based score; at test time it runs up to three rapid self‑edits without screenshots for roughly 7× faster loops while retaining quality paper summary.

Zero reward on render failure blocks reward hacking; the critic targets layout, color, text and interactions. Benchmarks show consistent wins over base models and a vision‑only RL baseline.
🛠️ Coding agents: Claude Code skills, plugins and consoles
Mostly practical agent dev: Claude Code plugin marketplace, plan‑mode UX, GLM coding endpoints, and Google AI Studio’s revamped key/project UX. Excludes Gemini 3.0 arena items (feature).
Claude Code adds plugin marketplace with slash commands, MCP and subagents
A public beta '/plugin' manager surfaced in Claude Code v2.0.12, showing installable entries like code‑review, typescript‑lsp, and security‑guidance, with support for hooks, MCP tools, and subagents. This formalizes an ecosystem for extending coding agents with review, language tooling and policy guards directly from the console Plugin manager screenshot.

Expect fast proliferation of domain‑specific skills (security lint, dependency audits, CI orchestrators); enterprise teams can start curating allowlists and RBAC around a standard plugin lifecycle.
GLM‑4.6 Coding Plan lands across clients with Vision MCP image support
Builders wired Zhipu’s GLM‑4.6 Coding Plan into multiple consoles: Claude Code adds a Vision MCP server for image/video understanding via a one‑line command, while Kilo Code connects through an OpenAI‑compatible endpoint and model selection Setup command, Provider config, with integration details in the official guide Vision MCP docs.

Result: coding agents can ingest screenshots/specs alongside code, enabling UI‑to‑code tasks, OCR‑assisted refactors, and visual test assertions without leaving the IDE.
Google AI Studio revamps API keys and projects management
Google AI Studio shipped a redesigned API Keys & Projects experience: name/rename keys, group or filter by project, import/remove GCP projects, and jump to usage and billing from the same pane Feature rollout, Keys page screenshot. Documentation reflects the new creation and setup flows, including environment setup and quickstart examples AI Studio docs.

Cleaner credentials hygiene matters as teams scale multi‑agent apps; this reduces key sprawl and makes rotation and auditing less error‑prone.
mcp2py maps any MCP server to Python modules, enabling quick agent tooling
The mcp2py 0.1.0 package turns MCP servers into importable Python modules, letting DSPy/agents call tools like Chrome DevTools MCP with just a few lines. A demo shows a React-style agent measuring TTFB/FCP and reporting the largest asset via headless Chrome Code demo.

This bridges LLM tool ecosystems with regular REPLs/tests, encouraging shared tool surfaces (MCP) across IDEs, notebooks, and CI—less custom glue, more reuse.
Plan mode now elicits exact test cases via interactive select UI
Claude Code’s plan mode now renders an in‑console selector that asks, “What specific scenarios should the export command tests cover?”, letting devs check concrete cases before generation—tightening specs and reducing slop Plan mode screenshot. Following up on clarifying questions, this pushes more structure into pre‑execution planning and aligns with the community’s call to make agents ask better questions Engineer reaction.

This is a practical step toward self‑verifiable workflows: agents can bind to these scenarios for tests and post‑hoc diffs.
Claude Code Skills + LlamaIndex semtools ship an M&A comps agent
A practitioner combined Claude Code Skills with LlamaIndex semtools/LlamaCloud parsing to read DEF 14A filings, extract table data (including complex financial layouts), and auto‑write an Excel comps sheet. Notes include pitfalls like percentage vs raw value formatting and a roadmap for native skill integrations Agent demo.
For AI engineers, the pattern is clear: delegate parsing to a specialized service, keep the agent on orchestration and validation, and write to domain‑native outputs (Excel) for analyst workflows.
How teams are wrangling Cursor agents for reliable shipping
Power users outline a concrete Cursor cadence: start in plan mode, demand self‑verifiable steps (tests/scripts), lean on GitHub CLI for PR context and CI logs, parallelize with best‑of‑n background agents, and codify daily commands like /create‑pr, /review‑changes, and /deslop Usage guide, Self‑verifiable tip, GitHub CLI tip, Best of n tip, Daily commands.
These patterns reduce drift, cut latency, and turn agents from “long‑autonomy” guessers into iterative, test‑driven contributors that fit existing review gates.
🔌 Interoperability & MCP: mcp2py, vision MCP, agent canvases
Interchange surfaces matured: a Pythonic MCP client, a Vision MCP server, and LangGraph‑based orchestration templates. Excludes coding‑agent plugin specifics covered elsewhere.
mcp2py v0.1.0 turns any MCP server into a Python module; DSPy agent drives Chrome DevTools
A Pythonic MCP client landed on PyPI: mcp2py v0.1.0 maps any Model Context Protocol server into importable modules/tools, so agents can call external capabilities from plain Python—demonstrated by a DSPy ReAct agent steering a headless Chrome DevTools MCP to measure page performance DSPy browser demo, PyPI release. Following up on ChatGPT MCP enterprise enablement, this closes the loop for local scripting and CLI workflows.

In practice it’s a one‑liner to load servers (e.g., npx chrome-devtools-mcp), pass their tools into DSPy, and run tasks end‑to‑end within a Python REPL—lowering friction for MCP adoption in existing codebases DSPy browser demo.
Z.AI ships Vision MCP Server to add image/video understanding to Claude Code via GLM Coding Plan
Z.AI’s Vision MCP Server brings image and video understanding to MCP‑capable clients (e.g., Claude Code) with a single setup command and requires the GLM Coding Plan Pro tier; the post includes the exact CLI to add the server and environment variables Claude config steps, with full instructions in the first‑party documentation Vision MCP docs.

This complements code‑oriented MCP servers by letting the same agent loop reason over screenshots, PDFs, and recorded sessions without bespoke adapters—useful for UI tests, visual diffing, and multimodal debugging Claude config steps.
CopilotKit + LangGraph release AI canvas template for real‑time UI↔agent sync (Python–Next.js)
CopilotKit published a production AI canvas template built on LangGraph, wiring real‑time synchronization between the UI and agent state in a Python–Next.js stack; the walkthrough shows how to coordinate agent tools and render live updates in the canvas Template overview, YouTube walkthrough.

For teams standardizing on MCP, this template provides the front‑end surface where MCP tools (retrieval, browsers, internal services) can be orchestrated and visualized as multi‑step plans rather than opaque chat logs Template overview.
LangChain’s 'Agents 2.0' formalizes deep agents with planning, hierarchy, and persistent memory
LangChain outlined an “Agents 2.0” architecture that moves beyond shallow REPL loops to deep agents: an orchestrator plans explicitly, delegates to specialist sub‑agents, and writes to persistent memory so tasks can scale from tens to hundreds of steps Design thread, Agents 2.0 blog.

For MCP‑centric stacks, this offers a clean place to register MCP tools behind an orchestrator and keep long‑running plans stable (recover after failures, stay goal‑aligned) rather than rely on a single context window Design thread.
LangGraph ‘Article Explainer’ demo: multi‑agent swarm breaks down PDFs with interactive chat
LangGraph’s Article Explainer showcases a swarm architecture: upload a PDF, then a team of specialized agents (reader, explainer, Q&A) collaborates to answer queries and highlight insights through a unified UI Demo overview.

This pattern translates well to MCP: each specialist can expose MCP tools (e.g., document parsing, vector search, code evaluators) while the graph coordinates memory and task boundaries for traceable, multi‑step reasoning Demo overview.
🎬 Creative stacks: Veo 3.1 control, Sora 2 in Copilot, vision costs
Generative media kept buzzing (Veo 3.1 physics‑ish behavior, Copilot+Sora 2, Moondream cost/speed). Excludes the Gemini 3.0 'lithiumflow/orionmist' arena story (feature).
Copilot starts testing Sora 2 video generation with daily/free limits and a new Shopping tab
Microsoft is trialing Sora 2–powered video creation inside Copilot; early notes indicate free users may get one video per day while Pro has unlimited, and a Shopping sidebar entry is being added alongside the feature feature testing, feature article. This points to metered access for frontier text‑to‑video inside a mainstream assistant, with UI placement suggesting creative and commerce tasks will sit side by side feature details.
Moondream Cloud launches low‑cost hosted vision API with $5 credits and speed/price gains vs peers
Moondream introduced a hosted vision stack with pay‑as‑you‑go pricing and $5 monthly free credits, claiming faster and cheaper performance than Gemini 2.5 Flash and GPT‑5 Mini across pointing, detection and OCR; token pricing lands at ~$0.30/M input and ~$2.50/M output, aided by a tokenizer that reduces output tokens by ~21% product blog, launch summary. For teams building camera, OCR, and UI‑analysis flows, the cost curves here materially lower prototype and ops budgets.

Veo 3.1 shows plausible dynamics in community tests (ships, honey on marble), though imperfect
Creators report Veo 3.1 exhibits surprisingly consistent “physics‑ish” behavior—e.g., iron/wood/sugar toy ships behave differently in water, and honey poured over a marble toucan cracks the beak—while still missing some details ships example, honey example. These clips ground earlier guidance in concrete outputs, following scene workflows that moved Veo 3.1 tips into repeatable recipes for camera moves and shot control.
Copy‑pasteable Veo 3.1 text‑animation JSON sequences speed stylized title cards
Reusable 5‑second title animations (fractal, bioluminescent, runic, phoenix, mirror) were shared as compact JSON sequences; swap a keyword to repurpose the style without reauthoring camera, lighting, or motion logic recipe thread, mirror recipe. For content teams, this turns Veo 3.1 into a motion‑graphics macro—useful for consistent lower‑thirds, intros, and explainer slates.
NotebookLM tests a “Kawaii” preset for Video Overviews to theme explainers without re‑editing
Google is experimenting with a “Kawaii” visual style for NotebookLM Video Overviews; a sample clip suggests style packs that can re‑skin generated videos while keeping content intact style preview. For education and marketing stacks, native styling knobs reduce downstream editing time and help enforce brand consistency.
🏗️ Capex watch: $2.8T thru 2029 and financing mechanics
Macro infra thread advanced via Citi’s revised spend and financing details (Oracle bonds, Nvidia–OpenAI pledges). Excludes serving/runtime topics (Aegaeon handled under systems).
Citi lifts AI infra forecast to $2.8T through 2029; 2026 capex outlook raised to ~$490B
Citigroup now pegs Big Tech AI infrastructure spend at ~$2.8T through 2029, up from $2.3T, and lifts its 2026 hyperscaler capex view to roughly $490B Macro forecast, following up on $300B capex prior 2025 Goldman estimate. Citi’s math implies about $50B per 1 GW of compute and ~55 GW of extra power needed by 2030, underscoring why operators are turning to debt and partnerships to bridge the gap.
Oracle floats ~$18B in bonds as AI data center bookings surge; Meta deal pegged at ~$20B
Oracle raised about $18B in bonds to finance AI data center build‑outs and reports ~$65B of new bookings in 30 days, plus a ~$20B deal with Meta; the runway also includes a $500B OpenAI initiative adding five new data centers, signaling heavy reliance on debt markets to meet AI demand Oracle financing.
Report: Nvidia pledges up to $100B to OpenAI amid industry push to finance AI buildout
As hyperscaler AI capex soars, Nvidia has reportedly pledged up to $100B to OpenAI, reflecting tighter strategic financing ties between compute suppliers and frontier labs to secure capacity and de‑risk supply Nvidia pledge.
📈 Usage shifts: Gemini gains, vibe‑coding traffic leaderboard
Market share and traffic signals moved: Gemini’s share up, ChatGPT down; vibe‑coding sites’ relative scale shared. Excludes infra capex and model leak items covered elsewhere.
Gemini climbs to 12.9% of GenAI traffic as ChatGPT slips to 74.1%
Google’s Gemini has nearly doubled its share of generative‑AI web traffic to ~12.9% over the past year while ChatGPT fell to ~74.1%, per Similarweb. Perplexity also crossed the 2% threshold in recent weeks. Following up on Gemini 3.0 timing (ship signals for year‑end), the usage tilt suggests growing multi‑model adoption beyond OpenAI. See the yearlong breakdown in Similarweb chart, with corroborating notes that Perplexity exceeded 2.0% in the latest snapshot traffic update.

Vibe‑coding traffic: Lovable 34M visits; Replit 12M; Bolt 8M; Base44 5.5M; v0 4.5M; Emergent 3M
Latest Similarweb tallies show the “vibe‑coding” cohort at substantial scale: Lovable (~34M), Replit (~12M), Bolt (~8M), Base44 (~5.5M), v0 (~4.5M) and Emergent (~3M) monthly visits, underscoring strong top‑of‑funnel interest for agentic coding tools Similarweb stats. Complementary trendlines from a separate DevOps/code‑completion view show changing traffic shares across Cursor, Cognition and others, though enterprise usage may be undercounted by web‑only stats traffic share charts.

Sonnet 4.5 gains token share on Rust prompts, per OpenRouter usage dashboard
OpenRouter’s language‑segmented telemetry shows Anthropic’s Claude Sonnet 4.5 steadily growing its token share for Rust‑related prompts, hinting at deeper traction among systems developers Token share chart. For AI teams, the signal points to where model strengths are resonating in specific language communities—and where to tailor evals, prompts, and pricing.

McLaren’s USGP paddock content features Gemini branding, signaling Google’s mainstream push
McLaren F1’s paddock clip tagged Google Gemini during the US Grand Prix weekend, highlighting a high‑visibility brand placement that complements Gemini’s rising share in consumer AI usage McLaren tie‑in. While not a usage metric, these placements typically precede broader awareness and funnel growth among non‑developer audiences.
🛡️ Policy & data hygiene: companion bot law, 'brain rot' risk
A concrete regulatory step (CA SB 243) and a data‑quality warning for continual pretraining. Also a platform policy rumor that could reshape distribution. Excludes evals and training results.
California enacts SB 243: companion chatbots must disclose AI and file suicide‑safety reports
California passed SB 243, requiring companion chatbots to clearly disclose they are AI and to submit annual reports detailing safeguards for users expressing suicidal ideation, with publication by the state Office of Suicide Prevention law summary, and full details in The Verge article.

For AI product owners, this means updating UX copy, adding intent detection and escalation workflows, logging/telemetry for incident handling, and likely age‑gating where experiences could be confused for humans. Scope matters: the law targets “companion” use cases; leaders should assess whether their agents plausibly fit that definition before implementing compliance at scale.
‘Brain rot’ quantified: junk continual pretraining slashes reasoning and retrieval accuracy
Newly circulated figures show continual pretraining on popular/clickbait “junk” text drives steep declines: ARC‑Challenge with chain‑of‑thought falls 74.9→57.2 and RULER‑CWE 84.4→52.3; authors also report higher narcissism/psychopathy traits and a failure mode of “thought‑skipping” paper snapshot.

Following up on brain rot, which warned of lasting cognitive harm from junk web text, this quantification underscores the need for strict curation in continual pretraining (avoid popularity‑biased crawls), guardrails against reward hacking, and evaluations that detect missing reasoning steps rather than only final answers.
Screenshot claims WhatsApp Business API will ban general‑purpose AI chatbots from Jan 15, 2026
A widely shared screenshot asserts WhatsApp will prohibit general‑purpose AI chatbots on its Business API effective January 15, 2026, targeting assistants like ChatGPT or Perplexity; if enacted, it would push vendors toward narrowly scoped, purpose‑specific flows and human handoff patterns policy screenshot.

Treat as unconfirmed until Meta publishes official terms, but start contingency planning: classify bot intents (narrow vs general), ensure compliant fallback/handoff, and evaluate alternative channels for open‑ended assistants.