Google Gemini 3 Pro preview surfaces with 1M‑token context, 200k tier – ~650 per GSU
Stay in the loop
Get the Daily AI Primer delivered straight to your inbox. One email per day, unsubscribe anytime.
Executive Summary
Engineers spotted “gemini‑3‑pro‑preview‑11‑2025” appearing inside Vertex AI configs, with tier‑200k and tier‑1M context windows and a throughputPerGsu around 650. That points to a November preview and, more importantly, a 1M‑token option that could reshape RAG and long‑context agent workflows. Access isn’t on yet; these look like internal/preview endpoints, not public API GA.
If you plan to test on day one, price the bandwidth tax now. Long windows hammer KV‑cache movement and can inflate time‑to‑first‑token, so budget for streaming, prompt caching, and cascaded summarization rather than hero prompts. The per‑GSU number gives you a back‑of‑the‑envelope for throughput; GSU is Vertex AI’s quota unit, so map it to your current spend and latency targets before wiring production routes. Also consider pairing retrieval with a dedicated reranker to keep context trims honest—pushing everything into a 1M window is rarely the cheapest win.
One adjacent signal: reports say Apple will pay about $1B/yr to run a 1.2T‑param Gemini for the next Siri, while vLLM just merged support for Kimi‑K2 reasoning traces. The direction is clear: bigger contexts and richer planning are coming, but the real gains land where runtime and retrieval are tuned to the metal. We help creators ship faster when those pieces line up.
Feature Spotlight
Feature: Gemini 3 Pro preview surfaces in Vertex AI
Gemini 3 Pro preview spotted in Vertex AI with 200k and 1M context tiers, signaling near‑term availability and a direct challenge to frontier models—developers are already seeing configs in network logs ahead of a November preview.
Cross‑account leak shows gemini‑3‑pro‑preview‑11‑2025 in Vertex AI with 200k and 1M tiers; widespread chatter it’s arriving in November. Excludes all other Gemini product surfacing (Maps, Canvas) which are covered elsewhere.
Jump to Feature: Gemini 3 Pro preview surfaces in Vertex AI topics📑 Table of Contents
🛰️ Feature: Gemini 3 Pro preview surfaces in Vertex AI
Cross‑account leak shows gemini‑3‑pro‑preview‑11‑2025 in Vertex AI with 200k and 1M tiers; widespread chatter it’s arriving in November. Excludes all other Gemini product surfacing (Maps, Canvas) which are covered elsewhere.
Gemini 3 Pro preview appears in Vertex AI with 200k and 1M tiers
Engineers spotted modelUserId "gemini-3-pro-preview-11-2025" in Vertex AI traffic, with config snippets showing tier‑200k and tier‑1m context windows and throughputPerGsu ≈650; access is not yet enabled. Multiple sightings suggest a November preview rather than a full release. Network logs Config screenshot Preview tease Documented roundup

Why it matters: a 1M‑token tier reshapes retrieval and long‑context agent workflows, and the surfaced ratios help teams model token costs and latency now. Keep an eye on API enablement and quota flags before wiring production calls; the references so far point to internal/preview endpoints rather than public availability.
🧪 Model roadmaps: Apple–Gemini, Kimi-K2 reasoning, GEMPIX2, GPT-5.1 traces
New model signals: Apple said to use a custom 1.2T‑param Gemini for Siri while building its own 1T model; Kimi‑K2 reasoning parser lands in vLLM/SGLang; Nano Banana 2 (GEMPIX2) teased; GPT‑5.1 ‘thinking’ traces. Excludes Gemini 3 preview (feature).
Apple to use Google’s 1.2T‑parameter Gemini for new Siri; ~$1B/yr deal, Apple’s own 1T model in 2026
Bloomberg/Reuters reporting says Apple is finalizing a ~$1B/year agreement to run a custom 1.2T‑parameter Google Gemini model for Siri’s summarizer/planner in a Spring 2026 overhaul, while training its own ~1T cloud model to replace Gemini later. Siri (code‑name Linwood) will execute on Apple Private Cloud Compute; China deployments will use local partners to meet rules headline summary, deal details. Expect Gemini to handle multi‑step intent parsing and planning while Apple keeps data residency and control; the longer‑term bet is Apple’s in‑house 1T to avoid a permanent third‑party dependency takeaways, key bullets.

GPT‑5.1 ‘thinking’ traces and new checkpoints surface in limited tests
Fresh traces labeled “gpt‑5‑1‑thinking” showed up for some ChatGPT users, alongside chatter about two near‑term GPT‑5 snapshots with lower latency and possibly larger context, plus a data‑analysis model early next year trace spotted, 3 models note. This builds on earlier hints of internal checkpoints (willow/cedar/birch), suggesting a staged rollout rather than a single drop checkpoint names. Sam Altman also teased “great upcoming models,” reinforcing the near‑term cadence models coming. Expect iterative API changes, pricing moves, and new “thinking” modes to affect tool‑calling budgets and latency envelopes.

Kimi‑K2 reasoning lands in vLLM (merged) with SGLang support incoming
vLLM merged a Kimi‑K2 reasoning parser, adding native trace parsing for Moonshot’s trillion‑param class model; SGLang says Kimi‑K2 reasoner support will be available at launch. This lowers friction to serve chain‑of‑thought–style outputs and structured traces across popular inference stacks vLLM merge, GitHub PR, SGLang note. Teams running open or hybrid fleets can pilot K2’s reasoning without bespoke log parsers or adapters, and compare it against DeepSeek R1‑style formats during A/Bs.

Nano Banana 2 pops up as “GEMPIX2” in Gemini UI, hinting at next image model
Multiple sightings of “GEMPIX2” strings in Gemini interfaces suggest Google’s next image model—colloquially Nano Banana 2—is nearing preview. UI hints also reference a new image agent for Stitch, likely wired into AI Studio and possibly Lovable, with a project‑brief generator on deck Gemini UI tease, model spotted, agent leak, feature write‑up. For builders, plan evals on identity/consistency and layout control; watch for pairing with Video Overviews and NotebookLM custom styles.

💼 Enterprise momentum: OpenAI at 1M customers, Snap–Perplexity, credits & collabs
Clear go‑to‑market signals: OpenAI crosses 1M business customers (7M Work seats); Snap picks Perplexity as default AI (reported $400M); OpenAI/Lovable credits; Genspark ships Slack app. Distinct from legal disputes and infra.
OpenAI tops 1M business customers and 7M Work seats
OpenAI says more than 1 million organizations now pay for its AI, with 7 million ChatGPT for Work seats (up ~40% in two months) and 800M weekly users driving adoption rollout post, reinforced in a longer write‑up OpenAI post. In context of AWS deal that secured $38B of compute, this is the clearest demand signal yet that enterprises are standardizing on OpenAI’s stack, including Company Knowledge, Codex usage up ~10× since August, and AgentKit for internal agents OpenAI post.

Snap makes Perplexity default AI in 2026; reported $400M distribution deal
Perplexity will power Snapchat’s default AI starting January 2026, putting its answer engine directly into My AI for hundreds of millions of users partnership card. Reporting pegs the distribution fee at roughly $400M with revenue recognition beginning 2026, and a shift to grounded, cited answers with heavy caching for latency deal analysis. For builders, this is a rare consumer‑scale default that should increase question volume and feedback loops for ranking.

Groq partners with Paytm to bring real‑time AI to 300M users and 20M merchants
Paytm will use Groq’s inference stack to speed fraud/risk models and conversational flows across payments and platform intelligence in India deal note. Groq says this brings real‑time AI to 300M consumers and 20M merchants, with a newsroom post outlining speed/cost advantages of its LPU approach Groq newsroom.

Turner rolls out ChatGPT Enterprise access to all employees
Turner Construction says every employee now has ChatGPT Enterprise access as part of its innovation program, signaling another large, company‑wide standardization on an enterprise AI assistant program page. For IT leaders, this implies growing pressure to unify on one assistant with auditing, data controls, and predictable billing Turner insight.
Factory partners with Snyk to embed security into agent‑native development
Factory is integrating Snyk’s scanning and guardrails directly into its agent‑native dev flow (“Droids”), with RBAC and enterprise controls designed with regulated customers, including major U.S. banks collab brief. The post describes real‑time vuln detection and remediation inside agent runs, rather than late‑stage code scanning Factory post.

OpenAI funds $1M in credits so schools can use Lovable via imagi
OpenAI is providing $1M in credits so schools can run Lovable through imagi during CS Ed Week and Hour of AI, lowering cost barriers for classroom pilots program note. The campaign positions Lovable’s AI coding experience as a turnkey way to expose students to agents and structured outputs without needing new infra Lovable page.
OpenAI grants $200 credits to Plus/Pro users after Codex cloud usage issues
After fixing Codex cloud task usage tracking, OpenAI gave $200 in free credits to Plus/Pro users who used cloud tasks in the past month, valid until Nov 20, and teased efficiency improvements ahead credits update. A follow‑up reiterates more CLI/IDE usage optimizations are coming, useful for teams trialing agentic dev flows follow‑up note.
Replicant case study: Cartesia voice hits 99.99% uptime and 3–5× lower latency
Replicant reports its Cartesia‑powered voice agents run at 99.99% uptime with 3–5× faster latency and +10 bps containment lift in two weeks, highlighting the value of lower‑latency speech stacks in call containment economics case study. Details in Cartesia’s write‑up for teams weighing hosted STS/TTS vs piecing together providers Customer story.

Genspark launches Slack app for chat, web search, and slide/image/doc creation
Genspark is now available inside Slack: DM it, @mention in channels, or use the top‑bar button to search the web, draft presentations, create images, and more app launch. This is another sign teams want AI in their existing collaboration surfaces, not a new tab Genspark and Slack.
Vercel makes BotID Deep Analysis free for Pro/Enterprise through Jan 15
Vercel is waiving charges for BotID Deep Analysis until Jan 15 for Pro and Enterprise plans, pitching it as protection for high‑value endpoints like AI invocations and checkout during holiday spikes offer note. The changelog outlines opt‑in via the Firewall dashboard and a return to standard billing after the window Vercel changelog.
⚙️ Serving trillion‑param MoE and runtime best practices
Runtime engineering focus today: Perplexity’s custom MoE kernels enable 1T models on AWS EFA without GPUDirect Async; NVIDIA publishes vLLM on DGX Spark guidance; local runs compare llama.cpp vs Ollama. Distinct from MCP/agents categories.
Perplexity details custom MoE kernels to serve 1T‑param models on AWS EFA
Perplexity published its first research paper on expert‑parallel MoE kernels that enable trillion‑parameter serving over AWS EFA without GPUDirect Async, packing tokens into single RDMA writes and using a host proxy to overlap transfers with GEMMs paper launch, research article. The team reports multi‑node serving that matches or beats single‑node on DeepSeek V3 671B at medium batches and says the work enables Kimi K2 serving on EFA as well engineering notes.

For runtime engineers, the main takeaways are lower MoE routing overhead (>1 ms via generic proxies) and a practical EFA path for trillion‑class inference without vendor‑specific async features.
NVIDIA posts vLLM best practices for DGX Spark multi‑node high‑throughput serving
NVIDIA published a deployment guide for running vLLM on DGX Spark, covering ARM64 containers, continuous batching, multi‑node configuration, and build tuning for high‑throughput inference deployment guide, NVIDIA guide. The guidance lands after strong DGX Spark throughput results with SGLang earlier this week, following up on SGLang DGX Spark which showed 70 tps on a 20B and 50 tps on a 120B. This gives ops teams prescriptive steps to reproduce similar throughput with vLLM.
Local tests show llama.cpp ~5× faster than Ollama on RTX 3090
A side‑by‑side local benchmark reports ~1.2 s per completion with llama.cpp vs ~5 s on Ollama for the same task on an RTX 3090 (Ryzen 5 host) benchmark run. That gap suggests swapping Ollama for llama.cpp when latency is critical, especially for short completions where launch and framework overhead dominate.

Engineers running local agents can route low‑latency paths through llama.cpp and keep Ollama where its ergonomics matter more than raw speed.
100‑request load test: Gemma 3 4B 8k on llama.cpp hits ~4.7 s median
A 100‑request run against a local /summarize endpoint using Gemma 3 4B 8k (llama.cpp) posted 100% success, ~4.7 s median, and ~5.7 s max response time on an RTX 3090 performance dashboard.

This is a useful baseline for small‑model serving under light concurrency; consider continuous batching and prompt caching to squeeze the tail.
llama‑server demo: Gemma 3 27B Q4_K_M runs 65k context on RTX 3090
A local run shows llama‑server handling a 65,536‑token context with Gemma 3 27B (Q4_K_M) on an RTX 3090, suitable for long PDF summarization and code review tasks long context demo.

The trade‑off is decode speed and VRAM pressure; plan for aggressive KV‑cache paging or lower‑precision quantization when pushing sequence length this high.
One‑GPU, many endpoints: running multiple llama.cpp servers for load tests
A practical setup shows multiple llama.cpp instances bound to different ports on one machine, making it easy to simulate parallel traffic and test client‑side load balancing multi‑instance demo.

This pattern helps profile concurrency bottlenecks (network vs decode) and validate per‑tenant isolation without changing model binaries.
vLLM adds full support for hybrid models (e.g., Qwen3‑Next, Granite 4.0)
vLLM announced full support for several hybrid architectures, widening the set of production‑ready model families you can serve without custom forks feature brief. This reduces integration work when mixing sparse/expert and dense paths in one fleet.
🧰 Coding agents and IDE tooling: search, hooks, SDKs, editors
Dev productivity updates dominate: Cursor shows semantic search boosts agent accuracy; Claude Code adds prompt‑based stop hooks; OpenRouter ships a type‑safe SDK; Zed, Chat LangChain and others ship practical editor/QoL features. Separate from MCP orchestration.
Cursor: semantic search boosts agent accuracy and retention in large repos
Cursor published results showing semantic search improves agent answer accuracy by an average +12.5%, with code retention up +2.6% in large codebases and fewer dissatisfied requests compared to grep-only workflows blog post. The post details an embedding model trained from real agent traces and an A/B that highlights gains on bigger projects results thread, while users echo that “grep AND semantic search” is the best combo practitioner note.

For builders, this argues for routing retrieval to embeddings first and falling back to grep for exact matches, especially on sprawling monorepos.
Claude Code adds prompt-based stop hooks to extend runs and enforce cleanup
Claude Code now supports prompt-based stop hooks: your hook text is evaluated by Haiku to decide whether the agent should continue, and if not, what to do instead (e.g., remove extra files, write tests, summarize work) feature brief. Docs explain how hooks evaluate continuation and emit guidance back to the main loop docs how-to.

This gives teams a guardrail to keep long sessions productive without manual babysitting.
Windsurf Codemaps is now live; maps codebases for agent and human onboarding
Windsurf made Codemaps available broadly, framing it as a DeepWiki for code: it identifies semantic components and organizes them for readability during agent or human navigation launch note feature explainer. This follows our prior coverage of the feature’s debut Codemaps and arrives with community momentum from hackathons event recap.
For teams adopting coding agents, Codemaps can become the shared map that reduces “lost in repo” time.
OpenAI Codex: usage tracking fixes and $200 credits for Plus/Pro users
OpenAI Devs fixed several issues with tracking Codex cloud task usage and granted $200 in free credits to Plus/Pro users who used cloud tasks in the past month; credits are valid until Nov 20 credits update. The team says more efficiency improvements for cloud tasks and better mileage from included usage/credits in the CLI and IDE extensions are coming shortly roadmap note.
This should smooth cost spikes for teams trialing heavier cloud runs before the next round of efficiency patches lands.
OpenRouter ships type-safe JS/TS SDK (beta) with OAuth and full API coverage
OpenRouter released a new SDK in beta for JS/TS with type safety by default, built-in OAuth flows, full support for model listing, chat/completions, keys, and response streaming; Python/Java/Go versions are coming SDK launch. The package targets day‑1 compatibility for new OpenRouter APIs and includes actionable error messages to speed integration SDK launch.

If you manage multi‑provider agents, this reduces glue code while keeping model swap costs low.
Zed v0.211 lands native Windows Arm64, branch diffs, and faster inlay hints
Zed 0.211 is out with a native Windows Arm64 build and a new git: branch diff action that shows your feature branch vs main in one view release thread branch diff feature. The team also rewrote inlay hints for better performance and reliability, and improved Markdown table rendering and nested list styling inlay hints update. Full changelog is live in the stable releases page stable notes.

Good polish for daily coding loops, especially if you review changes often or live on Surface-class hardware.
Conductor adds Checkpoints, “files changed” view, and a zen mode toggle
Conductor shipped Checkpoints to reset any chat to an earlier state in one click, plus a new “files changed” view to see exactly what the AI edited in the last step checkpoints note files changed view. A clean “zen mode” hides the sidebar, and the app now ships via brew for keyboard‑only installs zen mode view brew install.

Resetting state and auditing diffs are practical controls for production agent use.
LangChain rebuilds Chat LangChain with streaming, persistent history, and LangSmith traces
LangChain’s Chat LangChain got a ground-up rebuild: real‑time streaming responses, persistent chat history, direct trace viewing in LangSmith, and a feedback loop when answers miss the mark feature screenshot. It targets faster iteration against LangChain docs/KB with observability built in.

If you’re onboarding agents to a new codebase, traces plus persistence help tighten the debug loop.
Roo Code indexing now supports OpenRouter embeddings for code search
Roo Code added OpenRouter-provided embedding models to its codebase indexing pipeline, widening retrieval options for agent coding sessions indexing support. This lets teams standardize on a single provider for both generation and embeddings if they prefer.
A small change, but it can cut model sprawl in multi-repo setups.
🧩 MCP to code execution and standardization debates
Strong momentum toward ‘code mode’ MCP: Anthropic’s guide turns MCP servers into code APIs to slash token use; Cloudflare echoes TS‑API approach; mcporter compiles MCPs to CLIs; MCP core accepts long‑running Tasks. Includes client debates on FS proxies.
MCP Core will add long‑running Tasks to the spec
MCP Core maintainers accepted a proposal to introduce long‑running Tasks in the next revision of the protocol, enabling agents to start, track, and resume multi‑minute workflows without stuffing state into the context window maintainers note. This change aligns the spec with the emerging “code mode” pattern where models orchestrate tool work over time rather than one‑shot calls.
Here’s the catch: implementations will need robust task lifecycle semantics (create, poll, cancel) and audit logs to avoid zombie runs and surprise costs. Expect client updates to surface task UIs and retry logic shortly after the spec lands.
Cloudflare pushes “Code Mode” for MCP as TypeScript APIs
Cloudflare detailed a "Code Mode" approach that presents MCP tools as TypeScript modules the model can import and program against, cutting prompt bloat and letting code handle control flow, batching, and reuse Cloudflare blog post. Anthropic’s engineering write‑up points in the same direction, and practitioners are already treating this as the default pattern for large toolsets anthropic reference.

Why it matters: pushing tool composition into code slashes token churn and latency, and it makes error handling and privacy gates enforceable in code rather than in free‑form prompts.
Practitioners debate FS proxies vs progressive tool discovery for MCP
Builders are split on whether agents should proxy through a file system (write outputs to disk, pipe with jq) or use progressive tool discovery to keep everything inside MCP messages. Arguments for FS: huge payloads can flow without round‑tripping the model, and standard CLI filters are battle‑tested fs benefits, client power note. Arguments against: FS breaks OS‑agnosticism and increases sandboxing burden; better to standardize discovery and only expose a minimal subset of tools per task fs proxy critique, standardization call. Anthropic and Cloudflare’s "code mode" posts are being cited by both sides to justify their defaults anthropic reference.
So what? If you can’t isolate the filesystem cleanly, prefer progressive tool disclosure and code‑level buffers; otherwise, a sandboxed FS can be a pragmatic bridge while the MCP spec evolves.
e2b sandboxes ship with Docker’s 200+ MCP tools prewired
e2b added the Docker MCP Catalog to its sandboxes, exposing 200+ pre‑defined tools (Git, Jira, etc.) that agents can call from an isolated runtime catalog update. For teams standardizing on MCP, this cuts setup time and reduces security drift because tools live behind a single sandbox boundary.
mcp2py exposes MCP servers as importable Python modules
mcp2py lets you pip‑install a wrapper that exposes any MCP server as a Python module, so agents can call tools with normal function calls and typed args instead of inflating context with tool JSON package note, GitHub repo. This is Code Mode for Python: less token overhead, native retries, and easier composition inside agent runtimes.
mcporter turns any MCP server into a self‑contained CLI
A new utility, mcporter, compiles any MCP server into a single bun‑based command‑line tool via one command (npx mcporter generate‑cli … --compile), giving agents and humans a consistent, auditable interface to the same capabilities tool announcement, GitHub repo. There’s also a live directory of MCPs to target MCP directory.
The point is: CLIs are easy to sandbox, log, and permission—useful when moving from prompt‑driven experiments to production ops and CI jobs.
⚖️ Platforms, lawsuits and public positioning
Policy & legal beats: Amazon’s complaint over Perplexity Comet’s automated shopping; OpenAI CFO clarifies ‘no federal backstop’ ask; Japan’s CODA presses OpenAI on Sora 2 training. Excludes Snap–Perplexity commercial terms (covered in Enterprise).
Amazon sues Perplexity over Comet agent’s covert shopping on Amazon
Amazon filed a suit alleging Perplexity’s Comet automates logged‑in Amazon sessions while masquerading as a human, bypassing bot controls and risking wrong orders or data exposure lawsuit summary. The filing argues the agent degrades personalization tests and evades safeguards by disguising traffic; Amazon says it warned Perplexity earlier.

For AI leads, this is a warning shot: if an agent acts as the user, platforms may treat it as unauthorized automated access. Expect more sites to enforce bot disclosure and require official APIs.
- Route shopping automations through approved APIs. Don’t simulate humans inside logged‑in flows.
- Add explicit bot identifiers and throttles. Keep audit trails per action.
- Treat carts, checkouts, and address books as high‑risk scopes with extra review.
See Reuters’ parallel coverage of the legal threat here Reuters coverage.
Apple to pay ~$1B/yr for Google’s 1.2T‑param Gemini to power new Siri
Bloomberg reporting says Apple will use a custom 1.2T‑parameter Gemini to handle Siri’s summarizer and planner, paying about $1B annually, while Apple races to finish its own ~1T model next year for eventual replacement deal summary. The stack will run on Apple Private Cloud Compute so user data doesn’t hit Google’s systems key details.

This is a platform alignment with a price tag. For partners, expect Siri to upgrade research/planning tasks first, with on‑prem Apple models backfilling regions like China via local providers. It also signals that trillion‑class sparsity is now an integration game, not just a lab milestone.
China orders state‑funded data centers to use domestic AI chips; foreign GPUs barred
China instructed new state‑funded data centers to deploy only domestic AI accelerators; projects <30% complete must rip and replace foreign chips or cancel purchases policy report. Nvidia’s share in China reportedly fell from ~95% in 2022 to near 0 by Nov‑2025; covered H20 and grey‑market B200/H200 are now blocked in these builds policy report.

Deployment won’t stop—Huawei, Cambricon and others gain a captive market—but software/tooling porting will slow rollouts. Multinationals should plan for bifurcated model builds and finetunes with region‑locked kernels and ops.
Japan’s CODA escalates, says Sora 2 training can’t rely on opt‑out under JP law
Japan’s CODA (members include Studio Ghibli, Bandai, Square Enix) urged OpenAI to stop using member works to train Sora 2, arguing training‑time copying may be reproduction under Japanese law and that opt‑out can’t cure the violation post hoc CODA position. This follows Japan demand where the group first pressed OpenAI to halt use.

If adopted by regulators or courts, permission‑first pipelines and auditable provenance could become mandatory in Japan. Teams should prepare explicit licenses, dataset receipts, and region‑aware training gates.
OpenAI CFO clarifies ‘no federal backstop’ sought for AI infra financing
After headlines suggested OpenAI wanted a U.S. government debt guarantee, CFO Sarah Friar said the company is not seeking a federal backstop and that ‘backstop’ referenced public‑private collaboration on strategic AI infrastructure CFO clarification. The clarification counters claims that taxpayers would cover trillion‑dollar losses critics’ claim.

For policy teams, this is OpenAI trying to cool bailout optics while keeping optionality for public‑sector involvement in power, land, and grid buildouts. Watch how language shifts in future capex discussions.
Microsoft stands up an AI publisher marketplace; People Inc. is a launch partner
People Inc. signed a pay‑per‑use AI licensing deal with Microsoft as a launch partner for a new ‘publisher content marketplace,’ with Copilot as the first buyer deal summary. People says Google AI Overviews cut its Google traffic from 54% to 24% over two years; it previously inked an ‘all‑you‑can‑eat’ license with OpenAI deal summary.

The takeaway: content licensing is consolidating into platform‑run markets. If you run an AI product, plan for per‑use content costs; if you’re a publisher, expect more leverage to bundle crawler blocking with paid access.
Perplexity hits back: ‘Bullying is not innovation’ and defends user‑consented agent shopping
Perplexity publicly counters Amazon, arguing Comet acts with user consent to comparison‑shop and checkout on their behalf, accusing Amazon of protecting ad‑driven upsell flows rather than innovation Perplexity rebuttal. The post frames the dispute as who controls the logged‑in session and whether agents must self‑identify instead of simulating people.

The point is: agent builders should assume large retailers will push for bot disclosure and contractual interfaces. If your UX relies on pretending to be a person, expect blocks and litigation.
Anthropic and Iceland launch national AI education pilot using Claude
Anthropic and Iceland’s Ministry of Education will give hundreds of teachers Claude access for lesson planning, material adaptation, and on‑demand student support, with a focus on preserving the Icelandic language pilot overview. It builds on Anthropic’s growing public‑sector footprint in Europe.

This is public positioning with practical stakes. If the pilot shows workload reduction without quality loss, expect similar national deals—and stronger localization asks for mid‑resource languages.
Nvidia’s Jensen Huang: China could ‘win’ AI race on power subsidies and looser rules
Jensen Huang told the FT that China’s cheaper power and fewer regulatory hurdles could help it outpace the U.S./UK, where 50 state rules risk fragmentation and delays FT remarks. He argues model gaps are narrowing, so access to cheap energy and scale will decide who deploys faster.

For strategy, this frames energy and permitting as the moat. Expect more U.S. lobbying around unified rules and power incentives, and more China subsidies to keep domestic chips fully utilized.
🏗️ Compute constraints: memory wall, HBM and geopolitics
Infra discourse centers on bandwidth limits: detailed ‘memory wall’ threads (KV cache dominance, HBM bit demand) plus policy/economics signals (China chip bans for state DCs; Jensen on power subsidies). Excludes space compute (covered prior day).
China bans foreign AI chips in state-funded data centers; projects must rip and replace
China has told new state-funded data centers to use domestic accelerators only; any build under 30% completion must remove already-installed foreign AI chips or cancel purchases. The rule effectively bars Nvidia, AMD and Intel from these projects, carving out a captive market for Huawei, Cambricon and peers, while raising porting costs and slowing short‑term rollouts that depend on CUDA software ecosystems reuters summary.

This is a direct supply shock for AI capacity planning. Nvidia’s share in China had already fallen sharply; losing state builds removes a remaining high‑volume path back into the market. Expect more onshore compiler/runtime workarounds and a wider US–China compute gap until local stacks mature.
Memory wall takes center stage: KV cache movement, not FLOPs, is the bottleneck
A detailed engineering thread argues the primary limit to GenAI performance is memory bandwidth, not raw compute, with decoder LLMs dominated by KV‑cache reads as context grows. It contrasts ~60,000× growth in peak compute over ~20 years with only ~100× DRAM bandwidth and ~30× interconnect gains, and explains why research is shifting to reducing or reorganizing KV movement rather than adding FLOPs memory wall thread, kv cache costs.

Why this matters: HBM capacity is rising but deliverable bandwidth lags, so long contexts, bigger batches, and deeper stacks stall on memory I/O. A follow‑up dives into HBM bit demand exploding with model/context growth and notes prefill/kv read patterns dominate bytes moved, pushing the industry toward KV offload, prefill‑decode disaggregation, and bandwidth‑efficient attention hbm bit demand. Builders should budget for bandwidth first, treat longer contexts as a bandwidth tax, and prioritize kernels/serving plans that minimize KV traffic.
Perplexity ships MoE kernels to serve trillion‑param models efficiently on AWS EFA
Perplexity published custom expert‑parallel MoE kernels that work around AWS EFA’s lack of GPUDirect Async by coordinating RDMA writes via a host proxy thread and overlapping with grouped GEMMs. They report multi‑node serving that matches or exceeds single‑node for DeepSeek‑V3 671B at medium batches and enabling Kimi K2 serving—making 1T‑parameter MoE inference portable across clouds research article, engineering summary, Perplexity research.

Why it matters: networking and memory movement are the ceiling. Packing tokens to minimize cross‑GPU traffic and overlapping compute with transfers is how you buy back bandwidth. If you’re stuck on EFA, these kernels turn a hard limit into a tuning problem rather than a platform blocker.
Jensen Huang: China could ‘win’ AI on cheaper power and looser rules
In comments to the FT, Nvidia’s CEO argues China may outpace the US/UK in AI deployment due to lower effective power costs via subsidies and fewer regulatory fragments across states, while model capability gaps narrow. The thesis reframes the race as an energy and rollout problem more than a model frontier contest ft excerpt.

For infra leads, the point is blunt: power pricing and permitting speed decide training and serving economics. Policy that splinters compliance into 50 regimes or delays interconnects can be as constraining as GPU supply.
📊 Leaderboards and evals: Arena Expert, speech accuracy, retrieval tests
Mostly new eval artifacts today: LMArena’s Arena Expert and Occupational Categories; AA-WER shows open STT models surpass Whisper; CodeClash appears for goal‑oriented SWE. Distinct from model announcements or MCP.
LMArena debuts Arena Expert with 8 expert leaderboards
LMArena launched Arena Expert, a harder, cleaner evaluation that filters to the toughest 5.5% of real user prompts and adds eight Occupational Categories to see how models fare in specific fields launch thread. The team reports that prompt structure and complexity drive clearer head‑to‑head separation than Arena Hard, with wider score spreads on these expert prompts prompt structure note.

For practitioners, there’s now an open Expert dataset of 5,130 prompts with occupational tags to reproduce findings and run custom slices Hugging Face dataset. Early model callouts include Qwen3‑max‑preview ranking #4 globally, and Qwen’s open model taking the top OSS spot on Expert, per Alibaba Qwen performance. You can scan the new Expert leaderboard and its domain breakouts here Arena leaderboard.
Open STT models beat Whisper on AA‑WER benchmark
Artificial Analysis’ AA‑WER shows new open‑weights speech models surpassing OpenAI’s Whisper across three real‑world sets (AMI‑SDM meetings, Earnings‑22, VoxPopuli parliament). NVIDIA’s Canary Qwen 2.5B and Parakeet TDT 0.6B V2 lead, with Mistral’s Voxtral Small/Mini and IBM Granite Speech 3.3 8B close behind results post. The study emphasizes privacy‑friendly deployment and fine‑tuning flexibility for STT pipelines, with full leaderboard and methods published leaderboard and methodology.

Cursor: Semantic search lifts coding agent accuracy ~12.5%
Cursor reports that adding semantic search to agents improves answer accuracy by an average of 12.5% versus grep‑only, with code retention up +2.6% in larger repos and fewer dissatisfied requests, based on offline evals and online A/B tests blog post. The post details a custom embedding model trained from agent sessions and shows where grep still helps, especially paired with CLI‑level traversal results link and engineer note. See the methodology and examples in the write‑up blog excerpt.

DeepMind details IMO‑Bench; Gemini DeepThink leads proofs, autograder tracks humans
DeepMind shared new charts showing its Gemini‑based ProofAutoGrader correlates tightly with human grading on the advanced track (Pearson ≈0.93–0.96) while Gemini DeepThink (the IMO gold system) leads on proof quality among public baselines benchmarks chart. This follows IMO‑Bench, which introduced Answer, Proof, and Grading benches to test math reasoning beyond short answers.

Voyage: Rerank‑2.5 beats LLMs for reranking on cost, speed, accuracy
VoyageAI’s study argues specialized rerankers remain the right tool: their rerank‑2.5 is up to 60× cheaper, 48× faster, and as much as +15% better on NDCG@10 compared to LLMs used as rerankers, assuming a solid first‑stage retriever blog post. The takeaway for RAG builders is to invest in retrieval + reranker quality instead of pushing LLM context to do ranking work.
CodeClash benchmark tests goal‑oriented SWE via code tournaments
CodeClash reframes coding evals from single tasks to goal‑oriented software engineering: models iteratively edit codebases and then face off in multi‑round “code arena” tournaments, measuring strategy, maintenance, and competitive outcomes rather than one‑shot fixes paper page. The authors report diverse coding styles but shared weaknesses in long‑term planning under competition, with large‑scale runs across arenas and LMs paper page.
LTD‑Bench: “Let them draw” catches spatial reasoning gaps
LTD‑Bench evaluates models by making them draw dot‑matrix or code‑rendered images from text, and read them back—exposing spatial reasoning weaknesses that text‑only tests can hide paper brief. The suite spans letters to real objects and separates “make” vs “read” skills; multimodal status doesn’t guarantee an edge on these text‑first spatial tasks.

🎙️ Realtime voice agents and hosted LLM stacks
Production voice stacks trend: ElevenLabs hosts LLMs alongside STT/TTS for lower latency and cost; Modal outlines 1‑second voice‑to‑voice pipeline; Replicant+Cartesia reports measurable containment/latency gains. Not creative music models.
ElevenLabs hosts LLMs beside STT/TTS to cut reasoning cost and latency
ElevenLabs is now hosting LLMs inside its Agents Platform so voice agents don’t hop clouds for reasoning. The company highlights GLM 4.5 Air for tool‑calling accuracy at about one‑third the cost of alternatives and Qwen3‑30B‑a3b with sub‑150 ms time‑to‑first‑sentence for smooth back‑and‑forth speech platform update, with model details in updated docs docs page.

Why it matters: collapsing STT, LLM, and TTS in one runtime removes round‑trip latency and failure modes. Teams building phone trees, support deflection, or concierge agents can now trade absolute peak model IQ for consistent turn‑taking and lower bills.
Modal shows ~1s voice‑to‑voice loop using Parakeet, Qwen3‑4B and KokoroTTS
Modal published an end‑to‑end recipe that hits about one‑second voice‑to‑voice latency using Parakeet STT, Qwen3‑4B for reasoning, and KokoroTTS, orchestrated with Pipecat over WebRTC/WebSockets release thread, with engineering notes on inference scheduling and network paths in the write‑up latency blog. This lands as a credible reference stack for teams that need near‑realtime dialog without proprietary lock‑in—especially IVR upgrades and kiosk flows.
It also lands in context of Realtime stack where Together AI rolled out an E2E voice pipeline; Modal’s blueprint adds concrete component choices and diagrams that infra leads can replicate.
AssemblyAI launches a unified voice API with sub‑300 ms streaming STT and built‑in PII
AssemblyAI pitched a single API that replaces five separate services many teams stitch together. It promises real‑time transcription under ~300 ms, 99+ languages, speaker diarization that tolerates crosstalk, automatic PII redaction, and direct LLM hooks for intent/summary layers platform pitch, with access via a public signup signup page. If you’re still juggling disparate STT, redaction, and reranking jobs, this reduces glue code and observability gaps.

Replicant cites 99.99% uptime and 3–5× faster latency after moving to Cartesia
Replicant says Cartesia’s speech stack lifted reliability and speed at scale: 99.99% uptime, 3–5× lower latency, human‑quality tone/IDs, and +10 bps containment within two weeks on production traffic case highlights, with a deeper customer story detailing the Sonic model and deployment setup customer story. For contact centers, the takeaway is simple: tighter voice loops improve handoffs and reduce agent transfer rates.

Microsoft 365 Copilot adds interruptible voice chat on mobile with enterprise guardrails
Microsoft enabled natural voice chat in the Microsoft 365 Copilot mobile app, including the ability to interrupt mid‑response and get spoken replies grounded in your work graph and the web mobile voice rollout. The company frames it as hands‑free productivity with secure transcription/storage and plans desktop/web availability by year‑end mobile voice rollout. This is a pragmatic baseline: not the lowest latency on the market, but firmly integrated with enterprise identity, data boundaries, and audit.
Navan outlines evals for a production travel voice agent
Navan’s team walked through how they test a voice agent that rebooks travel and handles expenses, focusing on multimodal evaluation rather than demos—a useful signal for anyone graduating POCs into operations eval talk, with a public session link for details event page. The point is: treat voice agents like products, not prototypes—define success cases, measure latency/ASR drift, and load‑test turn‑taking.
🗂️ Retrieval stacks: agentic RAG and reranking evidence
RAG and retrieval pipelines get practical: Hornet pitches agent‑first retrieval; Weaviate shows self‑correcting RAG with evals; Voyage AI argues dedicated rerankers beat LLM reranking on price/speed/accuracy. Separate from Cursor’s IDE‑focused search.
VoyageAI: purpose‑built reranker beats LLM reranking on price, speed, and accuracy
VoyageAI argues LLMs are not the right tool for reranking: their rerank‑2.5 model is up to 60× cheaper, 48× faster, and as much as +15% better on NDCG@10 than LLM‑as‑reranker baselines, with strongest gains when paired with robust first‑stage retrieval Voyage blog. Practitioners are already sanity‑checking and debating the results for agent stacks engineer comment.
Hornet unveils agent‑first retrieval engine built for iterative/parallel workflows
Hornet surfaced with an agent‑oriented retrieval stack that uses schema‑first APIs and supports iterative and parallel query loops, with deployment options in VPC or on‑prem for tighter control why we build, and momentum hints from the team today team update. The pitch is simple: retrieval designed for autonomous agents rather than humans, so you can route long, structured queries, keep token budgets predictable, and run at enterprise boundaries. See the product rationale and positioning on the site Hornet site.
Weaviate details self‑correcting RAG that detects hallucinations and retries with feedback
Weaviate published a behavior‑shaping pattern where the RAG pipeline automatically flags likely hallucinations, generates corrective feedback, and retries before a response reaches users—evaluated with Arize to keep quality measurable guide summary. The design squarely targets enterprise reliability: do retrieval, check claims, repair the prompt, and only then emit an answer.

📎 Workflows: Workspace research, Maps Q&A and ChatGPT QoL
Productivity assistants: Gemini Deep Research now pulls Gmail/Drive/Chat; Google Maps adds Gemini‑summarized Q&A; Gemini Canvas can draft presentations; ChatGPT can inject context mid‑run. Excludes Gemini 3 model preview (feature).
Gemini Deep Research now reads Gmail, Drive and Chat
Google is rolling out Deep Research with Workspace sources, letting you pull from Gmail, Drive, Docs/Sheets/Slides, and Chat alongside web search for a single research task beta note, with early testers showing heavy, accurate email/thread synthesis and source‑scoped runs tester demo. A community write‑up confirms the selector for Search, Gmail, Drive and Chat, and that the feature is live on desktop with mobile coming later doc post, see details in the walkthrough testing write‑up.

For teams, this eliminates manual copy‑paste into prompts and reduces hallucinations by grounding answers in your tenant content. Keep it in a test domain first; confirm permissions and data scopes per source before enabling org‑wide.
ChatGPT now lets you update long runs with new context
OpenAI added the ability to interrupt a long‑running response and inject extra instructions or clarifications without restarting the run; hit Update in the sidebar and type new context, and the model will adjust mid‑flight feature note. Early users are confirming the quality‑of‑life boost and calling out parity with other assistants that supported mid‑run edits user recap.
This trims wasted tokens and time on deep research or long planning chains. Use it to narrow scope, add constraints, or paste fresh data as the model is thinking.
Gemini Canvas can auto‑draft presentations from your files
Gemini Canvas now converts a uploaded file or Google Doc into an editable slide deck—images, charts, and text—then lets you download or export to Google Slides for edits user report. Access and examples are outlined in Google’s Canvas overview with model support and export paths availability note, see the official page for usage details Canvas overview.
This is a practical win for PMs and sales: turn briefs into first‑pass decks in minutes. Sanity‑check data visualizations and add speaker notes; Canvas won’t infer org‑specific narratives without guidance.
Google Maps adds Gemini Q&A, landmark directions and Lens
Google Maps is adding Gemini‑powered place Q&A, landmark‑based directions (e.g., “turn right after Thai Siam”), proactive traffic/closure alerts, and a Lens mode to ask about places through the camera; rollout is starting on Android and iOS in the U.S. feature images, with a fuller capability rundown here feature summary.

Why it matters: fewer context switches to Search and faster trip setup. Expect early quirks around local data coverage; verify answers against recent reviews before committing plans.
🎬 Generative media: ads, upscalers and subject‑consistent video
Creative stack updates: Coca‑Cola’s AI Christmas spot made via an open workflow; fal’s Crystal Upscaler to 10K res; ByteDance’s BindWeave for subject‑consistent video; NotebookLM custom video styles. Separate from voice agents and core model releases.
ByteDance’s BindWeave targets subject‑consistent video via cross‑modal integration
ByteDance released BindWeave, a framework that conditions a diffusion transformer on subject‑aware hidden states from an MLLM to keep characters consistent across shots and prompts, showing strong scores on OpenS2V‑Eval paper page, ArXiv paper. The team has also published a model card for developers to test model card, with extra discussion in the community paper discussion. If you’re fighting continuity drift in multi‑subject scenes, this is a fresh baseline to benchmark against your current control stack.
Coca‑Cola’s holiday ad is AI‑made; open workflow credited with ~70k clips in ~30 days
Coca‑Cola’s annual Christmas spot was produced with AI again this year, with community producers attributing an open, node‑based workflow (ComfyUI/Comfy Cloud) and citing ~70,000 video clip iterations refined in about a month ad confirmation, workflow hint. A credited artist on the project said the production was intense but achievable with the pipeline artist note. For media teams, this signals the maturing of composable, template‑driven video systems for large ad campaigns at speed, not just experiments comfy cloud note.
MotionStream demos real‑time video generation (29 FPS, ~0.4 s latency) with interactive motion
MotionStream distills a motion‑controlled T2V model into a causal student with sliding‑window attention to hit ~29 FPS and ~0.4 s end‑to‑end latency on a single NVIDIA H100, enabling paint‑by‑trajectory motion and camera control in real time paper page. If you’re building live graphics or broadcast overlays, this shifts from offline renders to interactive motion routing.
ElevenLabs Music launches text‑to‑track generation for studio‑quality songs
ElevenLabs rolled out Music: generate full tracks—instrumental or with vocals—by prompt, with structure control, genre selection, and export to common formats release thread. A hosted version is live on Replicate for quick trials and API usage Replicate page. This gives video teams and editors an in‑house soundtrack option for temp or final cuts without licensing delays.
fal hosts Crystal Upscaler to 10K resolution with identity‑preserving portraits
fal added Crystal Upscaler, an image and video‑frame upscaler that prioritizes sharpness while preserving subject identity and supports outputs up to 10K resolution launch note. It’s positioned to avoid plastic skin and over‑smoothing typical of GAN upscalers, and is live to try now try it now, with more details on the model page model page. This is useful for up‑rezzing portrait shots and keeping continuity across edited video frames without retraining.
Google preps a Stitch Image Agent, likely backed by ‘GEMPIX2’ (Nano Banana 2)
TestingCatalog spotted a new Image Agent mode inside Stitch with a banana icon, plus export hooks to AI Studio and possibly Lovable, and an option to auto‑generate a project brief feature scoop, feature article. If you storyboard or spec creatives in Stitch, this suggests direct AI image placement and cross‑tool handoff are coming soon.
Microsoft’s MAI‑Image‑1 is live in Bing Image Creator and Copilot Audio Expressions
Microsoft’s first in‑house image generator, MAI‑Image‑1, is now available inside Bing Image Creator and in Copilot’s Audio Expressions, with EU rollout “coming soon” feature rollout. For design teams, this adds another first‑party option with tight placement in Microsoft’s stack, reducing reliance on third‑party services.

NotebookLM’s Video Overviews UI shows ‘Custom visual style’ prompts
A new NotebookLM interface lets creators pick a “Custom” style and provide a prompt like “animated children’s storybook” in a “Customize Video Overview” modal, alongside presets like Whiteboard, Kawaii, and Anime ui screenshot. This follows style prompts landing earlier; today’s change is a clear UI path to inject your own look. For teams automating explainer videos, this reduces post‑pass grading—set a house style once and generate on‑brand cuts repeatedly.

StyleSculptor proposes zero‑shot, style‑controllable 3D asset generation
StyleSculptor details a texture‑geometry dual‑guidance method to spin up 3D assets from single images, letting you steer output style without per‑subject training paper page. For art teams, it’s another route to generate consistent props and variants quickly, slotting into asset pipelines ahead of rigging or simulation.
Qwen Image Multiple Angles LoRA improves character and scene consistency
A community LoRA tuned on Qwen’s image stack aims to keep characters and scenes consistent across multiple angles, and early examples highlight stable identity and framing model tease. It’s a practical add‑on for storyboards and comics when you need the same subject rendered across pose variations.
🤖 Field robots and humanoids
Embodied AI clips: robot dogs in Sichuan firefighting trials (hose handling, gas/temperature telemetry); Xpeng humanoid physiognomy demo. Limited but notable applied robotics signals today.
Robot dogs enter live firefighting trials in Sichuan
China released footage of quadruped robots being tested for firefighting, hauling hoses, streaming real‑time video, and sensing toxic gases and temperature in hazardous zones trial footage, with a separate rundown highlighting their intended role in areas unsafe or unreachable for humans capabilities recap. For teams building embodied AI, this is a concrete deployment signal: perception, telemetry, and locomotion are now being integrated into incident response workflows rather than lab demos.
Unitree’s ‘Embodied Avatar’ enables full‑body teleoperation
Unitree is promoting an embodied avatar system for full‑body teleoperation—positioned as a production‑grade control stack rather than a research demo platform teaser. For field robotics teams, this suggests near‑term pathways to mix human teleop with on‑device policies for tasks that remain too brittle for autonomous execution.
Xpeng shows a more human‑proportioned humanoid design
Xpeng surfaced a humanoid robot prototype described as closer to human physiognomy than typical designs design note. Details are thin, but the emphasis on proportions hints at a push toward balance, reach, and manipulation that map better to human environments, which matters for gait control and hand–eye coordination models.