Anthropic locks ~1M Google TPUs โ capacity tops 1 GW
Stay in the loop
Free daily newsletter & Telegram daily report
Executive Summary
Anthropic just turned rumor into steel: itโs locking up roughly 1M Google TPUs with well over 1 GW slated to come online in 2026. The spend is โtens of billionsโ and it isnโt just about training; Anthropic says it chose TPUs for priceโperformance on serving too, which is where margins go to die if you pick the wrong silicon. This is the rare compute deal that actually changes a roadmap: guaranteed throughput means shorter training queues and fewer rateโlimit headaches for customers next year.
Google is pushing Ironwood (TPU v7) as the servingโfirst piece of the puzzle, and that tracks with Anthropicโs pitch to enterprise buyers who care more about steadyโstate token costs than oneโoff megatrain runs. Demand doesnโt look madeโup eitherโcompany commentary pegs annualized revenue near $7B, which explains why theyโre preโbuying capacity instead of praying for cancellations on GPU waitlists. Still, Anthropic is careful to say itโs staying multiโcloud and multiโsilicon, with Amazon Trainium and NVIDIA GPUs in the mix so workloads can land where unit economics and latency actually make sense.
Net: this is a compute hedge and a serving bet wrapped into one, and it puts real pressure on rivals to show similar 2026โdated capacity, not just MOUs.
Feature Spotlight
Feature: Anthropic ร Google secure ~1M TPUs, >1 GW by 2026
Anthropic locks a multiโyear, multiโbillion Google Cloud deal for up to 1M TPUs (>1 GW by 2026), materially expanding Claude training and serving capacity and reshaping compute economics for enterprise AI.
Cross-account confirmation that Anthropic will massively expand on Google Cloud TPUsโtens of billions in spendโto scale Claude training/inference. Multiple tweets cite the 1M TPU figure, >1 GW capacity online in 2026, and current enterprise traction.
Jump to Feature: Anthropic ร Google secure ~1M TPUs, >1 GW by 2026 topicsTable of Contents
โก Feature: Anthropic ร Google secure ~1M TPUs, >1 GW by 2026
Cross-account confirmation that Anthropic will massively expand on Google Cloud TPUsโtens of billions in spendโto scale Claude training/inference. Multiple tweets cite the 1M TPU figure, >1 GW capacity online in 2026, and current enterprise traction.
Anthropic locks up ~1M Google TPUs and >1 GW for 2026 in a deal worth tens of billions
Anthropic and Google confirmed a massive TPU expansionโapproximately one million chips and well over 1 GW of capacity coming online in 2026โto scale Claude training and serving, with spend described as โtens of billions.โ The company frames the move as priceโperformance driven on TPUs, timed to accelerating demand. Following up on compute pact, which noted talks and early signals, todayโs posts quantify capacity and timing, and reiterate why TPUs fit Anthropicโs cost curve Deal announcement, Anthropic blog post, Press confirmation, Google press page.
For AI leads, the headline is concrete: guaranteed throughput for 2026 (training queues and serving readiness) and a visible hedge against GPU scarcityโwithout abandoning other stacks.
Anthropic demand picture: 300k+ business customers, large accounts up ~7ร YoY, revenue near $7B
Anthropic says it now serves 300,000+ businesses with nearly 7ร growth in large accounts over the past year; commentary adds annualized revenue approaching $7B and Claude Code surpassing a $500M runโrate within monthsโhelping justify the TPU scaleโup Anthropic blog post, Company summary, Analysis thread.
Implication for buyers: capacity wonโt just shorten waitlistsโit should stabilize SLAs and rate limits as onboarding accelerates.
Ironwood, Googleโs 7thโgen TPU for highโthroughput inference, is central to Anthropicโs plan
Google highlights Ironwood (TPU v7) as a servingโfirst design that lowers cost per token via 256โchip pods and 9,216โchip superpods, matching Anthropicโs need to scale inference economically alongside training Google press page. Anthropicโs own post ties the expansion to observed TPU priceโperformance over multiple generations, reinforcing why this capacity lines up for 2026 Anthropic blog post.
For platform teams, this signals practical gains: cheaper steadyโstate throughput for enterprise traffic, not just bigโbang training windows.
Despite the TPU megadeal, Anthropic reiterates a multiโcloud, multiโsilicon strategy
Alongside the Google TPU expansion, Anthropic stresses it will continue training and serving across Amazon Trainium and NVIDIA GPUs; Amazon remains a core training partner via Project Rainier, tempering vendor lockโin and letting workloads land where unit economics and latency fit best Anthropic blog post, Analysis thread.
For architects, this means portability pressures remain: plan for heterogeneous kernels, model builds, and orchestration that can shift between TPU, GPU, and ASIC targets as prices and queues move.
๐ฅ๏ธ OpenAI buys Sky: screen-aware Mac actions
OpenAI acquired Software Applications Inc. (Sky), an Appleโveteran team building a Mac, screenโaware natural language interface. Excludes the AnthropicโGoogle compute pact (covered in Feature). Focus here is OSโlevel agent UX and M&A signal.
OpenAI buys Sky to add screenโaware Mac actions to ChatGPT
OpenAI acquired Software Applications Inc. (Sky), a Mac overlay agent that understands whatโs on screen and can take actions through native apps; the team is joining to bring these capabilities into ChatGPT, with terms undisclosed OpenAI blog, and the acquisition confirmed across community posts acquisition post, announcement link.
OpenAI frames the deal as moving from โanswersโ to helping users get things done on macOS, implying deeper OSโlevel permissions, context, and action execution beyond web automations OpenAI blog.
Signal in the noise: Sky shows OpenAIโs platform push into OSโlevel agents
Practitioner briefs note OpenAI has been on an acquisitions streak and describe Skyโs product as a floating desktop agent that understands the active window and can trigger actions in local apps like Calendar, Messages, Safari, Finder and Mailโan explicit platform move beyond webโonly automation feature explainer. Coupled with OpenAIโs own integration plan, this suggests a nearโterm consolidation of agent UX at the OS layer to win trust, control latencies, and harden permissions around sensitive actions OpenAI blog.
Workflow/Shortcuts alumni behind Sky bring deep macOS automation chops to OpenAI
Skyโs founders previously built Workflow (acquired by Apple and turned into Shortcuts), and community posts say the team had a summer release queued before the acquisitionโan overlay agent that could read the screen and drive Mac appsโhighlighting rare, lowโlevel macOS automation expertise now in OpenAIโs stack prelaunch details, product description, community recap. This background reduces integration risk and accelerates building a reliable, permissionsโaware OS agent versus purely browserโbound automation.
OpenAI positions Sky as a shift from chat to actionโand discloses Altmanโlinked passive investment
In its note, OpenAI emphasizes Sky will help โget things doneโ on macOSโnot just respond to promptsโwhile stating all team members are joining OpenAI to deliver these capabilities at scale OpenAI blog. The post also discloses that a fund associated with Sam Altman held a passive Sky investment and that independent Transaction/Audit Committees approved the deal, a governance detail leaders will track as OSโlevel agents gain wider powers OpenAI blog.
What a screenโaware Mac agent unlocks for developers and IT
A Skyโstyle agent can reason over onโscreen context and invoke native intentsโbridging ambiguous dialog (โwhatโs on my screen?โ) with deterministic app actions and user approvals. Community summaries cite concrete app domains Sky targeted (Calendar/Messages/Notes/Safari/Finder/Mail) and a desktop overlay UX, signaling new integration surfaces for secure, auditable automations and policy controls on macOS fleets feature explainer, product description.
๐ฌ Cinematic AI video goes open: LTXโ2 arrives
Lightricksโ LTXโ2 dominates todayโs genโmedia chatter: native 4K up to 50 fps with synchronized audio, 10โ15s sequences, and dayโ0 availability via fal/Replicate; weights to open later this year. Excludes Genie worldโmodel news (separate category).
LTXโ2 debuts with native 4K, up to 50 fps, and synced audio; open weights coming later this year
Lightricksโ LTXโ2 arrives as a cinematicโgrade AI video engine: native 4K output, up to 50 fps, synchronized audio/dialog, and ~10โ15โsecond sequences designed for real creative workflows, with APIโreadiness today and weights slated to open later this year capability highlights, weights plan. Early handsโon testers are positioning it as a stepโchange over prior demoโgrade models, citing resolution fidelity and motion smoothness aligned to professional pipelines review thread.
fal ships dayโ0 LTXโ2 APIs (Fast/Pro) for textโvideo and imageโvideo up to 4K with perโsecond pricing
fal made LTXโ2 available on day one with Fast and Pro endpoints for both textโvideo and imageโvideo at 1080p, 1440p, and 4K, supporting synchronized audio and up to 50 fps; usage is metered perโsecond with published rate tiers on each model page availability brief, Text to video fast, Text to video pro, Image to video fast, Image to video pro.
In practice, this gives teams an immediate path to prototype and scale highโfidelity clips via API without managing custom serving, while preserving a clean upgrade track to Pro for higher quality runs.
Replicate lists LTXโ2 Fast and Pro with prompt guidelines and example workflows
Replicate now hosts lightricks/ltxโ2โfast and lightricks/ltxโ2โpro, complete with promptโwriting guidance, example pipelines, and API playgrounds to speed adoption into existing tooling hosting update, Replicate fast model, Replicate pro model. For AI engineers, this lowers integration friction (oneโclick deploys, consistent SDKs) while enabling sideโbyโside Fast/Pro comparisons for costโquality tuning in production.
Practitioners call LTXโ2 a new bar; native 4K motion and texture beat upscaled outputs
Early testers report a clear perceptual gap between LTXโ2โs native 4K and prior upscaled pipelines, citing sharper textures, steadier motion, and coherent audio that shortens postโproduction cycles review thread, native vs upscaled. For teams evaluating model swaps, expect fewer artifacts in fast action and dialogueโdriven scenes, plus simpler editorial passes when cutting short spots and trailers.
๐งญ Agentic browsers: Edge Copilot Mode and fall updates
Microsoftโs Edge adds Copilot Mode with Actions for onโpage navigation, tab management, and history context. Copilot Sessions โFall releaseโ teases Mico/Clippy, groups, and health features. Excludes OpenAI Atlas (prior day) to keep today focused on Edge updates.
Edge adds Copilot Mode with Actions, autonomy levels, and optโin Page Context
Microsoft is turning Edge into an agentic browser: Copilot Mode can navigate pages, execute multiโstep Actions (unsubscribe, book, scroll to sections), manage tabs, and draw on browsing history if users enable Page Context. Handsโon reports show three autonomy settings (light, balanced, strict) and a Preview toggle to watch or backgroundโrun tasks feature brief, how to enable, deep dive thread.
- Actions sequences and tool use are visible, with suggested flows for common chores and guardrails around history access actions samples.
Copilot Sessions Fall update brings Groups, Mico/Clippy, and crossโapp memory
At Copilot Sessions, Microsoft previewed a broad Fall update: Groups for up to 32 participants, a longโterm memory that spans apps, and a more expressive Mico avatarโwith an Easterโegg return of Clippy. Early notes also highlight health Q&A grounded in vetted sources, stronger privacy optโins, and a staged U.S. rollout before expanding feature collage, event stream, feature recap.
Following up on Feature lineup that teased 12 areas, todayโs session put numbers (32โuser Groups) and concrete capabilities on the roadmap while reinforcing an โAI agentic browserโ framing across Edge and Copilot.
โ๏ธ Ship faster in AI Studio: Annotate & vibe coding
Google AI Studio adds Annotate Mode: draw on your running app UI and have Gemini implement changes. Builders showcase โvibe codingโ flows with prebuilt components and grounded Search. Strong traction signals (traffic spike) surfaced today.
Google AI Studio adds Annotate Mode for pointโandโedit coding
Google AI Studio now lets you draw directly on your app preview and have Gemini implement the change in code, collapsing reviewโspecโcommit loops into a single pass. The update ships inside the Build experience and supports fineโgrained tweaks (e.g., animations) without leaving the IDE-like canvas feature brief, announcement, AI Studio build, annotate details.
For teams, this makes UI polish and stakeholder feedback far more executableโnonโdevelopers can mark targets in context while engineers keep a clean diff trail. Early users report the feature feels natural in the new AIโassisted flow of โpoint, narrate intent, shipโ feature mention.
Vibe coding in AI Studio: NL intents to runnable apps with Search grounding
Creators showcased โvibe codingโ in AI Studio: pick prebuilt components (speech, image analysis), describe the app in natural language, and get runnable code plus a live preview grounded in Google Search. The demo walks through highlightโandโedit cycles, showing Gemini wiring UI changes and data calls endโtoโend video demo, YouTube demo.
Beyond prototyping speed, Search grounding adds productionโlike behavior (fresh results/citations) to early builds, reducing the gap between demo logic and real integrations feature brief.
AI Studio traffic jumps 64% in September, topping ~160M monthly visits
AI Studioโs site traffic spiked ~64% in September to ~160M+ visits, its biggest surge since the Gemini 2 cycleโevidence that annotateโandโvibe coding workflows are resonating with builders traffic chart. Following up on traffic surge that highlighted the 160M+ milestone, todayโs chart underscores momentum rather than a oneโoff bump, suggesting sustained interest as new Build features roll out.
๐ Agent infra for builders: Vercel Agent, WDK, Marketplace
Vercel Ship AI day brings a cohesive agent stack: Vercel Agent (code review + investigations), Workflow Development Kit (โuse workflowโ durability), a Marketplace for agents/services, and zeroโconfig backends for AI apps.
Vercel Agent launches in public beta with AI code review and incident investigations
Vercel introduced an AI teammate that performs PR reviews by running simulated builds in a Sandbox and triggers AI-led investigations when telemetry flags anomalies, now available in Public Beta on AI Cloud product blog, and documented in full on the launch post Vercel blog. This slots into a broader Ship AI push aimed at making agentic workflows firstโclass for app teams.
Workflow Development Kit makes reliability โjust codeโ with durable, resumable steps
Vercelโs WDK adds a use workflow primitive that turns async functions into durable workflows that pause, resume, and persist automatically; each use step is isolated, retried on failure, and stateโreplayed across deploys feature brief, with deeper details in the launch writeโup Vercel blog. Early builders immediately pressed for controls like cancellation, idempotency keys, handling code changes, and rollbacksโuseful signals for WDK ergonomics and docs to address next dev questions, followโup questions.
Vercel Marketplace debuts with agent apps and AI infrastructure services, unified billing
Vercel opened a marketplace that ships plugโin โagentsโ (e.g., CodeRabbit, Corridor, Sourcery) and โservicesโ (Autonoma, Braintrust, Browser Use, Chatbase, Mixedbread and more) behind one install and bill marketplace blog, with partners announcing dayโone availability coderabbit launch, mixedbread launch. The intent is to reduce the integration sprawl for teams adopting agentic patterns while keeping observability centralized.
AI SDK 6 (beta) unifies agent abstraction with humanโinโtheโloop tool approvals and image editing
Vercelโs AI SDK 6 beta stabilizes an agent abstraction layer, adds toolโexecution approval for humanโinโtheโloop control, and extends image editing supportโpositioning the SDK as the default interface across models and providers for agent apps sdk beta image. These capabilities complement Vercel Agent and WDK so teams can define logic once and run it reliably on AI Cloud.
Zeroโconfig backends on Vercel AI Cloud bring frameworkโdefined infra and unified observability
Vercel AI Cloud now provisions and scales backends from your chosen framework with no extra YAML or Docker, adds perโroute scaling, and centralizes logs, traces, and metrics so AI apps get a productionโgrade control plane out of the box backends blog, Vercel blog. For agent builders, this pairs with the AI stack to simplify deploying toolโrich, stateful services without bespoke infra plumbing.
๐งฉ Enterprise collaboration & context: projects, knowledge, memory
Teams features dominated: OpenAI expands Shared Projects (with perโtier limits) and ships Company Knowledge with connectors/citations; Anthropic rolls out projectโscoped Memory to Max/Pro with incognito chats. Excludes OpenAIโs Sky M&A (separate).
Company Knowledge arrives for Business, Enterprise, and Edu with GPTโ5 search across Slack/SharePoint/Drive/GitHub and citations
ChatGPT can now pull trusted answers from your organizationโs toolsโSlack, SharePoint, Google Drive, GitHubโwith a GPTโ5โbased model that searches across sources and cites where each answer came from, now rolling out to Business, Enterprise, and Edu feature screenshot, OpenAI blog.
New connectors were added alongside the rollout (e.g., Asana, GitLab Issues, ClickUp), and admins can review the Business release notes for setup details and visibility controls business notes, and Business release notes. See OpenAIโs overview for capabilities and citation behavior OpenAI blog.
OpenAI rolls out Shared Projects to Free, Plus, and Pro with tier caps and project-only memory
OpenAI is expanding Shared Projects to all ChatGPT tiers so teams can work from shared chats, files, and instructions, with projectโscoped memory enabled automatically on shared projects feature post, rollout summary.
- Tier limits: Free supports up to 5 files and 5 collaborators, Plus/Go up to 25 files and 10 collaborators, and Pro up to 40 files and 100 collaborators, per OpenAIโs notes release summary, and OpenAI release notes.
Anthropic ships projectโscoped Memory to Max and starts Pro rollout with incognito chats and safety guardrails
Anthropic enabled Memory for Max customers and will roll it out to Pro over the next two weeks; each project keeps its own memory that users can view/edit, with an incognito chat mode that avoids saving, following internal safety testing rollout note, memory page.
Practitioners highlight projectโscoped memory as a practical way to prevent crossโpollination between unrelated workstreams user sentiment, with full details and controls in Anthropicโs announcement Anthropic memory page.
๐ Document AI momentum: LightOnOCRโ1B and tooling
OCR/VLM remained hot: LightOnOCRโ1B debuts as a fast, endโtoโend domainโtunable model; vLLM adds OCR model support; applied guides explain deployment and โoptical compressionโ angles. Mostly practical releases and howโtos today.
LightOnOCRโ1B debuts: fast, endโtoโend OCR with SOTA-class quality; training data release teased
LightOn unveiled LightOnOCRโ1B, an endโtoโend OCR/VLM that targets stateโofโtheโart accuracy while running significantly faster than recent releases, and says a curated training dataset will be released soon. The team details design choices (e.g., teacher size, resolution, domain adaptation) and shipped readyโtoโrun models, including vLLM availability. See the announcement and technical blog for architecture and ablation results release thread, with more notes that the dataset is โcoming soonโ followโup note, and the model and collection pages for immediate use Hugging Face blog, Models collection.
Baseten explains DeepSeekโOCRโs โoptical compressionโ and ships a 10โminute deploy path
Baseten breaks down why DeepSeekโOCRโs imageโnative pipelines are dramatically cheaper and faster (compressing text visually before decoding) and provides a ready template to stand up inference in under ten minutes. This adds actionable ops guidance following up on vLLM support and libraryโscale conversions reported yesterday, with concrete throughput/cost angles for production teams blog summary, Baseten blog, and an additional pointer from the team blog pointer.
Hugging Face updates open OCR model comparison with Chandra, OlmOCRโ2, Qwen3โVL and averaged scores
Hugging Face refreshed its applied guide and comparison for open OCR/VLMs, adding Chandra, OlmOCRโ2 and Qwen3โVL plus an averaged OlmOCR score, giving practitioners clearer tradeโoffs on accuracy, latency and deployment patterns. The post complements recent LightOnOCR and DeepSeek work by focusing on practical pipelines and costs blog update, with the full writeโup here Hugging Face blog.
vLLM flags surge of small, fast OCR models landing for production serving
vLLM highlighted that compact OCR models are "taking off" on the platform, underscoring practical, highโthroughput serving for document AI workloads. This aligns with LightOnOCRโ1Bโs immediate vLLM availability and broader momentum toward efficient OCR/VLM deployment vLLM comment, model availability.
Hugging Face promotes fewโclick deployment for the latest OCR models
Hugging Face highlighted that current OCR models can be deployed in a few clicks on its platform, lowering the bar for teams to productionize document AI without bespoke infra. This dovetails with the updated model comparison to help practitioners choose and ship quickly deployment note.
๐ง Research: agent routing, proactive problemโsolving, trace fidelity
New papers target where agents fail: responseโaware routing (Lookahead), distributed selfโrouting (DiSRouter), proactive E2E eval (PROBE), and instructionโfollowing inside reasoning traces (ReasonIF).
ReasonIF finds frontier LRMs violate reasoningโtime instructions >75% of the time; finetuning helps modestly
Together AIโs ReasonIF benchmark shows models like GPTโOSSโ120B, Qwen3โ235B, and DeepSeekโR1 ignore stepโlevel directives (formatting, length, multilingual constraints) in >75% of reasoning traces; multiโturn prompting and a lightweight finetune improve scores but donโt fully fix processโlevel compliance paper overview.
Code, paper, and blog are available for replication and training recipes GitHub repo, project blog.
Lookahead routing predicts model outputs to choose the best LLM, averaging +7.7% over SOTA routers
A new routing framework โLookaheadโ forecasts latent response representations for each candidate model before routing, yielding a 7.7% average lift across seven benchmarks and working with both causal and masked LMs paper thread, with details in the preprint ArXiv paper.
It improves especially on openโended tasks by making responseโaware decisions instead of inputโonly classification, and reaches full performance with ~16% of training data, cutting router data needs.
PROBE benchmark shows proactive agent limits: only ~40% endโtoโend success on realโwork scenarios
PROBE (Proactive Resolution of Bottlenecks) tests agents on three stepsโsearch, identify the root blocker, then execute a precise actionโover long, noisy corpora (emails, docs, calendars); top models reach ~40% endโtoโend success, with frequent failures on rootโcause ID and parameterizing the final action paper abstract.
Chained tool frameworks underperform when retrieval misses key evidence, underscoring that proactive help hinges on evidence selection and exact action specification.
DiSRouter: Distributed selfโrouting across smallโlarge LLMs with subโ5% overhead
DiSRouter removes the central router and lets each model decide to answer or say โI donโt knowโ and forward upstream, chaining small to large models for better utility at low cost; authors report <5% routing overhead and robustness when the model pool changes paper abstract.
By training models to selfโreject via SFT and RL, the system avoids brittle global routers that must be retrained whenever the pool updates.
SmartSwitch curbs โunderthinkingโ by blocking premature strategy switches; QwQโ32B hits 100% on AMC23
SmartSwitch monitors generation for switch cues (e.g., โalternativelyโ), scores the current thought with a small model, and if still promising, rolls back to deepen that path before allowing a switch; across math tasks it raises accuracy while cutting tokens/time, with QwQโ32B reaching 100% on AMC23 paper abstract.
Unlike โbe thoroughโ prompts or fixed penalties, the selective intervention preserves agility while enforcing depth where it matters.
Ensembling multiple LLMs via โconsortium votingโ reduces hallucinations and boosts uncertainty signals
A study ensembles diverse LLMs and groups semantically equivalent answers to take a majority vote, introducing โconsortium entropyโ as an uncertainty score; this blackโbox setup often outperforms singleโmodel selfโconsistency while costing less than manyโsample decoding paper abstract.
The result doubles as a triage signal, flagging lowโconfidence cases to humansโuseful for production gateways where retraining isnโt feasible. Following up on self-consistency, which offered error guarantees for majority vote, this extends the idea across heterogeneous models rather than multiple samples of one.
Letta Evals snapshots agents for stateful, reproducible evaluation via โAgent File (.af)โ checkpoints
Letta introduced an evaluation method that checkpoints full agent state and environment into an Agent File (.af) so teams can replay and compare agent behavior holisticallyโnot just promptsโover longโlived, learning agents product note.
This targets a growing gap in agent testing where memory and environment drift make traditional singleโturn or stateless evals misleading for production readiness.
๐งช Serving quality: provider exactness and openโmodel stabilization
Production notes on improving open models in agents: Clineโs GLMโ4.6 prompt slimming and provider filtering (:exacto) lift toolโcall reliability; OpenRouter confirms :exacto gains; Baseten adds fast GLMโ4.6 hosting.
Cline stabilizes GLMโ4.6 agents with 57% prompt cut and :exacto provider routing
Cline reports a production hardening of open models by shrinking GLMโ4.6โs system prompt from 56,499 to 24,111 characters (โ57%), which sped responses, lowered cost, and reduced toolโcall failures; they also now autoโselect OpenRouterโs โ:exactoโ endpoints to avoid silently degraded hosts that broke tool calls. See details and the before/after instruction tuning in Cline blog, a sideโbyโside run where glmโ4.6:exacto succeeds while a standard endpoint fails by emitting calls in thinking tags in provider demo, and OpenRouterโs confirmation that Clineโs quality jump came from :exacto in OpenRouter note.
SGLang Model Gateway v0.2 adds cacheโaware multiโmodel routing and productionโgrade reliability
LMSYS rebuilt SGLโRouter into the SGLang Model Gateway: a Rust gRPC, OpenAIโcompatible front door that runs fleets of models under one gateway with policyโbased routing, prefill/decode disaggregation, cached tokenization, retries, circuit breakers, rate limiting, and Prometheus metrics/tracing. It targets agent backends where endpoint quality varies and failover, observability, and tool/MCP integration are mandatory gateway release, with a feature list of reliability/observability upgrades for production workloads reliability brief.
Baseten lights up GLMโ4.6 hosting with usageโbilled API and fastest thirdโparty claim
Baseten announced GLMโ4.6 availability via its managed inference with API pricing for teams that prefer usage billing, and reiterated itโs the fastest thirdโparty host for this model per recent bakeโoffs. For teams standardizing on open models across providers, this adds a turnkey endpoint option alongside selfโhosted stacks hosting note.
Factory CLIโs mixedโmodel planโexecute keeps 93% quality at lower cost
Factory advocates splitting agent work across modelsโuse a strong, pricier model (e.g., Sonnet) to plan and a cheaper open model (e.g., GLM) to executeโclaiming you keep ~93% of performance while โonly paying premium for thinking.โ This is a practical pattern for taming provider variance and stabilizing tool calls without locking into a single endpoint claims thread, with broader mixedโmodel support landing in the Factory CLI mixed models note.
๐ก๏ธ Trust & uptime: data deletion policy and outage recap
Operational signals: OpenAI confirms return to 30โday deletion for ChatGPT/API after litigation hold ended; separate brief outage caused โToo many concurrent requestsโ with status updates to recovery.
OpenAI reinstates 30โday deletion for ChatGPT and API after litigation hold ends
OpenAI says deleted and temporary ChatGPT chats will again be autoโdeleted within 30 days, and API data will also be deleted after 30 days, following the end of a litigation hold on September 26, 2025 policy screenshot.
Teams should verify retention assumptions in privacy notices, DSR workflows, and logging/backup pipelines; OpenAI notes it will keep a tightlyโaccessโcontrolled slice of historical user data from AprilโSeptember 2025 for legal/security reasons only policy screenshot. Community commentary stresses this mirrors prior standard practice and that the earlier hold stemmed from external litigation constraints, not a product policy change context thread.
ChatGPT outage triggers โToo many concurrent requestsโ; status page shows sameโday recovery
ChatGPT briefly returned โToo many concurrent requestsโ errors; OpenAIโs status page tracked investigation, mitigation, and full recovery within the same afternoon error screenshot, OpenAI status.
According to the incident log, errors began midโafternoon, a mitigation was applied within about an hour, and all impacted services recovered shortly thereafter OpenAI status. Users and thirdโparty monitors reported elevated error rates during the window, aligning with OpenAIโs outage acknowledgment and remediation updates outage report.
๐น๏ธ World models in the browser: Genie 3 experiment
Googleโs Genie 3 public experiment appears imminent: UI for sketchโyourโworld and character prompts surfaces, with reporting that users will generate and explore interactive worlds. Separate from LTXโ2 video engine.
Genie 3 public experiment UI surfaces; โcreate worldโ flow suggests launch soon
Googleโs Genie 3 appears ready for a public browser experiment: a โCreate worldโ interface with Environment and Character prompts, plus a Firstโperson toggle, has been spotted alongside reports that users will generate and then explore interactive worlds. Multiple screenshots and writeโups point to an imminent rollout rather than a labโonly demo documented scoop, and community observers are now calling the release all but confirmed confirmation post.
The new UI invites text descriptions of the world and avatar and hints at sketchโtoโworld creation, aligning with Googleโs earlier โworld modelโ framing. For analysts and engineers, this signals handsโon data about userโsteered simulation, control inputs, and firstโperson interaction loopsโkey to agent training and evaluation in browserโsafe sandboxes ui preview. Full details and artifact references are compiled in TestingCatalogโs coverage TestingCatalog article, with additional UI capture corroborating the same flow ui screenshot.
๐ Agent evals & observability: multiโturn and automated insights
Evals tooling advanced: LangSmith adds an Insights Agent and multiโturn evals for goal completion; Letta ships stateful agent evals using Agent File snapshots to replicate full state+env. Practical, productionโoriented.
LangSmith adds Insights Agent and multiโturn conversation evals
LangChain rolled out two eval features in LangSmith: an Insights Agent that automatically categorizes agent behavior patterns, and Multiโturn Evals that score entire conversations against user goals rather than single turns feature brief. This closes a common gap in production agent QA by shifting from turnโlevel rubric checks to trajectoryโlevel success measurement across tasks like planning, tool use, and error recovery.
ReasonIF finds LRMs ignore reasoningโtime instructions >75% of the time
Together AIโs ReasonIF study shows frontier large reasoning models often fail to follow instructions during the chainโofโthought itselfโover 75% nonโcompliance across multilingual reasoning, formatting, and length controlโdespite solving ability paper summary. Authors release a benchmark plus code and data; simple interventions like multiโturn prompting and instructionโaware finetuning partially improve adherence resources bundle, ArXiv paper, and GitHub repo.
For evaluators, this clarifies why outputโonly checks miss latent failures: processโlevel audits and instructionโfidelity metrics belong alongside accuracy.
Letta Evals debuts stateful agent testing via Agent File snapshots
Letta introduced an eval suite purposeโbuilt for longโlived agents, snapshotting full agent state and environment into an Agent File (.af) so tests can deterministically replay behavior, compare changes, and evaluate upgrades applesโtoโapples product note, launch claim. Teams can evaluate an entire agent (not just prompts) and even target existing agents as eval fixtures, addressing the core challenge of drift in memoryful, toolโrich systems.
New PROBE benchmark stresses proactive agents; top models ~40% endโtoโend
A new dataset, PROBE (Proactive Resolution of Bottlenecks), evaluates agent workflows that must search long noisy corpora, identify a single true blocker, and execute one precise action with parameters. Leading models manage roughly 40% endโtoโend success, with most failures in rootโcause identification and incomplete action arguments paper thread.
This style of eval mirrors real knowledgeโwork: find the right evidence, disambiguate ownership/deadlines, and act onceโuseful for assessing enterprise agent readiness beyond chat quality.
Multiโmodel โconsortium votingโ cuts hallucinations and adds calibrated uncertainty
A paper from Cambridge Consultants and collaborators proposes teaming multiple LLMs, grouping semantically equivalent answers, and majorityโvoting to both reduce hallucinations and expose confidence via consortium entropyโoften beating singleโmodel selfโconsistency at lower cost paper details. In context of certified majorityโvote methods with error guarantees reported yesterday error guarantees, this offers a pragmatic, blackโbox route to production risk flags without retraining.
The approach also provides a cheap abstain signal for eval pipelines: throttle or escalate when answer clusters disperse.





