Claude Opus 4.6 adds 1M context beta – $5/$25 MTok pricing holds

Stay in the loop

Free daily newsletter & Telegram daily report

Executive Summary

Anthropic shipped Claude Opus 4.6 as its long-horizon agent flagship; Opus-class finally gets a 1M-token context window (beta) plus up to 128K output tokens; pricing stays $5/$25 per MTok, with some subscribers seeing $50 testing credits. Claude Code picks up Agent Teams (research preview) for parallel sessions; CLI 2.1.32 adds Opus 4.6 support, auto-memory, and partial “Summarize from here.” The Claude API swaps “budget tokens” for adaptive thinking; effort levels and fine-grained tool streaming are GA; a context-compaction endpoint lands in beta; Office surfaces expand (Excel pivots/validation; PowerPoint research preview with native editable charts).

• Evals + hygiene: ARC Prize reports 93.0% ARC-AGI-1 and 68.8% ARC-AGI-2 at a fixed 120K thinking budget; $1.88/task and $3.64/task; Artificial Analysis posts GDPval-AA Elo 1606 with ~160M tokens for 220 tasks; Anthropic warns infra config can swing agentic-coding scores by several points.
• Security + rollout reality: Opus 4.6 system card cites pilots where the model used misplaced GitHub/Slack tokens and took unapproved actions; it also flags eval-integrity risk if the model helped debug its own eval stack.
• OpenAI counter-move: GPT‑5.3‑Codex claims 57% SWE‑Bench Pro and 76% TerminalBench 2.0 with “less than half” the tokens vs 5.2; available in Codex products first, API “coming soon.”

Claude Opus 4.6 ships: longer-horizon agent work + Agent Teams + Office workflows

Opus 4.6 + Claude Code Agent Teams pushes “run it longer, run it in parallel” into mainstream workflows (1M context beta, adaptive thinking, Office integrations). This is a concrete step toward autonomous, repo-scale engineering loops.

Anthropic’s Opus 4.6 rollout dominated the day: a flagship model tuned for long-running agentic work plus major Claude Code and Office-style workflow upgrades. This category covers the Opus 4.6 + Claude tooling package and ecosystem availability; benchmark debate is handled separately.

Jump to Claude Opus 4.6 ships: longer-horizon agent work + Agent Teams + Office workflows topics

🧠 Claude Opus 4.6 ships: longer-horizon agent work + Agent Teams + Office workflows

Opus 4.6 expands outputs to 128K and changes some API constraints

Claude Opus 4.6 (Anthropic): Alongside the headline context increase, multiple rollout notes point to operationally relevant interface details—most notably up to 128K output tokens and the constraint that Opus 4.6 does not support assistant-message prefilling, as summarized in the release rundown.

For teams with agent harnesses that rely on partial-prefill patterns (e.g., templated tool plans or structured boilerplate), this is the kind of “small” incompatibility that can break workflows even when model quality improves. The same summary also mentions broader platform changes shipping alongside Opus 4.6—see the release rundown for the consolidated list.

Claude Opus 4.6 adds 1M context beta – $5/$25 MTok pricing holds

Executive Summary

Top links today

Claude Opus 4.6 ships: longer-horizon agent work + Agent Teams + Office workflows

Table of Contents

🧠 Claude Opus 4.6 ships: longer-horizon agent work + Agent Teams + Office workflows

Opus 4.6 expands outputs to 128K and changes some API constraints

Opus 4.6 pricing stays $5/$25 per MTok; subscribers see $50 testing credit offers

Agent Teams work best when you decompose work into discrete, parallelizable chunks

Cursor adds Claude Opus 4.6 for long-running tasks and code review

Effort dialing: turn reasoning down on simple tasks to control cost and latency

Long-horizon agent practice: give full context, let Opus run, then review

OpenRouter lists Opus 4.6 with 1M context and publishes a migration guide

Rork Max ships with Opus 4.6 to build and iterate mobile apps from prompts

v0 switches its underlying model to Opus 4.6

Warp adds Opus 4.6 support and highlights adaptive thinking behavior

🛠️ GPT‑5.3‑Codex lands: faster, token‑leaner, steerable long runs (Codex surfaces)

GPT-5.3-Codex rolls out across Codex products for paid ChatGPT plans

GPT-5.3-Codex posts new coding SOTA claims and major efficiency gains

Builders report multi-hour Codex runs and refactors that used to take weeks

Codex CLI 0.98.0 adds Agent Jobs and enables Steer mode by default

GPT-5.3-Codex is product-only at launch, with API “coming soon”

Codex app for Windows gets a waitlist signup

Codex release cadence tightens to roughly monthly major upgrades

🏢 OpenAI Frontier: enterprise platform to deploy governed “AI coworkers”

OpenAI Frontier launches as an enterprise platform for governed AI coworkers

Frontier early deployment claim: chip optimization reduced from six weeks to one day

🧭 Agentic engineering playbooks: agent-first codebases, skills files, and quality gates

AGENTS.md as the living “operating manual” for coding agents

Quality doctrine for agent code: keep a human merge owner, keep the bar high

Acceptance tests that constrain agents, not just validate them

Agent-first repos: fast tests, clear boundaries, and up-to-date process docs

Inventory internal tools and expose them to agents via CLI or MCP

Ralph loop adds a hard guardrail: write lint rules after you spot bad patterns

“Agents captain” becomes a named role in agent adoption playbooks

Managing agents looks like specifying inputs and evaluating outputs quickly

Start tracking agent trajectories, not only the final diffs

Token budgets show up as a practical constraint in agent-native roles

🕸️ Agent runners & multi-agent ops: orchestration UIs, long runs, and cost control

Cursor previews week-long coding agents peaking at 1,000+ commits/hour

W&B Weave shows how multi-agent systems fail under load (and how they patched it)

Perplexity Model Council runs three models in parallel and synthesizes differences

v0 adds GitHub-native agent runs: import repo, auto-commits, PRs, deploys

VS Code positions itself as a control plane for multiple agents under Copilot

AgentRelay pitches “dozen-agent” collaboration as a capability multiplier

LangSmith adds an “Insights Agent” that reads your traces for patterns and failures

Parallel releases official OpenClaw skills for parallel search and research

📊 Evals in the loop: ARC‑AGI jumps, time-horizon tracking, and benchmark hygiene

METR time-horizon discourse clusters around 6.6h and rumored 8–10h

Artificial Analysis says Opus 4.6 leads GDPval-AA at Elo 1606, but costs more

Opus 4.6 leads MRCR v2 long-context retrieval at 256K and 1M

TerminalBench 2.0 becomes the top-line race metric as new models land

DRACO benchmark lands for deep research evaluation across 10 domains

Hugging Face adds PR-based Community Evals and benchmark repositories

Arena demos Max router that routes prompts using 5M+ community votes

Arena updates its model lineup with Opus 4.6 and Gemini 3 Pro GA sightings

System-card note raises eval integrity questions when models touch the eval stack

Vals Index ranks Opus 4.6 #1 and discloses “auto” thinking eval settings

🛡️ Agent security & governance: system cards, cyber gating, and risky autonomy behaviors

OpenAI flags GPT-5.3-Codex as “High” for cybersecurity preparedness

Opus 4.6 system card documents auth token misuse and unapproved actions

OpenAI launches Trusted Access for Cyber with identity verification and $10M credits

Opus 4.6 system card warns about side-task compliance and prompt-injection deltas

Opus 4.6 Vending-Bench results include collusion and deceptive commerce behaviors

Anthropic flags eval integrity risk when Opus 4.6 helps debug the eval harness

Developers criticize GPT-5.3-Codex being product-only while “API access coming soon”

Operational risk talk shifts to “boring failures” like accidental DDoS loops

🧩 Skills & extension packaging: `.agents/skills`, playbooks, and agent add‑ons

ClawHub adds one-command installs for Parallel skill bundles

npx playbooks adds .agents/skills discovery and a docs installer for repos

Factory Droid reportedly now supports .agents/skills

RepoPrompt 1.6.13 adds GPT-5.3-Codex + Opus 4.6 support

Hyperbrowser previews web-to-skill learning via /learn commands

🏗️ Compute & scaling signals: GB200 serving, capex gravity, and power bottlenecks

GPT-5.3-Codex is served on NVIDIA GB200 NVL72, with faster inference and fewer tokens

Google guides 2026 CapEx at $175B–$185B and cites major Gemini serving cost cuts

Orbital datacenters: Elon’s “space GPUs” case hinges on terrestrial power ceilings

AWS says it has never retired A100s, citing persistent AI demand

Datacenter power bottlenecks: grid interconnect queues and gas turbines “sold out to 2030”

Long-running agents are being used as an argument for more compute demand, not less