Gemini 3.1 Pro hits 77.1% ARC-AGI-2 – $2/$12 MTok pricing holds

Stay in the loop

Free daily newsletter & Telegram daily report

Executive Summary

Google/DeepMind shipped Gemini 3.1 Pro as a global rollout across Gemini app, AI Studio (preview API), Vertex AI, and NotebookLM tiers; launch framing is reasoning-first with a claimed 77.1% on ARC-AGI-2 and broad distribution via third-party pickers (Perplexity, OpenRouter). AI Studio cards show pricing unchanged at $2/$12 per 1M input/output tokens up to 200k context (then $4/$18), plus a Jan 2025 knowledge cutoff; Artificial Analysis ranks the preview #1 on its Intelligence Index at 57, saying the harness run consumed ~57M tokens and cost $892, with a reported hallucination-rate drop ~88%→50% on AA-Omniscience, but these are harness-dependent and not yet corroborated by independent end-to-end task suites.

• Builder UX/capacity: early testers report “Canvas” gating for runnable code in the web UI; launch-day latency shows up as ~5-minute generations plus capacity errors.
• Claude Code 2.1.49: adds --worktree isolation and background-agent kill controls; prompt footprint reportedly down 15,248 tokens.
• OpenAI surfaces: ChatGPT Code Blocks adds in-chat editing + live previews; separate UI leaks hint at “Codex Security” malware-analysis workflows, details still thin.

Gemini 3.1 Pro ships: big reasoning jump + rollout across developer surfaces

Gemini 3.1 Pro’s jump in reasoning + agentic coding (e.g., 77.1% ARC-AGI-2) and rapid rollout across AI Studio/Vertex/app surfaces reshapes default-model choices for builders, especially at $2/$12 per Mtok pricing.

High-volume story today: Google/DeepMind shipped Gemini 3.1 Pro/Preview and pushed it across AI Studio, Vertex, Gemini app/NotebookLM, and multiple third-party gateways—plus lots of early “does it actually code” smoke tests. Excludes non-Gemini tool updates which are covered elsewhere.

Jump to Gemini 3.1 Pro ships: big reasoning jump + rollout across developer surfaces topics

🧠 Gemini 3.1 Pro ships: big reasoning jump + rollout across developer surfaces

Gemini 3.1 Pro ships and starts rolling out across app + API surfaces

Gemini 3.1 Pro (Google DeepMind/Google): Following up on launch watch—imminent-drop chatter—Google says Gemini 3.1 Pro is now rolling out globally in the Gemini app with higher limits on Google AI Pro/Ultra plans, with developer access via Gemini API in AI Studio (preview) and enterprise availability via Vertex AI, per the rollout note in rollout post and the launch blog post.

The release framing is reasoning-first, anchored by 77.1% on ARC-AGI-2 (claimed as >2× Gemini 3 Pro) in the announcement thread DeepMind thread.

• Access points: Gemini app model picker now exposes a dedicated “Pro” option for 3.1 Pro, as shown in model picker UI; the model is also showing up across third-party surfaces like Perplexity Pro/Max in their picker Perplexity model picker and OpenRouter listings for gemini-3.1-pro-preview OpenRouter model page screenshot.
• API specs: AI Studio’s model card indicates the same pricing tiers as Gemini 3 Pro—$2/$12 per 1M input/output tokens (≤200k context) and $4/$18 (>200k)—and a Jan 2025 knowledge cutoff, as shown in pricing screenshot and AI Studio model card.

Launch messaging emphasizes “preview” / validation via real usage; early reports of capacity constraints show up elsewhere in today’s feed (see separate topic).

Gemini 3.1 Pro hits 77.1% ARC-AGI-2 – $2/$12 MTok pricing holds

Executive Summary

Top links today

Gemini 3.1 Pro ships: big reasoning jump + rollout across developer surfaces

Table of Contents

🧠 Gemini 3.1 Pro ships: big reasoning jump + rollout across developer surfaces

Gemini 3.1 Pro ships and starts rolling out across app + API surfaces

Artificial Analysis ranks Gemini 3.1 Pro Preview #1 on its Intelligence Index at lower run cost

Early Gemini 3.1 Pro usage: Canvas is the switch for runnable code, and launch-day latency is real

🧰 Claude Code CLI 2.1.49: worktrees, background control, and prompt/system changes

Claude Code 2.1.49 adds git worktree isolation for sessions and subagents

Claude Code 2.1.49 adds ConfigChange hook for auditing or blocking settings edits

Claude Code 2.1.49 adds Ctrl+F to kill background agents, fixes Ctrl+C/ESC edge cases

Claude Code 2.1.49 rewrites tool policy and removes some prior injection-handling rules

Claude Code 2.1.49 system prompt pushes “professional objectivity” and bans time estimates harder

Claude Code 2.1.49 cuts prompt tokens by 15,248 and changes prompt composition

Claude Code 2.1.49 lets plugins ship default settings via settings.json

Claude Code 2.1.49 patches long-session WASM memory growth in parsing and layout

Claude Code 2.1.49 reduces startup overhead with MCP auth caching and batched token counts

Claude Code 2.1.49 adds “did you mean” path suggestions for file-not-found errors

🧩 OpenAI surfaces: interactive Code Blocks, Codex packaging, and product signals

ChatGPT adds interactive Code Blocks for editing and previewing code in chat

OpenAI’s Aardvark appears to rebrand to Codex Security with malware analysis UI

ChatGPT web code hints at “Citron Mode” and 18+ gating for sensitive shared chats

Codex usage complaints prompt hints of a plan change beyond the $200 tier

Debate flares: “ChatGPT subscription includes Codex” vs observed throttles

OpenAI’s Atlas browser adds “Auto Organize” for extreme tab clutter

Signals point to a Codex Windows app, alongside working Codex CLI on Windows

OpenAI commits $7.5M to fund independent AI safety mitigations research

🧪 Agent runtime platforms: isolated compute, sandboxes, and always-on sessions

Airtable launches Hyperagent with per-session isolated compute environments

Cursor ships agent sandboxing across macOS, Linux, and Windows

Ramp details its “Inspect” coding agent built on Modal Sandboxes

Modal documents Sandbox Snapshots for saving and restoring agent environments

📊 Claude in PowerPoint + connectors: LLMs move into office artifacts

Claude in PowerPoint expands to Pro and adds connectors for data-backed slides

🦞 OpenClaw + agent harness engineering: releases, runtimes, and security hardening

OpenClaw iOS gateway can wake backgrounded nodes via APNs

pi_agent_rust ships a capability-gated agent harness with big perf claims

OpenClaw v2026.2.19 beta focuses on security hardening and fixes

OpenClaw’s “Gateway control plane + markdown state on disk” design gets documented

OpenClaw adds paired-device management and push notification configuration

OpenClaw v2026.2.19 includes an early Apple Watch companion app

OpenClaw maintainer recruitment emphasizes security mindset and process-based onboarding

📱 Vibe-coded native apps: Rork Max replaces Xcode (and triggers “App Store flood” discourse)

Rork Max launches as a browser Swift builder with 1‑click install and App Store publish claims

Rork Max walkthrough frames “Pokémon Go in hours” as a repeatable build recipe

Rork Max triggers “Apple will sherlock this” and App Store flood predictions

🧭 Coding workflow patterns: context discipline, “apps are dead”, and prompt-caching-first design

Karpathy argues the app store model breaks when agents can improvise custom apps

“AI-native CLI” becomes the bottleneck for agentic integration

A Claude Code contributor says agents must be designed for prompt caching first

Automatic prompt caching lands, shifting effort to prompt template structure

Builders describe losing codebase familiarity as prompting becomes the default

🧷 Skills & MCP plumbing: turning docs into reusable agent capabilities

Firecrawl ships a docs-to-skill generator for Claude Code workflows

HyperGraph proposes “skill graphs” instead of one giant SKILL.md

Repo-local /do-work skills as a lightweight alternative to bloated AGENTS files

🛠️ Dev tooling upgrades: agent-friendly docs, model switching UX, and editor integrations

Next.js canary installs now include version-matched docs for agents

Vercel AI Gateway adds video generation and a generateVideo API in AI SDK 6

keep.md pushes “markdown bookmarks as an API” for feeding agents saved context

Warp adds /models fuzzy switching for faster model swaps

Zed adds Gemini 3.1 Pro support for subscribers and BYOK users

Zed gets official GitHub Copilot support through a partnership

Warp improves terminal tab density and contrast

🧾 Tracing & debugging agents: making runs searchable, comparable, and cheaper

Raindrop’s Trajectory Explorer makes agent traces searchable and failure-shaped

agent-browser adds snapshot diffs and pixel-level visual regression for cheaper verification

Cursor Debug Mode essay: agents should instrument the code they run by default

📏 Benchmarks & eval methodology: what to trust (and what to ignore)

ValsAI releases MedScribe + MedCode to separate “note quality” from “coding correctness”

SkillsBench finds curated skills boost agents; self-generated skills don’t on average

GDPval-AA gets pushback: “public questions + model judge” isn’t a headline agent eval

Higher “thinking” settings didn’t improve document understanding on OmniDocBench

Benchmark-heavy launches trigger skepticism about what labs optimize for

🧪 Research & training methods: multi-agent test-time scaling, retrieval models, and reasoning control loops

ColBERT-Zero argues full multi-vector pretraining beats KD-on-top recipes

InftyThink+ trains models to reason in summarize-and-continue loops

Team of Thoughts proposes heterogeneous model teams with orchestrator calibration