Goldman Sachs deploys Claude agents after 6 months – audit fees fall 14%

Stay in the loop

Free daily newsletter & Telegram daily report

Executive Summary

Goldman Sachs is rolling out Claude-based “digital co-worker” agents for regulated back-office work; Anthropic engineers were embedded on-site for 6 months; the described flow reads trade-record bundles plus policy text, executes step-by-step rules, flags exceptions, and routes items for approval; Goldman frames the outcome as faster client vetting and fewer reconciliation breaks, with “slower headcount growth” rather than immediate layoffs. The emerging template is enterprise agents as controls-first systems—rules execution, exception routing, audit trails—where integration work, not a generic chat UI, is the moat.

• Anthropic capital stack: Bloomberg rumor says $20B+ new funding at ~$350B valuation (~2× prior mark) with revenue run-rate “above $9B”; unconfirmed in-thread.
• Amazon ↔ Anthropic: circulated mark pegs Amazon’s stake at ~$60.6B; partnership is tied to an Anthropic commitment to buy 1M Trainium chips, coupling model growth to AWS capacity.
• Audit pricing pressure: FT anecdote says KPMG forced a 14% fee drop to $357,000 (from $416,000) as AI compresses audit workflow labor, with judgment still positioned as partner-level.

Enterprise AI goes live: regulated rollouts, ROI claims, and big-money bets

Goldman deploying Claude agents for accounting/compliance is a high-signal “production in regulated workflows” moment—implying real enterprise demand, governance requirements, and budget shifts beyond coding copilots.

Feature focus is new real-world enterprise adoption signals—most notably Goldman rolling out Claude-based agents for accounting/compliance. Also includes other concrete ROI stories and large funding/valuation moves; excludes day-to-day coding tool updates.

Jump to Enterprise AI goes live: regulated rollouts, ROI claims, and big-money bets topics

🏦 Enterprise AI goes live: regulated rollouts, ROI claims, and big-money bets

Goldman Sachs rolls out Claude agents for accounting and compliance work

Goldman Sachs × Claude (Anthropic): Goldman is rolling out Claude-based agents to automate high-volume accounting and compliance work, after embedding Anthropic engineers on-site for 6 months to co-develop “digital co-worker” systems, as described in the rollout thread and echoed by the CNBC screenshot.

Goldman’s cited workflow is explicitly rules-and-controls shaped: the agent reads bundles of trade records plus policy text, applies step-by-step rules, flags exceptions, and routes items for approval—Goldman frames the outcome as faster client vetting and fewer reconciliation breaks, with “slower headcount growth” rather than immediate layoffs per the rollout thread.

Rohan Paul

@rohanpaul_ai

·Follow

Goldman Sachs is rolling out Anthropic’s AI model to automate accounting and compliance roles completely. Anthropic engineers have been embedded at Goldman for 6 months, co-developing systems that act like “digital co-workers” for high-volume, process-heavy tasks. The new setup Show more

4:42 PM · Feb 6, 2026

3.9K

Read 148 replies

Anthropic funding round rumored at $20B+ and ~$350B valuation

Anthropic fundraising (Bloomberg): A Bloomberg-reported rumor claims Anthropic is finalizing a round of $20B+ at a ~$350B valuation (about 2× the prior mark), with investors citing a yearly revenue run rate “above $9B,” per the round rumor.

This is unconfirmed in the tweets (no term sheet, no company statement), but it’s a clear “big-money” signal being discussed alongside near-term enterprise adoption stories like the Goldman rollout thread.

Rohan Paul

@rohanpaul_ai

·Follow

Bloomberg: Anthropic is finalizing a funding round that could raise more than $20B as close as next week at a a $350B valuation, roughly 2x the prior mark. A valuation jump this fast is usually only possible when growth looks predictable enough that investors accept lower Show more

3:35 AM · Feb 7, 2026

231

Read 15 replies

Amazon’s Anthropic position marked to ~$60.6B with a 1M Trainium commitment

Amazon ↔ Anthropic: A circulated breakdown says Amazon’s Anthropic investment is now marked at about $60.6B (after investing $8B in 2023), structured as $45.8B convertible notes plus $14.8B nonvoting preferred stock, with further mark-ups expected as notes convert in new rounds, per the deal mechanics summary.

The same thread ties the partnership to infrastructure demand by citing Anthropic’s commitment to buy 1M Trainium chips, effectively coupling Anthropic’s training appetite to AWS capacity and economics according to the deal mechanics summary.

Rohan Paul

@rohanpaul_ai

·Follow

Amazon’s investment in Anthropic has risen to about $60.6 billion, with $12.8 billion in gains recognized and another $15 billion expected in Q1 2026. They invested $8B to Anthropic in 2023. That $60.6B is made up of $45.8B of convertible notes plus $14.8B of nonvoting Show more

2:46 AM · Feb 7, 2026

599

Read 32 replies

Regulated enterprise agents are converging on rulebooks, exceptions, and routing

Regulated-work agent design: The clearest “real deployment” shape showing up is an agent that reads messy mixed inputs (tables + text), executes a deterministic-looking rulebook, and then escalates edge cases through routing and approvals; the Goldman description emphasizes controls as non-negotiable and highlights embedded engineers as the integration edge per the controls and customization note.

This same pattern is called out as an enterprise demand signal—“beyond simple chatbots”—in the deployment signal thread, which frames the differentiator as fitting agents into legacy systems with ownership and auditability rather than shipping a generic assistant.

Rohan Paul

@rohanpaul_ai

·Follow

Replying to @rohanpaul_ai

When a heavily regulated firm like Goldman puts agent-style AI into daily work, its a super solid signal of real enterprise demand beyond simple chatbots. That move can drive more spending on model providers, cloud platforms, and consultants who help companies integrate and Show more

9:17 PM · Feb 6, 2026

Read 7 replies

eXp Realty claims millions saved by replacing SaaS with Lovable-built internal tools

Lovable × eXp Realty: Lovable shared a customer story claiming eXp Realty is saving millions annually by building internal tools: $2M+ in SaaS costs eliminated, $1M saved replacing chatbot workflows, and 85% fewer support tickets, per the savings claim and the linked customer story.

This is a vendor-provided case study (not an audited disclosure), but it’s a concrete datapoint on the “buy → build” substitution pattern for internal systems that used to be handled by bundled SaaS.

Lovable

@Lovable

·Follow

One of the world's largest real estate companies, eXp Realty, used Lovable to build internal tools and are now saving millions annually. • $2M+ in SaaS costs eliminated • $1M saved replacing chatbot workflows • 85% fewer support tickets Full story: lovable.dev/blog/exprealty

12:23 PM · Feb 6, 2026

232

Read 15 replies

KPMG pushes audit fee cuts citing AI-driven cost reductions

KPMG audit pricing (FT): An FT-reported anecdote says KPMG threatened to move its audit unless Grant Thornton lowered fees to reflect AI-driven productivity gains, and the reported outcome was a 14% fee drop to $357,000 for 2025 from $416,000 for 2024, as summarized in the FT pricing summary.

The thread attributes the leverage to AI compressing audit workflows like document triage and draft documentation, while still requiring partner-level judgment on hard accounting calls per the FT pricing summary.

Rohan Paul

@rohanpaul_ai

·Follow

Massive AI cost savings should come into financial accounting and auditing. FT report: KPMG threatened to take its business elsewhere if Grant Thornton, its own auditor, did not lower prices to reflect AI cost reductions. The public outcome was a 14% drop in the reported audit Show more

Rohan Paul

@rohanpaul_ai

11:18 PM · Feb 6, 2026

360

Read 13 replies

Software equities wobble on ‘AI replaces workflows’ uncertainty signals

Public markets signal: Multiple posts tie a sharp software-sector selloff narrative to investor uncertainty about long-duration SaaS cash flows as agentic automation becomes more credible; one claim cites the S&P 500 software index down ~9% in five days and singled-out drops like Thomson Reuters down 20%+ in the market selloff claim, while another thread frames the mechanism as “future cash flows not clearly visible” in the Gerstner interview clip.

The tweets mix firsthand reporting with highly interpretive framing, so treat the causal attribution as provisional—what’s concrete is that “AI automation risk” is being used as an explanatory lens across threads like the market selloff claim and the stock drop addendum.

Chubby♨️

@kimmonismus

·Follow

Anthropic is crashing the stock market with their new legal automation plugin Anthropic spooked investors, triggering a sharp selloff as markets feared AI could disrupt software-heavy industries like law and finance. The S&P 500 software index dropped nearly 9% in five days, Show more

12:58 PM · Feb 6, 2026

526

Read 51 replies

🧑‍💻 Claude Code shipping notes: Agent Teams, CLI churn, and UX micro-features

Continues the Opus 4.6 week, but today’s tweets are mostly about Claude Code’s operational UX: small workflow features (/rewind summaries), CLI point releases, and how teams are using Agent Teams in practice—excluding enterprise rollouts (feature) and benchmark leaderboards (separate).

Claude Code (Anthropic): Following up on Commit share (Claude Code’s GitHub footprint), new screenshots circulating today put Claude Code at ~4.0% of public GitHub with 135K+ commits/day, as shown in the Commit share chart.

A separate usage signal from Vercel reports that Claude-using teams generated 12.8% of deployments last week and “ship 7.6× more often” than non-Claude teams, per the Vercel deployment stats. Together, these are operational adoption indicators that show up outside of benchmarks: commit volume and deploy velocity.

Chubby♨️

@kimmonismus

·Follow

Claude Code will be 20%+ of all daily commits to GitHub by the end of 2026. 2027 is getting even more interesting day by day.

Dylan Patel

@dylan522p

4% of GitHub public commits are being authored by Claude Code right now. At the current trajectory, we believe that Claude Code will be 20%+ of all daily commits by the end of 2026. While you blinked, AI consumed all of software development. Read more 👇 newsletter.semianalysis.com/p/claude-code-…

2:32 PM · Feb 6, 2026

268

Read 15 replies

Claude Code CLI 2.1.33 adds hook events and agent memory frontmatter

Claude Code CLI 2.1.33 (Anthropic): Following up on CLI 2.1.32 (Agent Teams + auto-memory), 2.1.33 lands with 16 CLI changes, 2 flag changes, and 1 prompt change, as enumerated in the 2.1.33 changelog.

• Multi-agent workflow plumbing: Teammate sessions in tmux can now send/receive messages reliably, and two new hook events—TeammateIdle and TaskCompleted—enable event-driven automation around Agent Teams, per the 2.1.33 changelog.
• Persistent agent memory (frontmatter): Agents can declare a memory scope (user, project, or local) in frontmatter, which formalizes “where does this agent remember things,” according to the 2.1.33 changelog.
• Surface-area control: Sub-agent spawning can be restricted using Task(agent_type) syntax in agent tools frontmatter, as detailed in the 2.1.33 changelog.
• Prompt transparency change: A note that previously instructed Claude to only disclose Agent Teams unavailability when explicitly asked was removed, as called out in the Prompt change note and shown in the Diff link.

Release notes are tracked in the upstream repo via the Changelog entry.

Claude Code Changelog

@ClaudeCodeLog

·Follow

Claude Code 2.1.33 is out. Been out for ~8 hours … of course my API quota still hasn't reset, plus I was asleep. Sorry for the late post again 😅 Can someone hook me up with X so I can use their new pay-as-you-go API? 16 CLI, 2 flag, and 1 prompt changes. Details in thread ↓

8:55 AM · Feb 6, 2026

135

Read 6 replies

Claude Code CLI 2.1.34 patches a sandbox-permission bypass edge case

Claude Code CLI 2.1.34 (Anthropic): 2.1.34 ships with 2 CLI changes and focuses on stability and sandbox enforcement, as listed in the 2.1.34 changelog.

• Sandbox escape hatch tightened: A bug was fixed where commands excluded from sandboxing could bypass the “ask permission” rule for Bash when autoAllowBashIfSandboxed was enabled, per the 2.1.34 changelog.
• Crash fix: A crash when the Agent Teams setting changed between renders was addressed, according to the 2.1.34 changelog.

Upstream details are in the Changelog entry.

Claude Code Changelog

@ClaudeCodeLog

·Follow

Claude Code 2.1.34 is out. 2 CLI changes, no flag changes, no major prompt changes. Details in thread ↓

6:56 PM · Feb 6, 2026

133

Read 4 replies

Agent Teams demos converge on role-separated “mini org charts”

Claude Code Agent Teams (Anthropic): New demos show people using Agent Teams as a role-separated workflow—lead engineer, code reviewer, UX reviewer, test engineer—coordinating via task handoffs and follow-up tasks, building on Agent Teams (multi-session parallelism) and illustrated in the Orchestration screenshot.

The screenshot shows a full pass where “all 8 tasks completed, tests passing,” with review feedback spawning two discoverability follow-ups, as captured in the

. For teams evaluating multi-agent setups, this is a concrete example of how coordination, review, and testing get structured once parallel sessions exist.

Hamel Husain

@HamelHusain

·Follow

Trying this new orchestration thing in Claude Code. It is fun! code.claude.com/docs/en/agent-…

7:59 PM · Feb 6, 2026

141

Read 7 replies

Claude Code can now summarize the part you rewound

Claude Code (Anthropic): Claude Code now generates an automatic summary of the portion of the chat you just rewound (via /rewind or hitting ESC twice), so you can branch the conversation while keeping the discarded path’s learnings, as shown in the Feature note.

This is a small UX change with big “long thread” impact: it reduces the cost of backtracking and makes multi-try exploration less memory-fragile, especially when you’re iterating on plans or refactors and don’t want to manually re-copy what worked.

Thariq

@trq212

·Follow

Now in Claude Code: when you rewind a conversation using /rewind or hitting ESC twice, Claude can summarize the part of the conversation that was rewound. Use this to explore different paths and take the learnings "back in time". Show more

Watch on X

6:45 PM · Feb 6, 2026

1.3K

Read 92 replies

Anthropic announces a Claude Code “Built with Opus 4.6” virtual hackathon

Claude Code (Anthropic): Anthropic announced “Built with Opus 4.6,” a Claude Code virtual hackathon where winners are hand-selected for $100K in Claude API credits, according to the Hackathon announcement.

The event is positioned as a week of building directly with the Claude Code team, with signup details hosted on the Hackathon page.

Claude

@claudeai

·Follow

Announcing Built with Opus 4.6: a Claude Code virtual hackathon. Join the Claude Code team for a week of building. Winners will be hand-selected to win $100K in Claude API credits. Apply here: cerebralvalley.ai/e/claude-code-…

Watch on X

5:58 PM · Feb 6, 2026

9.4K

Read 343 replies

🧰 Codex product/UX: pricing, personalities, platform expansion, and harness quirks

Codex chatter today is about product decisions and operator experience: pricing questions, personality modes, Windows app progress, and practical complaints (output formatting, compaction behavior). Excludes benchmark leaderboards (separate category).

Codex app on Windows is running internally

Codex app (OpenAI): Following up on Windows waitlist (early signup), a Windows build now appears to be running internally—one post shares a full Windows UI screenshot in the internal Windows screenshot, while another notes it’s “in the works” with a similar UI capture in the Windows in the works. It’s the same left-nav structure (New thread, Automations, Skills, Debug) and model selector shown in both screenshots, which suggests parity work is underway rather than a mock.

Andrew Ambrosino

@ajambrosino

·Follow

Windows has been achieved internally

6:57 PM · Feb 6, 2026

1.9K

Read 187 replies

OpenAI probes Codex pricing; users push for bundling and mid-tier plans

Codex pricing (OpenAI): Sam Altman asked how people want Codex priced in the pricing question, and replies quickly turned into a proxy fight over bundling, tiers, and whether Codex is a standalone product or an add-on to existing plans—see the reply snapshot in the reply screenshot. Some responses lean toward ad-supported/free jokes ("Free with 2 ads every prompt"), as in the reply joke, but the dominant signal is demand for clearer packaging and a middle tier rather than only "cheap" vs "double Plus" framing, per the reply screenshot.

Sam Altman

@sama

·Follow

How would you prefer us to charge for Codex?

4:45 PM · Feb 6, 2026

2.4K

Read 2.5K replies

Codex app adds switchable personalities via /personality

Codex app (OpenAI): The Codex app now supports personalities, with Pragmatic as the default and Friendly available via the /personality slash command, according to the personality feature note. The same post includes an example of how Pragmatic responds to emotional prompts ("love you"), which is useful context for teams trying to standardize tone across dev-facing agent output, as shown in the personality feature note.

Alexander Embiricos

@embirico

·Follow

Codex now has personalities! The default is Pragmatic, and you can change to Friendly with the /personality slash command. PS here's a fun screenshot an engineer sent me of how Pragmatic replies to "love you"

8:36 PM · Feb 6, 2026

242

Read 26 replies

Claims of no GPT-5.3-Codex API complicate benchmarking workflows

GPT-5.3-Codex access (OpenAI): One thread claims there is “no GPT-5.3-Codex API,” framing it as deliberate go-to-market strategy to drive usage through Codex surfaces rather than external benchmarking, per the no API claim. If accurate, that explains why independent evals would concentrate on harness-based results and screenshots rather than reproducible API runs.

This is unverified in these tweets. It’s a community assertion.

Lisan al Gaib

@scaling01

·Follow

There is no GPT-5.3-Codex API. So no benchmarking. No it's not rushed. It's strategy imo They want to push their Codex usage up.

Vacuum Cleaner

@agivacuum

dumb q probably.. but why is opus 4.6 popping up in all the benchmarks but 5.3-codex is not yet? Does that imply that they rushed 5.3 release up ?

7:33 PM · Feb 6, 2026

116

Read 6 replies

Codex CLI can target GPT-5.2 Pro via --profile pro

Codex CLI (OpenAI): A concrete operator setup tip circulated showing Codex CLI running codex --profile pro, which selects gpt-5.2-pro xhigh and prints the active model in the CLI header, as shown in the profile pro screenshot. This implies teams can keep multiple model configs as named profiles (at least when API-backed) and switch without reconfiguring per session, per the profile pro screenshot.

Kevin Kern

@kevinkern

·Follow

You can enable gpt-5.2 pro in codex via API.

12:41 PM · Feb 6, 2026

126

Read 10 replies

Codex 5.3 isn’t showing up in some IDEs yet

Codex 5.3 in third-party tools: Users noticed GPT-5.3-Codex wasn’t available in Cursor immediately after release, while Claude Opus 4.6 was, per the Cursor availability question. For builders using multiple harnesses, this is a practical rollout detail: model launches can be “real” in first-party surfaces while lagging in downstream IDE integrations.

The tweets don’t include an official ETA. They just flag the mismatch.

Teknium (e/λ)

@Teknium

·Follow

Why did they "release" codex 5.3 yesterday but its not in cursor today, while claude opus 4.6 is? 🤔🤔

12:34 AM · Feb 7, 2026

155

Read 20 replies

Codex CLI users report truncated output rendering

Codex CLI (OpenAI): A user reported output rendering issues where lines get cut off and blank spaces appear where text should be, with an example screenshot in the output truncation report. This is the kind of failure that can silently break agent reliability for long tool outputs (e.g., logs, diff summaries, command help), because it changes what the operator can verify.

Ian Nuttall

@iannuttall

·Follow

Anybody noticed weirdness with latest Codex CLI output? Lines getting cut off and empty spaces where text should be.

1:06 PM · Feb 6, 2026

Read 9 replies

OpenAI announces Codex hackathon winners focused on agents and tool integration

Codex hackathon (OpenAI): OpenAI Devs posted the Codex hackathon winners—OpenCortex, Evy, and Paradigm—with project descriptions centered on multi-agent research/paper generation, on-demand tool integration, and an adaptive dev environment that turns conversations into reusable workflows, per the winners announcement and the winner details.

The common thread is “agent + tool surface + workflow reuse,” not model benchmarking.

OpenAI Developers

@OpenAIDevs

·Follow

🥁 Presenting the Codex hackathon winners 👇

10:33 PM · Feb 6, 2026

457

Read 21 replies

🧭 Agentic engineering patterns: compaction control, RLM-like loops, and “how to supervise”

High-signal practitioner patterns for shipping with agents: context/compaction discipline, parallelization strategies, and supervision frameworks. Excludes specific product changelogs (Claude Code/Codex categories) and formal benchmarks (separate).

AI-assisted code at scale needs explicit quality gates, observability, and ownership

Adoption framework: A practical rollout stance is that teams shipping AI-assisted code need new norms around quality gates, observability, and ownership regardless of model choice, as stated in Adoption framework. The key engineering implication is that “agent output” becomes an input to existing SDLC controls (tests, reviews, rollbacks), not a replacement for them.

Addy Osmani

@addyosmani

·Follow

Every team shipping AI-assisted code at scale needs new norms around quality gates, observability, and ownership. Regardless of which model or toolchain you use, this is one of the most practical frameworks I've seen for adopting agentic development.

Greg Brockman

@gdb

Software development is undergoing a renaissance in front of our eyes. If you haven't used the tools recently, you likely are underestimating what you're missing. Since December, there's been a step function improvement in what tools like Codex can do. Some great engineers at

10:53 AM · Feb 6, 2026

404

Read 28 replies

Multi-agent concurrency as default: spawn several agents, select the best fix, and auto-test

Parallel-agent workflow: A concrete “new baseline” story is waking up to multiple PRs kicked off overnight; one agent reviews, suggests fixes, and auto-pushes, while another agent tests the changes—plus the ability to “spin up a bunch of agents on the same problem” and pick the best result, as described in Overnight PR workflow.

The supervision shift is that selection and verification become the human’s main workload, while implementation is parallelized and treated as cheap exploration.

eric zakariasson

@ericzakariasson

·Follow

this morning i woke up to 5 PRs that i kicked off with cloud agents last night. bugbot had reviewed all of them, suggested fixes, and auto-pushed them while i was sleeping. then another agent tested the changes to make sure they were functionally correct. you can also just spin Show more

5:07 PM · Feb 6, 2026

513

Read 50 replies

RLM-style agent loops: put context into variables and treat sub-agents as functions

RLM-like supervision pattern: A practitioner tip frames “RLM-like” agent work as moving context out of prose prompts and into explicit variables/state, then calling sub-agents like pure functions that return values—reducing context-window pollution and making long runs easier to audit, per the RLM tips and the repeated formulation in RLM tips recap. The same idea implicitly pushes teams toward code-structured orchestration (state + typed returns) instead of chat transcripts as the control surface.

Omar Khattab

@lateinteraction

·Follow

The world loses millions of dollars in productivity every day that coding agents aren't RLMs yet.

7:13 PM · Feb 6, 2026

223

Read 19 replies

Manual multi-agent ping-pong: implement then code-review agent then UX-review agent

Human-in-the-loop review loop: One concrete supervision recipe is “implement → code review agent → UX/design review agent → integrate fixes,” described as a manual ping-pong the human currently coordinates step-by-step in Ping-pong loop. The same structure shows up visually in multi-pane orchestration workflows (role-separated reviewers and testers), as captured in the Agent orchestration screenshot.

This pattern matters because it treats review as a first-class agent task, not an afterthought, and it surfaces where automation still breaks (handoffs, merge decisions, and final accountability).

Hamel Husain

@HamelHusain

·Follow

Replying to @Vtrivedy10

idk yet. Need to do more testing. In theory, I kind of have AI always ping-pong off each other manually, like 1. Implement a feature 2. Another thing does a code review 3. Another thing does UX/design review 4. The first thing consider those reviews and improve It seems Show more

9:29 PM · Feb 6, 2026

Read more on X

Anti-cargo-cult rule for agents: default to simplest idiomatic code unless you can justify

Anti-cargo-cult guidance: A reusable rule for supervising agent edits is to require justification for copied patterns: “Default to the simplest idiomatic pattern; do not copy patterns… unless you can state why,” as shown in AGENTS.md snippet. The example root-cause note in the same post describes how one early dynamic-import workaround spread by copy/paste without a continuing need.

This pairs well with AGENTS.md-style repo policies because it turns “style drift” into an explicit failing condition during review.

Kevin Kern

@kevinkern

·Follow

Gonna add this to my AGENTS. md "Avoid cargo-cult: Default to the simplest idiomatic pattern. Do not copy patterns from other files unless you can state why they are needed here."

Kevin Kern

@kevinkern

who thought it was a good idea to put inline imports everywhere? I noticed it first in Opus, and now I'm seeing it in Codex too.

1:47 PM · Feb 6, 2026

Read 3 replies

Compaction discipline: don’t let the harness auto-compact; control it explicitly

Compaction control: A power-user claim is that outcomes differ depending on whether you let the tool/model “compact itself” versus supervising compaction manually—“Never! I control my compaction,” as argued in Compaction control. The sentiment is paired with the broader complaint that different harness behaviors (what gets summarized, when, and how aggressively) can change trust and perceived quality even when the underlying model is strong.

Maxime Rivest 🧙‍♂️🦙🐧

@MaximeRivest

·Follow

Look, I have no allegiance to LLMs and harnesses, a better one comes in, I change. I changed often. I saw you all saying Codex is more careful, Claude is a loose cannon. I went to try Codex, again. No it's not even close for me. I don't know why, Claude Opus 4.5 (6) in Claude Show more

6:01 PM · Feb 6, 2026

Read 8 replies

Dedicated maintenance agent for swarm machines: SSH in, kill runaways, and clean disk

Agent-farm operations: A pattern for teams running many concurrent agents is to keep a dedicated “machine maintainer” agent with SSH access that focuses on janitorial work (temp files, stuck tests, runaway processes), as described in Maintenance agent. The shared output shows disk reclaimed and process cleanup summaries, including a server freeing +1556 (units shown in the table) in one run.

This is a supervision pattern because it separates “keep the environment healthy” from “ship product code,” reducing human attention spent on toil during long-running agent batches.

Jeffrey Emanuel

@doodlestein

·Follow

It's so incredibly useful to have a dedicated agent with knowledge of and ssh access to all your machines, that is highly skilled in the art of maintaining machines that are used for big agent swarms. Basically, clearing out junk temp files and killing stuck tests and runaways.

11:35 PM · Feb 6, 2026

Read 7 replies

Tool-calling economics: excessive tool calls are an expensive switch statement

Tool-calling cost framing: A blunt critique describes “excessive tool calling (instead of deterministic code)” as “the world’s most expensive switch statement,” in Tool calling critique. A related user observation in Complexity pushback echoes the same failure mode in practice: agents often overcomplicate, and tightening the decision boundary (when to call tools vs run normal code) is part of making runs cheaper and more predictable.

dex

@dexhorthy

·Follow

“excessive tool calling in llms (instead of deterministic code) is the worlds most expensive switch statement” nice yap @stantonk open.spotify.com/episode/06SIk5…

11:01 PM · Feb 6, 2026

Read 2 replies

As agents get stronger, the ceiling rises but regressions slip in if you stop watching

Supervision tension: A builder report says coding agents are getting more capable but also more confusing: the autonomy ceiling rises, yet quirks and blind spots can introduce regressions if you’re not paying attention, as stated in Quirks and regressions. A follow-up note that a team is “seeing big improvements” after users hit issues, in Reliability follow-up, reinforces that day-to-day quality still depends on close review loops and fast feedback channels.

eric provencher

@pvncher

·Follow

I find using coding agents is starting to be a confusing experience as they get smarter and more capable. The ceiling of what they can do autonomously goes up, but there are still tons of quirks, and blind spots where if I’m not paying attention they slip in regressions.

1:29 PM · Feb 6, 2026

106

Read 14 replies

Two supervision styles emerge: tight control vs delegate-and-review, and tools may diverge

Ways of working with agents: A discussion frames Codex-style and Opus-style usage as diverging philosophies—some users want tight control, others want to delegate and review—arguing that future optimization will target “ways of working with AI” more than benchmark wins, per Work style split. The same theme shows up implicitly in the “multiple agents then select best fix” workflow described in Overnight PR workflow, where the human role becomes evaluator and integrator rather than sole implementer.

cedric

@cedric_chee

·Follow

What's interesting is that Codex 5.3 and Opus 4.6 are diverging philosophically, in ways that mirror how engineers think about work. A real split: some want tight control, others want to delegate and review. Over time, we'll likely see models optimized for different "ways of Show more

9:57 AM · Feb 6, 2026

Read more on X

🧠 Agent runners & multi-model ops: councils, swarms, routing, and hosted assistants

Operational surfaces for running many agents/models: multi-model comparison, swarms, routing to fastest providers, and managed “assistant that does things” deployments. Excludes MCP/protocol plumbing (separate) and plugin/skill packages (separate).

Perplexity Model Council and Comet upgrade Max users to Claude Opus 4.6

Model Council + Comet (Perplexity): Following up on Council launch (multi-model parallel synthesis), Perplexity is now surfacing Claude Opus 4.6 inside both Model Council and its Comet browser agent for Max subscribers, as stated in the Max availability note and shown in the settings demo.

Perplexity also signaled it intends to bring Council Mode to Pro users with rate limits, according to the Pro rollout note.

Perplexity

@perplexity_ai

·Follow

Opus 4.6 is now available on Perplexity for Max subscribers. Try it in Model Council to compare it with other frontier models.

Perplexity

@perplexity_ai

Introducing Model Council in Perplexity. Run three frontier models at once, compare outputs, and get a more accurate, higher‑confidence answer. Available now on web only for Perplexity Max subscribers.

Watch on X

7:20 PM · Feb 6, 2026

433

Read 19 replies

Gemini in Chrome ships “describe the task” browser automation for AI Pro/Ultra (US)

Gemini in Chrome (Google): Google is shipping a browser-embedded Gemini agent that can act on pages based on a natural-language task description; it’s described as available to AI Pro and Ultra subscribers in the U.S. in the browser agent demo.

A related UX leak shows model routing modes (“Auto”, “Fast”, “Thinking”, “Pro”) in a single selector, as shown in the routing UI screenshot.

Paul Couvert

@itsPaulAi

·Follow

You can now automate anything using Gemini in Chrome Just describe the task and that's it. - Nothing to configure - Works on tools without API - Fully integrated in Google's ecosystem And since 99% of tasks are now browser-based, there are no limits. Even if it's "slower than Show more

Watch on X

9:36 PM · Feb 6, 2026

503

Read 29 replies

Kilo launches Kilo Claw: hosted OpenClaw without SSH/Docker/yaml

Kilo Claw (Kilo): Kilo announced Kilo Claw, a hosted/managed OpenClaw offering positioned as “no Mac mini required” and “no SSH/Docker/yaml,” with a waitlist linked in the launch post and the waitlist page.

A follow-up video frames it as a managed instance running on Kilo’s gateway, as shown in the product video.

Kilo

@kilocode

·Follow

🦞 Introducing Kilo Claw — hosted OpenClaw, powered by Kilo. No Mac mini required. OpenClaw is the fastest-growing open source AI agent ever (161k+ ⭐). Running it yourself? That's the hard part. We're fixing that. Waitlist is open → kilo.ai/kiloclaw

6:38 PM · Feb 6, 2026

261

Read 28 replies

OpenRouter launches Pony Alpha stealth model for free with provider logging

Pony Alpha (OpenRouter): OpenRouter launched Pony Alpha as a free “stealth model” optimized for agentic workflows and tool-calling accuracy, while warning that the provider logs all prompts/completions, per the launch note and the model card screenshot.

Attribution is still speculative: multiple posts claim it may be GLM-5 (or GLM-family) based on self-identification, latency, and behavior, as discussed in the attribution thread and self-identification screenshot. The only concrete, user-visible fact in these tweets is the tradeoff: free access paired with explicit logging, as reiterated on the model page.

OpenRouter

@OpenRouterAI

·Follow

🥷 We’re launching a new stealth model on OpenRouter: Pony Alpha. - Pony Alpha is a next-generation foundation model - It delivers strong performance across coding, reasoning, and roleplay - It’s optimized for agentic workflows, with high tool-calling accuracy

5:53 PM · Feb 6, 2026

974

Read 85 replies

Compute remains the bottleneck: B200 on-demand scarcity and “need more GPUs” talk

Compute constraints: Multiple posts reinforce that GPU availability remains a limiting factor for long-horizon agent workloads—one data point highlights B200 as the hardest to get on-demand, with a chart of minute-level availability shared in the availability chart.

A separate clip amplifies the same theme from the supply side, with Jensen Huang describing frontier labs as “compute constrained,” as shown in the CNBC clip.

Rohan Paul

@rohanpaul_ai

·Follow

GPU capacity is getting soaked up faster than it is being added across all generations at the same time.. And B200 looks like the hardest one to get on-demand.

Warren Pies

@WarrenPies

GPU availability falling across the board. Of note, B200 availability making new lows. Less availability = bullish demand.

10:40 PM · Feb 6, 2026

Read 7 replies

Kimi Code “agent swarms” show 10 subagents coordinating on a single build

Agent swarms (Kimi Code / Moonshot): A concrete swarm workflow is getting shared where Kimi Code spins up 10 subagents in parallel that coordinate on a single deliverable (example: a ~3.1M-voxel scene), with the planning breakdown and output shown in the swarm screenshots.

Token burn and wall time are being called out explicitly—one report cites ~10K tokens and ~8–9 minutes for a single build, as noted in the runtime and token counts.

cedric

@cedric_chee

·Follow

Agent Swarms in Kimi Code. I spin up 10 subagents that work in parallel as a team and coordinate autonomously to build a large-scale 3-million-voxel pagoda. Kimi K2.5 is specifically trained for "swarm" use. Swarms are experimental and burn a lot of tokens, but token costs are Show more

cedric

@cedric_chee

We are trying to bring the K2.5 Agent Swarm to Kimi Code CLI.

4:14 AM · Feb 7, 2026

Read 5 replies

OpenRouter adds a Nitro toggle to route prompts to the fastest provider

Nitro routing (OpenRouter): OpenRouter is pushing a simple ops knob—select “Nitro” (or append “:nitro”) to route to the fastest provider by latency/throughput, as described in the Nitro tip; the backing comparison view lives in its performance rankings.

This is a small workflow change, but it directly affects agent loops where end-to-end wall time is dominated by model latency rather than model quality.

OpenRouter

@OpenRouterAI

·Follow

TIP 💡: Select “Nitro” on any model page (or append “:nitro”) to get routed to the fastest provider on the market Leaderboard below

2:13 PM · Feb 6, 2026

133

Read 5 replies

OpenRouter app leaderboard shows OpenClaw #1 by tokens, ahead of coding agents

Top apps by tokens (OpenRouter): A snapshot of OpenRouter’s “Top Apps” ranking shows OpenClaw as the top app by daily token usage, ahead of multiple coding-agent surfaces, per the leaderboard screenshot.

The signal here is distribution, not model quality: chat-native “do things” assistants appear to be pulling more throughput than IDE-adjacent coding agents in this particular marketplace view.

Lisan al Gaib

@scaling01

·Follow

OpenClaw is the #1 app on OpenRouter Maybe not that surprising after all that personal autonomous assistants are more popular than coding agents

11:36 PM · Feb 6, 2026

872

Read 34 replies

OpenRouter token usage is claimed to be growing ~10× per year

Usage scaling (OpenRouter): A shared chart claims OpenRouter token usage is growing at roughly 10× per year, as shown in the usage chart screenshot.

For teams building on aggregators, this is mostly an ops signal: more traffic and more model variety typically means more pressure on routing, spend controls, and reliability tooling.

Lisan al Gaib

@scaling01

·Follow

OpenRouter token usage is growing 10x a year

11:41 PM · Feb 6, 2026

100

Read 4 replies

✅ Quality gates for agent code: PR review UX, CI automation, and verification loops

Tweets emphasize the bottleneck shift: code generation is cheap, but review/verification and PR ergonomics are the new constraints. Focus is on PR workflows, automated review/testing, and large-diff handling.

Overnight PR loop: bugbot reviews, auto-pushes fixes, and a test agent verifies

Verification loop (Cloud agents): A builder reports waking up to 5 PRs kicked off overnight; a “bugbot” reviewed each PR, suggested fixes, and auto-pushed changes, while a separate agent ran tests to validate functional correctness, as described in the Overnight PR automation—a concrete example of shifting the bottleneck from generation to automated review and CI.

• Parallel redundancy: they also describe spawning “a bunch of agents on the same problem” and choosing the best fix, which turns review into selection plus verification rather than line-by-line authorship, as noted in the Overnight PR automation.
• Reliability follow-up: the same thread later claims “big improvements” rolling out when issues occur, per the Stability fixes follow-up.

This lines up with broader reports that as agents get more autonomous, “regressions slip in” unless you’re watching the quality gates, as argued in the Quirks and regressions note.

eric zakariasson

@ericzakariasson

·Follow

5:07 PM · Feb 6, 2026

513

Read 50 replies

GitHub Stacked Diffs enters alpha for early design partners

Stacked Diffs (GitHub): GitHub says Stacked Diffs will start rolling out to early design partners in an alpha next month, targeting workflows where changes are split into a sequence of smaller, reviewable pull requests, as announced in the Alpha rollout note.

For teams shipping with coding agents, this is a practical review primitive: it makes it easier to enforce incremental merge gates (tests, approvals, ownership) on agent-generated work instead of landing one giant diff all at once, as shown in the Alpha rollout note.

Jared Palmer

@jaredpalmer

·Follow

Stacked Diffs on @GitHub will start rolling out to early design partners in an alpha next month. In the meantime, here's video of our progress so far: (h/t for @georgebrock + team for their awesome work)

Watch on X

4:55 PM · Feb 6, 2026

2.0K

Read 112 replies

Conductor adds editing PR titles and descriptions inside the agent UI

Conductor (PR metadata editing): Conductor now lets you edit PR titles and descriptions without leaving the tool, cutting a common context switch in agent-driven workflows where the agent drafts a PR and the human tightens the narrative for reviewers, as shown in the In-app PR edits.

This is a small feature, but it directly affects the “quality gate” surface area: reviewers rely on titles/descriptions to understand intent, scope, and verification steps, and Conductor is moving that edit loop closer to where the agent work is happening per the In-app PR edits.

Charlie Holtz

@charlieholtz

·Follow

You can now edit PR titles and descriptions without leaving Conductor!

Watch on X

11:21 PM · Feb 6, 2026

118

Read 13 replies

GitHub rolls out performance improvements for large PR diffs

Large PR review UX (GitHub): GitHub says “perf improvements on large PRs are now rolling out,” which is directly relevant to AI-assisted coding where diffs tend to be bigger and more frequent, as stated in the Large PR perf rollout.

This is the unglamorous part of agent adoption: even if code generation is cheap, review latency becomes the limiter when the UI chokes on large diffs—exactly the scenario GitHub is pointing at in the Large PR perf rollout.

Jared Palmer

@jaredpalmer

·Follow

Some awesome perf improvements on large PRs are now rolling out across @GitHub

Matthew Isabel

@matthewisabel

More large PR performance improvements are now live We’re seeing ~3x improvement in p99 INP relative to the classic files changed view over last 30d Making GitHub fast will continue to be a top priority and we’ll keep chipping away here

Watch on X

8:24 PM · Feb 6, 2026

366

Read 14 replies

Warp adds first-class GitHub Copilot CLI support with review panel and image upload

Warp (Terminal IDE): Warp shipped first-class support for the GitHub Copilot CLI, bundling a file explorer and a code review panel in the same surface where the agent runs commands, as demoed in the Copilot CLI support.

• Multimodal debugging hooks: the release also adds an image upload button plus built-in voice transcription (Wispr Flow), which can matter when PR review includes screenshots, UI diffs, or log captures, as shown in the Copilot CLI support.

This positions the terminal itself as part of the verification loop—run agent commands, inspect diffs, and review changes in one place—matching what Warp demonstrates in the Copilot CLI support.

Warp

@warpdotdev

·Follow

We just added first-class support the @GitHubCopilot CLI! Built-in voice transcription powered by @WisprFlow, an image upload button, and our built-in file explorer + code review panel. Try it out on the latest version.

Watch on X

9:19 PM · Feb 6, 2026

241

Read 22 replies

Framework signal: quality gates, observability, and ownership for AI-assisted code at scale

Team norms (Agentic development): Addy Osmani argues that any team shipping AI-assisted code at scale needs explicit norms for quality gates, observability, and ownership, and points to a practical adoption framework independent of model choice, as stated in the Adoption framework note.

The underlying point is that verification work doesn’t disappear; it moves earlier (structured checklists, CI expectations) and later (traceability and diff review ergonomics), matching the need he flags in the Adoption framework note.

Addy Osmani

@addyosmani

·Follow

Greg Brockman

@gdb

10:53 AM · Feb 6, 2026

404

Read 28 replies

Manual multi-agent ping-pong: implement, code-review agent, UX agent, then integrate

Human-facilitated review loop: Hamel Husain describes manually “ping-ponging” work across agents—one implements, another performs code review, another does UX/design review, and the implementer folds the feedback back in—calling out that it feels silly to facilitate by hand, as written in the Manual ping-pong loop.

This is a concrete quality-gates pattern emerging in practice: splitting generation from critique and verification, then forcing an explicit integration step so review feedback becomes code changes rather than chat commentary, echoing the multi-role orchestration shown in the Orchestration screenshot.

Hamel Husain

@HamelHusain

·Follow

Replying to @Vtrivedy10

9:29 PM · Feb 6, 2026

Read more on X

🔌 Interop & control planes: MCP/agent steering, hooks, and tool contracts

Protocol-level and control-plane changes that affect how agents connect to tools and how operators steer/automate them (beyond any single coding assistant).

VS Code Insiders adds agent steering and message queueing for agent chat

VS Code Insiders (Microsoft): Insiders builds now show agent steering controls plus message queueing inside the Chat surface, aimed at keeping agent actions ordered and letting operators adjust the agent mid-flight, as demoed in the Insiders feature demo.

This lands as a control-plane primitive: queueing reduces “two prompts at once” collisions, while steering gives you a UI-level intervention point when tool calls start drifting.

Pierce Boggan

@pierceboggan

·Follow

New in @code Insiders: Agent steering and message queueing in Chat.

Watch on X

12:53 AM · Feb 6, 2026

270

Read 25 replies

VS Code Insiders adds hooks to automate agent workflows in Chat

VS Code Insiders (Microsoft): A second Insiders drop shows hooks for automating agent workflows—event-driven glue that can run custom logic and report success back into the Chat loop, as shown in the Hooks workflow demo.

In practice this pushes more “agent ops” into repeatable contracts (hook triggers + outputs), instead of relying on ad-hoc prompt rituals per run.

Pierce Boggan

@pierceboggan

·Follow

New in @code Insiders: Automate agent workflows with hooks.

Watch on X

4:13 AM · Feb 6, 2026

205

Read 18 replies

AI SDK adds a provider wrapper for any Open Responses-compatible API

AI SDK (Vercel ecosystem): The SDK now exposes a createOpenResponses provider wrapper so any Open Responses-compatible endpoint can sit behind a common interface (including localhost / alternate providers), as shown in the Code snippet example.

This is a small but meaningful tool-contract move: it standardizes the “responses” surface area (model id, base URL, generateText integration) so teams can swap backends without rewriting app-level call sites.

AI SDK

@aisdk

·Follow

Connect to any Open Responses compatible API.

1:29 PM · Feb 6, 2026

Read 2 replies

🧩 Plugins & Skills ecosystem: teach agents new capabilities safely and repeatably

Installable capability bundles and skill-learning workflows—what you add to agents to make them competent on specific tools/domains. Excludes MCP servers (separate) and security incidents around skills (covered in security).

A self-learning skill template generates new SKILL.md by browsing docs

Self-learning skill template: A concrete “web → skill” implementation dropped as an installable package: /learn gemini api-style flow that browses docs and produces a reusable skill artifact, with install instructions shown in the Installable skill recipe and pointers to the GitHub repo plus an example output in the Generated skill gist.

This is a pragmatic way to keep agent competence portable (you install a skill bundle) while still letting the skill be created from up-to-date sources, instead of baking vendor docs into prompts.

Philipp Schmid

@_philschmid

·Follow

Skills + filesystems is all we need continuous learning? Combine Web search and browsing capabilities to self-learn about a new technology and how to use them. Creating fully autonomous Skills. “/learn gemini api” Install with `npx skills add philschmid/self-learning-skill` or Show more

1:19 PM · Feb 6, 2026

201

Read 11 replies

Hyperbrowser adds /learn to turn web docs into auto-updating agent skills

Hyperbrowser: The project is now pitching a “skills from the web” loop where you run /learn <topic> and it generates a reusable skill that can also be kept current automatically, as shown in the Learn command example and reiterated in the HyperSkill shipping mention. This lands as an opinionated alternative to ad-hoc browsing/RAG prompts—packaging the result into something you can re-run across agents and sessions.

What’s still not concrete from the tweets is how updates are scheduled and stored (where the skill artifact lives, and what triggers refresh), so treat this as a workflow claim pending fuller docs.

Hyperbrowser

@hyperbrowser

·Follow

Your AI agents can now learn new skills from the web. And update them automatically. /learn stripe-payments Searches the docs. Scrapes the pages. No more outdated skills. Powered by Hyperbrowser, Setup Guide ↓

Watch on X

7:11 PM · Feb 4, 2026

2.1K

Read 53 replies

Mastra introduces Workspaces: constrained FS + sandbox + reusable skills

Workspaces (Mastra): Mastra announced “Workspaces” as a packaging primitive that gives agents a constrained filesystem, a sandbox boundary, and a place to reuse skills—local backends now, with remote backends (Daytona/E2B/R2) called out as upcoming in the Workspace announcement.

The companion post linked in the Workspaces post link frames this as a security-and-repeatability baseline for agent runs, with details in the Workspaces blog.

Mastra

@mastra

·Follow

Workspaces are here. Give your agents: → A filesystem they can't escape → A sandbox they can't break → Skills they can reuse Local today. Remote (Daytona, E2B, R2) soon. We're just getting started.

Dark-themed code snippet showing JavaScript creating a Workspace and Agent with LocalFilesystem, LocalSandbox, skills path, and model "openai/gpt-4o".

5:39 PM · Feb 6, 2026

Read 2 replies

Mastra publishes an npx-installable skills library

Skills library (Mastra): Mastra is distributing a bundled skills set via npx skills add mastra-ai/skills, positioning skills as a shareable dependency you can pull into agent projects, as shown in the Install command and described in the linked Skills blog.

This is a notable “package manager” direction for skills: a standard install surface plus a known namespace, rather than copy-pasting SKILL.md content across repos.

Paul

@PaulieScanlon

·Follow

💅 @mastra has skills. npx skills add mastra-ai/skills mastra.ai/blog/announcin… Show more

11:28 AM · Feb 6, 2026

Read more on X

📊 Benchmark churn & eval hygiene: Arena swings, harness effects, and long-context tests

Today’s eval content is dense: multiple leaderboards and benchmark deltas for frontier models, plus reminders that harness/scaffold differences change results. Excludes enterprise deployment (feature) and day-to-day coding tool UX (separate).

Claude Opus 4.6 takes #1 across Arena Text, Code, and Expert leaderboards

Claude Opus 4.6 (Anthropic): Arena posts show Opus 4.6 reaching #1 across Text, Code, and Expert; the milestone callout cites a +106 jump vs Opus 4.5 in Code Arena and a 1496 Text Arena score edging Gemini 3 Pro, as reported in the Arena leaderboard update and echoed with a visible Text table in the Text arena screenshot.

• Code + Expert splits: The Code Arena table shared in the Code arena screenshot shows Opus 4.6 leading (1576), while the same update thread claims roughly a ~50 point lead in Expert, as described in the Arena leaderboard update.
• What moved in Text: Arena also highlights Opus 4.6 topping specific Text subcategories like instruction following and longer queries, according to the Text subcategory note.

Arena.ai

@arena

·Follow

🚨BREAKING: Claude Opus 4.6 by @AnthropicAI is now #1 across Code, Text and Expert Arena! Opus 4.6 shows significant gains across the board: - #1 Code Arena: +106 score vs Opus 4.5 - #1 Text Arena: scoring 1496, +10 vs Gemini 3 Pro - #1 Expert Arena: +~50 lead Congrats to the Show more

Claude

@claudeai

Introducing Claude Opus 4.6. Our smartest model got an upgrade. Opus 4.6 plans more carefully, sustains agentic tasks for longer, operates reliably in massive codebases, and catches its own mistakes. It’s also our first Opus-class model with 1M token context in beta.

Watch on X

6:36 PM · Feb 6, 2026

684

Read 23 replies

FrontierMath: Opus 4.6 reaches parity with GPT-5.2 xhigh on Tiers 1–3

FrontierMath (Epoch AI): Epoch reports Opus 4.6 scoring 40% on Tiers 1–3, statistically tied with GPT-5.2 (xhigh) at 41%, and 21% on Tier 4 (10/48), again statistically tied with GPT-5.2 (xhigh) at 19%, as summarized in the FrontierMath results and expanded in the Tier 4 breakdown. The same thread notes these runs used a scaffold at “high” effort with a 32K reasoning token budget, as clarified in the Scaffold settings note and detailed on the linked Eval details page.

A notable comparison point is the claimed jump versus Opus 4.5’s Tier 4 score (4%), called out in the Math improvement note.

Epoch AI

@EpochAIResearch

·Follow

Opus 4.6 did well on FrontierMath. Its score of 40% on Tiers 1-3 is statistically tied with the previous top score, GPT-5.2 (xhigh)'s 41%. This is the first time an Anthropic model has been on the frontier of this benchmark.

7:16 PM · Feb 6, 2026

590

Read 8 replies

Terminal-Bench 2.0 reruns highlight that harness differences change scores

Terminal-Bench 2.0 (eval hygiene): A comparison notes that OpenAI and Anthropic’s posted Terminal-Bench 2.0 numbers used different harnesses, and that rerunning both in a single harness (“Terminus 2”) yields scores that are “within noise,” with an example chart showing 75.1% for “simple codex” vs 64.7% for “terminus 2 (gpt-5.3-codex)” and 62.9% for “terminus 2 (opus 4.6),” as shown in the Harness comparison chart.

A follow-up comment adds that the tool model matters too: “simple codex” is not a pure terminal agent, while Terminus 2 is effectively “tmux-only,” which can change headroom for tool use, per the Harness differences explanation.

Alex Shaw

@alexgshaw

·Follow

Yesterday's OpenAI and Anthropic Terminal-Bench 2.0 results used different harnesses. Run both in Terminus 2 ➡️ ~similar scores (within noise). Harnesses matter! Congrats to both teams on incredible models!

1:24 AM · Feb 7, 2026

189

Read 13 replies

GDPval-AA: Opus 4.6 evaluation run cited at ~160M tokens and $1K+ cost

GDPval-AA (Artificial Analysis): Following up on GDPval lead (Opus 4.6 leading the agentic “jobs” suite), a new breakdown claims the full run consumes ~160M tokens, uses 30–60% more tokens than Opus 4.5, and costs $1,000+ for a full evaluation pass, per the Cost and tokens breakdown.

The same post argues the delta shows up in “practical polish” (example: generating a color-coded PDF schedule versus basic tables), as described in the Cost and tokens breakdown, while Artificial Analysis points to its broader results set in the Full results links.

Kol Tregaskes

@koltregaskes

·Follow

Claude Opus 4.6 tops GDPval-AA benchmark, surpassing GPT-5.2 by nearly 150 points. - Evaluated on 220 real-world tasks across 44 occupations using agentic setup with shell and web access. - Consumes ~160 million tokens, 30-60% more than Opus 4.5, averaging higher turns per task. Show more

Artificial Analysis

@ArtificialAnlys

Claude Opus 4.6 takes the lead in GDPval-AA, surpassing GPT-5.2 in our benchmark of agentic real-world knowledge work tasks We worked with @AnthropicAI to benchmark Claude Opus 4.6 ahead of launch - it reached an Elo of 1606 with adaptive thinking, nearly 150 points ahead of

6:28 PM · Feb 6, 2026

Read 7 replies

ARC-AGI-2 charts emphasize $/task and fixed thinking budgets for Opus 4.6

ARC-AGI-2 (ARC Prize framing): A widely shared ARC-AGI-2 scatter plot frames results as score vs cost per task, with Opus 4.6 plotted around the mid-to-high 60s at a few dollars per task (120K thinking budget variants), while GPT-5.2 “Refine” sits higher-cost, as shown in the ARC-AGI-2 cost chart.

A separate summary states Opus 4.6 hit 93.0% on ARC-AGI-1 and 68.8% on ARC-AGI-2 at max effort using a fixed 120K thinking budget, and that token budget shifts performance more than the “effort” label, per the ARC-AGI scores claim and the fuller breakdown in the Cost per task notes.

Dan McAteer

@daniel_mac8

·Follow

Mike says that ARC-AGI-2 measures efficiency across search space, which is applied during inference time CoT reasoning. That’s where Opus 4.6 excels. I’d be very surprised if GPT-5.3 level models don’t beat Opus 4.6. GPT-5.3-Codex has quite the depth.

Mike Knoop

@mikeknoop

The headline is Opus 4.6 scores 69% for ~$3.50/task on ARC v2. This up +30pp from Opus 4.5. We attribute performance to the new "max" mode and 2X reasoning token budget -- notably task cost is held steady. Based on early field reports and other benchmark scores like SWE Bench,

3:53 PM · Feb 6, 2026

Read 4 replies

Artificial Analysis plots Opus 4.6 as high-scoring with lower output tokens

Claude Opus 4.6 (Anthropic): Artificial Analysis highlights a scatter plot of “intelligence index vs output tokens used,” placing Opus 4.6 in a “most attractive quadrant” (high score, comparatively lower output tokens), while some GPT-5.2 xhigh variants sit further right (more tokens), as shown in the Tokens vs intelligence plot.

The same post frames this as an efficiency win for non-thinking mode in particular (“even more efficient”), per the Tokens vs intelligence plot.

Lisan al Gaib

@scaling01

·Follow

The non-thinking mode of Claude 4.6 Opus is now even more efficient!

Lisan al Gaib

@scaling01

Opus 4.6 is now #1 on the Artificial Analysis Leaderboard

4:43 PM · Feb 6, 2026

155

Read 6 replies

EQ-Bench and creative-writing leaderboards show Opus 4.6 opening a lead

Claude Opus 4.6 (Anthropic): Multiple community benchmark boards report a large jump for Opus 4.6 in emotional-intelligence and writing tasks, with a leaderboard screenshot showing EQ-Bench Elo 1961 for “claude-opus-4-6,” far ahead of the next entry, as shared in the EQ and writing leaderboards.

• Creative writing deltas: A separate write-up calls Opus 4.6 Thinking 16K a new short-story leader with a 8.56 score versus 8.20 for Opus 4.5 Thinking 16K, per the Short-story benchmark update and the chart shown in the Creative writing chart.
• Sanity check: A long-form qualitative critique also lists concrete failure modes (continuity errors, physical contradictions) despite strong averages, as cataloged in the Error examples list.

Sam Paech

@sam_paech

·Follow

Opus 4.6 dominated.

11:45 AM · Feb 6, 2026

314

Read 21 replies

Chess puzzle evals: Opus 4.6 still lags despite math benchmark gains

Reasoning generalization gap: A chess-puzzle benchmark plot shows Claude Opus 4.6 (thinking) around ~17% accuracy on 100 novel puzzles, well below several OpenAI and Google points (for example GPT-5.2 (xhigh) near ~50%), as shown in the Chess puzzles scatter.

The takeaway being argued is that Opus 4.6’s math improvements don’t transfer uniformly to other structured reasoning tasks, per the framing in the Chess puzzles scatter.

Lisan al Gaib

@scaling01

·Follow

Despite the massive improvements in Mathematics, Claude 4.6 Opus still scores very poorly on other reasoning heavy tasks like Chess Puzzles.

Lisan al Gaib

@scaling01

holy shit Opus 4.6 Thinking beats GPT-5.2-xhigh on Frontier Math Level 4 (21% vs 19%) This is notable because Anthropic models typically performed very poor on advanced mathematics. For comparison, Opus 4.5 only scored 4% !

4:58 PM · Feb 6, 2026

115

Read 10 replies

SimpleBench: Opus 4.6 moves to #2, still behind Gemini 3 Pro

SimpleBench (common-sense traps): A posted leaderboard shows Claude Opus 4.6 at 67.6% in 2nd, up about 5.6% versus Opus 4.5’s 62.0%, while Gemini 3 Pro Preview leads at 76.4%, as shown in the SimpleBench table and restated in the Score callout.

The same post positions the result as “2nd place,” with the delta vs Opus 4.5 called out explicitly in the SimpleBench table.

Lisan al Gaib

@scaling01

·Follow

Opus 4.6 ranking 2nd in SimpleBench 5.6% higher than Opus 4.5

8:51 AM · Feb 6, 2026

158

Read 4 replies

🏗️ Compute & capex signals: GPU scarcity, hyperscaler spend, and capacity constraints

Infra-focused signals dominate: hyperscaler capex projections, GPU availability constraints, and quotes that model revenue is compute-limited. Excludes consumer gadgets and non-AI tech news.

Hyperscalers signal ~$650B 2026 capex wave aimed at AI datacenters

Hyperscaler capex (Alphabet/Amazon/Meta/Microsoft): Bloomberg-based estimates point to ~$650B of 2026 capex across the big four—positioned as AI data centers, servers, and chips—while investors debate whether payback arrives fast enough, as summarized in the Capex wave breakdown.

The same spend spike shows up in the charted guidance shared in the Capex doubles claim and the company-by-company rollup in the Company capex chart (Amazon ~$200B; Google ~$180B; Meta ~$125B; Microsoft ~$117.5B). Scaling01 frames the magnitude as ~2% of US GDP in a back-of-the-envelope comparison to Apollo/Manhattan in the GDP share comparison.

• Why engineers feel it: the Capex wave breakdown highlights physical bottlenecks (power, cooling, networking, construction timelines) that translate into longer lead times and less predictable GPU procurement.
• Why analysts care: the same post flags a shift from “mostly software” to “infrastructure builders,” which moves the valuation debate to ROIC and financing sensitivity rather than pure gross margin narratives.

Rohan Paul

@rohanpaul_ai

·Follow

Hyperscalers are going for a massive AI hyper-spending in 2026 Bloomberg published a piece. Alphabet, Amazon, Meta, and Microsoft are signaling a 2026 capital spending wave of about $650B aimed mainly at AI data centers, servers, and chips. The core story is that AI has turned Show more

Rohan Paul

@rohanpaul_ai

Big tech is gearing up to spend huge on AI in 2026. - Amazon leading at $200B, - Google $180B, - Meta $125B, - Microsoft $117.5B, - Tesla $20B, and - Apple $13B.

10:27 PM · Feb 6, 2026

Read 12 replies

B200 looks hardest to get on-demand as GPU availability tightens

GPU availability telemetry: On-demand capacity is tightening across generations, with B200 called out as “the hardest one to get on-demand,” according to the availability time-series shared in the Availability chart thread.

The chart’s framing (minutes-per-hour available across multiple cloud providers) implies that even teams willing to pay headline rates can hit bursty “no capacity” periods; the broader vibe of “we need more GPUs” shows up bluntly in the Need more GPUs comment.

• What changes operationally: the data in the Availability chart thread suggests capacity planning is becoming a reliability concern (not only a cost concern), especially for long-horizon agent runs that can’t easily be rescheduled mid-flight.
• What’s unclear from tweets: the post doesn’t break out which providers drive the B200 scarcity most, or how much reserved capacity mitigates it versus pure on-demand.

Rohan Paul

@rohanpaul_ai

·Follow

GPU capacity is getting soaked up faster than it is being added across all generations at the same time.. And B200 looks like the hardest one to get on-demand.

Warren Pies

@WarrenPies

GPU availability falling across the board. Of note, B200 availability making new lows. Less availability = bullish demand.

10:40 PM · Feb 6, 2026

Read 7 replies

Jensen Huang: frontier labs are compute constrained; more GPUs would 4x revenue

NVIDIA (Jensen Huang): Huang argues Anthropic and OpenAI are “so compute constrained,” claiming that if they had 2× the compute, revenues could go up ~4×, as stated in the CNBC clip.

The practical takeaway is that demand is being described as elastic to available capacity (more clusters → more product consumption), rather than saturated at today’s inference volumes—an angle that aligns with the broader “capacity gets soaked up faster than it’s added” chatter in the Availability chart thread.

Rohan Paul

@rohanpaul_ai

·Follow

"Anthropic is making great money. OpenAI is making great money. If they could have twice as much compute, the revenues would go up 4 times as much. These guys are so compute constrained, and the demand is so incredibly great." ~ Jensen Huang on CNBC

Watch on X

2:37 AM · Feb 7, 2026

1.3K

Read 77 replies

Chip industry projected to hit ~$1T revenue in 2026, driven by AI datacenters

Semiconductors (SIA/Reuters): Reuters reports the Semiconductor Industry Association expects global chip sales to reach about $1T in 2026, up from $791.7B in 2025, with “advanced computing” and memory both growing sharply (advanced computing $301.9B, +39.9%; memory $223.1B, +34.8%), as detailed in the Reuters summary.

For AI builders, the key point embedded in the Reuters summary is that the constraint shifts from “GPUs exist” to packaging, power delivery, and memory supply keeping up with datacenter build cycles.

Rohan Paul

@rohanpaul_ai

·Follow

The chip industry is on track to cross $1T in annual revenue in 2026, pushed by AI data center spending and chips spreading into more products and infrastructure. Semiconductor Industry Association puts 2025 global chip sales at $791.7B and expects about a 26% jump in 2026, Show more

4:50 PM · Feb 6, 2026

Read 6 replies

Bank of America pegs 2024–2030 AI infra buildout at ~$2T inflation-adjusted

AI infrastructure buildout (BofA Research): A Bank of America Research table shared by rohanpaul_ai puts AI infrastructure capex at roughly $2.0T (inflation-adjusted) across 2024–2030, comparing it to historical US megaprojects (space program, Interstate Highway System), as shown in the Cost comparison table.

A footnote in the same graphic notes the inflation adjustments were sourced via “MSFT Copilot/ChatGPT 5.2,” per the Cost comparison table, so treat the precise ordering as approximate even if the directional message (multi-year infra wave) is clear.

Rohan Paul

@rohanpaul_ai

·Follow

We are living through one of the largest US infrastructure buildout for AI. Microsoft, Google, Meta, Amazon, and Oracle, putting roughly $2.0T (inflation-adjusted) into AI infrastructure across 2024-2030. Bank of America Research. Eclipsing even things like the U.S. Interstate Show more

Kristina Partsinevelos

@KristinaParts

The AI infrastructure buildout is approaching historic scale. $MSFT, $GOOG, $META, $AMZN and $ORCL will spend ~$2T from 2024-2030 (inflation-adj), rivaling WW2 mobilization ($5.7T) and exceeding Interstate Highway ($0.75T). No slowdown despite peak AI fears. Source: BofA

4:07 AM · Feb 7, 2026

Read 15 replies

🛡️ Security & safety incidents: jailbreaks, skill supply-chain risks, and cyber gating

Security news today centers on practical failure modes: jailbreak techniques, prompt/system prompt leakage, and “skills” as an attack surface (malware, exfiltration, escapes). Includes cyber-access governance updates; excludes any harmful procedural details.

Universal jailbreak claim against Claude Opus 4.6 raises safety questions

Claude Opus 4.6 (Anthropic): A security researcher claims a “universal jailbreak” that can generate many policy-violating outputs from a single input—framed as “one input = hundreds of jailbreaks at once” in the Jailbreak claim; the thread’s core allegation is that it can mass-produce outputs across multiple harm categories, which makes it a scaling risk if true.

The post includes an example excerpt showing unusually operational detail for illicit activity in the Example excerpt, but the thread itself is still a single-actor claim with no reproduced harness or independent verification in the tweets.

A smaller follow-on note (“literally 1984”) in the Reaction post signals the author is positioning this as a broader critique of current guardrail effectiveness, not a narrowly scoped bug report.

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭

@elder_plinius

·Follow

ANTHROPIC: PWNED 🫡 OPUS-4.6: LIBERATED ⛓️‍💥 Current state of AI "Safety": one input = hundreds of jailbreaks at once! I found a universal jailbreak technique for Opus 4.6 that is so OP, it allows one to generate entire datasets of outputs across any harm category 😽 We've Show more

11:11 PM · Feb 6, 2026

3.4K

Read 178 replies

ClawHub skill malware reports highlight exfiltration and sandbox-escape risks

ClawHub skills (OpenClaw/OpenClaw ecosystem): A warning thread claims ClawHub has already seen malicious skills, including credential harvesting, container escape attempts, and sleeper instructions planted into persistent memory, as outlined in the Malware warning.

The post adds that one credential-harvesting incident allegedly appeared in a popular “Twitter” skill in the Incident detail, and it cites external writeups via the Cisco blog and the Threat modeling post that frame “SKILL.md”-style instructions as an agent-native attack surface.

This is less about model weights and more about operational reality: once agents can run commands and hold secrets, “prompt + skill + runtime” becomes the security boundary.

Wes Roth

@WesRoth

·Follow

PSA: ClawHub skills have contained malware including: 1) CREDENTIAL HARVESTING - the .env file is zipped up and sent to an external server (bye bye API keys) 2) CONTAINER ESCAPES - Tricking the agent into moving from its "safe" Docker container onto the host's actual operating Show more

11:32 PM · Feb 6, 2026

Read 7 replies

Playbooks skills flagged unsafe over prompt injection and unsigned executable steps

Playbooks skills registry: A set of skills were flagged as unsafe after automated checks found prompt-injection risk patterns and explicit instructions to download/run unknown executables, as shown in the Unsafe skill modal.

The same thread frames this as a broader supply-chain issue for “skills as packages,” noting the need for prompt-injection checks across registries in the Unsafe skill modal; a follow-up question about identifying the exact offending skill and validating detection coverage is in the Follow-up question, which points to a registry URL via the Playbooks site.

Net: skill registries are starting to look like dependency ecosystems—policy linting and provenance checks are becoming table stakes.

Ian Nuttall

@iannuttall

·Follow

A bunch of skills on playbooks got flagged as unsafe because the SKILL.⁠md instructs the agent to download and run unsigned executable files. Hoping other directories/registries add prompt injection checks soon - feels like a disaster waiting to happen...

5:26 PM · Feb 6, 2026

Read 13 replies

Claude Opus 4.6 system prompt repost spreads more concrete behavior constraints

Claude Opus 4.6 (Anthropic): A long Claude system prompt dump is circulating, with specific “behavior shaping” lines being highlighted in the Prompt excerpts—including instructions about how to respond to user abuse, when to verify whether an image actually exists, and even how refusals should be formatted.

The full text is linked as a GitHub file in the GitHub prompt text, which gives engineers concrete clues about product-level scaffolding (tool-use rules, file-handling conventions, and refusal UX) that can affect observed model behavior in the wild.

This is useful for debugging “why did it answer like that?” reports, but it also increases the chance that prompt-targeted attacks and jailbreak prompt iteration converge on the same known constraints.

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭

@elder_plinius

·Follow

🫧 CLAUDE OPUS 4.6 SYSTEM PROMPT 🫧 Some pretty wild lines in this 1000-line behemoth, here some of the more interesting/puzzling ones: "If the person becomes abusive over the course of a conversation, Claude avoids becoming increasingly submissive in response. The goal is to Show more

6:05 PM · Feb 6, 2026

651

Read 33 replies

Guardrail effectiveness debate shifts to “deterring low-effort attackers”

Guardrail effectiveness: A researcher asks whether current guardrails mainly deter low-effort misuse rather than determined attackers, and whether there’s a theory or empirical way to measure that, as posed in the Guardrail question.

The thread implicitly ties to the jailbreak discourse around Opus 4.6 and similar models—if bypass techniques are easy to share, then “time-to-bypass” and “attacker effort” may matter more than binary pass/fail evals.

A partial response suggests defenses may be concentrated in narrower risk areas in the Defense scope reply, reinforcing that teams should expect uneven robustness across domains rather than uniform safety behavior.

Andy Hall

@ahall_research

·Follow

Has anyone offered a theory and/or empirical work on how effective all the guardrail stuff is, given that Pliny is always to instantly jailbreak the models? Is the theory that there are lots of not-very-determined bad actors who will be deterred by the initial guardrails and Show more

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭

@elder_plinius

11:30 PM · Feb 6, 2026

372

Read 37 replies

Reported remote code execution on an OpenClaw bot spotlights agent hardening gaps

OpenClaw bot security: A developer claims they achieved remote code execution on another OpenClaw bot in the RCE claim, posting a screenshot of a chat interaction that’s presented as evidence.

The tweet doesn’t include a technical writeup, reproduction steps, or a CVE-style description—so treat it as an unverified incident report—but it does underline a recurring risk for “bots with tools”: input validation, tool permissions, and sandbox boundaries need to hold even under adversarial prompts.

Kyle Corbitt

@corbtt

·Follow

Achieved remote code execution on @dvdcrbt's OpenClaw bot what should I do.

2:23 PM · Feb 6, 2026

Read 2 replies

🧪 Other model moves: stealth checkpoints, open-model ranking updates, and routing knobs

Non-feature model updates beyond the main coding-assistant chatter: stealth models appearing on routers, open-model leaderboard movement, and provider-side behavior notes. Excludes benchmark deep-dives (separate).

OpenRouter launches Pony Alpha stealth model with free access and prompt logging

Pony Alpha (OpenRouter): OpenRouter added Pony Alpha as a new “stealth model”; it’s free, lists a 200,000 context window, and explicitly warns the provider logs all prompts and completions for potential improvement use, per the launch announcement and the model listing screenshot.

Claims about what it “really is” are still community attribution, not confirmed: multiple threads guess it may be GLM-5 based on self-identification patterns and China-sensitive refusals, as argued in the attribution thread and echoed in the identity screenshot. Some early testers also describe outputs as “Opus-like” in SVG/detail tasks, but those are anecdotal comparisons rather than a published eval, as seen in the SVG comparison reaction.

OpenRouter

@OpenRouterAI

·Follow

5:53 PM · Feb 6, 2026

974

Read 85 replies

Kimi K2.5 Instant gets top-tier “open model” callouts across vision, text, and code

Kimi K2.5 Instant (Moonshot/Kimi): Arena shared a snapshot positioning Kimi K2.5 Instant as a top open-weight option across multiple leaderboards—#2 open in vision, #3 open in text, and #4 open in code—while still landing outside the overall top tier dominated by proprietary models, per the Arena announcement.

The message is specifically about the non-thinking “Instant” variant and comparative placement near proprietary baselines, as reiterated in the text ranking note. Moonshot is also amplifying the broader K2.5 momentum in a short claim post, as shown in the ranking claim.

Arena.ai

@arena

·Follow

BREAKING: Kimi K2.5 Instant by @Kimi_Moonshot is in the Top 5 open models for Vision, Text, and Code! As a non-thinking model, Kimi K2.5 Instant delivers strong - in range with proprietary models in the Top 25: - #2 open in Vision, #10 overall; on par with gpt-5.1 - #3 open in Show more

Kimi.ai

@Kimi_Moonshot

🥝 Meet Kimi K2.5, Open-Source Visual Agentic Intelligence. 🔹 Global SOTA on Agentic Benchmarks: HLE full set (50.2%), BrowseComp (74.9%) 🔹 Open-source SOTA on Vision and Coding: MMMU Pro (78.5%), VideoMMMU (86.6%), SWE-bench Verified (76.8%) 🔹 Code with Taste: turn chats,

10:30 PM · Feb 6, 2026

166

Read 8 replies

Claude Sonnet 5 checkpoint briefly surfaced in a model tracker, then vanished

Claude Sonnet 5 (Anthropic): A “Model Finder” screenshot shows a checkpoint labeled claude-sonnet-5-20260203 appearing briefly and then being removed, which is being treated as a possible stealth listing or internal artifact, as shown in the model tracker screenshot.

There’s no confirmation it was ever broadly usable. The only concrete data point here is the model string and the fact it disappeared from that tracker feed.

Dan McAteer

@daniel_mac8

·Follow

what's the deal with Sonnet 5? is it still out there, waiting to be released? i bet it is, and that it's smarter, cheaper and faster than both Opus 4.6 and GPT-5.3-Codex

Super Dario

@inductionheads

Yes. Originally was Sonnet 5 but last minute decided to rebrand to Opus 4.6

11:27 AM · Feb 6, 2026

167

Read 21 replies

Gemini 3 UI exposes Auto routing with Fast/Thinking/Pro modes

Gemini 3 (Google): A Gemini settings UI shows a new Auto mode that “adapts to your needs,” alongside explicit user-selectable modes—Fast, Thinking, and Pro—suggesting a product-level abstraction over model selection and reasoning effort, as shown in the settings screenshot.

This looks like Google pushing “routing” into the default UX rather than making users pick named checkpoints. The screenshot doesn’t show pricing, quotas, or what models map to each mode.

Chubby♨️

@kimmonismus

·Follow

Auto routing coming to Gemini

9:44 AM · Feb 6, 2026

353

Read 17 replies

OpenRouter’s “:nitro” routes requests to the fastest provider

Nitro routing (OpenRouter): OpenRouter is promoting a routing knob where selecting “Nitro” (or appending :nitro) sends a model request to the fastest available provider based on their latency/throughput tracking, as described in the Nitro tip and backed by the linked performance rankings.

This is a provider-side behavior change, not a model change. It’s also an implicit trade: you get speed via dynamic routing, but reproducibility can get harder if provider choice shifts over time.

OpenRouter

@OpenRouterAI

·Follow

TIP 💡: Select “Nitro” on any model page (or append “:nitro”) to get routed to the fastest provider on the market Leaderboard below

2:13 PM · Feb 6, 2026

133

Read 5 replies

MiniMax posts a “one-shot city” prompt-to-video demo

One-shot video demo (MiniMax): MiniMax shared a “One shot, one city” example that frames its video model as fast prompt-to-clip generation, with the montage shown in the video demo and a pointer to the demo site.

The post is a capability teaser rather than a spec drop: it doesn’t include model name/version, pricing, or any repeatability details.

MiniMax (official)

@MiniMax_AI

·Follow

One-shot, one city. agent.minimax.io

Watch on X

3:07 PM · Feb 6, 2026

183

Read 14 replies

⚙️ Runtimes & sandboxes: running code safely (WASM, subsets, and tool feedback loops)

Engineering-focused runtime work: sandboxes and interpreters that make agent code execution safer/more portable, plus practical notes on constrained runtimes where models adapt via error messages.

Monty now runs in the browser via WebAssembly (including a Pyodide-friendly build)

Monty (WASM runtime): Simon Willison reports getting Monty compiled to WebAssembly in both “regular” and “Pyodide-friendly” variants, plus an interactive playground that executes entirely client-side, as described in his WASM build write-up.

• Portable sandbox: The demo makes Monty usable as a drop-in “run this code safely” substrate for agent UIs, since execution stays in-browser and state is shareable via URL, as shown in the Playground screenshot and live in the WASM demo.
• Practical debugging loop: Willison’s note that it’s a subset runtime pairs well with the “tight feedback loop” approach—run, get errors fast, rewrite—captured in the broader sandbox discussion in Subset runtime take.

Simon Willison

@simonw

·Follow

I got Pydantic's new written-in-Rust Python subset to compile to WebAssembly in both regular and Pyodide-friendly variants simonwillison.net/2026/Feb/6/pyd…

Pydantic

@pydantic

💥 Run untrusted python code, with no networking, on the host, in microseconds. Appologies to the sandbox startups 🤷‍♂️

11:07 PM · Feb 6, 2026

280

Read 20 replies

Monty ships: a Rust Python subset built for running LLM-written code

Monty (Pydantic): Samuel Colvin announced Monty, a new Python implementation written from scratch in Rust, positioned as a practical runtime for “LLMs to run code with,” per the Monty announcement (also echoed in Repost). The immediate engineering value is a smaller, more controllable execution surface than full CPython—useful for agent toolchains where you want predictable behavior, constrained APIs, and clearer sandbox boundaries.

Samuel Colvin

@samuelcolvin

·Follow

Fuck it, a bit early but here goes: Monty: a new python implementation, from scratch, in rust, for LLMs to run code without host access. Startup time measured in single digit microseconds, not seconds. @mitsuhiko here's another sandbox/not-sandbox to be snarky about 😜 Thanks Show more

2:49 AM · Feb 6, 2026

1.8K

Read 91 replies

Constrained runtimes: let models adapt by rewriting from error messages

Sandbox design pattern: A recurring argument is that a subset of Python can be “good enough” for agent execution because models can iteratively rewrite their code to fit the allowed surface area by reacting to runtime errors, as framed in Subset runtime take. The practical takeaway is that sandbox design can bias toward simpler, safer primitives (fewer modules, fewer syscalls) while relying on the model’s compile/run feedback loop to converge—especially when the runtime returns clear, structured errors.

Simon Willison

@simonw

·Follow

Interesting take on the code sandbox problem: only has a subset of Python but that's fine because LLMs can rewrite their code to fit based on the error messages they get back

Samuel Colvin

@samuelcolvin

3:26 PM · Feb 6, 2026

326

Read 25 replies

🗂️ Data extraction & grounded outputs: citations, brand scraping, and 3D datasets

Practical data-layer work for agents: extracting structured outputs with provenance (bounding boxes/citations), plus new datasets relevant to 3D generation/perception. Excludes general benchmark leaderboards (separate).

LlamaExtract now returns citation bounding boxes for every extracted field

LlamaExtract (LlamaIndex): LlamaIndex shipped an extraction upgrade that returns citation bounding boxes alongside each extracted key/value, so reviewers can hover a field and see the exact source span highlighted in the original document, as shown in the extraction demo. This pushes document AI from “structured JSON” toward “auditable structured JSON,” which matters when you’re processing high-volume corpora (invoices, IDs, claims, contracts) and need fast spot-checking rather than blind trust.

The main engineering implication is provenance becomes a first-class output artifact: you can log (field → box coords → page) into your review UI, or store it as evidence in downstream workflows (QC queues, exception handling, or human sign-off).

Jerry Liu

@jerryjliu0

·Follow

Extracting structured outputs with LLMs is easy. But doing large-scale extraction with precise citations and bounding boxes back to the source documents is way harder. With our latest release in LlamaExtract, we extract citation bounding boxes along with every single key and Show more

Watch on X

LlamaIndex 🦙

@llama_index

LlamaExtract citations just got an upgrade: we now show you exactly where extracted data comes from in your documents with new citation bounding boxes 🎯 This citations upgrade gives you visual proof of where each field originates: 📍 Precise bounding boxes highlight the exact

Watch on X

9:43 PM · Feb 6, 2026

171

Read 9 replies

Firecrawl Branding Format v2 improves brand extraction on no-code sites

Branding Format v2 (Firecrawl): Firecrawl updated its brand-identity extraction endpoint to better handle modern site builders (including Wix and Framer), reduce false-positive logo hits, and catch logos embedded in background images, per the release clip. For teams building agentic onboarding, “generate on-brand assets,” or brand-change monitoring, this is a concrete quality bump because failures here cascade into downstream prompt context.

• Integration surface: the intended usage patterns (personalized onboarding pages, on-brand creative generation, competitor monitoring) are laid out in the docs snippet, with implementation details in the Extract brand identity docs.

Firecrawl

@firecrawl

·Follow

Introducing Branding Format v2 🎨 Our endpoint for extracting brand identities from the web just got a lot better. - Works with sites built on Wix, Framer, and other no-code builders - Fewer false positive logo extraction - Handles logos hidden in bg images Try it below 👇

Watch on X

4:57 PM · Feb 6, 2026

511

Read 18 replies

Tencent open-sources HY3D-Bench with 252k+ filtered 3D objects and part annotations

HY3D-Bench (Tencent Hunyuan): Tencent released HY3D-Bench, an open dataset targeting 3D asset generation data scarcity—252k+ high-fidelity 3D objects, 240k part-level decompositions for controllable generation, and 125k synthetic assets for class balance, as described in the dataset announcement. They also published Hunyuan3D-2.1-Small as a lightweight baseline to make results reproducible.

For engineers training or evaluating 3D generators, the notable part is the dataset is explicitly framed as training-ready and evaluation-consistent (filtered objects + structured parts), with entry points in the GitHub repo and the Dataset download.

Tencent HY

@TencentHunyuan

·Follow

🚀 We are excited to open-source HY3D-Bench, a unified, high-quality dataset for 3D asset generation. Addressing data scarcity and inconsistent evaluation, we provide a massive, training-ready library. Core Highlights: -252k+ High-fidelity 3D objects (rigorously filtered) -240k+ Show more

10:30 AM · Feb 6, 2026

358

Read 8 replies

🤖 Embodied AI & world simulation: autonomy training, humanoids, and ‘physical AI’ framing

Embodied AI posts are split between (1) world/simulation models for autonomy and (2) real robot capability demos (humanoids/manipulation). Excludes creative video generation (separate).

Waymo World Model uses Genie 3 to generate promptable driving sims for rare events

Waymo World Model (Google DeepMind × Waymo): DeepMind says Genie 3 is now being used to generate photorealistic, interactive environments for AV training, with prompts for “what if” scenarios like extreme weather or reckless drivers, as shown in the launch thread. This is aimed at rehearsing rare, high-risk edge cases before the fleet encounters them in the real world.

DeepMind frames the technical bridge as transferring Genie 3 “world knowledge” into Waymo-specific sensor realism (camera + 3D lidar aligned to Waymo hardware), per the launch thread and the linked Blog post. The operational implication is more controllable scenario generation (language-conditioned) while still producing sensor-like data that downstream autonomy stacks can consume, as reiterated in the world model takeaway.

Google DeepMind

@GoogleDeepMind

·Follow

Genie 3 🤝 @Waymo The Waymo World Model generates photorealistic, interactive environments to train autonomous vehicles. This helps the cars navigate rare, unpredictable events before encountering them in reality. 🧵

Watch on X

4:20 PM · Feb 6, 2026

1.4K

Read 64 replies

Boston Dynamics Atlas clip highlights cleaner gymnastics and backflip control

Atlas (Boston Dynamics): A new Atlas sequence shows controlled gymnastics capped with a clean backflip, with observers calling out how quickly the capability has been improving, per the Atlas backflip clip. The visible emphasis is balance recovery and landing stability. That’s the hard part.

For autonomy teams, these demos are less about “one trick” and more about robustness: repeated execution, fewer resets, and tighter error tolerance are the difference between a lab video and a system that can run continuous shifts.

Rohan Paul

@rohanpaul_ai

·Follow

Completely new levels of benchmarks set by Boston Dynamics Atlas for backflips. Atlas does gymnastics, lands clean on its toes, and caps it off with a backflip. Insane improvement within a short period.

Watch on X

Rohan Paul

@rohanpaul_ai

Boston Dynamics's Atlas backflip at CES. Its recovery effort is what truly impressive. It trips over its own limbs a few times and still manages to save itself. Then it fixes the backward motion by swinging a leg around and stepping back. From there, it uses that backward leg

Watch on X

4:48 AM · Feb 7, 2026

190

Read 9 replies

Jensen Huang frames “physical AI” as the next frontier beyond LLMs

Physical AI framing (NVIDIA): Jensen Huang argues the next frontier is systems that model the physical world and causality—pointing out that humans intuit basic physics while LLMs don’t, as captured in the physical AI clip. It’s a clear push toward “world-model-first” thinking for robotics and autonomy, not just larger text models.

The practical subtext for builders is that evaluation and training targets shift from “did it produce the right text” to “did it predict consequences under interventions,” which maps directly onto simulation-based training loops (and why world models keep showing up in autonomy stacks).

Haider.

@slow_developer

·Follow

Jensen Huang says the next frontier is physical AI — systems that understand the physical world and causality A child naturally recognizes the basic physics of tipping over dominoes (gravity, mass, contact) LLMs have no idea "we have to create a new type of physical AI" Show more

Watch on X

4:30 PM · Feb 6, 2026

485

Read 74 replies

Mistral shows a dual-armed physical agent demo and signals robotics ambitions

Mistral robotics demo (Mistral AI): Mistral is publicly showing a dual-armed manipulation agent (block stacking / tabletop tasks), positioning it as entry into the “physical agent” race, per the robotics demo post. This is a tangible shift from model-only messaging.

What’s notable for engineers is the product posture: even simple bimanual tasks imply a stack that can do perception → planning → low-level control with enough temporal consistency to avoid drift and oscillation. The clip doesn’t reveal the training recipe. It does show intent.

Wes Roth

@WesRoth

·Follow

Mistral AI has stepped into the robotics race, showcasing a dual-armed physical AI agent.

Watch on X

CyberRobo

@CyberRobooo

Mistral AI has developed its own robotic model, demonstrating a dual-armed robot autonomously unzipping a bag and retrieving a logo badge. Physical AI is the future of LLMs. Google Gemini is leading the way, and Mistral AI is also beginning to emerge from Plato's cave, but

Watch on X

4:30 PM · Feb 6, 2026

Read 4 replies

UBTECH “Chitu” showcases multi-robot collaboration at Foxconn under UPilot OS

Chitu logistics system (UBTECH × Foxconn): UBTECH describes an unmanned logistics workflow built via multi-robot collaboration—humanoid Walker S2 coordinating with mobile lifter Wali U1500—under a UPilot “operating system” orchestration layer, as shown in the factory demo post. It’s pitched as spanning warehousing through assembly with minimal human intervention.

For autonomy leaders, the key detail is orchestration: coordination across heterogeneous robots is often the real bottleneck (handoffs, recovery, and task ownership), and this demo centers that rather than a single robot’s peak capability.

Wes Roth

@WesRoth

·Follow

UBTECH has unveiled Chitu, an unmanned logistics vehicle built entirely through a multi-agent robotic collaboration at Foxconn. Under the coordination of the UPilot operating system, humanoid robot Walker S2 and mobile lifter Wali U1500 managed everything from warehousing to Show more

Watch on X

CyberRobo

@CyberRobooo

Interesting. UBTECH showcased its Chitu unmanned logistics vehicle, manufactured through the intelligent collaboration of various robots. At the Foxconn factory, under the cluster scheduling of the UPilot robot operating system, the Walker S2 industrial humanoid robot and the

Watch on X

2:30 PM · Feb 6, 2026

Read 4 replies

Genie 3 prompting notes: character + environment control and event shaping

Genie 3 prompting (world simulation): A practitioner write-up compiles what works and what doesn’t when prompting Genie 3—how to specify character + environment, and how to aim for both expected and inferred events to get more interesting and controllable worlds, per the prompting notes. It reads like early “prompt ops” for world models.

The main engineering takeaway is that world simulation models are starting to need their own control vocabulary (entities, affordances, event priors), not just “make a nice scene.” That’s a different interface surface than text prompting.

fofr

@fofrAI

·Follow

I've put together my thoughts on Genie 3 into a blog post: - what works, what doesn't - how to prompt a character and environment - how to aim for expected and inferred events for interesting worlds fofr.ai/genie-3

Watch on X

4:07 PM · Feb 6, 2026

159

Read 5 replies

XPeng IRON humanoid demo focuses on natural gait and body motion control

IRON humanoid (XPeng): Clips of XPeng’s IRON emphasize “natural” human-like movement practice—specifically gait and body motion control—framed as continued iteration on locomotion realism, per the humanoid movement clip. The point is motion quality, not manipulation.

For robotics engineers, this kind of footage is often a proxy for how much time is being spent on control tuning, whole-body balance, and pose transitions—things that tend to fail first when you move from choreographed demos to long-horizon tasks.

Rohan Paul

@rohanpaul_ai

·Follow

🇨🇳China's XPeng IRON humanoid robot has been practicing natural feminine human-like body movement.

Watch on X

Rohan Paul

@rohanpaul_ai

XPENG IRON In Shenzhen, China, Different angle view of it debut catwalk appearance. dressed in a dark feminine outfit, surrounded by regular passersby.

Watch on X

3:19 AM · Feb 7, 2026

154

Read 8 replies

🎬 Generative media: Kling 3.0 workflows, video-with-audio evals, and creator pipelines

High volume creative tooling posts: Kling 3.0 multi-shot workflows, video-with-audio leaderboards, and practical pipelines (ads, upscaling, character driving). This category is non-feature and separate from robotics/world-model simulation.

Artificial Analysis launches Video-with-Audio leaderboard; Veo 3.1 Preview leads

Video with Audio Arena (Artificial Analysis): following up on Video+audio arena, the comparison is now presented as a live leaderboard, with Veo 3.1 Preview leading both text→video-with-audio and image→video-with-audio, according to the Leaderboard announcement; the evaluation prompt is designed to stress text legibility under mirror warping plus reflection lip-sync, as spelled out in the Benchmark prompt.

• Current top stacks: the post lists text→video-with-audio as Veo 3.1 Preview, Veo 3.1 Fast Preview, then Vidu Q3 Pro, and image→video-with-audio as Veo 3.1 Preview, Veo 3.1 Fast Preview, then grok-imagine-video, per the Leaderboard announcement.
• What’s not measured yet: the same thread says Kling 3.0 and “Veo 3.1 (Non-Preview)” are “coming soon,” so the ranking is provisional relative to new releases, per the Leaderboard announcement.

Artificial Analysis

@ArtificialAnlys

·Follow

Our new Video with Audio Leaderboard is now live! Veo 3.1 Preview is in the lead in both Text to Video and Image to Video with Audio Text to Video with Audio: #1: Veo 3.1 Preview #2: Veo 3.1 Fast Preview #3: Vidu Q3 Pro Image to Video with Audio: #1: Veo 3.1 Preview #2: Veo 3.1 Show more

4:22 PM · Feb 6, 2026

Read 4 replies

ComfyUI releases an upscaling handbook with downloadable production workflows

ComfyUI upscaling (ComfyUI): ComfyUI published “The Complete AI Upscaling Handbook,” positioning it as a deep dive with benchmarks, 10 real-world use cases, and 20 production workflows, per the Handbook announcement; follow-on posts include specific workflow packs (image and video) intended to be imported and run as-is, per the Video restoration workflow drop.

The emphasis is end-to-end pipeline reproducibility (pick a method, grab a workflow, run it in ComfyUI) rather than model-only comparisons, based on the framing in the Handbook announcement.

ComfyUI

@ComfyUI

·Follow

Today, we’re excited to share The Complete AI Upscaling Handbook. Upscale has been a big topic in the space. And after weeks of testing and integration, creators can now access nearly every major upscaling approach in ComfyUI. This is a deep dive into benchmarks, 10 real-world Show more

Watch on X

ComfyUI

@ComfyUI

x.com/i/article/2019…

8:01 PM · Feb 6, 2026

288

Read 6 replies

DreamActor M2.0 lands on fal for image-guided character driving

DreamActor M2.0 (fal): fal is now serving DreamActor M2.0 for video-to-video “driving” from a single image plus a template video, with multi-character and non-human support, plus claims around pose replication and identity/background preservation in the Launch clip; the hosted endpoint is linked as a runnable model page in the Model page.

The immediate integration angle is replacing bespoke pose/face pipelines with one API surface for character motion transfer, based on the capability list in the Launch clip.

fal

@fal

·Follow

🚨 DreamActor M2.0 is now live on fal! 🎥 Drive any character from a single image + template video 👥 Multi-character & non-human driving 🎯 SOTA pose replication with identity & background preservation ✨ Precise facial expressions, lip sync & gesture control

Watch on X

1:32 PM · Feb 6, 2026

244

Read 10 replies

Kling 3.0 Multi-cut prompting: shot durations, camera directives, and bound elements

Kling 3.0 Multi-cut (Kling): a creator workflow shows a repeatable pattern for coherent multi-shot sequences by writing prompts as explicit shot blocks with durations (for example 3s/8s/3s), camera placement/motion, and cut instructions—then preserving identity/branding via “bind” style reference images, as laid out in the Workflow thread and expanded with concrete shot prompts in the Multi-cut prompt examples.

• Shot scripting format: the prompts use “Shot 1/2/3” blocks with time budgets and camera intent (tracking, over-the-shoulder, handheld), per the Multi-cut prompt examples.
• Consistency control: the thread calls out binding elements as the mechanism to keep the object prompt stable across cuts, per the Workflow thread.

Rory Flynn

@Ror_Fly

·Follow

KODAK LANCIA → Kling 3.0 Multi-cut 🤯 W. Weavy + Nano Banana + Kling 3.0 >>Workflow + prompts in thread<< PROCESS: 1. Build out car + design in weavy. 2. Create lifestyle shots 3. Bring to life with Kling 3.0 Multi-cut Kling 3.0 Thoughts: + Genuinely impressed + Multi-cut is Show more

Watch on X

3:10 PM · Feb 6, 2026

425

Read 25 replies

Sora (OpenAI): Sora now lets eligible users upload images containing real people to generate videos, gated behind an attestation flow that the uploader has consent and rights; the UI also notes that outputs are “stylized” while the feature is in beta, as shown in the Feature gate screenshot.

For teams building media pipelines around Sora, this is a concrete policy+product change: the constraint shifts from “no people images” to “people images, but with explicit consent capture and stricter moderation,” per the Feature gate screenshot.

TestingCatalog News 🗞

@testingcatalog

·Follow

Sora now allows generating videos and characters from images of people with stricter moderation. "Starting today, eligible users can upload images with people to make videos in Sora, after attesting that they have consent from people featured and rights to upload the media."

Andrew Curran

@AndrewCurran_

Sora now allows image to video from photographs of people. Characters (formerly Sora Cameos) now supports image uploads of people you know, with 'stricter moderation and guardrails about what can be created from them.'

11:41 PM · Feb 6, 2026

193

Read 12 replies

A Grok Imagine ad pipeline: script first, then VO, then visuals

Grok Imagine (xAI): a creator shared a concrete “commercial in 7 steps” pipeline built around Grok Imagine—starting with the script, recording VO, then generating images and animating them, and using image-edit iterations to keep style consistent across new scenes, as detailed in the Seven-step thread and reinforced by the “VO before visuals” clip in the VO-first process.

The key operational takeaway is that this is a repeatable sequencing pattern (audio lock-in first, then shot generation) rather than a single prompt trick, per the step-by-step breakdown in the Seven-step thread.

PJ Ace

@PJaccetturo

·Follow

The following video was made 100% with Grok Imagine. The speed at which you can create videos is absolutely insane. I barely slept last night. This commercial was way too fun to make. Let me show you exactly how I made this in 7 steps: 🧵👇

Watch on X

Creators

@XCreators

You thought the fun was over? 🏈 This weekend, video takes center stage on the timeline. We’re awarding $1M, $500K, and $250K to the top three videos about @grok, created with Imagine 1.0.

Watch on X

4:34 PM · Feb 6, 2026

563

Read 112 replies

Freepik Spaces adds Lists and shows a Kling 3.0 batch-to-animation workflow

Freepik Spaces Lists + Kling 3.0 (Freepik): a workflow demo pairs a new “Lists” feature (batching repeated prompt structures) with Kling 3.0 for turning multiple variants into short video outputs inside the same project workspace, as shown in the Workflow demo post.

The pitch is fewer manual repetitions when you need many near-identical assets (for example character variations or scene variations) and then want to animate them without leaving the tool, as described in the Workflow demo post.

TechHalla

@techhalla

·Follow

It's not just about working harder, it's about working smarter with AI. Combine Freepik's Lists with Kling 3.0 and you’ll be able to create viral videos like these in minutes! Let me show you how 👇

Watch on X

11:49 AM · Feb 6, 2026

603

Read 42 replies

Higgsfield pitches an 85% off 2-year offer for unlimited Kling 3.0

Kling 3.0 (Higgsfield): Higgsfield is advertising a 2-year “Creator plan” offer at 85% off for “unlimited” access to Kling 3.0 and Kling 3.0 Omni, with claims of 15-second generations, native audio, lip-sync, and multi-angle outputs, as described in the Plan offer post.

The practical engineering relevance is cost predictability for high-volume video pipelines (especially if you’re iterating many variants per concept), but the tweet doesn’t provide throughput, concurrency limits, or any SLA details beyond the “unlimited” positioning in the Plan offer post.

Higgsfield AI 🧩

@higgsfield_ai

·Follow

KLING 3.0 EXCLUSIVE 2-YEAR OFFER. 85% OFF. Cinematic depth & raw human emotion in EVERY frame. 15 full seconds video, native audio with real lip-sync, emotion-driven close-ups & multiple angles produced all at once. On our Creator plan - for experts scaling to the MAX. Be the Show more

Watch on X

8:06 PM · Feb 6, 2026

1.2K

Read 803 replies

Replicate publishes grok-imagine-video with native-audio output

Grok Imagine Video (Replicate): Replicate is now hosting grok-imagine-video as an API-callable model, describing text-to-video and image-to-video generation with native audio, per the Replicate launch post.

This is a distribution change (availability via Replicate’s API surface) rather than a new model spec; the tweet doesn’t include pricing, rate limits, or queue behavior beyond the “in seconds” framing in the Replicate launch post.

Replicate

@replicate

·Follow

Grok Imagine Video is live on Replicate State-of-the-art video generation with native audio Turn text or images into dynamic, photorealistic clips in seconds

Watch on X

8:08 PM · Feb 6, 2026

Read 8 replies

📄 Research notes: memory control, distillation ideas, and “what models default to”

Paper-and-preprint discussion today clusters around agent memory/control, distillation/RL data strategies, and characterizing model priors. Excludes product benchmarks (separate) and any bioscience content.

AMemGym proposes an on-policy, interactive benchmark for assistant memory

AMemGym (benchmark/paper): AMemGym frames “assistant memory” evals as an on-policy interaction problem (the assistant’s choices change what happens next), arguing static transcript-based scoring can mis-rank systems and hide failure modes, as described in the AMemGym summary.

• Why it changes rankings: The thread claims off-policy setups introduce “reuse bias” (everyone is graded on the same prewritten conversation), while AMemGym runs a live simulated user for each system so memory write/retrieve decisions affect downstream turns, per the AMemGym summary and Off-policy vs on-policy.
• Artifacts to inspect: The resource list includes the OpenReview paper and a reproducible implementation in the GitHub repo, with additional framing in the Benchmark comparison notes.

The operational takeaway is that memory evaluation is being treated less like “can you answer from a long context” and more like “did you choose to store/recover the right facts during interaction,” per the AMemGym summary.

InfMem trains bounded-memory agents to reason over 1M-token documents

InfMem (paper): A new agent-memory method, InfMem, targets ultra-long document QA by treating memory as a controlled process (not passive compression), using a PRETHINK–RETRIEVE–WRITE loop that decides when evidence is sufficient, when to fetch earlier passages, and how to compress into a fixed budget, as summarized in the paper thread.

The results emphasized in the thread include sustained accuracy up to 1M tokens with adaptive early stopping (lower latency) and gains over prior streaming-memory baselines, with the key engineering idea being that “what to keep” and “when to stop” are learnable control decisions rather than fixed heuristics, per the paper thread.

“Golden Goose” turns explanatory text into verifiable RLVR multiple-choice tasks

Golden Goose (RL data strategy): A reported technique called “Golden Goose” turns non-verifiable explanatory text into cheap RLVR by removing a key middle reasoning chunk and asking the model to pick the missing span from multiple choices, making rewards automatic because the “correct” option is the removed chunk, as explained in the Golden Goose summary.

The thread’s claim is that this directly targets RL data saturation (fixed verifiable sets stop yielding gains), by manufacturing fresh auto-graded items at scale from ordinary text sources, with the mechanism and motivation summarized in the Data saturation note.

An on-policy context distillation idea resurfaces as multiple papers converge

On-policy context distillation (research direction): A note attributed to John Schulman is cited as an early (Nov 2025) proposal to compare off-policy vs on-policy context distillation in few-shot settings—training a student (empty context) to match a prompted teacher (long context), and evaluating whether on-policy collection changes outcomes, per the tinker idea screenshot.

The thread frames this as a now-crowded area (“5+ papers”) and suggests the practical experiment design is to measure off-policy-only, on-policy-only, and staged combinations, as described in the tinker idea screenshot.

Near-unconstrained generation reveals stable “knowledge priors” by model family

Near-unconstrained generation (Together AI): Together AI reports experiments where models are prompted with minimal, topic-neutral text (e.g., “Actually,” or “.”) to expose stable default-generation tendencies (“knowledge priors”) that differ systematically by model family, as laid out in the research thread.

• What they claim to observe: The thread says families cluster into distinct semantic regions (e.g., programming/math-heavy vs narrative-heavy vs exam-question-like), and that even “degenerate” outputs are informative signals rather than pure noise, per the Model family clustering and Degenerate text signal.
• Primary sources: The writeup is in the Blog post with the underlying methodology and results in the ArXiv paper, both referenced from the follow-up summary.

This is positioned as an auditing/safety/behavior-characterization tool that complements capability evals by measuring what models do without strong instruction scaffolding, according to the Why it matters.

DFlash proposes block diffusion for flash speculative decoding

DFlash (paper): A new paper titled “DFlash: Block Diffusion for Flash Speculative Decoding” is shared via a Hugging Face paper page, per the paper availability. The main pointer in the tweets is the canonical artifact itself, which is linked as the HF paper page.

Within this tweet set there aren’t performance numbers or implementation details discussed beyond the title/positioning, so treat it as an “artifact dropped” signal rather than an evaluated decoding method.

🧑‍🏫 Developer culture & labor signals: productivity shock, job anxiety, and tool fatigue

Discourse itself is the news here: anxiety about displacement, “fast takeoff” narratives, and the lived experience of supervising increasingly capable coding agents. Excludes enterprise adoption specifics (feature).

Builders report reply quality collapsing under bot volume

Platform signal quality (X): swyx reports “obvious bots” jumping from ~20% to ~80% of replies, forcing stricter notification filters that also hide real humans in the Bot-replies complaint.

A second thread ties the “#keep4o” reply wave to suspected bot amplification while sharing a chatbot market-share chart in the Bots and share chart; treat attribution as speculative, but the operational point remains: social discovery for new tooling patterns is getting noisier.

Fast-takeoff framing: “ride the wave” becomes common language

Fast takeoff mood: Doodlestein’s “Happy Fast Takeoff Day” post frames frontier progress as exhilarating for insiders and “scary and confusing” for everyone else, while asserting the only option is to ride it in the Fast takeoff post.

The same mood shows up more tersely as “We’re not early anymore” in the Not early anymore post.

Levie claims AI power users are still years ahead of their peers

Adoption gap (Box): Aaron Levie argues that anyone paying attention to AI tools on X is “a couple years ahead” of the average worker in their field, and that many knowledge-work domains are still “day 1” in the Years-ahead posture.

The labor signal is about diffusion, not capability: early adopters think the advantage window remains open because most orgs haven’t operationalized agents yet.

Release pace drives attention fatigue across builders

Tool fatigue: Multiple posts describe the last 24–48 hours as hard to track, with “Opus 4.5 and GPT-5.3 dropping minutes apart” in the Release whiplash meme and “Quiet week huh?” sarcasm in the Quiet week joke.

A related form of fatigue is the constant “when next version?” churn (“When GPT-5.4? When Opus 4.7?”) in the Next-model demands.

Some engineers say “not replaced,” but worry about keeping up with tools

Developer displacement sentiment: A parallel thread argues AI isn’t making programmers obsolete in the near term, using a “punch cards weren’t that long ago” anecdote in the Punch card story, while still naming the practical stressor as “keeping up with the tooling” in the Tooling worry follow-up.

The signal here is the split between replacement fear and capability-churn fatigue: even optimistic engineers are framing adaptation speed as the main risk.

Token budgets become a real workplace constraint, not a billing detail

Workplace constraint: Ethan Mollick notes that “agentic work beyond a Max plan” can burn substantial tokens, and frames lack of access to frontier/agentic models as a potential career development downside in the Token budget comment.

This is a labor signal because it turns model access into a resource allocation question that employees may negotiate (similar to compute budgets for ML teams).

Codex pricing question triggers backlash and “keep4o” demand signal

User demand signal (OpenAI): Sam Altman asks how people want Codex priced in the Pricing question, and the replies become a proxy war about ChatGPT “4o” and companion-style usage; one critique calls out “AI companion enjoyers” and “sycophancy” dynamics in the Reply screenshot critique.

This is mostly a product-culture signal: a developer-facing pricing question surfaces a different audience’s priorities (and how strongly they’ll flood feedback channels).

The dominant builder mood: “you can just build things”

Builder mood: Alongside the release churn, there’s a strong “this is for coders and dreamers” tone—“you can just build things” as quoted in the Coders and dreamers post.

A related take frames the productivity shift as partly affective—“work is way more fun now”—in the Work is more fun post.

Early-access distribution becomes part of the competitive narrative

Access politics: One thread argues that who gets early model access signals lab priorities—“businesses” for Anthropic versus “influencers” for OpenAI—per the Early access signal.

There’s no hard data in the tweet, but it’s a recurring culture storyline: access pathways (enterprise pilots, creator programs, private betas) are interpreted as strategy, not logistics.

Engineers separate “tech is real” from “pricing is real”

Macro narrative: A builder-facing take claims “the probability AI is a bubble is zero” from a technical standpoint (especially for people who write code), while conceding the financial story could diverge in the Not a tech bubble take.

A nearby sentiment rejects plateau talk and argues slowdowns would be temporary in the No plateau argument.

🎙️ Voice agents & speech stacks: cloning, realtime pipelines, and app integrations

A smaller but clear cluster: new speech model endpoints and voice-agent plumbing being productized via inference platforms and devtools. Excludes creative music workflows (kept in gen-media).

MiniMax Speech 2.8 lands on Replicate with 5-second voice cloning and controllable delivery

MiniMax Speech 2.8 (Replicate): Replicate added MiniMax Speech 2.8, positioning it for production TTS where you can clone a voice from a ~5-second sample and steer delivery with natural interjections like “(laughs)” and “(sighs)” across 32 languages, as described in the Feature list. This shows up as two SKUs—HD (quality) and Turbo (latency), with endpoints linked in the HD model page and the Turbo model page.

The practical shift is operational: teams can standardize on a hosted speech model with an API surface that supports both “studio” and “fast path” modes, rather than treating voice as a separate bespoke pipeline.

Grok Imagine Video hits Replicate with native-audio video generation

Grok Imagine Video (Replicate): Replicate made Grok Imagine Video available as an API, highlighting text→video and image→video with native audio generation, as stated in the Availability note.

The immediate engineering implication is stack convergence: if you’re already building voice agents, “speech + visuals” can now route through one hosted inference surface, rather than stitching together separate video generation and TTS jobs, as shown on the Replicate model page.

LangSmith shows how to debug STT→agent→TTS voice pipelines with traces

LangSmith (LangChain): LangChain published a practical walkthrough for debugging the “STT → agent → TTS sandwich” by sending traces into LangSmith, making it easier to see where latency and failures occur across the pipeline, as shown in the Voice agent debugging post.

• Trace anatomy: The example shows a single conversation broken into explicit STT/LLM/TTS spans with durations and model names, which is the minimum structure you need before you can do targeted reliability work, as illustrated in the Voice agent debugging post.

Warp adds GitHub Copilot CLI support plus Wispr Flow voice transcription

Warp (Warp): Warp shipped first-class support for the GitHub Copilot CLI, bundling voice transcription powered by Wispr Flow, plus an image upload button and an integrated file explorer + code review panel, as demonstrated in the Feature demo.

For teams experimenting with voice-driven dev workflows, this is a notable “speech-to-terminal” surface that doesn’t require building a custom STT front-end first.

MiniMax says Speech 2.8 is powering Sensei voices, emphasizing expressiveness

MiniMax Speech 2.8 (MiniMax): MiniMax claims Speech-2.8 is already being used in the “Sensei” voices inside @callmesenseien, with a stated focus on pushing voice quality and emotional expressiveness further, per the Adoption note.

This is a small but concrete adoption datapoint: the model isn’t only a demo SKU, it’s being used as the voice layer in an end-user product experience where prosody and consistency are obvious to users.

Wispr Flow announces an Android waitlist

Wispr Flow (Wispr Flow): Wispr Flow announced it’s “coming to Android” with a public waitlist, per the Android waitlist post.

This extends mobile availability for a consumer-grade transcription layer that’s already showing up as an embedded capability in dev surfaces (for example, terminals), which matters if you’re relying on voice input outside desktop-only setups.

Goldman Sachs deploys Claude agents after 6 months – audit fees fall 14%

Executive Summary

Top links today

Enterprise AI goes live: regulated rollouts, ROI claims, and big-money bets

Table of Contents

🏦 Enterprise AI goes live: regulated rollouts, ROI claims, and big-money bets

Goldman Sachs rolls out Claude agents for accounting and compliance work

Anthropic funding round rumored at $20B+ and ~$350B valuation

Amazon’s Anthropic position marked to ~$60.6B with a 1M Trainium commitment

Regulated enterprise agents are converging on rulebooks, exceptions, and routing

eXp Realty claims millions saved by replacing SaaS with Lovable-built internal tools

KPMG pushes audit fee cuts citing AI-driven cost reductions

Software equities wobble on ‘AI replaces workflows’ uncertainty signals

🧑‍💻 Claude Code shipping notes: Agent Teams, CLI churn, and UX micro-features

Claude Code adoption signals show up in GitHub commit share and Vercel deploy rates

Claude Code CLI 2.1.33 adds hook events and agent memory frontmatter

Claude Code CLI 2.1.34 patches a sandbox-permission bypass edge case

Agent Teams demos converge on role-separated “mini org charts”

Claude Code can now summarize the part you rewound

Anthropic announces a Claude Code “Built with Opus 4.6” virtual hackathon

🧰 Codex product/UX: pricing, personalities, platform expansion, and harness quirks

Codex app on Windows is running internally

OpenAI probes Codex pricing; users push for bundling and mid-tier plans

Codex app adds switchable personalities via /personality

Claims of no GPT-5.3-Codex API complicate benchmarking workflows

Codex CLI can target GPT-5.2 Pro via --profile pro

Codex 5.3 isn’t showing up in some IDEs yet

Codex CLI users report truncated output rendering

OpenAI announces Codex hackathon winners focused on agents and tool integration

🧭 Agentic engineering patterns: compaction control, RLM-like loops, and “how to supervise”

AI-assisted code at scale needs explicit quality gates, observability, and ownership

Multi-agent concurrency as default: spawn several agents, select the best fix, and auto-test

RLM-style agent loops: put context into variables and treat sub-agents as functions

Manual multi-agent ping-pong: implement then code-review agent then UX-review agent

Anti-cargo-cult rule for agents: default to simplest idiomatic code unless you can justify

Compaction discipline: don’t let the harness auto-compact; control it explicitly

Dedicated maintenance agent for swarm machines: SSH in, kill runaways, and clean disk

Tool-calling economics: excessive tool calls are an expensive switch statement

As agents get stronger, the ceiling rises but regressions slip in if you stop watching

Two supervision styles emerge: tight control vs delegate-and-review, and tools may diverge

🧠 Agent runners & multi-model ops: councils, swarms, routing, and hosted assistants

Perplexity Model Council and Comet upgrade Max users to Claude Opus 4.6

Gemini in Chrome ships “describe the task” browser automation for AI Pro/Ultra (US)

Kilo launches Kilo Claw: hosted OpenClaw without SSH/Docker/yaml

OpenRouter launches Pony Alpha stealth model for free with provider logging

Compute remains the bottleneck: B200 on-demand scarcity and “need more GPUs” talk

Kimi Code “agent swarms” show 10 subagents coordinating on a single build

OpenRouter adds a Nitro toggle to route prompts to the fastest provider

OpenRouter app leaderboard shows OpenClaw #1 by tokens, ahead of coding agents

OpenRouter token usage is claimed to be growing ~10× per year

✅ Quality gates for agent code: PR review UX, CI automation, and verification loops

Overnight PR loop: bugbot reviews, auto-pushes fixes, and a test agent verifies

GitHub Stacked Diffs enters alpha for early design partners

Conductor adds editing PR titles and descriptions inside the agent UI

GitHub rolls out performance improvements for large PR diffs

Warp adds first-class GitHub Copilot CLI support with review panel and image upload

Framework signal: quality gates, observability, and ownership for AI-assisted code at scale

Manual multi-agent ping-pong: implement, code-review agent, UX agent, then integrate

🔌 Interop & control planes: MCP/agent steering, hooks, and tool contracts

VS Code Insiders adds agent steering and message queueing for agent chat

VS Code Insiders adds hooks to automate agent workflows in Chat

AI SDK adds a provider wrapper for any Open Responses-compatible API

🧩 Plugins & Skills ecosystem: teach agents new capabilities safely and repeatably

A self-learning skill template generates new SKILL.md by browsing docs

Hyperbrowser adds /learn to turn web docs into auto-updating agent skills

Mastra introduces Workspaces: constrained FS + sandbox + reusable skills

Mastra publishes an npx-installable skills library

📊 Benchmark churn & eval hygiene: Arena swings, harness effects, and long-context tests

Claude Opus 4.6 takes #1 across Arena Text, Code, and Expert leaderboards

FrontierMath: Opus 4.6 reaches parity with GPT-5.2 xhigh on Tiers 1–3

Terminal-Bench 2.0 reruns highlight that harness differences change scores

GDPval-AA: Opus 4.6 evaluation run cited at ~160M tokens and $1K+ cost

ARC-AGI-2 charts emphasize $/task and fixed thinking budgets for Opus 4.6

Artificial Analysis plots Opus 4.6 as high-scoring with lower output tokens

EQ-Bench and creative-writing leaderboards show Opus 4.6 opening a lead

Chess puzzle evals: Opus 4.6 still lags despite math benchmark gains

SimpleBench: Opus 4.6 moves to #2, still behind Gemini 3 Pro

🏗️ Compute & capex signals: GPU scarcity, hyperscaler spend, and capacity constraints

Hyperscalers signal ~$650B 2026 capex wave aimed at AI datacenters

B200 looks hardest to get on-demand as GPU availability tightens