xAI Grok 4.20 Beta ships 2M context – $2/$6 per 1M tokens

Claude can now build interactive charts and diagrams, directly in the chat. Available today in beta on all plans, including free. Try it out: claude.ai

3:59 PM · Mar 12, 2026

30.7K

Read 1.3K replies

“Generative UI is here” sentiment clusters around Claude’s new visualization surface

Generative UI signal: Multiple builders are framing Claude’s new interactive charts/diagrams as a concrete arrival of “generative UI,” rather than a nicer plotting feature—see reactions like “the generative UI dream is happening” in the builder reaction and “Generative UI is here and it works very very well” in the builder reaction.

One interesting ecosystem detail is that at least one practitioner believes the feature is “powered by MCP” (Model Context Protocol) and is using it as a building block inside their own orchestrator, per the MCP speculation. The tweets don’t include an implementation write-up yet, so treat “MCP-powered” as an unverified claim rather than a confirmed architecture.

Thariq

@trq212

the generative UI dream is happening

Claude

@claudeai

Claude can now build interactive charts and diagrams, directly in the chat. Available today in beta on all plans, including free. Try it out: claude.ai

5:29 PM · Mar 12, 2026

3.1K

Read 106 replies

Builders are prototyping interactive “instrument panels” inside Claude chat

In-chat UI prototyping (Claude): One early usage pattern is treating Claude’s interactive chart/diagram output as a lightweight dashboard surface—e.g., generating an interactive Cessna 172-style instrument panel in the chat itself, as shown in the instrument panel demo.

The clip suggests Claude’s renderer can drive multiple coordinated widgets (gauges with changing values) and supports direct manipulation/interaction, not just a static artifact. The author frames it as “pretty cool” but “not perfect,” with a specific education/training use case in mind, per the instrument panel demo.

AI Breakfast

@AiBreakfast

I asked Claude to build an interactive instrument panel from a Cessna 172 directly in the chat... Pretty cool! Not perfect by any means but when this gets refined, it has major potential for education.

Claude

@claudeai

Claude can now build interactive charts and diagrams, directly in the chat. Available today in beta on all plans, including free. Try it out: claude.ai

8:59 PM · Mar 12, 2026

🧰 Codex desktop app: themes, automations GA, and hooks teasers

Today’s Codex-specific churn is mostly about making Codex feel like a programmable desktop IDE agent: customization (themes) and unattended runs (Automations GA), plus ongoing talk about upcoming Hooks and rate-limit realities.

Codex app Automations reach GA for recurring repo work

Codex app Automations (OpenAI): Automations are now generally available, with per-automation controls for model choice and reasoning level plus execution isolation (worktree vs existing branch) and reusable templates, per the GA announcement. The framing is recurring dev chores—daily repo briefings, issue triage, PR follow-ups—run as scheduled background work.

• Operational detail: the GA notes explicitly call out worktree-based runs as a first-class option for safer unattended changes, as described in the GA announcement.

OpenAI Developers

@OpenAIDevs

We’ve been cooking. 2 updates in the Codex app 👇 You can now personalize the Codex app with themes that match your taste. Import themes you like or share your own.

10:30 PM · Mar 12, 2026

1.9K

Read 178 replies

Codex app (OpenAI): The desktop app now supports theme personalization, including importing themes you like and sharing your own, as shown in the themes announcement and echoed with a “Matrix” preset in the UI screenshot. The settings surface exposes concrete knobs—accent/background/foreground hex colors, contrast, translucent sidebar, and separate UI vs code fonts—making Codex feel more like a configurable IDE than a fixed chat UI.

• Sharing format emerges: people are already posting full codex-theme-v1 blobs (fonts, semantic colors, surfaces) for copy/paste sharing, as in the theme string example.

OpenAI Developers

@OpenAIDevs

We’ve been cooking. 2 updates in the Codex app 👇 You can now personalize the Codex app with themes that match your taste. Import themes you like or share your own.

10:30 PM · Mar 12, 2026

1.9K

Read 178 replies

Codex usage-limit pressure shows up as weekly exhaustion screenshots

Codex usage limits: Multiple users are circulating Codex app limit banners showing near-exhaustion states—e.g., “Weekly usage limit 5% remaining” in the limit warning screenshot and “0% remaining” in the fully exhausted screenshot. Another UI shows “Rate limits remaining 1%” early in a billing period, per the rate limit screenshot, suggesting that rate/credit budgeting is becoming a visible constraint in day-to-day agent usage.

Peter Gostev

@petergostev

Replying to @OpenAIDevs

What a great reason to celebrate with a little button click!

10:31 PM · Mar 12, 2026

130

Hooks are coming to Codex, with early users already testing them

Codex hooks (OpenAI): OpenAI-affiliated accounts are teasing that “Hooks are coming to codex,” as stated in the hooks teaser and reinforced by follow-up sentiment in the follow-up post. Separately, at least one user is already “testing the new codex hooks feature,” showing SessionStart/Stop hooks running and injecting session rules, per the hooks output screenshot.

The public details on configuration surface and ordering are still sparse in these posts.

Tibo

@thsottiaux

Hooks are coming to codex. That’s all I wanted to say.

6:35 AM · Mar 12, 2026

2.8K

Read 215 replies

Codex app server_error interruptions are still being reported

Codex reliability: At least one report shows Codex returning a server_error (“An error occurred while processing your request…retry…include request ID”), as captured in the error screenshot. The post framing (“ugh it’s happening again…codex come on”) suggests recurrence rather than a one-off incident, but the tweets don’t include status-page confirmation or scope.

Matthew Berman

@MatthewBerman

ugh it's happening again...codex come on

2:04 AM · Mar 13, 2026

Read 15 replies

📐 CursorBench: scoring agentic coding on correctness vs token efficiency

Cursor shared more transparency on how they score agentic coding quality beyond saturated public benchmarks—positioning token usage (efficiency) alongside correctness and online eval signals from real usage.

CursorBench: Cursor opens up how it scores agentic coding beyond public benchmarks

CursorBench (Cursor): Cursor shared a new method for scoring agentic coding models that combines offline tasks with online metrics from real Cursor usage, aiming to stay useful even as public benchmarks saturate, as outlined in the method announcement and expanded in the CursorBench blog post.

• Efficiency as a first-class metric: Their “token efficiency frontier” plot maps CursorBench score against token usage, making it easier to reason about “good enough correctness” versus cost/latency tradeoffs (model points are shown in the method announcement).
• Transparency shift: Cursor leadership frames this as intentionally more open about internal scores after being “coy” in the past, per the Cursor eval transparency.

Cursor

@cursor_ai

We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency:

5:34 PM · Mar 12, 2026

2.1K

Read 142 replies

CursorBench vs SWE-bench Verified: internal tasks show bigger gaps between models

Benchmark interpretation: A shared comparison suggests CursorBench produces materially more separation between models than SWE-bench Verified, implying the internal workload is stressing different failure modes than “mostly-solved” public sets, as shown in the side-by-side chart.

The same discussion ties back to Cursor’s claim that public benchmarks are increasingly saturated, and that measuring with real Cursor sessions should better reflect day-to-day agent performance, per the CursorBench blog post.

David Gomes

@davidgomes

Insane!

Cursor

@cursor_ai

We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency:

7:04 PM · Mar 12, 2026

Evals ops pattern: pair offline suites with live-traffic signals for construct validity

Evals operations: Cursor’s write-up makes a concrete case for using online metrics from real product traffic alongside offline eval suites—less for leaderboard bragging and more for catching regressions that only show up in real multi-step sessions, as described in the CursorBench blog post.

The approach implicitly treats “model quality” as multi-dimensional (correctness, interaction behavior, efficiency), with token usage used as a proxy for runtime/cost pressure in the scoring plots shown in the frontier chart.

Cursor

@cursor_ai

We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency:

5:34 PM · Mar 12, 2026

2.1K

Read 142 replies

GPT-5.4 gets positioned as a CursorBench correctness leader with efficient tokens

GPT-5.4 (OpenAI): OpenAI DevRel amplified CursorBench results by claiming GPT-5.4 “leads CursorBench on correctness with efficient token usage,” per the OpenAI DevRel note.

That claim lands in the context of Cursor’s own framing that token usage should be considered alongside correctness—an idea visualized directly in the efficiency frontier plot in the CursorBench chart.

OpenAI Developers

@OpenAIDevs

GPT-5.4 leads CursorBench on correctness with efficient token usage.

Cursor

@cursor_ai

We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency:

9:39 PM · Mar 12, 2026

552

Read 36 replies

🖥️ Perplexity Computer: Pro rollout, credits, connectors, and Slack interface

Perplexity continues pushing “computer-as-agent” packaging: Pro access, credit mechanics, connectors, and Slack as an enterprise-facing UI surface for running tasks without switching contexts.

Perplexity Computer lands in Slack as an enterprise UI surface

Perplexity Computer (Perplexity): Computer can now run directly in Slack, with installs via the Slack App Marketplace and workflows that use channel context while syncing results back to the web Computer experience, according to the Slack integration post and its

• Why Slack matters: The integration frames Slack as the “where work happens” UI for agent actions (not just Q&A), with explicit connect-and-act affordances in-chat (e.g., “Connect Stripe”), as shown in the Slack app screenshot.

Wes Roth

@WesRoth

Perplexity Computer can now run directly from Slack, making it easier for users to work without leaving their main communication platform. Users can install it through the Slack App Marketplace, build workflows using context from their channels, and have everything Show more

Computer

@AskPerplexity

You can now run Perplexity Computer directly from Slack. → Install from the Slack App Marketplace → Create workflows using context from your channels → Everything syncs to Computer on web automatically No switching tabs. No copy-pasting. Just work where you already work.

11:00 PM · Mar 12, 2026

Read 4 replies

Perplexity Computer rolls out to Pro with 20+ models and connectors

Perplexity Computer (Perplexity): Computer is now available to Pro subscribers; Perplexity positions it as a bundled agent surface with “20+ advanced models,” prebuilt/custom skills, and “hundreds of connectors,” as stated in the rollout announcement and detailed on the launch page. Max is framed as the higher-spend tier with monthly credits and higher limits, per the same rollout announcement.

• Packaging shift: The pitch is less “pick a model” and more “pick a workspace with routing, skills, and integrations,” which is the operational unit most agent teams end up rebuilding internally anyway, per the rollout announcement.

Perplexity

@perplexity_ai

Perplexity Computer is now available for Pro subscribers. Access Computer’s full suite of 20+ advanced models, prebuilt and custom skills, and hundreds of connectors. Max subscribers receive monthly credits and higher spend limits than Pro. perplexity.ai/computer

6:23 PM · Mar 12, 2026

2.1K

Read 169 replies

Perplexity Computer adds bonus-credit mechanics and a Usage & credits page

Perplexity Computer (Perplexity): A new Usage and credits view is showing up alongside the Pro rollout, including 4,000 bonus credits for Pro users and an upsell path to Max with much larger bonus and monthly credits, as shown in the credits screenshot and described in the credits screenshot.

• Credit details surfaced in-product: The UI shows bonus-credit expiry dates and plan prompts (e.g., “Upgrade to Max… get 45,000 credits”), which makes the effective cost model visible to anyone running long agent workflows, per the credits screenshot.

TestingCatalog News 🗞

@testingcatalog

Perplexity Computer for Pro subscribers has been announced officially. - 4000 bonus credits available to all Pro users - 45000 bonus credits will be available after the upgrade to Max - 10000 recurrent credits available on Max per month

Perplexity

@perplexity_ai

7:32 PM · Mar 12, 2026

254

Read 25 replies

Perplexity launches Computer for Enterprise as an autonomous digital worker

Computer for Enterprise (Perplexity): Perplexity is also pitching Computer for Enterprise as an “autonomous digital worker” for corporate environments—positioned around collaboration, multi-model orchestration, and institutional-grade research, per the enterprise announcement.

• Enterprise posture: The enterprise framing emphasizes controlled connectors and org workflow execution (versus individual “computer use”), aligning with the examples shown in Slack-style tasking flows in the Slack app screenshot.

Wes Roth

@WesRoth

Perplexity has launched Computer for Enterprise, an advanced agentic system designed to function as an "autonomous digital worker" within corporate environments. Building on the core "Computer" technology, this version is tailored for professional collaboration, multi-model Show more

Perplexity

@perplexity_ai

Introducing Computer for Enterprise Computer runs multi-step workflows across research, coding, design, and deployment. It routes tasks across 20 specialized models and connects to 400+ applications.

8:00 AM · Mar 12, 2026

Perplexity Computer gets positioned in the chat-to-action agent race

Competitive positioning: Builders are explicitly grouping Perplexity Computer with “computer-use” products (Operator, Claude computer use, etc.) and describing the market shift as moving from chat into end-to-end task execution, as framed in the Max buyer note and reinforced by Perplexity’s Pro and Enterprise pushes in the Pro rollout and enterprise launch.

• Sentiment snapshot: Early adopters are paying for higher tiers specifically to access the Computer workflow surface and report back on how it compares to other agent runners, per the Max buyer note and the longer-form reactions in the hands-on post.

BridgeMind

@bridgemindai

Just spent $200 on Perplexity Max. Testing Perplexity Computer and the newly released Perplexity Personal Computer. Claude has computer use. OpenAI has Operator. Now Perplexity is entering the agent game. The AI race is shifting from chat to action. Models that can Show more

11:36 AM · Mar 12, 2026

Read 10 replies

🛰️ xAI Grok 4.20: 2M context, multi-agent variants, and benchmark deltas

Grok 4.20 Beta is the day’s big model-cycle storyline: new API snapshots, multi-agent variant packaging, and lots of third-party benchmarking around hallucination rate, instruction following, speed, and coding gaps.

Grok 4.20 Beta ships with 2M context, multi-agent variant, and $2/$6 pricing

Grok 4.20 Beta (xAI): xAI’s new Grok 4.20 Beta lineup is now live via the xAI API and widely routed through OpenRouter, with a 2,000,000 token context window and three SKUs (multi-agent beta, reasoning, non-reasoning) priced at $2/M input and $6/M output, as listed in the [model pricing screenshot](t:276|pricing table) and reiterated in the [OpenRouter listing](t:202|model listing).

Compared to peers called out in the same threads, the main operational change is the context jump (Claude Opus at 200K, GPT-5.4 at 1M) alongside a lower input/output price point than prior Grok snapshots, per the [launch comparison](t:276|context and pricing claim).

Artificial Analysis: Grok 4.20 posts 22% hallucination rate, #1 IFBench, and ~265 tok/s

Grok 4.20 Beta 0309 (xAI): Artificial Analysis reports three headline deltas—22% hallucination rate on AA-Omniscience (lower is better), 82.9% on IFBench (their #1 instruction-following score), and ~265 output tokens/sec on xAI’s API—summarized in the [benchmark charts](t:18|AA charts) and echoed in the [follow-on recap](t:351|metric recap).

• Benchmark + cost framing: the broader Artificial Analysis write-up also pins Grok 4.20 (reasoning) at 48 on their Intelligence Index and describes a $484 run cost for that suite, per the [index breakdown](t:448|evaluation notes).

The comparisons in the charts are mixed: Grok leads on non-hallucination and instruction following, but those wins don’t automatically carry into coding-centric aggregates covered elsewhere.

Artificial Analysis Coding Index shows Grok 4.20 still behind the top coding models

Artificial Analysis Coding Index: a circulated chart puts Grok 4.20 Beta 0309 at 42, behind GPT‑5.4 at 57 and Gemini 3.1 Pro Preview at 56, and also behind Claude Opus 4.6 at 48, as shown in the [coding index chart](t:301|coding index chart).

The recurring theme across posts is that 2M context and strong non-hallucination metrics don’t automatically translate into top-tier coding aggregates, as framed in the [coding gap note](t:301|comparison commentary).

BridgeBench ranks Grok 4.20 Multi-Agent #1 while base Grok 4.20 lands #6

BridgeBench (BridgeMind): a BridgeBench screenshot shows Grok 4.20 Multi-Agent (4-agent) ranked #1 with 96.1 overall, 100% completion, and 87.8s latency, with the 16-agent variant close behind at 95.9, as shown in the [leaderboard table](t:7|BridgeBench table). The same benchmark later places Grok 4.20 Beta at #6 overall (93.4) with 59.0s latency, per the [follow-up table](t:465|BridgeBench rank 6).

BridgeMind’s post leans into the multi-agent framing—"xAI came out of nowhere" and "The multi-agent future is here"—as stated in the [BridgeBench commentary](t:7|multi-agent claim); the table itself highlights the completion-rate difference versus GPT-5.4 on that benchmark.

BullshitBench v2 shows Grok 4.20 ranking jump; high-reasoning runs can score worse

BullshitBench v2 (petergpt): the benchmark author reports Grok 4.20 moving up sharply—Grok 4.1 was ranked 54th and 72nd, while Grok 4.20 takes 13th–16th—as shown in the [BullshitBench table](t:300|ranking table).

• Reasoning sensitivity: the same post notes the multi-agent variant did better than base, but an “xHigh” run spent far more tokens ("cost me like $75") while scoring 3 points lower, alongside the claim that on this benchmark "reasoning either doesn't help much or makes things worse," per the [benchmark commentary](t:300|reasoning note).

This is a narrow eval (pushback vs accepted nonsense), but it’s one of the clearer datapoints in the tweets where extra reasoning budget appears to be a liability rather than a help.

OpenRouter provider stats show 1,122 tok/s throughput for Grok 4.20 Multi-Agent

Grok 4.20 Multi-Agent Beta (OpenRouter): an OpenRouter provider table shows 1,122 tokens/sec throughput for the xAI provider on Grok 4.20 Multi-Agent, alongside 2M context and tiered pricing beyond 200K tokens, as captured in the [provider metrics screenshot](t:346|throughput table).

This is a practical datapoint for long-context agent workloads where wall-clock time matters as much as per-token price, and it’s one of the few posts that includes an explicit throughput number rather than latency anecdotes, per the [routing view](t:346|provider list).

Vals Index places Grok 4.20 Beta (reasoning) at #13 overall with low cost/latency

Vals Index (ValsAI): ValsAI places Grok 4.20 Beta (Reasoning) at #13 overall, reporting 58.05% ± 1.98 accuracy, $0.28 cost per test, and 85.42s latency, as shown in the [Vals Index table](t:387|leaderboard screenshot).

The same thread claims it “shines” on a SWE‑Bench split at #4 (72.55%) and improves on Terminal Bench 2 versus earlier Grok models, per the [evaluation summary](t:387|performance notes).

LisanBench shows Grok 4.20 near Grok 4 performance with better token efficiency

LisanBench: a shared LisanBench chart shows Grok 4.20 Beta scoring roughly in line with Grok 4 (slightly lower on the displayed slice), while using fewer tokens—"only 9k tokens vs 11.7k tokens"—as stated alongside the [LisanBench screenshot](t:229|LisanBench chart).

The thread frames this as an efficiency/price story rather than a capability leap, and it aligns with other posts emphasizing Grok 4.20’s speed and cost profile even when aggregate intelligence isn’t at the very top, per the [token note](t:229|efficiency comment).

🧪 Hermes Agent: fast OSS releases, connectors, MCP client, and provider routing

Hermes Agent updates are mostly operational/platform work: big v0.2.0 release notes, install footprint changes, Slack improvements, MCP client support, and provider routing refactors—useful if you run agents across channels.

Hermes Agent v0.2.0 lands with MCP client, messaging gateway, and centralized provider routing

Hermes Agent (Nous/Community): v0.2.0 is the first big tagged milestone after the initial foundation—216 merged PRs from 63 contributors and 119 issues resolved, as summarized in the Release notes and echoed in the Release card. It’s a platform-style release (not a single feature). It adds native MCP client support, a multi-platform messaging gateway, and a centralized call_llm() router that collapses scattered provider logic.

• MCP client: Native stdio + HTTP transports, reconnection, resource/prompt discovery, and server-initiated sampling are called out in the Release notes.
• Messaging gateway: Unified sessions + attachments across Telegram/Discord/Slack/WhatsApp/Signal/Email/Home Assistant are bundled per the Release notes.
• Operational ergonomics: Git worktree isolation plus filesystem checkpoints and /rollback show up as first-class safety rails in the Release notes.
• Test surface: Release notes claim 3,289 tests, framing this as a move toward more reliable automation, as stated in the Release card.

Over 1200 commits, uncountable new features, improvements, bug fixes, and more - our first two weeks have been incredible. Our first version bump milestone, v0.2.0 of Hermes Agent - is here. You all have made Hermes Agent the biggest project I've worked on, and I love working Show more

2:10 PM · Mar 12, 2026

508

Read 54 replies

Hermes Agent adds official Claude provider and trims install weight

Hermes Agent (Teknium): A same-day batch of operational updates adds official Claude provider support and makes installs “much lighter” by making the RL pieces optional, per the Daily updates. Slack integration work also got a round of improvements.

• Cost/control tweak: Default context compression ratio was reduced to 50%, which is framed as a cost-saver in the Daily updates.
• Ecosystem interop: Teknium also mentions an adapter PR to PaperClip (a multi-agent orchestrator), as noted in the Daily updates.

A few Hermes Agent updates for today - one you've all been waiting on: - Official Claude provider support (yes) - Installs are now much lighter (All the RL stuff is now optional!) - Made an adapter PR to PaperClip by @dotta - a multi-agent orchestrator project - Huge Show more

Nous Research

@NousResearch

Meet Hermes Agent, the open source agent that grows with you. Hermes Agent remembers what it learns and gets more capable over time, with a multi-level memory system and persistent dedicated machine access.

1:09 AM · Mar 13, 2026

275

Read 31 replies

Hermes Agent finishes a routing refactor aimed at reducing provider-switching bugs

Hermes Agent (Teknium): Teknium says a “huge foundational refactor” is complete, targeting recurring issues from model/provider switching and routing/handling; the ask is to test latest builds and report regressions, per the Refactor note. This reads like stability work around the provider abstraction layer.

The post doesn’t enumerate diffs. It’s an ops-quality change.

Huge foundational refactor complete - Many issues that came and went due to model/provider switching/routing/handling etc etc are gone. Please test with the latest and let me know if you have any issues! Onward to powering thru to the future now ;)

Teknium (e/λ)

@Teknium

If you're wondering why there's only 100 or so commits today instead of 500 or so, I'm working on a clean refactor of providers:models because all these new providers are making a mess of the codebase x] Patience on a few other things plz 🙏

8:02 AM · Mar 12, 2026

Hermes Agent recipe: use OpenRouter’s free Nemotron 3 Super as the model driver

Hermes Agent (Teknium): Teknium shared a concrete configuration path to run Hermes with OpenRouter, selecting a custom model name of nvidia/nemotron-3-super-120b-a12b:free, as described in the Config instructions and the corresponding OpenRouter listing.

This is a practical way to swap the agent’s reasoning core without changing the rest of the harness.

Run Nemotron as your agent driver in Hermes Agent for free with OpenRouter: openrouter.ai/nvidia/nemotro… Just type `hermes model`, select OpenRouter, and click custom model name, and put: nvidia/nemotron-3-super-120b-a12b:free

Unsloth AI

@UnslothAI

NVIDIA releases Nemotron-3-Super, a new 120B open hybrid MoE model. Nemotron-3-Super-120B-A12B has a 1M-token context window and achieves competitive agentic coding and chat performance. Run on ~64GB RAM. GGUF: huggingface.co/unsloth/NVIDIA… Guide: unsloth.ai/docs/models/ne…

7:20 AM · Mar 12, 2026

316

Read 17 replies

Hermes Agent getting-started tutorial circulates as onboarding keeps changing fast

Hermes Agent (Teknium): Teknium points people at a “great tutorial” for setting up Hermes Agent, suggesting onboarding/documentation is still moving alongside rapid releases, per the Tutorial mention.

No new mechanics are described in the tweet itself.

Great tutorial on getting started setting up Hermes Agent!

Theo

@Theo_jpeg

How to start your Hermes AI Agent (step-by-step guide for beginners) I'm a beginner, and I spent the last 24h: > Understanding the process > Launching my Hermes Agent (@NousResearch) > Recording the whole process > Editing this step-by-step guide So you can do it yourself and

9:28 PM · Mar 12, 2026

149

Read 10 replies

Hermes Agent hackathon nears deadline, with an “idea generator” demo clip

Hermes Agent (Nous Research): NousResearch posted a final push that there are three days left for hackathon submissions, alongside a demo where Hermes generates “1000 project ideas” and uses an ASCII-video skill, as described in the Submission reminder.

A shorter reminder also went out from Teknium, per the Hackathon call.

Nous Research

@NousResearch

Three days left for Hackathon submissions! A big thank you to @Delphi_Digital for sponsoring. Hermes works fast, so it's not too late for anyone considering a last minute project. We had it generate 1000 project ideas and use the ascii-video skill to make a video for you:

Nous Research

@NousResearch

The Hermes Agent Hackathon Starts Now Show us what Hermes Agent can do: build something unique, creative, and useful. 1st: $7,500 2nd: $2,000 3rd: $500 To enter, make a tweet tagging @NousResearch with a video demo and a brief writeup, then send the tweet link to the

10:19 PM · Mar 12, 2026

113

Read 12 replies

🔌 MCP & agent interoperability: Figma loop, enterprise adoption, and “MCP is dead” debates

Today’s MCP content spans real integrations (code↔design loops) and discourse about whether MCP is foundational or overhyped. Net signal: MCP keeps showing up in production stacks despite recurring “dead” memes.

Factory AI pushes working prototypes into Figma via the Figma MCP server

Factory (Factory AI): Factory added a code→design handoff where an agent can take a working page from your local app and push it into a Figma canvas for designers/PMs to edit, using the Figma MCP server workflow described in the feature demo.

• Setup path: The flow starts by adding the Figma MCP server from an MCP registry and then prompting the agent to send a page from a local web app into Figma, as shown in the feature demo.
• Why it matters: This turns “code as the source of truth” into an artifact designers can directly manipulate in their native tool, without exporting static screenshots, as demonstrated in the feature demo.

Factory

@FactoryAI

Factory can now push what you’re building directly from code into @Figma. Take a working prototype from the codebase, send it to a Figma canvas, and let designers and PMs edit components directly.

4:39 PM · Mar 12, 2026

139

Read 9 replies

Enterprise signal: Uber is cited as running MCPs internally

MCP adoption (enterprise): A practitioner thread argues MCPs are “the life blood” for how agents use internal services in mid-sized+ companies, citing Uber as a concrete case in the Uber example, with more detail in the linked inside look article. The same thread frames MCP as operational infrastructure (not a hobby protocol), positioning “MCP is dead” takes as miscalibrated for enterprise reality, per the Uber example.

Gergely Orosz

@GergelyOrosz

MCPs are the opposite of dead. They are the life blood of how AI agents use services inside mid-sized and above companies. Case in point: Uber runs on MCPs internally, for good reason. Details: newsletter.pragmaticengineer.com/p/how-uber-use…

@levelsio

Thank god MCP is dead Just as useless of an idea as LLMs.txt was It's all dumb abstractions that AI doesn't need because AI's are as smart as humans so they can just use what was already there which is APIs

8:39 PM · Mar 12, 2026

815

Read 57 replies

Warp adds a code↔Figma roundtrip using the Figma MCP server

Warp (Warp): Warp shipped “code to canvas” support for the Figma MCP server—render UI from code, push it to a Figma canvas, get edits/feedback, and pull it back into code, as shown in the workflow walkthrough.

• Loop closure: The demo shows the UI rendering in Figma and updating as code changes, framing the MCP server as the transport for keeping the design surface in sync with working code, per the workflow walkthrough.

Warp

@warpdotdev

Warp now supports "code to canvas" for the Figma MCP server 🙌 Render UI, push it to the Figma canvas, get design feedback, pull it back into code. Tighter design-dev loop without the context switching. Here's how it works. Give it a try in the latest version

4:10 PM · Mar 12, 2026

Figma expands its MCP partner list for “code to canvas” workflows

Figma MCP server (Figma): Figma expanded its MCP ecosystem with additional “code to canvas” partners—called out as Cursor, Warp, Factory, Augment Code, and Firebender in the partner list. The new Warp and Factory implementations show what this looks like in practice via the Warp demo and Factory demo.

Figma

@figma

Expanding our code to canvas Figma MCP server capabilities with 5 new partners: Cursor, Warp, Factory, Augment Code, and Firebender.

4:00 PM · Mar 12, 2026

499

Read 31 replies

Unix text-stream tooling vs typed tools resurfaces in agent design debates

Tool interface design: A thread argues that text-based CLIs outperform typed tool catalogs for LLM agents because Unix commands are heavily represented in training data and because “everything is a text stream / tokens,” as captured in the CLI argument screenshot.

The claim is framed as a design preference for a single run(command="...")-style interface over large structured tool inventories, per the CLI argument screenshot.

Dan McAteer

@daniel_mac8

Manus ex-backend lead had a genius insight text based clis beat structured tool calling for ai agents all day because unix commands appear in training data going back to the 1970s text is the native language of the command line AND text is the native language of llms

2:26 AM · Mar 13, 2026

111

Read 11 replies

Progressive disclosure is pitched as the missing layer that makes MCP feel usable

MCP ergonomics: A practitioner response argues MCP is “not perfect” but becomes more legible as agents gain new interaction patterns, and specifically calls out progressive disclosure via an execution/routing layer as the way MCP starts to “make a lot more sense,” per the progressive disclosure note. The same post frames MCP failures as a harness problem more than a protocol problem, according to the progressive disclosure note.

elvis

@omarsar0

Replying to @omarsar0

Don’t dismiss MCP just yet. I agree with this. It’s a harness problem in most cases. MCP is not perfect but it’s getting better. As new ways to interact with agents emerge, it’s going to start to make a lot more sense to people. Combine it with progressive disclosure (via Show more

Rhys

@RhysSullivan

MCP sucking is a harness problem, not an MCP problem MCP unlocks behavior that is fundamentally impossible to get via CLI or APIs Bad auth, too much context usage, all get solved with an execution layer - your agent writes code to progressively discover and call tools

12:58 PM · Mar 12, 2026

“MCP is dead” becomes a real NYC meetup ahead of an MCP dev summit

MCP community (meme → meetup): An April 1 NYC “Celebration of Life” event for MCP was announced in the event post, explicitly tying the “MCP is dead” meme to an in-person gathering ahead of an MCP dev summit, with details in the event page. The meme itself continues to circulate as a one-liner in posts like hot take, which is part of why a tongue-in-cheek event can still draw attention.

Adam Azzam

@AAAzzam

MCP is dead. Join us for a celebration of its life on April 1 in NYC ahead of the MCP Dev Summit. Wear black. luma.com/htkxoidx

1:20 AM · Mar 13, 2026

Read 11 replies

🛠️ Agent developer tooling: sandboxes, scraping CLIs, doc editors, and gateway reliability knobs

Tooling today is about making agents practical: web data ingestion CLIs/SDKs, agent-native docs/collaboration, sandbox lifecycle automation, and reliability knobs (timeouts/failover). Excludes the Claude charts feature.

Firecrawl launches a CLI for agent-grade web ingestion (Markdown/JSON output)

Firecrawl CLI (Firecrawl): Firecrawl introduced a terminal-first toolkit to let coding agents scrape, search, and browse the web into LLM-ready Markdown/JSON, positioning it as higher-fidelity than “raw HTML” workflows per the CLI announcement and the explainer clip. This lands squarely in the “give agents reliable web I/O” bucket.

• Why it changes workflows: it’s built to be callable from agents like Claude Code/Codex/OpenCode without building a bespoke scraper each time, as described in the explainer clip.

The tweets don’t include a compatibility matrix (auth flows, JS rendering, rate limits), so exact site coverage remains unclear from today’s material.

Firecrawl

@firecrawl

Introducing the new Firecrawl CLI 🔥 The toolkit for agents to scrape, search, and browse the web. - Scrape clean data from any page - Search the web and get full results back - Spin up cloud browsers for interactive flows npx -y firecrawl-cli@latest init --all --browser

4:11 PM · Mar 11, 2026

2.2K

Read 84 replies

Vercel AI Gateway adds per-provider timeouts to trigger earlier failover

AI Gateway (Vercel): Vercel added provider-level custom timeouts (providerTimeouts) so you can fail over before a provider’s default timeout, shipping in beta for BYOK credentials with non-BYOK support “coming soon,” as described in the feature post and detailed in the changelog entry. This is a pragmatic reliability knob for multi-provider routing.

• Operational nuance: Vercel notes some providers may still bill timed-out requests if they don’t support stream cancellation, per the changelog entry.

No screenshots were shared in today’s tweets; the artifact is primarily the config surface and docs.

Vercel Developers

@vercel_dev

You can now tell AI Gateway to fail over before a provider's default timeout kicks in. Set a custom timeout per-provider, with 𝚙𝚛𝚘𝚟𝚒𝚍𝚎𝚛𝚃𝚒𝚖𝚎𝚘𝚞𝚝𝚜, for more granular control. In beta for BYOK, with non-BYOK support coming soon. vercel.com/changelog/prov…

6:22 PM · Mar 12, 2026

E2B adds Auto Resume so paused sandboxes wake on incoming activity

E2B Sandboxes (E2B): E2B shipped “Auto Resume” so a sandbox can pause on timeout but automatically resume when traffic arrives, per the feature note. This targets the common agent pattern where compute shouldn’t run 24/7, but cold-start friction still hurts.

• Config surface: examples show on_timeout: "pause" plus autoResume: true (TypeScript) / "auto_resume": True (Python), as captured in the feature note.

The tweet frames this as automatic; it doesn’t quantify wake latency or billing semantics.

E2B

@e2b

We're introducing 𝗔𝘂𝘁𝗼 𝗥𝗲𝘀𝘂𝗺𝗲 𝗳𝗼𝗿 𝗦𝗮𝗻𝗱𝗯𝗼𝘅𝗲𝘀 Many agents don’t need a sandbox running all the time, but when they do need it, it should just work, whether it was paused or not. Auto resume handles this 𝗮𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗰𝗮𝗹𝗹𝘆. A paused sandbox wakes up Show more

2:20 PM · Mar 12, 2026

Proof open-sources its agent-native collaborative doc stack after heavy-load outages

Proof SDK (Every): Proof went down temporarily and performance degraded due to “insane” launch load, while the team pointed people to run it locally since it’s open source, as stated in the outage note with the repo in the GitHub repo. The interesting bit for agent builders is the goal: shared documents where humans and agents edit the same artifact, not scattered markdown files.

• Agent integration hook: the install/setup instructions explicitly tell an agent how to install Proof and how to report bugs via an HTTP endpoint, as shown in the agent instructions screenshot.

Today’s tweets don’t include uptime numbers or scaling details beyond “heavy load,” so reliability characteristics are still mostly anecdotal.

Dan Shipper 📧

@danshipper

Proof is down temporarily! Insane amount of usage right now. It will be back shortly! In the meantime, it’s open source: github.com/everyinc/proof…

3:32 PM · Mar 12, 2026

Read 6 replies

agent-browser adds an inspect mode to open DevTools during agent runs

agent-browser (ctatedev): A new agent-browser inspect command opens Chrome DevTools while an agent uses a headless browser, giving real-time visibility and the ability to steer/debug mid-run, per the command announcement. This is aimed at the practical failure mode where browser agents stall and you need to see console/network state.

The tweet suggests “pair debugging” with agents; it doesn’t specify which browser driver/runtime it targets beyond the DevTools workflow shown in the command announcement.

Chris Tate

@ctatedev

New command: agent-browser inspect Open DevTools while your agent uses a headless browser Real-time visibility, pair debugging, guide the agent mid-execution if something goes wrong

9:55 PM · Mar 12, 2026

129

Read 6 replies

Firecrawl ships a Java SDK for scrape/search/crawl (Java 17+)

Firecrawl Java SDK (Firecrawl): Firecrawl released a Java client with “full support” for core endpoints—scrape, search, and crawl—and called out compatibility with Maven/Gradle and Java 17+ in the SDK launch post. This gives JVM shops a first-class path to put web ingestion behind internal agent tools.

The post doesn’t specify streaming, retries, or rate-limit behavior; it reads like a surface-area-first SDK drop.

Firecrawl

@firecrawl

Firecrawl Java SDK is here ☕ Full support for our core endpoints, including scrape, search, and crawl. Works with Maven, Gradle, Java 17+

4:00 PM · Mar 12, 2026

Read 6 replies

Modal Sandboxes (Modal): Modal says more than 1 billion sandboxes have been launched in three years, framing Sandboxes as foundational infra for coding platforms, background agents, and RL workloads at scale, per the milestone post. It’s a usage signal: “ephemeral, isolated execution” is becoming the default substrate for agent products.

The post name-checks multiple agentic builders using Sandboxes; it doesn’t include a breakdown of what % are agent sessions vs other workloads.

Modal

@modal

Over 1 billion sandboxes have been launched on Modal. Since launching three years ago, we've seen Modal Sandboxes become foundational to how AI is being built. Today, teams like @Lovable, @tryramp, @cognition and more are using Modal Sandboxes to power everything from coding Show more

7:19 PM · Mar 12, 2026

202

Read 9 replies

🧩 Installable skills & agent extensions: flags, fetch, and harness command packs

These are shippable add-ons you install into your agent workflow (skills/CLIs) rather than core assistant releases. Good for teams standardizing repeatable agent actions across repos.

Vercel adds `vercel flags` CLI and a Skill so agents can manage feature flags programmatically

Vercel Flags (Vercel): Vercel added programmatic flag management via a new vercel flags CLI and a companion Skill so coding agents can create/manage flags without touching the dashboard, as described in the changelog note and detailed in the changelog post. The same write-up frames this as “agent-native” flag operations—useful when your agent is already running deploy loops and needs to gate rollouts or experiments without a UI hop.

• Agent integration path: The changelog notes a Skill install flow (npx skills add vercel/flags) and natural-language creation of flags, with server-side evaluation positioned as a way to avoid client-side layout shifts, per the changelog post.

This is an incremental but real workflow change: “feature flags as CLI surface” becomes scriptable in the same environment where agents already run Git and CI steps.

Vercel Developers

@vercel_dev

Vercel CLI now supports programmatic flag management. With the new 𝚟𝚎𝚛𝚌𝚎𝚕 𝚏𝚕𝚊𝚐𝚜 CLI and Skill, your coding agents can now manage flags without accessing the dashboard. vercel.com/changelog/verc…

3:56 PM · Mar 12, 2026

Browserbase ships a Fetch API skill for agents via `npx skills add`

Browserbase Fetch API (Browserbase): Browserbase is positioning Fetch as a generic web-content retrieval primitive for agents, and it’s installable as a Skill using npx skills add browserbase/skills --skill fetch, as shown in the install command output.

The install output suggests this is meant to be a shared building block in “skills-first” agent setups (one standardized fetch tool, reused across different harnesses), rather than bespoke scraping code per project.

Kyle Jeong

@kylejeong

The Fetch API is free for all customers. Give your Agents the best tool to Fetch content from the web. npx skills add browserbase/skills --skill fetch

Browserbase

@browserbase

Yesterday we introduced the Fetch API, Here's what it actually does:

6:04 PM · Mar 12, 2026

LLMock adds WebSockets support and a Claude Code skill for deterministic fixtures

LLMock (CopilotKit): LLMock added WebSockets support for OpenAI and Gemini endpoints and shipped a Claude Code Skill for generating test fixtures, pushing the “deterministic LLM testing” angle further in the release note.

This reads like a response to CI brittleness in agent-heavy codebases: instead of snapshotting model outputs ad hoc, you stand up a mock server that can replay controlled streaming/tool-call behaviors, and you generate fixtures directly from your coding harness.

CopilotKit🪁

@CopilotKit

No more CI flaky generation tax 👇 LLMock went viral and just got new updates: 1. WebSockets support for OpenAI and Gemini 2. Claude Code Skill for writing test fixtures Testing LLM servers should free and deterministic x.com/CopilotKit/sta…

CopilotKit🪁

@CopilotKit

✨ Introducing LLMock: A deterministic mock LLM server for testing Test your AI powered apps reliably, without burning money on real API calls or fighting non-deterministic outputs in CI. Open-sourced for the community. llmock.copilotkit.dev

3:07 PM · Mar 12, 2026

gstack open-sources a Claude Code slash-command pack for repeatable eng workflows

gstack (Garry Tan): gstack is an MIT-licensed command pack intended to make Claude Code behave like a set of repeatable workflow tools (planning, architecture review, QA, retros), as announced in the repo launch and implemented in the GitHub repo.

The repo framing suggests “commands as process”: instead of re-prompting the same checks every time, teams can standardize a handful of opinionated entrypoints that encode what “good” looks like for their org.

Garry Tan

@garrytan

Replying to @garrytan

gstack is available now at github.com/garrytan/gstack Open source, MIT license, let me know if it works for you. It's just one paste to install it on your local Claude Code, and it's a 2nd one to install it in your repo for your teammates.

8:43 AM · Mar 12, 2026

1.3K

Read 78 replies

🧭 Workflow patterns: autoresearch loops, harness-first thinking, and agent orchestration habits

Practice-level content focused on how builders get reliable output: iterative optimization loops (/autoresearch), orchestration patterns, and workflow hygiene for running agents without drowning in context or review debt. Excludes the Claude charts feature.

Shopify’s Liquid gets 53% faster via an autoresearch micro-optimization loop

Liquid (Shopify): A Karpathy-style /autoresearch loop (propose tiny change → benchmark → keep/revert → repeat) was used to land a large performance gain on a mature codebase—53% faster parse+render and 61% fewer allocations, as summarized in the Performance notes and further explained in the Autoresearch breakdown.

A concrete enabling detail was the existence of a big, trusted regression suite—974 unit tests—which made rapid micro-changes safe to try, as called out in the Performance notes and reinforced by the Performance notes. The full technical write-up is in Simon Willison’s Write-up with benchmarks.

Simon Willison

@simonw

Published some notes on @tobi's autoresearch PR that improved the performance benchmark scores of the Liquid template language (which Tobi created for Shopify 20 years ago) by a hefty 53% simonwillison.net/2026/Mar/13/li…

I think this illustrates a number of interesting ideas:

Having a robust test suite - in this case 974 unit tests - is a massive unlock for working with coding agents. This kind of research effort would not be possible without first having a tried and tested suite of tests.
The autoresearch pattern - where an agent brainstorms a multitude of potential improvements and then experiments with them one at a time - is really effective.
If you provide an agent with a benchmarking script "make it faster" becomes an actionable goal.
CEOs can code again! Tobi has always been more hands-on than most, but this is a much more significant contribution than anyone would expect from the leader of a company with 7,500+ employees. I've seen this pattern play out a lot over the past few months: coding agents make it feasible for people in high-interruption roles to productively work with code again.

3:59 AM · Mar 13, 2026

234

Read 14 replies

Orchestration as a personal workflow: background subagents on bounded tasks

Agent orchestration habit: Multiple builders describe a workflow where you spin up background agents for clearly scoped tasks while you stay on “the nuanced part,” with one practitioner saying “orchestration was a big unlock” in the Orchestration note.

The same pattern shows up in day-to-day tool usage where people explicitly call out “subagents” as part of their default setup, as mentioned in the Subagents usage. The core operational idea is parallelism plus tighter task boundaries, rather than a single long monolithic agent run.

Addy Osmani

@addyosmani

Replying to @mattpocockuk

Orchestration was a big unlock. Background agents working on well-defined tasks while I focus on more nuanced ones helps me get so much done.

6:21 PM · Mar 12, 2026

Read 4 replies

Spec-led change control paired with mutation testing and incremental mutation runs

Spec-led development loop: A concrete checklist pairs acceptance-scenario-first changes with quality gates like “crap” thresholds and differential mutation testing—write scenarios, confirm they fail, implement via unit tests until scenarios pass, then refactor and kill mutation survivors—spelled out in the Workflow checklist.

A few operational heuristics accompany it: keeping modules below a mutation-site cap (<=50) per the Mutation site cap, and reducing compute burn via differential mutation runs that only re-test changed areas as described in the Incremental mutation idea.

### Workflow * For every new or changed behavior, write acceptance scenarios. Ask before changing existing scenarios. Confirm the scenarios fail. Write failing unit tests and make them pass until scenarios pass. * For every changed module run crap and refactor until crap is Show more

5:23 PM · Mar 12, 2026

Agent UX debate: dedicated agent browsers vs extension-first tools

Agent surfaces: One thread argues that “dedicated agentic browsers were a mistake” and that most automation can ship as a Chrome extension instead, framing it as “betting on where users are,” as stated in the Extension-first argument.

A parallel take emphasizes the same ergonomic principle—tools win by “meet[ing] you where you are” in existing workflows—per the Workflow fit thesis. The tweets don’t include an objective head-to-head; treat it as a workflow/UX positioning signal rather than a benchmarked conclusion.

swyx

@swyx

ok ive seen enough evidence to call it: dedicated agentic browsers were a mistake. you can do everything u need to do as a chrome extension. Claude Cowork et al have shown the way. excited for our @felixrieseberg pod to tell the full story of how @AnthropicAI won by just Show more

9:10 PM · Mar 12, 2026

158

Read 35 replies

AutoHarness proposes “synthesize a harness” as a reliability primitive for agents

AutoHarness (Google DeepMind): A newly shared paper proposes automatically synthesizing a code harness to constrain an agent’s action space (demonstrated in TextArena-style environments), with a practitioner noting they’re testing a similar idea “without training” and getting good results, as described in the Paper screenshot.

This lands as a practical “harness-first” pattern: instead of asking the model to behave, generate guardrail code that makes illegal/invalid moves impossible, then let the model operate inside that sandbox, per the Paper screenshot.

elvis

@omarsar0

This AutoHarness paper (from Google DeepMind) is the most interesting thing I've read lately. I am testing a similar idea (without training) on models like MiniMax-2.5 and getting good results. It already allowed me to synthesize an entire functional coding agent. More soon.

3:03 PM · Mar 12, 2026

531

Read 23 replies

Pi open-sources an /autoresearch plugin for automated benchmark hill-climbing

Pi (pi.dev): A public note says the /autoresearch plugin for Pi has been open-sourced, turning “make it faster” into an automated loop driven by benchmarks, per the Plugin announcement and Repost.

Pi itself is positioned as a minimal terminal harness that can swap providers/models and package reusable “skills” and prompts, as described on the Project page and in the Pi overview. The tweets don’t include a repo link for the plugin, so treat the OSS detail as directional until the source is posted.

tobi lutke

@tobi

Replying to @tobi

And the most important part: we open sourced the /autoresearch plugin for pi. Just tell it what you want, it will do the rest. github.com/davebcn87/pi-a…

9:50 PM · Mar 12, 2026

836

Read 17 replies

Dev metrics drift: tokens and execution time replacing lines-of-code as a proxy

Productivity metrics: A small but recurring argument is that “lines of code” is an increasingly misleading success metric for AI-assisted development, and that token usage is a closer proxy for real work/cost, as stated in the Metric critique.

A concrete artifact of this shift is people sharing token-efficiency plots and celebrating token thrift directly, as shown in the Token efficiency chart. The tweets don’t settle on a single metric; they show that teams are now tracking spend-shaped measures (tokens, latency) alongside correctness.

ben

@benhylak

people using lines of code as a success metric shows how far we've fallen

7:46 PM · Mar 12, 2026

311

Read 30 replies

✅ Quality gates for agent speed: reviews, mutation testing, and approval policies

These tweets are about keeping software mergeable as agent throughput rises: code review economics, recall/precision benchmarks, org policy changes, and testing discipline (mutation/differential mutation).

AWS reportedly adds senior sign-off gate for AI-assisted code by junior and mid engineers

AWS (Amazon): A circulating claim says junior and mid-level engineers at AWS can no longer push AI-assisted code without a senior engineer signing off, per the Policy claim; if accurate, it’s a concrete example of org-level merge gates tightening as agent throughput rises.

The underlying detail (what counts as “AI-assisted,” how enforcement is measured, and whether it’s team-specific) isn’t included in the post, so treat it as an early signal rather than a fully specified policy change.

Dr Milan Milanović

@milan_milanovic

Junior and mid-level engineers can no longer push AI-assisted code without a senior signing off at AWS

Lukasz Olejnik

@lukOlejnik

Amazon is holding a mandatory meeting about AI breaking its systems. The official framing is "part of normal business." The briefing note describes a trend of incidents with "high blast radius" caused by "Gen-AI assisted changes" for which "best practices and safeguards are not

9:43 AM · Mar 10, 2026

8.8K

Read 189 replies

Qodo publishes code review benchmark claiming higher recall than Claude at similar precision

Qodo Code Review (Qodo): Qodo published a head-to-head comparison claiming materially higher issue-finding recall than Claude Code Review at the same precision, using an open benchmark described as 100 real PRs with 580 injected issues across 8 production repos and multiple languages, as summarized in the Benchmark claims.

• Benchmark deltas: The post claims all tools hit 79% precision, while recall differs—Claude Code Review at 52%, Qodo Default at 60%, and Qodo Extended at 71%, per the Benchmark claims.
• Cost narrative: The comparison asserts Qodo is ~10× cheaper per review, while also citing Claude’s token-based reviews at “$15–$25 per review,” with a separate reaction calling that price “impractical … regularly” in the Cost reaction.

The methodology details and exact judging criteria aren’t fully included in the tweets, so treat the result as directional until you can inspect the underlying benchmark artifacts.

elvis

@omarsar0

Qodo outperforms Claude Code Review by 19% higher recall and costs 10x less per review. Why this matters for AI devs: @QodoAI just published a comparison using their open benchmark: 100 real pull requests, 580 injected issues, 8 production repositories across TypeScript, Show more

8:12 PM · Mar 12, 2026

Read 9 replies

Workflow checklist combines scenarios, CRAP score targets, and differential mutation

Agent-era test workflow (Uncle Bob Martin): A concrete loop is shared for keeping changes mergeable under high agent output: write acceptance scenarios for behavior changes, ensure they fail, then add unit tests until scenarios pass; for each changed module, refactor until “crap is 8 or less”; then run differential mutation tests module-by-module and “kill survivors,” with max-workers set to 3, per the Workflow checklist.

• Why mutation is emphasized: The workflow is motivated by the claim that mutation-style breaking and exploration finds bugs traditional TDD didn’t uncover, as described in the Bug-finding note.

The checklist is detailed enough to copy into team conventions (scenario-first gates, a numeric maintainability target, and a mutation budget), even if the exact CRAP threshold varies by codebase.

5:23 PM · Mar 12, 2026

Incremental mutation testing pattern: mutate only what changed

Differential mutation testing (Uncle Bob Martin): A proposed optimization is to record what was mutation-tested in the last run and, on the next run, only mutate what changed—implemented by writing the “last tested” info into the module itself, per the Incremental mutator idea.

The motivation is that mutation testing is a CPU-heavy loop in practice, as described in the CPU hog note, so incremental scope is positioned as a way to keep mutation runs routinely usable on developer hardware.

Exploring a differential mutation tester. It records what was tested in the last run in a comment at the end of the module. Next run only mutation tests things that have changed. That should keep my cores from melting.

11:49 AM · Mar 12, 2026

Mutation-site count proposed as a practical module-size limit

Mutation testing discipline (Uncle Bob Martin): A pragmatic heuristic is proposed to cap module size by mutation-test surface area—setting a hard limit of <=50 “mutation sites” per module, as stated in the Mutation-site threshold.

The framing is that mutation testing used to be too expensive to run frequently, but tooling and automation make it viable enough to use as a sizing constraint, per the same Mutation-site threshold.

New pragmatic limit on module size: # of mutation sites. I'm setting it to <= 50.

2:27 PM · Mar 12, 2026

Kilo reports $1.14 average cost per Opus 4.6 code review with no markup

Kilo Code Reviewer (Kilo Code): Kilo says 80%+ of reviews on its product use Opus, and reports an average cost of $1.14 per Opus 4.6 review, emphasizing “zero markup” and that users pay only LLM tokens, per the Pricing disclosure.

This is a concrete data point for teams comparing “token pass-through” review tooling versus bundled per-PR pricing.

Kilo

@kilocode

80%+ of code reviews in Kilo Code Review use Opus. We analyzed last week's usage to determine the average cost per Opus 4.6 review. The result: $1.14 per review, on average. Kilo Code Reviewer has zero markup. You only pay for LLM tokens.

5:25 PM · Mar 12, 2026

Mike Hostetler // Chief Agent Officer

Spec-led development pitches specs-in-CI as a backpressure mechanism for agents

Spec-led development (specleddev): A repository and framing describe “Specs in CI” as “agentic backpressure,” i.e., a small, enforced spec layer that constrains what gets merged as agent output scales, per the Spec-led framing and the linked GitHub repo.

The stated premise is that humans remain responsible for the spec loop (“Humans write software, not LLMs”), and CI is the enforcement point for coordination and drift control, according to the same Spec-led framing.

@mikehostetler

Specs in CI is now Agentic Backpressure Humans write software, not LLMs Humans need specs in a loop to coordinate github.com/specleddev/spe…

2:07 PM · Mar 12, 2026

🔎 Retrieval & search stacks: late-interaction wins, multimodal embeddings, and GraphRAG skepticism

Retrieval discourse is unusually high-volume: late-interaction/multi-vector results vs single-vector embeddings, multimodal embedding rollout echoes, and continued skepticism about GraphDB/GraphRAG as default infrastructure.

Mixedbread Wholembed v3 posts outsized gains on structured “metadata-like” search

Wholembed v3 (Mixedbread): Mixedbread’s new retrieval model is being cited for an extreme jump on the LIMIT structured-data search benchmark—Recall@100 98.00 vs Gemini Embedding 2’s 6.90 in one shared comparison—re-igniting the late-interaction / multi-vector conversation following up on Embedding launch (Gemini’s multimodal embeddings preview), as shown in the Bench comparison and reinforced by posts arguing it “makes embedding models look like they don’t work” in this regime Bench comparison.

• What’s in the shared table: The same image also shows competitive results on agentic and document retrieval tasks (BrowseComp-Plus answer accuracy 64.82; ViDoRe V3 Markdown NDCG@10 62.29; ViDoRe V3 Crosslingual NDCG@10 60.02), indicating the claim isn’t only about one synthetic dataset Bench comparison.
• Why LIMIT spikes are “all or nothing”: One explanation circulating is that LIMIT is packed with long attribute lists (“Tom likes X…”) and queries like “Who likes Scrabble?”, which tends to break single-vector semantic search while remaining easy for methods that preserve local token-level signals—often even BM25 does well here, per the LIMIT breakdown.
• Builder sentiment: Commentary frames this as validation that “multi-vector is going to win” Multi-vector claim, with extra emphasis that a Gemini Embedding 2 baseline used for comparison was only “2 days old” Baseline note (so treat the exact gap as provisional until there’s a stable eval artifact).

Omar Khattab

@lateinteraction

I've been eagerly awaiting this release from the @mixedbreadai folks. They're world-leading experts in late interaction retrieval. And today they remind us that late interaction done well makes all your favorite embedding models look like they don't work.

Mixedbread

@mixedbreadai

Introducing Mixedbread Wholembed v3, our new SOTA retrieval model across all modalities and 100+ languages. Wholembed v3 brings best-in-class search to text, audio, images, PDFs, videos... You can now get the best retrieval performance on your data, no matter its format.

4:24 PM · Mar 12, 2026

146

Read 7 replies

LIMIT’s “attribute list” pattern explains why late interaction can look dominant

LIMIT benchmark mechanics: A useful explainer notes that LIMIT is built from documents containing lots of “metadata-like” attributes and direct lookup queries (e.g., “Who likes Scrabble?”), which can cause single-vector retrieval to fail sharply and motivate multi-vector or lexical hybrids; the concrete description is in the LIMIT breakdown, and it matches the kind of failure mode implied by the LIMIT row in the Wholembed v3 comparison image Bench comparison.

Ben Clavié

@bclavie

Replying to @altryne

LIMIT is basically specifically designed to test the limits of semantic search. It’s made up of a lot of documents that contain « metadata-like » information that are super simple but long « Tom likes Apple, John drives a blue car, Leslie likes Yoga, Tea, Scrabble, Music », Show more

12:35 AM · Mar 13, 2026

Graph databases for RAG are still optional infrastructure, not a default

GraphDB skepticism: A recurring stance in retrieval circles is that GraphDB/GraphRAG is often overkill versus simpler stacks (including plain Postgres) and can add complexity without measurable gains; that position is captured in the GraphDB take and expanded in the linked Video explainer.

Hamel Husain

@HamelHusain

You probably don’t need a GraphDB youtu.be/7kXY-2fYdHI?si…

6:22 PM · Mar 12, 2026

SMVE: turning multi-vector retrieval into sparse vectors for scale

SMVE (TopK): A new write-up describes “Sparse Multi-Vector Encoding” as a way to make late-interaction / MaxSim-style retrieval practical at scale by converting multi-vector representations into sparse vectors, aiming to reduce the usual storage and compute pain points; the approach and motivation are summarized in the SMVE post and detailed in the linked Blog post.

Marek Galovič

@marek_galovic

Replying to @lateinteraction

And we built infra to make multi-vector retrieval practical at scale … topk.io/blog/20260311-…

6:55 PM · Mar 12, 2026

No single silver bullet: retrieval for agents needs hybrid + multimodal thinking

Hybrid retrieval framing: A “bm25 guy” argument making the rounds is that agents change query dynamics (they iterate and reformulate relentlessly), but you still have to make information retrievable in the first place—and a lot of enterprise context isn’t text, so embeddings (increasingly multimodal) remain important even if you rely heavily on keyword search; see the full reasoning in the Hybrid retrieval thread.

Jo Kristian Bergum

@jobergum

Many of you know me as the bm25 guy but I'm afraid there is no single silver bullet for solving retrieval for agents. You also need to make the data retrievable in the first place. Yes, agents are great at formulating queries and they are relentless so classic issues like Show more

12:43 PM · Mar 12, 2026

Read 8 replies

Retrieval after RAG: hybrid search and infra choices from Turbopuffer

Turbopuffer (Retrieval infra): An interview is circulating that frames “retrieval after RAG” as an infra and cost problem (not just embeddings quality), claiming large cost reductions for production users—e.g., “Cursor cut costs by 95%” is cited in the Podcast note and expanded in the Interview page.

Latent.Space

@latentspacepod

🆕 The Future of Search: AI Agents, RAG, and Why Retrieval Still Matters latent.space/p/turbopuffer @turbopuffer started with a painful realization: a recommendation feature would cost ~6x a company’s entire infra bill. Since then, @Sirupsen has helped @cursor_ai cut costs by 95% Show more

10:56 PM · Mar 12, 2026

📊 Benchmarks & leaderboards: webdev arenas, hallucination indices, and pushback tests

A lot of the news is meta-evaluation: multi-leaderboard comparisons across coding, hallucinations, instruction-following, and “push back on nonsense” behavior. Excludes CursorBench methodology (covered separately).

Artificial Analysis: Grok 4.20 Beta posts record-low hallucination and tops IFBench

Grok 4.20 Beta (xAI): Artificial Analysis reports 22% hallucination rate on AA-Omniscience (lower is better), #1 on IFBench at 82.9%, and ~265 output tokens/sec on xAI’s API, framing this as a large jump in “don’t make things up” behavior plus strong prompt adherence, per the benchmark breakdown charts.

• Index + cost-to-eval: the full write-up says Grok 4.20 (reasoning) scores 48 on the Artificial Analysis Intelligence Index (up +6 vs Grok 4) and that running their index cost $484 at the new $2/$6 per 1M input/output tokens pricing, per the analysis summary.
• Context window signal: multiple posts emphasize the 2,000,000-token context alongside these evals, but also note a remaining coding index gap vs GPT-5.4/Gemini/Claude in separate charts, as shown in the pricing table and coding index chart.

Artificial Analysis

@ArtificialAnlys

The Grok 4.20 Beta shows three major improvements over Grok 4: ➤ Our lowest ever hallucination rate on the AA-Omniscience evaluation. When Grok did not know the answer, it hallucinated an incorrect answer 22% of the time - this is the lowest hallucination rate of any model we Show more

8:21 PM · Mar 12, 2026

1.3K

Read 132 replies

GPT-5.4-high enters Code Arena top 6 for WebDev with Codex harness

Code Arena (Arena): gpt-5.4-high configured with the Codex harness shows up at #6 on WebDev overall with a 1460 score, sitting just above Gemini 3.1 Pro Preview at #7 (1457) and below multiple Claude 4.6 variants at the top of the table, as shown in the leaderboard post screenshot.

• What’s being measured: the post calls out specific subtracks where gpt-5.4-high is #6 for Multi-File React and top 10 for Single-File HTML, per the same leaderboard post.
• Community read: one take notes this leaderboard appears “mostly frontend dev,” which may explain different relative standings than coding-agent-heavy boards, according to the frontend skew comment.

Arena.ai

@arena

GPT-5.4-high has landed in the Code Arena top 6. Setup with the Codex Harness, @OpenAI’s latest model is on par with Gemini 3.1 Pro Preview for real-world web development tasks. Highlights: - top 6 in WebDev overall - #6 for Multi-File React - top 10 for Single-File HTML

OpenAI

@OpenAI

GPT-5.4 Thinking and GPT-5.4 Pro are rolling out now in ChatGPT. GPT-5.4 is also now available in the API and Codex. GPT-5.4 brings our advances in reasoning, coding, and agentic workflows into one frontier model.

4:07 PM · Mar 12, 2026

157

Read 28 replies

BridgeBench: Grok 4.20 Multi-Agent takes #1 with 100% completion

BridgeBench (BridgeMind): the posted table ranks Grok 4.20 Multi-Agent (4-agent) at #1 overall (96.1) with 100% completion and 87.8s latency, with the 16-agent variant close behind at #2 (95.9), based on the results table.

• Latency tradeoff: the same table shows GPT-5.4 at #3 (95.5) but with much higher reported latency (704.4s), per the results table.
• Single-model baseline: a separate post places Grok 4.20 Beta (non multi-agent) around #6 overall (93.4) with 59s latency and 88.5% completion, according to the beta placement post.

BridgeMind

@bridgemindai

Grok 4.20 Multi-Agent just took #1 on BridgeBench. 96.1 overall. 100% completion rate. 87.8s latency. Beat GPT 5.4, Claude Sonnet 4.6, and Claude Opus 4.6. xAI came out of nowhere. The multi-agent future is here. And Grok is leading it.

3:25 PM · Mar 12, 2026

1.5K

Read 238 replies

BullshitBench v2: Grok 4.20 jumps up the “push back on nonsense” rankings

BullshitBench v2 (Peter Gostev): the updated leaderboard shows Grok 4.20 jumping from 54th/72nd (prior Grok 4.1 placements) up to roughly 13th–16th, while the author notes a reasoning-heavy run cost about $75 yet scored a few points lower than a cheaper setting, according to the leaderboard update.

• Benchmark adoption: the repo crossed ~1,000 GitHub stars shortly after launch, as shown in the star history chart.
• Reproducibility: the maintainer links both a public data viewer and the GitHub repo for questions + scoring artifacts in the viewer and repo links.

Peter Gostev

@petergostev

BullshitBench v2 Update: Grok 4.2 - massive jump in the rankings - 4.1 was ranked 54th and 72nd (out of 84) and now it took 13-16th spots.

Peter Gostev

@petergostev

BullshitBench v2 is out! It is one of the few benchmarks where models are generally not getting better (except Claude) and where reasoning isn't helping. What's new: 100 new questions, by domain (coding (40 Q's), medical (15), legal (15), finance (15), physics(15)), 70+ model

5:11 PM · Mar 12, 2026

Vals Index ranks Grok 4.20 Beta #13 overall with $0.28/test cost

Vals Index (ValsAI): Grok 4.20 Beta (reasoning) lands at #13 overall with 58.05% ± 1.98 accuracy, an estimated $0.28 per test, and 85.42s latency, as shown in the Vals Index screenshot.

• Where it looks stronger: Vals calls out #4 on their SWE Bench split (72.55%) and a +10 pp improvement on Terminal Bench 2 vs previous Grok models, per the Vals Index screenshot.

Vals AI

@ValsAI

We evaluated the Grok 4.20 Beta (Reasoning) snapshot on the Vals Index. It lands at #13 overall, with low latency and relatively low cost.

7:29 PM · Mar 12, 2026

LisanBench: Grok 4.20 scores about the same as Grok 4 with fewer tokens

LisanBench: following up on LisanBench (new coding/model arena), a comparison chart shows Grok 4.20 Beta scoring 3786 vs Grok 4 at 3885, with the claim that Grok 4.20 is cheaper/faster and uses fewer tokens (about 9k vs 11.7k), per the LisanBench chart post.

The posted evidence is a single chart + token count note; no shared eval harness details are included in the thread.

Lisan al Gaib

@scaling01

Grok-4.20 ~same score as Grok-4 on LisanBench but it is much cheaper, faster and a bit more token-efficient than Grok-4 (only 9k tokens vs 11.7k tokens)

Lisan al Gaib

@scaling01

Grok 4.2 on OpenRouter let's give it a try

4:47 PM · Mar 12, 2026

102

Read 13 replies

WeirdML scatterplot puts GPT-5.4 (xhigh) near the accuracy frontier at high token use

WeirdML model comparison: an interactive scatterplot highlights gpt-5.4 (xhigh) at 77.7% average accuracy across 17 tasks while using 71,878 output tokens (tooltip also shows $5.7199 cost and 83.0s median exec time), framing an explicit “accuracy vs tokens” tradeoff, per the scatterplot tooltip.

This is a different lens than leaderboard ranks: it’s token-heavy by construction, but makes token economics visible when comparing near-frontier models.

Lisan al Gaib

@scaling01

i love the token-efficiency of GPT-5.4-xhigh

Håvard Ihle

@htihle

GPT 5.4 (xhigh) scores 77.7% on WeirdML, just behind 5.3 codex and Opus 4.6, but within the margin of error. GPT 5.4 is a really strong model, and sets a new high score on 3 of the 17 tasks, but it's not consistent enough to set a new top score. It uses by far the most tokens

3:19 PM · Mar 12, 2026

126

🏗️ Compute economics & supply constraints: packaging/HBM bottlenecks and metered intelligence

Infra signals today are about constraints and pricing models: packaging/HBM scarcity, GPU supply narratives, and “intelligence as a utility” framing that affects how teams budget inference-heavy agent workloads.

Epoch AI estimates ~90% of advanced packaging + HBM was consumed by top AI chip designers in 2025

Advanced packaging & HBM (Epoch AI): A new estimate says the four largest AI chip designers consumed ~90% of global CoWoS advanced packaging and HBM supply by value in 2025, implying these were the binding constraints (not logic dies), as shown in the supply share chart.

• Why the split matters: the same analysis shows advanced logic dies remained mostly “other” demand (NVIDIA at ~9%); the practical read is that scaling inference/training capacity is gated by memory and packaging throughput more than foundry wafer capacity in the near term, per the supply share chart.
• Method signal: Epoch flags they modeled manufacturing lags and inventory timing, adding detail in the methods note.

Epoch AI

@EpochAIResearch

How much of the world's advanced chip packaging and high-bandwidth memory does AI consume? Almost all of it. We estimate the four largest AI chip designers consumed ~90% of global advanced packaging and HBM supply in 2025, suggesting these inputs were bottlenecks in 2025.

8:57 PM · Mar 12, 2026

Sam Altman repeats “intelligence as a utility” framing and ties it to extreme long-run reasoning spend

OpenAI token economics (OpenAI): Sam Altman described the core business model as “selling tokens,” positioning “intelligence as a utility” where people buy it “on a meter,” as captured in the metered utility quote.

• Long-horizon spend: he also says some future high-stakes tasks could rationally spend tens/hundreds of millions—and eventually billions—on a single problem, according to the long-reasoning clip.
• Incentive alignment: in the same cycle of interviews, he frames OpenAI’s custom chip goal as cheapest/most power-efficient inference (not peak speed), which aligns token-metering with energy-per-answer constraints, per the inference chip goal.

The Rundown AI

@TheRundownAI

Sam Altman, at the Blackrock U.S Infrastructure Summit: "Fundamentally, our business, and I think the business of every other model provider, is going to look like selling tokens... They may work super hard — spend tens of millions, hundreds of millions, someday billions of Show more

3:15 PM · Mar 12, 2026

Read 10 replies

Jensen Huang frames custom ASICs as “science projects” versus NVIDIA’s full AI factory platform

NVIDIA vs custom ASICs (NVIDIA): In a financial analyst Q&A clip, Jensen Huang argues that a custom chip effort is a “science project” while NVIDIA is shipping revenue-producing “AI factories,” with the real moat being the integrated platform (silicon + packaging + software + roadmap), as recapped in the Jensen ASIC remarks.

The claim is directional rather than a spec drop, but it’s a clear procurement narrative: reduce appetite for bespoke inference/training ASIC bets when roadmap pace and packaging/HBM constraints are moving targets.

Rohan Paul

@rohanpaul_ai

In classic Jensen Huang style, he smashed the 'Nvidia vs. custom ASIC' debate. The level of confidence with which he explains. 🫡 He was answering to UBS research analyst question on how custom ASICs will affect NVIDIA or how they are going to compete with custom ASIC. Show more

1:25 PM · Mar 12, 2026

250

Read 18 replies

Compute scarcity talk shifts toward market mechanisms: “bidding for AI compute like ads”

Compute scarcity (ecosystem): One thread argues that if “barely 0.1%” of people use AI full-time and supply already feels exhausted, demand could rise “10000%,” pushing the market toward “bidding for AI compute, like bidding for ads,” as claimed in the compute bidding take.

• Token pressure from agents: Aaron Levie adds a concrete driver—frontier agent use-cases already using ~100× more tokens than a year ago, with long-running background agents poised to expand that load beyond coding, per the token usage expansion note.

These posts are speculative rather than measured, but they match the lived budgeting story: token-metered products create feedback loops where better agents directly translate into higher steady-state inference demand.

John Rush

@johnrushx

Barely 0.1% of population uses AI full-time, and the supply is already exhausted. The demand will go up 10000%....there is no way nvidia/apple can build so many chips....it means soon we'll be bidding for ai compute, like bidding for ads...so humans might soon be cheaper than ai

4:53 PM · Mar 12, 2026

286

Read 72 replies

🚢 Model & capability drops (non-Grok): retrieval, vision, and editing speedups

Outside Grok 4.20, today still has several notable model/capability updates: retrieval models, image editing acceleration, stealth model listings, and open-weight comparisons. Excludes Grok 4.20 (covered separately).

Mixedbread Wholembed v3 pushes multi-vector retrieval with outsized gains on structured search benches

Wholembed v3 (Mixedbread AI): Mixedbread’s new retrieval model is framed as an “omni” multi-vector / late-interaction system across modalities and 100+ languages, with shared benchmark screenshots showing extremely large deltas on structured “metadata-like” retrieval tasks, as highlighted in the Benchmark table post and reinforced by practitioner reactions in the Multi-vector praise.

• Notable metric: The posted table shows 98.00 Recall@100 on LIMIT “structured data search,” compared to 6.90 for Gemini Embedding 2 and 8.95 for a Voyage baseline, as shown in the Benchmark table post.
• Broader retrieval claim: The same table shows gains on BrowseComp-Plus “agentic search” and ViDoRe document search metrics, per the Benchmark table post.

Omar Khattab

@lateinteraction

Mixedbread

@mixedbreadai

4:24 PM · Mar 12, 2026

146

Read 7 replies

FLUX.2 [klein] 9B gets a ~2× speedup for multi-reference image editing via KV-caching

FLUX.2 [klein] 9B (Black Forest Labs): Image editing latency drops by up to ~2× (and sometimes more) when you supply multiple reference images, by KV-caching the reference encodings so the model skips redundant work; quality is positioned as unchanged and pricing as unchanged, per the Speedup announcement and follow-up rollout details in the API and weights note.

• What changes in practice: Multi-reference edit workflows (character/object consistency, style transfer with several refs) should see the biggest gains because the cache amortizes the cost of processing reference inputs, as explained in the Speedup announcement.
• Deployment detail: BFL also points to FP8 quantized weights and a “free upgrade” path for API users, according to the API and weights note.

Black Forest Labs

@bfl_ml

FLUX.2 [klein] 9B just got 2x faster at image editing, especially when you use multiple reference images. Same quality, no price increase.

3:04 PM · Mar 12, 2026

341

Read 14 replies

Nemotron 3 Super consolidates as a practical open-weights baseline, with free OpenRouter access

Nemotron 3 Super (NVIDIA): Builder posts increasingly treat Nemotron 3 Super as a default open-weights “intelligence baseline,” with OpenRouter offering a free endpoint that highlights 1M context and MoE-style efficiency, as documented in the OpenRouter model page and summarized via benchmark commentary in the Open-weights index post.

• Bench signal: The Artificial Analysis “open weights” index graphic places Nemotron 3 Super at 36 (with some peers cited at 42/39/33), per the Open-weights index post.
• Operational shape: The OpenRouter listing emphasizes long context (1M) plus “activate ~12B” style compute framing, which is spelled out in the OpenRouter model page.