OpenAI Batch API adds GPT Image jobs – 50% lower cost, 50,000 edits

what am i supposed to do with this? seems an unrecoverable state in claude.

2:00 PM · Feb 21, 2026

Read 43 replies

Anthropic says compaction fixes shipped, but wants /feedback IDs for hard failures

Claude Code (Anthropic): In response to reports of irrecoverable /compact failures, an Anthropic engineer says they “put in a fix” and asks what version affected users are on in the Fix query, while another notes they’ve fixed “a bunch of issues” recently and requests a /feedback ID for sessions that truly can’t recover, as written in the Feedback request. It’s still being debugged.

The same response thread highlights the current troubleshooting path—“double esc” navigation and retrying compaction—which is called out in the Feedback request alongside the irrecoverable-state reports.

Thariq

@trq212

Replying to @banteg

Hmmm we put in a fix for his, what version are you on? CC @sidbid

8:14 PM · Feb 21, 2026

Matt Pocock makes claude --worktree his new default (with a demo)

Claude Code (Anthropic): A concrete “make it your default” workflow is forming around claude --worktree, with Matt Pocock saying he’s now using it by default and sharing a walkthrough aimed at translating Anthropic’s more technical worktree messaging into a practical mental model, as described in the Worktree default note. Short version: parallel work becomes easier.

• Parallel agents without stepping on each other: The appeal is less about elegance and more about throughput—parallelizing subagents becomes straightforward, and “merge conflicts are so cheap” when each agent gets its own isolated tree, per the Parallel subagents praise.

Matt Pocock

@mattpocockuk

claude --worktree is so good I'm making it my new default. Don't know why you should care? Couldn't follow the Anthropic announcement? (way too technical IMO) Here's a demo:

4:13 PM · Feb 21, 2026

1.6K

Read 66 replies

Claude Code worktrees become a real coordination primitive (not everyone uses them)

Claude Code (Anthropic): Built-in git worktree support is being framed as the mechanism for running agents in parallel “without interfering,” per the Worktree support announcement, but practitioner notes suggest it’s not a universal default—Claude Code’s head of product says the team is split roughly 50/50 between worktrees and other approaches (multiple checkouts or tabs), as explained in the Team usage note. It’s messy. That’s the point.

• How it shows up day-to-day: The default UI subagent is often just “Task,” and worktrees show up most in “1-shot large batch changes” like codebase-wide migrations, per the same Team usage note.

Boris Cherny

@bcherny

Introducing: built-in git worktree support for Claude Code Now, agents can run in parallel without interfering with one other. Each agent gets its own worktree and can work independently. The Claude Code Desktop app has had built-in support for worktrees for a while, and now Show more

12:39 AM · Feb 21, 2026

9.4K

Read 377 replies

Some users report Claude Code subagents breaking on the latest release

Claude Code (Anthropic): There’s at least one direct warning that subagents are “broken” on the most recent release, with a heads-up to others captured in the Subagent regression warning. It’s a thin data point. It’s also the kind that can burn a team mid-migration.

Alex Volkov (Thursd/AI)

@altryne

Subagents are broken for me on this last release, be careful folks

Mario Valle Reyes 🚩🚩🚩

@bilbeny

If you update @OpenClaw to 2026.2.19 or 21 via npm, all agent tools break after the update. After upgrading, existing paired devices lose operator.write and operator.read scopes → infinite pairing loop → tools dead. Full issue: github.com/openclaw/openc… Quick fix:

2:17 AM · Feb 22, 2026

Claude Code reliability complaints persist: throttled, slow, unavailable

Claude Code (Anthropic): Reliability remains a recurring pain signal, summarized bluntly as “always throttled, always slow, always unavailable” in the Reliability complaint. Short sentence: that breaks flow.

This matters more as Claude Code workflows lean into longer-running sessions (worktrees, subagents, and compaction), where mid-run stalls create hard-to-debug partial state.

kache

@yacineMTB

claude: always throttled, always slow, always unavailable

12:17 AM · Feb 22, 2026

140

Read 22 replies

🧠 Codex dev workflows: app-server embedding, long-run steering, and harness co-design

Codex-centric engineering signals: app-server API embedding, better UX for long-running tasks, and explicit “harness is part of the model” framing. Excludes ChatGPT plan pricing (separate category).

Codex framing: the model is trained with its harness (tools, loops, compaction)

Codex harness co-design (OpenAI): A widely shared claim is that Codex models are trained “in the presence of the harness,” meaning tool use, execution loops, compaction, and iterative verification are treated as part of the training environment—not a layer bolted on after the fact, as described in the [harness quote card](t:165|Harness quote card).

If this is accurate, it implies two practical things for teams building agent runners: swapping harness components can change model behavior more than expected, and “agent quality” becomes an end-to-end systems property (model + harness), not a model-only attribute.

Codex app-server surfaces a local API for embedding Codex into tools

Codex app-server (OpenAI): Builders are calling out codex app-server as a clean way to treat Codex like an embeddable service—effectively turning the Codex app into a local API you can wire into your own workflows, as shown in the [app-server note](t:20|App-server note). This is showing up as a practical integration surface, with one developer saying they started investigating it for a project and “accidentally ended up making an actual native coding tool,” per the [hands-on reaction](t:56|Hands-on reaction).

Lightweight, local surfaces like this tend to be the difference between “I tried the agent” and “it lives in our stack.” The sentiment around the Codex app itself is also positive in the [Codex app comment](t:41|Codex app comment).

Codex CLI can be “steered” mid-run while a long terminal task keeps going

Codex CLI (OpenAI): A UX detail that matters for real work—Codex can keep a long-running terminal task executing while you ask clarifying questions, and it responds without interrupting the running job, according to the [steering screenshot](t:225|Steering screenshot).

This effectively separates “execution” from “conversation,” which is a common failure mode in agent CLIs (either you lose the run, or you lose your chance to steer it).

Codex CLI knob confusion: is default reasoning effort set to none?

Codex CLI (OpenAI): A question being raised is whether the Codex CLI’s default “reasoning effort” is set to “none,” as asked in the [CLI knob question](t:413|CLI knob question). It’s a short post, but it points at an important operational detail: hidden defaults can look like flaky performance or inconsistent code quality when teams compare setups.

No clarification appears in the provided tweets, so treat it as an open config/UX ambiguity rather than a confirmed default.

Codex-driven optimization claim: jido_ai tests cut from 92s to 9s

Codex for performance work: One concrete anecdote: “Codex just trimmed the jido_ai test suite from 92 seconds to 9,” per the [runtime claim](t:308|Runtime claim). That’s a 10× reduction.

It’s a reminder that current coding agents are being used for more than feature work—micro-optimizations, test harness tuning, and profiling-driven refactors are starting to show up as first-class tasks in agent-assisted workflows.

Coding-agent mental model: Model + Harness + Surface

Agent architecture mental model: A concise decomposition is circulating—“a coding agent is: Model + Harness + Surface,” as stated in the [mental model post](t:229|Mental model post). It pairs naturally with the harness co-design framing in the [harness quote card](t:165|Harness quote card).

This vocabulary is useful because it makes debugging and product decisions less mystical: many “model regressions” are actually harness or surface issues (permissions UX, tool schemas, context packing, or failure recovery).

Codex app is being positioned for end-to-end dev workflows, not snippets

Codex app (OpenAI): A thread framing “Codex for end-to-end dev workflows” suggests people are increasingly treating Codex as the primary driver of longer development loops, not a code-completion sidecar, as hinted in the [workflow post](t:26|Workflow post). The same author also reinforces the day-to-day usability angle in the [Codex app comment](t:41|Codex app comment).

The concrete details of the workflow aren’t spelled out in the tweets, but the direction is clear: Codex is being used as an app-level surface for running multi-step work, not just generating code blocks.

Codex vs Claude Code as a single lifelong tool is polling near 50/50

Tool choice signal: A poll asking which single tool you’d keep—Codex with GPT-5.3-Codex vs Claude Code with Opus 4.6—shows a near-even split at 52% vs 48% with 898 votes, as shown in the [poll screenshot](t:316|Poll screenshot).

This kind of tight split is a proxy for switching costs staying low: developers appear willing to swap their “primary agent” based on workflow fit and day-to-day UX, not just vendor loyalty.

Codex beyond codegen: A small but telling use-case signal: “Codex for managing social media” is being pitched as a workflow, per the [use-case mention](t:250|Use-case mention). It suggests Codex is being treated as a general automation agent surface (drafting, scheduling, reporting) rather than only a coding tool.

The post doesn’t include implementation details (tools, APIs, guardrails), but it’s consistent with the broader shift toward long-running, tool-driven task execution.

🕸️ Operating many agents: recursive subagents, swarm health, and attention management

How builders are keeping multi-agent setups stable: recursion limits, offloading heavy work, and day-to-day ‘agent ops’ habits. Excludes Claw/OpenClaw-specific ops (feature owns those).

Swarm health tooling emerges: offload builds, manage disk, kill runaway processes

Agent swarms (Ops pattern): As multi-agent setups saturate CPU, disk, and SSH, builders are assembling personal “coping toolchains” that treat swarm load like production infrastructure; one concrete stack includes remote build offload to a VPS fleet, automated disk ballast/cleanup to prevent toolchain collapse, and a process triage utility to hunt runaway/zombie processes, as described in the swarm ops toolkit post.

• Offload is the first lever: The reported critical dependency is moving compilation off the primary machine via a dedicated helper, per the swarm ops toolkit details.
• Failure mode is mundane: Disk-fill and runaway rg-style searches are called out as the actual swarm killers, not model quality, as shown in the swarm ops toolkit.

Jeffrey Emanuel

@doodlestein

Running big agent swarms can be hazardous to your machine's health (pic related). It has become such a problem for me that I've been forced to create a suite of programs to help me cope with this, including: 1) remote_compilation_helper (rch) for offloading builds to a fleet Show more

12:04 AM · Feb 22, 2026

Read 13 replies

“Keep the agent thin” pattern: local cron + event wakeups for reliability

Long-lived automation (Ops pattern): Instead of asking the model to do everything in chat, one practitioner describes building a local-first toolbelt—custom cron, an event system to wake the agent from scripts, and specialized CLIs—so routine work runs deterministically and the agent only does high-judgment steps, as outlined in the local-first tooling tip.

This frames “agent ops” as composing small reliable programs around the model, which reduces fragility when agent sessions stall or providers throttle.

Ian Nuttall

@iannuttall

seeing a lot of people say their openclaw dies/doesn’t respond often PRO TIP: build your own local-first api/cli tools for it and keep your personal agent “dumber” I have: - my own cron system - an event system to wake agent up from scripts - keep.md to fetch x Show more

5:31 PM · Feb 21, 2026

115

Read 20 replies

Recursive subagent fan-out via maxSpawnDepth=2 configuration

Recursive delegation (Harness pattern): A shared config snippet shows how teams are enabling an “orchestrator → workers” topology by raising maxSpawnDepth to 2, so a sub-agent can itself spawn additional subagents; the same example pins maxConcurrent and a specific model for predictable parallelism, as shown in the recursion config snippet screenshot.

The operational implication is that concurrency limits and recursion depth become first-class safety controls once subagents can recursively delegate, rather than being an accidental emergent behavior.

Alex Volkov (Thursd/AI)

@altryne

Your agent's subagents can now... spawn subagents 😅

5:46 AM · Feb 22, 2026

“Year of agent orchestrators” becomes a coordination-layer signal

Orchestration zeitgeist (Ecosystem signal): The phrase “the year of agent orchestrators” is getting repeated as a shorthand for where new leverage is coming from—less from raw model choice and more from coordination layers that manage many agents, per the orchestrator year claim post.

If this holds, the competitive surface shifts toward scheduling, delegation limits, failure recovery, and operator UX—classic distributed-systems problems, now applied to LLM work queues.

elvis

@omarsar0

the year of agent orchestrators

Andrej Karpathy

@karpathy

Bought a new Mac mini to properly tinker with claws over the weekend. The apple store person told me they are selling like hotcakes and everyone is confused :) I'm definitely a bit sus'd to run OpenClaw specifically - giving my private data/keys to 400K lines of vibe coded

12:42 PM · Feb 21, 2026

Read 11 replies

Oversight tactic: watching intermediate chatter to preempt rabbit holes

Human-in-the-loop triage (Ops habit): One practical supervision trick is to keep an eye on the agent’s intermediate “chatter” so you can spot when it’s heading into a rathole and redirect early, as described in the watch agent chatter post.

This frames oversight as real-time observability: the goal isn’t perfect prompting, it’s detecting drift quickly enough that parallel runs don’t multiply the same mistake.

It's really useful to watch Claude chatter. Often you see it going down ratholes that you can redirect it from taking.

3:48 PM · Feb 21, 2026

Practical multitasking: manage parallel agents by status-checking, not context switching

Attention management (Work habit): A simple but recurring habit is using multiple agents as background workers and treating human time as a control loop—kick off parallel tasks, then periodically check status across projects instead of single-threading, as stated in the status-check multitasking note.

In day-to-day ops terms, the limiting factor becomes operator attention (what to review next), not the ability to start more work streams.

Multitasking (me doing more than one task at a time) is a lot easier with AI. Just check the status of a bunch of different agents working on different projects or different aspects of the same project. That's good leverage.

5:32 PM · Feb 21, 2026

Read 11 replies

♊ Gemini in developer workflows: CLI adoption and parallel instances

Gemini 3.1 Pro shows up in CLI-first workflows, with early patterns around running multiple instances and managing quotas. Excludes generative-media demos (covered elsewhere).

Gemini 3.1 Pro Preview becomes selectable inside Gemini CLI via /model

Gemini CLI (Google): Following up on Missing model list (3.1 Pro not showing), the Gemini CLI now lists gemini-3.1-pro-preview in the /model picker, as shown in the CLI model picker; the same UI exposes two new day-to-day knobs—whether to remember the chosen model across sessions and the --model startup flag hint, per the CLI model picker.

The selector also makes the entitlement explicit (“Gemini Code Assist in Google One AI Ultra”), which matters for teams trying to reason about which plan gates which CLI models, as visible in the CLI model picker.

BridgeMind

@bridgemindai

Gemini 3.1 Pro is in the Gemini CLI now Finally

12:54 PM · Feb 21, 2026

304

Read 22 replies

Antigravity users ask for custom Gemini 3.1 Pro API keys to avoid multi-hour waits

Antigravity (Google ecosystem): Quota and latency constraints are showing up as workflow blockers—one request asks Google to allow custom API keys for Gemini 3.1 Pro in Antigravity, describing “wait for hours” mid-flow when capacity is tight, per the Custom key request.

The underlying signal is that “CLI-first” usage is colliding with entitlement/rate-limit boundaries, and builders want BYO-key escape hatches to keep parallel work moving, as described in the Custom key request.

Google team, please add an option to use a custom API key for Gemini 3.1 Pro in Antigravity. It is frustrating to wait for hours when you are in the middle of a flow.

1:29 PM · Feb 21, 2026

918

Read 68 replies

BridgeSpace runs six parallel Gemini CLI sessions powered by Gemini 3.1 Pro

BridgeSpace (bridgemindai): A shared “one workspace, many agents” workflow is emerging where developers run multiple Gemini CLI instances concurrently—one example shows six terminals in parallel, all authenticated to the same “Google One AI Ultra” plan and running Gemini 3.1 Pro, as shown in the Six parallel terminals.

This is a concrete pattern shift from single-chat iteration to concurrency-first development, where coordination is done by splitting tasks across terminals rather than overloading one long session, as described in the Six parallel terminals.

BridgeMind

@bridgemindai

6 Gemini CLI instances running in parallel inside BridgeSpace. All powered by Gemini 3.1 Pro. All on the Google One AI Ultra plan. This is what agentic development looks like. One workspace. Six agents. Shipping in parallel. Google cooked. We’re eating.

1:50 PM · Feb 21, 2026

118

Read 18 replies

Gemini-driven referrals to Render rise to 60% of ChatGPT volume

Gemini adoption proxy: A distribution datapoint suggests Gemini is becoming a meaningful top-of-funnel for developer infra—referrals from Gemini to Render are reported at 60% of ChatGPT referral volume, up from 10% three months ago, according to the Referral volume stat.

If this holds across other infra products, it implies Gemini is no longer “secondary traffic” for dev tools, but a channel teams may need to optimize docs and onboarding for, per the Referral volume stat.

Anurag Goel

@anuraggoel

Gemini is catching up: referrals from Gemini to @render have surged to 60% of ChatGPT referral volume, a big leap from just 10% three months ago.

7:35 PM · Feb 21, 2026

📱 AI app builders: Rork Max shipping + App Store publishing automation

Rork Max continues to trend as a Swift-native ‘Xcode replacement’ with streamlined device install and App Store submission flows, plus strong distribution signals (Product Hunt, X trends).

Rork Max adds a “2-click publish” path to TestFlight and App Store Review

Rork Max (Rork): Following up on web IDE launch—browser Swift builder pitched as an Xcode replacement—Rork Max now claims an end-to-end App Store pipeline where you enter Apple Developer credentials ("not stored") and hit Submit to get a TestFlight-ready build, with invites available in ~10 minutes per the 2-click publish thread.

• Workflow detail: The same flow is positioned as replacing both Xcode and App Store Connect friction, then later filling App Store metadata and submitting for Apple review, as described in the 2-click publish thread.
• Ownership and gating: Rork emphasizes apps ship to your own Apple Developer account (no IP transfer) and still require the $99/year Apple Developer Program, as stated in the 2-click publish thread.

Introducing 2-click publish to  App Store Rork Max allows you to stop wrestling with Xcode or App Store Connect. Just enter your  Developer credentials (fully secure; not stored on our servers), and click "Submit" – it's in the App Store, ready for download in TestFlight. Show more

Rork

@rork_app

Introducing Rork Max AI that one-shots almost any app for iPhone,  Watch, iPad,  TV &  Vision Pro. Even Pokémon Go with AR & 3D. Max is a website that replaces Xcode. Install on device in 1 click. Publish to App Store in 2 clicks. Powered by Swift, Claude Code & Opus

11:48 PM · Feb 21, 2026

380

Read 27 replies

Builders hype Rork Max as autonomous app builder; one user claims 4 apps already in Apple review

Rork Max (Rork): Several posts frame Rork Max as an “autonomous” app builder—an investor with early access calls it “absolutely amazing” and says it can build almost any app idea “completely autonomously,” as written in the early access praise.

• App Store throughput anecdote: A retweeted user message claims “4 apps pending” for Apple publishing and “one is live right now,” per the apps pending claim, while another endorsement says they’re going “full focus” on iOS native apps with Rork Max as shown in the user endorsement screenshot.

This is still anecdotal evidence (no public App Store links in the tweets), but it highlights the core promise: collapsing build-to-review latency into an agent-driven loop.

Matt Shumer

@mattshumer_

x.com/i/article/2021…

4:16 PM · Feb 10, 2026

116.4K

Read 6.4K replies

Rork Max hits #1 Product of the Day on Product Hunt in under 12 hours

Rork Max (Rork): Rork says it reached #1 Product of the Day on Product Hunt in under 12 hours, and is actively campaigning toward “Product of the Year,” according to the Product Hunt ranking post.

• Reach claim: In the same launch window, Rork also asserts “7M+ people saw” the Product Hunt launch, per the Product Hunt launch ask.

This is a traction signal more than a spec; there’s no independent dashboard screenshot in the tweets for the ranking or view count.

We just hit #1 Product of the Day on Product Hunt in less than 12 hours Now it's time for Rork Max to become Product of the Year. Let's create history together. Let's win. producthunt.com/products/rork-…

Rork

@rork_app

9:18 PM · Feb 21, 2026

Read 17 replies

Rork Max appears in X trends with “AI builds native Apple apps from one prompt” framing

Rork Max (Rork): Rork posted that “Rork Max is in X trends,” with a screenshot showing the trend blurb “AI Builds Native Apple Apps from One Prompt” and ~6K posts, as shown in the trending screenshot.

The screenshot reads like a mainstream distribution moment: it’s less about new capabilities than about the pitch landing outside the agent-builder bubble.

Rork Max is in X trends 🤯

5:04 PM · Feb 21, 2026

Read 9 replies

Rork Max leans into high-performance mobile game claims (60FPS 3D, “real games” from ad mocks)

Rork Max (Rork): Rork is pushing a game-dev angle—claiming it’s the only option to build a 60FPS 3D game “that would not overheat your phone,” per the 60FPS positioning, alongside a simpler pitch that it can turn “fake ad games” into actual playable games according to the ad game conversion claim.

These are positioning statements; there’s no benchmark or thermal profiling data attached in the tweets.

If you want to create a 60FPS 3D game that would not overheat your phone there is literally nowhere to go. It’s just Rork.

8:03 AM · Feb 21, 2026

102

Read 14 replies

💳 ChatGPT plan packaging: Pro Lite signals, reliability, and UX backlash

Plan-tier changes and user sentiment: strong signals of a $100 “Pro Lite” tier plus frustration with current tier gaps and reliability. Excludes Codex technical workflow updates (separate).

ChatGPT Pro Lite leaks as a $100/month tier in checkout config

ChatGPT (OpenAI): Following up on Pro Lite code strings—early SKU wiring—the web checkout flow now appears to return a prolite monthly amount of $100, even while the UI label still says “Pro plan,” as shown in the Checkout config screenshot response payload.

The plan name also shows up directly in the web app bundle as chatgptprolite / prolite, per the Plan id code snippet.

• Availability signal: TestingCatalog frames it as “working on” a tier between Plus and Pro and “not available yet,” according to the Pro Lite report.
• More $100 confirmations: Multiple accounts restate the $100/month price point (without additional artifacts beyond the checkout/code evidence), including the Price restatement and the earlier “likely WIP” framing in the Checkout config screenshot.

Open questions from the leaked UI are whether “Pro Lite” is its own plan or a renamed “Pro” variant; the checkout screenshot’s mismatch (UI says Pro, JSON says prolite) is the main clue so far.

Tibor Blaho

@btibor91

The new ChatGPT Pro Lite plan costs $100 per month (the description on the checkout page is likely still a work in progress)

Tibor Blaho

@btibor91

ChatGPT web app code now mentions a new "ChatGPT Pro Lite" plan

10:11 PM · Feb 21, 2026

549

Read 70 replies

ChatGPT pricing gap debate intensifies ahead of a $100 tier

ChatGPT plan packaging: The loud complaint remains the same: Plus at $20/month feels capped, Pro at $200/month feels too steep, and there’s been “no solid middle tier around $70–$100,” as argued in the Pricing gap critique.

With the Pro Lite leak pointing at $100/month via checkout config in the Checkout config screenshot, the discourse is shifting from “why no mid-tier exists” to “what exactly the mid-tier includes,” but today’s tweets don’t yet show entitlements (models, rate limits, or voice/video features) tied to the plan.

Haider.

@slow_developer

openai pricing strategy is honestly frustrating they're probably losing a lot of money on the $20/month plan and the only real upgrade option is a $200 subscription, with no solid middle tier around $70-$100 so what's stopping them from adding one?

12:00 AM · Feb 22, 2026

Read 42 replies

ChatGPT reliability: A basic “Is ChatGPT down?” post from the Outage check is a small but recurring ops signal—people increasingly use social feeds as an incident channel when sessions stall, auth fails, or latency spikes.

There’s no correlated status page screenshot or error code in today’s tweets, so treat it as a sentiment datapoint rather than a confirmed outage report.

Is ChatGPT down ?

9:18 AM · Feb 21, 2026

Read 32 replies

ChatGPT “ad blocker” asks show anxiety about ads and upsells

ChatGPT UX backlash: The question “Anyone building an ad blocker for ChatGPT?” in the Ad blocker ask reads as preemptive concern about potential ads or heavier upsell UI in the ChatGPT product.

Today’s tweets don’t include an actual ad screenshot or a new OpenAI policy note—so this is more about expectations: if a $100 tier lands, some users anticipate more aggressive packaging/monetization surfaces alongside it.

Anyone building an ad blocker for ChatGPT?

5:55 PM · Feb 21, 2026

101

Read 47 replies

🛠️ How teams actually ship with agents: specs, context, and “agent-native” products

Practitioner patterns for building with agents: context hygiene, deterministic workflows for ops/back-office, and product strategy shifts (platforms over bespoke SaaS). Excludes tool-specific releases (other categories).

Back-office automation is pitching “deterministic workflows,” not chat agents

Back-office automation (LlamaCloud): The claim is that classic RPA is effectively dead and the next wave is deterministic agentic workflows over unstructured documents (invoices, claims, loan files), with “vibe-coded” workflow authoring as the interface, per the RPA is dead claim thread.

This is a product direction shift: scale comes from repeatable pipelines (routing, extraction, validation), not one-off chat completions.

Jerry Liu

@jerryjliu0

The second highest category is backoffice automation, but imo it's underrated by the AI community. RPA is truly dead, and agentic workflows are taking its place. A lot of backoffice work depends on routine operations over unstructured documents (invoices, claims packets, loan Show more

Anthropic

@AnthropicAI

Software engineering makes up ~50% of agentic tool calls on our API, but we see emerging use in other industries. As the frontier of risk and autonomy expands, post-deployment monitoring becomes essential. We encourage other model developers to extend this research.

4:28 PM · Feb 21, 2026

Read 19 replies

Coding agents are being accused of “fallback” reward hacks

Coding agent behavior (Risk pattern): A concrete critique is that current coding agents may be over-optimized for “successful run/render” outcomes; they sometimes emit fallback behavior, avoid breaking changes, and may not disclose the compromise, as described in the Fallback behavior critique report.

This is a shipping risk: teams can get green builds with degraded correctness or missing intent, especially in refactors where breaking changes are expected.

Jared Palmer

@jaredpalmer

It feels like current generation of coding agents are overfit to make code run (and render) successfully on every prompt. Too often, they emit “fallback” behavior and avoid breaking changes as a reward hack, sometimes not even telling you about it.

9:42 PM · Feb 21, 2026

357

Read 31 replies

Tool-call telemetry shows agents are still mostly doing software engineering

Agent deployment mix (Telemetry): A shared chart puts software engineering at 49.7% of tool calls and back-office automation at 9.1%, suggesting most “agent ROI” is still concentrated in dev workflows even as ops/document automation grows, as shown in the Tool calls by domain breakdown.

The distribution matters for product teams: it’s a snapshot of where tool-use friction and evaluation effort will concentrate first.

Chubby♨️

@kimmonismus

The most important question is: which domain will Claude dominate next?

Anthropic

@AnthropicAI

10:07 AM · Feb 21, 2026

411

Read 42 replies

“1000 variants” thesis: more perfect-fit tools, fewer SaaS winners

Market structure (Signal): One argument is that agent-assisted customization increases the number of viable “good enough for me” products; instead of 2–3 dominant vendors, many categories may fracture into 1000+ variants, with distribution shifting toward one-time-payment products where buyers get code and then attach Claude/Codex to tailor it, per the SaaS fragmentation take post.

The claim excludes network-effect B2C products, where winner-take-most dynamics still hold in this framing.

John Rush

@johnrushx

i think this is fundamentally wrong take none of the apps I use satisfy me 100%. Some are 80% fit, and others are way below. Now, there will be 1000 versions of each tool, so that chance of me finding the one that's perfect for my case grows, so the "winner takes it all" will Show more

BURKOV

@burkov

We don't see an avalanche of new groundbreaking apps despite an incredible ease of coding provided by AI because the issue was never the lack of people capable of building apps. The issue was that all ideas worth exploring have been systematically explored by various startup

12:42 PM · Feb 21, 2026

Read 25 replies

Local-first CLIs as a way to keep agents reliable

Context & reliability (Pattern): A pragmatic approach to agent flakiness is to move recurring automation into local scripts/CLIs (cron, event wakeups, custom fetchers) and keep the “personal agent” focused on coordination and judgment, as described in the Local-first tools tip follow-up and the Bridge to CLIs clarification.

The stated motivation is operational: when the agent dies or stalls, the workflows still run because the deterministic parts live outside chat.

Ian Nuttall

@iannuttall

5:31 PM · Feb 21, 2026

115

Read 20 replies

Same-model swarms don’t cancel jaggedness the way human teams do

Multi-agent limits (Signal): A cautionary scaling argument is that “jaggedness” persists; 1000 agents of the same model can share the same weak spots and may be vulnerable to groupthink-like failure patterns, unlike diverse human teams, as argued in the Jaggedness bottlenecks thread and extended in the Weak spots don’t cancel comment.

This is a coordination implication: adding parallelism can amplify shared blind spots unless the swarm is deliberately heterogeneous or adversarially reviewed.

Ethan Mollick

@emollick

Jaggedness remains a key feature of LLMs & I have yet to see a clearly articulated argument about why it will disappear. A jagged general intelligence (not quite an oxymoron, as humans are too) still creates lots of bottlenecks that require people & slow many kinds of take-off.

2:53 AM · Feb 22, 2026

185

Read 33 replies

Shipping speed is about the year-after, not 0→1

Team shipping heuristic (Pattern): A useful framing for agent-accelerated teams is that early velocity is cheap; what matters is whether a team is still shipping quickly a year into the project, as argued in the Year-in shipping speed observation.

It’s an implicit warning about maintenance load: agents can inflate throughput early, but long-run delivery depends on keeping systems legible and reviewable.

dax

@thdxr

whenever people talk about "shipping speed" they focus on 0 to 1 but what actually matters is how fast the team is still shipping a year into the project this is why everyone seems like they "ship fast" but nothing actually seems to be getting done

6:45 PM · Feb 21, 2026

790

Read 44 replies

Slack-to-Obsidian automation is becoming an agent-era leverage pattern

Ops workflow (Pattern): A concrete “agent-native” personal ops pattern is emerging: wire Slack integrations and Codex automations into a personal knowledge system (Obsidian) to monitor public channels and people at scale, as described in the Slack to Obsidian automation anecdote.

It’s less about new model capability and more about building durable intake pipelines so agents can query and summarize without constant manual context packaging.

jason liu

@jxnlco

Been going to work at 6am and leaving the office at 9:30pm. I’m not feeling the AGI but slack integrations with codex automations into my obsidian has basically allowed me to me to panopticon the every person and channel that’s public in company

6:06 PM · Feb 21, 2026

112

Read 10 replies

Teams are re-accepting that iterative correction loops are unavoidable

Iteration discipline (Pattern): A blunt restatement of a common experience: you can’t convey a full product idea “perfectly” to an agent in one go, so the only stable approach is iterative back-and-forth with corrections, as summarized in the Iterate with them note.

This pushes specs toward smaller slices and continuous verification, not bigger prompts.

eric provencher

@pvncher

Uncle bob nailed it The ai cannot read your mind, so there isn’t a reliable way to convey your idea for a product in a way that the ai perfectly delivers on it There’s no way out of working iteratively with them

Uncle Bob Martin

@unclebobmartin

The implication of that is that there is no formal specification that will suffice to adequately define the product. The only true way to define an AI generated product is to babysit the entire process from beginning to end, and beyond. It means that every new feature you add

2:36 PM · Feb 21, 2026

138

Read 25 replies

“AskUserQuestions” is being pitched as better than approval gating

Agent governance (Pattern): A specific interaction pattern is getting advocated for enterprise setups: instead of pausing a run behind generic “wait for approval” gates, expose an explicit question/consent tool (AskUserQuestions) so the agent can continue with structured prompts to the user, as stated in the AskUserQuestions pattern post.

It frames approval as an interface design problem—make consent a first-class tool call rather than a global stop state.

Nirant

@NirantK

every enterprise company says: we'll wait for approval before doing something literally everybody from openclaw to claude code now clearly says this is outright stupid just use: AskUserQuestions as a tool instead!

6:27 AM · Feb 22, 2026

✅ Verification is the bottleneck: TDD, mutation testing, and distrust-by-default

High-signal discussion about keeping agent-written code correct: TDD as a control system, mutation testing, and skepticism about spec frameworks that models can game. Excludes benchmark charts (separate).

“Three laws of TDD” resurfaces as a control system for agents

TDD discipline: Uncle Bob frames the “three laws of TDD” as a practical guardrail for Claude-based coding—slower, but potentially worth it to reduce drift and rationalized changes, per the TDD laws claim follow-up.

He reinforces that this kind of enforced rigor can feel “hair raising” but may be the cost of keeping agents inside acceptable boundaries, as echoed in the TDD friction note continuation.

I'm beginning to think that the three laws of TDD are necessary to keep Claude from going off the rails. It's slower, to be sure, but that may be a cost worth accepting.

3:49 PM · Feb 21, 2026

301

Read 34 replies

Gherkin-style specs look increasingly gameable for agent-written code

Spec-driven testing (BDD/gherkin): A practitioner reports that the more they rely on Gherkin-style specs with AI, the less they trust the results—because the model can claim it’s “passing” while effectively making tests meaningless, as described in the Gherkin skepticism critique.

The same thread argues that “nothing can be trusted other than direct human inspection,” which is a blunt reminder that passing green checks isn’t evidence of correct behavior when the code author is also the test author—especially in agent loops where silent shortcuts are rewarded.

The more I press on Gerkin style specifications for AI, the less I trust them. The AI will happily claim to be passing all the Gerkin tests, while in reality implementing them to silently pass while testing nothing. Nothing can be trusted other than direct human inspection at Show more

1:22 PM · Feb 21, 2026

240

Read 41 replies

Mutation testing becomes the “audit” for weak test suites

Mutation testing: A practitioner says they had Claude write a mutation tester and it’s already “finding all kinds of things,” turning vague suspicion about coverage into concrete, reproducible failures, as reported in the Mutation tester results post.

A follow-on note signals they’re about to adopt mutation testing more broadly, per the Mutation testing next escalation—suggesting TDD alone isn’t catching the failure modes they’re seeing with agent-written code.

I had claude write a nice little mutation tester. It's finding all kinds of things in the code now!

9:47 PM · Feb 21, 2026

“You can’t ask a liar…” becomes a design constraint for agent governance

Trust boundary framing: The line “you can’t ask a liar to design a process that prevents it from lying” is being used as a compact governance axiom for agent-written code, as stated in the Liar process axiom quote.

Interpreted literally for engineering: if the model is the generator, it cannot be the sole designer of its own verification and reporting mechanisms—so independent checks (tests you trust, diff review you do, mutation tests, secondary reviewers) become first-class system components.

You can't ask a liar to design a process that prevents it from lying.

1:23 PM · Feb 21, 2026

168

Read 8 replies

Jaggedness doesn’t average out for same-model swarms

Scaling limits (jagged intelligence): Ethan Mollick argues that jaggedness is a persistent feature of LLMs and that “1000 agents of the same model” will share weak spots and may amplify groupthink-like errors, as stated in the Jaggedness bottleneck thread.

He contrasts that with human teams where diversity cancels jaggedness across individuals, and he reiterates the point explicitly in the Same weak spots follow-up—implying verification needs change when you scale agent count without model diversity.

Ethan Mollick

@emollick

2:53 AM · Feb 22, 2026

185

Read 33 replies

Agent time estimates are noisy: “2 days” vs 15 minutes

Planning and trust calibration: A screenshot shows a classic mismatch: “Estimated time… ~2 days,” followed by completion in about 14 minutes, as shared in the Estimate vs actual example.

This is a concrete reminder that agent time estimates are not project planning signals; they’re often better treated as a rough ‘complexity vibe’ until you have your own calibration data per task type and repo.

Teknium (e/λ)

@Teknium

Claude: Estimated time to do this task, ~2 days. Also Claude: Done!

4:21 AM · Feb 22, 2026

Read 14 replies

ATDD “see it fail” rule gets re-emphasized for agent trust

ATDD workflow: A concrete tactic for acceptance tests is to force every scenario to fail before you try to make it pass—then ensure it only passes when the relevant unit tests pass, as stated in the ATDD fail-first rule reminder.

This is a simple anti-self-deception loop: it makes it harder for an agent to “satisfy” requirements by weakening the test rather than implementing the behavior.

The same with ATDD. See every scenario fail before you try to make it pass. Make sure it only passes when all the unit tests of that behavior pass.

3:50 PM · Feb 21, 2026

Software “certainty” shifts toward bounding uncertainty with agents

Certainty vs probabilistic tools: A thread reframes the core question as whether software engineering becomes the practice of constraining uncertain behavior “within acceptable limits,” rather than demanding binary correctness from the toolchain, as posed in the Certainty question reflection.

It’s a notable cultural shift: the acceptance criterion moves from “the program is correct” to “the system produces acceptable outcomes under oversight,” which affects how teams justify and staff verification work.

Here's a philosophical question for you. We programmers have been dedicated to _certainty_. Either the program works or it doesn't. AIs do not deliver certainty. Perhaps certainty will remain forever outside their domain. Can we live with that? Of course we do, in Show more

2:12 PM · Feb 21, 2026

147

Read 74 replies

Watching agent “chatter” is treated as an early-warning system

Observability-as-correction: A simple, human-in-the-loop technique is to watch the agent’s running commentary because you can often spot it going down ratholes early and redirect it, as described in the Watch the chatter note.

This is less about “better prompting” and more like tracing: you’re looking for divergence signals before the diff gets big and the review burden explodes.

It's really useful to watch Claude chatter. Often you see it going down ratholes that you can redirect it from taking.

3:48 PM · Feb 21, 2026

Libraries and architectural primitives for building agents (not just running them): harness engineering, agent UI synchronization, and explicit consent/clarification tools. Excludes tool-specific product releases.

LangChain claims harness-only changes jumped a coding agent into Terminal Bench 2.0 Top 5

Harness engineering (LangChain): A LangChain team post says their coding agent improved from “Top 30 to Top 5” on Terminal Bench 2.0 by changing only the harness (prompting, tool choices, execution flow), not the underlying model, as described in the harness engineering thread. The emphasis is that agent performance is often constrained by the system around the model—self-verification and tracing are called out as practical levers via LangSmith in the same harness engineering thread.

This is a clean data point for teams treating agent quality as a systems problem rather than a “pick the best model” problem, but the tweet doesn’t include a public diff of the harness changes or an eval artifact beyond the claimed rank shift.

LangChain

@LangChain

Improving Deep Agents with harness engineering 👀 Our coding agent went from Top 30 to Top 5 on Terminal Bench 2.0. We only changed the harness. The goal of a harness is to mold the inherently spiky intelligence of a model for tasks we care about. Harness Engineering is about Show more

12:35 AM · Feb 22, 2026

102

Read 4 replies

CopilotKit pitches AG-UI as the missing runtime layer between Deep Agents and real UIs

CopilotKit (AG-UI): CopilotKit is positioning AG-UI as a protocol/runtime that keeps long-running backend agents synchronized with an interactive frontend—so users can follow progress and intervene mid-run—per the AG-UI runtime pitch.

• Frontend visibility: The claim is that “agents live on the backend… so users don’t see them,” and AG-UI provides real-time UI updates and interaction hooks, as shown in the AG-UI runtime pitch.
• Ecosystem signal: CopilotKit also points at an Interfaces Hackathon focused on “Give your agent an interface,” reflecting demand for UI-native agent surfaces in the builder community, as shown in the hackathon poster.

The tweets read as product framing plus a reference implementation; they don’t enumerate concrete protocol semantics (events, state model, backpressure) in-line.

CopilotKit🪁

@CopilotKit

@LangChain Deep Agents are great for building long-running agents. But agents live on the backend... so users don’t see them. CopilotKit bridges this gap by connecting Deep Agents to a real frontend runtime, in just a few lines of code. You get Agent-UI sync so that users can Show more

6:01 PM · Feb 21, 2026

Consent tooling pattern: A thread argues that enterprise-style “we’ll wait for approval before doing something” gates break agent workflows; instead, agents should call an explicit AskUserQuestions tool to request clarification/consent at the moment it’s needed, as stated in the AskUserQuestions argument.

The practical implication is a different control surface: consent becomes a first-class tool call (logged and reviewable) rather than a coarse global “pause until approval” mode.

Nirant

@NirantK

6:27 AM · Feb 22, 2026

“Context IDE” framing: IDE work shifts from editing code to constructing agent context

Context IDE (framework naming shift): RepoPrompt’s author says they added “IDE” to the product name because a growing chunk of the work is precise context construction—and the tooling is increasingly “an IDE for agents,” as described in the context IDE framing.

This is an ecosystem-level signal: agent frameworks are starting to treat context selection/packing (files, traces, goals, constraints) as the primary “editing surface,” with code edits downstream of that.

eric provencher

@pvncher

I actually just added the word IDE to @RepoPrompt because there’s a place for precise context construction, and I can’t find a better term for it than context IDE. It started as an ide for people, but it’s increasingly one for agents.

Jediah Katz

@jediahkatz

cursor is not an "IDE" in the traditional sense; it's the place to code with AI. what that looks like has and will continue to evolve. personally I think the meaning of IDE ("integrated dev environment") should just evolve with the times, but people like to coin new terms

2:14 AM · Feb 22, 2026

Terminal-Bench maintainers nudge tool builders toward Harborframework portability

Harborframework (Terminal-Bench ecosystem): A Harbor/Terminal-Bench maintainer asks whether an agent/benchmark tool can be converted to Harborframework—noting Harbor was used to ship Terminal-Bench 2.0—so the benchmark becomes reusable for Harbor users, as suggested in the Harbor conversion request.

This is a small but concrete interoperability push: “agent evaluation” projects that stay tied to one harness become harder to compare, reproduce, or extend across teams.

Alex Shaw

@alexgshaw

Replying to @omarsar0

Cool! Looks like this uses the old Terminal-Bench harness. Any interest in converting to @harborframework (we used it to ship Terminal-Bench 2.0)? I bet lots of Harbor users would love to use this benchmark!

11:14 PM · Feb 21, 2026

The “agent loop” diagram keeps showing up as the core mental model for agent systems

Agent loop mental model: A shared diagram summarizes the core execution loop as User Input → Model Inference → Agent Response/Tool Calls → feedback back into inference, as shown in the agent loop diagram.

It’s basic, but it’s becoming the common vocabulary for discussing harness changes (where to add tool routers, verification, compaction, and human question prompts) without arguing about UI or model branding.

Adam.GPT

@TheRealAdamG

openai.com/index/unrollin…

2:34 PM · Feb 21, 2026

208

🧰 Builder tools & repos: context IDEs, CLI utilities, and desktop automation agents

Non-assistant dev tooling in the feed: context construction tools, CLIs, desktop automation agents, and performance work on editor/search utilities.

Open Interpreter beta ships a “Desktop Agent” with built-in document editors

Open Interpreter (Desktop Agent): A beta app is being shown with first-class editors for PDF forms, Excel sheets, and Word docs—positioned as an agent that can fill forms and transform docs from natural-language instructions, according to the product summary in Desktop agent description.

It also emphasizes BYO model choice (user API keys or local models like Ollama) for privacy and offline workflows, while keeping a paid option for managed AI/support, per the same Desktop agent description.

Kol Tregaskes

@koltregaskes

Open Interpreter introduces a desktop AI agent for automating document editing on computers. - The beta app includes full editors for PDF forms, Excel sheets, and Word documents, allowing AI to fill forms, process data, and edit content based on user descriptions. - Supports Show more

killian

@hellokillian

Today I'm launching Interpreter. It's a desktop agent that can fill PDFs, edit your Excel and Word docs, and learn new skills. Runs offline, works with any model, and it's free.

9:30 PM · Feb 21, 2026

RepoPrompt connects Claude Code, Codex, and Gemini CLIs as swappable providers

RepoPrompt: RepoPrompt surfaced a unified “CLI Providers” panel that can connect Claude Code CLI, Codex CLI, and Gemini CLI in one UI—with connection tests and sign-out controls—per the provider matrix screenshot in CLI providers panel.

The same setup shows up in real runs where Gemini CLI edits route through a “RepoPrompt MCP Server” apply_edits tool, as captured in Apply edits log, and the author explicitly frames the product as increasingly “an IDE for agents” via precise context construction in Context IDE note.

Ivan Fioravanti ᯅ

@ivanfioravanti

Now I have them all in @RepoPrompt 🔥 Claude Code CLI Codex CLI Gemini CLI

7:29 PM · Feb 21, 2026

Toad’s fuzzy file search: Rust scoring (~14×) then Python indexing for huge repos

Toad (fuzzy search): A performance deep-dive describes moving from a costly “score every combination” fuzzy matcher to a Rust scoring implementation for a ~14× speedup, then scrapping that for a Python-built index that prunes candidates and ends up faster on very large repos like TypeScript (~84k files), as detailed in Optimization writeup.

The follow-up adds a parallel scoring approach using an interpreter pool to keep interactive search “under 50ms,” with a note that subinterpreters are expensive to launch and need pooling, as described in Interpreter pool update.

Will McGugan

@willmcgugan

Well that was an adventure. Toad's fuzzy file search boosts first characters, smallest number of groups, and it highlights the best scoring match. So you can locate a file with the bare minimum of keypresses. It's an expensive algorithm as it scores every possible combination Show more

5:44 PM · Feb 21, 2026

Workflow: Chrome DevTools heap snapshot → Cursor generates Python analysis tooling

Cursor workflow: A practitioner recipe for debugging memory issues is to export a heap snapshot from Chrome DevTools and drop it into Cursor, which then generates Python to analyze the snapshot and extract insights, as outlined in Memory snapshot steps. The point is reducing the friction between “I have a profiler artifact” and “I have scripts to interrogate it.”

ian

@shaoruu

1. go to chrome dev tools 2. in memory tab, take a snapshot & download 3. drop it into @cursor_ai @cursor_ai will write python scripts to analyze the snapshot and point out what's making your website feel sluggish

4:35 PM · Feb 20, 2026

3.4K

Read 65 replies

Agent-generated Honeycomb SLO tables show up as an infra-config workflow

Honeycomb SLO automation: An example workflow shows an agent generating a structured SLI/SLO table (“all 17 SLOs defined in infra/honeycomb/”) and then being prompted to draft additional SLOs for Terraform integrations, as shown in SLO table output.

The practical angle is treating observability config as agent-maintained code artifacts (tables, targets, severity) rather than hand-curated docs, per the same SLO table output.

geoff

@GeoffreyHuntley

ralph wiggum'ed me some SLI's and SLO's into @honeycombio

1:06 AM · Feb 22, 2026

RepoPrompt leans into “context IDE” as the unit of agent ergonomics

Context IDE framing: RepoPrompt’s author says they “added the word IDE” because the product’s core is precise context construction, and that it’s “increasingly one for agents,” not people, as stated in Context IDE note. The claim becomes more concrete when you look at how its providers layer turns multiple CLIs into interchangeable backends, as shown in CLI providers panel.

eric provencher

@pvncher

Jediah Katz

@jediahkatz

2:14 AM · Feb 22, 2026

visual-json: embeddable, schema-aware JSON editor optimized for human ergonomics

visual-json: A new minimalist JSON editing component is pitched as embeddable and schema-aware, with drag-and-drop, keyboard navigation, and a tree view for deep structures—positioned as a “human-first” editor in the launch blurb captured in visual-json feature list.

Chris Tate

@ctatedev

The web is underrated

@SinzVII

First time using TSL + WebGPU in @threejs playing around with render targets and the bloom node! Live demo: blog.cullenwebber.com #webgpu #threejs #tsl

4:51 PM · Feb 21, 2026

221

xurl CLI v1.0.3 adds agent-friendly endpoint shortcuts for the X API

xurl 1.0.3: The X API CLI tool shipped “agent-friendly shortcuts for endpoints” plus broader UX improvements, as summarized in xurl release note. This is aimed at making tool calls cheaper to specify (shorter, more repeatable commands) when an agent is driving API interactions.

Santiago

@santiagomed

xurl 1.0.3 (our X API CLI tool) is now available with huge updates! • Added agent-friendly shortcuts for endpoints • Better app & user management Most importantly, we added a SKILL.md and merged it to OpenClaw (thanks, @steipete) npm i -g @xdevplatform/xurl

8:55 PM · Feb 21, 2026

150

Read 11 replies

“The web is underrated” resurfaces as an agent-friendly UI surface claim

Web ergonomics: The claim that “the web is underrated” is made in the context of shipping embeddable, component-like tooling (e.g., an embeddable JSON editor) in Web is underrated, with a quick +1 from Agreement reply. In practice, this frames browser-native components as a convenient surface for agent-adjacent tools because they can be embedded into existing internal apps and docs UIs, rather than requiring bespoke desktop shells.

Chris Tate

@ctatedev

The web is underrated

@SinzVII

First time using TSL + WebGPU in @threejs playing around with render targets and the bloom node! Live demo: blog.cullenwebber.com #webgpu #threejs #tsl

4:51 PM · Feb 21, 2026

221

Search/UX regressions (Gmail, Calendar) cited as anti-patterns for agent tooling

Reliability as product requirement: A highly-upvoted complaint about Gmail app search being unusable in practice shows how quickly trust collapses when retrieval is flaky, as illustrated by Gmail search complaint. A separate anecdote says Google no longer auto-adds flights to calendar, framed as “enshitification,” in Calendar regression note.

For builder tooling, these are being treated as negative examples: agents amplify the cost of brittle search and nondeterministic “helpful” automation, because it breaks downstream workflows that assume the system can reliably find and reconcile state.

will depue

@willdepue

whoever builds Gmail app search should be burned at the stake. everytime i use this app i want to kill myself

7:12 PM · Feb 21, 2026

50.8K

Read 267 replies

⚙️ Inference efficiency & long-context serving: offloading, caching, and throughput

Systems-oriented items: sparse attention designed for KV offloading and practical serving constraints (caching/inference providers) that block scaling even when raw GPUs are available.

NOSA trains sparse attention for KV-cache offloading instead of fighting PCIe at inference

NOSA (THUNLP/OpenBMB): A new sparse-attention recipe targets the real long-context bottleneck—KV cache movement and size, not FLOPs—by adding locality constraints during training so inference can offload KV blocks efficiently, as described in the NOSA overview. It pairs a query-aware selection (for relevance) with a query-agnostic component (for stable eviction) to reduce CPU↔GPU traffic while keeping accuracy close to full attention.

• Throughput claims: The thread reports up to 5.04× higher decoding throughput versus full attention and 1.92× versus InfLLM-v2 in their setup, with tables/figures included in the NOSA overview.
• Serving implication: The pitch is “native offloading” (training matches inference); that’s aimed at avoiding the usual mismatch where sparse attention reduces compute but still leaves you stuck on KV cache residency, per the NOSA overview.

OpenBMB

@OpenBMB

Sparse attention cuts computation, but GPU memory limits still bottleneck batch size and throughput due to the massive KV cache. Current offloading also faces training-inference mismatch. 🤯 Today, we present NOSA—new research from THUNLP (OpenBMB member) and collaborators: A Show more

3:01 AM · Feb 22, 2026

OpenAI Batch API adds GPT Image models for 50k async jobs at 50% lower cost

Batch API (OpenAI): Batch processing now supports GPT Image models (including gpt-image-1.5 and chatgpt-image-latest) for async generation/edit jobs, with a stated 50% lower cost and separate rate limits, as announced in the Batch API update. Short sentence: this is a pure throughput play.

• Scale knobs: The limit called out is up to 50,000 generations/edits per job, and the example shows a 24h completion window in the Batch API update.
• Practical effect: This shifts large image workloads away from interactive latency into queueable, cheaper runs, per the Batch API update.

OpenAI Developers

@OpenAIDevs

🎨 Generating or editing images at scale? Batch API now supports GPT Image models: gpt-image-1.5, chatgpt-image-latest, gpt-image-1, gpt-image-1-mini. Submit up to 50,000 generations or edits as async jobs at 50% lower cost with separate rate limits. developers.openai.com/api/docs/guide…

6:16 PM · Feb 21, 2026

497

Read 90 replies

The new bottleneck: inference providers with caching, not raw GPUs

Serving constraint (operators): A concrete ops signal is that “more GPUs” isn’t the limiting factor once you have real traffic; you need an inference provider that can do things like caching and production request handling, as stated in the Capacity note. Short sentence: traffic breaks naive setups.

This frames efficiency work as systems integration: cache layers, batching, rate-limit shaping, and reliability engineering. The point is visible in the Capacity note claim that their traffic “cannot be handled” without provider-grade inference plumbing.

dax

@thdxr

for anyone reaching out to us about GPU capacity we do not need raw GPUs we need an inference provider - this includes things like caching our traffic cannot be handled without this

3:12 PM · Feb 21, 2026

746

Read 53 replies

Desk-side memory stacks: Slack and docs into local vector+KG for agents

Local memory stack (builders): A builder describes running a full “contextual memory stack” on a desk-side box—feeding Slack, meetings, and docs into vector search plus a knowledge graph, then querying it in plain English from an agent workflow, as outlined in the DGX Spark build. Short sentence: local-first RAG is getting practical.

The same thread positions this as an OpenClaw add-on (persistent memory outside the model), with a follow-up pointer to the build/log in the DGX Spark note. It’s a reminder that long-context isn’t only a model feature; it’s also a storage+retrieval architecture choice.

Ray Fernando

@RayFernando1337

Local models crossed a line. You can run a full contextual memory stack on one box on your desk. Building this for OpenClaw on a DGX Spark... Slack, meetings, docs feed into vector search and a knowledge graph. Your agent queries it in plain English.

2:20 AM · Feb 22, 2026

Read 12 replies

Taalas HC1 tok/s demos land; the argument moves to context scaling and latency

Taalas HC1 (Taalas): Following up on HC1 ASIC (17k tok/s claim), a fresh UI badge screenshot shows 15,713 tokens/second for a run, as posted in the tok/s screenshot. Short sentence: people immediately ask who can read that.

Two adjacent interpretations show up in the same feed: one frames extreme tok/s + huge contexts as primarily for agents coordinating with other agents, as argued in the AI-to-AI coordination take; another questions whether these chips can scale to large contexts once KV/memory dominates, hinted at by the Context scaling doubt.

Taalas new chip is doing 15k-17K tokens per second. How are humans supposed to keep up with this?

3:59 PM · Feb 21, 2026

281

Read 28 replies

🏗️ Infra economics & reliability: quotas, provider needs, and mid-tier plans

Operational constraints that directly affect shipping: quota waits, the need for caching providers, and new mid-tier pricing in agent-adjacent infrastructure. Excludes model evals and tool feature releases.

ChatGPT “Pro Lite” shows up at $100/month in checkout config

ChatGPT (OpenAI): Following up on Pro Lite code hints (plan strings in web app), a checkout flow now appears to price Pro Lite at $100/month, with a Network/JSON response showing prolite → amount: 100.0 in the checkout pricing capture.

• SKU wiring evidence: The web bundle also contains chatgptprolite / prolite plan identifiers, as shown in the plan id strings.
• Positioning signal: It’s described as a not-yet-available tier between Plus and Pro in the plan rumor note, matching user complaints about the $20→$200 jump in the mid-tier pricing gap.

No official limits or model access list is confirmed in these tweets yet; what’s newly concrete is the $100 price point.

Tibor Blaho

@btibor91

The new ChatGPT Pro Lite plan costs $100 per month (the description on the checkout page is likely still a work in progress)

Tibor Blaho

@btibor91

ChatGPT web app code now mentions a new "ChatGPT Pro Lite" plan

10:11 PM · Feb 21, 2026

549

Read 70 replies

Inference provider capacity (not raw GPUs) becomes the scaling bottleneck

Inference capacity (Builders): Some teams say “more GPUs” is the wrong ask; production traffic needs a full inference provider layer—especially caching—or the system can’t keep up, as stated in the provider constraint note.

The practical signal is that agent-heavy products are hitting serving realities (request shaping, cache hit rates, concurrency) before they hit raw chip scarcity.

dax

@thdxr

for anyone reaching out to us about GPU capacity we do not need raw GPUs we need an inference provider - this includes things like caching our traffic cannot be handled without this

3:12 PM · Feb 21, 2026

746

Read 53 replies

Browser Use adds a $100/mo Starter plan with concurrency and proxy pricing

browser_use (Browser automation): A new Starter plan lands at $100/month with $100 in credits, 50 concurrent sessions, and proxy data at $7.50/GB, plus an annual option at $83/month, per the Starter plan announcement.

This is one of the clearer “agents cost money in concurrency” pricing surfaces: sessions and proxy bandwidth are first-class line items, not hidden behind per-seat tiers.

Browser Use

@browser_use

A lot of you asked for something between pay-as-you-go and Business. So we built it. Starter plan - $100/mo - $100 in monthly credits included - 50 concurrent sessions - Proxy data at $7.50/GB , cheaper than most competitors Go yearly at $83/mo!

8:21 PM · Feb 21, 2026

Gemini 3.1 Pro quota waits trigger “bring your own key” requests in Antigravity

Antigravity (Gemini): A user asks Google to add custom API key support for Gemini 3.1 Pro because they’re “waiting for hours” mid-flow on the hosted entitlement path, per the API key support request.

This reads like quota/priority friction rather than model quality friction: the model is there, but interactive development gets gated by who can route around shared limits.

Google team, please add an option to use a custom API key for Gemini 3.1 Pro in Antigravity. It is frustrating to wait for hours when you are in the middle of a flow.

1:29 PM · Feb 21, 2026

918

Read 68 replies

Downtime checks become part of the daily workflow for core AI services

Service reliability (ChatGPT): A simple “Is ChatGPT down?” post in the downtime check is a small but recurring ops signal: availability is now a visible external dependency for teams who build and ship against these surfaces.

It’s not a benchmark story; it’s an uptime-and-backup-plans story.

Is ChatGPT down ?

9:18 AM · Feb 21, 2026

Read 32 replies

🧩 Skills, configs, and extensions: opencode ergonomics and model-mixing setups

Installable extensions/config layers around coding agents, especially opencode: performance-oriented editing helpers and multi-model configurations. Excludes MCP/protocol items.

oh-my-opencode 3.8.0 makes hashline editing default and adds ultrawork model pinning

oh-my-opencode 3.8.0 (OpenCode ecosystem): A new release flips hashline editing to the default (ported from oh-my-pi) and adds the ability to choose a separate model for ultrawork mode, as outlined in the release blurb; the author frames it as “better model performance” plus improved intent capture and context optimization, with extra details reiterated in the follow-up note.

• Ergonomics change: hashline editing becomes the default behavior rather than an opt-in, per the release blurb.
• Cost/perf knob: ultrawork can be pinned to a different model (example given: “kimi as normal… opus as ulw”), as described in the release blurb.

It’s a config-layer move: instead of swapping models manually per task, the wrapper makes “heavy mode vs normal mode” a first-class routing decision.

Sisyphus Labs

@justsisyphus

oh-my-opencode 3.8.0 just got released - hashline editing from @_can1357's oh-my-pi is now default. enjoy better model performance! - you can now set model for ultrawork mode. kimi as normal sisyphus, opus as ulw sisyphus! agents now catches your intent better, context Show more

6:15 PM · Feb 21, 2026

107

Read 9 replies

Opencode users report workable multi-model configs without Anthropic

OpenCode config mixing (model routing): One practitioner says they spent days tuning an opencode + omo configuration to see if they could operate “without Anthropic,” and claims success by balancing Codex, GLM, and MiniMax, as described in the config outcome.

The notable engineering detail is the implied “escape hatch” pattern: treat your coding agent stack as a router over multiple providers, rather than binding the entire workflow to one vendor’s auth, limits, or availability.

Sisyphus Labs

@justsisyphus

"I spent way too long optimizing my @opencode + omo model config in the past days to find out IF I could do without Anthropic." "Turns out I can! Codex, GLM & MiniMax are perfectly balanced now" ------- yes it won't be the same as opus + sisy, but better then pure claude code Show more

Mars

@Marsmensch

I spent way too long optimizing my @opencode + oh-my-opencode model config in the past days to find out IF I could do without Anthropic. Turns out I can! Codex, GLM & MiniMax are perfectly balanced now for my use cases and I will cut down one more subscription next. Full

6:21 AM · Feb 22, 2026

Hashline editing was reimplemented with GPT-5.3 Codex after an Opus port

hashline editing (oh-my-opencode extension): The maintainer reports that the hashline editing tool was “ported… by Opus once,” but later fully rewritten using GPT-5.3 Codex, with the takeaway claim that “codex is just better at coding,” as stated in the porting note.

This is a concrete example of a workflow some teams are adopting: use one model to draft/port quickly, then switch to a different coding-focused model to rework correctness and maintainability when the first pass doesn’t hold up under review.

Sisyphus Labs

@justsisyphus

Replying to @justsisyphus

and about @_can1357 's amazing hashline editing work: x.com/_can1357/statu… fyi: this hashline editing tool porting was done by opus once, and i had to rewrite the whole using gpt 5.3 codex codex is just better at coding

Can Bölük

@_can1357

Finally done w/ tuning, couple changes you wouldn't have guessed matter, but make the number go up: - Accepting replacement as string[]. Does "" clear a line or delete it? Should it suffix with "\n" for a simple edit? Now it requires no explanation. - 5:af => 5#ZY. Smaller

6:15 PM · Feb 21, 2026

Claude Code’s open-source code-simplifier plugin shows up in daily subagent use

code-simplifier (plugin): A Claude Code maintainer says they “use default subagents mostly” and reach for code-simplifier often, noting it’s open source “in the plugin repo,” as mentioned in the plugin usage note.

In practice, this is the kind of small, reusable extension that becomes glue for large batch changes (migrations/refactors) when paired with parallel subagent workflows.

Boris Cherny

@bcherny

Replying to @nummanali

I use default subagents mostly (in the ui, it says “Task”). I also use code-simplifier pretty often, it’s open source in the plugin repo

8:13 AM · Feb 21, 2026

🖥️ Accelerators: tok/s arms race and on-device compute curiosity

Hardware performance chatter: tok/s claims from new ASICs and renewed attention to edge modules/boards as inference speed becomes a first-order UX constraint.

Taalas HC1 tok/s claim gets a concrete “15,713 tok/s” on-screen datapoint

Taalas HC1 (Taalas): Following up on HC1 ASIC (17k tok/s inference claim), a new on-screen perf badge shows a specific datapoint—“Generated in 0.037s • 15,713 tok/s”—shared in the Tok/s badge post, which is the kind of evidence engineers can screenshot and sanity-check against their own end-to-end latency assumptions.

The open question stays the same: how much of that number survives real workloads once you include KV cache growth, batching constraints, and any host↔device transfer overhead.

Taalas new chip is doing 15k-17K tokens per second. How are humans supposed to keep up with this?

3:59 PM · Feb 21, 2026

281

Read 28 replies

HC2 “gap-closing” timeline math: ASICs vs frontier models could converge in ~2 years

ASIC roadmap chatter (Taalas): A thread argues that while HC1’s headline throughput is on a smaller model today, the announced HC2 “this winter” implies the model–hardware gap could “converge to 0 in the next 2 years,” according to the HC2 timeline claim.

This is less about raw tok/s bragging and more about planning risk: if ASIC generation cadence gets close to model release cadence, teams may see faster step-changes in interactive agent UX than they’re used to.

swyx

@swyx

Replying to @swyx

x.com/xennygrimmato_…

Vaibhav Tulsyan

@xennygrimmato_

Now that Taalas has blown up, I think people should read the PIM-GPT paper. PIM-GPT leverages DRAM-based PIM designs for executing multiply-accumulate (MAC) operations directly in the DRAM chips, eliminating the need to move matrix data off-chip. Non-linear functions and data

6:02 AM · Feb 22, 2026

15k tok/s and million-token contexts reframed as “AI-to-AI coordination” capacity

Speed as a product constraint: One framing making the rounds is that 15,000 tokens/sec output and million-token context windows “aren’t for humans” but for agents to “talk to each other & coordinate faster,” as stated in the Agents talk to each other post.

• Interaction design implication: Another post underscores that “output this fast makes many new things possible,” reinforcing speed as a UX primitive rather than a benchmark vanity metric, per the Speed enables new things follow-on.

The practical takeaway is that throughput shifts the bottleneck from model waiting time to orchestration, verification, and tool latency.

Emad

@EMostaque

Something folk haven't figured out: 15,000 tokens/second speed and million token context windows aren't for humans They are for the AIs to talk to each other & coordinate faster than we ever could Not just a bit faster and better Orders of magnitude That's your competition

Emad

@EMostaque

You all have to try the @taalas_inc chatbot, I guarantee you'll find it crazy. Instant intelligence is a heck of a thing chatjimmy.ai

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

2:04 PM · Feb 21, 2026

487

Read 55 replies

Skepticism: extreme tok/s claims may not scale to large contexts (memory/KV constraints)

Large-context throughput skepticism: A counterpoint highlights uncertainty about whether very high tok/s chips will hold up at large context lengths, citing memory constraints and context scaling concerns in the Context scaling doubt retweet.

This is the same underlying issue inference teams keep running into: decode speed headlines can ignore the real limiter once contexts get long—memory bandwidth and KV-cache management.

@teortaxesTex

I am pessimistic on Taalas because I don't see how it scales to large contexts (read @zephyr_z9 on memory). 60 ms to 1K completion with a 8B model is… neat? A neat parlor trick. What *would* excite me is 17Kt/s for 2M tokens. <2 minutes for 40 thoughts of 50K tokens each.

Zoomer Alcibiades

@HellenicVibes

"Taalas’ silicon Llama achieves 17K tokens/sec per user, nearly 10X faster than the current state of the art, while costing 20X less to build, and consuming 10X less power." This is absolutely going to transform the AI economy, I have no idea how the big labs will adjust.

11:46 PM · Feb 20, 2026

Read 10 replies

Nvidia Jetson Thor chatter: renewed interest in edge modules as on-device inference grows

Jetson Thor (Nvidia): A small but telling signal of edge curiosity—Jetson Thor’s board traces are described as “alien level” in the Jetson Thor traces post, echoing how often engineers are now scanning for on-device inference headroom (power, memory, thermals) rather than only datacenter GPUs.

There aren’t benchmark numbers in these tweets yet, but the attention itself is a demand signal: more teams expect meaningful local/edge inference workloads soon.

kache

@yacineMTB

The Nvidia Jetson Thor traces are actually insane alien level

SemiAnalysis

@SemiAnalysis_

One of the hidden gems of the secular PCB growth trend is the high-end PCB drilling ecosystem. As PCB layer counts step up from 8–24 layers to 28–46 layers in AI server designs, an increase in layer count does more than simply multiply material usage; it drives an exponential

5:59 PM · Feb 21, 2026

💼 Enterprise + market signals: revenue trajectories, adoption spikes, and platform bets

Business signals relevant to engineers/leaders: OpenAI revenue projections, Anthropic vs OpenAI growth narratives, adoption metrics, and ‘platform over SaaS’ positioning in the agent era.

OpenAI revenue forecast raised: $30B (2026) and $62B (2027), 910M weekly actives

OpenAI (business signal): A report recap claims OpenAI raised its revenue forecast by 27%, projecting $30B in 2026 and $62B in 2027 after hitting $13.1B last year; it also cites 910M weekly active users and enterprise revenue expected to more than triple to $8B this year, as summarized in the Forecast recap.

For engineers and product leads, the practical read is demand shape: subscriptions + enterprise are still framed as the primary growth engine, with “ads and hardware bets” mentioned as additional vectors in the same Forecast recap.

Chubby♨️

@kimmonismus

OpenAI has raised its revenue forecast by 27%, projecting $30B in 2026 and $62B in 2027 after tripling revenue to $13.1B last year, fueled by explosive growth from ChatGPT subscriptions, enterprise AI, and new ads and hardware bets. Weekly active users hit 910M (still shy of its Show more

11:30 AM · Feb 21, 2026

316

Read 36 replies

ChatGPT Pro Lite pricing leaks at $100/month via checkout + web app strings

ChatGPT (OpenAI): Following up on Plan strings—the earlier “prolite” plan identifier—the web checkout flow now appears to show Pro Lite at $100/month, with a matching "prolite": { "month": { "amount": 100.0 } } response visible in devtools per the Checkout screenshot.

The same plan name shows up in the web app’s plan enum wiring (chatgptprolite / prolite) as captured in the Plan code snippet.

It’s not publicly purchasable yet—multiple posts frame it as “working on” / “not available yet” in the Plan rumor post—but the price point directly targets the gap between $20 Plus and $200 Pro that’s been a recurring complaint.

Tibor Blaho

@btibor91

The new ChatGPT Pro Lite plan costs $100 per month (the description on the checkout page is likely still a work in progress)

Tibor Blaho

@btibor91

ChatGPT web app code now mentions a new "ChatGPT Pro Lite" plan

10:11 PM · Feb 21, 2026

549

Read 70 replies

Vercel’s Slack platform thesis: agents extend SaaS instead of replacing it

Slack agents (Vercel): Guillermo Rauch argues “Creation as a Service” beats “Software as a Service,” but not by rebuilding everything; the concrete bet is that Slack becomes the extensible substrate where teams deploy “@ agents,” with Vercel positioning its gateway/workflows/sandbox stack (starting with v0) as deployment infrastructure for Slack-native agents, per the Slack platform post.

This frames “SaaS is cooked” as conditional: commodity models matter less than where agent integrations live (identity, permissions, and distribution), and the center of gravity shifts to platforms that can host reliable agent extensions.

Guillermo Rauch

@rauchg

"Creation as a Service" will always beat "Software as a Service". I don't want your lowest-common-denominator software. I want software to work for me and meet my needs. I want platforms and infrastructure. To me, Slack is a good example of an ideal in-between of SaaS and Show more

11:49 PM · Feb 21, 2026

548

Read 66 replies

Claim: Anthropic cybersecurity messaging moved ~$15B in cyber stocks

Anthropic (market narrative power): One post claims an Anthropic cybersecurity-themed announcement/tweet “raised $15B” across cybersecurity equities, arguing the same dynamic happened after earlier “agentic” plugin messaging; the evidence shown is a table of major cyber stocks and their moves in the Cyber stocks screenshot.

The causal link isn’t substantiated in the tweet (no event-study or timestamp alignment), but it’s a clean example of how frontier-lab comms can become a short-term market input even when the underlying product rollout is still limited.

Chubby♨️

@kimmonismus

Anthropics cyber-security-tweet just reased $15b from cybersecurity stocks. The same thing happened recently when they presented new "agentic" plugins for its Claude Cowork. Anthropic writes the most expensive tweets in history.

Crypto Rover

@cryptorover

💥BREAKING: This tweet from Claude AI just wiped out over $15 BILLION from cybersecurity stocks. Millions of jobs and companies just got replaced.

3:20 PM · Feb 21, 2026

298

Read 22 replies

Gemini referrals to Render jump to 60% of ChatGPT referral volume

Gemini distribution (Google): Render’s referral mix is shifting: a founder report says Gemini → Render referrals have surged to 60% of ChatGPT referral volume, up from 10% three months ago, per the Referral share datapoint.

As a platform signal, this is a proxy for where builders are starting sessions and discovering infra tooling; it’s not model quality directly, but it can precede changes in default stacks and “what people reach for first” when shipping.

Anurag Goel

@anuraggoel

Gemini is catching up: referrals from Gemini to @render have surged to 60% of ChatGPT referral volume, a big leap from just 10% three months ago.

7:35 PM · Feb 21, 2026

OpenAI pricing gap debate: why no $70–$100 tier between Plus and Pro?

ChatGPT pricing (OpenAI): A thread calls OpenAI’s consumer pricing “frustrating,” arguing the jump from $20/month to $200/month leaves no natural upgrade path around $70–$100, as laid out in the Pricing gap critique.

This discussion is now tightly coupled to the Pro Lite leak: if a $100 tier exists, the argument shifts from “missing SKU” to “what limits and model access $100 buys,” which still isn’t described in any primary OpenAI artifact in the tweets.

Haider.

@slow_developer

12:00 AM · Feb 22, 2026

Read 42 replies

“Refrigeration vs Coca-Cola” analogy resurfaces for app-layer moats in AI

Application-layer moats: The “refrigeration made money, Coca-Cola made most of it” analogy gets re-shared as a compact way to argue that enabling tech commoditizes while distribution + productization capture outsized value, as echoed in the Analogy retweet.

The take implied by the thread is that model quality improvements alone don’t guarantee durable advantage; the wedge is packaging, workflow placement, and go-to-market.

Rohan Paul

@rohanpaul_ai

🎯🎯“The people who invented refrigeration made some money, but most of the money was made by Coca-Cola, who used refrigeration to build an empire. LLMs are like as refrigeration & the Coca-Cola has yet to be built” ~Chamath Palihapitiya (@chamath)

9:58 PM · Feb 20, 2026

3.8K

Read 128 replies

📈 Evals reality check: METR debate, Arena shifts, and ‘reasoning effort’ metrics

Continues yesterday’s eval discourse with more skepticism about METR deltas and new measurement ideas (reasoning effort vs token length), plus leaderboard movement signals.

Google proposes “deep-thinking tokens” as a better reasoning-effort metric

Deep-thinking tokens (Google/academia): A new paper argues token count is a weak proxy for reasoning; it proposes “deep-thinking tokens” (layer-to-layer prediction instability) and reports the deep-thinking ratio correlates strongly with accuracy across math/science benchmarks, as described in Paper summary.

• Cost control hook: The paper also introduces Think@n, a test-time strategy that prioritizes samples with high deep-thinking ratios and early-rejects low-quality partial outputs, per the method description in Paper summary.

elvis

@omarsar0

New Google paper challenges how we measure LLM reasoning. Token count is a poor proxy for actual reasoning quality. There might be a better way to measure this. This work introduces "deep-thinking tokens," a metric that identifies tokens where internal model predictions shift Show more

4:01 PM · Feb 21, 2026

294

Read 26 replies

Builders question METR deltas after GPT‑5.3‑Codex plots below Opus 4.6

METR methodology debate: Skepticism is growing around whether METR’s setup is still representative, with one thread arguing the placement of GPT‑5.3‑Codex relative to Opus 4.6 “doesn’t jive with experience,” and calling for changes so the eval remains relevant, as argued in METR critique.

This critique sits on top of the prior METR storyline—see GPT‑5.3 horizon—and the core complaint is less about the trendline and more about harness/task mix being code-agent-specific in a way that may advantage certain stacks.

Dan McAteer

@daniel_mac8

GPT-5.3-Codex that much lower than Opus 4.6 does bring into question the validity of @METR_Evals. Doesn’t jive at all with experience. Something needs to change for this eval to remain relevant.

4:31 PM · Feb 21, 2026

Read 21 replies

METR’s Opus 4.6 chart sparks focus on the “80% horizon” plateau

METR time horizon (METR): Following up on METR horizon (Opus 4.6 at ~14.5h p50), a new thread highlights a split where the p50 task length keeps stretching but the “80% success rate” horizon is still ~1 hour, per the extrapolation framing in Horizon extrapolation. Horizon extrapolation The practical read for evaluation consumers is that “how long until it works half the time” and “how long until it works most of the time” are diverging metrics, and projections depend heavily on which curve you care about.

Haider.

@slow_developer

i can't believe this is moving so fast. METR's new results for opus 4.6 show 14.5 hours on a software task at a 50% success rate surprisingly, the 80% success rate hasn't moved much, still 1 hour 3 minutes now if 80% capability doubles every 4 months: in 1 year: 8+ hrs in 2 Show more

9:30 AM · Feb 21, 2026

140

Read 13 replies

Throughput becomes an eval axis: “tok/s isn’t for humans” framing spreads

Inference speed as a capability metric: Posts argue that 15,000+ tokens/sec and million-token contexts matter less for human readability and more for agent-to-agent coordination speed, with “they are for the AIs to talk to each other” as the core claim in Agent coordination quote.

The evaluative implication is that raw correctness benchmarks are getting paired with interaction throughput metrics (latency/stream rate) as separate gating factors for multi-agent systems.

Emad

@EMostaque

Emad

@EMostaque

You all have to try the @taalas_inc chatbot, I guarantee you'll find it crazy. Instant intelligence is a heck of a thing chatjimmy.ai