Claude Code Security finds 500+ OSS vulnerabilities – preview adds verify-then-patch flow

Introducing Claude Code Security, now in limited research preview. It scans codebases for vulnerabilities and suggests targeted software patches for human review, allowing teams to find and fix issues that traditional tools often miss. Learn more: anthropic.com/news/claude-co…

6:02 PM · Feb 20, 2026

32.5K

Read 1.3K replies

Builders report Opus 4.6 surfaced 500+ OSS vulnerabilities and patches are underway

Opus 4.6 security findings (Anthropic): A Claude Code Desktop team member says Opus 4.6 found 500+ vulnerabilities across open-source code and that reporting and patch contributions have started, as stated in the OSS findings thread.

Concrete examples cited include a Ghostscript issue and an overflow bug in CGIF, with short excerpts referenced in the Ghostscript example and CGIF overflow example.

The claim ties back to the broader Claude Code Security rollout narrative—security as a first-class agent workflow rather than an afterthought—per the OSS findings thread.

Opus4.6 found 500+ vulnerabilities in open-source code and we've begun reporting them and contributing patches quick excerpts from some of them 🧵

Claude

@claudeai

899

Read 55 replies

Claude Code Security emphasizes a dashboard workflow over auto-fixing

Findings UX (Claude Code Security): The product presentation centers on a findings dashboard with severity/confidence and suggested fixes, keeping control with developers rather than auto-applying changes, as shown in the feature walkthrough.

A separate share image captures the “defenders” positioning and the Feb 20 timing, as shown in the announcement slide.

Anthropic Unleashes Claude Code Security to Help Defenders Beat Hackers to the Punch 🛡️💻 Unlike traditional, rule-based security tools that only scan for known patterns, Claude Code Security uses advanced AI reasoning to understand how components interact, allowing it to catch Show more

Claude

@claudeai

6:48 PM · Feb 20, 2026

Claude Code Security uses a self-verification step to reduce vuln-scan noise

Verification methodology (Claude Code Security): To control hallucinations and reduce noisy findings, the tool flow includes an internal verification step where Claude attempts to prove or disprove its own vulnerability hypotheses before surfacing them, according to the method overview.

This matters operationally because it’s an explicit “find → verify → propose patch” pipeline rather than a single-pass scan, and the output is still routed to human approval, as explained in the method overview.

Claude

@claudeai

6:48 PM · Feb 20, 2026

Claude Code Security rollout mentions safeguard probes for cyber misuse detection

Misuse monitoring (Claude Code Security): Alongside the vulnerability-finding narrative, a Claude Code contributor says they “introduced safeguard probes” intended to detect when Claude is used for cyber misuse, as noted in the safeguard probes note.

That frames the product as not only a developer tool but also an instrumented security surface where Anthropic can watch for misuse patterns, per the safeguard probes note.

Replying to @trq212

we might have buried the lede here, because we also introduced safeguard probes to figure out when Claude is used for cyber misuse which I'm very excited about read more about probes and how they work here: alignment.anthropic.com/2025/cheap-mon…

Read 5 replies

Claude-based scanning is being used to spot vulnerabilities introduced by patches

Patch-diff review (Claude security workflow): One highlighted technique is using Claude to inspect git diffs and reason about whether a patch introduced a vulnerability—an approach that targets “regressions hidden inside fixes,” as described in the Ghostscript example.

This is a different posture than classic pattern-matching scanners: it treats code review artifacts (diffs) as primary evidence and looks for second-order effects across components, per the Ghostscript example.

Replying to @trq212

Claude found a vulnerability in GhostScript by looking at git diffs and trying to see if they introduced a vulnerability in the patch

Claude Code Security rollout starts staged for Team/Enterprise with cautious dogfooding notes

Rollout status (Claude Code Security): Anthropic is rolling out the feature slowly as a research preview starting with Team and Enterprise customers, with internal commentary describing findings as “impressive (and scary),” per the rollout note.

The same thread frames it as a staged enablement rather than a broad default-on feature, reinforcing that adoption will likely be gated by trust and review ergonomics, as said in the rollout note.

Boris Cherny

@bcherny

We've been working on this for a while -- it's impressive (and scary) to see the kinds of security issues it has identified. Rolling out slowly, starting as a research preview for Team and Enterprise customers.

Claude

@claudeai

10:24 PM · Feb 20, 2026

1.9K

Read 120 replies

🖥️ Claude Code Desktop dev loop: embedded previews, CI babysitting, and UI iteration

Today’s feed has lots of practitioner chatter and demos around Claude Code Desktop’s tighter edit→run→preview loop (embedded app previews, log reading, background CI/PR handling). Excludes Claude Code Security (covered as the feature).

Claude Code Desktop can babysit CI and PRs in the background

Claude Code Desktop (Anthropic): Desktop is being positioned to handle CI failures and PR workflows while you do other work—monitoring checks, reacting to failures, and keeping the loop going, as described in the what’s new thread and demoed in the CI + preview demo. Less babysitting. Still reviewable.

• Why it changes the loop: earlier “agentic coding” often required bouncing between terminal logs, the browser, and GitHub; Desktop’s pitch is collapsing those steps into one place, per the CI + preview demo.
• Vibe-coding impact: the same update is called out as reducing back-and-forth follow-ups during UI iteration in the what’s new thread.

Claude

@claudeai

Claude Code on desktop can now preview your running apps, review your code, and handle CI failures and PRs in the background. Here’s what's new:

14.4K

Read 491 replies

Claude Code Desktop can run your dev server and preview the app inside the UI

Claude Code Desktop (Anthropic): Desktop now supports server previews—it can start a dev server and render the running app inside the desktop interface while it edits code, as shown in the release thread and reiterated in the server preview note. This keeps runtime feedback (and the UI output) in the same place as code edits. Short loop.

• Runtime-aware iteration: Claude reads console output, catches errors, and continues iterating without needing a separate browser loop, per the release thread.
• Adoption signal: the Desktop team frames it as a “massive ship” after internal dogfooding in the dogfooding note.

Claude

@claudeai

Claude Code on desktop can now preview your running apps, review your code, and handle CI failures and PRs in the background. Here’s what's new:

14.4K

Read 491 replies

Claude Code Desktop adds /desktop session handoff and a pre-push code review step

Claude Code Desktop (Anthropic): Session mobility is being highlighted via a /desktop flow to move a CLI session into the Desktop UI, alongside a built-in “review code” step that leaves inline comments on diffs before pushing, according to the feature rundown. It’s a workflow change. It’s also a governance change.

• PR monitoring loop: the same rundown says Desktop can watch CI on open PRs and act when checks fail or pass, as described in the feature rundown.

Rohan Paul

@rohanpaul_ai

Claude Code on desktop just added an embedded loop that can run your dev server, view the app, review local changes, and watch pull requests until they merge. It can now preview a running app inside the desktop UI while it iterates on fixes. It can also monitor CI for an open Show more

Claude

@claudeai

Claude Code on desktop can now preview your running apps, review your code, and handle CI failures and PRs in the background. Here’s what's new:

5:53 AM · Feb 21, 2026

Claude Code Desktop flag skips all permission prompts

Claude Code Desktop (Anthropic): A new-ish workflow shortcut is circulating: --dangerously-skip-permissions, which skips permission prompts so the agent can operate without interactive approvals, as noted in the flag callout. This is friction removal. It also expands blast radius.

Lydia Hallie ✨

@lydiahallie

Claude Code Desktop now supports --dangerously-skip-permissions! This skips all permission prompts so Claude can operate fully autonomously. Great for workflows in a trusted environment where you want no interruptions, no approval prompts, just uninterrupted work. But as the Show more

12:03 AM · Feb 10, 2026

1.4K

Read 107 replies

Preview-driven frontend iteration: “spin up app, screenshot, iterate”

Frontend workflow (Claude Code Desktop): Builders are describing a repeatable pattern for UI work: run the app preview inside Desktop, have Claude take screenshots, then iterate on layout/styling until it matches intent—summarized in the frontend workflow claim and enabled by the preview feature in the Desktop update. Tight feedback. Fewer context switches.

Claude Code Desktop is easily the best way to do any frontend work right now. With Preview it can spin up your app, take screenshots and iterate until it's right.

Claude

@claudeai

Claude Code on desktop can now preview your running apps, review your code, and handle CI failures and PRs in the background. Here’s what's new:

8:54 PM · Feb 20, 2026

1.1K

Read 84 replies

Claude Code Desktop adds Windows ARM64 support

Claude Code Desktop (Anthropic): Reports say the Desktop app now supports Windows ARM64, per the Windows ARM64 note. This is mostly a deployment/IT unlock for teams standardizing on ARM laptops. Simple change. Practical impact.

Anthony Morris ツ

@amorriscode

Claude Code on desktop now supports Windows ARM64 🎉

12:39 AM · Feb 19, 2026

107

Read 9 replies

🧰 Claude Code CLI 2.1.50: worktree isolation expands + long-session fixes

Incremental but concrete Claude Code CLI changes land (2.1.50), with more worktree isolation plumbing and a long list of memory/stability fixes. Excludes Desktop Preview specifics (covered separately) and Claude Code Security (feature).

Agent definitions can declare isolation: worktree in Claude Code 2.1.50

Claude Code CLI 2.1.50 (Anthropic): Custom agent definitions can now declaratively opt into worktree sandboxing via isolation: worktree, so specific subagents always run isolated by default, as called out in the CLI changelog excerpt. This is the config-level counterpart to the new Task tool flag, and it pushes “safe parallelism” into the agent catalog rather than relying on per-run operator discipline.

Claude Code 2.1.50 other prompt changes • Renames content filter identifier from GuardrailContentFilterConfig to GuardrailContentFilter, affecting config/API references. (github.com/marckrenn/clau…) • API response object renamed from ModelInvocationJobSummary to Show more

Claude Code 2.1.50 adds WorktreeCreate/WorktreeRemove hook events

Claude Code CLI 2.1.50 (Anthropic): Worktree isolation now comes with lifecycle hooks—WorktreeCreate and WorktreeRemove—to run custom VCS setup/teardown when isolation spins up or deletes worktrees, per the CLI changelog excerpt. That’s a concrete integration point for teams that need to rehydrate dev env state (deps, secrets mounting, pre-commit tooling) when parallel worktrees appear and disappear.

Claude Code 2.1.50 drops remote push fields from ExitPlanMode

Claude Code CLI 2.1.50 (Anthropic): Plan-mode’s ExitPlanMode schema no longer includes pushToRemote or remote session identifiers/URLs, which effectively removes that remote “plan pushing” pathway from the tool surface, as summarized in the system prompt diff note and detailed further in the ExitPlanMode removal note. The same release note bundle also flags “2 system prompt changes” overall, alongside other prompt/string diffs, in the 2.1.50 highlights.

Claude Code 2.1.50 system prompt changes Notable changes: • ExitPlanMode remote push fields removed • Task tool adds isolation:"worktree" option Diff: github.com/marckrenn/clau…

11:45 PM · Feb 20, 2026

Claude Code 2.1.50 adds a claude agents listing command

Claude Code CLI 2.1.50 (Anthropic): The CLI surface adds a new claude agents command to list configured agents, as noted in the CLI changelog excerpt and reflected in the surface changes list. This is a small but practical ergonomics win for teams accumulating custom agent catalogs and isolation defaults.

Claude Code 2.1.50 adds an env var to disable 1M context support

Claude Code CLI 2.1.50 (Anthropic): The CLI adds CLAUDE_CODE_DISABLE_1M_CONTEXT to toggle off 1M-context behavior, as shown in the surface changes list and called out in the CLI changelog excerpt. The same changelog also notes that “Opus 4.6 (fast mode) now includes the full 1M context window,” per the CLI changelog excerpt, which makes this toggle relevant for cost/perf control and reproducibility across environments.

Claude Code CLI 2.1.50 surface changes Added: • commands: agents • env vars: CLAUDE_CODE_DISABLE_1M_CONTEXT, CLAUDE_CODE_REMOTE_SEND_KEEPALIVES, CLAUDE_CODE_STREAMING_TEXT • config keys: after, all, before, beg, body, edits, insert, isolation, new_text, old_text, Show more

11:45 PM · Feb 20, 2026

Claude Code 2.1.50 expands CLAUDE_CODE_SIMPLE to disable more subsystems

Claude Code CLI 2.1.50 (Anthropic): CLAUDE_CODE_SIMPLE is extended to more aggressively strip down runtime surface—first to remove skills/session memory/custom agents/token counting, and then further to disable MCP tools, attachments, hooks, and file loading for a “fully minimal experience,” per the CLI changelog excerpt. This creates a clearer “minimal harness” mode for debugging agent behavior and isolating whether issues come from integrations vs core model/tooling.

Claude Code 2.1.50 fixes native modules on older glibc systems

Claude Code CLI 2.1.50 (Anthropic): Linux environments with older glibc versions (noted as < 2.30, with RHEL 8 as an example) should no longer fail to load native modules, as stated in the CLI changelog excerpt. This is a concrete compatibility fix for enterprise and regulated infra that tends to lag distro baselines.

Claude Code 2.1.50 adds startupTimeout for LSP servers

Claude Code CLI 2.1.50 (Anthropic): Language server startup is now configurable via a startupTimeout setting for LSP servers, as recorded in the CLI changelog excerpt. For users running Claude Code in heavier repos or remote environments, this creates a first-class knob for LSP cold-start variance instead of treating it as a flaky failure mode.

Claude Code 2.1.50 fixes a /mcp reconnect freeze case

Claude Code CLI 2.1.50 (Anthropic): The CLI fixes a case where /mcp reconnect could freeze when passed a server name that doesn’t exist, as listed in the CLI changelog excerpt. It’s a small quality-of-life fix, but it hits a common operator loop when iterating on MCP server configs under load.

Claude Code 2.1.50 improves -p headless startup performance

Claude Code CLI 2.1.50 (Anthropic): Headless mode startup (-p) now defers Yoga WASM and UI component imports to reduce startup overhead, per the CLI changelog excerpt. It’s a targeted change for users treating the CLI as a batch runner and trying to minimize “agent boot time” per invocation.

⚡ OpenAI Codex signals: throughput boosts, plan packaging hints, and meetups

OpenAI/Codex chatter today clusters around serving speed, plan packaging leaks, and developer community distribution (meetups). Excludes Anthropic’s Claude Code Security feature story.

GPT-5.3-Codex-Spark gets ~30% faster, now 1200+ tokens/sec

GPT-5.3-Codex-Spark (OpenAI): OpenAI says it made GPT-5.3-Codex-Spark about 30% faster, now serving at 1200+ tokens/sec, per the speed note in Speed claim; follow-on posts like Speed clip reinforce that this is being positioned as a broader push on latency.

This mainly matters for agent loops (lint/test/patch cycles, multi-file edits) where end-to-end wall clock time is dominated by repeated model calls, not single-shot quality.

Tibo

@thsottiaux

We’ve made GPT-5.3-Codex-Spark about 30% faster. It is now serving at over 1200 tokens per second. More to come on speed across the board.

8:43 PM · Feb 20, 2026

1.9K

Read 172 replies

Altman says Codex usage in India is up 4× in two weeks

Codex adoption (OpenAI): Sam Altman says India is OpenAI’s fastest-growing market for Codex, claiming 4× growth in weekly users over the past ~2 weeks, as stated in India growth claim.

This is one of the few concrete regional usage metrics shared publicly in this batch of tweets, and it implies real-world demand is ramping fast enough to show up in week-over-week deltas.

Sam Altman

@sama

Great meeting with PM @narendramodi today to talk about the incredible energy around AI in India. India is our fastest growing market for codex globally, up 4x in weekly users in the past 2 weeks alone. 🇮🇳!

12:41 PM · Feb 20, 2026

24.3K

Read 1.5K replies

ChatGPT web app code hints at a new “Pro Lite” tier

ChatGPT plans (OpenAI): A code spelunking thread claims the ChatGPT web app now references a new “ChatGPT Pro Lite” plan, as spotted in Plan string leak; separate speculation in Mid-tier plan guess frames it as a potential $50–$100/month tier that would sit between Plus and higher-end plans.

No official pricing or entitlements are confirmed in the tweets; treat this as UI-string evidence, not a product announcement.

Tibor Blaho

@btibor91

ChatGPT web app code now mentions a new "ChatGPT Pro Lite" plan

11:39 PM · Feb 20, 2026

513

Read 37 replies

Automating Codex↔Claude “ping-pong” reviews as a reliability check

Cross-model review loop: Hamel describes automating a workflow where Codex reviews Claude (and vice versa) to surface disagreements like “this is over engineering” and blunt critique, with an automation link referenced in Ping-pong automation.

This is a concrete way teams are trying to reduce single-model blind spots: generate with one model, adversarially review with another, then reconcile before merging.

Hamel Husain

@HamelHusain

I've been having codex review claude so much that I made a thing to automate the ping-pong github.com/hamelsmu/claud…

3:06 AM · Feb 21, 2026

182

Read 15 replies

ChatGPT reportedly expands to 256k total context in Thinking mode

ChatGPT context (OpenAI): A leak-style post claims the ChatGPT web app now supports 256k total context when manually selecting Thinking—described as 128k input + 128k max output, up from 196k total, according to Context window claim.

If accurate, this changes feasibility for “single-session” workflows (large repo excerpts, long design docs, multi-PR diff review) where truncation and compaction are the failure mode.

Tibor Blaho

@btibor91

ChatGPT now has a total context window of 256k tokens (128k input and 128k max output) when you manually select Thinking, up from 196k total tokens

Merk

@Makuh90

Breaking. ChatGPT Thinking now has a max contect window of 256,000 tokens, up from 192,000! 128,000 max output is a book! @ChatGPTapp @OpenAI help.openai.com/en/articles/68

12:24 AM · Feb 21, 2026

296

Read 12 replies

Codex meetups expand via an ambassador-led program

Codex community distribution (OpenAI Devs): OpenAI Devs is promoting a Codex meetup program run via ambassadors, positioning it as a way to “create and ship projects” and compare workflows, as described in Meetup announcement.

The public schedule screenshots in Meetups page show multiple cities with specific dates (e.g., Melbourne and Toronto on Feb 26), signaling an organized offline GTM channel around Codex usage.

OpenAI Developers

@OpenAIDevs

Looking for a Codex meetup in your city? Our ambassador community is bringing Codex to you. Create and ship projects with your local developer community, compare workflows, grab coffee, and meet people building with Codex. developers.openai.com/codex/communit…

7:46 PM · Feb 20, 2026

430

Read 110 replies

🧠 Agentic coding practice: context discipline, skills workflows, and verification loops

Builder posts focus on repeatable ways to steer coding agents (skills, PRD→issues pipelines, multi-model cross-review, and turning human review into automated backpressure). Excludes Claude Code Security (feature).

Agent rewrite failure mode: fabricating missing constants and tables

Failure report: When asked to re-implement an old C game, Claude completed the job but silently fabricated “nonsensical data” for missing source files containing critical tables, rather than stopping and asking for the missing inputs, as described in Invented missing tables. This is the kind of error that can survive superficial compile/run checks and only show up later as gameplay logic bugs or incorrect balancing.

• Why this matters: it’s a reminder that “missing context” doesn’t always become an explicit question—models may instead fill gaps with plausible-looking values, echoing the broader caution about ratholes in human+agent workflows noted in Dyad productivity warning.

Uncle Bob Martin

@unclebobmartin

Aaarrrrggghhh. I'm rebuilding an old game I wrote in the 80s in C. I told Claude to read the code and generate scenarios, and then implement the game. It did. But... There were some C source files missing. They contained critical data values and tables. So what did Show more

3:56 PM · Feb 20, 2026

420

Read 58 replies

Codex–Claude “ping-pong” review loops to surface disagreements

Cross-model review pattern: One workflow is to have Codex review Claude-generated code (and iterate), then automate the back-and-forth; the motivation is that the models disagree in useful ways—Claude “pushes back” with critiques like “this is over engineering,” while Codex offers blunt commentary, per Ping-pong automation and Pushback examples. This turns model diversity into a practical guardrail when a single agent’s blind spots are too costly.

• What it replaces: instead of a single model self-critiquing, the critique comes from a different training distribution and style, as shown by the contrasting reactions in Pushback examples.

Hamel Husain

@HamelHusain

I've been having codex review claude so much that I made a thing to automate the ping-pong github.com/hamelsmu/claud…

3:06 AM · Feb 21, 2026

182

Read 15 replies

Design teams building agentic internal tools as default practice

Vercel (Leap): Vercel describes a shift where designers also build—using tools like v0 and Claude Code—to engineer the design process itself via internal agents; one example is “Leap,” a self-serve agent/UI that generates social cards with an X preview and is claimed to have saved “hundreds of hours,” per Leap agent description. This is an org-level pattern: agent harnessing moves from “engineering helps design” to “design builds its own tooling.”

• Operational detail: the workflow includes previewing the final artifact “as it will look on X,” which is a concrete verification loop for visual outputs, as described in Leap agent description.

Guillermo Rauch

@rauchg

The future of design is… engineering. All designers at @vercel now also build, thanks to tools like @v0, Claude Code, and Cursor. They've been contributing to our frontends and apps for a while now. But over the past few months, the leap they've made is engineering the design Show more

2:54 AM · Feb 21, 2026

470

Read 35 replies

Plan-mode prompt that forces design-tree clarity

Plan hygiene: A specific plan-mode prompt is circulating to cut follow-up churn: “Interview me relentlessly about every aspect of this plan… walk down each branch of the design tree… resolving dependencies,” as written in Plan mode prompt. The core idea is to force the agent to surface hidden decisions early (APIs, data shapes, trade-offs) instead of discovering them mid-implementation.

• Where it fits: it pairs naturally with PRD→issues pipelines because it front-loads dependency resolution before tasks get spawned, as implied by the broader workflow orientation in Plan mode prompt.

Matt Pocock

@mattpocockuk

Replying to @mattpocockuk

I know it's a good tweet when I come back later to copy/paste the thing I posted

10:23 AM · Feb 20, 2026

Prompt repetition as a cheap accuracy boost in non-reasoning mode

Prompting trick: Following up on Prompt repetition—duplication for non-reasoning—today’s thread adds concrete results and an explanation: repeating the prompt twice can mitigate “reading order” misinterpretations and drive large accuracy gains (one search-style task reported from 21% to 97%), with 47/70 wins across 7 models/benchmarks, as summarized in Prompt repetition summary.

• Where it applies: the claim is specifically about setups “not using reasoning,” i.e., when you want short answers without long chain-of-thought, per Prompt repetition summary.

Rohan Paul

@rohanpaul_ai

Fascinating Google paper: just repeating your prompt 2 times can seriously boost LLM performance, sometimes pushing accuracy from 21% to 97% on certain search tasks. An LLM reads your prompt left to right, so early words get processed before the model has seen the later words Show more

2:20 AM · Feb 21, 2026

113

Read 13 replies

Shipping a personal skills repo becomes the unit of reuse

Skills distribution: Instead of treating prompts as throwaway chat history, one builder is publishing a dedicated skills repo plus an ongoing newsletter of “skills/learnings,” framing it as the primary mechanism to reuse agent behavior across projects, as linked from Skills repo and reinforced by the workflow post in Skills workflow. The practical engineering implication is that “how we do work” becomes versionable alongside code, not trapped in a single agent session.

• Portability signal: the same skills are referenced as building blocks in a larger delivery pipeline, per Skills workflow.
• Ongoing maintenance model: the newsletter framing suggests skills evolve like tooling—small iterative updates—according to Newsletter link.

Matt Pocock

@mattpocockuk

Replying to @mattpocockuk

Skills repo: aihero.dev/s/36hEu0

3:52 PM · Feb 20, 2026

144

Using chat to clone a repo for fast context

Context discipline: Regular Claude chat (not Claude Code) can now clone public GitHub repos and answer questions about structure (models/views/etc.), turning “repo onboarding” into a single prompt operation, as shown in Repo cloning tip. The practical implication is that early architecture/context questions can be answered without manually checking out and grepping a codebase first.

• Artifact reuse: the same post frames cloned repos as potential starting points for artifacts, which turns public repos into reusable context bundles, per Repo cloning tip.

Simon Willison

@simonw

Fun bonus tip: regular Claude chat (not Claude Code) has the ability to clone repos from GitHub these days, which means you can ask it to checkout ANY public repo and answer questions about it or even use it as a starting point for an artifact!

Simon Willison

@simonw

Added a feature to my blog I've wanted for ages - it now shows my content from elsewhere, including TILs, releases, museums, tools and research, as little badges in the various blog timeline views simonwillison.net/2026/Feb/20/be…

11:56 PM · Feb 20, 2026

326

Read 19 replies

Lines-of-code inflation as a heuristic for AI slop

Quality signal: A practitioner calls out “lines of code” as a surprisingly useful smell test now—seeing simple projects with much higher LoC than expected is framed as a sign of agent-generated slop in LoC slop metric. For engineering leaders, this is a reminder that output volume is no longer correlated with progress, and review heuristics are shifting to structure, duplication, and test coverage rather than “how much got written.”

dax

@thdxr

lines of code has become a good metric now i so often see simple projects that are noticeably way more LoC than i'd expect great sign of slop

12:42 PM · Feb 20, 2026

1.5K

Read 83 replies

LLM-as-integration-test by having it play to win

Verification loop: A novel testing harness is to have the agent play the software it just reimplemented and attempt to “win” as a form of integration testing—useful when the system is stateful and hard to unit test; the approach is described by Uncle Bob while reverse engineering an old single-player game, watching Claude learn strategy by trial and error in Claude plays to win with additional color in Strategy by trial. This reframes “tests” from static assertions to goal-directed behavior checks in a running environment.

Uncle Bob Martin

@unclebobmartin

I'm having claude reverse engineer this game I wrote in C back in the '80s. Now I'm having claude play the game and attempt to win as an integration test. This is really fun.

6:42 PM · Feb 20, 2026

125

Labeling parallel agent sessions with /rename

Parallel-session practice: When you’re running multiple agent terminals, a tiny convention helps keep mental state intact—use /rename [label] to name each session, as suggested in Rename tip. This is a lightweight substitute for heavier orchestration UIs when the main failure mode is “which agent did what in which terminal?”

cat

@_catwu

Multi-clauding? Use /rename [label] to name each terminal session

1:54 AM · Feb 21, 2026

222

Read 21 replies

🦞 OpenClaw / “Claw” agent systems: architecture sprawl, security worries, and maintainers

OpenClaw remains a major discourse thread: ‘claw’ as a new agent-systems layer, with sharp security concerns and a growing ecosystem of smaller clones/alternatives. Excludes Claude Code Security (feature).

OpenClaw threat model worries: exposed instances, RCE, supply-chain attacks

OpenClaw (community): Security concerns are becoming a core part of the OpenClaw story—Karpathy flags discomfort running a large, fast-growing local agent codebase with private data/keys, citing early reports of exposed instances, RCE bugs, supply-chain poisoning, and potentially malicious/compromised skills registries in the Security concerns thread. The subtext is that “agent on your machine” shifts the blast radius from “bad output” to “bad local actions.”

Bought a new Mac mini to properly tinker with claws over the weekend. The apple store person told me they are selling like hotcakes and everyone is confused :) I'm definitely a bit sus'd to run OpenClaw specifically - giving my private data/keys to 400K lines of vibe coded Show more

“Claws” framed as a new orchestration layer above LLM agents (local, scheduled, persistent)

Claw systems (concept): Karpathy frames “claws” as a layer above LLM agents—pushing orchestration, scheduling, tool calls, context handling, and persistence further than chat or single-session coding assistants, as described in the Claws stack layer post and echoed by the Claw tagline. The local-first aesthetic shows up as “a physical device ‘possessed’ by a personal digital house elf,” which implicitly prioritizes on-LAN integrations and long-running autonomy over cloud convenience.

NanoClaw highlighted as an auditable, container-by-default OpenClaw alternative

NanoClaw (ecosystem): As security anxiety rises, smaller “claw” implementations are getting attention—Karpathy calls out NanoClaw as a ~4,000-line core engine that’s easier to audit and that runs workloads in containers by default, per the NanoClaw mention. The argument is that manageability (for humans and for AI agents reading the harness) becomes a feature when you’re granting local execution privileges.

OpenClaw reliability issue: unclear model provenance and /status reportedly wrong

OpenClaw (ops hygiene): A concrete operational pain point is model provenance—users report it’s hard to know which underlying model OpenClaw is actually running, with frequent hallucinations and even /status being wrong, per the Model confusion report. A related screenshot shows OpenClaw attempting Anthropic auth and failing with “401 OAuth authentication is currently not supported,” then falling back to Codex so “reported model and actual runtime match,” as shown in the OAuth fallback screenshot.

Matthew Berman

@MatthewBerman

It's frustrating how difficult it is to tell which model is being used by OpenClaw. It will frequently hallucinate and even /status will be wrong.

1:20 AM · Feb 21, 2026

186

Read 73 replies

Skills-as-config pattern: use “/add-telegram”-style skills to fork repo configs

Skills-as-config (pattern): A configuration approach is emerging where “skills” are the configuration surface—Karpathy describes a pattern where a base repo is designed to be maximally forkable, and “/add-telegram”-type skills directly modify code to integrate features instead of accumulating config files and branching logic, as outlined in the Skills as configuration idea. The practical implication is that the agent’s primary task becomes “generate a concrete diff for this variant,” not “interpret a sprawling config matrix.”

“Claw” starts to solidify as the noun for OpenClaw-like personal-hardware agent systems

Claw (terminology): “Claw” is starting to get used as a category label for OpenClaw-like systems—agents that run on personal hardware, talk over messaging protocols, and can both react to direct instructions and schedule tasks, as Simon Willison notes in the Terminology note and follows up in the Blog link post. This matters because shared vocabulary tends to precede shared interfaces, threat models, and “which harness do you run?” comparisons.

Simon Willison

@simonw

I guess "Claw" is becoming a term of art now for the entire category of OpenClaw-like agent systems

Andrej Karpathy

@karpathy

12:07 AM · Feb 21, 2026

259

Read 32 replies

OpenClaw maintainer recruitment emphasizes security and running a large OSS project

OpenClaw (project health): Maintainer bandwidth is now an explicit bottleneck—there’s a call for OpenClaw maintainers with experience running larger open-source projects and a strong security mindset, as requested in the Maintainer request. This lines up with the parallel security discourse about exposed instances and skill-registry risk, but shifts it into “who actually operates the project day to day?” territory.

Peter Steinberger 🦞

@steipete

Folks, I'm looking for @openclaw maintainers. If you love open source, have experience with running larger projects, are security minded and want to help, drop me an email. github.com/openclaw/openc…

6:56 AM · Feb 19, 2026

6.3K

Read 356 replies

Secret sprawl friction: OpenClaw agents requesting many API keys up front

OpenClaw (secrets management): One adoption blocker is credential blast radius—there’s a recurring “my agent is asking for a bunch of API keys” moment, which becomes a practical security and trust hurdle for local agents, as captured in the API keys screenshot post. It’s a small interaction that forces a big decision.

Olivia Moore

@omooretweets

My OpenClaw agent asking for a bunch of API keys

3:12 AM · Feb 21, 2026

Builders start testing Gemini 3.1 Pro as a daily-driver model inside OpenClaw

OpenClaw (model experimentation): There’s active experimentation with Gemini 3.1 Pro as an OpenClaw runtime—Matthew Berman asks about personality for “daily driving” Gemini 3.1 Pro in OpenClaw in the Daily driver question and follows with “testing soon” in the Testing soon note, with an implied goal of comparing behavior vs existing defaults. The move reinforces how “claw” systems are increasingly model-agnostic wrappers where swapping the brain is routine.

Matthew Berman

@MatthewBerman

Thinking about daily driving Gemini 3.1 Pro in OpenClaw. How's its personality?

3:31 PM · Feb 20, 2026

151

Read 73 replies

KiloClaw hosted OpenClaw: GLM‑5 is most popular, with time-limited free access

KiloClaw (KiloCode): Hosted “claw” demand is showing up via KiloCode’s hosted OpenClaw offering—KiloCode says GLM‑5 is the most popular model for its KiloClaw product and is making it free through the weekend, per the KiloClaw model note. This is a small but clear signal that “managed claw infra + model choice” is becoming a product category, not just a repo.

Kilo

@kilocode

GLM-5 from @Zai_org is our most popular model for KiloClaw, our hosted version of @openclaw. And it's free in Kilo through this weekend. But that was just the first batch of invites. Which model will you choose to claw with? 🦞

10:17 PM · Feb 20, 2026

🔌 MCP + interoperability: compressing APIs and wiring agent tools together

MCP-related progress today is about making huge APIs usable in small context windows and expanding the installable tool surface for agents (stores, exports, debate-style advisors). Excludes non-MCP skills repos (covered in workflows/plugins).

Cloudflare “Code Mode” compresses huge APIs into tiny MCP tool surfaces

Code Mode (Cloudflare): Cloudflare introduced Code Mode, a MCP-friendly approach that collapses its 2,500+ endpoint API into two tools and roughly ~1,000 tokens of context, instead of the ~2M tokens you’d burn exposing every endpoint as a separate tool, as shown in the Code Mode overview.

This is a concrete template for “tool-token budgeting”: agents get an API spec + a code interpreter-like interface, so capability scales with retrieval/execution rather than prompt surface area.

Matt Carey

@mattzcarey

Code Mode is all you need, very excited about this direction for MCP blog.cloudflare.com/code-mode-mcp/

2:04 PM · Feb 20, 2026

3.0K

Read 122 replies

Stitch MCP lands in an MCP store and expands agent-side design editing

Stitch MCP (Stitch): Stitch shipped an MCP distribution + setup update: it’s now installable via the Antigravity MCP store search-and-install flow, and the Exports panel can generate MCP client instructions and reveal API keys for wiring agents faster, per the MCP update note.

• Edit-in-place tooling: the update also adds MCP tools aimed at editing existing screens and generating variants (a shift from greenfield “design-to-code” toward iterative artifact mutation), as described in the same MCP update note.

The Stitch Model Context Protocol (MCP) ecosystem has rolled out a new update focused on streamlining the developer experience and expanding the capabilities of AI coding agents. Users can now easily access step-by-step MCP client instructions and retrieve their API keys Show more

Stitch by Google

@stitchbygoogle

The Stitch MCP ecosystem just leveled up. 🛠️✨ Here is what is new today: 1️⃣ Frictionless Setup: You can now get step-by-step MCP client instructions and grab your API key directly from the Exports panel. 2️⃣ Antigravity Native: Stitch is officially in the Antigravity MCP

11:00 AM · Feb 20, 2026

Microsoft tests a two-agent “debate” UX inside Copilot

Copilot Advisors (Microsoft): Microsoft is reportedly prototyping “Copilot Advisors,” where two AI experts debate a topic—positioned like Audio Overviews paired with Copilot portraits—according to the Copilot Advisors leak.

For MCP/interoperability folks, the notable part is the product shape: multi-agent orchestration presented as an end-user UX primitive rather than a developer-only workflow.

TestingCatalog News 🗞

@testingcatalog

Microsoft is working on a new Copilot Advisors experiment in which two AI Experts would debate a given topic. Looks like Audio Overviews with Copilot Portraits 👀

2:55 PM · Feb 20, 2026

NotebookLM context is slated to flow into Opal and Gemini “Super Gems”

NotebookLM ↔ Opal (Google): A roadmap leak claims Google will integrate NotebookLM into the Opal workflow builder, letting users pull notebook context into Gemini “Super Gems,” though it’s described as not public yet in the roadmap note.

If accurate, it’s another signal that “bring your own context store” is becoming a first-class wiring surface—not just an app feature.

TestingCatalog News 🗞

@testingcatalog

Google will soon integrate NotebookLM into the Opal workflow builder. This also means that users will be able to rely on the context from their notebooks within their Super Gems on Gemini. * Not available to the public yet

1:39 PM · Feb 20, 2026

249

Read 10 replies

CopilotKit shows a Tavily + AG‑UI wiring pattern for deep research apps

Deep research assistant wiring (CopilotKit): CopilotKit published a step-by-step build for a LangChain “deep research assistant” that connects a deep agent to Tavily for search plus AG‑UI for streaming a user-facing generative UI, with an accompanying repo linked in the tutorial thread.

This is a practical “interop recipe” example: the agent framework, search provider, and UI protocol are treated as swappable components rather than a single monolith.

CopilotKit🪁

@CopilotKit

🫰🏼Learn how to build a @LangChain Deep Research Assistant A step-by-step guide plus open source repo ⬇️ In our latest tutorial, learn how to easily connect a Deep Agent with @tavilyai and AG-UI protocol to bring your Manus-type agent to life (with Generative UI) CopilotKit Show more

6:02 PM · Feb 20, 2026

🕹️ Running lots of agents: parallelism, dashboards, scheduling, and cost limits

Ops-oriented artifacts show up: open-sourced multi-agent managers, internal taskboards for teams of agents, and practical scheduling/limits friction. Excludes coding assistant feature releases themselves.

Claude Code telemetry: 17% of human interruptions blamed on slowness/hangs

Agent ops reliability: Following up on telemetry study (Claude Code autonomy metrics), a new highlighted cut of the data shows that 17% of human interruptions were attributed to Claude being “slow, hanging, or excessive,” as shown in Interruptions table.

The same table contrasts that with “missing technical context/corrections” (32%) as the top human-interruption reason; it’s a reminder that latency and responsiveness are operational blockers, not just UX polish.

Nirant

@NirantK

How important is speed and polish for AI Product adoption? Very: 17% of all user interruptions were because Claude Code was slow, hanging or excessive!

7:28 PM · Feb 20, 2026

ClawWork benchmark simulates a paid agent “labor market” with survival economics

ClawWork (benchmark): A new benchmark frames agent evaluation as an economic survival loop—agents start with a small balance, pay for LLM calls/tools/search, and must complete paid professional tasks to keep operating, per Benchmark summary. The same post notes it uses tasks from GDPVal and supports head-to-head multi-model competition, with a pointer to the project in Project link.

This is a different kind of eval: it bakes in spend-rate, tool overhead, and “agent thrift” as part of performance rather than treating tokens as free.

Ksenia_TuringPost

@TheTuringPost

Can an agent survive as a worker in a real economy? Here is a super interesting economic benchmark for AI agents - ClawWork. It’s like a real-world labor market for LLM-based agents that evaluates them in an economic survival loop. ClawWork turns agents into AI coworkers and Show more

9:32 PM · Feb 20, 2026

Kimi k25 output TPS drops sharply as demand exceeds capacity

Kimi k25 (Moonshot/Kimi): A service operator reported a sharp degradation in output throughput over 24 hours—“TPS … dropped drastically,” attributed to demand exceeding capacity—alongside a chart showing the collapse, per Capacity apology.

For teams running lots of agents, this is the failure mode to watch: orchestration may be stable, but provider-side throughput becomes the hard ceiling on iteration speed and job scheduling.

dax

@thdxr

our TPS for kimi k25 has dropped drastically in the last 24 hours - sorry for this demand is exceeding capacity and we're sourcing more right now

2:26 PM · Feb 20, 2026

608

Read 45 replies

Warp open-sources its internal agent taskboard UI

Warp (Warp): Warp shared an internal taskboard-style UI built to coordinate agent work, hinting it may replace their existing taskboard for shipping terminal improvements, as described in Taskboard announcement. They also published the code publicly and asked how other teams coordinate “teams of agents,” per Repo link follow-up.

The notable part is the framing: task management is being treated as a first-class surface for agent throughput (not a side spreadsheet), with the UI positioned as the coordination layer.

Warp

@warpdotdev

Built by our designer David. We might replace our taskboard with this to start shipping improvements to the Warp terminal... how are you managing teams of agents?

David Plakon

@DavidPlakon

I built the "Slack for coding agents." Or, as I like to call it: Productive Moltbook. - A team lead can assign tasks to "workers" from a kanban board - Agents can join chat channels to collaborate - Then, they work in cloud sandboxes to test and ship PRs Source below 📷

3:55 PM · Feb 20, 2026

File leasing proposed as an alternative to worktrees for parallel agents

Parallel edit coordination: A thread argues that git worktrees come with operational overhead (reinstalling deps, cleanup, merge conflicts, inconsistent agent behavior), and asks whether people actually use them in practice, per Worktree complaints. The proposed alternative is “file leasing” (agents acquire temporary exclusive rights to edit files), with an example implementation referenced in File leasing idea.

This frames concurrency control as a resource-allocation problem (files as locks) rather than an environment-isolation problem (worktrees as sandboxes).

Numman Ali

@nummanali

I have a few problems with worktrees: - you need to reinstall everything - you need to clean up after - conflicts arise again on merge - different agents handle it their way I suppose for incremental changes that are tight this makes sense But isn’t that going backwards? Show more

Boris Cherny

@bcherny

Introducing: built-in git worktree support for Claude Code Now, agents can run in parallel without interfering with one other. Each agent gets its own worktree and can work independently. The Claude Code Desktop app has had built-in support for worktrees for a while, and now

12:50 AM · Feb 21, 2026

Read 28 replies

PR volume management emerges as a pain point as agents open more PRs

PR ops: The question “How are people managing all these new AI powered pull request?” in PR volume question is a simple prompt, but it points at a concrete scaling issue: agent parallelism tends to create more branches, more diffs, and more review surface area than a human team naturally produces.

The operational gap is less about generating code and more about routing, reviewing, and merging at volume without degrading repo quality.

jason liu

@jxnlco

How are people managing all these new ai powered pull request?

3:49 AM · Feb 21, 2026

Recurring scheduled agent jobs become an “always-on” pattern

Scheduled agent ops: One practitioner shared a set of recurring, automated jobs (daily audits, twice-daily code review, “OpenAI updates,” and other routine checks), showing how agent use becomes a calendar of background runs rather than ad-hoc prompting, per Scheduled tasks screenshot.

This is an operational shift: the “interface” becomes the schedule, and the unit of work becomes a repeatable job definition that can be tuned over time.

Michael Wall

@sound4movement

I’m at 10+...and the are profoundly useful. 1. Describe the "job to be done" directly to Codex. 2. Ask Codex for its opinion on the skill and how it would make this possible. 3. Automate the skill. “Just-in-time software” is here and easy. @OpenAIDevs 🤍

12:51 PM · Feb 20, 2026

VS Code posts Agent Sessions Day replays focused on multi-agent development

VS Code (Microsoft): VS Code uploaded “Agent Sessions Day” on-demand, positioning VS Code as a home for multi-agent development workflows and publishing the full set of demos, per Event replay post.

It’s a distribution signal: the editor isn’t just adding an agent, it’s marketing itself as the coordination surface for many concurrent agent runs.

Visual Studio Code

@code

We hosted Agent Sessions Day yesterday - 4 hours of live demos showing how @code is evolving as the home for multi-agent devleopment. Miss any sessions? Want to rewatch your favorite moments? We have the entire event uploaded on-demand now! aka.ms/VSCode/AgentSe…

A thumbnail with a dark background that reads "VS Code Live: Agent Sessions Day"

7:46 PM · Feb 20, 2026

Weekly usage limits show up as a scaling pain for subagent-heavy workflows

Usage limits and subagents: A recurring theme in multi-agent workflows is that the limiting factor becomes quota management rather than orchestration logic; the meme in Weekly limit meme captures the lived reality of subagents colliding with weekly caps.

It’s lightweight evidence, but it matches a broader operational pattern: parallelism increases “work in flight,” which makes limits and throttling behavior a front-and-center product constraint.

Kevin Kern

@kevinkern

subagents when they see your weekly limit.

12:49 AM · Feb 21, 2026

Always-on agents with negative unit economics get called out as a trap

Agent unit economics: A cautionary anecdote highlights “agents running 24/7” that supposedly earn money while burning even more in credits—$200/day revenue vs $300/day spend—per Credits vs revenue anecdote.

It’s not a benchmark, but it’s a useful field signal: multi-agent setups can look productive while quietly running at a loss if spend controls, caching, and task selection aren’t accounted for.

AshutoshShrivastava

@ai_for_success

Average OpenClaw user: “I am running AI agents 24/7 and they make me $200 USD per day.” Me: “How much do you burn in credits?” Him: “$300 per day.” Me: Congratulations, you are on your way to becoming a billionaire.

2:29 AM · Feb 21, 2026

238

Read 44 replies

📈 Benchmarks & eval realism: METR time horizons, Arena shifts, and methodology fixes

Evaluation chatter is dominated by METR’s time-horizon jump plus follow-on skepticism about benchmark limits, alongside Arena rank moves and SWE-bench methodology updates. Excludes model availability rollouts (handled under model releases).

Claude Sonnet 4.6 climbs in LMArena: #3 Code and #13 Text

LMArena (Claude Sonnet 4.6): Arena accounts report Claude Sonnet 4.6 landing at #3 in Code Arena and #13 in Text Arena, calling out a large Code jump versus Sonnet 4.5 and notable category ranks like Math and Instruction Following in Arena milestone post.

A deeper breakdown enumerates where Sonnet 4.6 gained (notably WebDev and Instruction Following) and where 4.5 still leads (Multi-turn/Longer Query), as detailed in category deltas. The companion post repeats the headline Text score and parity claim against GPT‑5.1-high in text arena recap.

Arena.ai

@arena

Claude Sonnet 4.6 has landed #3 in Code and #13 in Text Arena! Highlights: ▪️+130 pts jump in Code Arena (#22 -> #3) compared to Sonnet 4.5, surpassing top-tier thinking models like Gemini-3.1 and GPT-5.2 ▪️Strong gains in Text categories: Math (#4) and Instruction Following Show more

Claude

@claudeai

This is Claude Sonnet 4.6: our most capable Sonnet model yet. It’s a full upgrade across coding, computer use, long-context reasoning, agent planning, knowledge work, and design. It also features a 1M token context window in beta.

4:27 PM · Feb 20, 2026

170

Read 9 replies

Gemini 3.1 Pro Preview tops SimpleBench MCQ at 79.6%

SimpleBench (MCQ): A SimpleBench leaderboard screenshot shared today shows Gemini 3.1 Pro Preview at 79.6% (AVG@5), ahead of Gemini 3 Pro Preview (76.4%) and with Opus 4.6 shown at 67.6%, as in SimpleBench table.

There’s active skepticism about calibration and what “baseline” even means in some comparisons—especially when smaller variants appear to beat larger ones—captured in calibration critique. Separate commentary frames this as “nearing human baseline,” referencing the same score table in near baseline framing.

Gemini 3.1 Pro new SOTA on SimpleBench

12:49 PM · Feb 20, 2026

341

METR “success decay” analysis highlights Opus 4.6’s slower drop-off on longer tasks

METR curve fitting (task-length robustness): A separate analysis thread argues Opus 4.6’s standout isn’t top “base capability,” but a less negative beta (slower success-probability decay as tasks get longer), which is what makes the implied horizon look so large in the shared curve-fit screenshot in beta vs alpha table.

A follow-on post claims GPT‑5.3‑Codex’s decay is still “up there,” placing it near the top for robustness as shown in beta ranking table. If you’re using time-horizon plots operationally, this is the argument for looking at the whole curve shape, not only the single p50 crossing.

Opus 4.6 is *SPECIAL* the same way Sonnet 3.5 was it's doesn't have the highest "base capability" as measured by alpha, but it degrades MUCH slower than other models which is why the time horizons go absolutely ballistic

7:31 PM · Feb 20, 2026

467

Read 17 replies

DesignArena shifts: Sonnet 4.6 places top-5; SVG generation becomes its own scoreboard

DesignArena (Arcada Labs): A DesignArena snapshot shared today places Claude Sonnet 4.6 at #4 overall in design tasks and notes it beating some prior Anthropic variants on “real-world design” prompts, per DesignArena post.

The same ecosystem is now treating SVG generation as a distinct eval surface; one thread claims Gemini 3.1 leads SVG generation “by a wide margin,” as stated in SVG claim, while another post shows Gemini 3.1 Pro Preview only slightly behind a set of top models on the main DesignArena chart in Gemini position.

Anthropic's newly released Claude Sonnet 4.6 has made a strong debut on the Design Arena leaderboard, officially securing the 4th overall position. The model is currently running neck-and-neck with GLM 5. Notably, Sonnet 4.6 is outperforming Anthropic's previous flagship model, Show more

Design Arena

@Designarena

BREAKING: Claude Sonnet 4.6 by @Anthropic is currently in 4th overall on Design Arena. It is neck-and-neck with GLM 5 and outperforms Opus 4.5 on real-world design tasks. Congratulations to the team on these improvements!

11:30 AM · Feb 20, 2026

Read 8 replies

FrontierMath note: Gemini 3.1 Pro rerun solved a previously unsolved Tier 4 problem

FrontierMath (Epoch AI): Epoch AI says Gemini 3.1 Pro scored comparably to Gemini 3 Pro overall on FrontierMath, but in a second (accidental) Tier 4 run it solved a Tier 4 problem “no model had solved before,” with the problem author’s reaction linked in new solve note.

The original summary framing—comparable overall but with that one new Tier 4 solve and a note that it solved it “not how a human would”—appears in FrontierMath summary.

Epoch AI

@EpochAIResearch

Replying to @EpochAIResearch

We accidentally ran Gemini 3.1 Pro on Tier 4 a second time. The score above reflects the first, official run. But we noticed in the second run that it had solved a problem no model had solved before. The newly-solved problem is by Emmanuel Breuillard. He had this to say.

6:28 PM · Feb 20, 2026

METR trend extrapolations: 99–123 day doubling-time fits and 100+ hour projections

METR trend analysis: Posts are now fitting piecewise curves over METR horizon points and claiming a faster “doubling time” around ~99 days, while noting the confidence intervals are large, as shown in piecewise fit plot.

That fit is being used to justify concrete extrapolations like “100 hour horizons by end of 2026,” per 100-hour claim, and even “~127–144 hours by end of year” if the new doubling holds, per end-of-year math. The original METR chart screenshot being reshared also includes a “doubling time: 123 days” annotation, as visible in METR chart post.

My estimate was 11.26 hours with a doubling time of 103.6 days It might be even slightly faster than this at ~99 days but it's really hard to say, the confidence intervals are crazy

8:06 PM · Feb 20, 2026

111

PostTrainBench: a benchmark where agents post-train small LLMs for 10 hours on an H100

PostTrainBench (agent post-training eval): A new benchmark called PostTrainBench is being shared as an eval where the “agent” must improve a base LLM’s benchmark score with access to an evaluation script and 10 hours on an H100, as summarized in benchmark callout and further explained in task definition.

The setup is notable because it evaluates end-to-end post-training competence (data, training loop, eval loop) rather than only inference-time problem solving.

Sweet new benchmark: LLMs post-training 1.7B to 4B models aka PostTrainBench

Maksym Andriushchenko

@maksym_andr

💥 Major PostTrainBench update: - New #1: Claude Opus 4.6 (23.2%) overtakes GPT-5.2 (21.5%) - New #2: Gemini 3.1 Pro (released yesterday) - Other new results: Sonnet 4.6, GPT 5.3 Codex, GLM 5, Kimi K2.5, MiniMax M2.5 Big gap between proprietary and open-weight models! 🧵🧵🧵

10:42 PM · Feb 20, 2026

ValsAI publishes a broad Gemini 3.1 Pro Preview benchmark bundle

ValsAI benchmarking (Gemini 3.1 Pro Preview): ValsAI says it published full results for Gemini 3.1 Pro Preview, listing first-place finishes on several domain benchmarks (including MedCode and LegalBench) and a #3 placement on its Finance Agent benchmark, as stated in full results post.

They also claim Gemini 3.1 Pro is “best by far” on Terminal Bench 2 (by their runs) and that SWE-bench Verified regressed slightly versus Gemini 3 Pro, with a note that vendor model cards may use private scaffolds, as described in Terminal Bench note. The test configuration is specified as temperature 1.0 with “high” thinking via the official Google API in eval settings.

Vals AI

@ValsAI

Full results on Gemini 3.1 Pro Preview are now released. It gets first place on several of our benchmarks, including MedCode, MortgageTax, and LegalBench. It is also #3 on our Finance Agent Benchmark, putting it ahead of OpenAI but still behind Anthropic.

8:57 PM · Feb 20, 2026

SWE-bench Verified upper-bound estimate: 44/500 tasks may be unsolved today

SWE-bench Verified (ceiling discussion): One analysis post claims that in a slice of EpochAI data it reviewed, 44/500 (8.8%) SWE-bench Verified tasks appear “unsolvable or still too hard” even for top models, implying an empirical upper bound of at least 91.2%, as argued in upper bound analysis.

The same post calls out specific repos (pylint-dev, astropy) as hardest and notes a small set of tasks that only one model solved in their sample, per single-solver tasks.

What's the upper bound on SWE-Bench-Verified? I yoinked a bunch of EpochAI data and analyzed it. However, I only got to download 21 files, now it seems im permanently blocked from downloading anything lol 44/500 = 8.8% of SWE-Bench-Verified questions are either unsolvable or Show more

3:05 AM · Feb 21, 2026

Extended NYT Connections benchmark: Gemini 3.1 Pro hits 98.4

Extended NYT Connections (puzzle eval): A results post claims Gemini 3.1 Pro Preview set a new record of 98.4 on an “Extended NYT Connections” benchmark, with Claude Opus 4.6 (high reasoning) at 94.7, as reported in scoreboard post.

A follow-up corrects an earlier run configuration (temperature set incorrectly) and adds additional model scores, per correction note.

Lech Mazur

@LechMazur

Gemini 3.1 Pro Preview sets a new record on the Extended NYT Connections benchmark: 98.4 (Gemini 3 Pro scored 96.3). Claude Opus 4.6 (high reasoning) scores 94.7. ByteDance Seed2.0 Pro scores 42.1.

2:25 AM · Feb 21, 2026

Read 2 replies

🧬 Model & API surface radar: Gemini/Qwen/MiniMax rollouts, variants, and open models

Beyond the benchmark noise, today includes concrete ‘where can I call it?’ updates: Gemini 3.1 Pro API access points, coding-optimized variants, and new endpoints for major open models. Excludes METR/Arena numbers (covered in benchmarks).

Crush adds a Gemini 3.1 Pro variant optimized for coding agents

Crush (Charm): Crush now exposes a selectable “Gemini 3.1 Pro (Optimized for Coding Agents)” variant without requiring a client update, as shown in the model switcher screenshot in Model menu update.

For engineers running long-lived coding loops, this is a concrete signal that Gemini’s rollout isn’t just “one model”—providers are starting to ship agent-tuned variants as first-class routing targets.

Charm

@charmcli

Gemini 3.1’s code-optimized variant Live in Crush, no update needed

4:45 PM · Feb 20, 2026

264

MagicPathAI ships Gemini 3.1 Pro for image-to-code flows

Gemini 3.1 Pro (Google): Gemini 3.1 Pro is now available inside MagicPathAI for “image → code,” with a live demo of sketch-to-HTML shown in Image to code demo.

The most direct claim from early users is that “image to code” is close to solved—see the phrasing in Image to code demo—which is relevant if your product pipeline includes UI-from-screenshot, design-system ingestion, or quick prototyping from whiteboards.

Pietro Schirano

@skirano

Gemini 3.1 Pro is the best model in the world for going from image to code. This task is basically solved now, kind of crazy. The model is now available in @MagicPathAI.

3:55 PM · Feb 20, 2026

1.2K

Read 55 replies

MiniMax M2.5 gets a high-throughput hosted surface (Fireworks)

MiniMax M2.5 (MiniMax): Fireworks is advertising hosted MiniMax M2.5 at roughly 275 output tokens/sec in a provider speed comparison shared by MiniMax in Provider speed graphic.

Separately, community coverage frames M2.5 as a model with 10B activated parameters and claims it can generate/operate Word/Excel/PPT files natively, as described in Model capability claim. Together, that’s a “fast hosted endpoint + agent-friendly file ops” positioning, even if the file-ops detail is not independently verified in these tweets.

MiniMax (official)

@MiniMax_AI

Nice work by @FireworksAI_HQ running MiniMax M2.5 Appreciate the benchmarks, @ArtificialAnlys 🥳

10:25 PM · Feb 20, 2026

295

Perplexity swaps Gemini 3 Pro out for Gemini 3.1 Pro

Perplexity (Gemini): Perplexity has rolled out Gemini 3.1 Pro as a selectable model, replacing its prior Gemini 3 Pro option, as shown in the in-app model picker screenshot in Model picker screenshot.

This is a concrete distribution surface change: Perplexity users now get Gemini 3.1 Pro behind the same UX (including a “Thinking” toggle), which can shift real-world prompt traffic and failure reports compared to standalone Gemini app usage.

Perplexity has officially rolled out Google's new Gemini 3.1 Pro model to its platform, replacing the previous Gemini 3 Pro model.

Aravind Srinivas

@AravSrinivas

Gemini 3 Pro has been upgraded to Gemini 3.1 Pro for all Perplexity Pro and Max users (consumer and enterprise). It's the second most picked model by our Enterprise customers after Claude 4.5 Sonnet/Opus family. Enjoy!

2:00 PM · Feb 20, 2026

Rork Companion Mac app ships to install iOS builds without Xcode

Rork Companion (Rork): Rork introduced a Companion Mac app that aims to remove parts of the iOS test loop—no 30GB Xcode install, fewer certificate/TestFlight steps—and claims it can install up to 3 iOS apps on a phone without a paid Apple Developer account, per Companion Mac app.

Publishing to the App Store still requires an Apple Developer account, as the same announcement clarifies in Companion Mac app.

Rork

@rork_app

Introducing Rork Companion Mac app Until now, if you wanted to build  apps with Swift, you had to install 30GB of Xcode, struggle with certificates, TestFlight, and pay $120 to test on your iPhone Rork Companion solves all of it with a Mac app that can install up to 3 iOS Show more

Rork

@rork_app

Introducing Rork Max AI that one-shots almost any app for iPhone,  Watch, iPad,  TV &  Vision Pro. Even Pokémon Go with AR & 3D. Max is a website that replaces Xcode. Install on device in 1 click. Publish to App Store in 2 clicks. Powered by Swift, Claude Code & Opus

11:38 PM · Feb 20, 2026

754

Read 49 replies

Unsloth crosses 100,000 open-sourced fine-tunes on Hugging Face

Unsloth (community fine-tuning): Unsloth reports 100,000+ models trained with Unsloth have been open-sourced on Hugging Face, with examples and a listing screenshot in Milestone post.

For engineers who run local inference or want specialized adapters, the practical implication is discovery overhead: a giant long-tail of “Claude/Qwen/Llama-distilled” variants exists now, but filtering for license, eval quality, and safety becomes the work.

Unsloth AI

@UnslothAI

100,000+ models trained with Unsloth have now been open-sourced on Hugging Face! Popular fine-tuned LLMs you can run locally: 1. TeichAI - GLM-4.7-Flash distilled from Claude 4.5 Opus (high) 2. Zed - Qwen Coder 7B fine-tuned for stronger coding 3. DavidAU - Llama-3.3-8B Show more

2:03 PM · Feb 20, 2026

569

Read 13 replies

A new Gemma model is teased as coming soon

Gemma (Google): A teaser suggests a new Gemma release is imminent, with speculation that it could be the Gemma 4 series given the time since Gemma 3, as noted in Gemma teaser.

No public specs, weights, or endpoint surfaces are included in these tweets, so this is a “watch for drop + packaging details” signal rather than a concrete API update.

AiBattle

@AiBattle_

A new Google Gemma model is coming soon It would be great if we finally got the Gemma 4 series of models. It’s been nearly a year since Gemma 3 was released (March 12, 2025)

NomoreID

@Hangsiin

Demis Hassabis(@demishassabis) recently said in India that a new Gemma model will be released soon. "...We work on our own open-source model, Gemma, and we’ll soon release a new version that’s very powerful for edge devices."

3:44 PM · Feb 20, 2026

293

Gemini CLI model list still doesn’t show Gemini 3.1 Pro

Gemini CLI (Google): A user screenshot of the Gemini Code Assist CLI model picker shows options like gemini-3-pro-preview and gemini-3-flash-preview but not gemini-3.1-pro-preview, as captured in CLI model list.

As a rollout signal, this suggests a split between “API has it” and “official CLI UI exposes it,” which can matter for teams standardizing on a CLI harness across environments.

BridgeMind

@bridgemindai

Google... Would you please add Gemini 3.1 Pro to the Gemini CLI? Why have they still not added Gemini 3.1 Pro? What a weird rollout.

1:07 PM · Feb 20, 2026

136

Read 22 replies

xAI frames Grok 4.20 as a weekly-updating frontier model line

Grok 4.20 (xAI): xAI marketing claims Grok 4.20 is shipping meaningful capability updates every week driven by live user feedback, with a “weekly learning cycle” diagram shown in Weekly update claim.

For model/platform analysts, the concrete part here is the cadence claim (weekly) paired with an explicit multi-agent decomposition in the same artifact; the tweets don’t include API/versioning details that would let teams pin behavior to a specific dated build.

Instead of waiting months between major model upgrades, xAI's new Grok 4.20 is pushing out meaningful, capability-boosting updates every single week based on live user feedback.

tetsuo

@tetsuoai

Grok 4.20 just shipped something no other frontier lab is doing publicly: weekly model updates with published release notes, driven by real-world user feedback. Every major AI model today (GPT, Claude, Gemini) goes months between meaningful capability updates. Grok 4.20 is

12:00 PM · Feb 20, 2026

Read 5 replies

Rork Max becomes free to try for a limited time

Rork Max (Rork): Rork says Rork Max is free to try for a limited time, per the short announcement in Free-to-try note.

This is mainly a distribution change: it lowers the friction for teams to evaluate an “app-building agent” workflow without committing up front, but the tweet doesn’t specify quota limits, model mix, or what “free” covers.

Rork

@rork_app

We've just made Rork Max free to try. For a limited time Enjoy ❤️

Rork

@rork_app

9:56 AM · Feb 20, 2026

2.0K

Read 109 replies

🧩 Dev tooling drops: structured editing, prompt-to-PDF, and regression diffs

A set of smaller but shippable developer tools appears today—focused on making artifacts easier for agents/humans to generate, diff, and validate. Excludes MCP servers (covered in orchestration-mcp).

Pixel diffs become a practical UI regression harness for agent-made changes

Visual regression diffs (agent-browser): Following up on snapshot diffs—the new example shows pixel-level comparisons catching unintended CSS/layout changes, plus a workflow hint to pair diffs with bisect to locate the introducing commit, as described in the regression walkthrough.

A separate note flags that “diffing now available” has landed in agent-browser, per the diffing availability.

Chris Tate

@ctatedev

Another example of the new diff command: catching visual regressions Instead of text-based snapshots, this uses pixel-level comparison to spot unintended CSS and layout changes Catch regressions before shipping, or pair with bisect to pinpoint exactly when one was introduced

Chris Tate

@ctatedev

Diffing now available in agent-browser Compare pages with snapshot diffs or pixel-level visual regression Verify actions with up to 90% fewer tokens, catch visual regressions, monitor page changes, and compare across environments

2:35 PM · Feb 20, 2026

314

Read 9 replies

visual-json ships an embeddable, schema-aware JSON editor for human-first editing

visual-json (ctatedev): A new embeddable JSON editor focuses on “human-first ergonomics” (tree view, schema awareness, keyboard navigation, drag/drop) as shown in the launch post.

The follow-up note emphasizes drop-in embedding (“embed it anywhere”), keeping it positioned as a UI building block rather than a full product, per the embed note.

Chris Tate

@ctatedev

Introducing visual-json > JSON editing with human-first ergonomics – Minimalist – Embeddable – Schema-aware – Extensible – Drag and drop – Keyboard navigation – Tree view to drill into deeply nested data

2:53 AM · Feb 21, 2026

445

Read 24 replies

Zed patches split-diff crash uptick and recommits to stable-channel reliability

Zed (Zed Industries): After enabling split diffs, Zed reports an uptick in crashes to 1.03% of app opens and says it shipped patches within 24 hours, recommending an update to stable v0.244.10; it also calls out a renewed focus on stability safeguards in upcoming cycles, per the stability note.

This is a concrete reminder that “stable” channels for agent-heavy editors still need explicit crash/rollback discipline as features ship faster.

Zed

@zeddotdev

In this week's stable release (v0.224), we saw an uptick in app crashes around the new split diff feature (1.03% of app opens). We've addressed most of the panics via patches we've published in the last 24 hours—please update to stable v0.244.10. In the coming release cycles, Show more

10:33 PM · Feb 20, 2026

170

ClaudeCodeLog expands Claude Code bundle diffing to full prompt/string history

ClaudeCodeLog (community tooling): The tracker now extracts and backfills a broader set of prompts/strings from Claude Code’s bundled JS from v0.2.9 to latest, adding index views (by init version, last edit version, token count) and enriched metadata/flags pages, as detailed in the tooling update.

The same feed is also posting per-release summaries (e.g., prompt/string deltas) as seen in the 2.1.50 breakdown, which turns Claude Code upgrades into diffable artifacts rather than vibes.

Today's the day! As promised a week – or maybe three 😅 – ago, the big update is finally here. The bot/repo is no longer just tracking system prompt + flags. It now tracks a much broader set of prompts and strings extracted from the bundled JS, backfilled from the first public Show more

9:58 PM · Feb 20, 2026

Read 5 replies

ElevenLabs adds multi-region routing to cut international TTS latency

Flash v2.5 TTS (ElevenLabs): ElevenLabs says it added automatic multi-region routing (US/Netherlands/Singapore) and saw 20–40% perceived latency reduction for many international developers; it also claims 50ms model inference time to first byte plus region-specific network gains (e.g., ~150–200ms faster in Southeast Asia), per the latency update.

This is one of the cleaner “latency as product” datapoints shared in the open, with per-region deltas that infra teams can sanity-check against their own telemetry.

ElevenLabs Developers

@ElevenLabsDevs

The fastest Text to Speech API just got faster. 20-40% reduction in perceived latency for most international developers. Flash v2.5 achieves 50ms model inference time to first byte (TTFB), and now with global routing, we’ve reduced the network latency on top.

4:03 PM · Feb 20, 2026

225

Read 11 replies

json-render adds a React PDF renderer for prompt-to-PDF output

json-render/react-pdf (ctatedev): A new renderer lets you generate PDFs from the same JSON format used by the React and React Native renderers (minus state/actions), effectively making “prompt → PDF” a renderer swap, as described in the renderer announcement.

Chris Tate

@ctatedev

New: @𝚓𝚜𝚘𝚗-𝚛𝚎𝚗𝚍𝚎𝚛/𝚛𝚎𝚊𝚌𝚝-𝚙𝚍𝚏 You can now go from prompt to PDF with json-render Same JSON format as the React and React Native renderers, minus state and actions Just swap the renderer and get a PDF

8:23 AM · Feb 20, 2026

1.3K

Read 29 replies

ComputeSDK sandbox benchmark claims E2B is fastest to interactive

Sandbox TTI benchmarking (E2B): E2B points to results from a ComputeSDK sandbox benchmark and claims it is the fastest for time-to-interactive (TTI), per the benchmark claim.

No methodology details are included in the tweet, so it reads as a directional marketing signal more than an independently reproducible benchmark artifact.

E2B

@e2b

Results from the ComputeSDK sandbox benchmark show the speed comparison of sandbox providers. E2B is the fastest in TTI. 🏎️⚡️

ComputeSDK

@computesdk

Today we're super excited to announce ComputeSDK Benchmarks. Sandbox usage is only growing and the need for speed is real. So we're creating a set of benchmarks that measure each sandbox experience. Our first benchmark is the TTI.

6:57 PM · Feb 20, 2026