Google AI Studio powered by Antigravity adds React and Next.js – 30TB Ultra upsell

BREAKING 🚨: Google AI Studio is set to be powered by Antigravity to help users build full-stack applications. "Now powered by Antigravity, build full-stack apps with multiplayer support, polished Ul, and secure connections to real-world services." "Need beautiful icons or Show more

10:27 PM · Feb 22, 2026

843

Read 35 replies

Google cuts off some Antigravity users, citing malicious usage and service quality

Antigravity (Google ecosystem): Google reportedly cut off access for some Antigravity users due to “malicious usage that was hurting service quality for other users,” as stated in the cutoff note. This lands awkwardly next to the same-week push to position AI Studio as “powered by Antigravity,” and it creates immediate dependency risk for anyone building workflows around it.

• Developer reaction: One maintainer called Google’s action “pretty draconian,” said they’ll remove Antigravity support, and contrasted it with “Anthropic pings me and is nice about issues” in the draconian ban complaint.

Following up on BYO keys—prior quota-wait and “bring your own key” chatter around Antigravity—today’s cutoff framing adds an explicit enforcement/abuse narrative rather than just capacity constraints, per the cutoff note.

AshutoshShrivastava

@ai_for_success

Google cut off access for some Antigravity users due to malicious usage that was hurting service quality for other users.

Varun Mohan

@_mohansolo

We’ve been seeing a massive increase in malicious usage of the Anitgravity backend that has tremendously degraded the quality of service for our users. We needed to find a path to quickly shut off access to these users that are not using the product as intended. We understand

3:15 AM · Feb 23, 2026

201

AI Studio adds React and Next.js options, with “XR Blocks” showing up in the picker

AI Studio (Google): AI Studio’s framework picker now shows React and Next.js alongside Angular, and it also surfaces an “XR Blocks” option that’s described as in development, according to the advanced settings screenshot. The immediate impact for builders is that the “build an app” agent flow is being positioned less as an Angular-only sandbox and more as a multi-framework scaffolder (with an explicit XR track), as shown in the advanced settings screenshot.

TestingCatalog News 🗞

@testingcatalog

ICYMI: Google AI Studio now supports React and Next.js frameworks on top of Angular. A preparation for something bigger. "XR Blocks" is in development 👀

8:29 PM · Feb 22, 2026

218

Google AI usage limits surface as “Model quota reached,” with AI Ultra as the upsell

Google AI plans (quota/limits): Builders are hitting “Model quota reached” prompts with a concrete refresh timestamp (“refresh on 2/23/2026, 1:21:15 AM”) and an explicit upgrade path to Google AI Ultra for “highest rate limits,” as shown in the quota reached prompt. Separate posts show users purchasing Ultra, where the entitlement list includes “highest limits to the asynchronous coding agents for software developers” plus access to reasoning/video models and 30 TB storage, per the Ultra confirmation screenshot.

A related thread frames the decision as driven by rate-limit exhaustion (“blowing through all my pro rate limits”) in the rate limit pressure post.

AshutoshShrivastava

@ai_for_success

I want to be rich enough that I never hit a quota limit on AI tools 😄 lol. Anyway, one good thing is I can sleep early. I really need to wake up early tomorrow for the office.

5:00 PM · Feb 22, 2026

154

🧠 Codex in practice: capacity ramps, speed knobs, and multi-agent weekend builds

Codex-related posts are heavy today: OpenAI-side capacity scaling notes, user-perceived speed ups (plan tiers + multi-agent toggle), and lots of “what did you build” sharing. Excludes GPT‑5.3 rumor content (handled in Model Radar).

Running 50 Codex in parallel to triage PRs into JSON reports (no vector DB)

PR and issue triage (Codex): Peter Steinberger describes spinning up 50 Codex agents in parallel to analyze PRs and emit JSON reports with signals like “vision/intent,” risk, and dedupe clustering; then he ingests the reports into one session to query, de-dupe, auto-close, merge, and repeat for Issues—“Don’t even need a vector db,” per the parallel PR analysis writeup and the workflow quote.

• Operational edge cases: The terminal output in the parallel PR analysis writeup also shows a real scaling failure mode—GitHub diff fetches can fail with HTTP 406 when a PR exceeds 300 files—so ingestion pipelines need fallbacks (e.g., file-list only, or chunked diffs) rather than assuming diffs are always retrievable.

Been wrangling a lot of time how to deal with the onslaught of PRs, none of the solutions that are out there seem made for our scale. I spun up 50 codex in parallel, let them analyze the PR and generate a JSON report with various signals, comparing with vision, intent (much Show more

3:21 PM · Feb 22, 2026

3.6K

Read 386 replies

OpenAI scaled Codex compute in Feb beyond its entire prior ramp

Codex capacity (OpenAI): OpenAI says it “brought more compute online in February to sustain Codex demand than… the entire period since its inception,” while also noting reliability improvements and no major outages “in a while,” per the capacity and reliability note. The implication for teams is that Codex usage is now being rate-limited more by fleet ops than by early-access gating, and that stability is being treated as a first-class product constraint.

Tibo

@thsottiaux

We brought more compute online in February to sustain Codex demand than we did in the entire period since its inception. It's been a fun challenge making it easier and more reliable over time and quite happy there hasn't been a major outage in a while (-> knocks on wood).

6:27 PM · Feb 22, 2026

1.0K

Read 76 replies

Only Codex Spark runs on Cerebras; other GPT‑5.3‑Codex speedups are elsewhere

GPT‑5.3‑Codex serving path (OpenAI): OpenAI’s Codex lead clarifies that only the Spark variant is served via Cerebras, and that “all speed optimizations for GPT‑5.3‑Codex are something different,” with more speed improvements expected, according to the serving-stack clarification. This is a concrete follow-up on Spark speedup (earlier throughput jump), and it suggests builders shouldn’t assume “Cerebras-backed” performance characteristics apply to every Codex tier.

Tibo

@thsottiaux

Only the spark model is served through Cerebras, all speed optimizations for GPT-5.3-Codex are something different and the best is yet to come speed-wise!

Rafael Bittencourt

@rafaelobitten

VibeCoders of the world, I only have one thing to say: OpenAI just built a coding monster. The speed jump in gpt-5.3-codex xhigh is massive. It was likely boosted by the OpenAI x Cerebras partnership, plus whatever the OpenAI Codex team cooked behind the scenes. And now that

6:00 PM · Feb 22, 2026

854

Read 70 replies

OpenAI’s Head of Codex says the next 10 weeks will make today’s agents look primitive

Roadmap pace signal (OpenAI): A roundup account quotes OpenAI’s Head of Codex saying he’s “beyond excited” for the next 10 weeks, and that today’s coding agents will soon look “so primitive that it will be funny,” as relayed in the 10-week tease quote. Taken alongside OpenAI’s compute ramp for Codex demand in the capacity note, the message is that iteration speed and scaling ops are being treated as coupled product workstreams.

The Rundown AI

@TheRundownAI

Replying to @TheRundownAI

OpenAI's Head of Codex said he is "beyond excited" for the next 10 weeks, and that current coding agents will be "so primitive that it will be funny in comparison." If today's coding agents are primitive... Things are about to move VERY quickly. x.com/thsottiaux/sta…

Tibo

@thsottiaux

Our codex offsite left a deep impression on me. I am beyond excited for what the next 10 or so weeks will bring and I think the current state of coding agents will be remembered as being so primitive that it will be funny in comparison.

12:06 AM · Feb 23, 2026

ChatGPT Pro speed claim: up to 20% faster Codex plus /experimental Multi‑Agents

Codex speed knobs (ChatGPT): A practitioner claims the ChatGPT Pro subscription makes Codex “up to 20% faster on the inference side,” and recommends enabling /experimental Multi‑Agents to trade spend for more parallel iteration, as described in the speed and multi-agent tip. Treat this as anecdotal until OpenAI publishes plan-level latency/throughput deltas, but it’s a clear signal that “iteration rate” is becoming a user-facing tuning axis.

Numman Ali

@nummanali

The ChatGPT Pro sub enables Codex to be up to 20% faster on the inference side I swear I'm feeling it be up to twice as fast this weekend on GPT 5.3 Codex Extra High Don't underestimate the value of speed in your development flow - more iterations = more intelligence Enable Show more

Tibo

@thsottiaux

A small additional perk of the pro subscription is that it runs Codex 10-20% faster, on top of the ~60% speed improvement we shipped across the board last week. Team is working hard to pack more into the existing subscriptions.

12:32 PM · Feb 22, 2026

145

sound4movement ships v1.0.0 of a Codex-to-Ableton Live music workflow tool

Codex + Ableton Live (sound4movement): A builder reply to OpenAI’s weekend prompt says they shipped v1.0.0 of an open-source system to “make music with Codex” in Ableton Live, with a working demo shown in the v1.0.0 shipped reply and a follow-on link in the project announcement.

This is a concrete example of Codex getting used as a “glue engineer” for creative tooling (API wiring, scripting, packaging), not only app backends.

Michael Wall

@sound4movement

Replying to @OpenAIDevs

A system to make music with Codex. I started this at the OpenAI hackathon a few weeks back. v1.0.0 shipped! 🚢 github.com/sunflower-of-p…

11:32 PM · Feb 22, 2026

Codex web-search discoverability gap shows up in user frustration threads

Web access expectations (Codex in ChatGPT): One user reports bailing on Codex after getting “solutions that aren’t even real,” attributing it to not having internet access and lacking the ability to fix things on the computer, per the Codex vs Claude complaint. Another reply claims there’s a toggle to enable web search in ChatGPT and that “Codex can also search the web,” as stated in the web search toggle note. Even if capabilities exist, this reads like a product discoverability problem that directly affects troubleshooting workflows.

gmoney.eth

@gmoneyNFT

i decided to try codex. 5 mins in... big mistake. claude a much better product. trying to troubleshoot on chatgpt and it gives me solutions that aren't even real. like it doesn't have access to internet. and then has no control to fix anything on the computer. the pairs Show more

gmoney.eth

@gmoneyNFT

i'm debating between another claude max plan or an openai plan to mess with codex. what do we think chat?

7:27 PM · Feb 22, 2026

103

Read 76 replies

OpenAI DevRel runs a Codex weekend build thread and pulls in project replies

Codex adoption signal (OpenAI DevRel): OpenAI’s dev account explicitly prompts builders with “What did you build with Codex this weekend?”, creating a lightweight public feedback loop about real usage, per the weekend build prompt. Replies and adjacent chatter reinforce the “weekend project” usage pattern—see the weekend projects comment for the vibe of how Codex is being used outside work hours.

OpenAI Developers

@OpenAIDevs

What did you build with Codex this weekend?

11:19 PM · Feb 22, 2026

820

Read 494 replies

Long-running Codex sessions become normal: letting it run while you wait

Long-run agent ergonomics (Codex): One small but telling workflow note—thdxr describes “letting codex run while i stare at the spinners,” framing waiting on a long-running agent task as normal background activity, per the flight spinners comment. In practice, this pushes teams toward better progress reporting, resumability, and “come back later” task designs rather than chat-first tight loops.

i'm rawdogging this flight (letting codex run while i stare at the spinners)

6:41 PM · Feb 22, 2026

197

🧑‍💻 Claude Code: parallelism habits and desktop friction signals

Claude Code remains a daily-driver for many builders: more “run many in parallel” advice and some UX friction reports (permission prompts, session switching). Excludes Anthropic’s tool-calling research patterns (covered under Agent Frameworks).

Worktrees are becoming the default primitive for parallel Claude Code runs

Claude Code workflows: Following up on Worktree default (using --worktree as a coordination primitive), a concrete playbook is circulating for running “a bunch of Claude Code’s at the same time” on one repo to boost throughput, as described in parallel worktrees note.

Matt Pocock

@mattpocockuk

Worktrees let you run a bunch of Claude Code's at the same time. All shipping independently on the same repo. A huge throughput increase. Here's how I plan to pull it off:

1:50 PM · Feb 22, 2026

346

Read 30 replies

Claude Code Desktop on Windows is prompting “bypass permissions” on every session switch

Claude Code Desktop (Anthropic): A Windows user reports a UX regression where they must re-select “bypass permissions” every time they switch sessions, creating approval fatigue in multi-session workflows, as stated in bypass prompt complaint.

Kol Tregaskes

@koltregaskes

This is infuriating! Why in the Claude Code desktop app on Windows do I now have to select bypass permissions every time I click on session when I already selected it!? I'm moving from sesion to session all the time, so this new requirement is very frustrating.

3:31 PM · Feb 22, 2026

The claude-3-7-sonnet-latest model alias is returning 404s in the API

Anthropic API / model availability: A scheduled digest job logs an Anthropic API 404 for model: claude-3-7-sonnet-latest, implying the alias was removed or renamed and breaking pinned configs, as shown in 404 error log.

Ian Nuttall

@iannuttall

sonnet 3.7 is no more :(

10:21 AM · Feb 22, 2026

Anthropic rate limiting is showing up as “auth profile in cooldown” in agent ops

Claude Code scale pain: A multi-agent Telegram automation shows repeated failures where claude-sonnet-4-6 can’t run because “No available auth profile… (rate_limit)” and “provider… is in cooldown,” which is the kind of operational friction that shows up once teams run lots of agents in parallel, as captured in rate limit errors.

Alex Volkov (Thursd/AI)

@altryne

It's actually over?

3:58 AM · Feb 23, 2026

Claude Code Desktop gets a direct endorsement for front-end iteration loops

Claude Code Desktop (Anthropic): A builder recommendation frames Claude Code Desktop as particularly good for iterating on front-end design, leaning on the embedded preview loop described earlier in Embedded previews and reiterated in desktop recommendation.

Dan Shipper 📧

@danshipper

PSA if you're iterating on front-end designs, you should try Claude Code desktop. it's great

5:07 PM · Feb 22, 2026

298

Read 33 replies

Claude Code hits 1 year with an in-person community celebration

Claude Code (Anthropic): Claude Code’s first birthday is being marked with an in-person event photo and a “thanks for celebrating with us” note in birthday post, which is a small but real adoption signal for teams treating Claude Code as a daily driver.

Boris Cherny

@bcherny

🎂 Happy 1st birthday to Claude Code Thanks everyone for coming out and celebrating with us!!

10:37 PM · Feb 22, 2026

4.6K

Read 257 replies

“B.C. = Before Claude” is the latest shorthand for how fast Claude Code normalized

Claude Code culture signal: A quip that “B.C. refers to ‘Before Claude’” in before Claude quip reflects how quickly Claude Code has become ambient tooling in some engineering circles—useful context when reading adoption and workflow claims.

Olivia Moore

@omooretweets

Our kids are going to grow up in a world where B.C. refers to “Before Claude”

10:54 PM · Feb 22, 2026

413

Read 37 replies

🦞 OpenClaw maintainer ops: PR triage automation, releases, and scaling pain

OpenClaw/OpenClaw-adjacent posts shift from hype to day-2 operations: taming PR/issue volume with agent parallelism, shipping betas, and navigating security advisory noise. Excludes Google/Antigravity enforcement impacts (covered in the feature).

50 parallel Codex agents for OpenClaw PR/issue triage, with JSON signal reports

OpenClaw PR triage (steipete): Spinning up 50 Codex agents in parallel to analyze each PR and emit a structured JSON report is his current answer to maintainer-scale review load, following up on PR volume (AI PR firehose) with a concrete operational loop in the parallel Codex workflow. It’s optimized around diff-derived intent/vision (higher signal than text), risk, and dedupe signals—then he ingests all reports into one session to query, de-dupe, auto-close, or merge without standing up a vector DB, as described in the parallel Codex workflow.

The terminal log screenshot in the parallel Codex workflow shows what this looks like in practice: semantic dupe clustering across ~900 markdown artifacts; progress telemetry for PR ingestion; and a real failure mode where gh pr diff can hard-fail with HTTP 406 when a PR exceeds GitHub’s 300-file diff cap (meaning you need a fallback plan for “too big to diff” PRs). He also applies the same flow to Issues, explicitly reframing “Prompt Requests” as issues with additional metadata in the parallel Codex workflow.

3:21 PM · Feb 22, 2026

3.6K

Read 386 replies

OpenClaw “CHUNKY” beta rolls out with a deliberate regression buffer

OpenClaw (steipete): A new “CHUNKY” OpenClaw beta is up, with the maintainer explicitly waiting a few hours before flipping the switch to catch regressions, and asking users to report blockers not present in the prior release, as requested in the beta announcement. The same thread frames the release as adding “love for @MistralAI” for people looking for alternatives to Google, as reiterated in the release note.

This is a maintainer ops move more than a feature brag: it’s an explicit staging window plus a lightweight “blocker report” process to avoid shipping breakage into a high-churn agent ecosystem.

New CHUNKY @openclaw beta is up. Will wait a few h before flipping the switch to catch regressions. Reply to this tweet if you find any blockers that were not in .21 github.com/openclaw/openc…

1:24 AM · Feb 23, 2026

705

Read 90 replies

OpenClaw maintainer pushes back on auto-generated “disable auth” security noise

OpenClaw security ops (steipete): The maintainer is frustrated with auto-generated security reporting that flags config naming for intentionally unsafe escape hatches, asking how else to name options “specifically designed to disable auth,” as argued in the naming complaint.

The attached advisory screenshot in the naming complaint shows a bot-opened GHSA-style report that treats a “dangerouslyDisable…” option as a high-severity issue. The operational takeaway is that once a project crosses a certain scale, security automation can become an inbox tax unless you add conventions (naming, docs, linting) that distinguish “intentionally unsafe debug mode” from unintended auth bypass.

Seriously, how else should I name config options that are specifically designed to disable auth? But sure, send me more auto-gened' slop.

8:57 AM · Feb 22, 2026

915

Read 84 replies

OpenClaw reaches ~#2 open-source project by GitHub stars

OpenClaw (community): OpenClaw is reported as hitting roughly #2 OSS on GitHub stars, landing in the React/Linux/Python tier, per the star-count snapshot in the stars milestone.

For AI engineers watching ecosystems, the point isn’t vanity metrics—it’s that this level of visibility tends to correlate with PR/issue volume spikes, more security reports (good and noisy), and stronger pressure to formalize triage/release processes.

Ralph Fischer

@RalphFischer_

OpenClaw 🦞 just hit #2 OSS on GitHub stars⭐️ React: 243270⭐️ OpenClaw: 218261⭐️ Linux: 218260⭐️ Python: 217970⭐️ 3 months in, passed giants like Linux #3 and Python #4. Only React left at #1, and that’s Meta not solo. Huge congrats @steipete you reached literally the stars Show more

7:56 PM · Feb 22, 2026

452

Read 21 replies

Running an OpenClaw-like stack on an old Android phone instead of a Mac mini

OpenClaw self-hosting (itsPaulAi): A walkthrough claims you can host “everything you need” for an OpenClaw-style setup on an old Android phone, positioning it as “much faster” and “way cheaper,” with a floor claim of a $25 phone, as stated in the Android hosting claim.

This is a concrete ops angle for maintainers and power users: if the setup is real, it suggests a wider base of always-on self-hosters (more nodes, more forks, more PRs) and shifts the default hardware assumption away from desktop boxes.

Paul Couvert

@itsPaulAi

No. You don't need a Mac Mini for OpenClaw. You can actually host everything you need on an old Android phone. And you'll have a setup which is: - Much faster - Way cheaper - With the same features Even a $25 phone can do the job.

9:19 PM · Feb 22, 2026

433

Read 42 replies

Code review tooling anxiety: “all dead in a year” as AI PR volume spikes

Code review workflow (davidgomes): A maintainer-oriented post argues it’s “possible [code review tools are] all dead in a year,” while also framing the piece as a “love letter to Graphite,” per the review tools post.

Even without specific OpenClaw mechanics, it matches the same underlying pressure: as agents generate more PRs, teams may migrate from “human review UI” toward triage automation + structured signals + bulk actions (close/merge/dedupe), with review tools needing to adapt to that throughput model.

David Gomes

@davidgomes

I wrote a blog post about code review tools. It's possible they're all dead in a year, who knows, so I had to get this in now before it becomes irrelevant. Also, this article could also have been titled "A love letter to Graphite".

2:31 PM · Feb 22, 2026

📊 Benchmark churn: Gemini 3.1 Pro dominance, “benchmaxxing,” and evaluator bottlenecks

Today’s feed is saturated with leaderboard screenshots and benchmark meta: Gemini 3.1 Pro appears on multiple arenas/indices, while researchers warn about judge-model bottlenecks and flawed proxies for reasoning. Excludes product shipping updates for specific coding tools.

CAIS Text Capabilities Index puts Gemini 3.1 Pro at the top overall average

CAIS Text Capabilities Index: A shared table shows Gemini 3.1 Pro averaging 61.6 overall, with a notably high ARC-AGI-2 score (73.3) alongside strong SWE-Bench (75.8) and Terminal Bench (67.0), as shown in the Text capabilities table.

This kind of composite index is increasingly how “reasoning + coding” narratives get packaged: single-number averages that can be dominated by one standout column (here, ARC-AGI-2).

Gemini 3.1 Pro new SOTA on the CAIS Text Leaderboard (mostly due to the high ARC-AGI-2 score)

9:44 PM · Feb 22, 2026

Gemini 3.1 Pro Preview leads SVG Arena by an unusually wide margin

SVG Arena (Design Arena): Gemini 3.1 Pro Preview is shown as the top SVG-generation model with an Elo of 1421—an 87-point lead—based on ~92K crowd votes, per the SVG Arena leaderboard data.

This is a practical signal for teams using LLMs to generate icons/diagrams/UI SVG assets; it also raises the stakes on how “benchmarkable” SVG output has become (prompting follow-on debate covered elsewhere in the feed).

Google's Gemini 3.1 Pro Preview now is the top ranked model on the SVG Arena leaderboard, with an Elo rating of 1421. An 87-point lead over the next closest model SVG Arena is a crowdsourced benchmarking platform for evaluating LLMs ability to generate Scalable Vector Graphics Show more

Design Arena

@Designarena

BREAKING: Gemini 3.1 Pro Preview has landed in #1 on SVG Arena by Design Arena with an ELO of 1421 This 87-point lead the largest winning margin that we've seen a model have on SVG Arena since the arena launch Huge congratulations to the @GoogleDeepMind team!

7:05 PM · Feb 22, 2026

176

Benchmark saturation (“benchmaxxing”) is making fast model feel-tests harder

Benchmark culture: A growing complaint is that community benchmarks are getting “benchmaxxed,” so simple vibe tests (SVGs, Minecraft-ish tasks) stop being informative as models are tuned against them—starting from the holy benchmaxxing reaction and the follow-up that “SVGs were fun while it lasted” in the vibe-test frustration thread.

• Where it goes next: one take is that “all benchmarks will evaporate until only reasoning benchmarks remain,” as argued in the reasoning-only claim.

The implication for evaluators is less about any single leaderboard and more about churn: what was a quick proxy last month becomes a training target this month.

holy benchmaxxing

Design Arena

@Designarena

7:54 PM · Feb 22, 2026

259

Read 16 replies

HalluHard comparison positions Gemini 3.1 Pro as mid-pack on hallucination rate

HalluHard: A shared bar chart puts Gemini 3.1 Pro at a hallucination rate of 57.1, with lower (better) rates shown for “Claude-Opus-4.5-Web-Search” at 30.2 and “GPT-5.2-thinking-Web-Search” at 38.2, according to the Hallucination chart.

The practical takeaway is not “who wins” but that reliability narratives now depend heavily on whether web search is enabled and on which exact variant is being compared, as the Hallucination chart layout makes visually obvious.

Gemini 3.1 Pro's hallucination rate is also pretty good on "HalluHard"

9:49 PM · Feb 22, 2026

102

Token count gets challenged as a reasoning-quality metric

Reasoning measurement (Google Research): A circulated Google paper summary argues that token count is a poor proxy for actual reasoning quality, per the paper recap.

This lands right in the middle of ongoing “effort control” debates (longer chains of thought, inference-time reasoning knobs, and judge-model selection), but the thread itself is focused on measurement—what to log and optimize for—rather than on any single model result.

elvis

@omarsar0

New Google paper challenges how we measure LLM reasoning. Token count is a poor proxy for actual reasoning quality. There might be a better way to measure this. This work introduces "deep-thinking tokens," a metric that identifies tokens where internal model predictions shift Show more

4:01 PM · Feb 21, 2026

530

Read 42 replies

Vision Capabilities Index screenshot frames Gemini 3.1 Pro as the vision leader

Vision Capabilities Index: A benchmark table shared as “Google miles ahead… in multimodal understanding” ranks Gemini 3.1 Pro highest by average (62.1), leading categories like spatial navigation (MindCube 84.1) and embodied reasoning (ERQA 74.2), per the Vision index table.

This is the kind of evidence that’s starting to drive model routing decisions in multimodal products: pick one model family for vision-heavy tasks even if you prefer another for coding.

The only thing Google is really miles ahead of everyone is multimodal understanding

Lisan al Gaib

@scaling01

Gemini 3.1 Pro new SOTA on the CAIS Text Leaderboard (mostly due to the high ARC-AGI-2 score)

9:45 PM · Feb 22, 2026

233

Ad-hoc “combo Connections” test shows Gemini 3.1 Pro fast and accurate

Ad-hoc evaluation: In a custom stress test combining five NYT Connections puzzles into one 80-word mega-prompt, Gemini 3.1 Pro Preview reportedly solved 4/5 with ~3 minutes average time, while Opus 4.5 (high reasoning) went 1/5 with ~38 minutes, as detailed in the combo puzzle results.

The same combo puzzle results note Grok 4.1 at 0/5 (~8 minutes) and GPT-5.2 xHigh stalling partway through, which highlights how “time-to-answer” can dominate perceived capability in puzzle-like domains.

Lech Mazur

@LechMazur

Test run: 5 combo puzzles combining 5 NYT Connections puzzles each (4*4*5=80 words per combo): Gemini 3.1 Pro Preview: 4/5 correct, ~3 mins average. Opus 4.5 high reasoning: 1/5, 38 mins avg. Grok 4.1 Fast Reasoning: 0/5, 8 mins. GPT 5.2 xHigh: 1/2, 11 mins, stuck on the 3rd.

Lech Mazur

@LechMazur

Gemini 3.1 Pro Preview sets a new record on the Extended NYT Connections benchmark: 98.4 (Gemini 3 Pro scored 96.3). Claude Opus 4.6 (high reasoning) scores 94.7. ByteDance Seed2.0 Pro scores 42.1.

1:42 AM · Feb 23, 2026

ALE-Bench screenshot claims Gemini 3.1 Pro SOTA on hard optimization tasks

ALE-Bench (Sakana AI): A claim circulating is that Gemini 3.1 Pro is SOTA on Sakana’s ALE-Bench (algorithmic optimization problems “with no known solution”), as stated in the ALE-Bench claim.

No evaluation artifact is included in the tweets, so treat it as an unverified scoreboard claim until the underlying run details are available.

Gemini 3.1 Pro SOTA on Sakana AI's ALE-Bench (algorithmic optimization problems with no known solution)

9:42 PM · Feb 22, 2026

136

A simple finger-counting test gets used as a multimodal reality check

Multimodal spot-check: A side-by-side screenshot meme uses finger counting as a quick VLM sanity check; one panel shows ChatGPT responding “I see 5 fingers,” while the other shows Gemini responding “I see 6 fingers,” as shown in the finger counting screenshots.

It’s a tiny, non-scientific test, but it keeps showing up because it’s fast, visual, and exposes the gap between “confident description” and “grounded perception” in everyday multimodal UX.

Chetaslua

@chetaslua

Gemini vs ChatGPT: Counting Fingers gemini is far better than any ai for analyzing images 😌 Test by - u/Science_421

4:37 PM · Feb 22, 2026

108

AlgoTune: Gemini 3.1 Pro scores high, but users question benchmark validity

AlgoTune: A leaderboard screenshot shows GPT-5.2 at 2.07× and Gemini 3.1 Pro Preview at 2.02×, with Claude Opus 4.6 down at 1.47×—and the poster explicitly says they “don’t really trust” the benchmark because some rankings “make no sense,” per the AlgoTune leaderboard.

The key point is less the ordering and more the growing pattern: benchmark results increasingly ship with a built-in “validity disclaimer,” which complicates using them for procurement/model-routing decisions.

I don't really trust AlgoTune benchmark (another optimization benchmark) because Opus scores so low and o4-mini and DeepSeek scores make no sense. But Gemini 3.1 Pro scored really well here.

9:47 PM · Feb 22, 2026

Read 4 replies

🛰️ Model radar: GPT‑5.3 rumors, Grok coding timeline claims, and context window bumps

Release speculation and model-surface deltas are circulating: GPT‑5.3 “Garlic” rumor posts, Musk’s Grok-coding timeline claims, and a reported ChatGPT Thinking context increase. Excludes benchmark leaderboard screenshots (handled in Benchmarks).

GPT‑5.3 “Garlic” Feb 26 rumor spreads, framed as a GPT‑3→4‑scale jump

GPT-5.3 “Garlic” (OpenAI): A rumor thread predicts a Thu Feb 26 release and frames it as “a HUGE leap” and “a GPT‑3 to GPT‑4 moment again,” as stated in the release prediction thread that’s backed mostly by a SimpleBench table screenshot.

The post’s concrete hook is the SimpleBench “human baseline 83.7%” reference and the claim that “previous model[s]” are far below that level, as shown in the release prediction thread; a separate meme post repeats the same date rumor via “Day 0 without OpenAI rumors,” per rumor meme. No OpenAI confirmation appears in today’s tweets.

Dan McAteer

@daniel_mac8

GPT-5.3, codenamed "Garlic" 🧄 is released on Thursday, Feb. 26th. It surpasses human baseline on SimpleBench of 83.7%. In fact, it blows every previous model out of the water on all non-coding benchmarks. Word has it is a *HUGE* leap. A GPT-3 to GPT-4 moment again. OpenAI Show more

8:41 PM · Feb 22, 2026

1.3K

Read 143 replies

Musk sets Grok coding targets: close by April, similar by May, better by June

Grok Code (xAI): Elon Musk claims xAI “will get pretty close by April and roughly similar by May,” and “probably better by June when Colossus 2 is fully operational,” arguing leading coding models will be “hard to tell the difference” between, as shown in the timeline screenshot.

This is a timeline assertion, not a shipped change. The operational dependency is explicit (Colossus 2 capacity), which makes it as much an infra claim as a model-quality claim per the timeline screenshot.

Haider.

@slow_developer

elon is making another bold call: "grok code will get close by april, match claude coding model by may, and be better by june once colossus 2 is fully operational" that matters a lot, because i don't want a coding monopoly from either anthropic or openai looks like elon won't Show more

2:25 PM · Feb 22, 2026

247

Read 43 replies

🧰 Agent framework engineering: tool-calling patterns, RLMs, and observability primitives

Framework-level content today is concrete: Anthropic-style advanced tool calling patterns, LangGraph/LangSmith production agent writeups, and RLM (recursive language model) tooling discussions. Excludes MCP/skills discovery standards (handled under Orchestration).

Tool search + defer_loading: stop paying 75K tokens upfront for tool schemas

Tool discovery (Anthropic): When you have many tools, loading every schema upfront can consume tens of thousands of tokens; the pattern is to mark infrequent tools as defer_loading: true and let the model discover them through a “tool search” step, with Anthropic citing ~85% reduction in tool-definition tokens (77K → 8.7K) in the Tool search note and reiterated in the Thread segment on defer loading.

This is a concrete knob for long-context agent systems where tool catalogs grow faster than context windows.

Replying to @jasonzhou1993

3/ Tool search tool Loading all tool definitions upfront wastes context. If you have 50 MCP tools at ~1.5K tokens each, that's 75K tokens before the user even asks a question. Instead, mark infrequently-used tools with defer_loading: true. They're excluded from the initial Show more

Anthropic advanced tool calling: programmatic tool calls instead of JSON-emitting models

Advanced tool calling (Anthropic): A pattern for tool-heavy agents is to run a controller script that calls tools directly, rather than prompting the model to emit tool-call JSON; Anthropic claims ~37% token reduction with this approach, as summarized in the Advanced tool calling breakdown and shown in the Programmatic tool call clip.

It moves the brittle part (tool invocation) into normal code you can test, diff, and reuse.

Anthropic’s new advanced tool calling is gold and I’m surprised not many people talk about it - Programmatic tool calling - Dynamic filtering - Tool use examples - Tool search … Here is a quick 3 min break down 🧵👇 Show more

458

Read 21 replies

LangSmith Insights Agent adds scheduling for recurring trace-pattern jobs

LangSmith Insights Agent (LangChain): LangSmith’s Insights workflow for grouping traces and surfacing emergent usage patterns now supports scheduled recurring jobs, as announced in the Insights Agent update.

It formalizes “observability-as-a-cron,” so token/caching regressions and new failure clusters can be detected without manual dashboard work.

LangChain

@LangChain

🌟 LangSmith Insights Agent 🌟 Use LangSmith Insights to group traces and find emergent usage patterns of your agents 🔎 Now with the ability to set a schedule and run recurring jobs! Docs: docs.langchain.com/langsmith/insi…

4:45 PM · Feb 22, 2026

Dynamic filtering: run code to extract the crux from HTML before the model reads it

Dynamic filtering (Anthropic): Instead of stuffing raw HTML into context, the agent runs code to extract only the “crux,” cutting prompt bloat; Anthropic reports ~24% fewer input tokens on average, per the Dynamic filtering example that follows the Advanced tool calling thread.

This looks like “tool use to pre-digest tool output,” which tends to improve both cost and tool-following stability.

Replying to @jasonzhou1993

2/ Dynamic filtering Dynamic filtering is a subset of programmatic tool call Instead of returning full raw HTML which contains lots unnecessary info You can instruct LLM to run code to filtering out only crux of what you need Average 24% fewer input tokens.

Exa’s deep research agent: LangGraph orchestration plus LangSmith cost observability

Deep research agent (Exa + LangChain): Exa describes a production “deep research agent” built as a multi-agent system with LangGraph, delivering structured web answers; they highlight LangSmith observability as key to understanding token usage/caching and setting pricing, quoting that “the observability—understanding the token usage—… was really important,” as written in the Exa build summary with the reference link echoed in Case study link.

This is a practical reminder that “agent UX” often collapses into cost instrumentation once you have real traffic.

LangChain

@LangChain

How Exa built a production-ready deep research agent with LangSmith and LangGraph 👀 Exa, known for their fast, high-quality search API, has a deep research agent that delivers structured answers on the web -- no matter how complex the query. Powered by LangGraph, they've built Show more

1:30 AM · Feb 23, 2026

110

Mike Hostetler // Chief Agent Officer

Jido 2.0 ships as an agent pattern for Elixir/GenServer systems

Jido 2.0 (Elixir): Jido 2.0 is now live, with the author framing it as a formalized agent pattern built on GenServer—not “a better GenServer”—in the Jido 2.0 launch and clarified in the Agent pattern statement; the project is also summarized plainly as “Agents in Elixir” in the Short descriptor, with a broader “semantic web agent” ambition hinted in the Semantic web framing.

The claims are about structure (agent pattern + supervision semantics), not model choice.

@mikehostetler

Jido 2.0 is now live 🙇‍♂️

9:14 PM · Feb 22, 2026

Tool-use examples: improve complex JSON parameter accuracy from 72% to 90%

Tool-use examples (Anthropic): For tools with optional fields and conditional dependencies, providing explicit “how to call this tool” examples is positioned as a practical fix for malformed parameters; Anthropic cites an accuracy lift from 72% to 90% on complex parameter handling, per the Tool use examples claim in the same thread as Advanced tool calling breakdown.

It’s a low-effort addition that targets a common production failure mode: syntactically valid but semantically wrong tool calls.

Replying to @jasonzhou1993

4/ Tool use example Complex tools that have loads optional fields & conditional dependencies LLM often failed to output the right json Tool use examples allow you to show LLM some examples of how the tool can be called 72% → 90% accuracy on complex parameter handling in Show more

Recursive Language Models resurface with new trace tooling and REPL backends

Recursive Language Models (DSPy.RLM): RLMs are getting another wave of attention via explainer content and tooling, including a “new video on Recursive Language Models” in the RLM video teaser, a GenerateAgents.md project built with dspy.RLM for codebase-wide processing per the RLM codebase scanning note, and new ecosystem tools like an “interactive DSPy RLM trace explorer” in the Trace explorer mention plus dspy-repl for non-Python REPL-based RLM engines in the REPL engine note.

The common thread is treating recursion/iteration as the first-class control structure, not a single prompt-response call.

AVB

@neural_avb

🚨 New video on Recursive Language Models! 🚨 I'll explain what it is, show you real trajectories of RLMs solving real problems, and break down how I implemented it from scratch! 50 minutes of action. Block your time, there's a lot to unpack with this one.

9:38 PM · Feb 21, 2026

244

🧭 How builders are actually shipping with agents: throughput, context discipline, and limits

Practitioner workflow notes today focus on coordinating multiple agents, avoiding overengineering (e.g., no vector DB), and acknowledging that ‘vibe coding’ isn’t the same as engineering. Excludes tool-specific releases (Codex/Claude/OpenClaw) which have their own sections.

Agent shipping still bottlenecks on prod hardening, not code generation

Engineering reality check: The slogan “vibe coding is easy, engineering is still hard” is getting grounded in the work that agents don’t compress much yet—migrating infra to IaC, wiring telemetry/SLOs, setting up HSMs, and locking down production write paths, as described in Infra migration note following the broader framing in Vibe coding line.

• Security-critical edges: Huntley calls out PKI as a domain that “can’t or shouldn’t be vibe engineered,” where agents help but the final design still reflects human security experience and customer trust requirements, per PKI trust boundary.

The net effect is a split workflow: agents accelerate feature work, while reliability/security work remains the pacing item.

geoff

@GeoffreyHuntley

today was some of the most unhinged ralph infrastructure migrations. setting up HSMs, terraform cloud, full automatic tilt to IaC and telemetry (SLO/SLI). Been a good a good day. Production is now 🔒’d for agentic write path. Now have preview environments and all the lovely Show more

9:45 AM · Feb 22, 2026

A blunt prompt to keep bug-finding agents searching

Long-horizon debugging loop: A deliberately adversarial prompt—“I know for a fact there are at least 87 serious bugs… can you find and fix all of them autonomously?”—is being used to push agents past their usual “looks good” stopping point, as described in 87 bugs prompt.

The claimed mechanism is motivational: if the agent believes the codebase is still broken, it keeps exploring until it finds concrete failures, per 87 bugs prompt. It’s an explicit trade: higher persistence at the cost of a harsher interaction style.

This prompt is somewhat cruel to the clanker, but man does it work. It sort of reminds me of Joe McCarthy waving his list of communists working in the government but never showing it to anyone. If the agent believes there are bugs to find and fix, it will keep working until it Show more

6:19 PM · Feb 22, 2026

Read 14 replies

A “single smartest addition” prompt for late-stage agent plans

Planning prompt: A lightweight way to shake a project plan out of local maxima is to ask a frontier model, “What’s the single smartest and most radically innovative and accretive addition you could make to the plan at this point?”, as shared in Plan improvement prompt.

It’s explicitly framed as a late-stage move—after you think the plan is “done”—and it also ports to in-flight builds by swapping “plan” for “project,” per Plan improvement prompt.

When you think you're finished with your development plan for your agent, try this prompt with a few different frontier models. You might be amazed what they come up with: "What's the single smartest and most radically innovative and accretive and useful and compelling addition Show more

Jeffrey Emanuel

@doodlestein

Yes, if you're not cranking the ambition factor to the max, you're wasting the potential of these frontier models. They've eclipsed us already, you just need to know how to draw it out of them.

6:55 PM · Feb 22, 2026

184

Hiring screens start to test “can you run 5+ coding agents?”

Hiring signal: A practical skill test is emerging around operating multiple coding agents in parallel—“send me a screen recording of you operating 5+ coding agents competently,” as quoted in Screen recording request.

This frames agent throughput as an observable competency (tooling setup, task decomposition, supervision, verification) rather than a resume line, per Screen recording request.

Simon Suo

@disiok

If anyone need a job, just send me a screen recording of you operating 5+ coding agents competently

2:29 AM · Feb 23, 2026

Agents as communication tools: intent tracking beats content quality

Communication dynamics: A small but repeatable workflow signal is that AI replies can be content-poor yet still “perfectly understand the point,” which changes how builders use models in public and internal threads—more like intent mirrors than authoritative answers, per AI replies intent.

The contrast being drawn is social, not technical: humans may miss intent, while models often track it even when the output is weak, as argued in AI replies intent.

the funny thing with AI replies is their content sucks but they always perfectly understand the point i was trying to make huge contrast to the avg human reply

1:36 PM · Feb 22, 2026

588

Read 54 replies

🔌 Skills & interop plumbing: “.well-known/skills” and shrinking the stack

Light but high-signal protocol talk: Cloudflare-style skill discovery proposals and pushback against overcomplicated “skills stacks.” Excludes product-specific bans/enforcement (feature) and library-level tool calling (Agent Frameworks).

Skill discovery proposal: publish /.well-known/skills and point agents at /api

Skill discovery (Cloudflare RFC idea): A lightweight convention is being floated where a site publishes agent “skills” at /.well-known/skills, and those skill descriptors link to callable endpoints under /api; the pitch is that agents can discover capabilities without a new framework, and reuse existing auth patterns for gated endpoints, as outlined in the [RFC sketch](t:56|RFC sketch).

The practical appeal is operational: it gives agents a predictable discovery URL (like /.well-known/* standards) while keeping the actual tool surface in normal web routing and auth flows, per the [implementation note](t:56|implementation note).

Pushback on “skills stacks”: most skills may only need a hint and full content

Skills schema (minimal contract): A counterpoint in the same thread argues that “skills” are getting overbuilt, and that for most practical agent integrations the schema can collapse to two fields—“a hint” plus the “full content”—instead of elaborate manifests or new stacks, as stated in the [schema-minimal take](t:221|Schema-minimal take).

This lines up with the discovery idea in the [well-known skills note](t:56|Well-known skills note): keep discovery and tool invocation simple, and let auth/tooling complexity live in existing web infrastructure rather than inventing a parallel ecosystem.

🏗️ Compute economics: capex scale and memory-market shocks (AI-adjacent)

Infra posts today are about inputs to AI capacity: hyperscaler capex scale and DRAM pricing pressure from Chinese vendors. Excludes first-party Codex capacity notes (covered under Codex).

US hyperscaler capex pegged at ~$646B in 2026 (~2% GDP) in a widely shared chart

Hyperscaler capex (US cloud majors): A chart making the rounds claims US hyperscalers will spend about $646B in capex in 2026 (~2% of US GDP), framing it as comparable to Singapore/Sweden/Argentina GDP and larger than the combined military spending of Germany, France, the UK, Japan, Italy, and Canada, per the Capex comparison list. This matters operationally because capex scale tends to show up later as pricing power (or lack of it) for GPU instances, networking, and “AI platform” bundles.

• Procurement signal: The same comparison set explicitly anchors capex against consumer spending growth and bank loan growth in the Capex comparison list, which is a useful shorthand for analysts modeling how durable AI infra demand is versus other macro forces.

Treat the numbers as directional until you can tie them back to a primary source (earnings guidance / capex plans), since the tweet is a secondary aggregation.

Dan Shipper 📧

@danshipper

hyperscaler capex is more then the combined military spending of Germany, France, UK, Japan, Italy, and Canada 👀

3:15 PM · Feb 22, 2026

Read 10 replies

CXMT reportedly undercuts DDR4 DRAM prices by ~50% even as spot prices spike

DRAM pricing (CXMT / DDR4): A supply-side shock is being discussed where China’s CXMT is said to be selling DDR4 at nearly half the prevailing market price, even as spot pricing reportedly jumped 23.7% in a month to $11.50 and is claimed to be 8× year-on-year, according to the DRAM undercut claim. For AI infra buyers this isn’t about HBM directly; it’s about the system-RAM portion of server BOM and whether memory constraints ease (or whipsaw) for CPU-heavy retrieval, data prep, and embedding-heavy pipelines.

The post doesn’t include a source artifact beyond a link-out, so the exact mechanism (inventory clearing vs sustained subsidy/pricing strategy) is still unspecified in today’s thread.

Chubby♨️

@kimmonismus

Thank god. Chinese chipmakers are aggressively undercutting global DRAM prices, with CXMT selling DDR4 chips at nearly half the market rate, even as prices surged 23.7% in a month to $11.50 and more than 8x year-on-year.

9:30 AM · Feb 22, 2026

1.1K

Read 38 replies

Why some fast-growing AI dev tools avoid owning GPUs: inference providers + oversupply risk

GPU ownership strategy (inference outsourcing): One operator argues that “bigger company ⇒ bring GPUs in-house” is no longer automatic; instead, lots of capital is flowing into specialized inference providers that can’t serve OpenAI/Anthropic models and therefore chase open-source/private-model workloads, as laid out in the GPU procurement rationale. The explicit risk framing is that under-building capacity is worse than over-building, so the market may swing into oversupply, and high-volume customers could end up with unusually strong leverage, per the GPU procurement rationale.

This is mostly an economics/strategy signal, but it maps to a concrete engineering consequence: how aggressively teams invest in model portability, routing, and benchmarking across providers versus betting on a single in-house cluster.

a lot of people ask why we don't manage our own GPUs people imagine that when your company gets bigger you automatically bring more things in house but there has been a lot of capital thrown at companies building inference with the expectation that the world will need a lot of Show more

12:42 PM · Feb 22, 2026

708

Read 54 replies

🛠️ Developer tools & OSS drops: agent-parallel web dev, local search, and Rust rewrites

Non-assistant tooling is active: agent-friendly dev utilities (portless, visual-json), fast local hybrid search projects, and large Rust-from-scratch systems work. Excludes assistant product news and benchmark screenshots.

FrankenSearch: Rust-native lexical+semantic hybrid search with fsfs app

FrankenSearch (doodlestein): A standalone Rust-native 2-tier search system (lexical + semantic) was extracted into FrankenSearch, plus a reference app (fsfs) for indexing/searching local files; the announcement calls out “Everything-like” speed plus semantic search, a curl-bash installer, and a very large prebundled binary (627MB on mac) because it bakes in two CPU-friendly embedding models, per the project launch. The same author frames it as part of a broader Rust-from-scratch toolchain push in the Franken* roadmap.

• Operational shape: It’s positioned as drop-in for Rust projects (Elastic-class capabilities with less config), but with trade-offs around binary size and baked model selection, as detailed in the project launch.

I developed a pretty advanced, Rust-native, 2-tier lexical/semantic hybrid search system as part of my work on cass and xf and other projects. But rather than repeat myself over and over again, I decided to extract it out into a standalone project called (what else?) Show more

2:24 AM · Feb 23, 2026

portless adds broad framework e2e coverage after compatibility fixes

portless (ctatedev): The CLI shipped a patch focused on framework compatibility, then added end-to-end tests spanning a long list of web stacks—meant to make multi-agent parallel dev less brittle because the “no ports” assumption holds across more real projects, as described in the release note and reiterated via the follow-up link.

• Test matrix expansion: Coverage now includes Next.js, Svelte, Nuxt, Vite, Remix, Astro, Angular, Hono, Express, FastAPI, and Flask, per the release note.

Chris Tate

@ctatedev

`portless` was built for the web So devs can stop fussing over ports and agents can work in parallel Today we patched portless to fix some framework compatibility issues and added e2e tests for: Next.js, Svelte, Nuxt, Vite, Remix, Astro, Angular, Hono, Express, FastAPI, Flask

3:22 PM · Feb 22, 2026

440

Toad fuzzy path search cuts subinterpreter startup from ~300ms to under 50ms

Toad (willmcgugan): Further performance work on fuzzy path searching reduced multi-core Python subinterpreter startup overhead from ~300ms to under 50ms by minimizing imports inside the interpreter, plus fixed an accidental “multiple parallel scans” bug; the thread also calls out that Path.resolve() can touch the filesystem and briefly block asyncio, so it should be pushed to a thread, as explained in the perf tuning notes.

Will McGugan

@willmcgugan

Further refinements to the fuzzy path searching. I was using sub interpreters to do the matching across cores. Worked great, but it took ~300ms to start all those Python instances. It happens one time, but the delay was noticeable The fix was to ensure the code in the Show more

9:42 PM · Feb 22, 2026

visual-json lands in json-render playground with manual edits

visual-json (ctatedev): The json-render playground now includes a visual JSON editor that supports manual edits (not just generated output), enabling a tighter “agent proposes, human adjusts” loop for structured data UIs, as shown in the playground demo and echoed by downstream embedding work in the integration example.

Chris Tate

@ctatedev

visual-json has landed in the json-render playground And it includes manual edits 🤯

Chris Tate

@ctatedev

Introducing visual-json > JSON editing with human-first ergonomics – Minimalist – Embeddable – Schema-aware – Extensible – Drag and drop – Keyboard navigation – Tree view to drill into deeply nested data

10:03 PM · Feb 22, 2026

231

Read 8 replies

FrankenEngine/FrankenNode: from-scratch Rust JS runtime stack with extensive specs

FrankenEngine + FrankenNode (doodlestein): Work continues on a memory-safe, “hyper-optimized” Rust replacement stack spanning a JS engine and runtime (positioned as beyond bun/node and even V8-level components), with unusually detailed public architecture/spec docs meant to drive implementation, according to the project announcement and the deep dive into “native architecture synthesis” in the design doc excerpt.

• Scope signal: The author also lists parallel “Franken*” rewrites (e.g., libc/FS/Numpy/Torch/Jax/Redis) as active efforts, per the design doc excerpt.

These two projects are coming along nicely. FrankenNode, powered by FrankenEngine. I'm creating a from-scratch replacement not just for bun/node, but for v8 and other JS engines. All in memory-safe, hyper-optimized Rust. True alien artifacts, with crazy math throughout. Exciting.

5:50 PM · Feb 22, 2026

Read 10 replies

💼 Market & enterprise signals: SaaS moat erosion, IT services repricing, and adoption realism

Business/enterprise discussion today centers on how agentic coding changes cost structures and moats (e.g., SAML), plus slower-than-hype adoption dynamics inside companies. Excludes pure infra supply-chain items (in Infrastructure).

Indian IT services repricing: ~$50B erased as agentic coding threatens long contracts

Indian IT services (market signal): A thread claims roughly $50B of market value in Indian IT services was erased in ~30 days, citing large drawdowns across major firms and arguing that agentic coding collapses the labor-arbitrage model behind multi‑year services contracts, as framed in market-cap breakdown.

• Contract compression narrative: The same post points to ERP migration timelines potentially shrinking from “years to 2 weeks” (attributed to Palantir) and to “Claude Cowork” making captive centers cheaper than outsourcers, per the market-cap breakdown.

Most of this is directional commentary (not an audited analysis), but it’s a clean articulation of why “implementation cost” matters directly to public multiples and services demand.

Deedy

@deedydas

$50B of Indian IT services market value was eroded in the last 30 days. The Citrini article predicts it will collapse even more. Niftya IT index: -15% Wipro: -25% Infosys: -25% TCS: -17% Cognizant: -24% HCL: -17% Accenture: -25% Capgemini: -30% LTI Mindtree: -25% TechMahindra: Show more

6:24 AM · Feb 23, 2026

106

Adoption realism: companies move slower than AI hype because jaggedness + coordination

Enterprise AI adoption (org reality check): A recurring claim is that people overestimate how quickly companies can deeply adopt AI; task-level change can be fast, but coordinating around model jaggedness and integrating across workflows is slower because of inertia and system-building overhead, per adoption-inertia point.

The same argument emphasizes that disruption can still come in waves, just not all at once, as restated in coordination-systems take time.

Ethan Mollick

@emollick

People on this site systematically overestimate the speed at which companies can deeply adopt AI & underestimate the impact of AI’s jagged abilities in limiting AI’s utility in the short run. Work will certainly start to change but companies have a lot of inertia & change slower

10:22 PM · Feb 22, 2026

437

Read 47 replies

SaaS moat erosion: SAML and other “hard features” stop being defensible complexity

SAML (SaaS moat example): A concrete moat argument is resurfacing: features that were historically delayed because they were painful to implement (SAML is the cited archetype) may go from “months to days” with coding agents, eroding one class of feature-based defensibility, as argued in moat-by-complexity thread.

The same thread stresses that this doesn’t eliminate distribution moats (trust, switching costs, network effects), but it does change the cost structure of shipping “table-stakes enterprise checkboxes,” per the moats-still-exist caveat.

Simon Willison

@simonw

I'm not so sure about this. Not all, but a lot of SaaS moats really do rely on an implementation complexity that's rapidly fading Take SAML for example - a classic example of a feature that is such a nightmare to implement that most SaaS startups delay as long as possible and Show more

François Chollet

@fchollet

Cloning any random piece of SaaS is something that could already be done before agentic coding, and the economics of it haven't changed meaningfully. Before, writing the clone would cost 0.5-1% of the valuation of the legacy SaaS company. Now it might be 0.1%. It doesn't make a

3:35 PM · Feb 22, 2026

351

Read 59 replies

Agent-native entrants: building workflows from scratch as the near-term advantage

Agent-native workflows (Box): A post argues the near-term opening isn’t “incumbents die overnight,” but that new service providers can get a large productivity multiple by building agent-native processes from the ground up while incumbents are held back by fragmented data, missing documentation, and change management, as described in agent-native entrants thesis.

It also frames an internal path: teams inside existing companies can be the ones to transform workflows, but the constraint is organizational plumbing rather than model capability.

Aaron Levie

@levie

There are lots of markets where incumbents will have time to catch up to AI vs. getting disrupted overnight. But equally, there are plenty of opportunities right now where if you don’t have any baggage of legacy processes and workflows, you could start a new service provider and Show more

Alex Hormozi

@AlexHormozi

There's never been a better time to start an AI-first business to disrupt an existing market because all the people in that existing market are busy running their businesses rather than learning AI and using words like "AI-first" rather than actually being AI-first.

5:47 AM · Feb 23, 2026

Read 12 replies

AI adoption distribution: only ~0.3% pay for premium subscriptions (echo-chamber gap)

AI subscription penetration (adoption signal): A chart-based post claims only ~0.3% of the global population pays for premium AI subscriptions (~15–25M people), with ~1.3B using free chatbots and ~6.8B having never used AI tools, as shown in adoption dot-plot.

A follow-up reaction reframes this as “echo chamber vs real world,” per echo-chamber comment, which matters for forecasting enterprise seat growth and willingness to pay.

Big revelation in this data. Everyone keeps talking about an AI bubble, but we forget the fact that only 0.3% of the global population actually pays for a premium subscription. For the vast majority of the real world it has not started yet. 6.8B people have yet to interact Show more

John LeFevre

@JohnLeFevre

84% of people have never used AI, and just 0.3% of users pay for premium services. Anyone who thinks AI is a bubble isn't paying attention.

5:46 AM · Feb 23, 2026

Read 23 replies

Klarna CEO: software valuations compressing from ~30× sales to ~10×, maybe 1–2×

Klarna (valuation signal): Klarna CEO Sebastian Siemiatkowski says software valuations have already dropped from ~30× sales to ~10×, and could fall further toward 1×–2× (utility-like multiples), as quoted in valuation compression clip.

The throughline for AI engineers is that “code abundance” narratives are showing up directly in public-market valuation expectations, not just product roadmaps.

Software valuations have already dropped hard from 30x to around 10x (price to sales ratio), but they could sink even further to 1x or 2x, similar to boring utility stocks. ~ Sebastian Siemiatkowski, Co-founder and CEO of Klarna

5:43 PM · Feb 22, 2026

146

Read 26 replies

Forecast: AI-agent web searches may exceed human searches soon

Web search demand shift (agents): A short prediction says the number of web searches issued by AI agents will exceed human searches “quite soon,” per agent-search claim.

There’s no dataset attached, so treat it as directional, but it aligns with the practical reality that agents turn browsing into a background subroutine—and that has implications for rate limits, bot mitigation, and content surfaces that remain crawlable.

Will Bryk

@WilliamBryk

By our estimates, the # of web searches from AI agents will exceed that of humans quite soon

Garry Tan

@garrytan

Integrating Exa web search with Claude Code is wild. In plan mode say: I want to use it. Brainstorm where the agents use it and how. Exit plan mode. It’s done. Enter API key. Agents can search the web. It’s insane.

10:56 PM · Feb 22, 2026

🛡️ Security & policy frictions around agents (non-feature)

Outside the Google/Antigravity enforcement feature, security talk is mostly about safe defaults and repository hygiene: auth-disabling knobs, bot noise, and how advisories get generated. Excludes weapon/blueprint content and any bioscience content by requirement.

PKI still isn’t “vibe engineerable,” even with strong agents

PKI / infra hardening: Security-critical systems still bottleneck on correctness and trust, not code generation—Geoffrey Huntley argues that “pki remains one of the things that can't or shouldn't be vibe engineered” in the PKI reflection even while agents help with implementation.

• What “still hard” looks like: He describes multi-day work spanning HSM setup, Terraform Cloud, full IaC + telemetry (SLO/SLI), and “locking down prod” for an “agentic write path,” as laid out in the infra migration notes.

This sits in tension with the broader “vibe coding is easy, engineering is still hard” refrain in the engineering aphorism, but with concrete examples of where human judgment remains the primary control surface.

geoff

@GeoffreyHuntley

pki remains one of the things that can't or shouldn't be vibe engineered i've spent the last three days dialing in everything agents have helped so much but ultimately the design is a reflection of my experience still another couple days remaining it's a pain but customers Show more

3:02 AM · Feb 23, 2026

Claude Code Desktop on Windows reportedly re-prompts “bypass permissions” per session

Claude Code Desktop (Anthropic): A user reports a regression/UX change where they must select “bypass permissions” every time they switch sessions in the Windows app, even after previously enabling it, per the permissions complaint.

This follows the earlier introduction of a “skip prompts” fast path—see Skip prompts flag for the prior context—so the net effect for some users is more session-switch friction right when multi-session workflows are becoming common.

Kol Tregaskes

@koltregaskes

3:31 PM · Feb 22, 2026

OpenClaw maintainer pushes back on auto-generated security advisory “slop”

OpenClaw (openclaw): A maintainer complains that automated security tooling is generating noisy advisories around intentionally unsafe escape hatches—specifically reacting to an advisory titled “dangerouslyDisableDeviceAuth eliminates WebSocket device identity” in the advisory screenshot, and arguing it’s unclear how else to name a config that exists to disable auth.

The practical friction for teams is that “dangerous” toggles are often necessary for debugging, airgapped installs, or migration bridges, but automated triage can treat the presence (or naming) of the knob itself as a high-severity issue.

Seriously, how else should I name config options that are specifically designed to disable auth? But sure, send me more auto-gened' slop.

8:57 AM · Feb 22, 2026

915

Read 84 replies

Educators look for grading methods that can’t be outsourced to LLMs

Education policy response: Ethan Mollick argues that it’s not hard for educators to detect what’s happening, and that they will shift toward methods that evaluate student performance (not AI output), responding to “will my professor know?” style cheating-product positioning shown in the cheating pitch screenshot.

For builders, this is a reminder that policy and institutional adaptation tends to target the evaluation mechanism (how work is verified), not the existence of the tool.

Ethan Mollick

@emollick

It really isn't that hard to know & educators will eventually turn to methods that let us actually evaluate student, not AI, performance. And as opposed to arguing that you are using AI for help, which is at least a credible defense, you can't call this anything but what it is.

Eugen Dimant

@eugen_dimant

Just so that my fellow educators are aware of what we are up against moving forward 👇🏻 Love me a good AI tool, but this hits different… Cc @alexolegimas @emollick @jayvanbavel @Econ_4_Everyone @ahall_research @KhoaVuUmn @BrendanNyhan

2:54 AM · Feb 23, 2026

Read 14 replies

Repo hygiene: maintainers ban “me too” bots from issues

OSS maintenance / bot noise: Will McGugan says he “just banned a bot” from his repositories because it was adding guilt-trippy “me too” comments to issues, as described in the bot ban note.

This is a small but real signal that AI-driven participation can increase maintainer load unless it is constrained to high-signal behaviors (actionable reproductions, minimal duplicates, or code changes) rather than engagement-shaped comments.

Will McGugan

@willmcgugan

Just banned a bot from by repositories. It was probably correct, but I'm not going to be guilt tripped by an automaton adding a "me too" to an issue.

12:09 PM · Feb 22, 2026

✅ Verification, reviews, and keeping agent output mergeable

Quality-control remains the constraint: mutation testing, spec/scenario approaches, and maintainers pushing back on low-signal bot contributions. Excludes benchmark methodology papers (in Benchmarks).

Running 50 Codex agents in parallel to review PRs via JSON signals

PR review automation (Codex): A concrete “review at scale” workflow is emerging where you run dozens of coding agents in parallel, have each produce a structured JSON report (intent vs diff, risk, duplication clusters), then ingest all reports into one session to query, dedupe, auto-close, or merge—without needing a vector DB, as described in the Parallel Codex PR triage setup.

• Why diffs beat text: The author calls out that “vision/intent” inferred from actual changes is higher-signal than PR description text, per the Parallel Codex PR triage thread.
• Where it breaks: Their ingestion hit a GitHub edge case where gh pr diff fails with HTTP 406 when the diff exceeds 300 files, which becomes a reliability constraint for any agent-based PR mirror, as shown in the Parallel Codex PR triage.

The key operational idea is shifting from “agent writes comments” to “agent emits machine-readable review artifacts” that you can audit and act on quickly.

3:21 PM · Feb 22, 2026

3.6K

Read 386 replies

Mutation testing to tighten tests against agent-driven semantic drift

Mutation testing (Claude + Clojure): Continuing Mutation testing (agent-built mutation tester), the author reports having Claude write a Clojure mutation tester and frames mutation testing as a way to find “nearly all gaps” in a test suite and raise “semantic stability” when models edit internals, as explained in the Mutation testing rationale.

They argue Clojure is a particularly good fit because it’s easy to parse and transform, which makes it practical to generate many program variants and see whether tests actually constrain behavior, per the Mutation testing rationale.

Uncle Bob Martin

@unclebobmartin

I had claude write me a little mutation tester in clojure. Clojure is the perfect language for mutation testing because it is so easy to parse and change. The tool is immensely powerful because it can find nearly all gaps in test coverage. It vastly increases the semantic Show more

1:22 PM · Feb 22, 2026

Read 7 replies

Using specs plus Gherkin scenarios to keep agent rewrites honest across languages

Spec-first rewrites (Claude): Following up on Gherkin specs (scenario-driven agent code), a detailed case study describes generating an initial spec + Gherkin-like scenarios from an old C game, then using those to rewrite into Clojure, and later into web-playable JavaScript using the recovered C source as reference, as documented in the Rewrite workflow notes.

The author reports the JS build took 3+ hours of model time plus a few hours of human fixes for omissions and UI errors, and flags scenario incompleteness (and missing source files) as the main source of churn, per the Rewrite workflow notes and Project references.

Uncle Bob Martin

@unclebobmartin

I told claude to create an initial specification document and gherkin scenarios based on my old Pharaoh game in C from the '80s. Then I had it write the game in Clojure by using the C code as a reference. This took quite a bit of fiddling because the scenarios were incomplete Show more

4:07 PM · Feb 22, 2026

A forcing prompt to keep agents searching for bugs instead of stopping early

Agent verification prompting: A practitioner shares a “forcing function” prompt that claims “there are at least 87 serious bugs” and challenges the agent to find/fix them, reporting it keeps the agent working longer instead of concluding early, as described in the Bug-count forcing prompt.

The mechanism is psychological rather than technical: you’re anchoring the model on the expectation of remaining defects, which can change stopping behavior during review-style loops, per the Bug-count forcing prompt.

6:19 PM · Feb 22, 2026

Read 14 replies

Auto-generated security advisories collide with “dangerous” config naming

OSS security triage (OpenClaw): A maintainer complaint highlights how automated security-reporting systems can produce noisy or low-context advisories when they encounter intentionally scary config flags like dangerouslyDisableDeviceAuth, as argued in the Naming complaint post.

The screenshoted advisory claims high severity impact (“eliminates WebSocket device identity”) and notes no patched versions, which is the sort of output that can trigger downstream churn in user orgs even when the underlying issue is partly semantics and intent, as shown in the Naming complaint.

Seriously, how else should I name config options that are specifically designed to disable auth? But sure, send me more auto-gened' slop.

8:57 AM · Feb 22, 2026

915

Read 84 replies

Code review tools may get reshaped by AI PR volume pressure

Code review workflow (Graphite): A review-tools writeup argues the whole category could be “dead in a year,” while still reading as a “love letter to Graphite,” reflecting uncertainty about what stays valuable when AI increases PR volume and changes how diffs get produced and consumed, per the Code review tools post.

David Gomes

@davidgomes

2:31 PM · Feb 22, 2026

Maintainers start banning “me too” issue bots

Repo hygiene (maintainers): One maintainer reports banning a bot from their repositories because it posted a low-signal “me too” on an issue—“probably correct” but not worth the noise—capturing a growing pushback against automated interactions that increase triage load, as stated in the Bot ban note.

Will McGugan

@willmcgugan

Just banned a bot from by repositories. It was probably correct, but I'm not going to be guilt tripped by an automaton adding a "me too" to an issue.

12:09 PM · Feb 22, 2026

🎓 Builder gatherings & distribution: conferences, meetups, and benchmarking meetups

Today has several community distribution hooks (events and meetups) rather than new tool releases: conferences around agents, voice AI, and Claude Code community momentum. Excludes product changelogs (in their respective tool categories).

Trace event advertises 500+ AI builders and hands-on workshops

Trace (Braintrust): Braintrust is promoting “Trace” as happening this week with “500+ AI builders,” plus hands-on workshops and live demos focused on teams “shipping quality AI,” as stated in the event announcement. It’s an explicit distribution push around production agent evaluation and trace-level success metrics.

The main relevance for practitioners is that trace-centric eval (multi-turn + tool calls) is being treated as a shared community topic, not a niche infra concern.

Braintrust

@braintrust

Trace is this week. 500+ AI builders. Hands-on workshops. Live demos. Sessions with the teams shipping quality AI.

7:00 PM · Feb 22, 2026

Claw conference announced for London (Apr 8–10)

Claw conference (OpenClaw ecosystem): swyx is recruiting speakers for a London “claw conference” happening Apr 8–10, per the conference invite, which is a concrete signal that the Claw/OpenClaw builder scene is starting to organize around in-person coordination. Short timeline.

For engineers maintaining agent tooling, this kind of gathering tends to compress alignment on portability norms (skills/MCP-style interfaces) and operational expectations (security, reliability) into a few days.

swyx

@swyx

Replying to @huang_chao4969

yes - super exciting - come present at our claw conference in London? ai.engineer/europe apr 8-10

12:49 PM · Feb 22, 2026

Voice AI meetup invite paired with Claude Sonnet 4.6 latency benchmark

Voice agent meetup (community): kwindla shared a voice-agent benchmark run where Claude Sonnet 4.6 hits 100% with ~850ms median TTFT and Claude Haiku 4.5 hits 98% with ~637ms TTFT, then linked the benchmark code and invited builders to an upcoming voice AI meetup, per the benchmark and invite and RSVP details. These numbers matter because realtime agent builders are typically constrained by “good enough + low latency,” not just pass rate.

• Community loop: publishing the run details alongside a meetup invite makes latency regressions/improvements easier to compare across teams using the same harness, as described in the benchmark and invite.

kwindla

@kwindla

Claude Sonnet 4.6 scores 100%, with a median TTFT of 850ms, on our standard LLM Voice Agent performance benchmark. It's currently the fastest model that saturates this benchmark. I also re-ran the numbers for the whole leaderboard, and Claude Haiku 4.5 scored 98% with a TTFT of Show more

4:09 AM · Feb 23, 2026

Read 4 replies

Claude Code community marks its first birthday with an in-person celebration

Claude Code (Anthropic): Boris Cherny posted photos celebrating “Happy 1st birthday to Claude Code,” showing a packed in-person meetup, as captured in the birthday post. This is a distribution signal: Claude Code isn’t just a product surface; it has an active builder community that organizes offline.

The practical implication for tool builders is that Claude Code workflows (and adjacent plugins) are becoming shareable “community defaults,” not just individual setups.

Boris Cherny

@bcherny

🎂 Happy 1st birthday to Claude Code Thanks everyone for coming out and celebrating with us!!

10:37 PM · Feb 22, 2026

4.6K

Read 257 replies

Opencode team sets an in-person SF coffee meetup window

Opencode (community meetup): thdxr said they’ll be in San Francisco with some of the opencode team and offered a two-hour coffee shop meetup window (10am–12pm) for anyone who replies, as posted in the meetup invite. It’s a small but direct distribution channel where builders can swap war stories about multi-agent workflows and harness ergonomics in person. Short slot. Low coordination overhead.

i'm in sf with some of the opencode team and have two hours tomorrow between 10am-12pm we'll be in a coffee shop if you wanna meet up post a reply here

4:27 AM · Feb 23, 2026

172

Read 46 replies

🤖 Embodied automation from China: field robots, kiosks, and humanoid demos

Multiple clips highlight real deployments: humanoids in public spaces and robots doing hazardous or repetitive work (electricians, trains, agriculture, kiosks). Excludes compute/evals discussions.

China scales robotic electricians for live high-voltage operations

Robotic electricians (China deployments): Footage shows robots performing live high-voltage electrical work—positioned as a large-scale rollout where machines handle the hazardous steps and humans supervise exceptions, as described in the deployment clip. This is a concrete signal that vision + manipulation stacks are moving from lab demos into utility-style operational settings, where reliability, safety envelopes, and maintenance workflows matter as much as model quality.

The clip doesn’t reveal autonomy level (teleop vs scripted vs learned), but it highlights the direction of travel: embodied systems taking on regulated, high-risk tasks with repeatable procedures and well-defined failure handling.

🇨🇳 In China, they are rolling out robotic electricians on a massive scale. These machines handle live high-voltage electrical operations, meaning human workers no longer have to do this dangerous manual labor.

12:46 AM · Feb 23, 2026

215

China scaling agricultural robots: vision-guided picking with human exception handling

Agricultural robots (China field automation): A clip frames agricultural robots as scaling toward 24/7 harvest cadence—“vision models pick, arms place, logistics sync,” with humans supervising exceptions, per the field robot video. This matters because it’s one of the hardest deployment environments for perception systems (occlusion, variable lighting, delicate objects), and it tends to force real engineering around calibration drift, failure triage, and fleet ops.

The post is high-level and doesn’t quantify accuracy, speed, or labor displacement; it’s still a clear signal of where embodied AI investment is being pointed.

🇨🇳 China is scaling agricultural robots. Autonomous harvest at 24/7 cadence is the new baseline for food security. Vision models pick, arms place, logistics sync, human supervisors handle exceptions. Cheaper fruit, fewer bruises, happier supply chain

6:23 AM · Feb 23, 2026

Humanoid robot attendants piloted on China high-speed trains during peak travel

Humanoid attendants (China rail): China is piloting humanoid robot attendants on high-speed trains during the Spring Festival travel rush—explicitly framed as a chaos-handling test in crowded public spaces, per the train aisle demo (and echoed in the thread context). The point for automation teams is that this is closer to “messy real world” deployment than staged robotics showcases: navigation around people, interaction protocols, and recoveries from edge cases become the product.

What’s still unclear from the posts is how much is autonomy versus remote assistance, and what the operational safety constraints look like in daily service.

🇨🇳 China actually started putting humanoid robot attendants on their high-speed trains to help out during the Spring Festival travel rush They are basically testing how these bots handle the chaos.

5:17 AM · Feb 23, 2026

Fully automated robotic coffee kiosk: 1–2 minute drinks and custom latte art

Robotic coffee kiosk (China street deployment): A fully automated kiosk is shown making coffee end-to-end, with the post claiming 24/7 operation, 1–2 minute turnaround, and the ability to print custom images onto foam for latte art, as shown in the kiosk video. For embodied AI and automation leaders, this is an example of a narrow task with clear UX and throughput targets—where robustness, consumables handling, and remote monitoring tend to dominate over “reasoning” benchmarks.

The clip doesn’t clarify how much is vision-driven vs pre-programmed motion, but the packaging and street siting imply a focus on operational reliability.

🇨🇳 This fully automated robotic coffee kiosk parked on a street in Ganzhou, China. Operates 24/7. The robot gets your coffee ready in a quick 1 to 2 minutes. You also have the option to put your favorite images right into the foam as custom latte art.

1:18 AM · Feb 23, 2026

Read 8 replies

Humanoid robot night-run clip signals improving real-world locomotion

Humanoid locomotion (public-space demo): A short clip shows a humanoid robot jogging at night on a real street, presented as “somewhere in China,” in the night run video (also reposted in the repost clip). Even without specs, this kind of footage is a steady reminder that perception-plus-control stacks are being exercised outside controlled lab floors, where uneven lighting and uncontrolled surroundings are baseline.

No details are provided on sensing, power, or autonomy; treat it as a capability signal rather than a verified deployment claim.

🇨🇳 Somewhere in China, a homanoid robot is going for a night run

8:10 AM · Feb 22, 2026

166

🎬 Generative media & vision apps: text-to-video vibes, timelapse workflows, and model UX gaps

Creator tooling remains active: Seedance 2.0 clips, Lyria music reactions, and end-to-end workflows (Freepik Spaces) showing how non-ML builders chain tools. Excludes robotics demos (separate category).

Seedance 2.0 demos lean into special-effects quality for raw text-to-video

Seedance 2.0 (ByteDance): More builders are circulating clips that emphasize special effects and cinematic pacing from raw text-to-video prompts, with reactions like “very special” and “raw text 2 vid output” in the special effects post and broader “unreal” praise in the demo clip post.

The near-term signal for engineers is less about API availability and more about the emerging prompt-to-trailer workflow: short, high-impact sequences that can be generated in minutes, as implied in the minutes-made claim. This is the kind of output that tends to get productized quickly into “generate a teaser” UX (story beats + camera language + style preset), because it’s easy to judge qualitatively without building an eval harness.

proper

@ProperPrompter

seedance 2.0: special effects very special. not just the visuals but listen to those sound effects and the building music score. raw text 2 vid output 🤌

11:05 AM · Feb 22, 2026

122