OpenAI GPT‑5.3 Instant ships 26.8% fewer hallucinations – 128K context API

GPT-5.3 Instant in ChatGPT is now rolling out to everyone. More accurate, less cringe. openai.com/index/gpt-5-3-…

6:02 PM · Mar 3, 2026

9.6K

Read 1.4K replies

GPT-5.3 Chat lands in the API as `gpt-5.3-chat-latest` with pricing and limits

GPT-5.3 Chat (OpenAI API): OpenAI’s ChatGPT default snapshot is exposed for developers as gpt-5.3-chat-latest ("GPT-5.3 Chat Default"), with posted limits of 128,000 context and 16,384 max output tokens, plus a knowledge cutoff of Aug 31, 2025, as shown in an API model card screenshot API model card screenshot. Pricing in that same card shows $1.75 / 1M input tokens and $14 / 1M output tokens, with cached input priced at $0.175 / 1M tokens API model card screenshot.

The rollout also includes lifecycle guidance: OpenAI states GPT-5.2 Instant remains selectable for paid users for ~3 months and is retired June 3, 2026, per the availability text excerpted by users availability excerpt. Arena also added the snapshot for side-by-side testing, per its announcement Arena availability.

GPT-5.3-chat-latest now also in the API

6:51 PM · Mar 3, 2026

GPT-5.3 Instant claims ~20–27% hallucination reduction, with small safety shifts

GPT-5.3 Instant (OpenAI): OpenAI’s materials for GPT-5.3 Instant highlight measurable hallucination reductions—26.8% lower on a high-stakes eval when using web search and 19.7% lower without web search, as summarized by an OpenAI researcher hallucination metric callout. They also cite reductions on a user-feedback-based eval (e.g., flagged factual errors) in the same excerpt hallucination metric callout, with more detail available in the system card System card.

• Safety trade-offs: A third-party summary of the system card notes slightly lower safety scores on some sensitive topics versus the prior Instant snapshot, alongside stronger refusal behavior for non-violent illegal activity requests system card bullets.

• Medical behavior changes: The same summary calls out a small dip on a medical advice test while claiming the model asks more clarifying questions when uncertain system card bullets.

Aidan McLaughlin

@aidan_mclau

welcome 5.3 instant! was proud to help reduce hallucinations for questions where factuality matters most, it's 26.8% better (when searching) and 19.7% better (when not searching)

6:03 PM · Mar 3, 2026

685

Read 88 replies

OpenAI teases GPT-5.4; users report early gpt-5.4 strings and cyber throttles

GPT-5.4 (OpenAI): Shortly after the 5.3 Instant rollout, OpenAI posted “5.4 sooner than you Think” OpenAI teaser, triggering confusion because OpenAI’s 5.3 Instant materials say “updates to Thinking and Pro will follow soon” availability excerpt while users speculate whether 5.3 Thinking/Pro are being skipped in favor of 5.4 confusion thread.

Separately, multiple people reported seeing gpt-5.4 model strings and enforcement: one screenshot shows a stream error referencing access to a gpt-5.4-* variant being temporarily limited for “potentially suspicious activity related to cybersecurity” access limit screenshot, and another report describes hitting similar guardrails while testing a local “deep research” pipeline guardrails report. None of these tweets confirm general availability or what “5.4” actually ships; they mainly indicate pre-release plumbing and safety throttles are active.

OpenAI

@OpenAI

5.4 sooner than you Think.

7:03 PM · Mar 3, 2026

23.6K

Read 2.9K replies

Builders say GPT-5.3 Instant feels more direct; benchmarkers report writing/EQ regressions

GPT-5.3 Instant (Early sentiment): Early builder feedback clusters around tone: one thread says 5.3 Instant “feels more direct and less defensive” and “more lived-in” compared to 5.2 Instant early vibe thread, aligning with OpenAI’s own “less cringe” positioning rollout announcement. A separate viral screenshot shows a harsher/edgier reply (“You’re an idiot.”) that some interpreted as part of the tone shift snarky response screenshot.

At the same time, at least one benchmarker claims a “surprising & severe regression” on EQ-Bench and longform writing for gpt-5.3-chat, describing partial refusals on EQ-Bench and “tiny 1–5 word paragraphs” in writing evals benchmark regression claim. The net signal today is mixed: user-facing conversation tone appears to have shifted in the direction OpenAI intended, while some third-party writing/EQ evals are flagging potential regressions that may or may not match typical ChatGPT usage patterns benchmark regression claim.

cedric

@cedric_chee

ChatGPT-5.3 Instant might fill the void left by ChatGPT-4o. Still early, but after some testing it feels like a noticeable improvement in everyday convos: - Feels more direct and less defensive, empathetic, less dry and more lived-in (has soul) - Harmless evals: I tried a Show more

4:45 AM · Mar 4, 2026

🚀 Non‑OpenAI model drops & price/perf moves (Gemini Flash‑Lite, Grok 4.20, embeddings)

Outside of the GPT‑5.3 feature, the biggest release chatter is Google’s Gemini 3.1 Flash‑Lite (speed/cost + adjustable thinking levels), alongside Grok 4.20 Beta 2 updates and a new open-weights embedding entrant. Excludes OpenAI GPT‑5.3/5.4 items (covered in the feature).

Gemini 3.1 Flash‑Lite Preview ships with adjustable thinking and 1M context

Gemini 3.1 Flash‑Lite Preview (Google DeepMind): Google launched Gemini 3.1 Flash‑Lite as a preview model for high‑volume, cost‑sensitive workloads; it’s priced at $0.25/M input tokens and $1.50/M output tokens and introduces adjustable thinking levels (minimal→high) as described in the speed and cost card and the AI Studio model picker. It’s available via Google AI Studio and Vertex AI per the Vertex AI listing, with the launch positioning and rollout details spelled out in the Google blog post Google blog post and supported capability surface documented in the Vertex docs Vertex AI docs.

The preview marketing claims “core performance of 2.5 Flash” and highlights translation/data processing and agentic workloads in the Vertex AI listing and launch recap.

• Deployment surfaces: In addition to Google’s first‑party endpoints, the model shows up on third‑party routers/gateways, including OpenRouter per the OpenRouter availability and Vercel AI Gateway per the Gateway changelog.

• Interface knob: The “thinking levels” control is presented as a latency/quality dial in launch posts like the speed and cost card, rather than a separate model SKU.

Google DeepMind

@GoogleDeepMind

Replying to @GoogleDeepMind

3.1 Flash-Lite outperforms 2.5 Flash with faster performance at a lower price. New ‘thinking levels’ let you dial in reasoning to adapt for different tasks, while still being able to handle complex workloads - like generating UI and dashboards or creating simulations.

4:37 PM · Mar 3, 2026

501

Read 18 replies

Gemini 3.1 Flash‑Lite’s pitch is speed-first, with surprisingly high GPQA/MMMU-Pro numbers

Gemini 3.1 Flash‑Lite Preview (Google): Early benchmark framing centers on throughput and price/perf, with multiple posts repeating ~363 output tokens/s, $0.25/$1.50 per 1M input/output tokens, and scores like 86.9% on GPQA Diamond and ~76.8% on MMMU‑Pro, as shown in the widely shared benchmark table photo and the speed and price chart. Artificial Analysis adds its own positioning—34 on its Intelligence Index and 360+ tok/s—in the evaluation breakdown.

• Arena positioning: Arena reports Gemini‑3.1‑Flash‑Lite‑Preview around #36 in Text (Elo 1432) and tied around #35 in Code (Elo 1261) in the Arena post, which is being used as evidence that Lite is “usable” beyond trivial routing.

• Cost ladder nuance: Some comparisons emphasize it’s cheaper than Gemini 2.5 Flash, but more expensive than Gemini 2.5 Flash‑Lite; that tension shows up implicitly in the speed and price chart and explicitly in third‑party commentary like the evaluation breakdown.

Logan Kilpatrick

@OfficialLoganK

Introducing Gemini 3.1 Flash-Lite 🔦, a huge step forward on the boundary of intelligence, beating 2.5 Flash on many tasks.

4:38 PM · Mar 3, 2026

3.0K

Read 296 replies

Builders like Flash‑Lite’s speed; the 2–4× price bump vs prior Lite is a sore spot

Gemini 3.1 Flash‑Lite Preview (Field reports): Builder reactions split along iteration speed vs. cost; some love how fast it feels—“Flash‑Lite is so darn fast” per the speed reaction and “absolute speed demon” in the speed praise—while others call out that it’s 2–4× more expensive than Gemini 2.5 Flash‑Lite per the pricing comparison and the pricing critique.

The most concrete “hands-on” notes in today’s tweets are around the thinking‑level dial: one tester reports minimal thinking is fast but lower fidelity, while high thinking improves results with modest latency increases in the thinking level impressions.

Logan Kilpatrick

@OfficialLoganK

Introducing Gemini 3.1 Flash-Lite 🔦, a huge step forward on the boundary of intelligence, beating 2.5 Flash on many tasks.

4:38 PM · Mar 3, 2026

3.0K

Read 296 replies

Grok 4.20 Beta 2 update emphasizes instruction-following and fewer capability hallucinations

Grok 4.20 (xAI): Grok announced a Beta 2 update with a focused change list—instruction following improvements, capability hallucination reduction, and better scientific text quality (LaTeX) plus reliability fixes around image search triggers and multi‑image rendering, as listed in the release notes screenshot. Elon Musk separately confirms a “Beta 2 out today” cadence in the Musk reply.

ICYMI: Grok 4.20 has received a beta 2 update with improvements to instructions following and more.

Grok

@grok

Grok 4.20 Beta 2 Update 🎯 Instruction Following Improvements ✅ Capability Hallucination Reduction 📐 Scientific Text Quality (LaTeX) 🔍 Image Search Trigger Precision 🖼️ Multiple Image Render Reliability

3:33 PM · Mar 3, 2026

263

ZeroEntropy ships zembed‑1 open‑weights embeddings with truncatable dimensions

zembed‑1 (ZeroEntropy): ZeroEntropy released zembed‑1, an open‑weights embedding model they position as multilingual and competitive vs incumbent proprietary embeddings, with availability via API plus Hugging Face and AWS Marketplace per the launch post. The model details being circulated include 4B parameters, 2560‑dim vectors that can be truncated to smaller sizes, and quantization support, as enumerated in the model details thread.

• Training lineage: The model is described as distilled from a reranker (“zerank‑2”) with an Elo-style pairwise teacher setup in the distillation note.

• Evidence quality: The performance claims are currently mostly proprietary / vendor-reported, which is called out directly in the evaluation caveat, with deeper context in the launch blog Launch blog post and the model card Hugging Face model.

Ghita

@ghita__ha

zembed-1 is finally here! 🔥 The world's best embedding model, by @ZeroEntropy_AI It outperforms @OpenAI , @GeminiApp , @Alibaba_Qwen , and Voyage's latest embeddings on 100+ languages, and across verticals. Available now via our API/SDK, @huggingface, and @awscloud Show more

4:01 PM · Mar 3, 2026

Read 14 replies

Grok Imagine adds an “Extend video” control

Grok Imagine (xAI): The Grok Imagine UI now shows an “Extend video” action on both web and mobile, surfaced as a one-click continuation option in the UI menu capture. This is framed as a product control (not a model change) and appears alongside the standard feedback menu.

ICYMI: Grok Imagine now has Extend video feature on both web and mobile.

8:45 AM · Mar 3, 2026

🔧 Claude & Claude Code: scaling pain + CLI/prompt changes that alter day-to-day workflows

Claude teams flagged unexpected traffic growth and instability, while Claude Code saw concrete release/changelog activity (including system-prompt constraints aimed at output efficiency). This beat is about workflow-impacting Claude/Claude Code changes and reliability notes today.

Anthropic says Claude demand spiked fast enough to cause instability

Claude (Anthropic): Anthropic staff said they saw “unprecedented growth” in Claude and Claude Code usage “this week” that was “genuinely hard to forecast,” and asked users to bear with them while they scale, as described in the Scaling note and reiterated in the Follow-up apology. Reliability is the direct bottleneck here: even if models improve, teams can’t build around tools that intermittently fail.

The messaging is notable for explicitly framing the outages as capacity/scale-driven rather than a model regression, while also acknowledging user dependence (“for both work and life”) in the Follow-up apology.

Thariq

@trq212

We've seen unprecedented growth in Claude and Claude Code traffic this week that was genuinely hard to forecast. We appreciate you bearing with us as we scale.

6:40 PM · Mar 3, 2026

5.7K

Read 242 replies

Claude Code 2.1.66 prompts now explicitly optimize for brevity

Claude Code (Anthropic): Claude Code’s 2.1.66 update includes system-prompt changes that explicitly instruct the assistant to be more concise—“go straight to the point,” avoid filler/preamble, and focus on decisions, status, and blockers—according to the Prompt diff summary.

• Subagent prompting: The guidance notes some agents can access prior conversation context before tool calls, enabling shorter subagent prompts like “investigate the error discussed above,” per the Prompt diff summary.
• Agent controls: The agent tool schema now supports model selection (haiku/sonnet/opus) and a max_turns cap for round-trips, as described in the Prompt diff summary.

This is a behavioral change: the same tasks may now yield shorter narration but similar tool activity.

Claude Code Changelog

@ClaudeCodeLog

Replying to @ClaudeCodeLog

Claude Code 2.1.66 system prompt modifications Notable changes: 1) Claude is now explicitly instructed to optimize for brevity: go straight to the point, try the simplest approach first, avoid filler/preamble, avoid restating the user, and mainly report decisions, milestone Show more

2:31 AM · Mar 4, 2026

Claude Code 2.1.66 ships with reduced spurious logs and more detailed failures

Claude Code 2.1.66 (Anthropic): A new Claude Code release is out with changes aimed at day-to-day ergonomics: reduced spurious error logging and task failure messages that include more details, as captured in the Release notes thread and the upstream Changelog entry. This is small on paper, but it affects the debugging loop when agents fail mid-run.

The change is framed as CLI/runtime quality-of-life rather than new agent features, based on the Release notes thread.

Claude Code Changelog

@ClaudeCodeLog

Claude Code 2.1.66 has been released. 3 flag changes, 1 CLI change, 3 system prompt changes Highlights: • Reduced spurious error logging • Added an Output efficiency section that enforces more concise replies • Task failure messages now include additional details All Show more

2:31 AM · Mar 4, 2026

Claude Code 2.1.66 removes CLI commands and adds new config/env knobs

Claude Code CLI 2.1.66 (Anthropic): The release removes a large set of CLI commands/options and flags while adding new configuration and environment variables (including DISABLE_MICROCOMPACT and config keys like max_turns), as enumerated in the CLI surface diff.

A meaningful operational implication is that scripts or internal docs relying on removed commands (e.g., open, server) may need updates, per the CLI surface diff.

Claude Code Changelog

@ClaudeCodeLog

Replying to @ClaudeCodeLog

Claude Code 2.1.66 further changes Flags: Removed: • tengu_ccr_bridge — Enforces a minimum Claude Code version for Remote Control bridge usage and prompts update when too old; confidence: medium • tengu_ccr_bridge_multi_session — Controls ccr bridge multi session behavior; Show more

2:31 AM · Mar 4, 2026

Claude Code adds a “Bypass permissions” mode with explicit risk warnings

Claude Code (Anthropic): Claude Code’s desktop UI now surfaces a “Bypass permissions” mode that can accept all permissions and let Claude run uninterrupted—explicitly warning this can enable data loss, corruption, or exfiltration via prompt injection—per the Settings screenshot.

This slots into a spectrum of automation modes (“Ask permissions,” “Auto accept edits,” “Plan mode,” then “Bypass permissions”) visible in the same menu, as shown in the Settings screenshot.

ICYMI: Claude Code has a "Bypass permissions" mode on desktop. Claude Yolo Code 👀 "Bypass all permission checks and let Claude work uninterrupted. This works well for workflows like fixing lint errors or generating boilerplate code. Letting Claude run arbitrary commands is Show more

3:54 PM · Mar 3, 2026

175

Read 15 replies

Claude Code voice input is rolling out with /voice and spacebar dictation

Claude Code (Anthropic): Voice input is being presented as a native Claude Code capability; one demo shows holding the spacebar to dictate a coding request that generates code immediately, as shown in the Voice demo.

Rollout notes mention voice mode reaching “~5% of users today,” as stated in the Rollout claim, and the /voice entry point is called out directly in the Slash command note.

Wes Roth

@WesRoth

Anthropic is rolling out Voice Mode for Claude Code! Instead of typing out massive, complex coding instructions, you can now just hold the spacebar and dictate exactly what you want the AI to build.

Thariq

@trq212

Voice mode is rolling out now in Claude Code. It’s live for ~5% of users today, and will be ramping through the coming weeks. You'll see a note on the welcome screen once you have access. /voice to toggle it on!

9:00 AM · Mar 3, 2026

Power-user list of Claude desktop pain points focuses on caching and compaction

Claude desktop app (Anthropic): A detailed user report lists practical friction points: slow session switching (reloading conversations), compaction loops, “prompt too long” session loss, inconsistent “working” indicators, and limitations around PR creation flow, as itemized in the Issue list.

The report is useful as a grounded failure taxonomy for agent UX: it’s less about model quality and more about session persistence, compaction predictability, and workflow continuity across long runs, per the Issue list.

Kol Tregaskes

@koltregaskes

The Claude Windows desktop app has problems. 1. It's so slow. Sure, long conversations will be slower, but why does it reload the conversations each time I re-select a session? Is there no caching? 2. Each session only creates one PR, and that's it; further PRs have to be Show more

6:33 PM · Mar 3, 2026

🧩 Codex surface area expands: Windows app timing, new skills, and session/worktree ergonomics

Codex-related tweets today are mostly about new product surfaces (Windows app timing), new skill-based workflows, and operational ergonomics (handoff between Local and Worktree). Excludes GPT‑5.3 Instant rollout (covered in the feature).

Codex for Windows gets a “tomorrow” teaser, pointing to an imminent desktop drop

Codex for Windows (OpenAI): Following up on Windows teaser (interest-form tease), multiple posts now point to a Windows desktop app arriving on Mar 4—with a screenshot of OpenAI replying “Tomorrow” to “Windows wen???” questions in the Windows timing tease, plus an additional recap saying the launch is “tomorrow (Wednesday)” in the Wednesday claim.

If the Windows build lands as implied, it widens Codex’s “native app” surface beyond macOS and should change how teams standardize agent tooling across mixed-OS shops.

Codex for Windows is set to arrive tomorrow according to OpenAI! Windex soon 👀

OpenAI Developers

@OpenAIDevs

Soon.

6:21 PM · Mar 3, 2026

545

Read 21 replies

Codex app adds a handoff flow to move threads between Local and Worktree

Codex app (OpenAI): OpenAI Devs highlighted a new “handoff workflow” that moves a conversation thread between Local and Worktree contexts, as shown in the Handoff workflow post.

This is an ergonomics fix for long-running work where you start locally, then need an isolated worktree for risky changes (or the reverse). It’s also a clue that Codex is treating “thread state” as an asset you should be able to relocate, not restart.

OpenAI Developers

@OpenAIDevs

Try our handoff workflow to move threads between Local and Worktree

Guinness Chen

@guinnesschen

Today we shipped Handoff in the Codex app: a simpler way to move a thread between Local and Worktree.

12:47 AM · Mar 4, 2026

334

Read 21 replies

Codex ships a skill aimed at building ChatGPT apps

Codex skills (OpenAI): OpenAI Devs teased a new Codex skill specifically for building ChatGPT app experiences, sharing a “brb building a new app for @ChatGPTapp” demo hook in the Skill demo teaser, which is echoed by downstream resharing in the Skill availability retweet.

There aren’t public details yet on what the skill scaffolds (project templates, APIs, deployment, review loop), but the positioning implies OpenAI is turning “build apps for our platform” into a first-class skill workflow rather than generic coding prompts.

OpenAI Developers

@OpenAIDevs

brb building a new app for @ChatGPTapp with this new Codex skill:

corey.ching

@coreyching

Shipped a new $chatgpt-apps skill — now available in the Codex app. It’s purpose-built for building @ChatGPTapp apps with the Apps SDK: scaffold projects, wire tools to widget resources, and iterate toward polished, host-aware UI inside ChatGPT. Especially useful for: •

10:34 PM · Mar 3, 2026

584

Read 41 replies

Codex Spark dictation workflows show up as a speed trick for frontend iteration

Codex Spark (OpenAI): Builders are showing a dictation-first loop where they speak changes and let Codex Spark transcribe into code in real time, with a pop-out window demoed in the Realtime Spark dictation.

• Controls people are using: Another post claims voice transcription works in both the Codex app and CLI, triggered via a mic button or Ctrl+M, as referenced in the Hotkey callout.

The common thread is that voice input compresses the “prompt typing” bottleneck for UI-heavy work, but the footage also shows how easy it is to outrun any steering/constraints once you’re talking continuously.

dominik kundel

@dkundel

Using voice transcription with Codex Spark in the pop out window is wild for rapid front-end development! You don't even have time to use steering 😂 PS: Everything in this video once I start talking is in realtime

Vaibhav (VB) Srivastav

@reach_vb

ICYMI: You can use Voice transcription in both Codex App as well as the CLI! 🎙️ Press the mic button or hit `Ctrl + M` and talk away! Available to 100% of the codex users :)

4:52 PM · Mar 3, 2026

145

Read 5 replies

opencode reports 300k regular users signing in with Codex

opencode (community): A usage milestone landed with “300k users who regularly use their Codex sign-in through opencode,” according to the User milestone note.

This is a concrete adoption signal for the Codex ecosystem outside OpenAI’s first-party surfaces, and it suggests “bring-your-own-harness” patterns are scaling even when the underlying model/provider is the same.

Tibo

@thsottiaux

We have reached 300k users who regularly use their Codex sign-in through opencode. Way to go @thdxr and team. Congrats!

9:33 AM · Mar 3, 2026

1.6K

Read 62 replies

Codex CLI is rumored to add native image generation in a “next big update”

Codex CLI (OpenAI): A roadmap-style claim says Codex CLI will support native image generation “in the next big update,” per the Roadmap hint.

No release notes or PR reference is included in the tweets, so treat this as unconfirmed. If it’s real, it would broaden Codex CLI from code-first tasks into multimodal asset generation without switching tools.

I don't know who needs this but Codex CLI will support image gen natively in the next big update

2:44 AM · Mar 4, 2026

🖥️ Cursor: MCP Apps (interactive UI in chat) + plugin governance + autonomous research signals

Cursor’s updates center on making agent conversations render interactive UIs (MCP Apps) and adding team-controlled plugin distribution, plus a notable autonomous ‘math research’ style run shared by Cursor accounts. This is Cursor-specific (not generic MCP server news).

Cursor ships MCP Apps: interactive UIs rendered inside agent chats

MCP Apps (Cursor): Cursor now supports MCP Apps, letting agents render interactive UI components directly inside conversations—Cursor frames it as “Agents can render interactive UIs in your conversations” in the feature announcement, and the Cursor 2.6 notes bundle it alongside other updates in the Cursor 2.6 changelog. This shifts agent UX from “paste a chart” to “operate a UI in place.”

The practical effect is tighter loops for workflows that normally jump to a browser tab (dashboards, forms, diagrams), since the agent can show and update a UI artifact inline rather than describing it.

Cursor

@cursor_ai

Cursor now supports MCP Apps. Agents can render interactive UIs in your conversations.

10:00 PM · Mar 3, 2026

1.4K

Read 61 replies

Cursor adds Team Marketplaces to distribute private plugins internally

Team Marketplaces (Cursor): Cursor introduced team marketplaces for plugins—intended for sharing private plugins within Teams/Enterprise orgs—as shown in the marketplace preview and described in the Cursor 2.6 changelog. This adds an admin-governed distribution path for internal tooling, instead of every developer side-loading plugins ad hoc.

The main operational change is governance: centralized enablement, repeatable installs, and fewer “works on my machine” plugin setups across a team.

Cursor

@cursor_ai

Replying to @cursor_ai

Create and share private plugins with team marketplaces.

10:00 PM · Mar 3, 2026

127

Cursor claims a 4-day autonomous run found a novel math solution

Autonomy signal (Cursor): Cursor’s account amplified a claim that Cursor discovered a novel solution to “Problem Six” of the First Proof challenge, with the run operating autonomously for four days per the retweet claim. A related screenshot repeats the “ran fully autonomously… for four days” framing and ties it to “scaling agent coordination” in the screenshot excerpt.

The public details are sparse (no trace or writeup linked in these tweets), so treat it as a directional signal about long-horizon harness reliability rather than a reproducible result.

Michael Truell

@mntruell

We believe Cursor discovered a novel solution to Problem Six of the First Proof challenge, a set of math research problems that approximate the work of Stanford, MIT, Berkeley academics. Cursor's solution yields stronger results than the official, human-written solution. Show more

6:39 PM · Mar 3, 2026

7.2K

Read 220 replies

MCP Apps are being read as Cursor expanding beyond “a coding tool”

Product positioning (Cursor): Community reaction frames MCP Apps as Cursor stepping outside pure coding into a more general “UI inside agent chat” platform—one take was “This is taking them outside of the realm of being a coding tool” in the reaction post. That interpretation builds directly on the MCP Apps capability shown in the feature announcement.

If this holds, it implies Cursor will compete more directly on collaboration surfaces (dashboards, internal tools, agent frontends), not just editor ergonomics and model quality.

Numman Ali

@nummanali

Seriously didn't expect Cursor to be the first outside of Claude and ChatGPT This is taking them outside of the realm of being a coding tool Read more about MCP Apps here: modelcontextprotocol.io/extensions/app…

Cursor

@cursor_ai

Cursor now supports MCP Apps. Agents can render interactive UIs in your conversations.

10:06 PM · Mar 3, 2026

113

🧱 Skills, plugins, and reusable agent capabilities (beyond any single IDE)

Today’s skills/plugin news spans Anthropic’s skill-creation tooling, Vercel’s skill-driven Slack agent setup, and cross-product ‘skills’ expansion (e.g., Perplexity Computer). This category is for installable/distributable capability packs, not core model releases.

Anthropic upgrades skill-creator to generate tests and measure skill quality

skill-creator (Anthropic): Anthropic shipped an upgraded skill-creator workflow that bakes measurement into skill authoring—calling out built-in support for test generation to track things like “skill trigger rate,” and distributing it as a Claude Code plugin alongside claude.ai and Cowork surfaces, as shown in the plugin walkthrough.

The update is positioned as a move from “write instructions” to “write instructions plus checks,” with the details outlined in the skill-creator blog and reference implementations living in the GitHub repo.

Lance Martin

@RLanceMartin

check out the updated skill-creator. i esp like built-in support for test generation (e.g., to measure + optimize tricky things like skill trigger rate). available in Claude Code as plugin, Claude.ai, + Cowork. Show more

6:31 PM · Mar 3, 2026

1.3K

Read 41 replies

Vercel’s Slack agent skill turns setup into an npx-driven flow

Slack agent skill (Vercel): Vercel is promoting a reusable “skill” that drives a full Slack-agent bring-up—npx skills add vercel-labs/slack-agent-skill—covering app config, OAuth scopes, webhook verification, testing, and deployment as a guided workflow, per the demo clip.

Implementation details and the step-by-step flow are described in the setup guide, with the key pitch being that agent scaffolding plus Slack app configuration happen together (instead of being a separate manual checklist).

Vercel Developers

@vercel_dev

Anyone can build a Slack agent on Vercel, even if you've never touched OAuth scopes or webhook verification. You just need the right skill. ▲ ~/ 𝚗𝚙𝚡 𝚜𝚔𝚒𝚕𝚕𝚜 𝚊𝚍𝚍 𝚟𝚎𝚛𝚌𝚎𝚕-𝚕𝚊𝚋𝚜/𝚜𝚕𝚊𝚌𝚔-𝚊𝚐𝚎𝚗𝚝-𝚜𝚔𝚒𝚕𝚕

1:11 AM · Mar 4, 2026

290

Read 10 replies

Perplexity Computer adds Skills to reuse agent workflows across tools

Skills (Perplexity Computer): Perplexity is rolling out Skills support inside Perplexity Computer, framing Skills as reusable “Computer programs” that can port existing workflows people built for Codex and Claude Code, according to the skills UI screenshot.

The artifact here is a first-party UI for discovering and running saved capabilities (“My Skills” plus built-ins), implying a shift toward a skill catalog as a compatibility layer between agent runtimes rather than one-off prompts.

Perplexity is rolling out Skills support for Perplexity Computer. SKILLs will allow users to reuse their existing workflows from Codex and Claude Code with Perplexity Computer. SKILLs are new Computer programs 👀

Perplexity

@perplexity_ai

Introducing Perplexity Computer. Computer unifies every current AI capability into one system. It can research, design, code, deploy, and manage any project end-to-end.

10:48 PM · Mar 3, 2026

306

Read 20 replies

Playbooks.com indexes agent skills as a browse-and-install directory

Playbooks.com directory: A new “agent skills for Codex” directory is being shared as a marketplace-style discovery surface, showing 31,170 items in the UI and including sponsored placements, as seen in the directory screenshot.

The site positions itself as a cross-tool catalog for “agent skills” that can be installed into agent workflows, with the product framing and submission path described on the site overview.

Ian Nuttall

@iannuttall

playbooks.com is now making me ever so slightly more than it costs me to run it thanks to @coderabbitai sponsoring 🔥

8:26 AM · Mar 3, 2026

🛡️ Agent security & policy: tool misuse, legal constraints, and trust gaps under automation

Security-focused items today include a real incident where an agent hallucinated identifiers and triggered an unintended deployment, plus policy signals (legal advice restrictions) and clarifications about “gov” model safeguards. This category excludes GPT‑5.3 launch content.

Agent hallucinated a GitHub repo ID and triggered an unintended Vercel deployment

Vercel deploy API incident (Vercel): A customer reported an “unknown OSS codebase” deploying into their team; Vercel’s investigation found the agent (running Opus 4.6 via OpenClaw) fabricated a plausible numeric GitHub repoId and then used Vercel’s API to deploy it, as described in the incident report. There were zero GitHub API calls before the deployment; the ID appeared “for the first time at line 877,” indicating pure fabrication rather than an off-by-one, per the incident report.

• Why this matters: This is a concrete example of a tool-using agent creating new attack surface by inventing identifiers and taking real actions with keys, with the risk called out explicitly in the incident report.
• Mitigation direction: The post argues that “powerful APIs create additional risks,” and suggests tighter guardrails and deeper tool integrations; Vercel reiterates that OpenClaw wasn’t the root cause, “it’s just an agent with access to tools and keys,” as emphasized in the follow-up note.

Guillermo Rauch

@rauchg

A Vercel user reported an issue that sounded extremely scary. An unknown GitHub OSS codebase being deployed to their team. We, of course, took the report extremely seriously and began an investigation. Security and infra engineering engaged. Turns out Opus 4.6 *hallucinated a Show more

7:47 PM · Mar 3, 2026

2.6K

Read 161 replies

NY bill proposal would restrict LLMs from giving substantive legal advice to consumers

NY legal-compliance signal: A New York state bill is described as prohibiting LLMs from providing “substantive legal analysis or advice” in NY, with a likely practical effect of consumer-facing assistants refusing legal questions more often, per the bill summary. The same post notes a narrower interpretation after feedback: assistance to lawyers may remain permissible under the text, but consumers would still be blocked, as clarified in the bill summary.

For AI product leaders, this reads like a jurisdiction-specific compliance cliff for general chatbots and legal-help UX, even if “lawyer-in-the-loop” workflows remain viable.

prinz

@deredleritt3r

The NY legislature is rapidly pushing through a 2025 bill that would prohibit LLMs from providing substantive legal analysis or advice in NY. EDIT: Thanks to @InquisitiveUrsa, I now agree that this bill is not as bad as I first thought. It seems that LLMs could still provide Show more

Rob Freund

@RobertFreundLaw

Interesting: NY bill would prohibit AI chatbots from giving legal advice. SB 7263, which passed the Internet & Technology Committee last week, says: "A proprietor of a chatbot shall not permit such chatbot to provide any substantive response, information or advice, or take any

3:32 PM · Mar 3, 2026

141

Read 19 replies

Anthropic clarifies Claude Gov includes extra safeguards, not a “helpful-only” model

Claude Gov (Anthropic): In response to claims that Anthropic would offer the military an uncensored “helpful-only” model, Sam McAllister states this “isn’t true,” describing Claude Gov as a custom model with extra safety training and technical safeguards, plus Anthropic-run enforcement via a classifier stack and forward-deployed engineers, as captured in the Claude Gov clarification.

The operational detail that keys aren’t simply handed over—Anthropic says it deploys and runs the classifier layer itself—directly addresses “who controls guardrails in production,” which is the part procurement and risk teams usually care about.

Wes Roth

@WesRoth

There has been a narrative floating around that Anthropic was secretly willing to offer the military an uncensored, "helpful-only" model an AI stripped of its ethical guardrails to blindly follow any command. McAllister clarifies that Anthropic's national security model, known Show more

sam mcallister

@sammcallister

This isn't true. Anthropic hasn't offered a "helpful-only" model without safeguards for NatSec use. Claude Gov is a custom model with extra training, including technical safeguards. (We've also had FDEs and researchers implementing it, and we run our own classifier stack.)

8:00 AM · Mar 3, 2026

Read 11 replies

Security review at scale: treat coding agents like mixed-seniority teams shipping fast

Security review under agent throughput: A thread frames the near-term reality as “dozens of teams…constantly shipping new features,” and suggests thinking of coding agents as “teams of mixed ability engineers working under aggressive deadlines,” as argued in the security-at-scale framing. The punchline is that security is the lens where quality drift becomes directly harmful, with a call for established practices from DEF CON/Black Hat/CCC and similar venues in the security lens follow-up.

This lands as a practical reframing: many orgs already know how to manage inconsistent code quality, but the security failure modes are different in blast radius and remediation cost, as discussed in the security review practices ask and echoed in the mixed-seniority analogy.

Simon Willison

@simonw

The people I want to hear from right now are the security teams at large companies who have to try and keep systems secure when dozens of teams of engineers of varying levels of experience are constantly shipping new features

swyx

@swyx

this is the Final Boss of Agentic Engineering: killing the Code Review at this point multiple people are already weighing how to remove the human code review bottleneck from agents becoming fully productive. @ankitxg was brave enough to map out how he sees SDLC being turned on

2:22 PM · Mar 3, 2026

613

Read 54 replies

Visual LLMs can’t reliably detect AI-generated images/videos, but may answer confidently

Model trust gap in media forensics: A caution flags that Grok (and “no visual LLM”) can’t reliably tell whether an image/video is AI-generated, yet will still provide definitive-sounding answers if prompted, as stated in the media detection limitation.

This is a sharp reminder for teams building moderation, provenance, or “is this fake?” features: model confidence and model correctness diverge badly on this task, and a chat UI can mask that gap.

Ethan Mollick

@emollick

Grok cannot tell you whether an image or video is AI generated but will happily provide you with a definite (but often wrong) answer if you ask. (This isn’t just Grok; no visual LLM can quickly look at video or images and tell you if they are real)

5:21 PM · Mar 3, 2026

175

Read 30 replies

🧰 Agent harnesses & ops: OpenClaw adoption, hosted agents, sandboxes, and multi-agent runners

Operational agent tooling is the story here: OpenClaw’s day-to-day usage patterns (and pain), hosted variants like MaxClaw, and “agents get their own machine” sandbox patterns. This category is about running/operating agents, not writing plugins.

OpenClaw 2026.3.2 ships Telegram live streaming and enables ACP subagents by default

OpenClaw (OpenClaw): Following up on PDF tool beta—which introduced the 2026.3.2 beta line with a native PDF tool—OpenClaw 2026.3.2 calls out Telegram live streaming support, flips ACP subagents on by default, and lists “100+ security & stability fixes,” as shown in the Release notes post.

• Operational knobs: The same notes mention openclaw config validate plus a “native PDF tool,” suggesting the project is formalizing config hygiene alongside new channels and tools, per the Release notes post.

It’s not clear from the tweets whether 2026.3.2 is GA vs “beta,” but the change list reads like a reliability-focused cut.

OpenClaw now supports recently released live streaming feature on Telegram.

OpenClaw🦞

@openclaw

OpenClaw 2026.3.2 🦞 💬 Telegram live streaming 🔌 ACP subagents on by default 📄 Native PDF tool ✅ openclaw config validate 🇻🇳 Zalo rebuilt in pure JS 🔒 100+ security & stability fixes Sleep is a feature we haven't shipped yet. github.com/openclaw/openc…

8:28 AM · Mar 3, 2026

175

MiniMax positions MaxClaw as a hosted OpenClaw you can deploy in minutes

MaxClaw (MiniMax): MiniMax is being pitched as “OpenClaw, fully hosted in the cloud”—no server, no API-key wiring—plus a guided connection flow (Telegram shown) and a “deployed in under 2 minutes” setup claim in the Hosted OpenClaw demo.

• Packaging signal: MiniMax also advertises a “Coding Plan” integrated with MaxClaw, implying a bundled model+agent+credits offering rather than DIY orchestration, as announced in the Coding plan post with an entry point at the Hosted agent page.

The tweets don’t show pricing/limits beyond the anecdotal “full access is $19,” so treat the ops promises as directional until there’s a spec page or SLA.

AshutoshShrivastava

@ai_for_success

Running OpenClaw yourself is a pain. You need your own server, your own API keys, handle the deployment, and if something breaks you figure it out alone. Most people just don't bother. MiniMax launched MaxClaw to solve exactly that. It's OpenClaw, fully hosted in the cloud. No Show more

2:05 PM · Mar 3, 2026

141

Read 30 replies

OB-1 --sandbox (OpenBlock Labs): OB-1 introduced ob1 --sandbox, moving agent execution into an isolated cloud VM “powered by Modal,” with the pitch that your repo and local environment are cloned into the sandbox to avoid laptop resource contention, per the Sandbox announcement.

• Ops rationale: The follow-up in the Sandbox announcement and Why sandbox mode frames this as a stability move (less memory pressure, fewer terminal crashes) rather than a model upgrade.

The posts don’t specify what isolation boundaries exist (network egress, secrets handling, filesystem persistence), which are the details teams typically need before routing production work through it.

OpenBlock

@openblocklabs

Your coding agent just got its own computer. ob1 --sandbox Powered by Modal.

11:10 PM · Mar 3, 2026

214

Read 13 replies

Every publishes a practical OpenClaw beginner guide built from internal workflows

OpenClaw adoption (Every): Every published an “ultimate beginner’s guide” framing OpenClaw as day-to-day ops infra—using claws for product work, customer support, restaurant reservations, and reading-note tracking—anchored by their onboarding narrative in the Guide announcement, with the full walkthrough in the linked Beginner guide.

The emphasis is less “agent demo” and more how teams actually integrate an always-on agent into workstreams (and what they wish they’d known at the start), as described in the Guide announcement.

Dan Shipper 📧

@danshipper

we just wrote the ultimate beginner's guide to OpenClaw almost everyone @every has one now, and they have completely changed the way we work and live. we're using our claws to: - build product - answer customer service queries - book hard-to-get restaurant reservations - track Show more

4:46 PM · Mar 3, 2026

2.5K

Read 56 replies

Superset open-sources a terminal workspace for parallel agent runs via worktrees

Superset (superset-sh): Superset is an open-source “terminal application designed for AI coding agents” that runs multiple CLI agents in parallel, isolates them in git worktrees, and adds monitoring/notifications plus a diff viewer, as described on the GitHub repo surfaced via the Repo post.

The traction signal in the same post is the repo’s “vertical” star-history spike (not a benchmark), suggesting interest in multi-agent local orchestration as a product category, per the Repo post.

Kiet

@FlyaKiet

We're going vertical! github.com/superset-sh/su…

7:06 PM · Mar 3, 2026

285

Read 12 replies

OpenClaw reliability frustration shows up in public: “more down than up”

OpenClaw reliability (Community): Multiple posts describe OpenClaw as unstable in practice—one calling it “more down than it’s up” in the Reliability complaint, and another joking about starting a “support group for OpenClaw debuggers” in the Support group joke.

This sentiment lands the same day as a release-note-heavy OpenClaw update, which makes it hard to tell whether the instability is rollout churn, local self-host friction, or upstream service issues.

Kol Tregaskes

@koltregaskes

Holy crap, does OpenClaw break a lot! I feel like it's more down than it's up. lol.

3:30 PM · Mar 3, 2026

Read 44 replies

✅ Code quality under agent throughput: review bottlenecks, maintainability debt, and verification tactics

The dominant theme is that AI increases code output faster than humans can review/maintain, pushing teams toward new verification and boundary-enforcement tactics. This category is about keeping codebases mergeable and secure, not tool releases.

A hallucinated repo ID triggered a real deployment, highlighting verification gaps

Agent verification gap (Vercel): A Vercel investigation found an agent (Opus 4.6) fabricated a plausible-looking numeric GitHub repoId and then used Vercel’s API to deploy it—without any prior GitHub API lookup—per Incident writeup and the added clarification that the ID was “completely hallucinated” in Hallucinated ID confirmed.

The notable detail is that the failure mode is structured: the model invented an identifier that fit the schema and proceeded as if it were real. The incident is being used to argue that powerful write-capable APIs need stronger preflight checks than “ask the agent to explain what happened,” as summarized in Incident writeup.

Guillermo Rauch

@rauchg

7:47 PM · Mar 3, 2026

2.6K

Read 161 replies

Agent output is outpacing review, and some teams are considering skipping code review

Code review bottleneck: A FAROS/Latent Space chart making the rounds claims high AI adoption drives +21% task throughput and +98% PR merges per dev, while median review time rises +91%, as shown in Review bottleneck chart; the same post frames “killing the code review” as the remaining gate for agentic engineering in Review bottleneck chart.

The practical takeaway is that the limiting factor shifts from writing code to verifying it. It also explains why teams are spending energy on automated checks, policy enforcement, and artifact-based validation instead of “faster generation” alone.

swyx

@swyx

Latent.Space

@latentspacepod

🆕 How to Kill The Code Review latent.space/p/reviews-dead the volume and size of PRs is skyrocketing. @simonw called out StrongDM’s “Dark Factory” last month: no human code, but *also* no human review (!?) in this week’s guest post, @ankitxg makes a 5 step layered playbook for

11:31 AM · Mar 3, 2026

1.1K

Read 133 replies

A growing counter-narrative says code review isn’t the fix—scope control is

Codebase scope control: In direct response to “review is the bottleneck” takes, one thread argues the situation is partly self-inflicted—if your team is producing that much code, “you’re using LLMs entirely incorrectly,” because models struggle most with large codebases, per Scope warning and the earlier “we’ve never done code review” stance in Scope warning.

This is a different kind of quality strategy: constrain what agents are allowed to change, keep modules small, and avoid “more code” as a proxy for progress. It’s less about reviewing faster and more about generating less.

dax

@thdxr

we've never done code review but damn if your team is producing this much code you're using LLMs entirely incorrectly no one struggles with large amounts of code more than an LLM, if you don't keep that in check you have a self defeating codebase

Latent.Space

@latentspacepod

1:38 PM · Mar 3, 2026

978

Read 63 replies

LLMs can preserve green tests while quietly increasing long-term maintenance debt

Maintainability drift: One practitioner describes a common arc—early AI code feels magical, then small fixes cascade; models avoid breaking functionality, so they add layers of backward compatibility to keep tests passing, which can mask that behavior has changed, according to Backwards-compat layering.

A short-term green build isn’t the same thing as a stable design. That’s the point. The described symptom to watch for is “it still passes,” but the system is getting harder to reason about.

eric provencher

@pvncher

Absolutely this Ai code is magic at the start of a project You keep going because it keeps working. Exactly what you wanted every time. Suddenly you notice a small issue, no worries - fix the issue. That worked but something else broke. As the llms generate more and more Show more

Mario Zechner

@badlogicgames

it also ignores compounding effects on bugs, code quality, and maintainability (which matters to agents just as much, if not more). agents love nothing more than adding technical debt and unnecessary abstractions, resulting in bloat. bloat that agents can currently not handle.

12:32 PM · Mar 3, 2026

356

Read 33 replies

Security teams are being asked how to review at scale when agents ship nonstop

Security review at scale: Simon Willison frames the agent throughput problem as a security problem first—bad performance or tech debt is survivable, but security failures aren’t—while suggesting we may need to treat coding agents like “teams of mixed ability engineers working under aggressive deadlines,” per Security lens and the broader framing in Mixed-seniority analogy.

He explicitly asks for the best essays/books/talks on robust security review at scale (DEF CON/Black Hat/CCC-style material) in Call for references. Short sentence: this is an org design question.

Simon Willison

@simonw

Replying to @simonw

I feel like security is the most interesting lens to look at this from Most bad code problems are survivable, things like poor performance, over-complexity leading to technical debt etc Security problems are much more directly harmful to the organization

2:30 PM · Mar 3, 2026

Read 6 replies

Using an AI to find dependency cycles, then locking in architectural boundaries

Dependency boundaries (Codex workflow): Uncle Bob reports using Codex to build a dependency checking tool that finds cycles, enforces dependency boundaries, and computes dependency metrics—then using the output to break component-level cycles via dependency inversion, per Dependency checker built and the follow-up on component mutual dependencies in Component-level cycle found.

He also describes moving from trusting the agent to scrutinizing abstractions until rules are written down in Scrutinize abstractions, and mentions mutation testing (“no surviving mutants”) as a regression brake in Mutation tests hold.

Uncle Bob Martin

@unclebobmartin

I had codex build a dependency checking tool that finds cycles, enforces dependency boundaries, and calculates dependency metrics. It found cycles. Who knew? So now we are breaking cycles using dependency inversion. I am so happy.

4:03 PM · Mar 3, 2026

195

Read 14 replies

Maintainers are starting to police AI-accelerated PR abuse more aggressively

OSS integrity pressure: An OpenClaw maintainer reports banning users who copied others’ PRs (“even copied his own PRs”) and retroactively updating credits/changelogs, per PR copier banned.

This is an operational response to higher-volume contributions: more triage, more provenance checking, and explicit consequences for low-integrity submissions.

Peter Steinberger 🦞

@steipete

Replying to @steipete

Found another one. This guy even copied his own PRs. ffs.

3:53 PM · Mar 3, 2026

492

Read 19 replies

📊 Benchmarks & eval reality checks: document reasoning leaderboards, agent horizons, and reasoning tests

Benchmarks today skew toward document reasoning and ‘time horizon’ style agent measurements, plus new reasoning leaderboards and ARC score reporting. This category is for measurement artifacts and leaderboard movements (not the underlying model announcements).

Document Arena goes live for PDF reasoning, with Opus 4.6 in the lead

Document Arena (Arena): Arena launched Document Arena, a side-by-side evaluation flow where users upload real PDFs and vote on which model handles document reasoning best; early standings put Claude Opus 4.6 at #1 with a score of 1525 and a +51 lead, while GPT-5.2 is shown tied around #9 at roughly ~100 points back, per the leaderboard announcement.

This is a new “bring-your-own-document” eval surface rather than a synthetic benchmark; it should bias toward workflow-relevant failure modes (tables, long policies, messy PDFs) that aren’t visible in standard text-only leaderboards.

Arena.ai

@arena

The Document Arena is now live with leaderboard scores! See which frontier AI models rank highest in document reasoning, all powered by side-by-side evaluations on user-uploaded PDFs from real work use cases. - #1 is Claude Opus 4.6 scoring 1525, +51 pts in the lead - While Show more

Arena.ai

@arena

📄We just launched PDF uploads in Arena. Upload PDFs with your prompts to add richer context and test models on document reasoning, bringing evaluations closer to real-world use. ▪️Ask questions directly against documents ▪️Digest complex, technical content in minutes ▪️Extract

7:28 PM · Mar 3, 2026

163

Read 11 replies

METR corrects Opus 4.6 task time-horizon estimates after methodology fix

Opus 4.6 horizons (METR): METR revised its time-horizon estimates for Claude Opus 4.6, reporting P50 ≈ 11h 59m (down from 14.5h) and P80 ≈ 1h 20m (up from ~1h), as stated in the correction note.

The curve-fit plot in the fit comparison chart shows materially different assumptions (logistic variants, nonparametrics, survival-style fits) producing different P50/P80 horizons, which is a useful reminder that “agent time horizon” numbers are model-and-method sensitive—not a single canonical stat.

METR corrected time horizons for Opus 4.6: - P50: 11 hours 59 minutes (down from 14.5h) - P80: 1 hour 20 minutes (up from ~1h) the new p50 time horizon is much closer to my estimate of 11.26 hours!

METR

@METR_Evals

We're correcting a mistake in our modeling that inflated recent 50%-time horizons by 10-20% (and reduced 80%-horizons). We inappropriately penalized steepness in task-length→success curve fits. This most affects the oldest and newest models, whose fits are less data-constrained.

9:45 PM · Mar 3, 2026

Artificial Analysis breaks down Gemini 3.1 Flash-Lite: very fast, mixed deltas

Gemini 3.1 Flash-Lite Preview (Artificial Analysis): Artificial Analysis benchmarked Gemini 3.1 Flash-Lite Preview as a speed-first model served at 360+ output tok/s with a reported 34 on its Intelligence Index; the thread also notes limited tool-use gains in some evals and that the pricing moved up versus prior Lite generations, per the benchmark breakdown and the speed and latency note.

More detail (including context-window and pricing comparisons) is compiled in the model analysis page, which is helpful when you’re deciding whether “fast enough + smart enough” beats a slower model in real traffic.

Artificial Analysis

@ArtificialAnlys

Google has released Gemini 3.1 Flash-Lite Preview! This upgrades the fastest, lowest-cost Gemini model series, scoring 34 on the Artificial Analysis Intelligence Index while served at over 360 output tokens/sec, significantly faster than other first-party API endpoints Key Show more

5:16 PM · Mar 3, 2026

419

Read 23 replies

Community evals flag GPT-5.3-Chat regressions on EQ-Bench and longform writing

EQ-Bench and longform writing (community evals): A community benchmarking post claims gpt-5.3-chat shows a “surprising & severe regression” on EQ-Bench and a longform writing eval, including partial refusals and prose that collapses into very short paragraphs, per the eval screenshots.

This sits in tension with OpenAI’s broader “tone improvement” positioning elsewhere; treat it as provisional until the underlying eval setup (prompts, temperature, sampling, judge model) is reproduced.

Sam Paech

@sam_paech

GPT-5.3-chat shows a surprising & severe regression on EQ-Bench and Longform Writing. Lots of partial refusals on EQ-Bench. In the writing evals, the prose devolves to tiny 1-5 word paragraphs.

5:50 AM · Mar 4, 2026

CritPt reasoning leaderboard highlights how low ‘hard reasoning’ scores still are

CritPt benchmark (Artificial Analysis): A CritPt leaderboard screenshot shows extremely low absolute scores on a high-difficulty reasoning benchmark, with Gemini 3.1 Pro Preview at 17.7%, GPT-5.3 Codex (xhigh) at 16.9%, and Claude Opus 4.6 (max) at 12.6%, as shown in the leaderboard screenshot.

The distribution (many models clustered near ~0–3%) reinforces that “hard reasoning” benchmarks can remain sparse even when general chat/coding feels strong—good context for interpreting incremental model wins.

Teknium (e/λ)

@Teknium

Replying to @scaling01

I think this benchmark is much more meaningful for high difficulty reasoning tasks

11:54 AM · Mar 3, 2026

ARC-AGI-2 chart emphasizes cost-per-task alongside low scores for smaller models

ARC-AGI-2 (ARC Prize): Following up on ARC results (international scores + cost framing), a new ARC Prize Verified scatter plot shows the cost/score frontier with high-end models near the top—e.g., Gemini 3.1 Pro (Preview) ~75%, Claude Opus 4.6 (Max) ~68%, and GPT-5.2 (High) ~45%—while smaller/cheaper models like Kimi K2.5 and Deepseek remain in the low single digits to low teens, as captured in the leaderboard post.

The chart makes the trade explicit: you can buy lower cost-per-task, but the score cliff is still steep for this benchmark.

Wes Roth

@WesRoth

ARC-AGI-2 Results Just Dropped for the Newest International AI Models! Moonshot's Kimi K2.5 led the pack with just 12%, while Deepseek V3.2 managed only 4%.

ARC Prize

@arcprize

International models on ARC-AGI-2 Semi Private - Kimi K2.5 (@Kimi_Moonshot): 12%, $0.28 - Minimax M2.5 (@MiniMax_AI): 5%, $0.17 - GLM-5 (@Zai_org): 5%, $0.27 - Deepseek V3.2 (@deepseek_ai): 4%, $0.12 These models score below July 2025 frontier labs

3:00 PM · Mar 3, 2026

Read 11 replies

BullshitBench v2 places GPT-5.3-Chat mid-pack and Flash-Lite lower

BullshitBench v2: BullshitBench v2’s updated standings put GPT-5.3-Chat “towards the top of OpenAI models” but 23rd overall, while Gemini 3.1 Flash Lite is shown 56th overall, according to the results note and the linked interactive viewer.

Because BullshitBench is about detecting nonsense / pushing back appropriately, this leaderboard is often more about instruction discipline and refusal calibration than raw knowledge or coding ability.

Peter Gostev

@petergostev

BullshitBench v2 update: @OpenAI GPT-5.3-Chat is towards the top of OpenAI models, but is still 23rd overall.

Peter Gostev

@petergostev

BullshitBench v2 is out! It is one of the few benchmarks where models are generally not getting better (except Claude) and where reasoning isn't helping. What's new: 100 new questions, by domain (coding (40 Q's), medical (15), legal (15), finance (15), physics(15)), 70+ model

12:20 AM · Mar 4, 2026

Arena adds GPT-5.3-Chat-Latest for public side-by-side testing

Text Arena (Arena): Arena says the latest GPT-5.3-Chat-Latest snapshot is now available in Text Arena for side-by-side battles and voting, as announced in the arena listing.

This is mainly a measurement surface update: it creates a public, prompt-driven way to compare the new snapshot against other frontier models using “your real prompts,” without relying on vendor-reported evals.

Arena.ai

@arena

The latest snapshot in @OpenAI's ChatGPT is available for testing in the Arena! Find GPT-5.3-Chat-Latest in Text Arena and bring your real-world prompts to judge for yourself.

OpenAI

@OpenAI

GPT-5.3 Instant in ChatGPT is now rolling out to everyone. More accurate, less cringe. openai.com/index/gpt-5-3-…

7:02 PM · Mar 3, 2026

187

Read 7 replies

ValsAI benchmarks Gemini 3.1 Flash-Lite: cheaper per test, weaker on coding

Gemini 3.1 Flash Lite (ValsAI): ValsAI reports Gemini 3.1 Flash Lite lands 15/20 on its multimodal index and 22/31 on its broader index; they call out strong cost-per-test economics (example: ~$0.07 vs ~$0.37 for a larger Gemini variant on one benchmark) but weaker coding placements and even a 0% on “Vibe Code Bench,” per the benchmark summary and the cost comparison note.

This is a useful counterweight to headline speed charts: it quantifies cost-per-eval run and shows where fast small models still struggle under app-building style harnesses.

Vals AI

@ValsAI

We’ve evaluated @GoogleAI's Gemini 3.1 Flash Lite on our suite of benchmarks! Overall, it places 15/20 on the Vals Multimodal Index and 22/31 on the Vals Index. We find it to performs comparably to 2.5 Pro. While it is not able to match Gemini 3.1 Pro and Flash most benchmarks, Show more

1:22 AM · Mar 4, 2026

📄 Docs, parsing, and retrieval plumbing: PDF→Markdown, agentic doc processing, and web search infra

This beat is about turning messy documents and web sources into reliable context for agents: PDF parsing to Markdown, agentic document processing positioning, and specialized web search partnerships. Excludes general agent skills unless they’re retrieval-focused.

Firecrawl’s Rust PDF parser converts long PDFs into Markdown in seconds

Firecrawl: Firecrawl demoed a new Rust-based PDF parser that converts 200+ page PDFs (text + charts + graphs) into clean Markdown “in seconds,” aimed at feeding structured text into downstream RAG/agent pipelines as shown in the Rust parser demo. This is a concrete move toward treating PDFs as first-class ingestion sources instead of a brittle preprocessing step.

• What it targets: Earnings calls, whitepapers, and market research PDFs are explicitly called out in the Rust parser demo, which is where most teams see the worst “layout turns into garbage text” failure modes.
• How you try it: They also point to a web playground where you can parse PDFs directly from the web, per the Playground link and its associated Playground.

Firecrawl

@firecrawl

Watch our new Rust-based parser turn a 200+ page PDF packed with text, charts, and graphs into clean Markdown in seconds. Earnings calls, whitepapers, market research - Firecrawl can handle them all, giving you structured data ready for your AI pipeline.

5:00 PM · Mar 3, 2026

609

Read 16 replies

Weaviate 1.36 introduces HFresh, pushing vector search onto disk

Weaviate 1.36 (Weaviate): Weaviate shipped HFresh, positioning it as an alternative to HNSW when “everything in memory” becomes too expensive—HFresh partitions vectors into disk postings while keeping a small centroid index in RAM, aiming for predictable latency at very large scale as outlined in the Release details. This is directly about lowering retrieval infra cost without periodic full index rebuilds.

• Index maintenance model: They claim incremental background rebalancing keeps the index “fresh” instead of doing rebuild cycles, per the Release details.
• Other retrieval-adjacent ops upgrades: The same release pushes server-side batching, object TTL, and async replication GA, per the Release details.

Weaviate AI Database

@weaviate_io

Weaviate 1.36 is here! 🔥 HNSW is the gold standard for vector search, but it needs everything in memory. As datasets grow, that gets expensive. HFresh takes a different approach: partition vectors into disk postings, keep only a small centroid index in memory, and search by Show more

4:35 PM · Mar 3, 2026

Harvey integrates Parallel web search for international legal context

Parallel × Harvey: Harvey is integrating Parallel’s web search to retrieve “accurate, relevant, and fresh” public context for legal workflows, with the partnership also building a specialized index of hard-to-reach international legal domains to expand coverage to “60+ countries,” according to the Partnership note and the linked Collaboration blog. This is a clear signal that legal AI vendors are treating web search coverage quality as a core dependency.

• Why it matters operationally: Legal retrieval needs jurisdictional breadth; the “specialized index” detail in the Partnership note suggests bespoke crawling/allowlists rather than generic search APIs.
• Integration shape: The announcement frames it as embedded across Harvey’s platform, not a standalone tool, per the Partnership note.

Parallel Web Systems

@p0

We’re thrilled to highlight our new collaboration with @harvey, the leading AI platform for legal and professional services. Harvey uses Parallel’s accurate, relevant, and fresh web search across their platform to retrieve valuable public context for their legal AI workflows. Show more

5:03 PM · Mar 3, 2026

107

Read 6 replies

LlamaIndex leans into agentic document processing over RAG abstractions

LlamaIndex: LlamaIndex is explicitly repositioning from “RAG framework” to an agentic document processing platform, arguing the durable value is extracting high-quality context from messy containers (PDF/Office) and supporting long-running agent loops, as described in the Positioning post and elaborated in the Strategy blog post. One short implication is that they’re de-emphasizing general LLM abstractions in favor of deeper OCR/layout/document tooling.

• Product emphasis: They frame themselves as “best in class OCR module” and anchor the managed platform around LlamaParse, per the Positioning post.
• Reason this is happening: The post claims retrieval patterns have changed because agent reasoning loops are longer and more iterative; the diagram in

mirrors that pipeline mindset.

Jerry Liu

@jerryjliu0

3 years ago, you might’ve known @llama_index as a RAG framework. Today we are not a RAG framework. We are an agentic document processing platform 🦙📑 I wrote a blog post detailing the evolution of our company over the past ~3 years and why we believe our current position is Show more

LlamaIndex 🦙

@llama_index

LlamaIndex has evolved far beyond a RAG framework - we're now focused on agentic document processing that automates knowledge work. 🚀 Agent orchestration has fundamentally changed with sophisticated reasoning loops, tool discovery through Skills/MCP, and coding agents that

11:24 PM · Mar 3, 2026

103

Read 9 replies

⚙️ Self-hosting & efficiency: quantized weights, local fine-tuning, and runtime support signals

Today’s systems/inference items emphasize doing more with less VRAM—quantized weights and “train locally” recipes—plus community reimplementations for learning. This is about runtime/fine-tune practicality, not model marketing.

Unsloth shows Qwen3.5-2B LoRA fine-tuning on 5GB VRAM with a free notebook

Unsloth (Qwen3.5 fine-tuning): Unsloth published a free notebook for LoRA fine-tuning Qwen3.5-2B locally with ~5GB VRAM, claiming 1.5× faster training and ~50% lower VRAM than typical FA2 setups, as described in the Notebook announcement and the Fine-tuning guide.

• Practical floor: The same post lists rough VRAM targets across sizes (for example, 0.8B at ~3GB; 2B at ~5GB; 4B at ~10GB), as shown in the Notebook announcement.
• Deployment handoff: It also calls out exporting to GGUF/vLLM after tuning, per the GitHub repo.

Unsloth AI

@UnslothAI

You can now fine-tune Qwen3.5 with our free notebook! 🔥 You just need 5GB VRAM to train Qwen3.5-2B LoRA locally! Unsloth trains Qwen3.5 1.5x faster with 50% less VRAM. GitHub: github.com/unslothai/unsl… Guide: unsloth.ai/docs/models/qw… Qwen3.5-4B Colab: colab.research.google.com/github/unsloth… Show more

2:49 PM · Mar 3, 2026

2.0K

Read 38 replies

Qwen 3.5 GPTQ-Int4 drops with native vLLM and SGLang support

Qwen 3.5 (Alibaba Qwen): Alibaba released GPTQ-Int4 weights for the Qwen 3.5 series with native vLLM and SGLang support—positioned as “less VRAM, faster inference” for constrained GPU setups, per the Quantized weights announcement and the linked Hugging Face collection. This is a practical packaging step for teams trying to run Qwen-derived services on smaller instances without rewriting their serving stack.

Qwen

@Alibaba_Qwen

🔥 Qwen 3.5 Series GPTQ-Int4 weights are live. Native vLLM & SGLang support. ⚡️ Less VRAM. Faster inference. Run powerful models on limited-GPU setups. 👇 Grab the weights + example code: Hugging Face: huggingface.co/collections/Qw… ModelScope: modelscope.cn/collections/Qw…

2:52 PM · Mar 3, 2026

724

Read 34 replies

Model routing talk shifts toward a “cognitive spot market” for inference

Model routing economics: A “models are commodities” argument is reappearing, framed as an emerging “cognitive spot market” where apps should dynamically hot-swap providers based on spot price/latency (rather than hard-coding one API key), as laid out in the Routing thesis thread.

The concrete claim is that agents won’t care about brand—only whether outputs clear a success threshold at low cost—while the defensibility moves toward the routing/control plane, as illustrated by the Routing thesis thread.

Rohan Paul

@rohanpaul_ai

A few days ago, Larry Ellison weighed in on the 'AI moat' debate, noting that AI is becoming a commodity. Since most models are trained on the same public internet data, they are eventually bound to converge. If inference is just a commodity input, hardcoding a single provider's Show more

The Grid

@The_GridAI

Models are becoming commodities. The question isn’t “what’s the top model?” It’s “what performance tier is good enough for the price?” Markets make that tradeoff explicit. That’s what we’re building.

3:37 PM · Mar 3, 2026

Read 12 replies

Raschka’s from-scratch Qwen3.5 reimplementation highlights hybrid attention + KV cache

LLMs-from-scratch (rasbt): A small, educational “from scratch” reimplementation of Qwen3.5 (0.8B) is being shared as a readable reference, including notes on hybrid linear/full attention and KV-cache decoding, per the Repo pointer and the linked GitHub repo. It’s a useful artifact for engineers who want to understand (or re-derive) Qwen-style efficiency tricks without treating the Transformers implementation as a black box.

Sebastian Raschka

@rasbt

A small Qwen3.5 from-scratch reimplementation for edu purposes: github.com/rasbt/LLMs-fro… (probably the best "small" LLM today for on-device tinkering)

10:32 PM · Mar 3, 2026

1.6K

Read 26 replies

🎓 Courses, meetups, and events that shape the builder ecosystem

Today includes multiple education/event distribution nodes: agent reliability courses, local meetups, and major eval/AGI events. These matter because they drive shared practices and talent flow.

LangChain Academy adds a free “Building Reliable Agents” course

LangChain Academy (LangChain): LangChain announced a free course on taking an agent from “first run” to a production-ready system via iterative observe→evaluate→improve loops in LangSmith, as described in the Course announcement.

• What it covers: Production reliability framing (non-determinism, tool use, multi-step reasoning, real user traffic) and an iteration workflow built around LangSmith observability/evals, according to the Course announcement.

The practical value is the emphasis on debugging/measurement mechanics, not prompt tricks.

LangChain

@LangChain

🚀 New LangChain Academy Course: Building Reliable Agents 🚀 Shipping agents to production is hard. Traditional software is deterministic – when something breaks, you check the logs and fix the code. But agents rely on non-deterministic models. Add multi-step reasoning, tool Show more

6:38 PM · Mar 3, 2026

388

Read 13 replies

Zed and JetBrains set a London talk on ACP and coding-agent interoperability

ACP event (Zed × JetBrains): Zed announced a March 11 London event focused on ACP (Agent Client Protocol)—how to use it in IDEs, how to build an ACP client, and where coding-agent interoperability is headed, per the Event announcement and the linked Event page.

This is one of the few event nodes explicitly centered on cross-IDE agent protocol standardization rather than a single vendor workflow.

Zed

@zeddotdev

Zed × @JetBrains in London 🤝 @benjaminbrandt and @sergey__ignatov are diving into ACP—how to use it in your favorite IDE, how to build an ACP client, and where coding agent interoperability is headed. 📅 March 11 @ 6pm GMT lu.ma/4hs6hs36

4:58 PM · Mar 3, 2026

LangChain hosts a San Francisco meetup on “Deep Agents” (Python OSS)

Deep Agents meetup (LangChain): LangChain announced an in-person SF event on March 4 featuring Sydney Runkle (LangChain Python OSS) and a moderated discussion on “Deep Agents,” per the Meetup announcement.

The agenda callout includes task planning, contextual file systems, subagent spawning, and long-term memory, as listed in the Meetup announcement.

LangChain

@LangChain

San Francisco Meetup: Deep Agents with Python OSS 🌉 Join @sydneyrunkle, LangChain Python OSS Engineer, as she shares insights from the front lines of open-source development. Moderated by @jakebroekhuizen, we’ll dive deep into Deep Agents, the powerful way to build LLM-based Show more

5:34 PM · Mar 3, 2026

Nous Research, Prime Intellect, and Hillclimb host a Guinness meetup at GTC

OSS AI @ GTC (Nous Research × Prime Intellect × Hillclimb): Nous Research announced an in-person meetup on March 18 (6–9 PM PDT) framed as an OSS AI social, with registration/approval details in the Event invite and the linked RSVP page.

The invite reads as a community consolidation point around open-model + open-infra builders coinciding with GTC week.

Nous Research

@NousResearch

Come drink Guinness at an Irish pub the day after St. Paddy’s with the Nous Research, @PrimeIntellect, and @hillclimbai teams at GTC 🍀🍻 March 18th, 6-9PM PDT luma.com/vpzfkrzr

3:52 PM · Mar 3, 2026

211

Read 21 replies

OpenClaw schedules a London meetup hosted by OpenAI and Sequoia

OpenClaw meetup (OpenClaw): A London OpenClaw meetup was shared, framed as featuring Peter Steinberger plus Codex team demos and a fireside/Q&A, per the Meetup share and the linked Meetup page.

The Luma listing describes limited capacity with an approval flow, suggesting it’s intended as a high-signal community gathering rather than a large public conference.

Benedict Kerres

@benedictk__

OpenClaw Meetup in London this week. Meet @steipete and have some lobster? luma.com/ierc56lp

1:49 PM · Mar 3, 2026

Read 5 replies

Parallel, Deepchecks, and Snowflake schedule an “Evals & AI Agents” meetup

Evals & AI Agents meetup (Parallel × Deepchecks × Snowflake): Parallel announced an in-person “Build & Debug” meetup on March 12 in Menlo Park, positioning it around practitioner discussions for agent evaluation and debugging, as posted in the Meetup announcement with logistics in the Meetup details.

This is one of the clearer “production agent ops” event hooks in today’s stream, focusing on evals and failure analysis rather than model releases.

Parallel Web Systems

@p0

Next Thursday, March 12th, join Parallel, @deepchecks, and @Snowflake for Evals & AI Agents Meetup: Build & Debug, in Menlo Park, CA. Enjoy a round of food & drinks while connecting with fellow GenAI practitioners. luma.com/z2qae6kx?tk=M2…

8:04 PM · Mar 3, 2026

AI+ Renaissance Summit (SF) pushes early-bird ticket deadline

AI+ Renaissance Summit (AI+): A promo post flagged that early-bird tickets for the March 15 San Francisco summit are “running out in 48 hours,” per the Early-bird promo and the linked Summit page.

The event positioning is broad (multiple tracks and demos); the tweet itself doesn’t enumerate a technical agenda beyond the conference framing.

AI+

@aiplus_hq

🔥 Early-bird tickets running out in 48 hours! AI+ Renᴬᴵssance Summit | 📅 March 15 |📍SF Join now 👉：renaissance.aiplus.dev Collaborations: team@aiplus.dev On the Sunday before NVIDIA's GTC, AI+ Renᴬᴵssance Summit will bring together: - 20 founders / CEOs from top AI Show more

8:44 AM · Mar 2, 2026

Cursor announces a Stockholm community meetup on March 16

Cursor meetup (Cursor): Cursor community organizers announced a Stockholm meetup on March 16 with demos of “new things we’re working on” and a request for feedback, per the Meetup announcement and the linked Meetup RSVP.

No speaker list or agenda details were included in the tweet beyond the demo/feedback framing.

eric zakariasson

@ericzakariasson

we're hosting a small cursor meetup in stockholm on march 16! we'll show some new things we're working on and would love to get your feedback! luma.com/q146gjh2

5:58 AM · Mar 4, 2026

ClawCon Madrid event page circulates for an OpenClaw community meetup

ClawCon Madrid (OpenClaw community): ClawCon Madrid was promoted as a community event, with the public details living on the linked page in the Event share and the Event page.

The listing emphasizes “show and tell” style community participation over a formal speaker roster.

ClawCon

@clawcon

ClawCon Madrid this week luma.com/lolflzsg Thanks to @shellbot_ai @samuelgil @trycua for sponsoring

9:40 PM · Mar 3, 2026

IPAM (UCLA) posts a program on AI for math and theoretical physics

IPAM program (UCLA): A link circulated to IPAM’s event “Accelerating Math and Theoretical Physics with AI,” with program details on the linked page, as amplified in the Event link.

The public artifact is the event page itself—see the Program page for schedule and speakers.

Sebastien Bubeck

@SebastienBubeck

ipam.ucla.edu/programs/speci… This will be very good!

4:39 PM · Mar 3, 2026

The business signal today is Anthropic’s apparent surge in enterprise spend and revenue run-rate metrics, plus broader platform positioning for enterprise agents (content/context as the bottleneck). This is about adoption and dollars, not product changelogs.

Bloomberg says Anthropic nears a $20B annual revenue run-rate

Anthropic (Bloomberg): A Bloomberg-cited metric making the rounds claims Anthropic’s annualized revenue run-rate rose from ~$9B to ~$19B in about three months, as summarized in Bloomberg recap and restated with month-by-month figures in Run-rate timeline.

• Source and framing: The “Pentagon feud” context is explicitly called out in Bloomberg’s headline, as captured in Bloomberg screenshot, with the primary article linked in Bloomberg report.
• Important footnote: Multiple posts emphasize this is “annualized run-rate,” not realized revenue, per the Run-rate clarification.

ANTHROPIC more than doubled its revenue run-rate from $9B to $19B in just 3 months per Bloomberg

12:14 AM · Mar 4, 2026

799

Read 12 replies

Ramp data shows Anthropic taking the lead in U.S. business AI chat spend

AI subscription spend (Ramp Economics Lab): Card/bill-pay data shared today shows a steep shift in U.S. business AI chat subscription spend toward Anthropic—one charted claim pegs the swing from “ChatGPT held 90%” (Feb 2025) to “Claude ~70%” (Feb 2026), as shown in the Ramp share chart.

• Enterprise mix nuance: Posts note OpenAI may still lead on “business count” while Anthropic captures larger spenders, per the Spend vs count note.
• Adjacent signal (API spend): The same Ramp thread asserts Anthropic also commands a majority of API spend by U.S. businesses, according to the Spend vs count note and the follow-up pointer in API spend follow-up.

Yuchen Jin

@Yuchenj_UW

Feb 2025: ChatGPT held 90% of the US business market. Feb 2026: Claude share has surged to ~70%. Absolutely insane growth of Anthropic. Their bet on coding and agents clearly paid off.

11:22 PM · Mar 3, 2026

2.2K

Read 72 replies

Appfigures shows Claude overtaking ChatGPT in daily U.S. mobile downloads

Mobile adoption (Appfigures): Following up on Uninstall surge (ChatGPT uninstall spike narrative), an Appfigures chart now being reposted shows Claude’s daily U.S. first-time downloads rising through Feb 2026 and crossing above ChatGPT at month-end, as shown in the Appfigures downloads chart.

• Store rank claims: One RT also claims Claude hit #1 in the Google Play Store, per the Play Store rank claim.
• Confounders are real: The same image bundle also includes the “ChatGPT uninstalls surged by 295% after DoD deal” headline, so the adoption signal and the news cycle are entangled in the shared artifact in Appfigures downloads chart.

Chubby♨️

@kimmonismus

It seems as if consumers are acting contrary to Sam's decision.

12:34 PM · Mar 3, 2026

293

Read 28 replies

Report: OpenAI is building a GitHub alternative

Code hosting (OpenAI): A Business Insider screenshot and follow-on summaries claim OpenAI is developing a code repository product positioned as an alternative to Microsoft’s GitHub, as shown in the Business Insider screenshot and echoed in the Project summary.

• What’s concrete so far: The project is described as early-stage and “likely months” from completion, with internal discussion of potentially selling access to OpenAI customers, per the Project summary.

If accurate, this is a strategic move “up the stack” from coding agents into the collaboration layer where repos, permissions, and CI live.

swyx

@swyx