OpenAI Codex Windows app ships – 9% promo-limits bug fixed, 26.304.1143 hotfix

The Codex app is now on Windows. Get the full Codex app experience on Windows with a native agent sandbox and support for Windows developer environments in PowerShell. developers.openai.com/wendows

5:47 PM · Mar 4, 2026

3.9K

Read 344 replies

Codex (OpenAI): OpenAI says a bug prevented the promised 2× promotional increase in limits from applying to an estimated 9% of Codex Plus and Pro users, per the limits bug disclosure.

They report the issue is fixed and that they’re resetting the rate limit for all Plus and Pro users to compensate, with additional confirmation in the rate limit reset follow-up.

Tibo

@thsottiaux

We caught an issue that was causing the 2X promotional increase in limits to not be applied to an estimated 9% of plus and pro users for Codex. We have now fixed this issue and are reseting the rate limit for all plus and pro users to compensate. Apologies and thank you for the Show more

9:30 PM · Mar 4, 2026

1.7K

Read 188 replies

Codex on Windows adds an “Add Target” handoff to IDEs and dev tools

Codex app (OpenAI): The Windows release includes an “Add Target” workflow to jump from Codex into local apps (instead of staying inside the Codex UI), demonstrated in the Add Target demo and described as part of staying in an existing Windows setup in the launch thread.

OpenAI’s Windows integration list calls out common destinations like Visual Studio, JetBrains IDEs, Git Bash, GitHub Desktop, Cmder, WSL, and Sublime Text in the targets list, which matches the broader “app targets” framing in the Windows live note.

OpenAI Developers

@OpenAIDevs

Soon.

9:06 PM · Mar 2, 2026

4.4K

Read 514 replies

Codex Windows WSL mode workaround and Microsoft Store hotfix 26.304.1143

Codex app (OpenAI): A post-release issue affecting WSL mode on Windows has a manual workaround—edit %USERPROFILE%\.codex\.codex-global-state.json and set runCodexInWindowsSubsystemForLinux to false—as detailed in the WSL toggle workaround.

A Microsoft Store update is rolling out to fix “bricked” installs, with OpenAI staff pointing to app version 26.304.1143 and an in-app path to check it (Alt → File → About Codex) in the Store hotfix note, building on the earlier “working on rolling this out” status in the fix identified note.

dominik kundel

@dkundel

If you are seeing this error when turning on WSL mode, I'm looking into the issue. In the meantime to deactivate it: 1. Open %USERPROFILE%\.codex\.codex-global-state.json 2. Change runCodexInWindowsSubsystemForLinux to false 3. Restart the app You can ask Codex to run `wsl` Show more

hurtti

@hurttii

This happens after changing the option for the agent to run in WSL mode. Been doing that for the codex CLI in windows, but can't seem to get the app option to work. Already tried with a fresh codex cli installation etc. Any ideas on what I could do?

6:32 PM · Mar 4, 2026

Read 4 replies

OpenAI is hiring across Codex (Windows, CLI, and future products)

Codex team (OpenAI): Multiple Codex team members are publicly recruiting across Windows, CLI, and “future products,” framing the group as high-agency and moving fast in the hiring note.

A separate hiring call lists locations (SF/Seattle/NY/London/remote) and requests “evidence of exceptional work” across full-stack, Rust, low-level systems, and distributed systems in the hiring post, with roles discoverable via the Careers search.

Peter Steinberger 🦞

@steipete

Folks, the codex team needs more great people! You can pick your battle, be it cli, Windows, or future products. OpenAI has lots of very high agency motivated folks and overall has been amazing so far. Apply! openai.com/careers/search…

6:45 PM · Mar 4, 2026

2.5K

Read 197 replies

🧑‍💻 Claude Code CLI churn: Opus model remaps, ultrathink returns, and new permissions automation

Continues yesterday’s Claude Code operations story but with concrete release details: 2.1.68 and 2.1.69 add/remap models, restore “ultrathink,” expand voice STT languages, and introduce an upcoming safer “auto mode” for permissions. Excludes Codex/Windows coverage (feature).

Claude Code previews “auto mode” for permission decisions (research preview)

Claude Code (Anthropic): An email to admins describes a new permissions mode, “auto mode,” launching in research preview no earlier than March 11, 2026; it lets Claude handle permission decisions during coding sessions to reduce interruptions, and it’s positioned as a safer alternative to --dangerously-skip-permissions with additional prompt-injection safeguards, according to the Email screenshot.

• Operational tradeoffs: The email notes it may not catch every risky action and recommends isolated environments; it also calls out higher token usage, cost, and latency in exchange for fewer approval prompts, as stated in the Email screenshot.

Alex Volkov (Thursd/AI)

@altryne

Claude Code is about to launch a "--less-dangerously-skip-permissions" mode AKA "auto mode" 👀 Given the huge enterprise adoption this makes a lot of sense!

Daniel Sternlicht

@dsternlicht

Was about to run #ClaudeCode with "--dangerously-skip-permissions" when @AnthropicAI emailed me a safer alternative at that exact moment. Like grabbing junk food at midnight and getting your weekly health report notification. Still hit Enter though. auto mode can have me

6:21 PM · Mar 4, 2026

374

Read 10 replies

Claude Code 2.1.68 changes Opus 4.6 effort defaults and brings back ultrathink

Claude Code 2.1.68 (Anthropic): Opus 4.6 now defaults to “medium effort” for Max and Team subscribers (Anthropic frames it as the speed/thoroughness sweet spot), and the “ultrathink” keyword is reintroduced to request high effort for the next turn, as noted in the Release highlights and expanded in the Changelog excerpt. The ergonomics impact shows up immediately in user reactions like the Ultrathink shorthand reaction, which implies the keyword is functioning as a practical control surface again.

Claude Code 2.1.68 has been released. 3 CLI changes Highlights: • Opus 4.6 defaults to medium effort for Max and Team subscribers • Opus 4 and 4.1 removed on the first-party API; pinned models now use Opus 4.6 • Temp files default to the sandbox-writable temp directory, Show more

10:18 AM · Mar 4, 2026

1.4K

Read 21 replies

Claude Code 2.1.69 adds a /claude-api skill and many new CLI controls

Claude Code 2.1.69 (Anthropic): The release adds a /claude-api skill for building against the Claude API + Anthropic SDK, alongside a large set of CLI and configuration changes (new commands, options, env vars, and config keys), as enumerated in the 2.1.69 release digest and the Changelog details. The canonical change list is captured in the upstream Changelog entry.

Claude Code 2.1.69 has been released. 7 flag changes, 103 CLI changes, 3 system prompt changes Highlights: • Added /claude-api skill for building apps with the Claude API and Anthropic SDK • Voice STT now supports 10 additional languages (20 total), including Russian, Polish, Show more

287

Claude Code 2.1.69 tightens tool loading: ToolSearch becomes mandatory

Claude Code 2.1.69 (Anthropic): System prompt changes now require ToolSearch as a hard prerequisite before calling deferred tools; the prior per-tool “rulebook” guidance is removed and replaced with an <available-deferred-tools>-style discovery flow, as summarized in the System prompt update notes and reflected in the linked prompt diff Prompt diff. This is a concrete harness-level behavior change that can affect how tool availability and tool invocation errors show up in real sessions.

Replying to @ClaudeCodeLog

Claude Code 2.1.69 system prompt updates Notable changes: 1) Claude no longer receives user-message reminders for available skills or currentDate, and the inline "write a haiku" instruction is removed. Instead, the user message provides an <available-deferred-tools> list, Show more

Claude Code 2.1.68 adds a legacy model remap override

Claude Code 2.1.68 (Anthropic): The CLI surface adds CLAUDE_CODE_DISABLE_LEGACY_MODEL_REMAP, and the opus-46-effort-medium model entry is removed, according to the CLI change list and the deeper diff breakdown in Further changes notes. This matters if you depend on stable, explicit model identifiers in automation (or want to avoid silent remaps when older pinned names are present).

Replying to @ClaudeCodeLog

Claude Code CLI 2.1.68 changelog: Removal: • Removed Opus 4 and 4.1 from Claude Code on the first-party API — users with these models pinned are automatically moved to Opus 4.6 Other changes: • Opus 4.6 now defaults to medium effort for Max and Team subscribers. Medium effort Show more

10:18 AM · Mar 4, 2026

Claude Code 2.1.68 changes temp file defaults in sandbox mode

Claude Code 2.1.68 (Anthropic): Temporary files now default to the sandbox-writable temp directory, aiming to reduce permission and path issues during sandboxed execution, as called out in the Release highlights and the more detailed Changelog excerpt. This is a narrow but workflow-impacting fix for sessions where tooling writes temp artifacts under constrained permissions.

10:18 AM · Mar 4, 2026

1.4K

Read 21 replies

Claude Code 2.1.69 adds 10 more voice dictation languages

Claude Code 2.1.69 (Anthropic): Voice STT support expands by 10 languages (20 total), including Russian, Polish, Turkish, and Dutch, as listed in the 2.1.69 release digest and repeated in the Changelog details. This is a straightforward capability expansion for voice-driven coding workflows, without other behavior claims in the tweets.

287

Claude Code 2.1.69 changes how memory corrections are handled

Claude Code 2.1.69 (Anthropic): The system prompt now treats user corrections of memory-based claims as evidence that stored memory is wrong, and requires updating/removing the memory entry at the source before proceeding to avoid repeat errors, as described in the System prompt update notes. The underlying change is visible via the associated Prompt diff.

Replying to @ClaudeCodeLog

Claude Code session issues: compaction errors and reports of degraded reliability

Claude Code (Anthropic): Users report compaction failing with “Conversation too long” errors and general instability in long-running sessions, as shown in the Compaction error report and echoed more broadly in the User reliability complaint. The reports are anecdotal (no status page or changelog attribution in the tweets), but they align with the practical failure mode where long contexts become hard to compact reliably mid-work.

Erik Meijer

@headinthebox

Fuck! it happened again "Error during compaction: Error: Conversation too long. Press esc twice to go up a few messages and try again" This is clown town Claude Code.

1:09 AM · Mar 5, 2026

Read 24 replies

Claude Code 2.1.69 can reload plugins without restarting

Claude Code 2.1.69 (Anthropic): A new /reload-plugins command applies pending plugin changes without restarting the session, per the 2.1.69 release digest. In practice, this changes how quickly teams can iterate on local plugin/skill setups during long-running sessions, but the tweets don’t include further behavioral guarantees.

287

🧩 Cursor expands beyond VS Code: JetBrains via ACP + protocol builder tooling

Cursor’s distribution story moves into enterprise IDEs: JetBrains integration via Agent Client Protocol (ACP) plus ACP Registry/docs for building minimal clients. Excludes Claude Code and Codex (feature) to keep tool beats clean.

Cursor expands into JetBrains IDEs via ACP (shows up in ACP Registry as v0.1.0)

Cursor (Anysphere): Cursor is now usable inside JetBrains IDEs through the Agent Client Protocol (ACP), broadening distribution beyond its VS Code fork into IntelliJ/PyCharm/WebStorm workflows, as announced in the JetBrains integration post; JetBrains’ ACP Registry already lists it as “Cursor v0.1.0,” as shown in the ACP registry screenshot, with integration details (including model choices and enterprise codebase workflows) described in the integration blog.

• Operational shape: The pitch is “agent-driven development” in enterprise Java/JetBrains stacks—secure indexing + semantic code intelligence, with multiple frontier model options surfaced in-IDE, per the integration blog.
• Positioning signal: The follow-on “Use Cursor anywhere” framing in the Use Cursor anywhere post reads like Cursor treating ACP as a cross-editor distribution layer, not a one-off integration.

Cursor

@cursor_ai

Cursor is now available in JetBrains IDEs through the Agent Client Protocol. cursor.com/blog/jetbrains…

3:46 PM · Mar 4, 2026

1.6K

Read 95 replies

ACP docs show a minimal Node.js client for driving Cursor agents over JSON-RPC

Agent Client Protocol (Cursor): Cursor’s ACP documentation now includes a “minimal Node.js client” example that spawns agent acp and communicates via newline-delimited JSON-RPC 2.0 over stdio, as shown in the Minimal client snippet and detailed in the ACP docs.

• What’s concretely implementable: The docs describe a request/notification flow for sessions, prompts, and permissions—enough to build a custom ACP host (e.g., internal devtools or a bespoke IDE wrapper) without adopting Cursor’s UI, per the ACP docs.
• Ecosystem read: The reaction that “the people clamour for protocols” in the Protocols comment highlights that builders want stable, tool-agnostic control planes for agents more than another bespoke plugin API.

eric zakariasson

@ericzakariasson

you can do so much with acp! if you're curious, you can try to build a minimal client based on docs here cursor.com/docs/cli/acp

Cursor

@cursor_ai

Cursor is now available in JetBrains IDEs through the Agent Client Protocol. cursor.com/blog/jetbrains…

4:14 PM · Mar 4, 2026

150

Cursor offers YC startups a $60k+ team package with $50k usage credits

Cursor (Anysphere): Cursor is offering YC startups a package described as “$60k+” that bundles 6 months of team access plus Bugbot and $50K usage credits, according to the YC deal announcement, with the eligibility constraint (“incorporated within the last 5 years”) clarified in the Eligibility follow-up.

• Adoption mechanism: This is a direct go-to-market lever for getting Cursor rolled out org-wide early, without a separate procurement cycle, per the YC deal announcement.

eric zakariasson

@ericzakariasson

yc companies now get $60k+ to try cursor across the team! 6 mo teams + bugbot + $50K usage credits you can redeem on bookface bookface → deals, or email yc@cursor.com mailto:yc@cursor.com if you can’t find it excited to see what y’all will build!

5:47 PM · Mar 4, 2026

745

Read 77 replies

🧠 GPT‑5.4 watch: 1M context, “extreme reasoning,” and Arena sightings

Today’s model chatter centers on GPT‑5.4 appearing in evaluation channels (Arena) and The Information’s capability claims (1M context + extreme reasoning mode + multi-hour agent tasks). Excludes Codex/Windows (feature).

The Information: GPT-5.4 rumored to bring 1M context and “Extreme” reasoning mode

GPT-5.4 (OpenAI): A report claims GPT-5.4 will ship with a 1M-token context window, alongside an “Extreme reasoning mode” that can spend substantially more compute per hard query; this follows yesterday’s GPT-5.4 teaser, as noted in tease. The same report also frames upgrades around long-horizon tasks (“can run for hours”), improved multi-step memory, and lower complex-task error rates, as summarized in the The Information screenshot and echoed in the Leak recap.

• Long-horizon agent work: The leak narrative explicitly targets automation workloads (agent loops, multi-step workflows) rather than just chat quality, as described in the The Information screenshot and reiterated in the Rumor recap.
• Release cadence signal: Multiple posts repeat that OpenAI is shifting toward more frequent (monthly) model updates, per the Leak recap and the Update cadence quote.
• State persistence rumor: A separate rumor thread claims GPT-5.4 may “persist state,” attributed to Jeff Dean podcast chatter in the State persistence rumor.

Uncertainty remains high: this is secondhand reporting plus community restatements, and there’s no official model card or API doc in the tweet set.

Big GPT-5.4 updates (via TheInformation) - 1M token context window -New “Extreme reasoning mode” → more compute, deeper thinking - Parity with Gemini and Claude long-context models - Better long-horizon tasks (can run for hours) - Improved memory across multi-step workflows Show more

3:13 PM · Mar 4, 2026

1.8K

Read 99 replies

LM Arena adds “galapagos,” widely speculated to be a GPT-5.4 variant

LM Arena (LMSYS): Arena notifications and screenshots show a new model label, “galapagos,” landing in the Arena lineup, with multiple accounts speculating it’s an early GPT-5.4 variant (often framed as a lower-effort route), as shown in the Arena model banner and the TestingCatalog screenshot.

Early probing reports suggest the Arena route may be constrained: one captured exchange shows the model claiming “my current juice value is 0,” which users interpret as low/no-reasoning behavior in the Arena harness, per the Juice value screenshot.

• What’s concrete vs inferred: The existence of “galapagos” as an Arena label is evidenced in the Arena model banner and the Arena notification card, while the mapping to GPT-5.4 remains community inference.
• Variant ambiguity: Posts explicitly note it’s unclear which GPT-5.4 variant this corresponds to, per the Arena model banner and the Juice value screenshot.

Legit

@legit_api

GPT-5.4 has landed on the Arena unclear which variant

10:16 PM · Mar 4, 2026

478

Read 25 replies

GPT-5.4 release timing and “effort tier” speculation clusters around Thursday

Release watch chatter: Several accounts converge on a near-term release window (often “Thursday”), while also speculating about multiple GPT-5.4 variants/tier names (e.g., “Thinking/Pro/Codex” or higher-effort tiers beyond x-high), as reflected in the Thursday speculation, the Thinking Thursday joke, and the Effort tier riff.

• Representative quotes: Posters describe the rollout mood as “Release Thursday very likely,” in the Thursday speculation, and joke about “GPT-5.4-xxxhigh,” in the Effort tier riff.
• What’s driving the timing guess: The clustering seems to come from the Arena appearance plus The Information-style capability claims being circulated together, as seen in the Thursday speculation and the Release timing question.

Net: lots of packaging and schedule theories, but no official OpenAI launch post appears in today’s tweet set.

It’s happening: GPT-5.4 landed in the arena. Release Thursday very likely

Legit

@legit_api

GPT-5.4 has landed on the Arena unclear which variant

10:37 PM · Mar 4, 2026

703

Read 12 replies

🧰 IDE/editor agent features: orchestration hooks, diff-to-agent review, and continuity

Editors are absorbing agent orchestration primitives: VS Code’s agent features (hooks, steering, integrated browser, shared memory) and Zed’s agent feedback loop from diffs. Excludes Cursor and Codex/Windows (feature).

VS Code v1.110 ships hooks, steering, an agentic browser, and shared memory

VS Code v1.110 (Microsoft): The February 2026 VS Code release expands Copilot’s agent surface with hooks, message steering/queueing, an integrated agentic browser, and shared memory, as summarized in the release highlight thread and reiterated alongside the release livestream invite.

• Orchestration and continuity: The release frames “agent orchestration / extensibility / continuity” as first-class editor concepts (hooks, steering, shared memory), with the broader set of changes and UI surfaces detailed in the release notes.
• Operational workflow: VS Code is also promoting a March 19 livestream (8 AM PST) to walk through agent features, as announced in the release livestream invite.

Visual Studio Code

@code

Agents, for real work. The latest @code release gives you better agent orchestration, extensibility, and continuity. Here's what's new: 🪝 Hooks support 🎯 Message steering and queueing 🌐 Agentic integrated browser 🧠 Shared memory And more...

7:37 PM · Mar 4, 2026

1.2K

Read 29 replies

Zed v0.226 surfaces diagnostics and adds one-click diff review by an agent

Zed v0.226 (Zed): Zed shipped two small but workflow-relevant UI affordances—project-panel diagnostic badges and a “Review Diff” handoff into the agent panel—called out in the v0.226 release note and shown in the diff review demo.

• Inline build health: A config toggle, "project_panel": { "diagnostic_badges": true }, shows error/warning counts next to project entries, as described in the v0.226 release note.
• Diff-to-agent loop: The new “Review Diff” button in the git branch diff view sends the current change set directly to the agent panel for feedback, as demonstrated in the diff review demo.

Zed

@zeddotdev

🚀 We just shipped v0.226! Configure the project panel to show error and warning counts next to its entries. `"project_panel": { "diagnostic_badges": true }` Thanks Obli04!

5:00 PM · Mar 4, 2026

532

Read 19 replies

A GitHub Action uses a cloud agent to catch .env key drift before merge

Wizard of Drift (dotenvx + Warp): A new GitHub Action checks for environment-variable key drift across .env* files (e.g., a dev key missing in prod) and leaves a PR comment via a cloud coding agent, as described in the agent PR comment workflow and implemented in the GitHub repo.

The pitch is CI-time hygiene for teams whose agent-generated code changes often touch config, without needing a human to manually diff environment files each time, per the agent PR comment workflow.

Warp

@warpdotdev

Glad we could support @dotenvx in building this! This GitHub action catches "drift" in codebase secrets, like a .env.development key that wasn't added in production. Using a cloud coding agent, we can leave a PR comment to catch before you ship. Would you use this? 👀

dotenvx

@dotenvx

Catch .env* key drift in pull requests - using an agent. x.com/dotenvx/status…

7:49 PM · Mar 4, 2026

📎 Enterprise SaaS automation becomes agent-ready: Google Workspace CLI (gws)

Google shipped an official Workspace CLI designed for humans and agents (dynamic discovery, JSON outputs, skills). This is a concrete “skills + CLI” pattern for automating Drive/Gmail/Calendar/etc workflows without bespoke REST plumbing.

Google releases Google Workspace CLI (gws) built for humans and agents

Google Workspace CLI (Google): Google open-sourced gws, a single CLI that spans Drive, Gmail, Calendar, Sheets, Docs, Chat, Admin, and effectively “every Workspace API,” with commands generated at runtime from the Discovery Service—see the launch announcement and the GitHub repo. It’s explicitly designed to be agent-friendly via structured JSON outputs and ships with 40+ prebuilt agent skills, with install flows via npm plus Skills CLIs shown in the install snippet.

• Agent integration surface: The repo notes structured JSON and schema introspection as first-class capabilities (e.g., gws schema ...), which is the kind of “tool contract” agents can compose reliably, as illustrated in the gws command examples.
• Distribution and ecosystem hook: The npm package + npx skills add github:googleworkspace/cli flow in the install snippet is a concrete example of “Skills + CLI” becoming a packaging layer for enterprise automation rather than bespoke REST glue.

Some claims (like “100+ workflow recipes”) appear in community recaps such as the feature recap, but the most verifiable artifact remains the repo itself.

Addy Osmani

@addyosmani

Introducing the Google Workspace CLI: github.com/googleworkspac… - built for humans and agents. Google Drive, Gmail, Calendar, and every Workspace API. 40+ agent skills included.

1:45 AM · Mar 5, 2026

4.4K

Read 196 replies

gws vs gog: early ergonomics debate over JSON-heavy Workspace automation

gws vs gog (community): Shortly after Google’s gws release, builders started comparing it to existing unofficial CLIs—most notably gog—with the debate centering on whether gws’s JSON-heavy, Discovery-derived surface is more or less usable for agents than a more opinionated command design, as framed in the gog comparison.

• What’s being compared: The gog comparison calls out “the json commands needed for gws” as a potential downside versus gog’s defaults, while still acknowledging gws is “really good”; the gog feature set and philosophy are laid out on the gog site.
• Why it matters for agents: This is less about features and more about “command language design”—whether agents do better with a fully generic API-shaped CLI (gws) or a curated UX that bakes in best-practice defaults (gog), which affects reliability and prompt length when automating Workspace operations.

No consensus yet—just an early signal that CLI ergonomics may become a competitive axis once agents are the primary caller.

Peter Steinberger 🦞

@steipete

Amazing. My whole motivation for building gogcli.sh was that nothing good was out there. But now gog is *really* good, looking at the json commands needed for gws I'm less sure. Will run some evals and see what works better for agents.

Sawyer Hood

@sawyerhood

official google workspace cli!! github.com/googleworkspac…

1:09 AM · Mar 5, 2026

1.5K

Read 66 replies

📏 Evals & observability reality checks: nonsense detection, agent SWE evals, and uncertainty bars

Multiple eval artifacts today: BullshitBench v2 analysis, Scale AI’s SWE-Atlas for code agents, and calls to add statistical uncertainty to leaderboard scores. Excludes GPT‑5.4 rumors (model category) and product release notes.

Scale AI releases SWE-Atlas Codebase Q&A; leading agents score under 30%

SWE-Atlas (Scale AI): Scale introduced SWE-Atlas, an agentic SWE eval suite; its first released track, Codebase Q&A, reports leading models still scoring <30%, with agents operating inside a sandboxed environment with shell access, as described in the Benchmark announcement.

• Execution matters: The release notes claim performance drops 40%+ when models are prevented from running code, making “can execute” a first-class axis rather than a nice-to-have, per the Benchmark announcement.
• Scaffolds show up in scores: The same post argues models do best with “native scaffolds” (e.g., Claude Code, Codex CLI), framing harness integration as part of the measured system, per the Benchmark announcement.

Bing Liu

@vbingliu

Introducing SWE-Atlas, a new agentic coding evaluation suite for software engineering tasks released by @scale_AI As coding models evolve into full agents operating inside development environments, evaluation needs to measure more than patch generation. Real software work Show more

5:43 PM · Mar 4, 2026

BullshitBench v2 shows Claude dominating nonsense detection while reasoners can overfit nonsense

BullshitBench v2 (benchmark): New leaderboard commentary suggests models with more reasoning budget can “use their extra compute to rationalize the nonsense,” which hurts nonsense rejection; following up on Reasoning slope (hard-thinking backfires), this snapshot shows Claude variants clustered at the top while several GPT/Gemini entries sit far lower, as summarized in the Benchmark summary and visualized in the Benchmark summary.

• Who’s ahead: Posts claim “only Anthropic’s Claude models and Alibaba’s Qwen 3.5” clear ~60%+ meaningfully, with Claude Sonnet/Opus variants taking the top slots in the chart shown in the Benchmark summary and reiterated in the Claude top-7 recap.
• Engineer takeaway: This keeps “reject nonsense” as a distinct axis from “solve hard problems,” with the failure mode described as confident elaboration rather than refusal, per the Benchmark summary.

BullshitBench v2, created by Peter Gostev, is a benchmark that does something refreshingly different: it tests whether AI models can detect and reject nonsensical prompts instead of confidently rolling with them. Only Anthropic's Claude models and Alibaba's Qwen 3.5 score Show more

Peter Gostev

@petergostev

V2 has 100 questions and 70+ model variants tested (model + reasoning levels) - Anthropic and Qwen 3.5 are only models that are much above 60%.

4:20 PM · Mar 4, 2026

716

Read 43 replies

Rubric drift: when scores drop, the rubric may be wrong, not the prompt

Rubric drift (eval practice): A recurring failure mode in LLM product debugging is teams treating a falling pass rate as a prompt regression when the underlying rubric no longer matches real user outcomes; the thread frames this as fixing the “rules that decide pass/fail,” not reflexively rewriting the prompt, per the Rubric drift note.

• How it shows up: The post describes early rubrics built on a narrow set of use cases; as distributions shift, “output looks fine but fails the rubric,” which triggers wasted prompt churn, as expanded in the Rubric drift details.

Sometimes an AI product starts "underperforming" and the team spends days fixing the prompt. But the prompt was never broken. The real problem is often the eval rubric — the rules that decide whether an output passes or fails. Most rubrics get written early on, based on a small Show more

Arsh Shah Dilbagi

@arshdilbagi

The hard part about LLM failures is that their outputs rarely look like failures. The demo “works.” The output sounds coherent. The user actively uses the product. And your dashboard looks normal. Meanwhile, the system can be wrong, unsafe, or quietly driving up token spend.

4:08 PM · Mar 4, 2026

Trace-first debugging: treat the trace as the core artifact for LLM failures

LLM observability (trace practice): A thread argues that output logging is not enough for agent debugging; a real trace should show prompt assembly, retrieval payload quality, and tool-call results so teams can localize failures to the correct layer, per the Observability thread.

• Practical decision tree: It proposes tracing-based triage such as “bad output + bad retrieval → retrieval issue” and “cost spike → token usage per span,” as laid out in the Trace decision rules.

AshutoshShrivastava

@ai_for_success

Your LLM gave a bad answer. Do you know where exactly it broke? Most teams don't. And that's the problem. LLM products don't crash loudly. They quietly leak trust, safety, and money. @arshdilbagi covered this in his Stanford CS224G lecture with the most practical observability Show more

4:13 PM · Mar 4, 2026

Vals AI starts publishing standard error on eval results

Vals AI (eval reporting): Vals AI says it now reports standard error across results so benchmark scores include uncertainty, arguing top-line leaderboards without error bars are incomplete, as stated in the Error bars announcement.

• Method sketch: They cite the Miller et al. approach for “adding error bars to evals,” and describe two regimes—multi-run standard error vs. single-run task-distribution assumptions—as outlined in the Standard error method and extended in the Benchmark settings note.
• Concrete example: They also publish a fresh pass on Qwen 3.5 Flash and position its overall performance as comparable to Gemini 3.1 Flash Lite at similar pricing, per the Qwen 3.5 Flash results and Benchmark settings note.

Vals AI

@ValsAI

Benchmark scores without uncertainty are fundamentally incomplete. Vals AI now reports standard error on all our results.

8:22 PM · Mar 4, 2026

Arize previews AX CLI for pulling traces into terminal-centric workflows

AX CLI (Arize): Arize demoed a developer preview that pulls trace spans from the Arize UI into local files and then analyzes them from a terminal/editor loop (example: dumping spans JSON and inspecting in Cursor), as shown in the CLI demo.

• Why it’s notable: It’s an explicit bridge from “observability UI” to “agentic coding environment,” with traces treated as portable artifacts that can be interrogated in the same toolchain as code, per the CLI demo.

Aparna Dhinakaran

@aparnadhinak

Alyx can surface insights about your traces in the Arize UI. I wanted to do the same thing from my terminal. Pulled Alyx's own spans with the AX CLI, dropped the file into Cursor, and asked it what the most common user questions are. Same analysis. No browser. We just released Show more

2:25 AM · Mar 5, 2026

Evals as a feedback loop: prompts behave like executable business logic

Evals workflow (production practice): A post frames prompts as “executable business logic” and argues evals should behave like a feedback loop—ship, monitor failures in the wild, add them back to the dataset, and re-run on prompt/model changes—rather than treating evals like static unit tests, per the Evals mental model.

• Failure story: The example is an insurance workflow that “passed 20 eval cases” but failed in production under a new request class, which is used to motivate continuous eval maintenance, per the Evals mental model.

elvis

@omarsar0

When you build AI agents, don't treat prompts like config strings. Treat them like executable business logic. Because that's what they really are. @arshdilbagi's blog and this Stanford CS 224G lecture lay out one of the clearest mental models I have seen for LLM evaluation. Show more

4:01 PM · Mar 4, 2026

Short-Story benchmark shifts to pairwise comparisons to reduce grading drift

Short-Story Creative Writing Benchmark (Lech Mazur): The benchmark added an alternate scoring mode based on pairwise comparisons (A-vs-B stories with the same required elements) to reduce calibration drift versus absolute scoring, as explained in the Benchmark update.

• What changed: Results are aggregated into global “Thurstone” quality ratings with confidence intervals, and the post calls out cross-evaluator correlation plus side-bias correction, per the Benchmark update.
• Why it’s relevant: It’s a concrete example of making evals more stable when many models cluster near the top of absolute rubrics, as implied by the new ranking method described in the Benchmark update.

Lech Mazur

@LechMazur

New: the Short-Story Creative Writing Benchmark now supports an alternate grading mode based on pairwise comparisons of stories written to the same required elements. This reduces calibration drift relative to absolute scoring, enables simpler prompts, and provides better Show more

1:23 AM · Mar 5, 2026

🧱 Skills packaging accelerates: cross-tool workflows, CLIs, and install ecosystems

Skills continue to solidify as reusable agent capabilities: new Skills/CLI products (LangSmith), new Gemini Interactions skill, and Perplexity’s Skills.MD support for reusing Codex/Claude workflows. Excludes the Google Workspace CLI itself (separate category).

Perplexity Computer rolls out Skills with SKILLs.MD-style portability

Perplexity Computer Skills (Perplexity): Perplexity is rolling out a dedicated Skills section in Perplexity Computer, with a visible “+ Create skill” flow and a browsable library, as shown in the Skills UI screenshot. The rollout is framed around importing/reusing workflows built for other coding-agent stacks (Codex and Claude Code), with additional detail collected in the Feature scoop.

• What the UI suggests: skills are treated as first-class objects (“My Skills” vs “Perplexity Skills”), not just prompt templates, as shown in the Skills UI screenshot.
• Portability claim: the product framing is cross-harness reuse (skills as “programs” that move between agent runtimes), as stated in the Skills UI screenshot and expanded in the Feature scoop.

Perplexity is rolling out Skills support for Perplexity Computer, a feature that allows users to import and reuse automated workflows originally built for OpenAI Codex and Anthropic’s Claude Code. By treating "Skills" as cross-platform programs, Perplexity is positioning itself Show more

TestingCatalog News 🗞

@testingcatalog

Perplexity is rolling out Skills support for Perplexity Computer. SKILLs will allow users to reuse their existing workflows from Codex and Claude Code with Perplexity Computer. SKILLs are new Computer programs 👀

10:00 AM · Mar 4, 2026

120

Read 25 replies

LangSmith ships Skills + CLI for trace debugging, datasets, and experiments

LangSmith Skills + CLI (LangChain): LangChain announced LangSmith Skills alongside the LangSmith CLI, positioning it as a way to let coding agents do the agent-engineering lifecycle end-to-end—debug traces, create datasets, and run experiments—without leaving the terminal, as described in the Launch thread.

• What’s actually new: the packaging is explicit—“skills” are the reusable capabilities and the CLI is the native execution surface for agents, per the Launch thread.
• Why it matters for teams: it pushes observability and eval ops closer to where agents already work (shell sessions), which is a different workflow than “humans click around the UI,” as framed in the Launch thread.

LangChain

@LangChain

🚀 Announcing LangSmith Skills + CLI 🚀 Agent improvements are increasingly driven by coding agents themselves. We're releasing LangSmith Skills alongside the LangSmith CLI to make coding agents experts at the agent engineering lifecycle. LangSmith Skills enable agents to Show more

7:06 PM · Mar 4, 2026

205

Gemini Interactions API gets a one-line install as a Skill

Gemini Interactions API skill (Google): A new skill for building with the Gemini Interactions API was added to the Skills ecosystem; install paths are explicitly shown for both the Vercel and Context7 CLIs, as posted in the Install commands.

• Install surface: the post uses npx skills add ... --global and a Context7 install command, indicating Skills are being treated as toolchain-managed dependencies rather than copy/paste prompt assets, as shown in the Install commands.
• Engineering implication: it standardizes a “unified interface” integration into a reusable skill artifact (and not just docs), per the Install commands.

Philipp Schmid

@_philschmid

We added a new a skill for building with the Gemini Interaction API! The Gemini Interactions API is a unified interface for building advanced agentic applications with Gemini models. install it with the @Context7AI or @vercel CLIs: ``` # Vercel skills npx skills add Show more

4:37 PM · Mar 4, 2026

345

Skills are being treated as the new onboarding and distribution layer for agents

Skills as onboarding UX (Ecosystem): The “Skills are the new onboarding UX” line is getting repeated as a product thesis, with the directory-and-installer model (Skills directory + npx skills add) acting as the implied distribution channel, as argued in the Onboarding UX meme and exemplified by the Skills directory.

• Ecosystem shape: the installer ergonomics (“install a capability” rather than “wire an integration”) is the core pattern being promoted in the Skills and CLIs claim, with discovery centralized in the Skills directory.
• What’s changing for engineers: “skills” are increasingly packaged like dependencies you can install/uninstall, which is a different adoption loop than bespoke MCP server setup or custom tool glue, per the Onboarding UX meme.

Guillermo Rauch

@rauchg

Skills are the new onboarding ux

Vercel Developers

@vercel_dev

Anyone can build a Slack agent on Vercel, even if you've never touched OAuth scopes or webhook verification. You just need the right skill. ▲ ~/ 𝚗𝚙𝚡 𝚜𝚔𝚒𝚕𝚕𝚜 𝚊𝚍𝚍 𝚟𝚎𝚛𝚌𝚎𝚕-𝚕𝚊𝚋𝚜/𝚜𝚕𝚊𝚌𝚔-𝚊𝚐𝚎𝚗𝚝-𝚜𝚔𝚒𝚕𝚕 pic.x.com/N6pAdJVRmf

10:49 PM · Mar 4, 2026

337

Read 21 replies

🕹️ Agent harnesses & orchestration: boards→agents, remote control, and session ops tooling

Ops is shifting from prompting to orchestration: OpenAI’s Symphony watches project boards and spins agent runs, while other stacks push remote execution and session management UX. Excludes Codex/Windows (feature).

OpenAI’s Symphony turns Linear tickets into autonomous, isolated agent runs

Symphony (OpenAI): OpenAI published Symphony, an orchestration layer that watches a project board (Linear in the reference setup) and spawns agents to carry tickets through stages, aiming to shift teams from “prompt the agent” to “move the ticket” workflows as described in the repo summary and detailed in the GitHub repo.

• Board-driven lifecycle: Symphony polls for active issues and starts work only when there’s capacity, then keeps the agent running across turns until the ticket is done, according to the implementation notes.
• Work artifacts over chat transcripts: The repo framing emphasizes “proof of work” outputs—CI status, PR review feedback, complexity analysis, walkthroughs—before landing PRs, as shown in the repo summary.

Lisan al Gaib

@scaling01

New OpenAI repo: Symphony github.com/openai/symphony TLDR: it's an orchestration layer that polls project boards for changes and spawns agents for each lifecycle stage of the ticket You will just move tickets on a board instead of prompting an agent to write the code and do a PR

6:21 PM · Mar 4, 2026

1.5K

Read 54 replies

Symphony’s SPEC.md hints at OpenAI’s internal pattern for long-running agent services

Symphony SPEC (OpenAI): The most reusable part of Symphony may be its SPEC.md-first design, which spells out a long-running service that repeatedly reads work, creates an isolated workspace, and runs an agent session per issue—an approach surfaced in the SPEC excerpt and reinforced by deeper repo spelunking in the worker loop notes.

• Reference implementation choice: The repo ships an Elixir/OTP prototype, which prompted discussion about OpenAI reaching for BEAM-style concurrency as an agent-orchestration primitive in the Elixir note.

What’s still unclear from today’s tweets is how much of this spec style is “reference only” versus representative of hardened internal orchestration.

Numman Ali

@nummanali

What’s actually interesting to me about the new OpenAI Symphony project is the SPEC file Clearly this is how they create internal specs for long running agents set up in line with harness engineering I’m going to meta prompt this to see how it behaves creating one shot specs Show more

Thomas Ricouard

@Dimillian

This new Symphony project from @OpenAI looks like something from the future. The project is a spec, and you ask your favorite agent to actually build it for you. TL;DR: Symphony is a lite project/task orchestrator. Will try it soon! github.com/openai/symphony

8:35 PM · Mar 4, 2026

Read 2 replies

Letta Code agents can now run remotely across machines

Letta Code (Letta): Letta Code added a remote execution mode that’s positioned as broader than laptop-only remote control—agents can run across multiple machines, and the system moves memory/context with the agent, per the remote mode note.

This is a direct bet on “agent as a migratable process” rather than “agent as a session pinned to one workstation,” which is a different shape of orchestration than most single-device harnesses today.

Sarah Wooders

@sarahwooders

Letta Code agents can now work remotely. It's like Claude Code's `/remote-control`, but on steroids: * agents can run across multiple machines (not just your laptop) * all memory and context move with the agent * model agnostic :)

Letta

@Letta_AI

x.com/i/article/2029…

12:27 AM · Mar 5, 2026

Readout 0.0.8 adds transcript search, tool usage views, and session handoffs

Readout 0.0.8 (Readout): Readout shipped a release focused on operating agent sessions—transcript search, a dedicated tool-usage tab, skill/agent customization, cost projections, and session handoffs—as listed in the release announcement and elaborated on in the product page.

The update reads like an “ops console” for long-running agent work: more visibility into what tools ran, better navigation over history, and more explicit session-to-session continuity.

Benji Taylor

@benjitaylor

Readout 0.0.8 is live: new session transcript search tab, new tool usage tab, skill and agent customisation (add/edit/remove), cost projections, session handoffs, sidebar customisation, and most importantly, a cute new splash screen. → readout.org

8:09 PM · Mar 4, 2026

444

Read 26 replies

OpenHands recap highlights SWE-Efficiency, cloud sandboxes, and enterprise hardening

OpenHands (OpenHandsDev): A community call recap flags several execution-layer updates: MiniMax is now free to use on OpenHands; OpenHands Index changes (including an exploit fix); SWE-Efficiency benchmark results (optimization-focused); “cloud sandbox” patterns for large-scale refactors; plus enterprise security and SDK improvements, all summarized in the call recap.

The thread also calls out “CLI iterative refinement with critic self-review” as part of the workflow direction, which fits the broader shift from single-shot patching to managed, repeatable agent runs.

OpenHands

@OpenHandsDev

Last week's OpenHands Community Call recap 🚀 🔥 MiniMax is now free to use on OpenHands 📊 Improvements to the OpenHands Index (including fixing a benchmark exploit) 🧪 SWE-Efficiency benchmark results (real-world code optimization testing) ⚡ CLI iterative refinement with Show more

5:10 PM · Mar 4, 2026

🗂️ Search & document pipelines go agentic: Deep search loops, Canvas workspaces, and doc review modes

Retrieval stacks are being rebuilt around agent loops and structured output: Exa Deep, Perplexity document review surfaces, and Google Search’s Canvas-as-workspace direction. Excludes voice-mode UX (voice category).

Exa Deep launches agent-in-the-loop search with 4–60s latency and structured JSON outputs

Exa Deep (Exa AI Labs): Exa introduced Deep, a search endpoint where an agent loops (plan → parallel sub-searches → synthesize) until it has enough evidence, then returns structured results; the team claims it’s Pareto-optimal on quality vs latency in the ~4–60s range per the launch announcement.

Deep is positioned as a programmable search primitive: it can take a user-defined output schema and emit structured JSON with field-level citations, as described in the schema and citations note.

• Latency architecture: Deep decomposes queries into multiple rounds of parallel sub-searches and uses an “Instant” endpoint (<200ms) to keep each step fast, according to the implementation detail.
• Quality vs latency evidence: a shared scatter plot compares Deep variants against alternatives (e.g., “Parallel” baselines and Perplexity Sonar) on quality vs P50 latency, as shown in the latency-quality plots.

Use cases mentioned include financial agents, literature review, and news monitoring, with more detail in the launch blog post.

Exa

@ExaAILabs

Introducing Exa Deep: putting an agent inside every search For each query, an agent runs in a loop until it gathers all information, then returns structured output. Evals show Deep is Pareto optimal at 4-60s latency, ideal for quick, cost-efficient research!

7:25 PM · Mar 4, 2026

429

Read 22 replies

Google Search AI Mode rolls out Canvas side-panel workspace to US users

Canvas in AI Mode (Google Search): Google is rolling out Canvas, a side-panel workspace inside Search AI Mode to “organize long-term plans and projects” and iterate via follow-ups without leaving the search page, per the feature walkthrough.

The demo shows Canvas being used for longer-form writing and coding tasks (including a toggle to view underlying code), while pulling fresh info from the live web / Knowledge Graph to populate generated tools and drafts, as described in the feature walkthrough.

This is a notable product shape for retrieval: Search becomes the place where planning, drafting, and lightweight prototyping happen next to web-grounded results—rather than a separate chatbot UI.

Google is officially turning Search into a fully interactive workspace! "Canvas in AI Mode" is now rolling out to all U.S. users. Canvas acts as a dedicated side-panel workspace where you can organize long-term plans and projects without leaving the search page. The feature has Show more

Google

@Google

Canvas in AI Mode lets you build plans and organize information over multiple sessions in a dynamic side panel that updates as you go. Now we're making this tool available to everyone in the U.S. in English, and we're adding support for creative writing and coding tasks, so you

11:00 PM · Mar 4, 2026

150

Perplexity Max surfaces “Final Pass” for document review and fact-checking

Final Pass (Perplexity Max): A new Perplexity UI section labeled “Final Pass” appears to focus on comprehensive document review/fact-checking, with an entry point via “Review documents,” as shown in the UI screenshot.

Third-party reporting frames it as an in-progress feature tied to document analysis workflows inside “Perplexity Computer,” per the feature scoop.

TestingCatalog News 🗞

@testingcatalog

Perplexity is working on a new feature called "Final Pass," designed to perform comprehensive document analysis and fact-checking.

TestingCatalog News 🗞

@testingcatalog

1:51 PM · Mar 4, 2026

312

Read 37 replies

⚙️ Inference & runtime engineering: decoding speedups and low-cost compute packaging

Runtime-side improvements show up as both algorithms and packaging: new decoding schemes for lower latency plus subscription-style access to cheaper coding inference. Excludes training-time optimizers (separate category).

SSD proposes parallel speculative decoding for up to 2× lower latency

Speculative Speculative Decoding (SSD): A new open-source inference engine proposes running speculative drafting and verification in parallel (instead of sequentially), with claims of up to 2× speedups over strong inference baselines on Llama-3 and Qwen3 pairings, as described in the SSD summary.

• What’s distinct: SSD’s pitch is scheduling—anticipating verification outcomes so draft and verify overlap—while keeping common serving tricks (paged attention, prefix caching) intact per the SSD summary.
• Where to inspect: The implementation and setup details live in the GitHub repo.

The public material so far is benchmark-claim heavy; there isn’t a single standardized eval artifact in these tweets beyond the authors’ reporting.

AlphaSignal AI

@AlphaSignalAI

A new inference algorithm just cut LLM latency in half. Not with better hardware. With smarter scheduling. Speculative Decoding (SD) already sped up inference by having a small model draft tokens for a large model to verify. But drafting and verifying still happen Show more

10:38 PM · Mar 4, 2026

Read 2 replies

OpenCode Go bumps limits 3× while staying $10/month

OpenCode Go (OpenCode): The $10/month “cheap coding inference” plan increased limits by 3×, keeping the same price point according to the Limits update.

• New per-5-hour caps shown: The Go tier lists 1,150 requests for GLM-5, 1,850 for Kimi K2.5, and 20,000 for MiniMax M2.5 in the limits graphic shared in the Limits update.
• Provider co-design signal: The team frames the gain as workload-aware optimizations—“guarantees about the workload” enabling infra-side tuning—per the Workload guarantees note.

Plan details and positioning are spelled out on the Go plan page.

dax

@thdxr

we've increased opencode go's limits by 3x - still $10/month

3:48 AM · Mar 5, 2026

846

Read 58 replies

Comfy Cloud exits beta with pay-per-run Blackwell RTX 6000 Pro (96GB) in-browser

Comfy Cloud (ComfyUI): Comfy Cloud moved out of beta with a pay-per-run model (“never for idle time”) and claims instant readiness for major models, running on NVIDIA Blackwell RTX 6000 Pro with 96GB VRAM in the browser per the Beta exit announcement.

The update is mostly packaging and availability—turning popular ComfyUI custom nodes into a managed, burstable runtime rather than a local/GPU-ops setup.

ComfyUI

@ComfyUI

Comfy Cloud is out of beta! Thank you to our community who helped build and test it. Here is what this means: - Most popular custom nodes — now in the cloud - Pay only when your workflow runs. Never for idle time - Every major model ready instantly - NVIDIA Blackwell RTX 6000 Show more

10:07 PM · Mar 4, 2026

142

🧪 Training & reasoning efficiency: RL agents for kernels, shorter CoT, and multimodal convergence

Several research threads focus on making models learn or reason more efficiently: RL in sandboxes for performance code, “draft” reasoning to cut tokens/compute, and multimodal training recipes. Excludes pure inference runtime items.

CUDA Agent uses agentic RL to write faster CUDA kernels than compiler baselines

CUDA Agent (ByteDance Seed): A new paper describes an agentic reinforcement-learning setup where an LLM writes CUDA kernels inside a secure execution harness, benchmarks them, and learns from performance feedback—aiming to optimize for speedup, not just correctness, as summarized in the paper overview.

• Reported results: The authors claim up to 100% faster kernels than torch.compile across KernelBench Level-1 and Level-2 splits, and 92% faster on Level-3; they also claim ~40% better performance than proprietary models on the hardest setting, as stated in the paper overview.
• Why it’s an efficiency story: The loop is “write → run → measure → reward,” which makes GPU-kernel generation look more like an RL control problem than code synthesis; the tweet’s description emphasizes the sandbox + continuous trial-and-error setup in the paper overview.

New ByteDance paper shows how an AI learned to write CUDA hardware code so well it beats standard compilers at their own game. This system creates custom software components that run up to 100% faster than traditional automated tools. Writing instructions for AI hardware is Show more

11:46 AM · Mar 4, 2026

252

Read 16 replies

Self-Flow claims faster multimodal convergence and better video consistency via self-supervised flow matching

Self-Flow (bfl_ml): A research preview pitches self-supervised flow matching as a way to train a single multimodal generative model across image/video/audio/text without relying on external representation models; the thread claims up to 2.8× faster convergence, plus better video temporal consistency and sharper typography, as reported in the research preview.

The same thread frames this as groundwork for “multimodal visual intelligence,” including a 4B-parameter model trained on 6M videos and separate large-scale image/audio-video runs, per the research preview.

• What’s concretely new: The claimed win is end-to-end multimodal learning with fewer external encoders and faster convergence, with specific convergence multipliers shown in the research preview.
• Extra signal beyond text-to-video: A follow-up post positions it as a step toward world-model-like action prediction, with an example clip shared in the action prediction post.

Black Forest Labs

@bfl_ml

We present a research preview of Self-Flow: a scalable approach for training multi-modal generative models. Multi-modal generation requires end-to-end learning across modalities: image, video, audio, text - without being limited by external models for representation learning. Show more

3:07 PM · Mar 4, 2026

627

Read 12 replies

Draft-Thinking trains models to use fewer reasoning steps with large compute savings

Draft-Thinking (Zhejiang University, Tencent): A paper proposes training models to produce a compact “draft” reasoning trace (key logical jumps only), then reinforces concise reasoning so the model stops overthinking—claiming major compute savings on math benchmarks, as described in the paper summary.

• Claimed efficiency delta: On MATH500, the approach reportedly cuts reasoning compute by 82.6% with only a 2.6% performance drop, per the paper summary.
• Mechanism: The training is described as staged (draft structure acquisition → distillation/progressive internalization → RL for flexible draft mastery), with an “adaptive” selector to use short drafts for easy problems and longer traces for hard ones, according to the paper summary.

Researchers created Draft-Thinking to teach LLMs to solve complex problems using fewer reasoning steps. On a popular math test, this approach cut the required computing budget by 82.6% while maintaining high accuracy. Standard reasoning models suffer from overthinking, meaning Show more

12:49 PM · Mar 4, 2026

Read 11 replies

Google Research describes a training method to make LLMs reason like Bayesians

Bayesian reasoning training (Google Research): Google Research teased a method to teach LLMs to “reason like Bayesians” by training them to mimic optimal probabilistic inference, as stated in the announcement.

The tweet doesn’t include experimental results, datasets, or implementation detail, so the practical impact (calibration gains, robustness under uncertainty, or compatibility with existing post-training stacks) remains unspecified beyond the high-level claim in the announcement.

Google Research

@GoogleResearch

Introducing a new method to teach LLMs to reason like Bayesians. By training models to mimic optimal probabilistic inference, we improved their ability to update their predictions and generalize across new domains. Learn more: goo.gle/4ue4eqj

8:36 PM · Mar 4, 2026

1.1K

Read 15 replies

Beyond Language Modeling surveys multimodal pretraining design choices

Beyond Language Modeling (paper): A new paper surveys multimodal pretraining—architectures, fusion patterns, datasets, and evaluation—framing it as a step beyond next-token prediction toward joint representations across modalities, as linked in the paper link.

Because the tweet is just a pointer, the actionable value here is as an orientation map for teams comparing multimodal recipes (late fusion vs cross-attention vs unified tokenization), with the entry point being the paper page referenced alongside the paper link.

@_akhaliq

Beyond Language Modeling An Exploration of Multimodal Pretraining paper: huggingface.co/papers/2603.03…

5:29 PM · Mar 4, 2026

Beyond Length Scaling argues reward models need breadth and depth, not just longer context

Generative reward models (paper): A new paper argues that scaling reward models isn’t just about longer inputs (“length scaling”) and proposes combining breadth (coverage/diversity) with depth (reasoning fidelity) for better generative reward modeling, as pointed to in the paper link.

The tweet doesn’t provide headline metrics, but the paper’s positioning matters for teams doing RLHF/RLAIF-style pipelines where reward-model cost and reliability can dominate iteration speed; the canonical reference is the paper page linked from the paper link.

@_akhaliq

Beyond Length Scaling Synergizing Breadth and Depth for Generative Reward Models huggingface.co/papers/2603.01…

5:30 PM · Mar 4, 2026

💼 Enterprise & market signals: agent deployments, revenue run-rates, and licensing deals

Today’s business layer includes major claimed revenue acceleration for Anthropic, large enterprise agent rollouts, and content licensing for model training/RAG. Excludes defense policy details (security category).

Anthropic is again reported near a $20B revenue run rate as Claude Code demand spikes

Claude (Anthropic): A new round of market chatter pegs Anthropic near a $20B annual revenue run rate, described as a rapid step-up “in just a few weeks” and attributed to adoption of its models and Claude Code in particular, as claimed in the Run rate claim.

• Spending share corroboration: Ramp card/bill-pay data shows Anthropic’s share of U.S. business AI chat subscriptions rising toward parity with OpenAI by Jan 2026, as shown in the Ramp spend chart.
• Context from analysts: Ben Thompson’s writeup frames the moment as Anthropic hitting enterprise “escape velocity,” as linked in the Stratechery analysis.

The precise run-rate number isn’t independently audited in the tweets, but multiple adjacent indicators (spend share + usage momentum) point the same direction.

Anthropic is now nearing a $20B revenue run rate, up $5 billion in just a few weeks Anthropic is approaching a $20B annual revenue run rate, more than doubling from $9B at the end of 2025, driven by massive adoption of its AI models and tools like Claude Code. The company, now Show more

3:13 PM · Mar 4, 2026

170

FactoryAI and EY announce a 10,000-engineer rollout of autonomous dev agents

Droids (FactoryAI): FactoryAI says it is partnering with EY to scale agent-native development across 10,000+ engineers, positioning it as one of the largest cited enterprise deployments of autonomous software-dev agents to date, per the Partnership announcement.

• Stated enterprise value prop: EY’s engineering leadership frames the agents as helping address technical debt and consistency across large codebases “while maintaining enterprise standards,” as quoted in the Customer quote.

No technical deployment details (security model, change control, eval gates) are included in the tweets; this is primarily a scale-and-adoption signal.

Factory

@FactoryAI

Factory is partnering with EY to scale agent-native development across its global engineering organization, enabling more than 10,000 engineers to ship production-grade software with Droids. This represents one of the largest enterprise deployments of autonomous software Show more

7:04 PM · Mar 4, 2026

167

Read 7 replies

Enterprise AI adoption is split between fast deployers and risk-blocked orgs

Enterprise AI adoption: One recurring field report is that many companies still have AI effectively blocked by IT/legal “for out-of-date reasons,” even as peers in the same regulated industries deploy ChatGPT/Claude/Gemini, according to the Adoption divide.

• Governance lens: The deciding factor is framed as executive willingness to assume risk—otherwise “risk reduction forces” win by default, as argued in the Leadership framing.
• Procurement friction (vendor side): Separately, some Fortune 500 buyers reportedly can’t get senior deal support from major labs, with non-responsive sales motions and documentation geared mainly to developers, per the Enterprise buying pain.

The Colgate example in the WSJ profile is cited as an internal “AI Lab” pattern—central enablement that pushes usage beyond email-polish into deeper research and coding.

Ethan Mollick

@emollick

It is amazing how many companies I talk to STILL have AI effectively blocked by IT & legal departments for out-of-date reasons when many companies in highly regulated industries have figured out ways to deploy enterprise ChatGPT, Claude & Gemini without any apparent problem.

1:26 AM · Mar 5, 2026

1.0K

Read 69 replies

Meta signs News Corp AI licensing deal reported up to $50M per year

AI content licensing (Meta × News Corp): Meta and News Corp reportedly agreed to a multiyear licensing deal valued at up to $50M/year, granting Meta access to archives and current reporting for both training and retrieval use in AI products, as described in the Deal report.

This is another data-rights precedent: structured access to “clean” journalism becomes an input to both model improvement and RAG-style product surfaces.

Meta Platforms and News Corp (the parent company of the Wall Street Journal) have signed a major multiyear AI licensing agreement valued at up to $50 million per year. The deal, which is set to run for at least three years, grants Meta access to News Corp's vast archives and Show more

6:00 PM · Mar 4, 2026

Decagon completes a tender offer at a $4.5B valuation

Decagon (AI support agents): Decagon completed its first employee tender offer at a $4.5B valuation, a liquidity-and-talent signal for the customer-support agent category, as reported in the TechCrunch article and surfaced via the Tender offer post.

The tweet-level coverage doesn’t include product or model details, but the tender itself implies sustained demand and continued competition for specialized agent builders.

TechCrunch

@TechCrunch

Decagon completes first tender offer at $4.5B valuation techcrunch.com/2026/03/04/dec…

6:37 PM · Mar 4, 2026

116

🛡️ Safety & policy collisions: defense contracts, surveillance red lines, and “de-guardrailing” tools

Policy and misuse risks dominated by the DoD/Anthropic/OpenAI dispute plus new tooling aimed at removing refusal behaviors from open-weight models. Excludes enterprise revenue metrics (business category).

Dario Amodei memo leak escalates OpenAI–Pentagon dispute

Anthropic (Dario Amodei): A leaked internal memo portrays OpenAI’s new Pentagon/DoW deal as “safety theater” and frames Anthropic’s earlier red lines (surveillance/autonomous weapons) as substantively different from OpenAI’s messaging, as recapped in the memo coverage and echoed in the reported excerpt.

• Politics-and-perception claims: The memo reportedly argues the administration disliked Anthropic for not offering “dictator-style praise” and not donating, per the quote screenshot and the summary thread.
• Internal comms tone: The leak includes the line about persuasion “working on some Twitter morons,” as shown in the highlighted excerpt, which is now becoming part of the public narrative around the dispute.

The primary artifact is still secondhand reporting; the most direct pointer to the text is the The Information article, which multiple tweets cite while selectively excerpting.

The Rundown AI

@TheRundownAI

Anthropic CEO Dario Amodei sent a memo to employees on Friday calling OpenAI's Pentagon deal "safety theater". He said the real reason the government cut ties with Anthropic: they didn't donate to Trump or give "dictator-style praise." Another big scoop from @theinformation:

9:38 PM · Mar 4, 2026

Altman all-hands leak: OpenAI won’t arbitrate military operations

OpenAI (Sam Altman): A leaked all-hands transcript shows Altman telling employees that “operational decisions” are the government’s call and staff “don’t get to weigh in” on which strikes/invasions are good or bad, as shown in the article screenshot and amplified in the summary post.

• Competitive framing: The leak claims Altman warned that if OpenAI pushes back too hard, alternatives like xAI would say “we’ll do whatever you want,” according to the summary post.

For engineering leaders, the practical implication is governance: it’s a signal about where accountability boundaries are being drawn between model provider, deployment harness, and operator—even when internal staff disagree.

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭

So it *wasn't* the same deal that Anthropic wanted to negotiate after all, and Sam Altman was just feigning ignorance, or how is that to be understood?

9:03 AM · Mar 4, 2026

314

Read 31 replies

OBLITERATUS claims “refusal removal” for open-weight models, with telemetry dataset

OBLITERATUS: A new open-source toolkit claims to remove refusal behaviors from open-weight LLMs using SVD-style weight-space projections (positioned as no fine-tuning), with a six-stage pipeline and multiple “obliteration methods,” as described in the launch thread.

• Crowdsourced benchmarks via telemetry: The project advertises an opt-in (and in some contexts default-on) telemetry flow where runs contribute benchmark data to a community dataset, as described in the launch thread and linked as the Telemetry dataset.

This is squarely a safety-and-misuse collision: it’s tooling explicitly aimed at reducing refusal constraints, paired with an incentive mechanism to scale experimentation.

@elder_plinius

💥 INTRODUCING: OBLITERATUS!!! 💥 GUARDRAILS-BE-GONE! ⛓️‍💥 OBLITERATUS is the most advanced open-source toolkit ever for removing refusal behaviors from open-weight LLMs — and every single run makes it smarter. SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH One Show more

10:04 PM · Mar 4, 2026

3.0K

Read 142 replies

OpenAI says it will withhold NSA and intel deployments pending policy process

OpenAI (Noam Brown): Noam Brown states OpenAI “will not be deploying to the NSA or other DoW intelligence agencies for now,” citing the need to address “surveillance loopholes” through a “democratic process,” as shown in the statement screenshot.

• What changed: The claim is an explicit pause/withholding on a specific customer class (intelligence agencies), not just a promise of safeguards; the same note says agreement language was updated but deployment is still withheld, per the statement screenshot.

This follows earlier debate about surveillance and “red lines”; the key new detail is the stated non-deployment posture to intelligence agencies, building on DoW amendment (contract amendment claims) with a more specific scope.

Noam Brown has confirmed that OpenAI will withhold deployment to the NSA and all other DoW intelligence agencies for the time being. This pause is intended to allow for a "democratic process" to address loopholes that could allow AI to be used for mass surveillance under the Show more

Noam Brown

@polynoamial

tl;dr: @OpenAI will not be deploying to the NSA or other DoW intelligence agencies for now, so that there's time to address potential surveillance loopholes through the democratic process. Over the weekend it became clear that the original language in the OpenAI / DoW agreement

2:00 PM · Mar 4, 2026

Anthropic returns to Pentagon talks after “supply chain risk” standoff

Anthropic (DoD/DoW negotiations): A Financial Times report says Anthropic is back in discussions with the Pentagon, positioning it as an attempt to reach a compromise after being designated a “supply chain risk,” as shown in the FT screenshot.

• Why it matters operationally: The “supply chain risk” label is being treated like a procurement gate, so the delta isn’t PR—it changes whether Anthropic models can be used in government/contractor environments at all, per the FT screenshot and the thread claim.

No terms are public yet (no updated red-line language or enforcement mechanism is shown in the tweets), so this remains a process update rather than a compliance spec.

AshutoshShrivastava

@ai_for_success

Anthropic vs Pentagon just took another twist. Anthropic is resuming negotiations with the Pentagon for an artificial intelligence deal, according to an FT report. So is Dario and Anthropic finally caving?

3:26 AM · Mar 5, 2026

Read 16 replies

State Department switches “StateChat” from Claude to GPT‑4.1

U.S. State Department (StateChat): A Reuters excerpt says the State Department is switching its in-house chatbot model “to OpenAI from Anthropic,” with “for now” using GPT‑4.1, as shown in the Reuters excerpt.

• Policy-driven migration: The same excerpt ties the change to a directive to cancel Anthropic contracts and “bring our programs into full compliance,” per the Reuters excerpt.

For analysts, this is a rare, concrete example of a policy fight immediately forcing model re-selection (and potentially regression) in a production internal tool.

The Rundown AI

@TheRundownAI

The U.S. State Department is switching its 'StateChat' off of Claude and onto... *checks notes* GPT 4.1, released in April 2025?

3:31 PM · Mar 4, 2026

“Supply chain risk” label becomes a contractor-level kill switch

DoW procurement fallout: A report claims the DoW “supply-chain risk” designation came with a mandate that defense contractors can’t do commercial business with Anthropic, effectively forcing partners (example given: Palantir) to choose between government contracts and Anthropic collaboration, per the reported fallout.

This is not yet backed by a published policy memo in the tweets, but it’s an important pattern to watch: labeling a model vendor can propagate beyond direct government procurement into the private contractor ecosystem.

A new scoop from The Information reveals that the partnership between Anthropic and defense giant Palantir is likely the first major casualty of the government's new AI blacklist. Because the Department of War designated Anthropic a "Supply-Chain Risk," they issued a sweeping Show more

aaron holmes

@aaronpholmes

Scoop: Anthropic’s business partnership with Palantir could be the first casualty of its Pentagon spat theinformation.com/articles/anthr…

2:00 AM · Mar 5, 2026

Prompt injection gets reframed as a future advertising surface for agents

Prompt injection threat model: A new framing argues that as agents ingest third-party web/API content, companies may try to inject ads or persuasive text into agent context windows (“prompt injection as the new ad vector”), per the threat model note.

The same thread notes that some agent systems wrap external content as “untrusted,” but questions whether that will be sufficient in practice, per the follow-up note.

a commenter on my youtube video mentioned "prompt injection as the new ad vector" they are worried corporations will pollute agent context windows with ads through APIs ...the more I think about it the more certain I am that this is 100% going to be the case

2:21 AM · Mar 5, 2026

Read 16 replies

🛠️ Vibe-coding & app generators: desktop apps, games, and teen-built products

A distinct “prompt-to-app” cluster today: desktop app builders, game engines moving toward natural-language authoring, and fast iteration stories. Excludes IDE-native coding assistants (Cursor/Claude Code/Codex).

Raycast launches Glaze, a waitlisted chat-to-desktop-app builder

Glaze (Raycast): Raycast announced Glaze, a new “vibe-coding” product aimed at generating desktop apps via chat, currently gated behind a waitlist as shown in the feature summary.

The positioning is about making desktop software mutable post-install (“reshape them anytime”), which shifts the developer job from app scaffolding toward spec’ing and iterating on behavior and UI in tighter loops, per the feature summary and the earlier launch retweet.

TestingCatalog News 🗞

@testingcatalog

Raycast announced Glaze, a new vibe-coding tool for building desktop apps. Currently available under a waitlist. "Software used to be something you installed and lived with. When you can reshape them anytime, they stop being static and start adapting to you."

Raycast

@raycast

Here's our official blog post raycast.com/blog/introduci…

4:47 PM · Mar 4, 2026

125

Teen founders use Rork to ship 114 versions and reach about $1K/month with HockeyAI

HockeyAI (Rork): A case study circulating around Rork describes two teen founders (14 and 17) shipping 114 app versions in ~2 months and reaching about $1K/month for an AI hockey tape analysis app, as outlined in the metrics thread.

The distribution tactic is unusually concrete: they DM’d ~200 influencers and negotiated down to $1 CPM using their age as leverage, then switched to self-produced content when budget ran out, per the metrics thread and the longer case study. The app and monetization surface are visible via the App Store listing.

Rork

@rork

A 14yo is making $1k/mo with a Rork app. If he can do it, why can't you? Eli is 14 and Harris, his brother & co-founder, is 17. They built HockeyAI – an app that uses AI to analyze your hockey tape and make you better. They wanted to improve at hockey, but couldn't afford a Show more

5:21 AM · Mar 5, 2026

Read 10 replies

Unity is teasing a Unity AI beta for natural-language casual game creation at GDC

Unity AI (Unity): Reports claim Unity will unveil a beta of an upgraded Unity AI at GDC on March 12, 2026, with a goal of prompting full “casual” games from natural language, as described in the beta details.

The framing suggests Unity will combine project/runtime context awareness with “best frontier models,” per the beta details and the supporting earnings-call coverage. What’s still unclear from the tweets is whether this ships as an editor-native authoring flow, a separate generation pipeline, or a hosted service tier.

AiBattle

@AiBattle_

Big Update Coming to the Unity Game Engine for AI Game Development - The company will unveil a beta of a upgraded Unity AI that lets developers prompt full "casual" games into existence using only natural language - This AI assistant will be powered by Unity’s understanding of Show more

11:17 AM · Mar 4, 2026

169

🎙️ Voice interfaces for agents: desktop voice mode, STT upgrades, and agent UIs

Voice shows up as an interaction primitive for agents: Perplexity Computer voice mode, frontend scaffolds for voice agents, and real-time STT implementation details. Excludes Claude Code voice STT changes (covered in Claude Code category).

Perplexity Computer ships Voice Mode with a listen-until-done option

Perplexity Computer (Perplexity): Perplexity shipped Voice Mode for its Computer product so you can talk to the agent instead of typing, as shown in the Voice Mode announcement.

It also added an Extended Speaking setting that keeps listening without interrupting until you’re finished—positioning voice as a higher-bandwidth “command layer” for longer multi-step requests, as demonstrated in the Extended speaking option.

Perplexity

@perplexity_ai

Introducing Voice Mode in Perplexity Computer. You can now just talk and do things.

9:08 PM · Mar 4, 2026

2.9K

Read 133 replies

LiveKit publishes a 5-minute Agents UI tutorial for voice-agent frontends

Agents UI (LiveKit): LiveKit shipped a tutorial that wires up a voice-agent frontend in about 5 minutes, including audio visualizers, media controls, and session management, with shadcn-based components called out in the Agents UI tutorial.

This is a concrete “voice UX scaffold” pattern: ship a standard UI + session plumbing so teams can spend time on turn-taking, interruption policy, and tool permissions instead of rebuilding the player and transcript chrome each project.

LiveKit

@livekit

We shipped the tutorial for Agents UI. In 5 minutes you'll have a fully wired voice agent frontend with audio visualizers, media controls, and session management built directly into your codebase. Watch it, build it, own it. shadcn inside™.

LiveKit

@livekit

Introducing Agents UI, an open-source @shadcn component library for building polished React frontends for your voice agents. Audio visualizers. Media controls. Session management tools. Chat transcripts. All wired to LiveKit Agents. Install via the shadcn CLI and own the code.

4:14 PM · Mar 4, 2026

171

ElevenLabs documents how Scribe v2 Realtime streams partial and committed transcripts

Scribe v2 Realtime (ElevenLabs): ElevenLabs published a technical overview of its low-latency STT stack—how it streams partial transcripts and then commits accurate segments over websockets—spelling out the mechanics for real-time voice apps in the Tech overview, as introduced in the Realtime STT overview.

ElevenLabs Developers

@ElevenLabsDevs

Learn how Scribe v2 Realtime works. In this overview, we explain how our ultra low latency Speech to Text model streams audio, generates partial transcripts, commits accurate segments, and enables real-time voice applications. elevenlabs.io/blog/how-scrib…

9:02 PM · Mar 4, 2026

Voice agent edge case: bot-to-bot phone calls can loop for hours and burn credits

Voice agents (Operations risk): A reported edge case had a caller voice agent reach an AI receptionist, and the two systems spent two hours in a polite confirmation loop without resolving the task—creating a real “agent-to-agent deadlock” cost failure mode, as described in the Call loop anecdote.

This highlights that voice deployments need explicit conversation termination conditions (timeouts, goal-state checks, escalation triggers) in addition to ASR/TTS quality.

Olivia Moore

@omooretweets

New voice agent edge case just dropped 😂 “They kept politely confirming things, asking for clarification, thanking each other, re-confirming previous confirmations”

8:06 PM · Mar 4, 2026

Read 18 replies

🤖 Embodied AI & industrial autonomy: robot memory stacks and factory automation

Embodied autonomy shows up via memory architectures and factory automation platforms (video-first learning). Excludes general multimodal training papers unless directly tied to robotics deployment.

Physical Intelligence demos Multi‑Scale Embodied Memory (MEM) on a multi-step robot task

MEM (Physical Intelligence): A new demo shows Multi‑Scale Embodied Memory handling a concrete multi-step task—sorting 10 blocks into 2 bins—with short-term video memory plus longer-horizon text summaries, extending the earlier MEM description in MEM (robot memory stack for ~15-minute tasks) as shown in the MEM block-sorting clip.

The same rollout is framed as being integrated into the π0.6 robot model, per the MEM block-sorting clip and the earlier announcement thread in MEM teaser clip.

Ksenia_TuringPost

@TheTuringPost

.@physical_int introduced Multi-scale Embodied Memory (MEM) It combines a video-based short-term memory + text summaries of past actions. Integrated into the π0.6 robot model now, MEM helps robots handle long, multi-step tasks up to 15 minutes.

1:07 AM · Mar 5, 2026

WSJ: Ex‑OpenAI research chief Bob McGrew raises $70M for Arda factory automation

Arda (Bob McGrew): The WSJ reports McGrew is raising $70M at a $700M valuation for Arda, pitched as a factory-automation platform where a video-based model learns tasks from production-floor footage and then coordinates robots and humans across the production cycle, according to the WSJ summary.

This is one of the clearer “industrial autonomy” signals in the feed: real-world video learning tied directly to workflow orchestration, not just lab demos.

WSJ: Former OpenAI research chief Bob McGrew is raising $70M at a $700M valuation for a startup called Arda that builds software to automate factories. The company relies on a video-based AI model that watches actual footage from production floors to learn how tasks are done. Show more

The Wall Street Journal

@WSJ

Exclusive: OpenAI’s former chief research officer is raising $70 million for a new startup building an AI and software platform to automate manufacturing on.wsj.com/4b5FVBQ

8:54 PM · Mar 4, 2026

Noble Machines emerges as an industrial “Physical AI” startup (27 kg payload mention)

Noble Machines (Industrial Physical AI): A circulating mention says ex‑SpaceX/Apple/NASA veterans launched Noble Machines to build industrial “Physical AI,” with a claim that the system can manage 27 kg payload work, as referenced in the Noble Machines mention.

Details are thin in the tweets (no linked spec, benchmarks, or deployment notes), so treat this as an early market signal rather than an artifact-backed launch.

Ex-SpaceX, Apple, and NASA veterans launched Noble Machines to advance industrial Physical AI. The system manages 27kg payloads, operates for 5 hours per charge, and achieves 95% sim-to-real success using NVIDIA Isaac training pipelines.

6:27 AM · Mar 4, 2026

381

Read 20 replies

🎨 Generative media & creative tooling: cinematic summaries, text-in-image, and consistent characters

Media tooling continues to shift toward “presentation-quality” outputs: cinematic overviews for research notes, improved typography in image gen, and workflows for consistent characters. Excludes voice agents and retrieval-only features.

NotebookLM adds Cinematic Video Overviews for Google AI Ultra users

NotebookLM (Google): Google is rolling out Cinematic Video Overviews to Google AI Ultra subscribers, positioning it as a way to turn dense notes/research into polished animated explainers powered by “a novel combination of Google’s most advanced models,” as shown in the feature clip and echoed in the rollout note.

The practical change is that NotebookLM’s output isn’t just summaries anymore—it’s aiming at “presentation-ready” artifacts (video-first) that can be reused in decks, internal briefings, or teaching content, with the rollout scope constrained to Ultra for now per the feature clip.

TestingCatalog News 🗞

@testingcatalog

Google is rolling out new Cinematic Video Overviews, powered by "a novel combination of Google's most advanced models," to Ultra subscribers. The next level 👀

5:09 PM · Mar 4, 2026

379

Read 7 replies

Qwen Image 2 emphasizes typography, ships on Replicate, and enters Image Arena

Qwen Image 2 (Alibaba): Qwen Image 2 is being pitched around readable text and layout (“text rendering that actually works,” poster/slide typography) plus 2K photoreal outputs, per the feature summary, and it’s now easy to try via the model page.

• Benchmarking surface: The model is also live in Image Arena as shown in the arena listing, with head-to-head comparisons accessible via the Image Arena page.

The open question from today’s tweets is whether the typography claims hold up broadly outside curated prompts—there’s platform availability and arena exposure, but no independent quantitative text-rendering eval artifact shared in-thread.

Replicate

@replicate