Claude computer use hits macOS – per-session scopes, shipped in ~4 weeks

You can now enable Claude to use your computer to complete tasks. It opens your apps, navigates your browser, fills in spreadsheets—anything you'd do sitting at your desk. Research preview in Claude Cowork and Claude Code, macOS only.

9:38 PM · Mar 23, 2026

71.5K

Read 2.7K replies

Claude computer use ships with per-session app permissions and a safety checklist

Permission UX (Anthropic): Turning on computer use triggers a session-scoped permission flow that spells out screenshots + mouse/keyboard control, plus app-level scopes (including “Finder: full control”), and clipboard read/write—paired with a “keep in mind” safety checklist—shown in the Permissions dialog.

The UI implies a capability-style model rather than a single “take over my desktop” toggle; it also makes the risk surface explicit (file access, clipboard access, unintended actions) before the agent starts clicking.

Claude

@claudeai

Replying to @claudeai

Claude uses your connected apps first: Slack, Calendar, and other integrations. When there's no connector for the tool you need, it asks for your permission to open the app on your screen directly.

Two dialog boxes showing Claude requesting permission to use specific apps like Finder and Google Chrome, with options to deny or allow for the session.

9:38 PM · Mar 23, 2026

Anthropic shipped computer use about four weeks after acquiring Vercept

Acquisition-to-launch timeline (Anthropic): Several posts claim Anthropic acquired a computer-use company and shipped computer use roughly four weeks later, with one explicitly tying this to the Vercept acquisition timing, as stated in the Four weeks claim and summarized again in the Acquisition recap.

The concrete datapoint here is the reported ~4-week integration window; it suggests the feature was already near production internally, or Vercept’s work slotted directly into an existing Claude Desktop/Code surface.

Thariq

@trq212

Replying to @trq212

btw we acquired a computer use company and then 4 weeks later shipped computer use x.com/LucaWeihs/stat…

Luca Weihs

@LucaWeihs

It's been 26 days since Vercept joined Anthropic ➡️ Claude is now able to use your mouse and keyboard to control your computer in Cowork and the Claude Code Desktop app 🖱️⌨️💻 I'm so excited for everyone to play with what we've been building!

4:43 AM · Mar 24, 2026

133

Read 12 replies

Claude computer use defaults to connectors, then asks to drive on-screen apps

Computer-use execution flow (Anthropic): The rollout is structured as “connectors first” (Slack, Calendar, and other integrations), then a permissioned fallback where Claude can open and operate whatever app is on your screen if there’s no connector, as described in the Connector-first note and expanded in the Thread context.

This is a practical design choice for enterprises: where an API exists, it’s usually faster and more auditable; where it doesn’t (or the tool is legacy), UI control becomes the escape hatch.

• Operational implication: teams that already invested in connectors get more deterministic runs; teams with long-tail internal tools can still automate workflows via UI when needed, but only after explicit user approval, per the Connector-first note.

Claude

@claudeai

Replying to @claudeai

Claude uses your connected apps first: Slack, Calendar, and other integrations. When there's no connector for the tool you need, it asks for your permission to open the app on your screen directly.

9:38 PM · Mar 23, 2026

Dispatch adds a mobile-to-Mac loop for Claude’s computer-use sessions

Claude Dispatch + computer use (Anthropic): Posts describe a workflow where you can send instructions from mobile while a linked Mac executes the task using computer control, positioning Dispatch as the “remote control” layer for these desktop sessions, as shown in the Dispatch demo clip and echoed in the Rundown summary.

This changes the practical “availability” model: instead of an agent requiring you to sit at the Mac, the Mac becomes an execution surface the agent can drive while you’re elsewhere.

TestingCatalog News 🗞

@testingcatalog

BREAKING 🚨: Claude can now use your computer and apps to complete its tasks. The same also works with Claude Dispatch, where you can instruct Claude from your mobile device. A full-featured Computer Use 👀 "Research preview in Claude Cowork and Claude Code, macOS only."

Claude

@claudeai

Claude uses your connected apps first: Slack, Calendar, and other integrations. When there's no connector for the tool you need, it asks for your permission to open the app on your screen directly.

10:14 PM · Mar 23, 2026

336

Read 11 replies

A builder argues “computer use” is the wrong abstraction for software

Interface strategy debate: One thread argues that UI-driving “computer use” is an inefficient and less controllable approach for software (vs humanoid robotics), advocating instead for agent-first connectors with fine-grained permissions, auditing, and headless execution—positioned as a more secure and ergonomic path for agents—per the Connector-first critique.

This critique doesn’t dispute usefulness; it disputes whether “click the UI” should be the default interface layer once teams can realistically build and standardize connectors.

Jeffrey Emanuel

@doodlestein

The new Anthropic computer use system is cool and looks useful, but I think it’s ultimately the wrong approach. It’s one thing to use humanoid robots because they can seamlessly slot into a human-centric physical world that’s difficult and expensive to reconfigure, and use the Show more

Felix Rieseberg

@felixrieseberg

Today, we’re releasing a feature that allows Claude to control your computer: Mouse, keyboard, and screen, giving it the ability to use any app. I believe this is especially useful if used with Dispatch, which allows you to remotely control Claude on your computer while you’re

11:48 PM · Mar 23, 2026

Read 19 replies

Anthropic Labs frames computer use as catching up to model capability

Anthropic Labs shipping cadence: An Anthropic Labs team member says the small team that shipped MCP, Skills, Claude Desktop, and Claude Code is now releasing “full computer use” in Cowork and Dispatch; they describe early desktop prototypes as “clunky and slow,” and position today’s release as the point where the harness is closer to what the models can do, per the Team note.

This is a useful signal for engineers tracking product direction: Anthropic is treating desktop control as a first-class harness primitive (alongside connectors and MCP), not a side demo.

Boris Cherny

@bcherny

Little known fact, the Anthropic Labs team (the team I joined Anthropic to be on) shipped: - MCP - Skills - Claude Desktop app - Claude Code It was just a few of us, shipping fast, trying to keep pace with what the model was capable of. Those early Desktop computer use Show more

Claude

@claudeai

3:38 AM · Mar 24, 2026

Read 168 replies

Desktop control reopens automation for “no-API” enterprise software

Enterprise applicability: Commentary calls this “another domino” because it enables automation across arbitrary desktop apps—especially legacy or bespoke corporate tools that lack modern APIs—though it also notes the likely limited near-term impact given “macOS only” and “research preview” constraints, per the Enterprise legacy apps note.

For analysts, this is a shift in go-to-market surface area: “works with your weird internal app” becomes plausible without waiting on vendor integrations, but the security and permission model becomes the gating factor.

Peter Gostev

@petergostev

This is another domino to fall - computer use of arbitrary apps, not just your browser. This is a big deal for lots of corporates who have custom crappy apps from 20-30 years ago. Since it is Mac only and Research Preview, I'm guessing it won't be very impactful, but it is here.

Claude

@claudeai

9:47 PM · Mar 23, 2026

300

“Orbit” rumor points to Claude phone-use capabilities

Phone-use expansion (Anthropic, unconfirmed): A leak-style post claims Anthropic is likely working on a “Phone Use” capability (code-named “Orbit”) to let Claude execute tasks on a mobile device and make calls, per the Orbit rumor.

This is not a shipped feature in the tweets; treat it as roadmap speculation until there’s an Anthropic doc, UI, or release note confirming surfaces, permissions, and rollout constraints.

TestingCatalog News 🗞

@testingcatalog

BREAKING 🚨: Anthropic is also likely working on a Phone Use, so Claude will be able to make calls and execute tasks on your mobile device. Orbit 🪐 Discovered by @M1Astra

TestingCatalog News 🗞

@testingcatalog

10:37 PM · Mar 23, 2026

1.6K

Read 50 replies

🧰 Claude Code ops: scheduled cloud tasks, permissions, and usage-limit bugs

Non-computer-use Claude Code news: recurring cloud jobs (/schedule), channel permission prompts, and reports of Max-tier rate-limit/accounting issues. Excludes the macOS computer-use feature covered in the feature section.

Claude Code adds Scheduled Cloud Tasks for recurring background agent runs

Claude Code (Anthropic): Anthropic is rolling out Scheduled Cloud Tasks for Claude Code—recurring agent workflows that run in Anthropic’s cloud so you don’t need to keep a local terminal/tab/machine open, as described in the Scheduled Cloud Tasks clip.

The feature is being discussed as a terminal-first primitive ("Use /schedule") for periodic automation, as echoed in the Schedule command retweet, with early examples centered on ops-style loops (polling, triage, fixes) rather than one-shot codegen.

Wes Roth

@WesRoth

Anthropic rolled out Scheduled Cloud Tasks for Claude Code, allowing developers to automate recurring agentic workflows that run entirely in the background. You no longer need to keep your terminal, browser tab, or local machine running. Once a task is scheduled, Claude executes Show more

Noah Zweben

@noahzweben

You can now schedule recurring cloud-based tasks on Claude Code. Set a repo (or repos), a schedule, and a prompt. Claude runs it via cloud infra on your schedule, so you don’t need to keep Claude Code running on your local machine.

10:00 AM · Mar 23, 2026

323

A concrete /schedule loop: hourly Sentry triage to PR fix plus self-review

Claude Code (Anthropic): One detailed /schedule workflow shows what “background agents” look like when you wire them to real ops inputs: hourly polling Sentry via MCP, investigating root cause in-repo, opening a PR, then having Claude review and iterate—ending with an email notification, as shown in the Sentry auto-fix config.

The important design detail is that the schedule is tied to a connector (Sentry MCP) and ends at a durable artifact (a PR), not an in-chat summary—making the loop auditable and easy to hand off.

Dan McAteer

@daniel_mac8

Used /schedule in Claude Code to: 1. Poll Sentry w/ the MCP 1/hr 2. Retrieve new issues 3. Investigate root cause 4. Create PR fix 5. Claude reviews PR 6. Fix findings 7. Email notification of new PR from Claude Claude autonomous Prod fixes for aceagent.io!

Noah Zweben

@noahzweben

Use /schedule to create recurring cloud-based jobs for Claude, directly from the terminal. We use these internally to automatically resolve CI failures, push doc updates, and generally power automations that you want to exists beyond a closed laptop

8:43 PM · Mar 23, 2026

Claude Max subscribers report a rate-limit/accounting bug after allowance changes

Claude Max (Anthropic): Developers on the Claude Max ($100/mo) and Max 20x ($200/mo) tiers report getting locked out almost immediately due to what’s described as a token-usage accounting bug in how Claude Code calculates consumption, following a weekend of expanded allowances, per the Rate-limit bug report.

If accurate, this is an ops problem more than a model limit—users are hitting session/rolling-window caps unexpectedly, and the symptoms show up as rapid saturation of the usage bars rather than gradual depletion.

Wes Roth

@WesRoth

Following a weekend of expanded usage allowances, Anthropic’s highest-tier subscribers are waking up to a crippling rate-limit bug. Developers paying top dollar for the "Claude Max" ($100/mo) and "Max 20x" ($200/mo) plans are reporting that their accounts are being locked out Show more

Brad Groux

@BradGroux

Something is up with Claude Code usage today. $200 Claude Max, 0%, 52% to 62%, then 68%, 76% and 84% in 5-hour rolling window in the time it took me to write this tweet. WTF, @AnthropicAI? I'm working on one GitHub PR for regression testing. Not folding proteins to cure cancer.

11:00 PM · Mar 23, 2026

Anthropic suspension report raises questions about third-party tooling boundaries

Anthropic account enforcement: A developer reported a first-time Anthropic account suspension, attributing it to using a third-party usage-stats tool (“CodexBar”) and sharing the suspension email screenshot in the Suspension email.

Follow-up replies in the same orbit question whether the endpoint involved is official in the Endpoint legitimacy question, with an Anthropic-affiliated account saying it “will follow up” in the Follow-up reply. The net signal is uncertainty about what counts as acceptable automation/telemetry around Claude usage vs. what triggers enforcement.

Max Rovensky

@MaxRovensky

baby's first @AnthropicAI ban, this one is for using CodexBar for usage stats clown company I swear to god

2:40 PM · Mar 23, 2026

740

Read 50 replies

Claude Code channels add Permission Prompts, with updates required

Claude Code (Anthropic): Claude Code channels now support Permission Prompts, and the update requires both updating Claude and updating channel plugins, per the Permission prompts note.

This lands as a harness-level control point (permissions as an interaction step) rather than a model capability change; Anthropic’s broader desktop documentation enumerates multiple permission modes and environments in the Desktop docs, though the tweet callout here is specifically about channels and plugin updates.

Noah Zweben

@noahzweben

Claude Code channels now support Permission Prompts. Update to latest claude and update your channel plugins!

11:49 PM · Mar 23, 2026

257

Read 34 replies

Claude Code users report multi-minute stalls on basic tasks like pushing a repo

Claude Code reliability: A user report highlights Claude “thinking” for ~5 minutes about a simple “push my repo” instruction, per the Five-minute billowing screenshot.

This is consistent with a growing class of harness complaints where long-horizon agents feel intermittently “stuck” on mundane glue steps (git, auth, release steps), which can dominate wall-clock time even when the model is capable of the actual code change.

finbarr

@finbarrtimbers

Claude's been very weird today; has spent 5m thinking about how to push my repo

7:34 PM · Mar 23, 2026

Read 5 replies

🔎 Cursor ships Instant Grep: millisecond regex search across huge codebases

Cursor’s big engineering update: local indexed regex search (‘Instant Grep’) to accelerate agentic coding loops by avoiding full scans. This is mostly deep systems details (n-grams/inverted index tradeoffs), plus practitioner commentary.

Cursor adds Instant Grep: local indexed regex search in milliseconds

Instant Grep (Cursor): Cursor says it can now “search millions of files and find results in milliseconds,” aimed at cutting agent wall-clock time that’s dominated by repeated codebase search operations, as announced in the Instant Grep launch.

• Performance claim: the published benchmark shows 13ms for Instant Grep locally, 243ms with a us-east-1 roundtrip, versus 16.8s for ripgrep on a Chromium query, as shown in the Latency comparison.

• Why it matters for agents: the pitch is less “faster regex” and more “faster candidate set,” so the agent can iterate on hypotheses quickly instead of paying a full scan penalty every time it asks another broad query, per the Fast regex indexing post.

Cursor

@cursor_ai

Cursor can now search millions of files and find results in milliseconds. This dramatically speeds up how fast agents complete tasks. We're sharing how we built Instant Grep, including the algorithms and tradeoffs behind the design.

4:47 PM · Mar 23, 2026

4.9K

Read 151 replies

Instant Grep’s core trick: index-first regex to avoid opening most files

Cursor indexing design: Cursor’s write-up frames Instant Grep as a pragmatic regex acceleration stack—n-gram-based indexing, inverted posting lists, and probabilistic filters—optimized around the reality that agents run far more searches than humans and will happily spam wide regexes, as described in the Fast regex indexing post and introduced in the Article share.

• Key pattern: treat regex as a two-phase system—(1) derive required-ish substrings to get a small candidate set; (2) run the true regex engine only on that set—so “search” becomes an indexing and IO orchestration problem rather than pure compute.

• Engineering tradeoffs called out: query decomposition quality versus index size, and the need for predictable local latency (index locality and caching) rather than relying on server-side grep that adds network jitter.

Cursor

@cursor_ai

Replying to @cursor_ai

Learn more: cursor.com/blog/fast-rege…

4:47 PM · Mar 23, 2026

446

Read 8 replies

Instant Grep discourse centers on whether “trigrams” is the point

Community reaction to Instant Grep: The release triggered the predictable “this is just trigrams” critique—see the Trigram jab—followed by pushback that trigrams are the baby example and the real work is query decomposition, index size, and avoiding worst-case regex behavior, as argued in the Trigrams are toy follow-up.

• Constraints that show up in practice: defenders emphasize editor realities that classic code-search infra doesn’t face—users with limited disk/CPU, much bigger-than-Twitter monorepos, and agents generating more adversarial regexes than humans—captured in the Local constraints note.

kache

@yacineMTB

took them this long to learn what a trigram index is? lel

Cursor

@cursor_ai

5:38 PM · Mar 23, 2026

310

Read 13 replies

Cursor’s positioning debate shifts from tooling to owning the best model

Cursor competitive narrative: Alongside the Instant Grep shipping story, a sharper ecosystem take is circulating that tooling improvements won’t be the decisive moat; the argument is Cursor “will die” unless it builds “the best coding model in the world,” as stated in the Model moat claim.

The subtext is that fast local search helps the harness, but the market may still reward whoever owns the strongest model+tool loop, rather than whoever best wraps everyone else’s models.

ben

@benhylak

only one thing will save cursor: building the best coding model in the world. nothing else matters. if they can't do this, they'll die.

9:59 PM · Mar 23, 2026

132

🗂️ ChatGPT file workflows: Library tab, recents, and cross-chat document reuse

OpenAI shipped account-level file persistence UX: a Library for uploaded/created files with quick insertion and cross-conversation reference. This category stays on ChatGPT’s document surface (not ads/monetization, which is covered elsewhere).

ChatGPT adds a Library for persistent files and cross-chat reuse

ChatGPT (OpenAI): OpenAI is rolling out a Library tab (web sidebar) that automatically saves uploaded/created files so they can be reused across conversations, alongside a composer flow for Recent files → Add from Library, as shown in the product announcement.

Rollout details in the same announcement note it’s live for Plus, Pro, and Business globally, while EEA, Switzerland, and the UK are listed as “coming soon,” per the product announcement and the release notes recap (which also points to the updated release notes in Release notes entry).

• Limits and what persists: File storage is now separated from the originating chat thread (so the file becomes an account-level artifact); the commonly cited caps circulating with the rollout include 512MB per file, 2M tokens per text/doc file (not spreadsheets), ~50MB for CSV/spreadsheets, and 20MB per image, as summarized in the limits recap.

Net effect: the “upload once, reference later” loop becomes a first-class part of ChatGPT’s doc workflows instead of being tied to a single chat.

OpenAI

@OpenAI

It’s now easier to find, reuse, and build on the files you upload and create in ChatGPT. You can quickly reference files in a chat using recent files in the toolbar, ask ChatGPT about something you’ve uploaded, or browse your files in the new Library tab in the web sidebar. Show more

8:47 PM · Mar 23, 2026

3.4K

Read 282 replies

ChatGPT restores editing and retrying for all messages

ChatGPT (OpenAI): OpenAI is bringing back the ability to edit and retry any message, not only the most recent one, according to a user-visible update highlighted in the feature change note.

This change isn’t specific to file storage, but it affects document-heavy threads where teams iterate on prompts and outputs over long histories (revising an earlier instruction and replaying downstream steps) rather than restarting a new chat.

Tibor Blaho

@btibor91

Update - looks like OpenAI took the feedback seriously and they're bringing back editing and retrying for all messages in ChatGPT, not just the most recent ones

Tom Spis

@hey_tommy

Wow — it just got A LOT worse. As of today, you can only edit and regenerate the last user and the last model response, respectively. Thought it was a bug, but no, it's the new normal, and OAI has already updated their support articles. I think this is the last straw for me.

6:18 PM · Mar 23, 2026

373

Read 25 replies

🧑‍✈️ OpenClaw ops & ecosystem: plugin marketplace, providers, and release automation pain

High-volume OpenClaw chatter: major releases, provider plugins, marketplace mechanics, and maintainers dealing with release automation/CI realities. This is about running/orchestrating agents, not underlying model research.

OpenClaw 2026.3.22 adds ClawHub marketplace, OpenShell/SSH sandboxes, and search integrations

OpenClaw 2026.3.22 (OpenClaw): The project shipped a large release that turns extensibility into a first-class surface (ClawHub plugin marketplace) while also adding safer execution primitives (OpenShell plus SSH sandboxes) and wiring in multiple search backends—details are summarized in the release highlight list and spelled out in the upstream Release notes.

The release reads like an attempt to make “agent ops” repeatable: predictable plugin install, model/provider fan-out, and sandboxes you can hand to an agent without giving it your whole machine.

OpenClaw🦞

@openclaw

OpenClaw 2026.3.22 🦞 🏪 ClawHub plugin marketplace 🤖 MiniMax M2.7, GPT-5.4-mini/nano + per-agent reasoning 💬 /btw side questions 🏖️ OpenShell + SSH sandboxes 🌐 Exa, Tavily, Firecrawl search This release is so big it needs its own table of contents. github.com/openclaw/openc…

11:34 AM · Mar 23, 2026

5.6K

Read 520 replies

OpenClaw 2026.3.23 adds DeepSeek, Qwen pay-as-you-go, and OpenRouter auto pricing

OpenClaw 2026.3.23 (OpenClaw): A day-later release adds more provider surface area (DeepSeek plugin and Qwen pay-as-you-go) plus operational tweaks like OpenRouter auto-pricing and an “Anthropic thinking order,” with additional Chrome MCP waits and chat-integration fixes called out in the release highlight list and expanded in the upstream Release notes.

This is mostly “keep the harness stable while providers multiply”: pricing, ordering, and browser-state coordination becoming baked-in instead of tribal knowledge.

OpenClaw🦞

@openclaw

OpenClaw 2026.3.23 🦞 🧪 DeepSeek provider plugin ☁️ Qwen pay-as-you-go ♻️ OpenRouter auto pricing + Anthropic thinking order 🖥️ Chrome MCP waits for tabs 🔧 Discord/Slack/Matrix + Web UI fixes Upgrade before your agent does it for you. github.com/openclaw/openc…

4:05 AM · Mar 24, 2026

817

Read 90 replies

OpenClaw 2026.3.22-beta.1 prefers ClawHub installs and tightens sandbox defenses

OpenClaw 2026.3.22-beta.1 (OpenClaw): The beta introduces breaking changes aimed at safer, more deterministic ops—plugin installation now prefers ClawHub over npm, Chrome MCP configuration migrates (including removal of a relay path), and the plugin SDK surface was reorganized, as detailed in the Beta release notes shared via beta announcement.

• Supply-chain and sandbox hardening: The notes call out sandbox restrictions that block JVM, glibc, and .NET hijacking attempts, plus multiple migrations that force operators to run explicit “doctor/fix” style steps rather than silently carrying legacy behavior, as described in the Beta release notes.

New @openclaw beta bits are up! github.com/openclaw/openc…

10:01 AM · Mar 23, 2026

1.1K

Read 103 replies

OpenClaw maintainer disputes acquisition rumor and emphasizes foundation ownership

OpenClaw governance (OpenClaw Foundation): After a claim that OpenAI bought OpenClaw circulated, the maintainer explicitly says OpenClaw is owned by an independent foundation and that OpenAI did not buy the project, as stated in ownership correction in response to the earlier acquisition assertion in acquisition claim.

For teams adopting OpenClaw in production, this is a practical signal about stewardship and incentives: who can change direction, and what “model-agnostic” support means long-term.

Replying to @AlexFinn

Alex, don't spread misinformation. OpenAI did not buy the project. OpenClaw is owned by the independent OpenClaw foundation. (which I lead)

2:25 AM · Mar 24, 2026

1.6K

Read 74 replies

“Token session refund” request highlights QA expectations for agent workflows

OpenClaw operator expectations: A user requested a refund for an 8+ hour session after repeated factual and calculation errors in sensitive financial documents, per the maintainer’s anecdote in refund request screenshot.

This is a blunt reminder that token-billed, long-running agent sessions are getting evaluated like professional services—especially when the agent touches spreadsheets or board-facing docs.

This guy emailed me asking for a *token session refund* because his claw made mistakes. 🙃

10:12 AM · Mar 23, 2026

6.2K

Read 912 replies

GitHub Actions limits push OpenClaw toward release-pipeline automation and sponsorships

OpenClaw release engineering: While automating releases, the maintainer hit limits on GitHub’s free tier and reports going from “asking” to “yes, we sponsor you” in ~5 minutes, as described in sponsorship turnaround.

The same thread of work shows up again in a later note about automating the pipeline to reduce human mistakes, with e2e tests mentioned as part of the fix-forward posture in release step miss.

Working on automating our whole release pipeline (gotta protect myself from mistakes) and ran into some limits of GitHub's free tier. From asking to "yes ofc we sponsor you": 5 min. Kudos, @github !

11:01 PM · Mar 23, 2026

1.7K

Read 48 replies

OpenClaw web control UI shipped broken due to missed build step; beta fixes it

OpenClaw web control UI (OpenClaw): A release went out with a missing build step for web control UI assets, causing the control UI not to load correctly; the maintainer says users can update to beta for the fix or wait for an updated release, as described in regression explanation.

The error message shown in the field (“Control UI assets not found… build with pnpm ui:build”) matches the symptom captured in control UI error screenshot.

I missed a release step last night with the web control UI assets, current release doesn't load that correctly, you can update to beta where it's fixed, or wait for the updated release later. Just working on automating the whole release pipeline, and adding e2e tests for web.

11:09 PM · Mar 23, 2026

690

Read 72 replies

Apple notarization remains the macOS release automation bottleneck for OpenClaw

macOS release automation (OpenClaw): The maintainer says the hardest part of automating releases is the macOS build and Apple’s notarization process, per notarization bottleneck note.

This keeps showing up as the “last manual step” for teams shipping agent harnesses on macOS, even when everything else in CI is scripted.

Replying to @steipete

what's the hardest part of automating? The macOS release + Apple's notarization process.

11:11 PM · Mar 23, 2026

OpenClaw plugin connects the Codex app server into OpenClaw’s toolchain

OpenClaw plugin integrations: The maintainer highlights work that connects a Codex app server to OpenClaw “via plugins,” positioning it as a practical proof that OpenClaw’s plugin surface can bridge between agent runtimes, as noted in integration shoutout.

This is the kind of integration that matters operationally: instead of picking one agent harness, teams can wire them together behind a consistent plugin boundary.

Harold did some really great work connecting codex app server with openclaw (the power of plugins!)

Harold Hunt

@huntharo

@openclaw Codex App Server - Your bridge to using Codex in OpenClaw youtu.be/GKkipfNEJJQ?si…

11:58 PM · Mar 23, 2026

443

Read 11 replies

🧪 Beyond codegen: production debugging, auto-fixing, and review discipline

Today’s code-quality thread focuses on the “after code is written” layer: predicting/triaging production issues, autonomous fixes, and how reviewers keep agent output maintainable. Includes PlayerZero and self-healing codebase claims.

PlayerZero ships an “AI production engineer” built on a system-wide world model

PlayerZero: A new “AI production engineer” product is positioning itself as the layer after codegen—connecting code, observability, incidents, and tickets into a single graph (“world model”) that can predict what a PR will break and then trace production issues back to a specific change, generate a fix, and route it to the right engineer (often via Slack approval), as described in the world model pitch and the launch claims.

The headline metrics being repeated across threads are 64% confirmation rate (flagged issues that later became real production tickets) vs 16.3% for Cursor BugBot and 11% for Claude Code, as shown in the benchmark screenshot.

• Pre-ship simulation: The “Sim-1” component is framed as running production-like simulations before merge—using historical incidents, configs, and real usage patterns—to flag breakage without teams writing bespoke tests, per the product description and the launch recap.
• Post-ship triage: The company claims 92.6% accuracy on “real production test cases” (with recall and precision splits quoted in threads) and a <2 hour root-cause path when observability is partial, in the metrics thread.

Treat the numbers as self-reported until a public eval artifact exists, but the product shape is clear: “context graph + simulation + routing” as the new battleground beyond code generation.

Chubby♨️

@kimmonismus

Agentic coding just got a major upgrade nobody is talking about. There's now a context graph that sits on top of your entire production system. Your codebase, your incidents, your customer tickets, all connected into one living model. When a PR opens, it already knows what's Show more

Animesh Koratana

@akoratana

Introducing: PlayerZero The world's first Engineering World Model that puts debugging, fixing, and testing your code on autopilot. We've raised $20M from Foundation Capital, @matei_zaharia (Databricks), @pbailis (Workday), @rauchg (Vercel), @zoink (Figma), @drewhouston

5:03 PM · Mar 23, 2026

221

Read 15 replies

A Codex-assisted PR review loop that gates on clarity and often rewrites for maintainability

PR review discipline: Peter Steinberger describes a repeatable review flow where Codex helps find issues, but the reviewer still enforces two human gates—“is the issue clear?” and “is this the best possible fix?”—and he says “95% of the time” the best fix requires continued discussion and usually rewriting the PR, per the review workflow post.

The key operational point is that review quality is being treated as a maintainability control, not a correctness check; the “rewrite the PR” step is positioned as the default outcome when contributors submit localized patches that would accumulate project debt.

Pretty much every PR I review: 0) review <URL> [codex does it's thing and finds issues] 1) is the issue clear? [if not, trash PR] 2) is this the best possible fix? [95% of the time no] 3) continue discussion, consider tradeoffs, usually rewrite PR Most folks send too localized, Show more

1:23 AM · Mar 24, 2026

575

Read 52 replies

Ramp describes an agent-run loop that triages alerts and pushes fixes via 1,000 monitors

RampLabs: Ramp describes a system where an agent instruments every pull request, triages alerts, and autonomously pushes fixes; they claim it’s backed by “a thousand AI-generated monitors, one for every 75 lines of code,” in the self-maintaining codebase post.

This is notable because it’s not “agent writes code,” it’s “agent owns the pager-adjacent loop”—monitoring + diagnosis + PR creation as a continuous process.

The tweet doesn’t specify what “monitor” means (runtime checks vs CI assertions vs log-derived detectors) or what the approval gates look like, so the operational safety model is still unclear from today’s data.

Ramp Labs

@RampLabs

We built a codebase that maintains itself. An agent instruments every pull request, triages alerts, and pushes fixes autonomously. The system runs on a thousand AI-generated monitors, one for every 75 lines of code.

Ramp Labs

@RampLabs

x.com/i/article/2036…

7:54 PM · Mar 23, 2026

836

Wharton study: humans follow AI even when it’s wrong, weakening “review as a safeguard”

Human review reliability: A Wharton study recap is circulating under the label “cognitive surrender,” arguing that the common safety pattern “AI writes, humans review” is not dependable—people accept AI answers at high rates even when incorrect, and their confidence rises too, per the study summary thread.

The post cites three preregistered studies (1,372 participants; 9,593 trials) and reports that when AI was wrong, people still followed it 79.8% of the time; access to AI increased confidence even on wrong answers.

In an engineering context, this is a direct challenge to review-only controls for agentic changes and incident writeups: the failure mode is not just “review misses issues,” it’s “reviewers stop believing they need to verify.”

Rohan Paul

@rohanpaul_ai

Wharton’s latest AI study points to a hard truth: “AI writes, humans review” model is breaking down Why "just review the AI output" doesn't work anymore, our brains literally give up. We have started doing "Cognitive Surrender" to AI - Wharton’s latest AI study points to a hard Show more

5:35 PM · Mar 23, 2026

1.7K

Read 93 replies

Turn every agent bugfix into a regression-test ratchet

Workflow pattern: A practical prompt pattern is circulating for agentic coding—after an agent fixes a bug and verifies it, explicitly require it to write “extremely in-depth e2e integration tests” that would have caught the bug and similar variants, as described in the life hack post.

The claimed benefit isn’t only coverage; it’s forcing the agent to enumerate the failure surface, which can expose adjacent issues immediately once the test harness exists.

📏 Evals & scoreboards: FrontierMath breakthrough, coding leaderboards, and long-horizon benchmarks

Benchmarks were busy: a FrontierMath open problem solved with GPT‑5.4 Pro plus multiple coding/agent leaderboard snapshots and new long-horizon evaluation setups. This category is strictly measurement (not training methods).

GPT-5.4 Pro produces first FrontierMath Open Problems solution (publishable write-up planned)

FrontierMath: Open Problems (Epoch AI): Epoch AI says GPT-5.4 Pro was used to elicit a solution to one of the benchmark’s research problems—the first marked solved so far—as announced in the FrontierMath solve thread; the problem is a conjecture contributed by Will Brian (from a 2019 paper), and Brian plans to write up the result for publication, per the Problem provenance and Write-up plan.

• Elicitation credit and publication path: Epoch credits the first elicitation to specific users and notes they can be coauthors with Brian on any resulting paper, as stated in the Coauthor option.
• Replicability and “not just one model” signal: Epoch reports they replicated the elicitation in their internal scaffold, and that Gemini 3.1 Pro, GPT-5.4 (xhigh), and Opus 4.6 (max) can solve the problem at least some of the time, per the Scaffold replication note.

The most useful artifact to bookmark is the detailed problem page, which includes transcripts and variants, as linked in the Problem page.

Epoch AI

@EpochAIResearch

AI has solved one of the problems in FrontierMath: Open Problems, our benchmark of real research problems that mathematicians have tried and failed to solve. See thread for more.

4:14 PM · Mar 23, 2026

1.0K

Read 14 replies

EvoClaw benchmark quantifies how coding agents degrade on continuous software evolution

EvoClaw (OpenHands): OpenHands introduced EvoClaw, a benchmark for continuous software evolution using milestone DAGs reconstructed from real repo history—aimed at measuring whether agents can keep a codebase healthy over many dependent steps, not just solve isolated tickets, per the Benchmark announcement and the Milestone DAG rationale.

• Headline results: The project reports a big drop from isolated-task performance (“can exceed 80%”) to a best overall score of 38.03% with Claude Opus 4.6 + OpenHands, while the highest resolve rate shown is 13.37% for Gemini 3 Pro + Gemini CLI, as stated in the Topline numbers.
• Failure mode they call out: They claim recall keeps improving while precision saturates early—regressions and technical debt accumulate faster than the agent repairs them, per the Recall vs precision note.

The benchmark materials are published in the Benchmark blog and the ArXiv paper, with ongoing results tracked on the Leaderboard.

OpenHands

@OpenHandsDev

Agents are getting good at isolated coding tasks. The harder question is whether they can keep a real codebase healthy as it evolves. EvoClaw is a new benchmark for continuous software evolution, built with OpenHands to measure exactly that.

2:30 PM · Mar 23, 2026

Read 2 replies

BullshitBench v2 ranks Claude models highest; Grok 4.20 beats GPT-5.4 on pushback rate

BullshitBench v2 (prompt-robustness eval): A shared chart claims Claude Sonnet 4.6 (High) leads with 91% clear pushback, while Grok 4.20 Multi-Agent (Low) sits at 67% clear pushback versus GPT-5.4 at 48%, as shown in the BullshitBench v2 chart.

The framing in the thread is that higher pushback matters for agentic coding trustworthiness (less “accepted nonsense”), but the post doesn’t include the prompt set details—so interpret the numbers as a comparative signal, not a full safety characterization, per the BullshitBench v2 chart.

BridgeMind

@bridgemindai

Grok 4.20 Beta is the hardest model to fool on BullshitBench v2 outside of Anthropic. 67% clear pushback rate. Only 19% accepted nonsense. GPT 5.4? 48% pushback. 16% accepted nonsense. Gemini 3.1 Pro? 31% pushback. 46% accepted nonsense. Almost half the time. The top 5 is all Show more

4:14 PM · Mar 23, 2026

Read 7 replies

DesignArena Code Categories shows a 10-point gap between Opus 4.6 and GPT-5.4 (Medium)

DesignArena “Code Categories” (Arcada Labs): A circulating chart puts Claude Opus 4.6 at 66.8 and highlights GPT-5.4 (Medium) at 56.7, framing it as “can code, can’t design,” as shown in the Code categories chart.

• Adjacent signal on the same Arena ecosystem: Another DesignArena-style chart places MiniMax M2.7 around the middle of the pack with an Elo of 1289 (called out as #12 overall in that snapshot), as shown in the MiniMax M2.7 ranking.

Treat this as a point-in-time scoreboard rather than a stable ordering; the underlying question it surfaces is whether “frontend taste” needs different training/harnessing than general coding ability.

BridgeMind

@bridgemindai

GPT 5.4 scores 56.7 on DesignArena Code. Below MiniMax M2.5. Below Kimi K2.5. Below GLM 5. Claude Opus 4.6 sits at #1 with 66.8. That's a 10 point gap. GPT 5.4 can code. It can't design. This has been true since GPT 4 and it still hasn't been fixed. If you're using GPT Show more

11:02 AM · Mar 23, 2026

SWE-rebench leaderboard snapshot puts Claude Opus 4.6 on top at 65.3%

SWE-rebench (coding benchmark): A shared leaderboard chart shows Claude Opus 4.6 leading with 65.3%, with a tight pack behind it including gpt-5.2-2025-12-11-medium at 64.4% and GLM-5 / Gemini 3.1 Pro Preview at 62.8%, as shown in the SWE-rebench chart.

The same snapshot also places several “agent harness” entries close together (e.g., Claude Code 58.4%, Codex 58.3%), which makes this chart as much about end-to-end tooling as raw model capability, per the SWE-rebench chart.

Lisan al Gaib

@scaling01

Opus 4.6 on top of SWE-rebench

Ibragim

@ibragim_bad

🚨 SWE-rebench update! SWE-rebench is a live benchmark with fresh SWE tasks (issue+PR) from GitHub every month. updates: > we removed demonstrations and the 80-step limit (modern models can now handle huge contexts without getting trapped in loops!). > we added auxiliary

8:02 PM · Mar 23, 2026

127

📚 Research notes that matter to builders: AI-for-science, math limits, and measurement pitfalls

Research signal today is dominated by Anthropic’s Science Blog launch and practitioner-facing writeups on how AI accelerates scientific work, plus broader commentary on math reliability limits. Excludes training recipes and product launches.

Claude Opus 4.5 as an “AI grad student” for a two-week theoretical physics derivation

Vibe physics (Anthropic): A Harvard physicist reports running a long, supervised workflow with Claude Opus 4.5 through a graduate-level calculation, positioning the model as “roughly the level of a second-year grad student” and emphasizing speedups without claiming autonomous originality, per the Theoretical physics post. It’s a concrete pattern for long-horizon, correctness-sensitive work.

• Scale and cadence: The writeup describes an extended iteration loop (many drafts; large token budget; non-trivial local compute) and argues that “AI can’t yet do original work autonomously, but it can vastly accelerate it,” as summarized in the Physics workflow writeup. This is a supervision-heavy approach.

• Builder takeaway: The post reads like a template for “single agent, many revisions” work where you need traceable intermediate artifacts. That’s a different shape than multi-agent parallelization.

Anthropic

@AnthropicAI

Replying to @AnthropicAI

We’re launching with two new posts. Can AI do theoretical physics? Harvard physicist Matthew Schwartz led Claude Opus 4.5 through a graduate-level calculation. AI can’t yet do original work autonomously, but it can vastly accelerate it. Read more: anthropic.com/research/vibe-…

8:31 PM · Mar 23, 2026

617

Read 25 replies

Anthropic launches a Science Blog focused on AI-accelerated scientific work

Anthropic Science Blog (Anthropic): Anthropic launched a dedicated Science Blog to publish research stories and practitioner workflows for using Claude in real scientific work, as announced in the Science blog launch. This is a public “how it’s used” channel, not a model release.

• What to expect: The launch post frames the blog as a mix of research writeups and field notes about how scientists are using AI, with the intro laid out in the Science blog intro. That makes it a new, citable reference for teams trying to justify (or audit) AI-in-research workflows.

The initial launch also points to two concrete posts; those show up as separate items below.

Anthropic

@AnthropicAI

Introducing the Anthropic Science Blog. Increasing the pace of scientific progress is a core part of Anthropic’s mission. The Science Blog will feature new research and stories of how scientists are using AI to accelerate their work. Read the intro: anthropic.com/research/intro…

8:31 PM · Mar 23, 2026

3.2K

Read 122 replies

Single-agent, sequential Claude setup for scientific computing where mistakes compound

Long-running Claude for scientific computing (Anthropic): Anthropic published a workflow note arguing that some scientific computing tasks are better handled by a single agent working sequentially (optionally spawning subagents) because small errors compound across tightly coupled steps, as described in the Long-running agent post. This is about orchestration shape, not model capability.

• Why sequential beats parallel here: The post explicitly contrasts “many agents in parallel” with “one agent, many steps,” using a cosmology/scientific-computing example and emphasizing debug loops, careful validation, and integration with existing compute environments, as detailed in the Scientific computing post. Short version: parallelism can add coordination overhead and multiply subtle mistakes.

This is one of the clearer public statements that agent topology should be task-dependent.

Anthropic

@AnthropicAI

Replying to @AnthropicAI

Models keep improving on long-horizon tasks, but splitting work across many agents doesn’t suit every problem. We walk through the setup for a single agent working sequentially on a task where mistakes compound: modeling the early universe. Read more: anthropic.com/research/long-…

8:31 PM · Mar 23, 2026

252

Read 13 replies

Creativity study: LLMs beat humans on originality ratings; prompting boosts humans more

Serendipity by Design (arXiv): A new study reports that LLMs were rated as generating more original product-development ideas than human participants on Prolific, while a “cross-domain mapping” intervention boosted humans more reliably than it boosted LLMs, per the paper screenshots shared in Creativity paper screenshots. This is evidence about ideation quality, not code generation.

• Prompting implication: The intervention (forcing analogies from distant domains) appears to change human outputs more than model outputs on average, while semantic distance still matters for both, as summarized in the Creativity paper screenshots.

Treat this as a measurement datapoint: “creativity interventions” may not transfer 1:1 from humans to models, even when both produce plausibly novel text.

Ethan Mollick

@emollick

Interesting finding in this paper showing that, for product development ideas, AIs consistently rank above humans (well, humans on Prolific) & larger and more recent models are more creative than previous ones. (It also tries a creativity intervention that doesn’t work on LLMs)

3:36 PM · Mar 23, 2026

178

Read 22 replies

Terence Tao: math “wins” are real, but the broad hit rate stays around 1–2%

Math reliability reality check: Following up on Tao clip (limits on open problems), Dwarkesh relays Tao’s claim that AI has solved “50 Erdős problems” recently while overall success on broader problem sweeps remains around 1–2%, with labs tending to publicize the wins, as said in the Tao on hit rates. That’s a reminder to treat math breakthrough anecdotes as a selection-biased sample.

• Where models still lag: The same thread highlights a practical failure mode—models apply standard techniques well but don’t reliably iterate on partial progress across sessions (a continual-learning-shaped gap), as described in the Tao on hit rates.

The signal for builders is about evaluation hygiene: a few spectacular solves don’t translate to dependable “research autopilot” behavior yet.

Dwarkesh Patel

@dwarkesh_sp

AI has solved 50 Erdős problems in the last year. But on a wider sweep of problems, the models’ success rate is only about 1-2%: labs have just been publishing the wins. This isn’t because AI isn’t useful for mathematicians. Terence Tao thinks the models are currently at the Show more

3:00 PM · Mar 23, 2026

608

Read 40 replies

🎯 RL practice & theory: small-VRAM notebooks, collaborative agents, and unsupervised RLVR limits

Training talk today is practical (run RL at low VRAM) plus theory about where unsupervised/verifiable reward training collapses. This category excludes benchmark results that are purely evaluation snapshots.

Unsloth publishes an 8GB-VRAM RL notebook for Qwen3.5 vision GRPO

Unsloth (Qwen3.5 RL notebook): Unsloth published a free, runnable notebook that trains Qwen3.5-2B with RL locally on ~8GB VRAM, using vision GRPO to teach the model to solve math problems autonomously, as described in the RL notebook announcement.

The practical value for engineers is the “RL-on-a-laptop GPU” shape: you can iterate on reward functions, formatting constraints, and anti-cheating checks without waiting for a cluster—Unsloth calls out reward-function scaffolding plus guardrails against code-execution reward hacking in the RL notebook announcement. For setup details, Unsloth points to its RL guide and a ready-to-run Colab notebook, which should make it easier to replicate the exact environment and training loop.

Unsloth AI

@UnslothAI

You can now train Qwen3.5 with RL in our free notebook! You just need 8GB VRAM to RL Qwen3.5-2B locally! Qwen3.5 will learn to solve math problems autonomously via vision GRPO. RL Guide: unsloth.ai/docs/get-start… GitHub: github.com/unslothai/unsl… Qwen3-4B: colab.research.google.com/github/unsloth…

1:21 PM · Mar 23, 2026

2.0K

Read 25 replies

Miles adds ROCm support for RL post-training on AMD Instinct clusters

Miles (Radix / LMSYS): Miles added ROCm support for large-scale RL post-training on AMD Instinct MI300/MI350-class GPUs; rollout generation is framed as the main compute sink, with reported MI300X throughput of ~1.1–1.3k tok/GPU/s and a mean step time of 388.5s on one 8-GPU node, per the ROCm support thread.

• Measured training delta: LMSYS reports AIME accuracy improving from 0.665 → 0.729 while training Qwen3-30B-A3B with GRPO, as stated in the ROCm support thread.
• Repro path: the implementation and run guidance are expanded in the ROCm blog post, and the open-source codebase is linked via the Miles repo.

This is a concrete “non-CUDA RL” datapoint; the tweet frames it as validated end-to-end for multi-turn agentic training, but the underlying benchmark/eval harness details aren’t in the tweet itself.

LMSYS Org

@lmsysorg

🚀 New blog: ROCm Support for Miles: Large-Scale RL Post-Training on AMD Instinct™ GPUs Together with @AMD, Miles brings end-to-end RL pipelines to MI300/350-class clusters: ⚡️ Rollout generation dominates RL compute, and AMD’s HBM bandwidth directly addresses this bottleneck Show more

6:25 PM · Mar 23, 2026

Read 2 replies

Unsupervised RLVR study claims intrinsic rewards ‘sharpen’ then collapse

Unsupervised RLVR (OpenBMB / TsinghuaNLP): A new study argues that intrinsic-reward “unsupervised RLVR” often creates a “sharpening” illusion—models converge toward a deterministic policy and then hit a reward-hacking / collapse phase; the thread introduces “Model Collapse Steps” (steps until reward accuracy drops below 1%) as a predictor of RL trainability, as summarized in the paper overview.

• Core claim: intrinsic rewards (confidence/consistency-style) don’t add knowledge; they amplify existing preferences until they break, according to the paper overview.
• Operational hook: the paper positions “collapse steps” as a quick-to-measure scalar that correlates with downstream gains (the thread cites AIME24 comparisons), which could be useful when you need a budget-friendly “should we even try RL here?” filter.

The same thread says extrinsic/self-verification-style rewards look more promising than intrinsic ones, but it doesn’t pin down a single best recipe in the tweet.

OpenBMB

@OpenBMB

When #LLMs surpass human experts in math and coding, who provides the ground truth? How can we train models when human supervision becomes the bottleneck? 🤔 Today, we dive into Unsupervised RLVR—a comprehensive new study by @TsinghuaNLP (an #OpenBMB member)alongside researchers Show more

2:33 PM · Mar 23, 2026

🏃 Self-hosting & on-device inference: streaming MoE weights, iPhone runs, and weekend assistants

Systems posts focus on running bigger models on smaller hardware via weight streaming and on-device assistants. This category is about runtime tricks and deployment patterns, not model announcements.

Qwen3.5-397B-A17B gets name-checked as “running on iPhone” via MoE streaming

Extreme on-device MoE (Qwen3.5-397B-A17B): Following the SSD-per-token MoE streaming idea, a follow-on claim says Qwen3.5-397B-A17B (397B total parameters) is being run on an iPhone using that streaming approach, as stated in 397B iPhone claim. This is a claim, not a benchmark.

• What’s actually new vs “35B on phone”: The implied step-function here is the move from “fits if you compress hard” to “doesn’t fit; stream what’s active,” building directly on the technique described in Technique spread note.

The tweets don’t include reproducible details (repo/config/latency breakdown). That’s the missing piece.

Simon Willison

@simonw

Replying to @simonw

Here's Qwen3.5-397B-A17B- a 397B model - using the streaming MoE weights trick to run on an iPhone!

Anemll

@anemll

Running 400B model on iPhone! 0.6 t/s Credit @danveloper @alexintosh @danpacary @anemll

4:16 AM · Mar 24, 2026

Read 7 replies

Streaming MoE weights from SSD makes oversized MoE inference practical on Mac hardware

MoE weight streaming (local inference): A runtime pattern is circulating where an MoE model runs without keeping all experts in RAM by streaming just the active expert weights from SSD per generated token—effectively trading disk bandwidth for memory, as described in MoE streaming note. It’s a deployment trick. It changes what “fits” locally.

• Why people care: The same thread points at Kimi 2.5 as an intuition anchor—~1T total params but ~32B active—so it “fits” within 96GB because only the active slice matters at inference time, per MoE streaming note.
• Momentum signal: The technique is being framed as a fast-moving community exploration (“more people join in”), with attribution that Dan Woods helped kick off the current surge, as noted in Technique spread note.

Simon Willison

@simonw

Turns out you can run enormous Mixture-of-Experts on Mac hardware without fitting the whole model in RAM by streaming a subset of expert weights from SSD for each generated token - and people keep finding ways to run bigger models Kimi 2.5 is 1T, but only 32B active so fits 96GB

seikixtc

@seikixtc

I got a 1T-parameter model running locally on my MacBook Pro. LLM: Kimi K2.5 1,026,408,232,448 params (~1.026T) Hardware: M2 Max MacBook Pro (2023) w/ 96GB unified memory Running on MLX with a flash-style SSD streaming path + local patching. This is an experimental setup and

Log showing Kimi K2.5 generating 'A GPU processes graphics and parallel tasks.' with flash streaming active and 28 GB Metal usage.

4:08 AM · Mar 24, 2026

590

Read 23 replies

Qwen3.5 35B is reported running fully on-device on an iPhone at 5.6 tok/s

On-device iPhone inference (Qwen3.5): A field report claims Qwen3.5 35B runs fully on an iPhone at ~5.6 tokens/sec, using 4-bit weights and a Mixture-of-Experts setup (noted as “256 experts”), per iPhone run report. No cloud required.

The key engineering implication is that “phone-class” hardware is now being used as a serious inference target for medium/large MoE models, assuming aggressive quantization and careful memory/IO handling (the post also cites a ~19.5GB model size in the same report, per iPhone run report).

alexintosh

@Alexintosh

I just ran Qwen3.5 35B on my iPhone at 5.6 tok/sec. Fully on-device. 4bit | 256 experts. Model: 19.5GB. iPhone: 12GB RAM. wild.

4:02 PM · Mar 21, 2026

2.2K

Read 86 replies

A weekend-built, fully on-device “Siri replacement” stack uses Whisper + Qwen 14B

On-device assistant stack (DIY): A builder demoed a “Siri has been broken so I built my own” assistant that runs offline and controls a Mac (reminders, live data fetch, general Q&A), claiming it was built in a weekend, per On-device assistant demo. It’s an app-shaped reference design.

The stack is called out explicitly—Whisper for speech recognition, Qwen 14B as the LLM, and Kokoro for voice—per Stack details. No internet needed.

Deedy

@deedydas

Siri has been broken for 13 years so I built my own. Completely on-device. No internet needed. Controls my Mac, sets reminders, fetches live data, answers questions. Built in a weekend. This is the future of software.

4:14 AM · Mar 24, 2026

204

🧾 Agent retrieval stack: code web search evals + fast PDF parsing from URLs/streams

Retrieval-related posts are about improving agent grounding via better web search evals and high-throughput document parsing pipelines. This excludes Cursor’s in-editor grep index (covered under Cursor).

Exa publishes WebCode evals to measure coding-agent web search quality vs latency

WebCode (Exa): Exa published a write-up and open evaluation set for coding-related web search, aiming to quantify retrieval quality for agents using two axes—groundedness and correctness—against latency, as shown in the latency vs score plot.

The post frames the practical failure mode as “stale/noisy retrieval poisons long-running agents,” and positions a dedicated ingestion + evaluation pipeline for fast-changing artifacts (docs, changelogs, issues), with the details in the Blog and open evals.

Exa

@ExaAILabs

Exa powers most of the popular coding agents. We wrote about how we build and evaluate coding-related web search. Blog and open evals: exa.ai/blog/webcode

5:43 PM · Mar 23, 2026

121

Google shows an agentic pipeline for parsing dense financial PDFs with LlamaParse + Gemini 3.1 Pro

Financial PDF parsing workflow (Googledevs + LlamaParse + Gemini): Google’s developer blog walks through a multi-step agentic pattern for brokerage-style PDFs—parse with LlamaParse, extract text/tables, then synthesize a human-readable summary with Gemini 3.1 Pro—as described in the workflow outline, with implementation details in the Google blog post.

The same post calls out measured parsing accuracy gains ("~13–15%" improvement) and treats layout-heavy tables/charts as the core reason a single-pass OCR pipeline tends to break.

Jerry Liu

@jerryjliu0

We're excited to collaborate with @googledevs on building an agentic workflow over complex financial documents - using LlamaParse and Gemini 3.1 Pro Brokerage statements have complex layouts, dense tables, and oftentimes visual elements like charts. Our multi-step agentic Show more

Google for Developers

@googledevs

Improve document parsing accuracy by 15% for financial PDFs. Use LlamaParse and Gemini 3.1 Pro to extract high-quality data from unstructured brokerage statements and complex tables. 📈 Precise reasoning 📂 Structured PDF data ⚡️ Event-driven scaling Dive into the code on

6:58 PM · Mar 23, 2026

147

Read 7 replies

LiteParse adds URL and stream parsing for PDFs via stdin

LiteParse (LlamaIndex): LiteParse added URL/stream-first parsing so agents can read remote PDFs through pipes (for example, curl -sL … | lit parse -) without relying on a VLM, per the CLI example and guide screenshot.

The update emphasizes agent-friendliness—stdin buffers/streams plus screenshotting support—so document ingestion can run in cheap, fast, containerized steps instead of “upload then reason” workflows.

Jerry Liu

@jerryjliu0

Let your AI agent read any PDF on the internet in seconds 🌐⚡️ ``` curl -sL example.com/report.pdf | lit parse - ``` LiteParse is our fast and free document parser designed to seamlessly plug into 40+ different agents. Includes both text parsing and screenshotting Show more

8:00 PM · Mar 23, 2026

119

Read 6 replies

🛠️ Developer tools shipping: terminal dashboards, emulators, editors, and storage primitives

A grab bag of concrete dev tools: terminal-rendered dashboards, local emulation for integration testing, editor improvements, and agent-oriented storage primitives. This excludes MCP/connectors (covered separately).

Emulate adds a programmatic API for creating and resetting local service emulators

emulate (Vercel Labs): A programmatic API is now called out for emulate, aimed at automated test suites and local emulators—create an emulator (Vercel/GitHub/Google), set env vars or pass to an SDK, then reset and close deterministically, per the programmatic API announcement. The underlying project is linked in the GitHub repo.

This frames emulate as a test harness primitive for agent runs in no-network environments (or flaky-integration environments), where you want repeatable state resets rather than mocked responses.

Chris Tate

@ctatedev

New: Programmatic API for 𝚎𝚖𝚞𝚕𝚊𝚝𝚎 Automated test suites + local emulators: 1. Create emulator (Vercel, GitHub, Google) 2. Set env var (and/or pass to SDK) 3. Reset data and close when appropriate

10:19 PM · Mar 23, 2026

121

Read 4 replies

Hugging Face pushes “Buckets as S3 for agents” with a CLI-first private store

Hugging Face Buckets (Hugging Face): Hugging Face is explicitly pitching hf buckets as “the S3 for agents,” with a CLI that can create private blob stores and address them with hf:// handles, as shown in the CLI snippet.

The immediate engineering implication is a potential default storage primitive for agent runs that need durable artifacts (datasets, logs, build outputs) without wiring up cloud credentials for S3/GCS in every environment.

clem 🤗

@ClementDelangue

We want to make @huggingface buckets the S3 for agents! Let us know if you can use something like that and happy to work with you on your usecase

3:30 PM · Mar 23, 2026

115

Read 19 replies

Vercel Labs ships json-render + Ink pattern for live, streaming terminal dashboards

json-render (Vercel Labs): A “Generative TUI” workflow is circulating that turns prompts into live terminal dashboards using json-render plus Ink, positioning JSON-as-UI as a reusable rendering layer for agents and CLIs, as described in the Generative TUI announcement. The implementation and component catalog live in the GitHub repo, with the install flow shown as npx skills add vercel-labs/json-render --skill ink in the Generative TUI announcement.

This is a concrete pattern for agent UIs that don’t require a browser—useful when the agent is already operating in a terminal-first loop or when you want deterministic, copy-pastable outputs rather than web app state.

Chris Tate

@ctatedev

Introducing Generative TUI Ask anything - get polished dashboards with real data, rendered live in your terminal. 27 components. Streaming. json-render + Ink. npx skills add vercel-labs/json-render --skill ink

6:35 PM · Mar 23, 2026

1.8K

Read 79 replies

Lovable ships Security Checker 2.0 with modular scans gated on changes

Security Checker 2.0 (Lovable): Lovable says it now runs four automated security scans before projects are published—RLS analysis, database security checks, code security review, and dependency auditing—and only re-runs modules when relevant diffs change, per the scanner announcement.

This is an example of “incremental security scanning” being treated as a first-class part of AI-assisted app generation workflows, with modularity positioned as the path to keeping checks current as new threat patterns show up.

Lovable

@Lovable

Lovable runs 4 automated security scanners on every project before they’re even published: • RLS analysis: checks your database access policies • Database security check: reviews schema and configuration • Code security review: analyzes generated code for vulnerabilities • Show more

4:00 PM · Mar 23, 2026

246

Read 21 replies

Zed lands “editor: align selections” as a stable text-manipulation command

Zed (Zed Industries): Zed is shipping a new stable command, editor: align selections, for multi-line alignment edits, as shown in the command demo.

For teams using Zed in agent-heavy workflows, this is a small but concrete speedup on repetitive formatting and refactor cleanup steps (the kind of edits agents often request humans to review or apply).

Zed

@zeddotdev

A handy new text manipulation command is landing on stable this Zednesday thanks to tiagolobao: `editor: align selections`

5:37 PM · Mar 23, 2026

280

🔌 MCP & interoperability: WeChat bots, agent “cloud computers,” and cross-agent review hooks

Interoperability news centers on MCP servers/clients and agent execution surfaces (cloud desktops) plus patterns for chaining agents together (reviewer hooks).

Agent Computer spins up cloud desktops for agents in under 0.5 seconds

Agent Computer: A new “cloud computer” execution surface for agents is live, promising full Ubuntu machines in <0.5s, with persistent disks, shared/swappable credentials, and SSH access—positioned as a way to run Claude/Codex-style agents in an isolated sandbox instead of on your laptop, as described in the Launch post and the Product page.

• What’s actually new for builders: provisioning speed + persistence means agent runs can span sessions without re-installing dependencies; SSH makes it fit existing dev workflows (CI repro, debugging, dotfiles) rather than a browser-only VNC toy, per the Product page.
• Interop angle: the pitch is “bring your existing agent subscriptions” and run multiple agents in parallel inside consistent environments, which is often the missing glue when teams mix local IDE agents with background automation, as stated in the Launch post.

Advait Paliwal

@advaitpaliwal

Introducing Agent Computer Cloud computers for AI agents in <0.5s with persistent disk, shared credentials, and SSH access agentcomputer.ai

4:51 PM · Mar 23, 2026

853

Read 79 replies

Tencent’s compliant WeChat API gets turned into an MCP server for agent bots

WeChat MCP wrapper (Tencent/openclaw-weixin): A community wrapper turns Tencent’s official WeChat messaging API into an MCP server, so any MCP-capable agent harness can send/receive WeChat messages—framed as “scan & go” without reverse engineering or ban risk, according to the WeChat MCP server write-up.

• Why it matters: it turns “WeChat bot” from a one-off Claude Code tunnel hack into a portable integration you can plug into Claude/Cursor/OpenCode-style stacks; the wrapper maps the WeChat (QClaw) API into MCP tools, as explained in the WeChat MCP server write-up.
• Operational detail: the loop described is WeChat message → agent receives → agent replies → response is pushed back with real-time “typing…” UX, which is the shape teams want for customer support/community automation in markets where WeChat is the primary surface, per the WeChat MCP server write-up.

Zhihu Frontier

@ZhihuFrontier

🔥 This might be one of the most fun and practical Agent hacks for WeChat built by Zhihu contributor HowardZhangdqs. So, how to turn ANY AI Agent into a WeChat bot via MCP?👇 🧩 What happened Tencent released @ tencent-weixin/openclaw-weixin → an official, compliant API to Show more

10:13 AM · Mar 23, 2026

OpenRouter adds a hook for Claude to auto-request Codex code reviews

OpenRouter (cross-agent review hook): OpenRouter shared a workflow hook that triggers automatic Codex reviews when Claude asks for them, aiming to make “main agent + reviewer agent” setups easier with one-command auth and centralized observability/cost tracking, as described in the Hook announcement.

• Interoperability payoff: this is explicitly about mixing agent personalities—“maximum neurodivergence” between the writing agent and the reviewer agent—while avoiding duplicated auth + scattered spend across tools, per the Hook announcement.
• Where it fits: teams already doing plan→implement→review loops can formalize “Claude drafts, Codex audits” as a repeatable primitive rather than a manual copy/paste step, matching the integration intent in the Hook announcement.

OpenRouter

@OpenRouter

Set up a hook to request automatic *Codex* reviews when Claude wants them: ✨ Authenticate both agents with one command 🧠 Maximum neurodivergence between your main agent and reviewer agent 📊 Centralized observability and cost management for both Docs to do it yourself: Show more

Alex Atallah

@alexatallah

Just open-sourced a personal project, Redline: Automatic reviews from Codex, without leaving Claude Code. github.com/alexanderatall… A few design choices to be aware of:

5:40 PM · Mar 23, 2026

Read 2 replies

⚡ Compute & energy deals: fusion power talks and demand-response capacity

Infra signals today are about energy as compute bottleneck: OpenAI exploring large-scale power purchase from Helion plus hyperscaler demand-response contracting. This category is the one allowed non-software ‘real world’ beat because it directly gates AI capacity.

OpenAI explores a Helion fusion power deal, with 5GW by 2030 figures circulating

OpenAI + Helion Energy: OpenAI is reported to be in advanced talks to buy electricity from Sam Altman–backed fusion startup Helion, potentially securing an initial 12.5% of Helion’s output, according to the Advanced talks report; separate reporting frames the scale target as 5GW by 2030 and 50GW by 2035, as summarized in the Power output targets post and detailed in the Axios story.

• Scale math in the thread: one recap notes Helion has said each reactor targets 50MW, implying ~800 reactors for 5GW and ~7,200 reactors more for 50GW, as laid out in the Reactor scaling breakdown.

The open question is timing and deliverability—Helion still has to turn prototype milestones into repeatable grid electricity at the volumes implied by the numbers in circulation.

Chubby♨️

@kimmonismus

OpenAI is in advanced talks to buy electricity from Sam Altman-backed fusion startup Helion Energy, according to a person familiar with the situation. OpenAI could secure a guaranteed portion of Helion's production, initially 12.5%, per the source.

3:25 PM · Mar 23, 2026

244

Read 23 replies

Sam Altman leaves the Helion board as OpenAI and Helion discuss working together

Helion governance (OpenAI): Sam Altman says he’s stepping off the Helion board because “as Helion and OpenAI start to explore working together at significant scale,” it’s difficult to sit on both boards—he notes he’ll keep a financial interest and be recused from negotiations, but the move simplifies governance for both companies, as stated in the Altman board statement.

The stepdown lands alongside reporting that OpenAI is discussing a large power purchase arrangement with Helion, including a cited 12.5% initial allocation in the Advanced talks report and scale figures repeated in the Power output targets post.

Sam Altman

@sama

I have loved being on the Helion board; I continue to be extremely excited about a future with abundant energy and Helion in particular. As Helion and OpenAI start to explore working together at significant scale, it is difficult for me to be on both boards. (I will have a Show more

David Kirtley

@Dkirtley

Sam Altman is stepping down from Helion’s Board of Directors. This decision enables Helion and OpenAI to explore future partnerships to bring zero-carbon, safe electricity to the world, which Helion is perfectly poised to deliver. Sam has played an integral role in Helion’s

5:47 PM · Mar 23, 2026