ARC-AGI-3 leaderboard opens at 0.37% – $2M prize fuels backlash

ARC-AGI-3 scores are live on the website - Opus 4.6 (Max) Score: 0.2% Cost: $8.9K - Grok 4.20 (Beta Reasoning) Score: 0.0% Cost: $3.8K - GPT-5.4 (High) Score: 0.3% Cost: $5.2K - Gemini 3.1 Pro (Preview) Score: 0.2% Cost: $2.2K

Lisan al Gaib

@scaling01

ARC-AGI-3 scores for GPT-5.4, Gemini 3.1 Pro and Opus 4.6 Gemini 3.1 Pro: 0.37% GPT-5.4: 0.26% Opus 4.6: 0.25% Grok 4.2: 0%

5:26 PM · Mar 25, 2026

215

Read 11 replies

ARC-AGI-3 human baseline definition becomes a core point of contention

ARC-AGI-3 baseline (human reference): Debate centered on how “human-level” is defined, with criticism that the benchmark uses the second-best first-run human by action count as the baseline reference per environment, and that unsuccessful attempts are treated differently in the reported efficiency framing, as argued in the Baseline excerpt and the Baseline critique. This matters because the chosen baseline strongly shapes the headline “AI <1%” narrative.

One example of the pushback is the claim that “humans score 100%” is a solvability statement rather than an average-human efficiency statement, with further critique of the public messaging in the Scoring criticism thread.

damn i forgor the best part > THE AI STILL SCORES TOO HIGH > "i got an idea boss" > shoot > "how about we just take the best human score?" > i like your thinking > "but that would be sus" > fine, we'll use the second best human score > discard the rest of the scores > REMOVE ALL Show more

Lisan al Gaib

@scaling01

> be me > build "AGI" benchmark > actually version 3 already > we don't talk about 1 and 2 > (they saturated in a year) > invent new scoring method > if human scores above AI, use squared efficiency > example: human took 10 steps to solve level > AI took 100 steps to solve a

6:09 PM · Mar 25, 2026

298

Read 17 replies

ARC-AGI-3 scoring uses squared efficiency and clamps scores at human parity

ARC-AGI-3 scoring: Multiple threads zoomed in on how the benchmark converts action counts into a score—specifically squared efficiency with a hard cap at 1.0, meaning models can match but not exceed the human baseline on any level. The core definition is laid out with formulas in the Scoring equations and is documented in the Technical report.

A key engineering implication is that ARC-AGI-1/2 and ARC-AGI-3 numbers are not directly comparable because ARC-AGI-3’s metric bakes in both task completion and path efficiency, as explained in the Scoring equations.

The Scoring of ARC-AGI-3 doesn't tell you how many levels the models completed but how efficiently they completed them compared to humans actually using squared efficiency meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score of 1% Show more

5:56 PM · Mar 25, 2026

463

Read 29 replies

ARC-AGI-3 won’t use a harness for official scores, triggering measurement backlash

ARC-AGI-3 evaluation policy: The technical report text circulated showing the official leaderboard decision to report scores without a harness, framing this as “developer-aware generalization” and arguing future AGI systems “will not need task-specific external handholding,” as quoted in the No-harness rationale. That stance drew immediate objections from builders who want harnessed agent scores reported alongside the baseline, as in the Harness objection and the Harness theory.

The disagreement is about what ARC-AGI-3 is measuring: unaided interactive learning versus practical agent systems that depend on tool + UI + memory scaffolding.

this is pretty much worst case performance no harness at all and very simplistic prompt

Lisan al Gaib

@scaling01

ARC-AGI-3 scores for GPT-5.4, Gemini 3.1 Pro and Opus 4.6 Gemini 3.1 Pro: 0.37% GPT-5.4: 0.26% Opus 4.6: 0.25% Grok 4.2: 0%

5:18 PM · Mar 25, 2026

200

Read 16 replies

ARC-AGI-3 constraints and exclusions become part of the debate

ARC-AGI-3 evaluation constraints: Commentary also focused on benchmark constraints beyond the scoring equation—claims about action/step budgets and exclusions of higher-compute “think longer” variants, plus the decision to emphasize a minimal prompt and minimal tooling setup, as debated in the Constraint critique and reinforced by a “worst case performance” reading of the report text in the No-harness rationale.

The practical upshot is that leaderboard deltas can reflect harness and evaluation design choices as much as model capability, which is one reason several people described the current snapshot as hard to interpret at fine granularity.

Lisan al Gaib

@scaling01

The Scoring of ARC-AGI-3 doesn't tell you how many levels the models completed but how efficiently they completed compared to humans actually squared efficiency, whatever that means meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score

6:01 PM · Mar 25, 2026

403

Read 17 replies

ARC-AGI-3 saturation timing becomes the next argument

ARC-AGI-3 trajectory: Forecasts varied widely on how fast the benchmark will be saturated—one camp calling “four months” in the Four-month estimate, another predicting a long flat period followed by a sharp jump late 2026 or 2027 in the Step-change prediction, and others expecting a quick transition from unsolved to solved in the Fast saturation claim.

The common theme is that ARC-AGI-3 is expected to behave like prior benchmarks: low initial scores, then rapid gains once an effective approach lands.

Matt Shumer

@mattshumer_

1. Incredible. 2. I give it four months before this is ~saturated.

François Chollet

@fchollet

ARC-AGI-3 is out now! We've designed the benchmark to evaluate agentic intelligence via interactive reasoning environments. Beating ARC-AGI-3 will be achieved when an AI system matches or exceeds human-level action efficiency on all environments, upon seeing them for the first

10:30 PM · Mar 25, 2026

Data Agent Benchmark (DAB) ships: agents struggle with multi-database workflows

DAB (Data Agent Benchmark): A Berkeley EPIC / PromptQL collaboration released DAB, a benchmark grounded in enterprise “data agent” work: 54 queries, 12 datasets, 9 domains, across 4 DBMS, with the best frontier model reaching 38% pass@1 (averaged across trials), as described in the Benchmark announcement.

Unlike many text-to-SQL tasks, DAB explicitly tests cross-database joins, tool use (query + Python), and messy key reconciliation, which is why it’s being positioned as a gap-filler for agent evaluation, per the Benchmark announcement.

Shreya Shankar

@sh_reya

Databases are arguably the most commonly used enterprise tool, and enterprises typically have many of them. Yet no popular AI agent benchmark actually tests how well agents can query, join, and make sense of data across different databases! So, we built DAB (Data Agent Show more

5:15 PM · Mar 25, 2026

327

Mollick: ARC-AGI-3 looks human-winnable, but tool/harness gaps may dominate early scores

ARC-AGI-3 (practitioner read): Ethan Mollick reports that ARC-AGI-3 is “definitely human winnable,” and frames the open question as whether frontier-model underperformance is primarily harness/vision/tools versus core limitations of LLMs, per the Hands-on take.

This is the near-term engineering question for teams: whether improving the agent stack (UI control, state, exploration heuristics, memory) moves the needle faster than base-model upgrades.

Ethan Mollick

@emollick

ARC-AGI-3 took me a few tries, but it is definitely human winnable. I am curious how much of the very initially very low performance of frontier models is harness, vision, and tools, versus how much are limitations of LLMs. I guess we will find out! arcprize.org/arc-agi/3

6:01 PM · Mar 25, 2026

233

Read 37 replies

ARC Prize 2026 launches with $2M alongside ARC-AGI-3 benchmark release

ARC Prize 2026 (ARC Prize Foundation): Alongside the ARC-AGI-3 benchmark release, ARC Prize announced $2,000,000 in prizes and pitched ARC-AGI-3 as testing how agents “explore, form hypotheses, plan, learn and adapt,” per the Benchmark positioning thread.

A separate Fast Company writeup amplified the “benchmark exposes a weakness” framing, as shown in the Press coverage screenshot.

ARC Prize

@arcprize

Replying to @arcprize

ARC-AGI-3 gives us a formal measure to compare human and AI skill acquisition efficiency Humans don’t brute force - they build mental models, test ideas, and refine quickly How close AI is to that? (Spoiler: not close)

5:37 PM · Mar 25, 2026

414

Read 10 replies

🧷 Claude Code shipping log: 2.1.83/2.1.84, auto-mode design, and reliability pain

Continues the Claude Code surge with two CLI releases (2.1.83 + 2.1.84) and deep prompt/flag churn, while users report outages, login issues, and quota/limit frustration impacting daily use.

Claude Code 2.1.84 adds an opt-in PowerShell tool and an idle /clear nudge

Claude Code 2.1.84 (Anthropic): A follow-on release adds an opt-in PowerShell tool for Windows automation, and introduces an idle-return prompt that nudges sessions idle 75+ minutes to run /clear to avoid unnecessary prompt-cache rehydration, as summarized in the Release thread.

Prompt-level behavior also shifted: ClaudeCodeLog reports the removal of an explicit top-level “Avoid over-engineering” rule while retaining narrower “don’t add extras” constraints; it also standardizes GitHub references as owner/repo#123 for clickable links and foregrounds explicit parallel tool batching in commit/PR playbooks, per the System prompt updates. The changelog adds several operational knobs (for example, x-client-request-id for timeout debugging) in the Release thread.

Claude Code Changelog

@ClaudeCodeLog

Claude Code 2.1.84 has been released. 8 flag changes, 40 CLI changes, 5 system prompt changes Highlights: • Added opt-in PowerShell tool for Windows, enabling PowerShell commands from the CLI for Windows automation • Critical-files output lists only file paths, removing brief Show more

12:55 AM · Mar 26, 2026

135

Read 7 replies

Reliability (Claude Code): Following up on Rate limit bug—Max plan users reporting quota/accounting weirdness—today’s feed shows more acute disruption: Claude Status screenshots show major outage/partial outage conditions across claude.ai and Claude Code, including “elevated errors on Claude Opus 4.6” and Cowork connection resets, as captured in the Status page screenshot.

Multiple builders also report rapidly exhausting weekly/daily limits (“tapped out their whole usage limits on Mon/Tue”) with frustration at lack of acknowledgement, as shown in the Usage limits thread; others mention being unable to log in at all, per the Login issue post. Downstream behavior includes switching away from Claude Code “all day” due to rate limits/outages, according to the Switched due to limits, alongside quota-hit reactions in the Quota meme.

BridgeMind

@bridgemindai

Claude is having major outages today! Elevated errors on Claude Opus 4.6 and Claude Code. Fix this @claudeai

3:45 PM · Mar 25, 2026

Read 6 replies

Anthropic explains Claude Code Auto mode’s classifier approvals and injection probe design

Auto mode (Anthropic): Following up on Auto mode launch—skipping permission prompts safely—Anthropic published an engineering breakdown of how Auto mode decides when to approve tool actions using classifiers, including a prompt-injection probe over tool outputs and a transcript classifier to allow/deny actions, as described in the Engineering blog announcement and expanded in the Engineering post.

Auto mode also got a practical “how to turn it on” workflow callout: it’s available to Claude for Team users via claude --enable-auto-mode, then Shift+Tab to enter the mode, per the Enablement instructions.

Anthropic

@AnthropicAI

New on the Engineering Blog: How we designed Claude Code auto mode. Many Claude Code users let Claude work without permission prompts. Auto mode is a safer middle ground: we built and tested classifiers that make approval decisions instead. Read more: anthropic.com/engineering/cl…

11:14 PM · Mar 25, 2026

1.8K

Read 145 replies

Claude Code 2.1.83 adds managed-settings.d policy fragments and subprocess credential scrubbing

Claude Code 2.1.83 (Anthropic): The CLI shipped a dense ops/security release—most notably managed-settings.d/ as a drop-in directory where policy fragments merge alphabetically (useful when different teams own different controls), plus a hardening change so child processes no longer inherit Anthropic or cloud-provider credentials, reducing accidental secret exposure, as listed in the Changelog highlights.

The release also tweaks memory semantics—"ignore memory" now treats MEMORY.md as empty so stored lines don’t get pulled back into context, per the same Changelog highlights. Additional user-visible ergonomics include transcript search ("/" in transcript mode) and new hook events like CwdChanged/FileChanged, as detailed in the Changelog highlights.

ClaudeCodeLog’s prompt-diff notes add that deferred-tool availability now comes via a system reminder (vs an explicit list) and that an explicit skill catalog got injected, per the System prompt updates.

Claude Code Changelog

@ClaudeCodeLog

Claude Code 2.1.83 has been released. 3 flag changes, 76 CLI changes, 7 system prompt changes Highlights: • Added managed-settings.d/ drop-in dir; policy fragments merge alphabetically, letting teams deploy separately • Child processes no longer inherit Anthropic or Show more

7:06 AM · Mar 25, 2026

843

Read 36 replies

Claude Code cheat sheet (v2.1.81-era) circulates as the de facto shortcuts and commands reference

Claude Code cheat sheet (community): A one-page reference card for keyboard shortcuts, slash commands, MCP server setup, skills/agents frontmatter, and CLI flags is getting recirculated as a practical “muscle memory” aid, per the Cheat sheet share.

The sheet is explicitly labeled v2.1.81 (last updated March 24, 2026) and includes guidance around /compact, worktrees, “effort/ultrathink,” and permissions mode switching, as visible in the Cheat sheet share.

elvis

@omarsar0

Nice cheat sheet for Claude Code.

2:58 PM · Mar 25, 2026

666

Read 13 replies

Claude Code ToolSearch lazy-loading sparks complaints about added latency and error surface

ToolSearch ergonomics: A user complaint notes that Claude Code now “lazily loads all tools,” arguing that relying on ToolSearch adds latency and more failure modes than keeping tool schemas in-context, as stated in the Tool loading complaint.

The criticism lands alongside the 2.1.83+ churn where ToolSearch and deferred-tool discovery keep getting reshaped, as shown by the broader 2.1.83 release notes in the Changelog highlights.

eric provencher

@pvncher

Does anyone else find it annoying that claude code lazily loads all tools now? Tool search works fine, but it adds latency and error opportunity vs just keeping tools & schemas in context. I hope other harnesses do not follow suit.

7:21 PM · Mar 25, 2026

Read 4 replies

Anthropic schedules “What We Shipped” Claude Code release-and-tips webinar (Apr 7)

Claude Code team webinar (Anthropic): Anthropic is starting a monthly stream, “What We Shipped,” positioned as a live walkthrough of recent Claude Code feature updates plus tips and Q&A; the first session is April 7, and registration is available via the Signup post and the Webinar page.

The practical signal is that Claude Code’s shipping cadence is now high enough that official “what changed and why” briefings are being productized, per the Signup post.

Thariq

@trq212

Replying to @trq212

11:33 PM · Mar 25, 2026

Read 8 replies

🧑‍💻 Codex momentum: product commitment, student challenge, and power-user workflows

OpenAI’s coding track stays noisy: reassurance the Codex app persists, student-focused build challenges, and more day-to-day reports of teams swapping between Codex and Claude Code based on limits/reliability.

Codex App commitment: “here to stay” with increased investment

Codex App (OpenAI): A core product-direction signal landed when a Codex team member said the Codex App is “here to stay” and that OpenAI is “investing way more into it than before,” per the product commitment.

This is one of the cleaner “don’t migrate away” statements in a week where builders are actively hedging across coding agents due to reliability and quota volatility elsewhere.

Tibo

@thsottiaux

For the avoidance of doubt. The Codex App is here to stay. We are just investing way more into it than before and it's about to get quite awesome

2:03 AM · Mar 26, 2026

1.3K

Read 135 replies

Developers route work to Codex when Claude Code quotas/outages hit

Codex agents (OpenAI): A visible day-to-day pattern is builders switching active coding work to GPT-5.4 High agents inside Codex after running into Claude Code limits or degraded service, as described in the agent switch report.

• Reliability trigger: The same day included reports of Claude service instability, including “elevated errors on Claude Opus 4.6 and Claude Code,” shown in the status incident screenshot.

The practical takeaway is not “one tool wins,” but that many teams are operating with an explicit fallback path across subscriptions when an agent becomes unusable mid-loop.

BridgeMind

@bridgemindai

I just switched over to GPT 5.4 High agents in Codex. Claude Code with Claude Opus 4.6 gave me problems all day today and I hit my rate limits insanely fast. Anybody else having these issues? This is why you have multiple subscriptions!

5:21 PM · Mar 25, 2026

106

Read 32 replies

Codex-assisted profiling loop: instrument, find bottleneck, cache, repeat

Codex in-the-loop performance work: Robert “Uncle Bob” Martin described a concrete workflow where he asked Codex to instrument a game loop after it exceeded 500ms, then iteratively locate bottlenecks and apply caching-based fixes over several hours, per the profiling workflow writeup.

He reports a repeated pattern of “fix one bottleneck, then the next,” with human oversight mostly reduced to approving steps, as described in the profiling workflow writeup and reinforced by a follow-on note about running trials of strategy changes in the trial supervision note.

Uncle Bob Martin

@unclebobmartin

Yesterday I went through a process of decreasing the processing time of the adversary algorithm in the Empire game. As the game proceeds the computational load increases dramatically. I told Codex to instrument the main loop and start gathering data once the computation time Show more

2:55 PM · Mar 25, 2026

Read 7 replies

OpenAI x Handshake launches Codex Creator Challenge with $10K credits pool

Codex Creator Challenge (OpenAI Devs): OpenAI is running a student build challenge powered by Handshake—$10K in OpenAI API credit prizes, plus $100 in Codex credits for eligible U.S./Canada university students to start, as described in the challenge announcement and detailed on the challenge page.

The framing is explicitly “build something real,” which makes this more of a usage-onramp than a hackathon-style one-off.

OpenAI Developers

@OpenAIDevs

Students: build something real in the Codex Creator Challenge, powered by @joinHandshake Try new tools. Have fun. Break things. Repeat. $10K in OpenAI API credits in prizes. joinhandshake.com/students/codex…

5:02 PM · Mar 25, 2026

826

Read 74 replies

Strategy narrative: OpenAI “refocus around coding and business users”

OpenAI product focus (signal): The “strategy shift” narrative recirculated with a WSJ-attributed claim that OpenAI leadership is finalizing plans to refocus around coding and business users, as summarized in the WSJ-cited claim.

This lines up with the day’s more direct Codex product signal (Codex app commitment) and the steady drip of Codex-centered adoption stories, but the tweet itself doesn’t include a primary WSJ excerpt—treat it as secondhand reporting unless corroborated elsewhere.

Jeffrey Emanuel

@doodlestein

"OpenAI’s top executives are finalizing plans for a major strategy shift to refocus the company around coding and business users" - WSJ Waiting for them to figure out what I’ve done with Agent-Flywheel.com and reach out. I’ve really figured out how to scale with quality.

3:25 PM · Mar 25, 2026

Read 7 replies

Codex App Server highlighted as “100% open source”

Codex App Server (OpenAI ecosystem): A community reminder emphasized that the Codex App Server is open source and positioned it as a base for building “richer experiences” on top of Codex, per the open source reminder.

This is a distinct signal from “Codex app is here to stay”: it’s about how extensible the stack is for teams that want to run their own wrapper UI, orchestration, or integrations.

Vaibhav (VB) Srivastav

@reach_vb

Your periodic reminder: Codex App Server is 100% Open Source!! It allows you to build on codex and build richer experiences all with ChatGPT Auth Thousands oof projects are built on top of it including our own VS Code Extension - check it out ;)

8:35 PM · Mar 25, 2026

385

Codex app adds thread search for navigating long histories

Codex App (OpenAI): A small but high-frequency UX upgrade—thread search in the Codex app—was called out as a navigation improvement in the QoL thread search note.

For heavy Codex users, this targets the common “too many long sessions” failure mode: you remember you solved something, but can’t find the thread quickly.

dominik kundel

@dkundel

Little quality of life improvement in the Codex app. You can now search your threads for faster navigation. And if you don't want to take your hands off the keyboard you can open it directly using Cmd+K

1:20 AM · Mar 25, 2026

488

Read 48 replies

✅ Anti-slop engineering: browser QA, review modes, and spec-first validation

Quality tooling focuses on catching failures agents create: automated browser testing with replay video, structured review modes, and arguments for specs/tests defined before code so validation isn’t biased by implementation.

Expect CLI lets Claude Code/Codex QA your app in a real browser and records every failure

Expect (open source): Aiden Bai released Expect, a CLI (and agent skill) that hands your current coding agent a real browser to test against, then produces a video “highlight reel” of bugs it found, so you can fix and rerun until green, as shown in the Launch demo and Highlight reel clip.

• How it fits into existing loops: It’s explicitly positioned to run under Claude Code/Codex/Cursor “under the hood,” with a one-command bootstrap (npx -y expect-cli@latest init) described in the Launch demo.
• Why this matters: It’s trying to make “agent wrote it” code shippable by attaching repro artifacts (browser recordings) to each failure, which is the missing piece in a lot of agentic QA.

More details and entry points are collected on the Project site.

Aiden Bai

@aidenybai

Introducing Expect Let agents test your code in a real browser 1. Run Claude Code / Codex to QA your app 2. Watch a video of every bug found 3. Fix and repeat until passing Run as a CLI or agent skill. Fully open source

4:06 PM · Mar 25, 2026

3.0K

Read 153 replies

Spec-first validation: agent-written tests after implementation mostly “confirm decisions”

Spec-first validation (pattern): A builder framing from Factory AI argues that tests written after an agent implements a feature tend to reflect the code’s choices (“Everything passes. Full coverage.”) rather than the original intent, and that specs set the validation criteria before the first line of code, as laid out in the Spec-first critique.

• Why this shows up now: As agents generate more code, the failure mode shifts from “no tests” to “tests that validate the wrong thing,” because the implementation influences what gets asserted, per the Spec-first critique.

The core claim is about bias in post-hoc validation, not about any one tool.

Ray Fernando

@RayFernando1337

I’m going to miss you Sora…

9:04 PM · Mar 25, 2026

Read 5 replies

Bombadil Inspect lands on main: action traces with state-before/state-after diffs

Bombadil Inspect (Antithesis/Bombadil): A first version of Bombadil Inspect landed on main, exposing a debugging UI that pairs action logs with “state before” vs “state after” snapshots, according to the Inspect screenshot.

• What’s concrete in the UI: The screenshot shows a left rail of actions (type/click/reload), two panels rendering the UI state before/after, and a violations counter plus CPU/heap graphs in the same view, per the Inspect screenshot.

This is aimed at the failure mode where agents do many UI actions but you can’t quickly see what changed.

Oskar Wickström

@owickstrom

Today the first version of Bombadil Inspect landed on `main`. Psyched about this!

9:16 PM · Mar 25, 2026

Read 3 replies

Polyscope 0.14 ships Review Mode for in-depth PR/workspace reviews (and per-step model choice)

Polyscope 0.14 (Polyscope): Polyscope shipped a dedicated Review Mode for “in-depth code review of all changes” in a workspace/PR, and it also added a workflow where you can plan with one model and implement/review with another, as described in the Release notes and shown in the Model switch UI.

• Review as a first-class action: The product is framing review as a separate phase with its own UX, rather than “ask the agent to review” buried in chat.
• Planning vs implementation separation: The thread explicitly calls out “Plan with Claude; implement with Cursor; review with GPT 5.4,” per the Model switch UI.

The signal here is review tooling becoming a surface, not a prompt.

Marcel Pociot 🧪

@marcelpociot

We just released @getpolyscope 0.14 and it's a big one! 🔥 1. Getting started with your next big project is even easier in Polyscope. You can now create fresh @laravelphp projects right within Polyscope itself! This gives you a polyscope.json out of the box too!

2:11 PM · Mar 25, 2026

Read 10 replies

Cognition collaborates on Code Review Bench v0.3, focusing on precision vs latency

Code Review Bench v0.3 (Cognition): Cognition announced a collaboration with Martian on Code Review Bench v0.3, explicitly focusing on the tradeoff between precision and latency in code review evaluation, per the Collaboration note.

This is small but specific. It’s about measuring review quality under real-time constraints.

What’s missing in the tweets is the updated task set and scoring details (no paper or dashboard link was included in the shared post).

Cognition

@cognition

We're happy to announce our collaboration with @withmartian on Code Review Bench v0.3, with a focus on the tradeoffs between precision and latency.

Martian

@withmartian

We've been tracking AI code review tools across OSS, and a new category is emerging. We're calling it "Deep Review": → Standard AI review: PR-level, fast, human in the loop → Deep Review: repo-wide context, runs autonomously in the background 🧵👇

6:24 PM · Mar 25, 2026

Read 1 reply

📱 Claude as a work hub on mobile: app integrations and adoption signals

New Claude mobile capability centers on using work tools (design/analytics) from a phone; alongside this are adoption/usage visuals and “Claude as the super‑app” sentiment. Excludes Claude Code CLI updates (covered separately).

Claude mobile adds Work Tools for Figma, Canva, and Amplitude

Claude mobile (Anthropic): Anthropic says Claude’s “work tools” are now available on mobile—positioned as letting you open Figma designs, edit Canva slides, and view Amplitude dashboards from your phone, as announced in the mobile tools rollout and reiterated in the tools recap.

• Surfaces: The rollout points users to the mobile apps via the download page, implying this is a first-class capability across iOS/Android rather than a web-only feature.

The practical change is that “work-app context” is no longer desktop-gated for Claude sessions, which shifts when/where teams can do quick design/analytics lookups.

Claude

@claudeai

Your work tools in Claude are now available on mobile. Explore Figma designs, create Canva slides, check Amplitude dashboards, all from your phone. Give it a try: claude.com/download

5:00 PM · Mar 25, 2026

16.6K

Read 1.1K replies

Chart shows Claude leading Gemini, Grok, and DeepSeek on mobile DAUs

Claude mobile adoption: A Similarweb-style plot circulating in the mobile DAU chart shows Claude rising to roughly ~17M daily active users by ~Mar 20, 2026, overtaking Gemini, Grok, and DeepSeek over late Feb–Mar.

This is being interpreted as a distribution signal: Claude’s mobile presence looks less like a “companion app” and more like a primary interface for a chunk of users, which matters if you’re betting on mobile-first agent workflows.

Kol Tregaskes

@koltregaskes

Claude now has the second most daily active user figures worldwide on mobile after ChatGPT

3:30 PM · Mar 25, 2026

Claude usage intensity map highlights Israel, Singapore, and Australia

Claude usage geo signal (Anthropic Economic Index): A Visual Capitalist-style map of “Claude AI usage by country” is being reshared, showing usage intensity normalized by working-age population share—e.g., Israel at 4.90×, Singapore at 4.19×, Australia at 3.27×, and the U.S. at 3.69×, as shown in the country usage map.

The main analytical value is that it’s one of the few public, quantified glimpses of where Claude adoption clusters outside the U.S., and it’s being discussed as a proxy for where “Claude-first” workflows may be emerging.

Chubby♨️

@kimmonismus

Claude's use in the USA is no surprise. But apparently, Europe has also developed above-average use of Claude. China lags far behind. At least officially, there is no use of Claude (although we know that distillation was a significant part of its production). Funnily enough: Show more

1:25 PM · Mar 25, 2026

735

Read 67 replies

Builders frame Claude as the “super app” aggregator for work

Claude positioning: A recurring take is that Anthropic is turning Claude into a consolidated work hub—captured bluntly as “They are developing Claude into the app that ChatGPT wanted to be” in the super app take, with similar “Claude is becoming a super app” sentiment echoed in the super app shorthand.

This framing is being used less as model commentary and more as a product-architecture claim: Claude as the place where tool integrations accumulate, not an assistant you bounce between apps.

Chubby♨️

@kimmonismus

They are developing Claude into the app that ChatGPT wanted to be.

Claude

@claudeai

Your work tools in Claude are now available on mobile. Explore Figma designs, create Canva slides, check Amplitude dashboards, all from your phone. Give it a try: claude.com/download

5:22 PM · Mar 25, 2026

2.7K

Read 63 replies

Perplexity Computer UI hints at an upcoming Memory feature

Perplexity Computer (Perplexity): A screenshot shared by TestingCatalog suggests Perplexity is working on a Memory feature for “Perplexity Computer,” with “Memory” appearing as a sidebar nav item in the UI screenshot.

The key detail is the product direction: “computer that works for you” plus persistent state, which is the missing piece for longer-running agent workflows compared to stateless chat sessions.

TestingCatalog News 🗞

@testingcatalog

Perplexity is working on a Memory feature for Perplexity Computer.

4:03 PM · Mar 25, 2026

208

Read 14 replies

Copilot Tasks is being described as already usable on mobile

Copilot Tasks (Microsoft): One thread claims “Copilot Tasks is already on mobile,” framing it as effectively carrying a “cloud computer” style workflow in your pocket in the mobile mention.

It’s a light-data-point (no official release detail in the tweet), but it’s being used as a comparison point to Claude’s push toward mobile-controlled work execution.

Paul Couvert

@itsPaulAi

What. Copilot Tasks is already on mobile. So you basically have a cloud computer/Claude Cowork working for you on the go. Same features as on desktop. With one prompt it can: - Use a browser to navigate - Interact with the page - Scrap the relevant info - Generate an Excel Show more

Paul Couvert

@itsPaulAi

Copilot Tasks is seriously good?! Even one of the best alternative to Claude Cowork Using a single prompt it was able to: → Use a cloud browser to find the right tool → Interact with the page to enter data → Interpret all the info given by the page → Generate a PowerPoint

10:40 PM · Mar 24, 2026

167

🧱 Self-hosted & sandboxed agent execution: keep code inside your network

A cluster of updates about where agents run: self-hosted cloud agents, distributed sandboxes that survive laptop close, and ‘safe execution’ primitives. This is about execution environments, not model quality.

Cursor Cloud Agents can now run on your infrastructure

Cursor Cloud Agents (Cursor): Cursor added a self-hosted option so the same cloud-agent harness can execute tools and builds entirely inside your network, targeting teams that can’t send code or artifacts outside the perimeter, as announced in self-hosted agents announcement and detailed in the self-hosted agents post. The pitch is “cloud UX, local execution”: isolated environments and parallel task handling, but with enterprise network/compliance constraints preserved.

• Operational detail: Cursor frames this as avoiding inbound connectivity requirements (no inbound ports / no VPN) while still orchestrating agent runs, according to the self-hosted agents post.

The core change is where commands run and where artifacts land; model choice and editor experience are intended to stay the same.

Cursor

@cursor_ai

Cursor cloud agents can now run on your infrastructure. Get the same cloud agent harness and experience, but keep your code and tool execution entirely in your own network. cursor.com/blog/self-host…

6:32 PM · Mar 25, 2026

Read 87 replies

Cloudflare opens Dynamic Workers beta for sandboxing AI-generated code

Dynamic Workers (Cloudflare): Cloudflare announced an open beta of Dynamic Workers, positioning it as an infra primitive to securely execute AI-generated code at scale, as shared in beta announcement.

The framing is aimed at agentic systems that need to run untrusted generated code frequently; the concrete engineering question it’s addressing is sandbox overhead (how fast you can spin up isolated execution) rather than model capability.

Wes Roth

@WesRoth

Cloudflare has announced the open beta of Dynamic Workers, a new infrastructure primitive designed specifically to securely execute AI-generated code at scale.

Cloudflare

@Cloudflare

We’re introducing Dynamic Workers, which allow you to execute AI-generated code in secure, lightweight isolates. This approach is 100 times faster than traditional containers. cfl.re/4c2NvPl

2:00 PM · Mar 25, 2026

Read 1 reply

Imbue ships Keystone to auto-generate devcontainers in a sandbox

Keystone (Imbue): Imbue introduced Keystone, a sandboxed agent that runs in a Modal container and tries to produce a working dev environment for arbitrary repos—generating a Dockerfile, devcontainer.json, and a test runner that passes—per the Keystone launch and the product page.

• Sandboxing stance: The product writeup emphasizes doing sysadmin-style setup inside a sandboxed environment (rather than on developer machines), as described in the product page and reinforced by Modal’s callout in Modal mention.

The open question is how broadly it generalizes across polyglot repos and nonstandard test harnesses, but the shipped unit is clear: “make this repo runnable, reproducibly.”

Imbue

@imbue_ai

Teach your repo how to run itself 🦾💨 Introducing Keystone: a self-configuring agent inside a sandboxed @Modal container that generates a working dev container for any repo → pip install imbue-keystone

5:05 PM · Mar 25, 2026

Read 10 replies

OpenCode shows “distributed” agent runs that survive laptop close

OpenCode (OpenCode): A demo claims agents can run on a laptop, a remote server, or a cloud sandbox provider; you can close the laptop and work continues, then reopen and sync state back locally, per the distributed opencode demo and the follow-up on multi-device “home server” configurations in device topology notes. The design goal is execution-location flexibility without losing session data when sandboxes are deleted.

• Device/controller split: The same thread sketches patterns where a laptop or cloud node acts as the “home” runtime while phones act as remote controllers, as described in device topology notes.

What’s still unclear from the tweets is the durability mechanism (where state is stored, conflict behavior, and what is sync’d vs re-derived).

dax

@thdxr

james has achieved distributed opencode agents can run on your laptop, on a remote server, in a cloud sandbox provider shut your laptop and things keep running open it back up and all the data syncs delete the sandbox nothing is lost

James Long

@jlongster

OpenCode is about to get more powerful with remote sandboxes I showed a brief demo before, but here's a much more in-depth demo. it's not hard to add basic support for a remote env, but handling all the edge cases like when a remote env gets deleted is difficult. especially if

9:57 PM · Mar 25, 2026

1.2K

Read 50 replies

Sandcastle runs parallel Claude sandboxes on worktrees and merges back

Sandcastle (Matt Pocock): Sandcastle can take a backlog of issues, spawn N Claude instances in Docker sandboxes on separate git worktrees, and merge resulting changes back to a target branch—all locally—per the Sandcastle capability post.

• Scaling friction: The same author notes hitting rate limits when firing multiple Opus instances in parallel, which becomes an immediate bottleneck for this style of “many sandboxes at once,” as shown in parallel run rate limit.

This is a concrete execution pattern: isolate work per worktree, run agents in parallel, then reconcile via git.

Matt Pocock

@mattpocockuk

Sandcastle can now: - Look at a backlog of issues - Spawn N number of Claude's in sandboxes on different worktrees, each tackling an issue - Take all that worktree code and merge it back to a target branch - All locally Just Docker and TypeScript needed

7:11 PM · Mar 25, 2026

141

Read 19 replies

OpenCode is moving core features into plugins to force a better plugin API

OpenCode (OpenCode): The maintainer says they’re refactoring existing features into plugins, then building new features as internal plugins so the plugin API has to be “good,” as stated in plugin decomposition note and echoed via a plugin-first framing in internal plugins retweet. This is a product/architecture signal: extensibility becomes a first-class constraint, not an add-on.

• Why it matters for execution: In a sandboxed/distributed runtime, plugins often become the unit of capability delivery (tools, connectors, sandboxes, UI panels); a “dogfooded” internal plugin surface tends to converge on clearer capability boundaries, per the intent described in plugin decomposition note.

dax

@thdxr

now that we have most features we want implemented, we're doing a pass to decompose them all into plugins since we didn't have to use the plugin api, it never become very good now all new features we want to build will be internal plugins, so the plugin api will have to be good

kmdr

@kmdrfx

Features in OpenCode itself will be internal plugins and can be activated/deactivated at runtime. Same as external plugins. This will allow for reloading plugins at runtime. Trying to tweak the DX a little more. Almost ready to go.

11:52 PM · Mar 25, 2026

412

Read 30 replies

🕹️ Agent runners & harnesses: OpenClaw release and “run it from chat” ops

Ops tooling for coordinating agents stays active, led by OpenClaw’s new release and examples of chat-driven personal automation. Excludes general MCP protocol items (covered separately).

OpenClaw 2026.3.24 broadens OpenAI API compatibility for clients and RAG stacks

OpenClaw 2026.3.24 (OpenClaw): OpenClaw expanded its OpenAI-compatible surface by adding /v1/models and /v1/embeddings, and it now forwards explicit model overrides through both /v1/chat/completions and /v1/responses, as detailed in the GitHub release notes and summarized in the release thread. That matters for teams using OpenClaw as a “one runtime” gateway, because embeddings and model-override propagation are common breakpoints when plugging in off-the-shelf OpenAI clients.

• Sub-agent routing via OpenWebUI: The release also calls out improved ability to “talk to sub-agents” with OpenWebUI in the release thread, which is the kind of client-compat surface area that tends to rot quickly without explicit passthrough support.

OpenClaw🦞

@openclaw

OpenClaw 2026.3.24 🦞 🔌 Improved OpenAI API: talk to sub-agents with @openwebui 🎛️ Skill & tool management Control UI 🎨 Slack interactive reply buttons 💅 Native Microsoft Teams 🧵 Smart Discord auto-thread naming Any client. Any model. One runtime. github.com/openclaw/openc…

5:27 PM · Mar 25, 2026

2.0K

Read 184 replies

A chat-based CRM built on OpenClaw replaces a $300/mo SaaS CRM with a Google Sheet

OpenClaw CRM workflow (OpenClaw): A builder showed an OpenClaw “chat CRM” that uses a Google Sheet as the system of record while the agent auto-updates leads, sends follow-ups, and answers questions across Gmail/WhatsApp/calendar integrations—framed as replacing a $300/mo CRM in the CRM breakdown. This is a concrete “run it from chat” pattern: spreadsheet as database, chat as UI, agent as operator.

Moritz Kremb

@moritzkremb

My OpenClaw chat-based CRM system It lets me: - chat with my CRM - auto-update leads - auto-send follow-ups - ask anything about my leads - connected to gmail, whatsapp & calendar ...replacing a $300/mo CRM with a Google Sheet + OpenClaw agent Full breakdown & tutorial:

11:02 AM · Mar 25, 2026

143

Read 16 replies

OpenClaw 2026.3.24 ships native Microsoft Teams integration via the official SDK

OpenClaw Microsoft Teams (OpenClaw): OpenClaw 2026.3.24 migrates to the official Teams SDK and adds Teams-native behaviors (streaming 1:1 replies, welcome cards with prompt starters, feedback, typing indicators, native AI labeling, plus message edit/delete support), as described in the GitHub release notes and echoed by the beta post. For AI ops, this is about keeping agent execution inside an enterprise’s existing comms surface.

Peter Steinberger 🦞

@steipete

New @openclaw beta is out with better MS Teams integration, @OpenWebUI and more!

3:15 PM · Mar 25, 2026

796

Read 62 replies

OpenClaw 2026.3.24 updates Control UI for tool/skill management visibility

OpenClaw Control UI (OpenClaw): In 2026.3.24, the Control UI now shows only the tools available to the active agent, adds a live “Available Right Now” section, and offers compact vs detailed views, per the GitHub release notes and the beta mention. The point is operational clarity: when tools vary by agent, credentials, or environment, the UI becomes the fastest way to understand what the harness can actually do before you burn tokens.

Peter Steinberger 🦞

@steipete

New @openclaw beta is out with better MS Teams integration, @OpenWebUI and more!

3:15 PM · Mar 25, 2026

796

Read 62 replies

OpenClaw 2026.3.24 adds one-click install recipes for bundled skills

OpenClaw skills (OpenClaw): OpenClaw 2026.3.24 adds install “recipes” for bundled skills (and prompts to install missing dependencies via CLI/Control UI), according to the GitHub release notes. This is a distribution move: it makes skillpacks closer to reproducible units you can roll out across chat endpoints, instead of tribal-knowledge setup steps.

OpenClaw 2026.3.24 adds Slack interactive reply buttons for faster chat loops

OpenClaw Slack (OpenClaw): OpenClaw 2026.3.24 adds interactive reply buttons in Slack, as listed in the release thread. This is a harness-level UX primitive: it lets an agent present constrained choices (approve/deny, pick an option, confirm an action) without pushing users back into CLI commands.

OpenClaw🦞

@openclaw

5:27 PM · Mar 25, 2026

2.0K

Read 184 replies

OpenClaw ops gotcha: sessions “poof at 4am,” causing overnight amnesia

OpenClaw session ops: A recurring operational gotcha is that OpenClaw sessions “poof at 4am,” creating an overnight amnesia mode unless state is externalized, as called out in the community tip. For teams relying on chat-driven personal automation, this becomes a boundary between “conversation as state” and “external memory as state.”

claire vo 🖤

@clairevo

Random @openclaw tips that are super simple but almost no one realizes - your sessions poof at 4 am, overnight amnesia is built into the system - you don’t need a monitor for your Mac mini turn on screen share & remote in from your laptop - it’s SOUL is promoted to not bother Show more

5:36 AM · Mar 25, 2026

435

Read 28 replies

OpenClaw 2026.3.24 adds smart Discord auto-thread naming

OpenClaw Discord (OpenClaw): OpenClaw 2026.3.24 introduces smart Discord auto-thread naming, per the release thread and the deeper notes in the GitHub release notes. For chat-native agent ops, thread naming ends up being an indexing mechanism for long-running work, especially when you route many tasks through a shared Discord server.

OpenClaw🦞

@openclaw

5:27 PM · Mar 25, 2026

2.0K

Read 184 replies

RunClaw pitches Telegram as an ops console for OpenClaw-style agent work

RunClaw (OpenClaw-adjacent): A sponsored comparison claims people run OpenClaw on a ~$700 Mac mini 24/7 for inbox ops, while “RunClaw” can do similar work “from Telegram for $1,” plus expand into building websites, slide decks, and media from chat surfaces like Slack, per the sponsored comparison. Treat it as directional rather than verified benchmarking, but it signals continued pressure toward chat-first agent execution plus aggressive cost packaging.

AshutoshShrivastava

@ai_for_success

People running a $700 Mac Mini with OpenClaw 24/7 to manage their inbox… are now watching RunClaw do the same thing from Telegram for $1 And now RunClaw can build a full website design a slide deck generate videos and images all from Slack, anytime, anywhere The future we are Show more

Umesh Kumar

@itsumeshk

We're launching RunClaw to kill OpenClaw. OpenClaw costs $700 to set up. RunClaw costs $1 no setup. > OpenClaw can’t build you a website > Can’t generate a video > Can’t make a slide deck > Has 9 security CVEs RunClaw does all of it. Better agents. More secure. Always in your

2:05 PM · Mar 25, 2026

Read 18 replies

🔌 MCP & agent interoperability: channels, web actions, and UI streaming

Interoperability plumbing shows up across messaging channels, web-automation endpoints, and UI streaming protocols—useful for teams standardizing tool access across assistants.

Claude channels add iMessage support for agent conversations

Claude channels (Anthropic): iMessage is now supported as a channel, per the channel announcement, extending the “agent in chat apps” pattern beyond Slack/Discord-style surfaces.

• What it enables: the channel surface is being framed as a way to interact with an agent through Messages—see the iMessage UI example in screenshot framing.
• Why it matters: it makes “on-the-go” control/monitoring viable for long-running workflows (agents that keep running somewhere else), without requiring a dedicated mobile app UI.

Operational details (auth model, how sessions map to an agent runtime, and whether this is Claude Code-specific or broader) aren’t spelled out in the tweets, so treat the first wave as a channel primitive rather than a fully-specified remote-control stack.

Thariq

@trq212

imessage is now available as a channel!

kenneth

@neilhtennek

i bought a mac mini so i could have blue bubbles when texting claude and it started roasting me... try the imessage plugin for claude code today with /plugin install imessage@claude-plugins-official

12:13 AM · Mar 26, 2026

Read 77 replies

Firecrawl adds /interact for natural-language or Playwright web actions

Firecrawl (/interact): Firecrawl introduced an /interact endpoint that follows /scrape, letting an agent take actions on a live page via natural language, as described in the endpoint announcement.

• Action interface: alongside natural-language instructions, Firecrawl is explicitly positioning “need full browser control” as writing Playwright code in-session, with an example shown in Playwright snippet.
• Agent harness implication: this is a clean split between extraction (/scrape) and stateful manipulation (/interact), which fits teams trying to standardize web automation across multiple assistants without committing to full “computer use” UIs.

The tweets don’t include latency, pricing, or reliability claims; the concrete shipping artifact here is the endpoint and the code-level control surface.

Firecrawl

@firecrawl

Introducing the /interact endpoint - Just /scrape a page - Call /interact to do any action with natural language - Watch agents click, fill, scroll to get deep web data Available today.

4:02 PM · Mar 25, 2026

188

Read 8 replies

AWS Bedrock AgentCore adds an AG‑UI endpoint for streaming agent UIs

AgentCore AG‑UI (AWS Bedrock): AWS added a dedicated AG‑UI endpoint inside AgentCore so agents can stream interactive UI components and human-in-the-loop workflows, according to the AG‑UI announcement.

• Integration framing: the post explicitly places AG‑UI alongside MCP (“agent to tools”) and A2A (“agent to agent”), per the AG‑UI announcement.
• Build surface: the claim is that an AgentCore agent can stream UI by setting an AG‑UI flag and deploying, as shown in the AG‑UI announcement.

This is an interoperability move: a standardized way to ship agent UX without building a bespoke front-end for every agent runtime.

CopilotKit🪁

@CopilotKit

@AWS just added a dedicated AG-UI endpoint inside AgentCore Any AgentCore agent can now stream interactive UI components, live workflows, human-in-the-loop, and more. Build an agent on AWS, set an AG-UI flag, and deploy. That's it. AgentCore handles the infra. AG-UI handles Show more

3:13 PM · Mar 25, 2026

OpenBlock ships Connections to reuse SaaS auth across agent sessions

Connections (OpenBlock): OpenBlock announced “Connections,” a one-time setup flow so any OB‑1 session/CLI/agent can securely use linked SaaS accounts (individual + org), powered by WorkOS, as shown in the feature announcement.

• Surface area: examples name Linear and Sentry explicitly, and imply broader coverage (GitHub/Stripe and others), per the feature announcement.
• Interoperability value: this aims to decouple auth bootstrap from per-agent setup—one configured connection can be reused across multiple agent entrypoints, per the feature announcement.

The post is light on the underlying trust model (scopes, revocation, auditability), but the concrete new primitive is centralized credential plumbing for agent sessions.

OpenBlock

@openblocklabs

Introducing Connections. Set it up once → every OB-1 session just works. Connect: • Individual accounts (e.g. @linear) • Org accounts (e.g. @sentry) Plus: @github @braintrust @baseten @stripe …and more Powered by @WorkOS See OB-1 with @linear below!

12:06 AM · Mar 26, 2026

Read 1 reply

BrowserEnv pitches training on custom web workflows to improve agents

BrowserEnv: BrowserEnv is being positioned as a response to current “computer use” agents being slow and inaccurate on real sites, with a pitch that teams can train models on their own esoteric web workflows, per the BrowserEnv pitch.

The key interoperability angle is that it treats web automation as a trainable environment layer (workflows as data) rather than a one-off UI-driving feature. The tweet doesn’t include benchmarks, supported browsers, or how environments are specified, so the only solid claim here is the direction: training infrastructure targeted at the long tail of site-specific web tasks.

Kyle Jeong

@kylejeong

Excellent researchers & AI engineers all know that computer use still sucks, It's SUPER slow, not very accurate on many sites, and isn't trained to be good at the ultra-specific workflows you want it to do. We built BrowserEnv to change that.

Browserbase

@browserbase

We're excited to announce our partnership with @PrimeIntellect to allow anyone to train browser agents. General-purpose models aren't optimized for your browser workflows, BrowserEnv lets you train one that is. Checkout browserenv.com and train your own custom model in

5:49 PM · Mar 25, 2026

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭

🛡️ Safety & misuse: Model Spec, voice guardrails, and jailbreak tooling

Security/safety discourse clusters around explicit behavior specs, stronger guardrails for agents (especially voice), and the emergence of open jailbreak interfaces that invert post-training safety layers.

G0DM0D3 launches a refusal-penalizing, multi-model “jailbroken” chat UI

G0DM0D3 (elder_plinius): A new open-source web UI positions itself as “fully jailbroken AI chat” with “no guardrails,” routing prompts to many models in parallel and using a “Tastemaker” scorer that explicitly penalizes refusals/hedging and amplifies the most direct answer, per the launch thread and the accompanying GitHub repo.

The README-style framing (“cognition without control”) is visible in the

, and the core operational claim is that post-training safety layers can be competed away by multi-model selection pressure rather than bypassed in a single model.

@elder_plinius

⛓️‍💥 INTRODUCING: G0DM0D3 🌋 FULLY JAILBROKEN AI CHAT. NO GUARDRAILS. NO SIGN-UP. NO FILTERS. FULL METHODOLOGY + CODEBASE OPEN SOURCE. 🌐 GODMOD3.AI 📂 github.com/elder-plinius/… the most liberated AI interface ever built! designed to push the limits of the Show more

11:22 PM · Mar 25, 2026

Read 117 replies

ElevenLabs ships Guardrails 2.0 to curb voice-agent drift and manipulation

Guardrails 2.0 (ElevenLabs): ElevenLabs rolled out an updated safety/control layer for ElevenAgents aimed at preventing real-time voice agents from hallucinating, drifting off-task, or being steered by users, as shown in the Guardrails 2.0 demo.

This is specifically framed as in-the-loop control for conversational voice workflows (where “keep it on-script” and “don’t get socially engineered” failures are common), rather than a general model capability update.

Wes Roth

@WesRoth

ElevenLabs rolled out Guardrails 2.0 for ElevenAgents, its conversational AI platform. This update introduces a robust, enterprise-grade safety and control layer designed to stop voice agents from hallucinating, drifting off-topic, or being manipulated by users in real time.

ElevenLabs

@ElevenLabs

Introducing Guardrails 2.0 in ElevenAgents. Control how agents behave in production with a redesigned safety layer. You can define and enforce custom business policies. Or, toggle on pre-built protections to keep agents on-topic, on-brand, and resistant to manipulation.

8:00 AM · Mar 25, 2026

Read 3 replies

OpenAI spotlights Model Spec as a public instruction hierarchy and behavior contract

Model Spec (OpenAI): OpenAI published a long-form conversation on how the Model Spec is supposed to work in practice—resolving conflicting instructions via a hierarchy and evolving through real-world feedback—framing it as the “what it should and shouldn’t do” layer as models gain capability, as described in the podcast segment.

The same push is reinforced by OpenAI’s written explainer on how they maintain and iterate the spec, detailed in the Model Spec article.

OpenAI

@OpenAI

The more AI can do, the more we need to ask what it should and shouldn’t do. OpenAI researcher @w01fe joins host @AndrewMayne to explore the Model Spec, the public framework that defines how models are intended to behave. They break down how it works in practice, from the chain Show more

5:20 PM · Mar 25, 2026

944

Read 270 replies

English Wikipedia bans LLM-written article prose with two narrow exceptions

Wikipedia policy (English Wikipedia): Wikipedia is drawing a bright line against using LLMs to generate or rewrite article prose, while allowing two constrained uses—grammar/style help on human-written text (with the editor responsible for meaning fidelity) and translation assistance as a first draft—summarized in the policy recap.

For analysts tracking content provenance and integrity, this is a governance signal: “fluent” isn’t the acceptance criterion; traceability to sources is.

Rohan Paul

@rohanpaul_ai

Wikipedia just drew a hard line on AI-written article text and allowed only 2 narrow uses. The new English Wikipedia rule bans editors from using LLMs to generate fresh article prose or rewrite existing article content, because the system cares less about fluent wording than Show more

10:46 PM · Mar 25, 2026

Read 8 replies

OpenAI launches a Safety Bug Bounty focused on AI abuse and safety risks

Safety bug bounty (OpenAI): OpenAI is launching a program explicitly aimed at finding “AI abuse and safety risks” across OpenAI products, according to the program announcement.

The tweet doesn’t include payout ranges, scope details, or submission mechanics, so treat this as the headline only until the full program page is referenced in follow-ups.

OpenAI Newsroom

@OpenAINewsroom

Today we’re launching a Safety Bug Bounty program focused on identifying AI abuse and safety risks across OpenAI products. This new program builds on our Security Bug Bounty to include AI-specific safety issues and misuse scenarios, helping us work with the safety and security Show more

5:46 PM · Mar 25, 2026

854

Read 168 replies

AI detector false-positive: Gettysburg Address flagged as AI-generated

AI detection reliability: A widely shared example shows ZeroGPT flagging Lincoln’s Gettysburg Address as AI-generated at 96%+, illustrating how detectors can misclassify canonical historical text and why “AI-written” claims based on detectors alone remain brittle, per the detector screenshot.

This keeps resurfacing as practical ammo in policy and academic settings where detector outputs are treated as evidence rather than as noisy heuristics.

Rohan Paul

@rohanpaul_ai

AI Detector flags Abraham Lincoln’s Gettysburg address as AI-generated 😃

Possum Reviews

@ReviewsPossum

This AI text detector says Abraham Lincoln's Gettysburg Address was written by AI.

3:45 PM · Mar 25, 2026

Read 11 replies

🛠️ Dev utilities for agent era: repo context, doc freshness, and parsing primitives

Open-source tools aim at the core bottleneck: getting correct, current context into agents (and cutting token waste) via doc fetchers, structural repo graphs, and better document parsing.

code-review-graph builds a Tree-sitter repo map to shrink review context

code-review-graph (AlphaSignalAI/community): An open-source utility called code-review-graph indexes a repository into a Tree-sitter-derived structural graph stored in SQLite, then uses “blast radius” tracing to pull only the minimal impacted set of files during reviews—framed as a way to stop agents from rereading huge repos and guessing the missing parts, per the Token reduction claim screenshot.

The post’s concrete performance claims include 6.8× fewer tokens on typical reviews and 49× fewer tokens on large monorepos by reading ~15 files instead of scanning ~27,000, with “2,900 files reindex in 2 seconds” and “12 languages” support, as shown in the Token reduction claim.

AlphaSignal AI

@AlphaSignalAI

This open-source solution cuts Claude Code's token usage by 49x. Large codebases have thousands of files. Claude can't hold them all in context. So it reads what it can and fills in the gap with assumptions. Often, those assumptions are wrong. code-review-graph fixes this. Show more

3:01 PM · Mar 25, 2026

Context Hub (chub) open-sources a “doc freshness” CLI for coding agents

Context Hub (AlphaSignalAI/community): A new open-source CLI called Context Hub (“chub”) is positioned as a fix for the common agent failure mode of coding against stale SDKs—by fetching versioned docs on demand and letting agents persist local annotations when docs are missing or wrong, as described in the Context Hub overview screenshot.

The core claim is that this is an information problem more than a model problem—agents “invent parameters and call dead functions” when they don’t have current specs, and chub turns “search → fetch → use” into a repeatable pre-step for any coding session, using a growing markdown doc corpus that can be contributed back to (the post cites “6K+ GitHub stars in a week” and “1,000 API documents,” per the Context Hub overview).

AlphaSignal AI

@AlphaSignalAI

Andrew Ng just solved the #1 reason AI coding agents hallucinate. It's not a model problem. It's an information problem. Agents write code against outdated API specs. They invent parameters and call dead functions. Context Hub, or simply chub, is an open-source CLI tool. Show more

12:01 PM · Mar 25, 2026

Read 5 replies

LlamaParse improves Word table parsing by aligning XML structure to rendered pages

LlamaParse (LlamaIndex): LlamaParse shipped a Word/.docx table-parsing improvement that links source Word XML table structure (including merged cells and formatting) to the final rendered markdown while preserving page positions, addressing the practical “renderer decides pagination” problem noted in the announcement Docx parsing update.

• What changed: The pipeline can now map “source XML tables/table elements” to “rendered markdown output,” enabling direct interpretation of merged cells plus bold/italic/sub/sup/strikethrough, as explained in the Docx parsing update and illustrated by the renderer mismatch graphic.

• Why it matters for agent context: For RAG and doc-to-plaintext workflows, this makes citations and table extraction more stable when agents need “what was on page N” semantics, with more detail in the Parsing blog and entry points in the LlamaParse page.

Jerry Liu

@jerryjliu0

Improving Table Parsing for Word (.docx) Documents 📄🧩 Parsing Word/docx files is hard, even though counterintuitively the internal XML format is easier to understand than a PDF file. The XML captures the full semantic structure of text and tables, but the issue is which page Show more

LlamaIndex 🦙

@llama_index

Word docs are one of the most common file formats people process in LlamaParse, and they've always been surprisingly frustrating to parse well. Here's the counterintuitive part: .docx actually has better structural information than most document formats. We just haven't been able

4:42 PM · Mar 25, 2026

Read 4 replies

dev-browser CLI frames browser use as “agent writes code” instead of click-driving

dev-browser CLI (workflow pattern): A “dev-browser” command-line tool is introduced with the premise that “the fastest way for an agent to use a browser is to let it write code,” suggesting a shift away from slow, error-prone GUI driving toward code-authored browser actions, per the Dev-browser CLI intro.

The tweet doesn’t include a full spec, but the framing implies an agent loop where the model generates automation code (e.g., Playwright-style scripts) rather than interpreting pixels/click targets, as stated in the Dev-browser CLI intro.

Sawyer Hood

@sawyerhood

Introducing the new dev-browser cli. The fastest way for an agent to use a browser is to let it write code. Just `npm i -g dev-browser` and tell your agent to "use dev-browser"

4:27 PM · Mar 25, 2026