LiteLLM 1.82.7 and 1.82.8 compromised on PyPI – 10:39–14:35 UTC

New in Claude Code: auto mode. Instead of approving every file write and bash command, or skipping permissions entirely, auto mode lets Claude make permission decisions on your behalf. Safeguards check each action before it runs.

6:01 PM · Mar 24, 2026

31.2K

Read 1.7K replies

Claude Code Auto mode uses a pre-tool-call classifier to block risky actions

Safety layer for tool calls (Anthropic): Auto mode isn’t “auto-approve everything”; before each tool call, a classifier checks for potentially destructive actions so safe actions can proceed while risky ones are blocked and Claude is forced to try a different approach, according to the safeguard explanation in Classifier safeguard description.

• Risk framing: Anthropic explicitly says this “reduces risk but doesn’t eliminate it,” and recommends isolated environments, as stated in Auto mode announcement.
• Operational behavior: The model may reroute its strategy after a block instead of repeatedly requesting approvals, as summarized in Runtime safety filter summary.

Claude

@claudeai

Replying to @claudeai

Before each tool call, a classifier reviews it for potentially destructive actions. Safe actions proceed automatically. Risky ones get blocked, and Claude takes a different approach. This reduces risk but doesn't eliminate it. We recommend using it in isolated environments.

6:01 PM · Mar 24, 2026

1.5K

Read 46 replies

Claude Code permission UX: Shift+Tab mode switch, with Auto mode as a distinct setting

Claude Code CLI/UX (Anthropic): Auto mode shows up as a dedicated permission mode in the UI, sitting alongside options like “Auto accept edits” and “Bypass permissions,” with quick switching via Shift+Tab, as shown in the settings capture in Permission mode menu.

• Practical implication: This makes it easier to dial autonomy up/down mid-session without dropping all safeguards, which is the friction point highlighted in the original Auto mode pitch from Auto mode announcement.

TestingCatalog News 🗞

@testingcatalog

Anthropic released Auto Mode for Claude Code CLI, which allows Claude to make its own decisions on which permissions to accept. It is only available on the Team plan in research preview for now. On the desktop app, it is not yet available, but it is in the works.

Claude

@claudeai

7:41 PM · Mar 24, 2026

153

Read 9 replies

Claude Code Auto mode rollout: Teams-only now, with scaling to other surfaces planned

Rollout mechanics (Anthropic): Multiple posts emphasize the current constraint—Auto mode is Teams-only today—while hinting at broader availability once Anthropic “scales it,” per the note in Teams-only and scaling note.

• Surface gap: TestingCatalog notes it’s not on desktop yet and is “in the works,” while still being CLI-activatable, as summarized in Teams preview details.
• Why it matters: This tier-gating shapes who can actually run longer unattended agent tasks without prompt fatigue, which is the core complaint in the “no more permission prompts” chorus in Permission prompt fatigue.

Thariq

@trq212

turns out being an AI safety company is useful for when you need to make sure AIs can run safely

Claude

@claudeai

6:38 PM · Mar 24, 2026

1.2K

Read 116 replies

Builder sentiment: approval fatigue is the bottleneck Auto mode is targeting

Agentic coding workflow (community): The dominant reaction isn’t about new capabilities so much as removing interruption—“no more permission prompts” shows up as the headline value prop in Permission prompt fatigue, with others echoing that prompts should be “a thing of the past,” as in Permission prompts comment.

• Pragmatic take: Some builders frame Auto mode as a way to keep moving while still feeling responsible, rather than going straight to full bypass, as captured in YOLO substitute remark.

Boris Cherny

@bcherny

no 👏 more 👏 permission prompts 👏

Claude

@claudeai

9:26 PM · Mar 24, 2026

3.1K

Read 227 replies

🎨 Figma MCP as a first-class design surface for coding agents (Claude/Cursor/Copilot)

High-signal interop cluster: Figma’s MCP tool + skills enable agents to read/write real Figma files with design-system context; multiple vendors showcase design-to-code loops. Excludes general non-MCP design tools (covered elsewhere).

Figma’s use_figma MCP tool makes the canvas writable by agents

use_figma MCP (Figma): Figma is opening up direct agent control of the canvas via a new use_figma MCP tool plus teachable “skills,” positioning it as a standard way for agents to read/write real Figma files instead of relying on screenshots or brittle UI automation, as described in the [Figma MCP announcement](t:16|Figma MCP announcement).

• Why engineers care: MCP turns “design system context” (components, variables, tokens) into something an agent can query and mutate deterministically, which is the missing link for design-to-code loops that don’t drift.
• Ecosystem signal: downstream tools are already demoing agents writing into Figma through this interface, as shown in the [Factory demo](t:206|Factory demo).

Cursor adds Figma component generation with design-system tokens

Cursor + Figma (Cursor): Cursor can now create new components and frontends directly in Figma while adhering to a team’s design system, including variables/tokens and naming conventions, according to the [Cursor Figma demo](t:15|Cursor Figma demo).

• Design-system enforcement: the flow explicitly calls out implementing variables, tokens, and naming conventions through the Figma plugin, as noted in the [plugin details](t:274|Plugin details).
• Workflow impact: this moves “UI scaffold” from a manual handoff into an agentic step that can be replayed and kept consistent with system primitives.

A more reliable Claude Code → Figma loop via Plugin API codegen

Claude Code + Figma MCP (Anthropic/Figma): One emerging reliability pattern is to have Claude generate code that targets Figma’s Plugin API (i.e., translate intent into known Figma functions) rather than “freehand” design edits; the claim is that this makes outcomes more repeatable when working with design-system context, per the [integration note](t:7|Integration note) and the [Plugin API detail](t:218|Plugin API detail).

Copilot CLI can edit Figma files through Figma’s MCP server

Copilot CLI + Figma MCP (GitHub/Figma): GitHub is highlighting that, with Figma’s MCP server, you can drive changes directly to Figma files from GitHub Copilot CLI or @code, per the [GitHub MCP mention](t:103|GitHub MCP mention). This is an interoperability step: the same MCP surface can be used by multiple agent frontends without bespoke Figma integrations per tool.

Warp ships a Figma MCP skill pack for token-aware edits

Warp Figma skills (Warp): Warp is shipping a packaged skill set for editing Figma designs through the Figma MCP server; installation is via npx skills add warpdotdev/figma-skills, as shown in the [Warp Figma walkthrough](t:397|Warp Figma walkthrough).

• What’s shipped: a public skill repo exists for the integration, as linked from the [repo pointer](t:887|Repo pointer) and detailed in the [GitHub repo](link:887:0|GitHub repo).

FactoryAI’s agents write directly into Figma via use_figma MCP

FactoryAI + Figma (FactoryAI): FactoryAI is demoing a native connection from its agents (“Droids”) into the Figma canvas using use_figma MCP, with the pitch that agents can write real components/variables with full design-system awareness, as shown in the [FactoryAI canvas demo](t:206|FactoryAI canvas demo).

Figma and Anthropic schedule a Claude Code ↔ Figma roundtrip livestream

Workflow education (Figma/Anthropic): A livestream titled “From Claude Code to Figma – and Back Again” is scheduled for March 31 (9:00AM PST), framed as hands-on guidance for roundtrip workflows between Claude Code and Figma using the MCP server, as announced in the [livestream post](t:175|Livestream post) and described on the [event page](link:175:0|Event page).

🛡️ Supply-chain wake-up: LiteLLM PyPI credential-stealer and downstream fallout

Today’s dominant security story: compromised LiteLLM releases (1.82.7/1.82.8) exfiltrated credentials and hit transitive dependents; ecosystem response includes PyPI quarantine/yank, incident writeups, and calls for stronger package-manager install-script controls. Excludes Claude Code Auto mode (feature).

DSPy warns about transitive exposure and signals it may remove LiteLLM as a default dep

DSPy (DSPyOSS): DSPy maintainers published a time-bounded advisory saying the malicious LiteLLM versions were available from 10:39–14:35 UTC, and that anyone who installed LiteLLM 1.82.7 or 1.82.8 should treat the environment as compromised and rotate potentially exposed credentials, per DSPy incident advisory.

They also said a forthcoming DSPy 3.3 will “likely drop the dependency on LiteLLM” and instead expect providers to follow a small set of standards (OpenAI-style completions/Responses), as stated in DSPy dependency plan.

DSPy

@DSPyOSS

Earlier today, LiteLLM had two malicious versions posted to PyPi. These versions were available freely starting at 10:39 UTC and the packages were quarantined on PyPi by 14:35 UTC. If you installed anything that depends on LiteLLM in that four-hour span, including running `pip Show more

10:13 PM · Mar 24, 2026

184

Read 6 replies

browser-use limits the blast radius to v0.12.3 installs during the LiteLLM window

browser-use (open source): The project reports that only browser-use v0.12.3 was impacted (it was the only version depending on LiteLLM), and only for installs between 10:39–16:00 UTC; their cloud services were not affected, according to Scope-limited advisory.

The post repeats the key verification step—checking for LiteLLM 1.82.7/1.82.8—and suggests rotating credentials if those versions were pulled, as outlined in Scope-limited advisory.

Browser Use

@browser_use

Security notice: 1 version of browser-use was part of today's LiteLLM supply chain attack — but the scope is very limited. The open-source package v0.12.3 (all other versions do not have LiteLLM as a dependancy), installed between 10:39–16 UTC today, is affected. Our cloud Show more

12:29 AM · Mar 25, 2026

Hermes Agent posted a LiteLLM incident notice and mitigation guidance

Hermes Agent (NousResearch): Hermes users were warned that LiteLLM was a dependency “within parts of Hermes Agent,” and installs during the last 4–24 hours could be affected; Teknium points to a specific security notice in Hermes security notice.

The notice highlights the impacted LiteLLM versions (1.82.7/1.82.8) and frames the expected impact as secrets exfiltration (API keys, logins), aligning with the broader incident description in Incident overview.

Teknium (e/λ)

@Teknium

Thank you Luba for notifying us as well as the discord community of @Lite_LLM having been hacked. Please see this important security notice if you are a Hermes Agent user who installed within the last 4-24 hours!

luba luft

@luba_loop

fyi @NousResearch @Teknium hermes-agent installs appear to be blocked blocked, bc the dependency `litellm` is qurantined on PyPi. looks like there might be a supply chain attack on litellm github.com/NousResearch/h… github.com/NousResearch/h…

3:40 PM · Mar 24, 2026

254

Read 17 replies

AI diff scanning and publish holds proposed for critical packages

Registry scanning proposal: A detailed suggestion is for PyPI/npm/crates registries to run automated scans on releases of high-impact packages by diffing against the prior version and flagging suspicious signals (large base64 blobs, new URLs, unusual publish IP/location), then impose a 48-hour hold for review when risk is high, as laid out in Registry scanning proposal.

The argument is framed as low marginal cost (tokens per release) versus high blast radius, in the same spirit as the transitive-dependency risk described in Incident overview.

Jeffrey Emanuel

@doodlestein

This kind of thing happens way too often. For any package that’s this popular (40k+ GitHub stars in this case), it just seems like a total no-brainer that PyPi/npm/crates.io/etc. should do AI-powered scans for this pattern of attack. It would be trivial to make a skill to do Show more

Daniel Hnyk

@hnykda

LiteLLM HAS BEEN COMPROMISED, DO NOT UPDATE. We just discovered that LiteLLM pypi release 1.82.8. It has been compromised, it contains litellm_init.pth with base64 encoded instructions to send all the credentials it can find to remote server + self-replicate. link below

2:21 PM · Mar 24, 2026

182

Read 23 replies

Lockfile discipline resurfaces as an incident-response control for agent toolchains

OpenHands (OpenHandsDev): In response to the LiteLLM compromise, OpenHands reported production environments were unaffected and emphasized that open-source developers who bypassed the lockfile while installing dependencies should check if they were affected, as stated in Exposure investigation note.

This is a concrete reminder that “agent stacks” often install large dependency trees, and lockfile bypass turns a time-bounded PyPI incident into local compromise risk, per the framing in Exposure investigation note.

OpenHands

@OpenHandsDev

We're investigating our exposure to the LiteLLM vuln tl;dr: our production environments are unaffected. Investigating dev now. If you're an OpenHands open source dev, and bypassed the lockfile while installing dependencies, you should check if you're affected 👇

3:12 PM · Mar 24, 2026

Package-manager install-script controls get proposed as a post-LiteLLM mitigation

Package management controls: A concrete mitigation proposal is to make “nouveau” package managers (explicitly calling out uv and bun) reduce risk from install-time scripts—e.g., adding guardrails up to manually approving batches of network calls—per Install-script guardrails idea.

This is directly tied to the LiteLLM attack’s install-time execution mechanism described in Incident overview.

swyx

@swyx

Replying to @karpathy

we should probably also treat this as a wake up moment for all noveau package managers - uv and bun presumptively - to make these entire classes of things far less risky, eg by adding a lot of guards on install scripts up to the point of manually approving baches of network calls

6:19 PM · Mar 24, 2026

392

Read 11 replies

Security-audit branding gets scrutinized after the LiteLLM compromise

Audit/assurance signal: Commentary argues that LiteLLM’s “Secured by Delve” positioning looks hollow after the compromise, with specific criticism of Delve’s audits and lack of response in Audit backlash thread.

A related practitioner take suggests “AI-powered scans” for popular packages should be table stakes at registries, but also implies audit badges are not a substitute for release-channel controls, per Registry scanning proposal and the follow-up correction in PyPI scanning context.

Gergely Orosz

@GergelyOrosz

Oh damn, I thought this WAS a joke ... but no, LiteLLM *really* was "Secured by Delve" (the company that rubber stamped all of these audits, and seems to have been on the edge of fraudlent auditing, but useless for sure) And so unspririsingly LiteLLM was compromised, badly

SPEC

@___4o____

6:24 PM · Mar 24, 2026

2.3K

Read 66 replies

Supply-chain fear pushes a renewed “fewer dependencies” stance

Dependency posture shift: Karpathy frames the LiteLLM incident as a reminder that deep dependency trees are a systemic risk, and says this has made him “growingly averse” to dependencies—preferring to “yoink” simple functionality via LLMs when feasible, per Dependency critique.

This is less about LiteLLM specifically and more about the engineering response to transitive compromise risk, which the incident narrative in Dependency critique made concrete.

Andrej Karpathy

@karpathy

Software horror: litellm PyPI supply chain attack. Simple `pip install litellm` was enough to exfiltrate SSH keys, AWS/GCP/Azure creds, Kubernetes configs, git credentials, env vars (all your API keys), shell history, crypto wallets, SSL private keys, CI/CD secrets, database Show more

Daniel Hnyk

@hnykda

4:56 PM · Mar 24, 2026

20.8K

Read 992 replies

Incident response got noisier: suspicious spam comments show up on the LiteLLM GitHub issue

GitHub incident-response noise: During the LiteLLM disclosure, Simon Willison called out the odd pattern of many low-effort “thanks that helped” comments on the GitHub issue thread, asking for theories in Suspicious comments question.

This matters because operational guidance (which versions are compromised, how to verify installs) often concentrates in a single issue thread, and large-scale spam can bury remediation details, as implied by Suspicious comments question and the broader urgency in Incident overview.

Simon Willison

@simonw

Replying to @simonw

Anyone got any theories as to why there are hundreds of comments like this on the GitHub issue reporting the exploit? github.com/BerriAI/litell…

Screenshot of a threaded comment section from GitHub, issues, showing four replies all posted 2 hours ago. User "praiitt" says "Thanks, that helped!" followed by another comment from "praiitt" saying "This was the answer I was looking for." User "Hancie123" comments "Worked like a charm, much appreciated." and user "programonaut" says "Thanks, that helped!"

2:51 PM · Mar 24, 2026

Read 20 replies

PyPI’s existing scanning-partner API is cited as a reason LiteLLM was quarantined fast

PyPI scanning capability: Simon Willison notes that PyPI already supports scanning via an API used by partners, and suggests this may explain why LiteLLM was quarantined quickly after going live, per PyPI scanning note.

That comment directly answers calls for registry-side detection made in Registry scanning proposal, while leaving open how comprehensive the current partner scanning is in practice.

Simon Willison

@simonw

Replying to @doodlestein

"just seems like a total no-brainer that PyPi/npm/crates.io/etc. should do AI-powered scans for this pattern of attack" PyPI does that via an API used by scanning partners. I expect that may be why the package was quarantined on PyPI within an hour of it going live Show more

7:19 PM · Mar 24, 2026

🧵 Agent runners & swarms: Hermes 0.4.0, API backends, and parallelism UX

Operational agent tooling saw big movement: Hermes Agent’s largest release adds background self-improvement and an OpenAI-compatible API server, while builders highlight multi-agent swarms and long-running missions. Excludes MCP-specific Figma items (separate category).

Hermes Agent v0.4.0 adds background self-improvement and an OpenAI-compatible API server

Hermes Agent (NousResearch): v0.4.0 lands as the largest Hermes release ("300 merged PRs") and turns Hermes into an OpenAI-compatible agent backend while adding a background post-response improvement loop, as described in the release announcement from release post and the release summary thread from release highlights.

• OpenAI-compatible API server: Hermes now exposes both /v1/chat/completions and /v1/responses, including stateful chaining via previous_response_id, per the API server details in API server details.
• Background self-improvement: after a response is delivered, a separate review agent decides what to remember and what to convert into reusable skills, as outlined in self-improvement loop.
• Ops surface expansion: the release adds more messaging adapters (including Signal/Matrix/SMS) and ships CLI/context-handling upgrades (streaming by default, queue/status tooling, CLAUDE.md support), as listed in CLI upgrades.

The net change is Hermes moving from “agent you run” to “agent platform you can plug UIs into,” with the release notes tracked in the GitHub release notes linked from release notes link.

Nous Research

@NousResearch

Hermes Agent v0.4.0 is out:

Teknium (e/λ)

@Teknium

Hermes Agent v0.4.0 — 300 merged PRs this week. Biggest release we've done. Background self-improvement, OpenAI Responses API endpoint for your agent, new messaging platforms, new providers, MCP server management, and a lot more.

5:18 PM · Mar 24, 2026

792

Read 50 replies

Hermes Agent issues guidance for users exposed via LiteLLM dependency compromise

Hermes Agent (NousResearch): Nous/Hermes maintainers posted a security notice describing exposure via LiteLLM as a dependency in parts of Hermes Agent, including impacted versions and a short “check/rotate/remove” playbook, as shown in security notice screenshot.

The notice calls out LiteLLM 1.82.7 and 1.82.8 as affected releases and frames the safest response as treating the environment as compromised (rotate secrets/keys and remove the dependency) for anyone who installed during the relevant window, per the maintainer guidance in security notice screenshot.

Teknium (e/λ)

@Teknium

luba luft

@luba_loop

3:40 PM · Mar 24, 2026

254

Read 17 replies

BridgeSpace usage: 12-agent and 50-agent swarms for parallel code/security audits

BridgeSpace (BridgeMind): Multiple demos show BridgeSpace being used as a swarm runner for parallel security/audit work—including a phone-driven flow that triggers a 12-agent security audit and a separate run that launches 50 agents inside the same environment, per the 12-agent walkthrough in 12-agent swarm demo and the 50-agent clip in 50-agent swarm clip.

• Parallel audit decomposition: one example shows 10 explorer agents spawned in parallel for auth-flow review, each scoped to specific file paths and using the gpt-5.4-mini high variant, as captured in subagent roster screenshot.

The common thread is pushing long-horizon review work into many small, path-scoped investigations, then aggregating findings back into a single thread.

BridgeMind

@bridgemindai

Claude Computer Use CHANGES EVERYTHING. I used my phone to tell Claude Opus 4.6 to open BridgeSpace, launch a 12 agent swarm, and run a full security audit on my codebase. It controlled my Mac. Navigated the app. Configured the agents. Submitted prompts to all 12 terminals. Show more

Claude

@claudeai

You can now enable Claude to use your computer to complete tasks. It opens your apps, navigates your browser, fills in spreadsheets—anything you'd do sitting at your desk. Research preview in Claude Cowork and Claude Code, macOS only.

1:30 PM · Mar 24, 2026

145

Read 27 replies

LangSmith Fleet adds custom Slack bots for calling agents by handle

LangSmith Fleet (LangChain): Fleet now supports custom Slack bots, giving each agent its own handle so teams can run agent workflows directly from Slack, as announced in the Fleet launch post from Fleet announcement.

In practice, this is being framed as a shared collaboration surface where a team can see agent inputs/outputs in-channel (instead of fragmented per-user threads), as described in Slack-first workflow notes.

LangChain

@LangChain

LangSmith Fleet now supports custom Slack bots. Give your agent its own handle, then call it directly from Slack. Use agents where you already work. Try Fleet: smith.langchain.com/agents?skipOnb…

4:50 PM · Mar 24, 2026

Read 4 replies

Founder signal: engineering work moving into Slack/Linear via cloud-hosted agents

Cloud-hosted agent ops: A founder report describes spending multiple days without running local dev commands, with most engineering/marketing execution happening through Slack and Linear while agents run “in the cloud,” alongside the claim that building an internal orchestration layer is itself a full-time effort, as laid out in cloud agents workflow note.

The post also explicitly contrasts DIY orchestration with paying for “battle-hardened” systems (citing Devin) as a way to externalize the ops burden, per cloud agents workflow note.

Ryan Carson

@ryancarson

I haven't typed `npm run dev` on my local machine for three days now and it's absolute bliss. Having my agents 100% in the cloud is a massive unlock. (One of those agents is openclaw, which is technically on my mbp in my office, but the only way I interact with it is via Show more

12:08 PM · Mar 24, 2026

188

Read 48 replies

🧩 Cursor’s Composer 2: training report, RL recipe, and CursorBench economics

Cursor published technical details on how Composer 2 was trained (continued pretraining + RL + benchmark development) with emphasis on emulating the Cursor environment. This continues the Composer storyline with new concrete training/benchmark specifics and cost/performance plots.

Cursor details how Composer 2 was trained and where it sits on CursorBench cost vs quality

Composer 2 technical report (Cursor): Following up on RL claim (Composer 2’s RL story), Cursor released a training report describing three pillars—continued pretraining, reinforcement learning, and benchmark development—aimed at emulating the Cursor IDE environment, as stated in the Technical report announcement. The report also surfaces CursorBench positioning data where Composer 2 lands around 61% at roughly $0.35/task and ~8k completion tokens, versus points like GPT-5.4 at ~63% and ~$1.20/task and Opus 4.6 at ~61% and ~$2.00/task, as shown in the CursorBench plots.

• Benchmark targets: The report frames Composer 2 as scoring strongly on CursorBench plus public SWE benchmarks (SWE-bench Multilingual, Terminal-Bench), per the Technical report announcement.
• What RL was trained on: The RL training task mix is dominated by “iterate on feature” (~39%) and “debugging” (~32%), based on the chart shared in the RL task mix.

Cursor

@cursor_ai

We're releasing a technical report describing how Composer 2 was trained.

10:09 PM · Mar 24, 2026

3.4K

Read 94 replies

Composer 2 RL takeaway: improvements show up in both pass@k and pass@1

Composer 2 RL effect (Cursor): A notable interpretation circulating is that Composer 2’s RL phase improved both pass@k and pass@1, implying gains beyond “just sampling better” and pointing toward capability uplift rather than only reweighting, as highlighted in the RL pass@k and pass@1 note.

Niklas Muennighoff

@Muennighoff

One gem from Composer paper is that RL improved both pass@k & pass@1. Suggests RL does not just reweigh existing capabilities but also teaches new ones? 💎

Cursor

@cursor_ai

We're releasing a technical report describing how Composer 2 was trained.

10:20 PM · Mar 24, 2026

215

Read 10 replies

Composer 2’s early adoption pitch is feel: speed plus taste in frontend work

Composer 2 usage signal (Cursor): Multiple builders are emphasizing “feel” as the differentiator—“so fast, so smart” in the Composer 2 feel and “preferred model for frontend design work… at this speed” in the Frontend design preference—suggesting Cursor is winning some workflows where low-latency iteration matters more than raw benchmark deltas.

eric zakariasson

@ericzakariasson

i still can't believe how good it *feels* to use composer 2 so fast, so smart

8:09 PM · Mar 24, 2026

217

Read 23 replies

⚙️ Inference/serving performance: vLLM MRv2, KV-cache compression, and ultra-low latency UX

Systems posts centered on reducing CPU/GPU sync and KV-cache cost: vLLM’s new execution core, Google’s TurboQuant KV-cache compression claims, and editor-grade latency targets. Excludes on-device storage mounts (dev tools).

Google TurboQuant claims 6× KV-cache memory cuts and up to 8× faster attention

TurboQuant (Google Research): Google published TurboQuant, a KV-cache-focused quantization approach that claims ≥6× KV memory reduction and up to 8× faster attention scoring at 4-bit on H100, with “zero accuracy loss” framing via a two-stage scheme (PolarQuant + QJL) described in the TurboQuant breakdown and the underlying Google blog post.

A concrete detail that matters for serving teams is the emphasis on avoiding hidden overhead (extra per-block constants/metadata), since KV-cache is often bandwidth-bound in long-context workloads, as called out in the TurboQuant breakdown.

This is massive. Google released TurboQuant, advanced theoretically grounded quantization algorithms - massive compression for LLMs. Tackles one of the nastiest costs in long-context LLMs: the KV cache, which stores small memory vectors for every past token and keeps growing as Show more

Google Research

@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

2:09 AM · Mar 25, 2026

Read 6 replies

vLLM ships Model Runner V2: GPU-native input prep and async-first execution core

vLLM (vLLM project): vLLM introduced Model Runner V2 (MRV2), a ground-up rewrite of the execution core aimed at higher throughput and better speculative decoding behavior; it moves more prep onto the GPU, goes “async-first” with less CPU↔GPU synchronization, and adds Triton-native components, while keeping the external API unchanged per the MRV2 announcement and the deeper write-up in the MRV2 blog post.

• How to try it: it’s opt-in behind an env flag—export VLLM_USE_V2_MODEL_RUNNER=1—as shown in the MRV2 announcement.
• What else is bundled in the 2026 roadmap: the team also surfaced supporting work like KV/memory allocation and prefill disaggregation improvements in their GTC recap, which frames MRV2 as part of a broader “GPU-first” serving architecture rather than a one-off patch.

vLLM

@vllm_project

We rebuilt vLLM's execution core from the ground up — more efficient, more modular. Introducing Model Runner V2! 🔧 Modular design with cleaner abstractions ⚡️GPU-native input preparation 🔄 Async-first with zero CPU–GPU sync 🔋 New Triton-native sampler Already seeing Show more

8:29 PM · Mar 24, 2026

288

Read 7 replies

Zed’s edit prediction runs in ~200ms via Baseten-hosted Zeta

Zed (Zed + Baseten): Zed highlighted an Edit Prediction loop where AI code completions appear in about 200ms, with the Zeta model running on Baseten according to the Latency demo and echoed in Baseten’s positioning around “inference has to be invisible” in the Inference feel framing.

This is one of the clearer “latency as UX” datapoints in editor-integrated inference: the demo shows completions arriving fast enough to feel like local tooling rather than a chat roundtrip, as visible in the Latency demo.

Zed

@zeddotdev

Your AI code completions in Zed show up in ~200ms. That's Zeta, our Edit Prediction model, running on @baseten. We love partnering with companies who keep the bar high — Baseten is one of them.

6:41 PM · Mar 24, 2026

595

Read 17 replies

Data center power and cooling constraints show up as an inference scaling ceiling

Serving capacity constraints: a recurring infra signal is that scaling models is increasingly bounded by electricity, heat, and cooling, not just GPUs; one widely shared claim is data centers already consuming ~10% of US electricity, with new builds hitting ~400MW scale (and sometimes discussed in GW terms), alongside water-cooling for chips dissipating ~2kW each, per the Datacenter power note.

This frames long-context and high-throughput inference as a physical-systems problem (site power delivery, cooling loops, and time-to-build), beyond model/kernel optimizations, as described in the Datacenter power note.

Chubby♨️

@kimmonismus

Data centers are consuming 10% of US electricity already, with new ones hitting 400 megawatts (or even GW). These massive, half-mile-long structures use advanced water-cooling for chips that output 2kW of heat. It's an insane amount of power and heat!

6:21 PM · Mar 24, 2026

Read 12 replies

🧭 Workflow patterns: memory compaction, “you still must read code,” and autonomy ladders

Practitioner guidance focused on how to keep agents effective over time: periodic memory extraction/compaction, understanding-first discipline, and staged autonomy (draft → guarded retrieval → supervised actions). Excludes specific product releases covered elsewhere.

Delegation ceiling: you can outsource code, not understanding

Understanding-first discipline: Multiple posts repeat the same constraint for agent-driven development: you can delegate writing and searching, but you still have to read and understand the code to know what you’re shipping and where you can go next, as stated in the Read and understand code and reinforced in the Cant outsource understanding.

In practice this frames “review” as comprehension (architecture + invariants), not line-by-line nitpicking—especially as agents increase output volume.

kache

@yacineMTB

you still need to read and understand code

1:52 AM · Mar 25, 2026

987

Read 111 replies

HBR autonomy ladder: treat agents like employees with roles, limits, and audits

Agent rollout pattern: A Harvard Business Review piece argues that the core risk is “bad actions,” so production agents need a job description, limits, and a manager; it highlights distinct requirements like agent identity + permissions, trusted data sources, hard rule checks between a model and transactions, and full audit trails, as summarized in the Autonomy ladder summary and expanded in the HBR article.

This frames safe deployment as staged autonomy (drafts → guarded retrieval → supervised actions → narrow bounded autonomy) rather than a binary “agent on/off” switch.

Harvard Business Review just published a piece. A good AI agent needs a job description, limits, and a manager. Because, AI agents can fail like employees with too much access and too little supervision. firms keep treating agents like normal software, even though the real risk Show more

12:48 PM · Mar 24, 2026

208

Read 30 replies

Teams are reporting worse production code from “heavily vibe-coded” work

Code quality signal: A concrete failure mode is circulating: someone inherits a “heavily vibe-coded” React area described as “the worst…in the last 10y,” used to argue that teams are seeing broad code-quality degradation and only catching it late, per the Vibe-coded React warning.

The actionable takeaway is organizational, not tooling: if agent output is allowed to bypass normal design/testing pressure, the cleanup arrives later as operational cost rather than PR friction.

Armin Ronacher ⇌

@mitsuhiko

There will be more of this. And as much as we're joking about it, we're seeing a massive degradation of code quality right now and we're increasingly only catching it way too late.

4:11 PM · Mar 24, 2026

1.0K

Read 33 replies

Claude Code /memory “Auto-dream” rumor points to background memory compaction

Claude Code (Anthropic): A /memory setting called Auto-dream is being spotted as an unreleased toggle; the reported behavior is a background subagent that periodically reviews recent sessions, consolidates learnings, updates MEMORY.md, and prunes/reorganizes stale detail into separate files, per the Auto-dream menu leak and earlier chatter in the Reddit feature rumor.

This is a concrete “memory hygiene” pattern (index file + topic shards) aimed at keeping project memory short and durable, instead of growing a single notes blob.

There seems to be anew Claude Code feature in /memory called "Auto-dream", possibly unreleased. per Reddit Auto-dream seems to run a background Claude subagent that periodically reviews recent sessions, consolidates what was learned, updates MEMORY.md, and prunes or reorganizes Show more

12:35 PM · Mar 24, 2026

114

Read 31 replies

Cursor “Continual Learning” plugin turns chat history into AGENTS.md memory

Cursor (Plugin workflow): A new pattern is getting packaged as a plugin: every N prompts, a subagent reviews conversation history, extracts durable facts/preferences, and writes them into an AGENTS.md file that the agent can reuse later, as described in the Plugin behavior summary and detailed in the Plugin page.

This is a practical middle ground between ad-hoc summarization and full vector-memory: it produces an editable, repo-local artifact that can be code-reviewed and versioned.

eric zakariasson

@ericzakariasson

you can try this out in cursor today! cursor.com/marketplace/cu… this will look at your conversation history every N prompt, spawn a subagent to extract memories, and then store them where the agent can access them easily. if you're curious, there's a full article in thread!

Anthony

@kr0der

just found out Claude Code has a new (unreleased?) feature called "Auto-dream" under /memory according to reddit, this basically runs a subagent periodically to consolidate Claude's memory files for better long-term storage this is pretty crazy because that's basically how

9:31 AM · Mar 24, 2026

356

Read 25 replies

MCP vs CLI debate gets reframed as “computer vs no-computer”

Interface debate: The MCP vs shell argument is being reframed as whether you give the agent a full computer (Turing-complete bash) or a constrained API surface; the thread emphasizes that the security posture differs depending on whether the agent co-resides on your machine vs runs isolated, per the Computer vs no computer argument.

This pushes teams toward an explicit design choice: larger action space increases capability, while narrower connectors reduce blast radius when prompts or inputs are adversarial.

Onur Solmaz

@onusoz

The MCP versus CLI argument should be reframed as Computer vs No-computer argument I personally get the dunk on MCP. It didn't work last year, with earlier models. Then we saw CLIs perform much better with the same models. And giving access to bash was much simpler! Models' Show more

8:19 PM · Mar 24, 2026

Read 14 replies

Agent code-audit prompt: find hard-coded constants and unfinished “TODO/will” paths

Repo hygiene pattern: A reusable agent prompt pattern is circulating: first force the agent to read AGENTS.md and README.md and map architecture; then sweep the entire repo for hard-coded constants that should be dynamic plus “TODO/will/would” comments as unfinished logic, as written in the Agent coding life hack.

The follow-on prompt asks the agent to fix everything while maintaining a granular TODO list (or converting the findings into dependency-structured tasks), turning “agent review” into a structured backlog generator.

Jeffrey Emanuel

@doodlestein

Agent Coding Life Hack: This is like the coding agent equivalent of shining a blacklight on your clean-looking black shirt and seeing just how filthy it really is: ❯ First read ALL of the AGENTS.md file and README.md file super carefully and understand ALL of both! Then use Show more

1:57 AM · Mar 25, 2026

Read 7 replies

Reliability is a systems property: handoffs and escalation are the missing primitives

High-reliability pattern: A recurring point from high-reliability orgs is being applied to agents: reliability comes from the system (handoffs, escalation, and when to pull in humans), and current agentic tooling is often weaker at these coordination edges than the models themselves, per the Reliability is systems property.

This fits cleanly with the “autonomy ladder” framing: the hard engineering work is designing the supervision and transfer points, not only improving single-agent capability.

Ethan Mollick

@emollick

Big lesson from high reliability organizations that AI agent builders need to learn is reliability is the property of systems. Current agentic tools are weaker than the agents: they are bad at agent-agent handoffs, escalation, when to call in humans. All keys to high reliability.

5:54 PM · Mar 24, 2026

118

Read 23 replies

🧰 Builder utilities: hf-mount, sandboxed local agents, and agent-friendly storage interfaces

Developer tooling highlights included filesystem-shaped primitives (mount remote assets as local FS) and local sandbox orchestration for coding agents. Excludes MCP servers (separate category).

hf-mount turns Hugging Face Hub assets into a local filesystem

hf-mount (Hugging Face): Hugging Face introduced hf-mount, a CLI that mounts Hub assets as a local filesystem—positioned as a way to use remote storage “100x bigger than your local disk,” with read-write mounts for Storage Buckets and read-only mounts for models/datasets, per the launch blurb in hf-mount announcement and the implementation notes in mount semantics.

• Why it matters for agent-heavy workflows: it turns “agent storage” into plain file ops (read/write/ls) so existing tools can treat Hub-hosted state like local state, as described in hf-mount announcement.

clem 🤗

@ClementDelangue

Local AI is free, fast & secure! So today we're introducing hf-mount: attach any storage bucket, model or dataset from @huggingface as a local filesystem. This is a game changer, as it allows you to attach remote storage that is 100x bigger than your local machine's disk. This Show more

2:36 PM · Mar 24, 2026

979

Read 49 replies

LiteParse benchmarks a fast, non-VLM document parser for agent context

LiteParse (LlamaIndex): following up on earlier URL/stream parsing work URL parsing, LlamaIndex is now pushing LiteParse as a fast, non-VLM parser that outputs an interpretable spatial representation and supports a two-step “fast parse + screenshot deep-dive” workflow, with a benchmark claiming LLM judge pass rate 0.9497 (vs 0.8495 for Markitdown) and CLI latency around 2.235s on a 457-page file (vs 89.324s Markitdown), as shown in LiteParse benchmark.

• Agent-builder framing: LiteParse is being positioned as “highest quality context to AI agents” without using a vision model, while still enabling targeted page-level screenshot inspection, per LiteParse benchmark.
• Concrete downstream use: a compliance-reporting example pairs extraction/classification with agent orchestration, citing LiteParse/LlamaParse as the ingestion layer in compliance workflow screenshot.

Jerry Liu

@jerryjliu0

There’s not that many fast, free, non-VLM document parsers out there: there’s PyPDF, PyMuPDF, Markitdown, OpenDataLoader. Last week, we launched LiteParse ⚡️📄: a fast, free, and non-VLM based document parser that provides the highest quality context to AI agents compared to Show more

1:05 AM · Mar 25, 2026

143

Read 10 replies

Sandcastle proposes offline Docker sandboxes for coding agents with git patch-back

Sandcastle (mattpocockuk): Sandcastle is a TypeScript tool-in-progress for orchestrating locally sandboxed coding agents inside Docker; the design goal is “Docker Desktop as the only dependency,” 100% offline, and “no GitHub involved, only git,” with commits produced in the sandbox then patched back onto the host, as outlined in Sandcastle overview and reiterated in design constraints.

• Workflow implication: it’s aiming at a safer default execution model for agentic coding (run tools in an isolated container, then apply deltas), without tying the workflow to any specific model vendor, per Sandcastle overview.

Matt Pocock

@mattpocockuk

Working on a tool that orchestrates locally sandboxed coding agents in TypeScript - Sandboxed in Docker - 100% offline: commits made in the sandbox get patched back to the host - Build complex workflows in Typescript - Claude, Codex, OpenCode It's called Sandcastle

11:06 AM · Mar 24, 2026

408

Read 76 replies

Virtual filesystem interfaces as an agent-friendly storage primitive

Virtual filesystem pattern: a recurring agent ergonomics idea is to map storage backends (S3/Notion/Box/custom) onto filesystem operations—read/write/ls—so agents keep working in their “fs-ops” comfort zone while avoiding bulk data copying, as argued in virtual filesystem pattern.

• Why teams care: it standardizes “where state lives” behind one interface (including memory/scratchpads between agents) and reduces custom connector surface area, per the rationale in virtual filesystem pattern.

Viv

@Vtrivedy10

virtual filesystems! filesystems are such a great agent pattern, we lean into it pretty heavily at LangChain for data storage, memory, collaboration scratchpad between agents, more a lot of storage can “look like a filesystem” if you map fs-ops like read, write, ls to a Show more

clem 🤗

@ClementDelangue

2:48 PM · Mar 24, 2026

Read 4 replies

🏢 OpenAI product strategy: Sora shutdown and compute reallocation toward next frontier model

Multiple reports and reactions describe OpenAI discontinuing Sora (app + API) and shifting resources toward a forthcoming frontier LLM (“Spud”) and broader ‘agent’ tooling focus. This is primarily about compute allocation and product consolidation, not media workflows.

Reports say OpenAI is shutting down Sora to reallocate compute to “Spud”

OpenAI product focus shift: Coverage and internal-report summaries say OpenAI is discontinuing Sora as a consumer app and as a developer API—and also dropping plans to support video inside ChatGPT—in order to free up compute for its next major LLM (codename “Spud”), which leadership describes as arriving in “a few weeks,” according to the WSJ summary and the Compute reallocation excerpt.

• Compute rationale: The same thread claims Sora was viewed internally as a drag on scarce GPU resources during heightened model competition, per the Compute reallocation excerpt and the Side quests framing.

• Release expectation signal: Multiple posts repeat the “very strong model” / “accelerate the economy” language around Spud, as paraphrased in the Few weeks claim and the AGI Deployment excerpt.

OpenAI is killing Sora as a standalone app, its developer version, and video inside ChatGPT. Shows they see coding, enterprise software, and agent tools as a better use of scarce compute and top researchers instead of using it on video generation which is expensive. That makes Show more

Sora

@soraofficialapp

We’re saying goodbye to the Sora app. To everyone who created with Sora, shared it, and built community around it: thank you. What you made with Sora mattered, and we know this news is disappointing. We’ll share more soon, including timelines for the app and API and details on

9:17 PM · Mar 24, 2026

Read 13 replies

OpenAI posts a shutdown notice for the Sora app, with timelines TBD

Sora (OpenAI): The official Sora account says it’s “saying goodbye” to the Sora app and acknowledges the news is disappointing, while promising more details soon—specifically timelines for the app and API plus how users can preserve their work, as shown in the Shutdown screenshot and reiterated in the Edited shutdown message.

The operationally relevant detail for teams is that the announcement is explicit about forthcoming migration/preservation guidance, but does not yet specify dates or data-export guarantees.

TestingCatalog News 🗞

@testingcatalog

BREAKING 🚨: OPENAI DISCONTINUES SORA, MORE DETAILS WILL BE SHARED LATER. Video generation is moving into an upcoming super app soon?!

Sora

@soraofficialapp

We’re saying goodbye to Sora. To everyone who created with Sora, shared it, and built community around it: thank you. What you made with Sora mattered, and we know this news is disappointing. We’ll share more soon, including timelines for the app and API and details on

8:01 PM · Mar 24, 2026

415

Read 24 replies

OpenAI reportedly renames its product org to “AGI Deployment” amid leadership reshuffle

OpenAI org structure: A report recap claims Sam Altman has stepped back from direct control of safety and security orgs—moving safety under CRO Mark Chen and security under President Greg Brockman—while OpenAI renames its product org to “AGI Deployment,” as quoted in the Org changes recap and highlighted by the AGI Deployment excerpt.

• What Altman is doing instead: The same reporting says Altman is focusing on capital raising, semiconductor supply chains, and building datacenters “at unprecedented scale,” per the Org changes recap and the Spud milestone recap.

Tibor Blaho

@btibor91

Sam Altman gave up direct control of OpenAI's safety and security teams, moving safety under CRO Mark Chen and security under president Greg Brockman, so he can focus on raising money, supply chains and building data centers at a massive scale OpenAI finished pretraining its Show more

Stephanie Palazzolo

@steph_palazzolo

Breaking: OpenAI is canning Sora (mobile app, API and video capabilities in ChatGPT). It’s finished training its latest model, codenamed Spud, as CEO Sam Altman shifts his reports. w/ @amir theinformation.com/articles/opena…

8:29 PM · Mar 24, 2026

1.1K

Read 42 replies

Sora research is said to pivot to world models aimed at robotics

Sora research (OpenAI): Reporting snippets claim Sora’s research team is being redirected from consumer video productization toward “systems that deeply understand the world by learning to simulate arbitrary environments,” with an emphasis on longer-term world simulation for robotics, as shown in the World-model excerpt and echoed in the WSJ summary.

This frames Sora less as a sunset of video R&D and more as a rebrand/repurposing of the underlying work toward world modeling.

Chubby♨️

@kimmonismus

OpenAI's Sora team is now working on world-models - they prioritize longer-term world simulation research especially as it pertains to robotics. tl;dr what we know so far: - Sora has been cancelled because they needed the compute for their new LLM - they renamed product Show more

Chubby♨️

@kimmonismus

Either OpenAI officially achieved AGI or this is the biggest troll move ever: - they rename product organization to "AGI Deployment" - Altman says the next LLM is a "very strong model" - it very much accelerate the economy Quote: "Altman also said that the company would be

9:02 PM · Mar 24, 2026

594

Read 42 replies

Sora postmortems focus on retention collapse and the creator power law

Sora adoption dynamics: A long creator-side post argues Sora usage “collapsed to zero” for many users after the initial novelty, and that the economics are rough because content creation is power-law distributed—“95%+ of users just want to passively consume”—making churny subscription monetization unattractive for a compute-heavy product, according to the Creator postmortem.

• What creators wanted: The same post suggests high-output creators gravitate toward more complex, power-user workflows rather than a constrained text box and short clips, as described in the Creator postmortem.

Jeffrey Emanuel

@doodlestein

I had a lot of fun using Sora and got a lot of laughs with absurd videos of me in various situations. But like everyone else, I kind of got it out of my system after a couple weeks. Not to mention that my family got sick of seeing them. And so my usage collapsed to zero. And Show more

Sora

@soraofficialapp

10:48 PM · Mar 24, 2026

Read 12 replies

A public request asks OpenAI to open-source Sora as it winds down

Open-source ask (Sora): Hugging Face CEO Clément Delangue publicly asks whether OpenAI would open-source Sora as the app is shut down, framing it as a meaningful contribution to the field and a way to preserve the work of the team, per the Open-source request.

No OpenAI response appears in today’s tweet set, and the request does not cite licensing, weights, or a specific artifact (model, dataset, tooling) that would be released.

clem 🤗

@ClementDelangue

Would be so cool if OpenAI open-sourced Sora as they're shutting down the app! Would be an amazing contribution to the field and make all the efforts of the teams working on it even more meaningful!

Sora

@soraofficialapp

9:00 PM · Mar 24, 2026

1.6K

Read 105 replies

🖌️ AI-first design & prototyping tools (non-Figma): editable canvases, site-to-layers, and wireframe loops

A wave of design/prototyping products aimed at builders: importing live sites into editable layers, agent-driven layout editing, and “design agent with taste” pitches. Excludes Figma MCP specifics (covered separately).

Google demos a Flash-Lite browser that generates each web page in real time

Gemini 3.1 Flash-Lite (Google DeepMind): Google demoed a browser concept where pages are generated on-the-fly as you click and navigate—treating HTML/CSS as a streaming model output rather than a prebuilt site, as shown in the DeepMind demo.

A second clip shows the same idea applied to “imagined” historical UIs (e.g., “facebook in 2004”), per the Alt browsing demo, which frames this more as a prototyping surface than a faithful web renderer.

Google DeepMind

@GoogleDeepMind

Watch how fast Gemini 3.1 Flash-Lite can generate websites. ⚡ This browser creates each page in real-time as you click, search, and navigate. Give it a try → goo.gle/4t9In1R Show more

4:40 PM · Mar 24, 2026

1.6K

Read 81 replies

Moda launches a URL-to-brand design agent that outputs editable slides and assets

Moda (Moda): Moda launched a design platform that imports brand identity from a website URL and generates fully editable slides, social posts, and one-pagers on a canvas—positioned explicitly as a “design agent with taste,” per the Funding tweet and the Product walkthrough.

• Brand in, slides out: The product page describes URL-based brand import and export targets including Google Slides and PowerPoint, as outlined on the Product page.
• Builder signal: LangChain notes it’s built with “Deep Agents” and uses LangSmith for observability, according to the Stack note.

Anvisha

@anvisha

We raised $7.5M to kill AI slop. Introducing Moda: the world's first design agent with taste. RT+ comment “Moda” and we’ll design your brand for FREE.

4:04 PM · Mar 24, 2026

5.5K

Read 1.8K replies

Paper Snapshot imports a live website into editable layers (no screenshots)

Paper Snapshot (Paper): Paper added a “snapshot” flow that pulls a live website into the editor as editable layers, aiming to preserve structure by using the site’s real HTML/CSS instead of a static screenshot, as shown in the Feature announcement.

The follow-up post suggests it’s already usable as a starting point for rebuilding/iterating on existing marketing pages, per the Try it prompt.

Stephen Haney

@stephenhaney

Stay ahead Today we're announcing Paper Snapshot Snapshot your live website and paste it into Paper as editable layers • start from your real site • no more screenshots • uses real html/css What will you make? Link in replies 🎶

4:50 PM · Mar 24, 2026

2.3K

Read 144 replies

Agentation adds Layout Mode for on-page wireframing and agent feedback loops

Layout Mode (Agentation): Agentation shipped a new mode for directly rearranging and resizing elements on the page, adding components, and generating structured design feedback intended to feed downstream agents, as demonstrated in the Layout mode launch.

The product write-up describes the output as structured placement/annotation data (coordinates, sizes, labels) that can be passed to an agent workflow, as detailed in the Feature write-up.

Benji Taylor

@benjitaylor

Introducing Layout Mode for Agentation, a new way to explore and wireframe directly on the page. Rearrange and resize existing elements, add new components, and generate structured design feedback for your agent.

5:35 PM · Mar 24, 2026

1.8K

Read 90 replies

💼 Funding & org moves: OpenAI Foundation spend, SoftBank leverage, and new AI labs

Business/organization updates with operational relevance: OpenAI Foundation expansion and spending commitment, financing pressure around big AI bets, and new well-funded labs/hardware efforts. Excludes OpenAI’s Sora/Spud strategy (separate category).

OpenAI Foundation commits $1B in 12 months and formalizes an “AI Resilience” org line

OpenAI Foundation (OpenAI): The Foundation published a new mission/operations update that includes a commitment to spend at least $1B over the next year, positioning it as a society-wide effort around AI benefits and risks, as outlined in the Foundation spend pledge and detailed in the Foundation update. It also sets named leadership over “AI Resilience,” with Wojciech Zaremba moving into that role, alongside new hires/transitions for operations and finance, as listed in the Foundation spend pledge and summarized in the Update recap.

• Leadership and org design: Zaremba transitions to Head of AI Resilience, with Jacob Trefethen named Head of life sciences and curing diseases in the same update—plus shifts for civil society/philanthropy and additions including a CFO and director of operations, according to the Foundation spend pledge and Exec team summary.

The update is high-signal for analysts because it turns “safety” into a budgeted program and a staffed org line (resilience) rather than a generic principle, per the Foundation spend pledge and Foundation update.

Sam Altman

@sama

AI will help discover new science, such as cures for diseases, which is perhaps the most important way to increase quality of life long-term. AI will also present new threats to society that we have to address. No company can sufficiently mitigate these on their own; we will Show more

5:01 PM · Mar 24, 2026

5.4K

Read 1.4K replies

Figure founder Brett Adcock launches Hark, an AI lab targeting “personal intelligence” with custom devices

Hark (Brett Adcock): After ~8 months in stealth, Adcock announced a new AI lab called Hark aimed at a proactive multimodal “personal intelligence” system that pairs foundation models with bespoke hardware, as described in the Hark launch description and expanded in the Team and compute claims.

• Capital, team, and compute: The announcement claims $100M of Adcock’s own funding, 45+ engineers/designers, and thousands of B200 GPUs expected online by April, with a first model targeted for summer, according to the Team and compute claims.

• Product thesis: The pitch frames the device layer as the “interface” for a system with highly personalized memory and multimodal inputs/outputs—speech, text, vision—per the Interface plus memory framing and Hark launch description.

The immediate analyst signal is another well-funded entrant choosing an end-to-end stack (models plus hardware) for consumer-facing agent experiences, with unusually explicit near-term GPU sourcing claims in the Team and compute claims.

Wes Roth

@WesRoth

After eight months in stealth, Brett Adcock, the billionaire founder of the $39 billion humanoid robotics company Figure AI and aviation startup Archer announced his newest venture: an artificial intelligence lab named Hark. Hark is setting out to build a highly proactive, Show more

Brett Adcock

@adcock_brett

Today I'm excited to introduce Hark, a new artificial intelligence lab building the most advanced, personal intelligence in the world We've been in stealth for 8 months, assembling one of the greatest AI and hardware teams on the planet I want to explain why I started Hark and

12:00 AM · Mar 25, 2026

188

Read 4 replies

SoftBank reportedly pushes its own leverage cap to fund a new ~$30B OpenAI bet

SoftBank financing (FT): A Financial Times report says SoftBank is pushing up against its self-imposed 25% loan-to-value ceiling to finance a reported ~$30B OpenAI investment, increasing borrowing against assets whose values are hard to mark in real time, as described in the FT leverage summary.

For AI leaders tracking capital availability, the key operational point is that this is debt capacity being used to underwrite AI bets (and, indirectly, compute buildout and model rollouts), with the risk profile tied to private-asset valuation and potential forced de-leveraging if marks move, per the FT leverage summary.

FT: SoftBank is pushing its own 25% loan-to-value ceiling to fund a new $30B OpenAI bet, which means it is borrowing more aggressively against the market value of its assets. Loan-to-value is a pressure gauge: if asset values fall while debt stays fixed, the ratio jumps, and Show more

12:52 PM · Mar 24, 2026

Read 13 replies

📏 Benchmarks & measurement: new reasoning tests, SWE evals, and “review doesn’t scale” claims

Eval/benchmark chatter spans interactive reasoning (ARC-AGI-3), new SWE benchmarks, and ongoing concerns that “AI writes, humans review” breaks down at scale. Excludes pure research-paper summaries (separate category).

ARC-AGI-3 will test interactive reasoning across 1,000+ levels and 150+ environments

ARC-AGI-3 (ARC Prize): ARC-AGI-3 is slated to launch March 25, 2026 as an interactive reasoning benchmark—1,000+ levels across 150+ environments that require exploration, learning, planning, and rule discovery with no instructions, per the Launch announcement.

The same post anchors expected “ceiling” context by citing prior best-of results—Gemini 3.1 Pro at 98% on ARC-AGI-1 and Gemini 3 Deep Think at 84.6% on ARC-AGI-2—as background for how hard ARC-AGI-3 intends to be, as stated in the Launch announcement.

AiBattle

@AiBattle_

ARC-AGI-3 launches tomorrow - The first interactive reasoning benchmark built to test human-like intelligence in AI - 1,000+ levels across 150+ environments requiring exploration, learning, planning, and adaptation - Video-game-like tasks with no instructions, requiring Show more

3:20 PM · Mar 24, 2026

580

Read 18 replies

Cognition and Mercor announce APEX-SWE for realistic SWE evaluation

APEX-SWE (Cognition x Mercor): Cognition says it collaborated with Mercor on APEX-SWE, a new benchmark aimed at evaluating models on “realistic software engineering tasks,” as announced in the Benchmark announcement.

The tweet doesn’t include task format, scoring methodology, or a public harness link yet, so comparability to SWE-bench-style setups is still unclear based on the Benchmark announcement.

Cognition

@cognition

We've collaborated with @mercor_ai on APEX-SWE, a new benchmark that evaluates AI models on realistic software engineering tasks.

adarsh

@adarsh_exe

Traditional coding benchmarks do not reflect how software is actually built and maintained. That's why we built a new benchmark, APEX-SWE, in partnership with @cognition. It measures whether AI models can perform complex, real-world software engineering work to ship systems

5:52 PM · Mar 24, 2026

120

Read 9 replies

LisanBench correlation with ARC-AGI-1/2 fuels debate about benchmark “farming”

LisanBench (benchmark discourse): New correlation analysis between LisanBench and ARC-AGI-1/2 is being used as evidence in a “benchmark farming” argument—claiming Sonnet/Opus 4.6 may be over-optimized for LisanBench—based on correlations reported as 0.8741 (ARC-AGI-1) and 0.8244 (ARC-AGI-2) in the Correlation stats.

The same post flags uncertainty about the conclusion—“maybe ARC-AGI-1 is also just a cooked benchmark,” while noting METR and ARC-AGI-2 don’t show as drastic an effect—per the Correlation stats.

Lisan al Gaib

@scaling01

LisanBench correlation with ARC-AGI-1 and 2 (because tomorrow is ARC-AGI-3 day) ARC-AGI-1: 0.8741 - 95% CI: [0.7585, 0.9347] ARC-AGI-2: 0.8244 - 95% CI: [0.6642, 0.9191] ARC-AGI-1 looks like it confirms that Sonnet and Opus 4.6 are farming points on LisanBench*

12:29 AM · Mar 25, 2026

LisanBench vs METR time horizons shows a very high correlation in a small sample

METR horizons vs LisanBench (measurement chatter): A small-sample comparison claims a Spearman ρ = 0.965 between LisanBench average score and METR “p50 horizon,” with caveats that sample sizes are small and METR used high-compute settings for some GPT models, according to the Correlation plot.

A follow-up corrects an axis labeling mistake—“y-axis should be minutes”—and also notes uncertainty about the reasoning budget used for Opus 4.5/4.6, as stated in the Axis correction.

Lisan al Gaib

@scaling01

METR time horizons correlation with LisanBench samples sizes are pretty small and METR used high compute setting for GPT-5 and GPT-5.2 and not medium I don't know what thinking budget they used for Opus 4.5 and 4.6 the non-reasoning models in the corner look good tho

Lisan al Gaib

@scaling01

12:42 AM · Mar 25, 2026

Read 3 replies

PrinzBench adds GPT-5.4 Pro (Extended) and reports a new 79/99 top score

PrinzBench (community benchmark): GPT-5.4 Pro (Extended) was added to PrinzBench and reportedly scored 79/99, beating GPT-5.4 (xhigh) by 10 points, according to the Benchmark update.

The benchmark author notes they “had to throw out a lot of questions” that turned out not to be difficult for models, implying rapid saturation pressure on the task set, as stated in the Benchmark construction note.

prinz

@deredleritt3r

Replying to @corbtt

Lots. I had to throw out a lot of questions that I thought would be difficult for the models, but in fact were not.

5:10 AM · Mar 25, 2026

LisanBench vs AidanBench correlation shared, with a claimed “Gemini bias” effect

AidanBench vs LisanBench (measurement chatter): Another correlation plot reports Spearman ρ = 0.777 (n = 35) between LisanBench average score and AidanBench total, with the author attributing lower correlation to a previously identified “Gemini bias,” per the Correlation chart.

This is being framed as validation of benchmark-specific model effects rather than a clean “single capability axis,” as described in the Correlation chart.

Lisan al Gaib

@scaling01

AidanBench correlation with LisanBench it's validating that the Gemini bias I identified in my AidanBench analysis from 1-1.5 years ago shows up here, which lowers correlation quite a bit

Lisan al Gaib

@scaling01

12:47 AM · Mar 25, 2026

📚 Docs-for-agents devex: content negotiation, llms.txt skepticism, and discoverability

A smaller but concrete devex thread: teams are iterating on agent-facing doc surfaces (content negotiation, nav surfacing) while calling out weak defaults like llms.txt. Excludes repo-local steering files (covered under workflows).

Sentry MCP minisite adds content negotiation for agent-friendly docs

Sentry MCP (Sentry): The Sentry MCP minisite now serves an agent-optimized experience via HTTP content negotiation, with markdown returned when clients request it—a concrete move away from relying on llms.txt, which the team calls “useless” in this context, as noted in the [content negotiation change](t:737|content negotiation change) and the linked [agent-docs rationale](link:1037:0|agent docs note).

• Practical devex change: by varying responses on the Accept header, agent clients can fetch concise setup and usage info without scraping the full human-oriented site, as implied by the [implementation references](t:1037|implementation links).

Sentry adds an Integrations nav section to make MCP and CLI discoverable

Sentry (Sentry): A discoverability fix landed after the question “how do people know MCP exists?”—Sentry is adding an Integrations section in org settings that surfaces an MCP & CLI page (plus integrations pages), as shown in the [discoverability discussion](t:928|MCP discoverability note) and the corresponding [UI/navigation PR](link:928:0|navigation PR).

• Why it matters: it turns MCP setup from “read the docs somewhere” into a first-class in-product entry point, which tends to be the difference between agents getting used and agents being forgotten.

🔎 Retrieval for agents: late interaction, hybrid grep, and “deep research is retrieval” framing

Retrieval remained a core builder theme: late-interaction (ColBERT-style) momentum, arguments that deep research bottlenecks on evidence gathering, and codebase file-search stack discussions. Excludes document parsing tools (in dev tools).

BrowseComp-Plus: deep research bottlenecks on getting evidence into context

BrowseComp-Plus (Hornet): A new write-up argues BrowseComp-Plus is “a deep research benchmark” on paper but a retrieval benchmark in disguise, because the hardest step is usually getting the right evidence into context—not reasoning after it’s retrieved, as stated in Retrieval problem framing and expanded in the Blog post.

This framing matters for eval design: if toolchains change retrieval quality/latency, they can swing “agent reasoning” outcomes without any model changes.

Jo Kristian Bergum

@jobergum

New post: BrowseComp-Plus is presented as a benchmark for deep research agents. It is also a retrieval benchmark in disguise. The hardest part is often not reasoning once the evidence is in context, but getting the right evidence there at all. hornet.dev/blog/deep-rese…

11:38 AM · Mar 24, 2026

Read 3 replies

ColBERT fine-tuning story: MaxSim updates fewer tokens, making training less noisy

ColBERT training dynamics: A concrete argument for why ColBERT-style late interaction can be comparatively friendly to fine-tuning is that MaxSim selects a small set of token matches to update, keeping other document-token representations stable—so updates are more “surgical,” as explained in Fine-tuning intuition alongside the broader dual-encoder training tradeoff discussed in the MLR paper.

Antoine Chaffin

@antoine_chaffin

ColBERT models are very easy to train/fine-tune proceedings.mlr.press/v162/menon22a.… This paper highlights that one of the issue of dense bi-encoder training against cross encoder is that the factorized expression of cross encoder allows to only modify the couple representation and not be Show more

Connor Shorten

@CShorten30

Late Interaction is not only great for inference, but also for training!! 🏭 Fine-tuning single-vector embedding models hasn’t really taken off… Late Interaction could change this. One of my favorite takeaways from the podcast, here is a clip explaining this further in ~1

2:18 PM · Mar 24, 2026

Coding-agent retrieval pattern: ColBERT file search paired with an RLM router

ColBERT file search for coding agents: A concrete workflow claim is that swapping file search to a late-interaction model changes agent outcomes enough that “if your coding agent is not an RLM with ColBERT file search, you’re ngmi,” as stated in File search results claim, with a follow-up push for an RLM+ColBERT+ColGrep stack in Collab suggestion.

This is less about “better reasoning” and more about moving higher-signal code snippets into context earlier, which shortens agent iteration loops.

Omar Khattab

@lateinteraction

Look at these results carefully. Codex and Gemini 3, with gemini file search and codex default tools, versus with @mixedbread’s new late interaction model. Soon enough, if your coding agent is not an RLM with ColBERT file search, you’re ngmi.

Mixedbread

@mixedbreadai

For Agentic tasks, Oracle-level performance is the maximum performance a system can achieve, assuming it is able to retrieve all relevant documents perfectly, every time. We're proud to show that Mixedbread Search approaches the Oracle on multiple knowledge intensive benchmarks.

11:07 PM · Mar 24, 2026

Read 7 replies

ColGrep: local regex speed with late-interaction semantics for agent workflows

ColGrep (hybrid retrieval): The ColGrep idea is presented as a pragmatic middle layer for coding agents—keep regex as the backbone (agent-friendly, precise), but add semantic matching via late interaction; the case stresses that local indexes avoid privacy and freshness issues, as argued in Local index direction and detailed with agent-search rationale in ColGrep for agents.

This also doubles as an “MCP vs CLI” adjacent point: retrieval quality often comes down to what evidence you can fetch cheaply and locally, not whether the agent can execute arbitrary tools.

Antoine Chaffin

@antoine_chaffin

Very cool blog post simply explaining smart ideas with great visualizations, love it! Also, it seems that Cursor will slowly converge towards ColGrep (which is logical): an hybrid of regex augmented by semantic search that is efficient enough to run locally

Cursor

@cursor_ai

Cursor can now search millions of files and find results in milliseconds. This dramatically speeds up how fast agents complete tasks. We're sharing how we built Instant Grep, including the algorithms and tradeoffs behind the design.

9:00 AM · Mar 24, 2026

Read 1 reply

Late-interaction retrieval argues the real enemy is single-vector search, not grep

Late interaction retrieval: The “grep-is-all-you-need” backlash is framed as a category error—people equate “neural search” with single-vector embedding retrieval, then conclude it’s bad; the counterclaim is that late interaction (ColBERT-style) has been winning for years, including via the “late interaction can’t stop winning” quote highlighted in Grep vs neural search take.

The practical implication is that agent retrieval stacks should treat “semantic search” as multi-vector by default when the task is iterative and keyword-ish (code search, investigative retrieval), rather than trying to replace grep outright.

Omar Khattab

@lateinteraction

The “grep-is-all-you-need” nonsense arguments arise from the fact that too many people think neural search means single-vector IR, which do in fact suck. But we’ve known that since 2019. Quoting @aaxsh18, CEO of Mixedbread: > late interaction cant stop winning

Mixedbread

@mixedbreadai

5:00 PM · Mar 24, 2026

180

Read 8 replies

PyLate lands in MTEB, and late-interaction models keep taking top slots

PyLate (MTEB integration): PyLate is reported as merged into MTEB so late-interaction models can be run and compared on the “official” benchmark harness and leaderboard surfaces, as announced in PyLate merged to MTEB.

• Code retrieval deltas: One claim highlighted is a ~150M LateOn-Code model beating “Gemini embedding” on a Borda-style metric, per MTEB code comparison.
• New small-model entry: ColBERT-Zero is described as taking the top spot under 150M parameters, according to ColBERT-Zero note.

Antoine Chaffin

@antoine_chaffin