FlashCompact hits 33k tok/s – compacts 200k context in 1.5s

GPT-5.4 mini is available today in ChatGPT, Codex, and the API. Optimized for coding, computer use, multimodal understanding, and subagents. And it’s 2x faster than GPT-5 mini. openai.com/index/introduc…

5:08 PM · Mar 17, 2026

5.2K

Read 440 replies

GPT‑5.4 mini posts near-flagship scores on SWE‑Bench Pro and OSWorld

Benchmarks (OpenAI): OpenAI published a head-to-head table for GPT‑5.4 / GPT‑5.4 mini / GPT‑5.4 nano across SWE‑Bench Pro, Terminal‑Bench 2.0, OSWorld‑Verified, MCP Atlas, and GPQA Diamond in the benchmarks table, including competitor reference points for Claude Haiku 4.5 and Gemini 3 Flash.

Coding + tool use deltas: On SWE‑Bench Pro (Public), OpenAI reports 57.7% (GPT‑5.4) vs 54.4% (mini) vs 52.4% (nano), as shown in the benchmarks table and repeated in the benchmarks repost. Terminal‑Bench 2.0 is 75.1% / 60.0% / 46.3% for GPT‑5.4/mini/nano, again per the benchmarks table.

Computer use split: On OSWorld‑Verified, GPT‑5.4 is shown at 75.0% and mini at 72.1%, while nano is 39.0%, as highlighted in the osworld chart and in the main table in benchmarks table. A separate community recap explicitly flags the gap as “do not use nano for computer use,” as written in the usage note.

Reasoning effort note: Community screenshots and reposts indicate mini and nano can be run with an xhigh compute setting, with a consolidated score table shown in the xhigh compute table. The benchmark table itself notes OpenAI models were run with maximum available reasoning effort, per the footnote visible in the benchmarks table.

OpenAI

@OpenAI

5:08 PM · Mar 17, 2026

5.2K

Read 440 replies

Builders treat GPT‑5.4 mini as a subagent tier; nano as bulk labeling economics

Usage + sentiment (GPT‑5.4 mini/nano): Early builder reactions cluster around using mini to keep multi-agent workflows responsive and using nano for high-volume, low-stakes extraction/labeling—while pricing discourse is mixed.

Subagent economics and “in-the-loop” speed: OpenAI and OpenAIDevs emphasize that mini is built for subagents and high-throughput coding, and in Codex it consumes ~30% of the GPT‑5.4 quota, enabling “~3.3× more usage,” per the API details and the Codex quota note. OpenRouter’s early testing frames mini’s speedup as helping “stay in the loop” for agent tasks, as described in the OpenRouter availability.

Bulk multimodal math that now pencils out: Simon Willison shared a concrete cost model where nano could describe a 76,000-photo library for ~$52, as computed in the cost calculation with details in his write-up at cost write-up.

Representative split reactions: Some praise mini as “a wildly capable model” in the Codex quota note, while others call it “dead on arrival” due to price/perf comparisons against Kimi K2.5 in the price critique (including a competitor pricing/throughput screenshot). Separate complaints focus on the absolute price jump versus prior mini/nano tiers, as stated in the price hike complaint.

OpenAI Developers

@OpenAIDevs

Replying to @OpenAIDevs

GPT-5.4 mini is available today in the API, Codex, and ChatGPT. In the API, it has a 400k context window. In Codex, it uses only 30% of the GPT-5.4 quota, letting you handle simpler coding tasks for about one-third of the cost. GPT-5.4 nano is only available in the API.

5:09 PM · Mar 17, 2026

301

Read 15 replies

🧑‍💻 Codex in practice: subagent ops, limits, and orchestration habits

Continues yesterday’s subagent narrative with practitioner-level operations: rate-limit pain, orchestrator patterns, and “finish criteria / verify with tools” instruction blocks. Excludes GPT‑5.4 mini/nano model-release details (covered in feature).

Codex instruction pattern: define done, run tools, then report plainly

Codex (OpenAI): A reusable custom-instructions block is circulating that splits “technical work” from “human-readable reporting”; it asks the agent to (1) define finishing criteria up front, (2) self-verify by running tests/tools and checking outputs, and (3) explain results in plain English rather than code-speak, as written in Custom instructions block.

The emphasis is on keeping the user out of the iteration loop (“only come back when you’ve confirmed things work”), per Custom instructions block, which pairs naturally with subagent-heavy workflows where review bandwidth becomes the constraint.

Matt Shumer

@mattshumer_

Add this to your Codex custom instructions for a way better experience: "When communicating your results back to me, explain what you did and what happened in plain, clear English. Avoid jargon, technical implementation details, and code-speak in your final responses. Write as Show more

5:02 PM · Mar 17, 2026

447

Read 43 replies

Continuous prod-to-green swarms with fixed subagent lanes

Codex subagents (OpenAI): A more formal incident/fix pattern is being shared as a prompt: keep the immediate blocker local, but maintain a small set of persistent subagent lanes (prod monitor, staging shepherd, pathology investigator, fix worker) and require concrete outputs like “likely root cause,” “smallest safe fix,” and “commit SHA,” as specified in Prod-to-green prompt.

A notable operational detail is the instruction to reuse subagents via send_input instead of respawning, which treats subagents as long-lived lanes rather than one-shot calls, per Prod-to-green prompt.

Dan Shipper 📧

@danshipper

prompt get many PRs to prod autonomously using codex subagents: Run a continuous prod-to-green swarm loop. Keep the immediate blocking task local. Use a small stable set of persistent subagent lanes: 1. prod monitor 2. staging shepherd 3. current/newest pathology investigator Show more

4:21 PM · Mar 17, 2026

Read 11 replies

Self-review loops keep finding new issues, even after “done”

Coding agent reliability: A repeated workflow—ask an LLM to audit code, then in a fresh thread ask another LLM to implement the audit comments, and repeat until no new concerns—keeps running far longer than expected, which is being used as evidence that “claude-take-the-wheel” still breaks down on complex systems, per Review loop experiment.

The operational hypothesis is either persistent “cognitive dissonance” between runs or incentives to always find something wrong, as suggested in Review loop experiment, with a punchline reminder that “the real flex is how many LOC deleted” in LOC deleted quip.

Hamel Husain

@HamelHusain

One thing that makes me feel that code factory has not arrived yet is the following experiment: 1.Ask a LLM to do an in-depth rigorous review of your code 2. In a new thread, as same/different LLM to consider those review comments independently and address issues it agrees with Show more

2:55 PM · Mar 17, 2026

199

Read 65 replies

Spec length is converging with code length for agent-driven work

Spec vs code tradeoff: Builders are arguing that a spec detailed enough to reliably generate high-quality code is often “roughly the same length and detail as the code itself,” making spec-review a weak substitute for code-review, as laid out in Spec length argument.

The proposed gap isn’t “write better specs,” but finding a steering mechanism that can re-direct the agent before it outputs thousands of lines, per Spec length argument, with follow-on clarification about the current “stone age” of declarative config in Declarative config note.

dex

@dexhorthy

damn this is so good and encapsulates everything I've been seeing/saying in the last few months - a spec that is sufficiently detailed to generate code with a reliable degree of quality is roughly the same length and detail as the code itself - so don't review those things, Show more

gabby

@GabriellaG439

New blog post: "A sufficiently detailed spec is code" I wrote this because I was tired of people claiming that the future of agentic coding is thoughtful specification work. As I show in the post, the reality devolves into slop pseudocode haskellforall.com/2026/03/a-suff…

6:55 PM · Mar 17, 2026

285

Read 21 replies

Three audit modes: screen scan, code read, or runtime verification

Model audit workflow: A practical comparison of “audit styles” is circulating: Gemini “looked at the screen” and caught obvious issues; Opus “read the code” and cited line-level problems; GPT‑5.4 “ran the code,” hit endpoints, and traced bindings end-to-end, as written up in Audit comparison.

The key takeaway is less about which model “wins” and more about selecting an audit harness that forces runtime validation when correctness depends on wiring and behavior, per Audit comparison.

I was using Opus via Cursor, did an audit with Gemini 3.1 Pro, Opus 4.6 and GPT-5.4. Then I asked Opus to give assessment of the audit quality (anonymously). And I think it 100% nailed the current state of the models: Gemini 3.1 Pro: The weakest. Looked at the screen. Found the Show more

11:00 AM · Mar 17, 2026

241

Read 25 replies

Codex subagents and skills don’t compose cleanly yet

Codex subagents (OpenAI): There’s ongoing friction around how “skills” and subagents interact—one report says subagents don’t have the Skill tool and users end up maintaining duplicate instruction sets to run the same capability in parent vs subagent contexts, per Skills overlap concern and clarified in Skill tool missing.

The underlying issue is instruction modularity vs context forking: builders want to be able to say “use the X skill” either locally or via a subagent without rewriting prompts, per Skills overlap concern.

dex

@dexhorthy

soooo subagent skills…feels like subagent instructions will overlap with skill and ppl won’t know what to put where. I can only assume that this will eventually replace custom subagents This also explains why they recently made it so subagents can’t invoke skills. I still Show more

Lydia Hallie ✨

@lydiahallie

Btw you can add `context: fork` to run a skill in an isolated subagent. The main context only sees the final result, not the intermediate tool calls It gets a fresh context window with CLAUDE.md + your skill as the prompt. The `agent` field even lets you set the subagent type!

2:10 PM · Mar 17, 2026

Read 11 replies

codex-planr adds task states to stop premature “done”

codex-planr (community): A small, explicit task-state machine is being added to Codex workflows to combat agents marking work complete too early; the loop is “plan → fix → review → fix → summary,” as described in Task system note and implemented in the GitHub repo.

This is a repo-local pattern: status and scope live in files, which makes it easier to resume or hand off without relying on chat history alone, per GitHub repo.

Kevin Kern

@kevinkern

codex often marks tasks as finished even when some of the work is still left, so I added a simple task system to my project. It doesn't need much, so a few skills help here. simple: plan -> fix -> review -> fix -> summary code in the comments below

8:22 PM · Mar 17, 2026

Read 3 replies

Refactor rollback: tests passed, behavior still drifted

Agent-assisted refactors: A cautionary note from Uncle Bob describes a large reorganization that degraded behavior in ways tests didn’t catch; he reverted to a stable point and switched to smaller steps with more anchoring tests, warning that “AIs move fast, and they can take you off the rails,” per Refactor rollback.

This is a workflow signal about validation strategy: broad refactors amplify blind spots, and “go smaller” becomes the risk-management lever, per Refactor rollback.

Uncle Bob Martin

@unclebobmartin

I started a major reorganization of the empire game yesterday. I was confident that all my tests and rigor would keep the behavior stable. I was wrong. The behavior began to degrade in a way that the tests were not detecting. So after a couple of hours I reverted the work Show more

12:31 PM · Mar 17, 2026

218

Read 14 replies

Codex weekly usage limits are showing up as a UX pain point

Codex (OpenAI): The “weekly usage limit” UX is becoming a visible friction point in day-to-day agent work, with a screenshot showing “0% remaining” and a specific reset timestamp in Usage limit screenshot.

This is being framed less as a billing detail and more as an ops constraint: once parallel sessions and subagents become normal, teams hit limits in the middle of work, per the complaint setup in Usage limit screenshot.

PoV: the last thing you see before going on twitter and tagging @thsottiaux with an imaginary Codex issue

7:59 AM · Mar 17, 2026

326

Read 23 replies

The “Jason” subagent meme captures real coordination confusion

Codex subagents (OpenAI): The “who is Jason and why is he deleting my prod db?” joke is spreading as shorthand for subagent opacity and coordination overhead, starting from Jason confusion joke and reinforced by the community illustration in Subagents fanout art.

Even when it’s humor, it’s pointing at a real ergonomic gap: subagents need clearer identity, ownership, and boundaries so teams can track what’s happening across parallel lanes, per Jason confusion joke.

Ok I'll bite, who is Jason and why is he deleting my prod db?

2:06 PM · Mar 17, 2026

Read 7 replies

📲 Claude Cowork Dispatch + mobile-first agent workflows

New Claude Cowork “Dispatch” research preview and related workflow chatter: turning desktop agents into phone-controlled systems and reducing risk vs DIY remote control. Excludes OpenAI model-release content.

Claude Cowork adds Dispatch to relay phone messages to the Desktop agent

Dispatch (Claude Cowork, Anthropic): Anthropic is rolling out Dispatch as a research preview that lets you communicate with the Claude Desktop app from your phone, as described in the Dispatch rollout. It’s initially available to Max subscribers with an expansion to Pro planned, per the same Dispatch rollout.

In practice this turns “desktop agent sessions” into something you can keep moving while away from the machine—useful for longer-running work where you mainly need to unblock, redirect, or request updates without re-entering the full desktop setup.

TestingCatalog News 🗞

@testingcatalog

Anthropic is rolling out a new Dispatch feature for Claude Cowork that lets users communicate with the Claude Desktop app from their phones! Currently available to Max subscribers in research preview, with a later expansion to Pro plans. Mobile Cowork! 👀

Felix Rieseberg

@felixrieseberg

Your desktop has to be running. Like Cowork itself, we’re shipping an early version - you can expect more to come here within the next few days and weeks. Rolling out now to Max subscribers, with Pro coming in the next few days. Try it and let me know what you think. Download

9:25 PM · Mar 17, 2026

388

Read 8 replies

Dispatch is getting framed as a safer alternative to DIY phone-to-desktop agent control

Dispatch safety/UX comparison: Early user feedback frames Claude Cowork Dispatch as covering “90%” of a prior DIY setup while feeling less risky, with Ethan Mollick saying it “covers 90% of what I was trying” and feels “far less likely to upload my entire drive,” as quoted in the Dispatch safety comparison.

This is a practical signal: teams want phone-to-desktop control, but they also want tight guardrails around what a remote-controlled agent can read and upload—especially when the “phone as remote control” pattern gets bolted onto general-purpose computer-use agents.

Ethan Mollick

@emollick

After using it a bit, Claude Cowork Dispatch covers 90% of what I was trying to use OpenClaw for, but feels far less likely to upload my entire drive to a malware site.

12:41 AM · Mar 18, 2026

3.0K

Read 93 replies

Felix Rieseberg pitches Cowork as local-first agent workflows plus Skills

Claude Cowork product framing (Anthropic): Felix Rieseberg’s Cowork discussion emphasizes “local-first” agent workflows, Skills as a reusable capability layer, and the idea that execution is cheap enough to “build all the candidates,” as previewed in the Podcast episode blurb and expanded in the Podcast page.

The subtext is a product bet: putting agents into a dedicated desktop environment (rather than only a chat surface) becomes the way to make agents feel usable for non-trivial knowledge work, while still containing risk via sandboxing.

Latent.Space

@latentspacepod

🆕 Claude Cowork, Skills, and the Future of AI Coworkers latent.space/p/felix-anthro… @felixrieseberg has spent years working at the interface layer, from Electron and the Slack desktop app to now helping build @claudeai Cowork. In this episode, Felix explains why execution is Show more

9:39 PM · Mar 17, 2026

Read 5 replies

Cowork’s “touch grass” gag is another sign it’s shipping fast

Claude Cowork UI velocity: A small but telling datapoint: “You can now touch grass in Cowork, too,” per the Touch grass quip. Even without details, it reads like a steady cadence of UX tweaks and easter-egg features landing alongside bigger workflow features such as Dispatch.

Boris Cherny

@bcherny

You can now touch grass in Cowork, too 👏

Felix Rieseberg

@felixrieseberg

We're shipping a new feature in Claude Cowork as a research preview that I'm excited about: Dispatch! One persistent conversation with Claude that runs on your computer. Message it from your phone. Come back to finished work. To try it out, download Claude Desktop, then pair

9:44 PM · Mar 17, 2026

Read 76 replies

🧩 Plugins & skills shipping: agent capability packaging goes mainstream

Installable capability bundles and skill systems across agents (Vercel plugin, Hermes plugins, Codex Skills, Box CLI as agent filesystem). This is the “how do I add powers?” beat, distinct from model releases.

Intercom turns Claude Code into an internal full-stack platform via plugins+skills

Claude Code plugin system (Intercom): Intercom described an internal system with 13 plugins and 100+ skills that extend Claude into a “full-stack engineering platform,” per Plugin system thread; the thread highlights deep hooks, MCP-based capabilities, and observability loops that treat skills as product surface area.

• High-leverage capability: the “wildest” example is a read-only Rails production console exposed via MCP for safe production inspection (feature flags, business logic validation, cache state), as described in Prod console detail.
• Instrumentation and feedback loop: they instrument Claude Code lifecycle events with OpenTelemetry (SessionStart, PreToolUse, SubagentStart, etc.) flowing to Honeycomb, per Telemetry detail, which enables “real sessions → detected gaps → GitHub issues → new skills” style iteration.

This is an enterprise pattern: plugin hooks + skills + telemetry wired into a continuous improvement loop, as described across Plugin system thread and Telemetry detail.

Brian Scanlan

@brian_scanlan

We've been building an internal Claude Code plugin system at Intercom with 13 plugins, 100+ skills, and hooks that turn Claude into a full-stack engineering platform. Lots done, more to do. Here's a thread of some highlights.

6:46 PM · Mar 17, 2026

Hermes Agent v0.3.0 adds drop-in plugins and unified streaming across platforms

Hermes Agent v0.3.0 (Nous Research): Hermes shipped a release centered on capability packaging—Python plugins dropped into ~/.hermes/plugins/ can add tools/commands/skills without forking, per the Release notes; the same release also unifies real-time streaming across the CLI and gateway platforms and expands the provider/tooling surface (IDE integrations, browser attach, PII redaction).

• Plugin model: “drop Python files into a directory” is positioned as the extension mechanism, with shareable tools and hooks called out in the Release notes.
• Shipping details engineers will notice: the release notes list unified streaming, a provider router, /browser connect via Chrome CDP, and IDE integrations (VS Code/Zed/JetBrains) in the Release notes.

Nous Research

@NousResearch

Hermes Agent v0.3.0 ☤ 248 PRs. 15 contributors. 5 days. • Real-time streaming across CLI and all platforms • First-class plugin architecture, package and share tools+commands+skills • /browser connect to live Chrome via CDP • @vercel AI Gateway model provider • Show more

12:04 PM · Mar 17, 2026

967

Read 69 replies

Vercel plugin adds 47+ deploy/perf skills to Claude Code and Cursor via one command

Vercel plugin for coding agents (Vercel): Vercel shipped an installable plugin that turns “agent knows Vercel” into a dependency—installed with npx plugins add vercel/vercel-plugin as shown in Install command; it bundles 47+ specialized skills plus sub-agents for deployment and performance work, and it manages context injection dynamically for cost/precision, as detailed in the Changelog post.

• What’s actually new: rather than pasting docs into prompts, the plugin observes project activity (edits/commands) and injects the relevant Vercel knowledge at the right time, as described in the Changelog post.
• Adoption signal: the framing from Vercel leadership is “there’s no step two,” per Endorsement, which matches the direction teams are taking for making capabilities installable instead of re-prompted every session.

Vercel Developers

@vercel_dev

One plugin. One command. Every skill: ▲ ~/ npx plugins add vercel/vercel-plugin The Vercel plugin for coding agents turns isolated capabilities into coordinated expertise, with: • 47+ specialized skills • Sub-agents for deployments, performance, and more • Dynamic context Show more

1:11 AM · Mar 18, 2026

320

Read 11 replies

Box ships an official CLI so agents can treat Box as a cloud filesystem

Box CLI (Box): Box released an official CLI intended to act as a file-system surface for agents across tools like Claude Code, Codex, Perplexity Computer, and OpenClaw; the install path is npm install --global @box/cli, as announced in CLI announcement.

The tweet also notes availability to free users (including 10GB free storage), positioning the CLI as a shared “agent storage + file ops” primitive, as stated in CLI announcement.

Aaron Levie

@levie

The official Box CLI is here. Now you can use Box via Claude Code, Codex, Perplexity Computer, OpenClaw & more as a full cloud file system for agents. Available to all users, including free users with 10GB of free storage. npm install --global @box/cli

9:45 PM · Mar 17, 2026

500

Read 54 replies

Codex Agent Skills: reusable capability bundles callable via slash commands

Agent Skills (Code): OpenAI’s Code account highlighted “Agent skills” as a first-class packaging primitive—bundle instructions/resources into a named capability, load it on demand, and invoke it via /skill-name, as described in Skills overview.

This is a concrete shift from “prompt templates” toward installable, callable modules that can be shared and reused across runs, matching the workflow shown in Skills overview.

Visual Studio Code

@code

🧠 Want to give your coding agent new capabilities? Use Agent Skills in @code! Agent skills let you package instructions and resources into reusable capabilities for your agent. Skills can be loaded on demand and called directly from chat using slash commands like /skill-name. Show more

7:00 PM · Mar 17, 2026

552

Read 24 replies

Hermes Agent adds skill curation via a skills config toggle UI

Hermes skills curation (Teknium): Hermes Agent users can now toggle installed skills on/off without uninstalling them by running hermes skills config, as described in Config tip.

This turns “skill sprawl” into an explicit config surface—skills can be installed broadly but selectively activated, per Config tip.

Teknium (e/λ)

@Teknium

Tip of the day for Hermes Agent - Did you know you can enable/disable skills you have installed whenever you want? Type `hermes skills config` into your console, and curate those skills even if you've installed them already.

8:10 PM · Mar 17, 2026

192

Skills discipline: treating skill design as its own operational competency

Skills practice (trq212): A practitioner thread argues that “using skills well is a skill issue” and that strong skill design can change how a team works, as stated in Skills reflection; they also preview open-sourcing an example iMessage skill and a livestream focused on how to use skills effectively, per Open source plan.

The emphasis is on skills as reusable capability units (not one-off prompts), with the next step being publishing real examples, as outlined in Open source plan.

Thariq

@trq212

Using Skills well is a skill issue. I didn't quite realize how much until I wrote this, the best can completely transform how your team works.

Thariq

@trq212

x.com/i/article/2033…

5:29 PM · Mar 17, 2026

2.6K

Read 70 replies

🔌 MCP + interoperability: “apps inside chat” and cloud execution connectors

MCP servers/apps and cross-agent interoperability artifacts (diagramming inside chat, Colab control, auth bridges). Separate from non-MCP plugins/skills.

Google open-sources a Colab MCP server for agent-run notebooks

Colab MCP Server (Google): Google has open-sourced a Colab MCP Server that lets an agent create and control a Colab notebook as a remote execution environment—so code runs in a cloud sandbox rather than on your machine, per the announcement. It’s positioned as a universal connector for agents across multiple clients (Claude Code, Cursor, Codex, Gemini), with notebook lifecycle control (build/run/visualize) handled via MCP.

It’s part of a broader pattern in agent infra: “execution as a pluggable tool,” where MCP becomes the standard interface between a chat-based agent and a constrained compute surface.

AlphaSignal AI

@AlphaSignalAI

Your AI agent can now run code in the cloud, without touching your machine. Google just open-sourced the Colab MCP Server, letting any AI agent write and execute code directly inside a Colab notebook. → Works with Claude Code, Cursor, Codex, Gemini → Full notebook lifecycle Show more

10:47 PM · Mar 17, 2026

Read 1 reply

Kernel managed auth pulls credentials directly from 1Password vaults

Managed auth (Kernel + 1Password): Kernel announced a partnership with 1Password so agents using Kernel’s managed authentication can fetch credentials directly from 1Password vaults at runtime, as stated in the partner note. The implementation is documented in the integration docs, including domain/URL matching behavior and support for TOTP-based 2FA.

This is an interoperability move: secrets live in an existing enterprise vault, while the agent platform handles “log in and stay logged in” mechanics without copying credentials into prompts or ad-hoc config.

KERNEL

@usekernel

we’ve partnered with @1Password to take the next step toward solving authentication for agents. last month, we introduced managed auth: a standardized way for agents to log in and stay logged in across the internet. with this partnership, your agents can now use credentials Show more

2:31 PM · Mar 17, 2026

Read 19 replies

Excalidraw Studio brings editable diagrams into chat via MCP apps

Excalidraw Studio (CopilotKit): CopilotKit introduced Excalidraw Studio, which generates real, editable diagrams as MCP apps directly inside a chat session, with an edit loop that keeps the canvas and chat in sync, as described in the launch thread. It’s designed around local-first persistence—auto-saving to disk with persistent workspaces and “no database”—and includes a one-click path to publish artifacts.

This is a concrete step toward “apps inside chat” where the agent can emit a manipulable UI object (a diagram) rather than a static image, while still keeping the conversation as the control surface.

CopilotKit🪁

@CopilotKit

Introducing @Excalidraw Studio: Generate real, editable diagrams as MCP Apps inside your chat Edit in local canvas and refine via chat all within the same session. Auto-save to disk with persistent workspaces, no database. Then push it live in one click. Open-sourced for the Show more

4:43 PM · Mar 17, 2026

Read 5 replies

Intercom’s MCP pattern: a read-only Rails production console for Claude

Read-only prod console via MCP (Intercom): Intercom described giving Claude a read-only Rails production console via MCP, enabling arbitrary Ruby queries against production data for tasks like feature-flag checks and cache inspection, as highlighted in the plugin system thread and reiterated in the Rails console detail. They also described safety gates—read-replica only, blocked critical tables, mandatory verification steps, Okta auth, and an audit trail—per the safety gates note.

This is a notable enterprise pattern: MCP tools can expose “production visibility” to an agent, but only behind explicit guardrails and logging that resemble traditional admin tooling.

Brian Scanlan

@brian_scanlan

6:46 PM · Mar 17, 2026

A push for agent-backend interoperability across ACP and Codex App Server

Agent backend protocols (ACP + CASP): A protocol interoperability note claims support is coming for both ACP and the Codex App Server protocol (CASP), so different IDEs/clients can talk to the same agent backend without losing “native Codex-like” integration, as stated in the interop note. It also states intent to support additional protocols if vendors introduce them.

This is a concrete signal that agent tooling is drifting toward “one backend, many frontends,” with protocol bridges as the compatibility layer rather than bespoke plugins per editor.

Onur Solmaz

@onusoz

We will support ACP *and* Codex App Server* protocol (CASP) so you get native Codex-like support, and you can use all the others with native ACP or @zeddotdev’s compatibility shims If Anthropic develops their own protocol, we will support that too! The more interoperability and Show more

Harold Hunt

@huntharo

Steering a plan in Codex via @openclaw Codex App Server plugin. Full ability to go back and change answers just like Codex Desktop. Coming soon. Maybe tonight!

3:38 AM · Mar 18, 2026

Read 6 replies

🛡️ Secure execution & agent attack surface (sandboxes, prompt injection, OSS security)

Security posture becomes a first-class feature: isolated runtimes for agents, enterprise sandbox products, and prompt-injection warnings—grounded in OpenSandbox/LangSmith Sandboxes/OpenClaw security discourse.

CNCERT warns of indirect prompt injection against OpenClaw instances

OpenClaw security (KiloCode): KiloCode flags what it calls “inherently weak default security configurations” in OpenClaw, alongside a China CNCERT alert that attackers are using indirect prompt injection to compromise instances, as summarized in the security warning. The immediate engineering takeaway is that agent deployments are being treated like exposed automation surfaces, not “just chatbots,” and default configs are now part of the threat model.

KiloCode points readers to the CNCERT-linked reporting in the Hacker News writeup, framing it as a real-world example of how prompt-injection becomes an execution + data exfil problem once an agent can browse, fetch, and act.

Kilo

@kilocode

OpenClaw has "inherently weak default security configurations", according to new research.

12:27 PM · Mar 17, 2026

Read 2 replies

Alibaba open-sources OpenSandbox for isolated agent code execution

OpenSandbox (Alibaba): Alibaba’s Tongyi team open-sourced OpenSandbox, a general-purpose isolated execution environment meant to keep agents away from the host machine by running in sandboxes such as gVisor or Firecracker, as described in the release summary and shipped in the GitHub repo. It’s pitched as infra for agent apps that need code execution, a filesystem, and tightly controlled network access, without handing the agent your actual machine.

The project emphasizes running locally via Docker or scaling via Kubernetes, plus explicit network traffic controls (egress shaping/allowlisting) to narrow what an agent can reach online, per the release summary.

Alibaba just open-sourced OpenSandbox ( a general-purpose execution environment ) to give AI agents an isolated environment to run code safely. 8k+ Github stars ⭐️ This stops your AI Agent based applications from accessing your actual host infrastructure. By removing the Show more

12:08 PM · Mar 17, 2026

609

Read 34 replies

KiloClaw publishes a security architecture whitepaper for hosted OpenClaw

KiloClaw security (KiloCode): KiloCode published a security whitepaper describing how its managed OpenClaw hosting is structured around multi-layer tenant isolation and secret handling, claiming validation via an “independent 10-day security assessment” with threat modeling and adversarial testing, as shown in the whitepaper excerpt.

The accompanying post claims multiple isolation layers (routing, app env, network, VM isolation) and positions the platform as designed to mitigate agent-specific attacks such as prompt injection and data exfiltration, as detailed in the Security architecture post.

Kilo

@kilocode

99% of OpenClaw hosting providers claim their service “is secure.” Evidence > Claims KiloClaw doesn’t rely on claims alone. We stress-tested our OpenClaw hosting service across five layers of security risk to verify our claim, and published our findings in a whitepaper. Read Show more

8:56 AM · Mar 17, 2026

LangSmith launches Sandboxes for controlled agent code execution (private preview)

LangSmith Sandboxes (LangChain): LangChain launched LangSmith Sandboxes in private preview—ephemeral, locked-down execution environments for agents that need to run code, call APIs, or build artifacts, as announced in the product launch. This is framed as a way to make agents “useful” by adding execution while keeping isolation and lifecycle control.

LangChain

@LangChain

🚀 Today we're launching LangSmith Sandboxes Agents get a lot more useful when they can run code: analyze data, call APIs, build entire applications. Sandboxes give them a safe place to do it with ephemeral, locked-down environments you control. Now in Private Preview. Learn Show more

4:51 PM · Mar 17, 2026

150

Read 8 replies

Anchor adds 1Password Unified Access for runtime secrets in browser agents

Anchor × 1Password (AnchorBrowser): AnchorBrowser announced integration with 1Password Unified Access, positioning it as a way for browser-based agents to fetch credentials at runtime (instead of hardcoding secrets in .env files), with the session described as isolated and “logged and auditable,” per the integration note.

The announcement describes a workflow of isolated agent session start, secrets retrieval via 1Password, execution, and audit logging, as written in the integration note.

Anchorbrowser

@AnchorBrowser

Hardcoding secrets into agent scripts is an amateur hour move. Today, we’re changing that. Anchor is officially part of the @1Password Unified Access launch ecosystem! 🚀 Now, browser-based agents running on Anchor can pull credentials securely at runtime. No more .env file Show more

1:56 PM · Mar 17, 2026

Anthropic funds Linux Foundation work on open-source security

Linux Foundation funding (Anthropic): Anthropic says it is donating to the Linux Foundation to support open-source security, arguing that open source underpins “nearly every software system” and becomes more critical as AI capabilities grow, as stated in the donation note. The practical signal is that AI vendors are increasingly treating baseline OSS security as shared infrastructure risk rather than an externality.

Anthropic

@AnthropicAI

The open source ecosystem underpins nearly every software system in the world. As AI grows more capable, open source security becomes increasingly important. We're donating to the Linux Foundation to continue to help secure the foundations AI runs on.

The Linux Foundation

@linuxfoundation

The Linux Foundation Announces $12.5 Million in Grant Funding (via @AlphaOmegaOSS and @OpenSSF) @AnthropicAI , @AmazonWebServices, @GitHub, @Google, @GoogleDeepMind, @Microsoft, @OpenAI to Invest in Sustainable Security Solutions for #OpenSource linuxfoundation.org/press/linux-fo…

4:11 PM · Mar 17, 2026

Read 128 replies

🏎️ Inference/runtime engineering: compaction speed, browser tools, and throughput hacks

Runtime and serving improvements that change agent latency/cost: context compaction models, self-summarization for long horizons, and browser automation tool upgrades.

FlashCompact targets the compaction bottleneck in long-running agents

FlashCompact (Morph): Morph introduced FlashCompact, a specialized model for context compaction that claims 33k tokens/sec throughput and 200k → 50k compression in ~1.5s, aimed at making agent compaction fast enough to stay in the iteration loop, as shown in the launch speed claim.

Morph ties the work to agent failure modes seen in practice—after reviewing 200+ agent sessions they argue most context bloat comes from tool responses, not model text, and report compaction yielding “no performance drop” alongside fewer tokens/steps in the tool bloat finding. The infra angle is explicit too: they describe a custom PyTriton serving stack on H200 behind the speed numbers in the serving stack note, with more implementation details in the Compaction SDK blog.

Morph

@morphllm

Introducing FlashCompact - the first specialized model for context compaction 33k tokens/sec 200k → 50k in ~1.5s Fast, high quality compaction

6:09 PM · Mar 17, 2026

1.7K

Read 55 replies

Cursor trains self-summarization into Composer to extend long-horizon coding

Composer self-summarization (Cursor): Cursor says it trained Composer to self-summarize using reinforcement learning (instead of prompt-based summarization), reporting ~50% lower compaction error and better outcomes on “challenging coding tasks requiring hundreds of actions,” as described in the training claim.

It’s a direct attempt to turn summarization from a brittle harness step into a learned behavior. More detail is in Cursor’s write-up, linked from the training blog.

Cursor

@cursor_ai

We trained Composer to self-summarize through RL instead of a prompt. This reduces the error from compaction by 50% and allows Composer to succeed on challenging coding tasks requiring hundreds of actions.

6:04 PM · Mar 17, 2026

Read 59 replies

Mamba-3 pushes state-space models toward faster decode without quality loss

Mamba-3 (Together Research): TogetherCompute announced Mamba-3, framing decode speed as a first-class constraint for agents and RL rollouts; the key claim is a MIMO (multi-input, multi-output) variant that replaces a vector outer-product recurrence with matrix multiply to get “a stronger model at the same decode speed,” per the release thread.

The release includes open-sourced kernels and pointers to the paper/code/blog in the release links, with the public implementation in the kernel repo and the write-up in the blog post.

Together AI

@togethercompute

Introducing Mamba-3 🐍 Inference speeds are more important than ever, driven by the rise in agents and inference-heavy RL Show more

Albert Gu

@_albertgu

The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities

5:19 PM · Mar 17, 2026

203

Read 7 replies

agent-browser adds iframe-aware automation primitives

agent-browser (Vercel Labs): agent-browser now supports iframes—automatically snapshotting inline iframe content and enabling element interaction “using refs directly” without explicit frame switching, per the iframe support note.

This targets a common failure point in browser agents (embedded auth flows, embedded apps, payment widgets). Install and usage details live in the GitHub repo.

Chris Tate

@ctatedev

agent-browser now supports iframes → Snapshots inline iframe content automatically → Interact with elements inside iframes using refs directly → No frame switching needed

7:34 PM · Mar 17, 2026

218

Read 13 replies

Network-attached local inference: LM Studio tests “GPU over ethernet” setup

LM Studio + LM Link (local inference ergonomics): Matthew Berman reports that a GB300 box can be used as a network-attached accelerator—“plug in ethernet… it’ll work as an external GPU” in the setup note.

A first throughput datapoint shared in the throughput screenshot shows 103 tok/sec using Nemotron 3 Super (Q4) in LM Studio “w/ LM Link,” with follow-on plans to try NVFP4 for more speed. LM Link’s remote-local model bridging is described in the LM Link preview page.

Matthew Berman

@MatthewBerman

Turns out I can just plug in ethernet to this beast and it'll just work as an external GPU (GB300) on my network. No need to even plug in a monitor or keyboard. The powerrrrr

Matthew Berman

@MatthewBerman

.@nvidia hand delivered a pre-production unit of the @Dell Pro Max with GB300 to my house. 100lbs beast with 750GB+ of unified memory to power the best open-source models in the world. What should I test first?

6:45 PM · Mar 17, 2026

129

🦞 OpenClaw ops ecosystem: providers, UI layers, and “chat as the agent surface”

Operational ecosystem around OpenClaw: plugins/providers, UI/control surfaces, and UX philosophy (chat vs pages). Excludes GPT‑5.4 mini/nano release specifics.

Ollama 0.18.1 ships OpenClaw web search/fetch and headless ollama launch

Ollama 0.18.1 (Ollama): Following up on provider onboarding—Ollama shipped a web search + web fetch plugin for OpenClaw and added a non-interactive mode for ollama launch, as detailed in the release notes. This lands as a practical ops upgrade: more “fresh info” workflows inside local/cloud OpenClaw runs, plus a cleaner way to script agent launches in CI.

• OpenClaw web access: The @ollama/openclaw-web-search plugin enables search and fetch with a stated constraint that it “does not execute JavaScript,” per the release notes.
• CI/container ergonomics: ollama launch can run headless (example --yes), positioning it for ephemeral pipelines that spin up an integration, run prompts/evals, and tear down, as shown in the release notes.

ollama

@ollama

Ollama 0.18.1 is here! 🌐 Web search and fetch in OpenClaw Ollama now ships with web search and web fetch plugin for OpenClaw. This allows Ollama's models (local or cloud) to search the web for the latest content and news. This also allows OpenClaw with Ollama to be able to Show more

7:46 PM · Mar 17, 2026

1.3K

Read 55 replies

OpenRouter shows a sharp OpenClaw usage spike (“NVIDIA effect”)

OpenClaw usage (OpenRouter): A 30-day OpenRouter chart shows a step-change upward in OpenClaw usage—framed as “the NVIDIA effect”—as shared in the usage chart. It’s a concrete adoption signal on routed traffic.

This reads like a continuation of the attention wave following up on keynote star—OpenClaw is getting pulled into default stacks, and OpenRouter’s proxy telemetry is one of the few public windows into that shift.

Dave Morin 🦞

@davemorin

The @nvidia effect...

8:19 PM · Mar 17, 2026

207

Rauch argues “chat isn’t temporary” for agents; pages evolve into generative UI

Chat as an agent surface (Vercel): Guillermo Rauch argues the opposite of the “something better than chat is coming” take—he expects more work to run through chat and voice, with richer visualizations embedded in-chat and a one-click escape hatch to web pages, as laid out in the interface argument. Short version: chat is the control plane; pages become a higher-bandwidth view.

He also describes “Generative UI” as pages that accept natural language and stream back both text and complex data, and frames this as complementary to Slack/WhatsApp-style agent conversations, per the interface argument.

Guillermo Rauch

@rauchg

Every month I periodically see the recycled take that “something better than chat” is coming for AI. That chat is temporary. In fact, I predict the opposite. More of our work and life will happen through chat and voice interfaces of increasingly intelligent agents. 🦞 OpenClaw Show more

12:22 PM · Mar 17, 2026

808

Read 119 replies

A one-command recipe to run OpenClaw on MI300X via SGLang (with free credits)

OpenClaw deployment recipe (LMSYS + AMD Developer Cloud): LMSYS shared a concrete path to run OpenClaw on AMD’s Developer Cloud using ~50 hours of MI300X credits ($100) and serving Qwen3.5-122B-A10B-FP8 via SGLang, as written in the setup thread. It’s positioned as a “self-hosted agent stack, on enterprise hardware, at zero cost.”

The operational hook for OpenClaw builders is that SGLang is selectable in OpenClaw’s onboarding CLI, per the setup thread, so the serving backend swap is becoming a first-class knob rather than a bespoke integration.

LMSYS Org

@lmsysorg

🚀 @AMD just dropped a full guide: run OpenClaw🦞 for free on AMD Developer Cloud with Qwen3.5 + SGLang on a single MI300X! Following SGLang's support in @OpenClaw, here's a free way to get your own stack running: 🆓 $100 AMD Developer Cloud credits (~50 hrs of MI300X) 🧠 Show more

8:20 PM · Mar 17, 2026

Read 1 reply

A pixel-art “office” UI for OpenClaw tracks agent state with a lobster avatar

Star-Office-UI (community OpenClaw UI): A community-built, pixel-art “office” interface for OpenClaw tracks agent status by moving a lobster character between work/rest/bug areas, with the repo called out as passing ~5K stars in the project spotlight. It’s a UI-layer attempt to make multi-agent state legible at a glance.

The project details and setup are in the GitHub repo, which emphasizes multi-agent status states and a lightweight web dashboard approach.

Beautiful. Someone created an open-source Pixel Office interface for OpenClaw. It visually tracks status by moving a lobster character into specific work, rest, or bug areas. The Github repo got 5K+ stars (⭐️)

12:03 PM · Mar 17, 2026

405

Read 30 replies

Nemotron 3 Nano 4B is now runnable via Ollama (and Pi)

Nemotron 3 Nano 4B (NVIDIA via Ollama): Ollama added nemotron-3-nano:4b to its library with a one-liner ollama run nemotron-3-nano:4b, and highlighted pairing it with Pi (the minimal runtime used by OpenClaw) via ollama launch pi --model nemotron-3-nano:4b, per the availability note. It’s framed as a fit for agents on constrained hardware.

The concrete shipping detail is the CLI path: local model pull/run plus a lightweight agent runner in the same toolchain, as described in the availability note and the model page.

ollama

@ollama

Nemotron 3 Nano 4B is now available to run via Ollama: ollama run nemotron-3-nano:4b Try it with Pi, the minimal agent runtime that powers OpenClaw: ollama launch pi --model nemotron-3-nano:4b This new addition to @nvidia's Nemotron family is a great fit for building and Show more

11:17 PM · Mar 17, 2026

604

Read 30 replies

🏢 Enterprise agent products: browsers, grounded models, and workspace automation

Enterprise-oriented agent surfaces and grounded modeling offerings (AI browser for teams, enterprise-custom model training, and Workspace CLI automation).

Comet Enterprise brings admin controls and CrowdStrike security to Perplexity’s AI browser

Comet Enterprise (Perplexity): Perplexity launched an enterprise tier for its AI-native browser, adding centralized controls for rollout and monitoring—plus an enterprise security integration—per the launch announcement.

• Admin + fleet operations: Enterprise teams can deploy Comet to thousands of devices via MDM and get telemetry/audit logs for visibility, as described in the admin controls clip.
• Security layer: For Enterprise plans, Comet integrates with CrowdStrike Falcon to detect suspicious files/links and block phishing/malware, according to the CrowdStrike integration.
• Positioning and compliance claims: It’s framed as an “always-on assistant” that can automate multi-tab workflows, with compliance and prompt-injection/data-leakage protections called out in the product overview.

Claims about who’s already using it (Fortune, AWS, Bessemer and others) are attributed in the customer list post, but no independent security evaluation is included in these tweets.

Perplexity

@perplexity_ai

Today we're launching Comet Enterprise. Now, the most powerful AI browser is available to enterprise teams. Research, automate tasks, and get work done without leaving the browser.

4:43 PM · Mar 17, 2026

Read 74 replies

Gemini Personal Intelligence expands free in the U.S. across Gemini and Chrome

Personal Intelligence (Google Gemini): Google is rolling out Personal Intelligence more broadly for free in the U.S. across the Gemini app and Gemini in Chrome, with explicit opt-in to connect Google apps like Search, Gmail, Photos, and YouTube, as stated in the rollout post.

• What changes: Responses can use user-connected signals to be more tailored (e.g., recommendations based on past favorites), as described in the example use case.
• Control + privacy claims: Users can choose which apps are connected and toggle personalization off per chat, according to the control description and the Google blog post.

This is a meaningful “enterprise adjacent” signal because it normalizes connector-based personalization in the primary consumer surfaces (Chrome + Search-adjacent), which tends to become the UX baseline that workplace tools get compared against.

Google Gemini

@GeminiApp

Personal Intelligence is rolling out to more users for free across the Gemini app and Gemini in @GoogleChrome in the U.S. Access smarter responses uniquely relevant to you if you choose to connect your @Google apps like Search, @Gmail, @GooglePhotos, and @YouTube.🧵

4:00 PM · Mar 17, 2026

Read 111 replies

Mistral Forge targets enterprise training on proprietary knowledge and workflows

Forge (Mistral AI): Mistral introduced Forge, a system for enterprises to build “frontier-grade” models grounded in proprietary context (internal systems, workflows, policies), as announced in the Forge launch thread.

• Customer signal: Mistral says it has already partnered with ASML, Ericsson, the European Space Agency, and others, per the Forge launch thread.
• Core pitch: The product is positioned as bridging generic models to org-specific ones by training on internal knowledge rather than only relying on public data, as detailed in the Forge explainer.

The tweets don’t specify pricing, deployment topology (fully on-prem vs managed options), or what customization knobs (pretraining vs fine-tuning vs RL) are generally available versus bespoke engagements.

Mistral AI

@MistralAI

Today, we’re introducing Forge, a system for enterprises to build frontier-grade AI models grounded in their proprietary knowledge. 🌎 Forge bridges the gap between generic AI and enterprise-specific needs. Instead of relying on broad, public data, organizations can train models Show more

9:00 PM · Mar 17, 2026

Read 23 replies

Manus upgrades its Google Drive connector with Google Workspace CLI actions

Google Workspace connector (Manus): Manus says its Google Drive connector now supports the Google Workspace CLI, enabling more precise actions across Docs, Sheets, and Slides from a single prompt, per the connector upgrade note.

• Granular operations: Examples include replying to specific Doc comments, updating a single Sheet cell, reorganizing Drive folders, and renaming Slide titles, as listed in the connector upgrade note.
• Mechanism: The product framing is that it can manage a wider set of Workspace operations without bespoke per-app UI flows, with additional detail in the announcement post.

No demo media is included in the cited tweets, so the operational reliability (auth flows, rate limits, error handling) can’t be assessed from this thread alone.

Manus

@ManusAI

Our @googledrive connector just got a major upgrade. Manus now supports the Google Workspace CLI, so you can seamlessly manage your entire workflow in Docs, Sheets, and Slides from a single prompt. This means you can address comments, edit cells, and even organize your Drive, Show more

3:06 PM · Mar 17, 2026

489

OpenAI says ~3M daily US ChatGPT messages involve wages and earnings

Worker compensation usage (OpenAI): OpenAI published analysis claiming that in Jan–Feb 2026, “nearly 3 million messages each day” on consumer ChatGPT in the U.S. involved wages and earnings, as quoted in the usage stat.

• Why it matters to leaders: It’s a concrete adoption signal that a large share of consumer usage is already labor-market decision support (benchmarking pay, exploring earnings for roles), which can spill into HR/compensation tooling expectations.
• Primary source: OpenAI’s writeup and methodology context (including WorkerBench framing) are in the OpenAI report.

The tweets don’t include error bars or breakdowns by model/version, but they do anchor a specific daily volume for this use case.

Tibor Blaho

@btibor91

"In January and February 2026, on average, nearly 3 million messages each day on consumer ChatGPT in the US involve wages and earnings"

9:26 PM · Mar 17, 2026

Read 2 replies

🖥️ Local training & on-device fine-tuning: the ‘run it yourself’ tool wave

Tools that make local model training/running practical for engineers: Unsloth Studio, HF agent bootstrap tooling, and mobile-friendly fine-tuning frameworks.

Unsloth open-sources Studio: local LLM training + inference UI with dataset recipes

Unsloth Studio (UnslothAI): Unsloth shipped Unsloth Studio, an open-source web UI for running models locally (Mac/Windows/Linux) and fine-tuning 500+ models with claims of 2× faster training and ~70% less VRAM than typical setups, as announced in the launch thread and detailed in the GitHub repo via GitHub repo.

The tooling angle is that it bundles a bunch of “agent-like” conveniences into a local runner—multi-format model support (GGUF, vision/audio/embeddings), automated dataset creation from office docs, and a sandbox for code execution to verify outputs, as described in the launch thread and reinforced by the sandbox execution example.

• Data-to-dataset pipeline: “Data Recipes” converts PDFs/CSVs/DOCXs/TXT into structured synthetic datasets through a node/graph workflow, as shown in the data recipes clip.
• Reliability hooks: “Self-healing tool calling” plus built-in code execution are positioned as ways to reduce unverified answers, per the launch thread and sandbox execution example.

Sentiment in the thread leans toward “local-first is becoming table stakes,” with one practitioner framing it as “no longer optional” in the local-first takeaway.

Unsloth AI

@UnslothAI

Introducing Unsloth Studio ✨ A new open-source web UI to train and run LLMs. • Run models locally on Mac, Windows, Linux • Train 500+ models 2x faster with 70% less VRAM • Supports GGUF, vision, audio, embedding models • Auto-create datasets from PDF, CSV, DOCX • Show more

3:19 PM · Mar 17, 2026

3.7K

Read 179 replies

Hugging Face ships hf-agents: pick local model/quant and start a coding agent

hf-agents (Hugging Face): Hugging Face released an hf CLI extension that detects your hardware, recommends a model+quant, and then spins up a local coding agent—a “one command to go local/private/free/fast” pitch in the CLI announcement, with implementation details in the repo linked as GitHub repo.

Under the hood, it’s framed as a practical bootstrapper: hardware fit → model selection → local server/runtime wiring (with llama.cpp and an agent runtime mentioned in the repo), so the “setup tax” drops to a CLI flow rather than a bespoke install script per machine.

clem 🤗

@ClementDelangue

We just released an hf CLI extension to detect the best model/quant for a user's hardware and then spins up a local coding agent. Time to go local/private/free/fast for your agents thanks to open-source!

7:01 PM · Mar 17, 2026

426

Read 27 replies

Tether releases QVAC BitNet LoRA stack claiming billion-parameter fine-tunes on phones

QVAC Fabric BitNet LoRA (Tether): Tether introduced an open-source BitNet+LoRA fine-tuning framework with claims of up to 90% lower memory and demos like fine-tuning a 13B model on an iPhone 16 plus large speedups on mobile GPUs, per the launch claims.

The engineering-interesting part is the cross-device backend story—targeting heterogeneous consumer/edge GPUs for LoRA training rather than treating phones as inference-only endpoints—backed by the published source in the GitHub repo.

Paul Couvert

@itsPaulAi

Ok that's absolutely insane?? Tether has just introduced the QVAC BitNet LoRA fine‑tuning framework You can now run and fine-tune (!) billion-parameter models ON YOUR PHONE - It cuts memory use by up to 90% - They've fine-tuned a 13B model on an iPhone 16 - Runs 11x faster on Show more

Paolo Ardoino 🤖

@paoloardoino

Tether AI breakthrough Tether AI team just released new version of QVAC Fabric to include the World’s First Cross-Platform BitNet LoRA Framework to Enable Billion-Parameter AI Training and Inference on Consumer GPUs and Smartphones. Background Microsoft's BitNet uses one bit

2:54 PM · Mar 17, 2026

Read 12 replies

Despite compute concentration, pretraining research feels more alive again

Open research signal: Nathan Lambert notes that even though relatively few orgs can scale frontier models to mass deployment, pretraining research still “feels vibrant and progressing,” and he frames this as a shift toward optimism versus a few years ago in the sentiment note.

For builders, the implied takeaway is that systems and deployment constraints may be consolidating, while architecture/training ideas (and their open implementations) are still diversifying—two different dynamics moving at once.

Nathan Lambert

@natolambert

While there are fewer labs that have the compute to scale to models that tons of people use, it makes me really happy to see that pretraining research is vibrant and progressing with many interleaved ideas. Compared to a few years ago, it feels like a big shift in optimism.

Albert Gu

@_albertgu

6:04 PM · Mar 17, 2026

100

📊 Benchmarks & measurement: leaderboards, nonsense-tests, and AGI eval push

Evaluation and measurement signals: community benchmarks/leaderboards, usage-scale metrics, and new benchmark-building initiatives. Excludes the feature’s launch metrics where possible; focuses on third-party evals and measurement tooling.

DeepMind launches $200K Kaggle hackathon to build new cognitive AI evaluations

Cognitive evals (Google DeepMind + Kaggle): DeepMind is crowdsourcing new cognitive capability benchmarks via a Kaggle hackathon with $200K in prizes, targeting dimensions like learning, metacognition, attention, executive function, and social cognition as described in the benchmark call and echoed in the hackathon announcement. The point is measurement: as classic leaderboards saturate, the shortage is now “good tests,” not more charts.

• Scope signal: it’s explicitly framed as “progress toward AGI” measurement rather than product evals, per the benchmark call and hackathon announcement.

Logan Kilpatrick

@OfficialLoganK

Help us measure the progress towards AGI (specifically cognitive capabilities) by building benchmarks on @kaggle, with $ 200K in prizes available! Details in 🧵

6:46 PM · Mar 17, 2026

638

Read 44 replies

OpenRouter hits ~1 quadrillion tokens/year pace, implying ~$1B/yr spend at $1/M

Usage-scale measurement (OpenRouter): OpenRouter usage is now paced at roughly 1 quadrillion tokens/year, computed from a shown ~20.47T tokens/week run rate, with an implied ~$1B annual spend at an assumed ~$1/M tokens as laid out in the usage pace chart. This is a rare “demand-side” datapoint that’s closer to real inference traffic than most lab benchmarks.

The evidence is directional (back-of-envelope pricing assumption), but the weekly token throughput plot in the usage pace chart is the core signal.

Deedy

@deedydas

OpenRouter just broke 1 quadrillion tokens a year. Assuming ~$1/M, $1B would be spent on it annually.

8:44 PM · Mar 17, 2026

349

BullshitBench update: GPT‑5.4 mini/nano rank low on nonsense detection; “thinking” didn’t help

BullshitBench (nonsense-prompt eval): Following up on Nonsense benchmark—the benchmark’s maintainer reports GPT‑5.4 mini landing around ~40th on the full list and GPT‑5.4 nano around ~70th, with “thinking” effort not improving results much, according to the results update and the reasoning note.

If you want to inspect methodology or rerun, the maintainer links the public viewer and repo in the benchmark links.

BullshitBench update: The new GPT-5.4 mini and nano models score quite low. This screenshot shows OpenAI models only, on the full list would put GPT-5.4-mini around 40th place and Nano is around 70th place. Again thinking didn't help much at all.

Peter Gostev

@petergostev

BullshitBench v2 is out! It is one of the few benchmarks where models are generally not getting better (except Claude) and where reasoning isn't helping. What's new: 100 new questions, by domain (coding (40 Q's), medical (15), legal (15), finance (15), physics(15)), 70+ model

7:54 PM · Mar 17, 2026

Read 5 replies

Vals Index placements: GPT‑5.4 mini #13, nano #18; MiniMax M2.7 debuts around #12

Vals Index (third-party composite eval): ValsAI places GPT‑5.4 mini at #13—stating it’s roughly “equivalent performance to GPT‑5” and strong on an in-house “Vibe Code Bench”—as shown in the mini placement and elaborated in the run settings note. They also slot GPT‑5.4 nano at #18, emphasizing cost-effectiveness and a performance gap on “ProofBench” style tasks per the nano placement and the ProofBench caveat.

• Another new entrant: ValsAI posts initial results putting MiniMax 2.7 at #12 overall, noting ~$0.15/test cost and a potential #2 open-weight placement if weights ship, per the M2.7 placement and cost note.

Treat as provisional: these are vendor-run results with incomplete benchmark breakdowns still “to be released,” per the pending benchmarks note.

Vals AI

@ValsAI

GPT 5.4 Mini comes in at #13 on the Vals Index - equivalent performance to GPT 5 🚀

5:14 PM · Mar 17, 2026

Read 3 replies

Arena launches Video Edit Arena leaderboard; Grok-Imagine-Video leads initial rankings

Video Edit Arena (Arena): Arena launched a community-vote leaderboard focused specifically on video editing capabilities (not just generation), with early rankings showing Grok-Imagine-Video #1, Kling-o3-pro #2, Kling-o1-pro #3, and Runway Gen4-aleph #4, as listed in the leaderboard announcement and viewable via the leaderboard page.

This is a rare attempt to isolate “edit” as a capability class; the tradeoff is that votes are subjective and model availability can shift quickly.

Arena.ai

@arena

Today we’re launching the Video Edit Arena to evaluate the frontier capability of video models! - #1 Grok-Imagine-Video, @xAI - #2 Kling-o3-pro, @Kling_ai - #3 Kling-o1-pro, @Kling_ai - #4 Gen4-aleph, @Runwayml The leaderboard is powered by thousands of real-world community Show more

6:57 PM · Mar 17, 2026

151

Read 15 replies

Prinzbench adds GPT‑5.4 Pro (Extended), reports new top score of 79/99

Prinzbench (legal-research benchmark): Prinzbench added GPT‑5.4 Pro (Extended) and reports a new high score of 79/99, beating GPT‑5.4 (xhigh) by 10 points, per the benchmark result. The benchmark’s framing and dataset are described in the linked GitHub repo, which positions it as testing “obscure info + legal research” rather than coding/math.

The evaluation is niche by design; the useful signal is a third-party attempt to measure “economically valuable” research behavior outside common coding suites.

prinz

@deredleritt3r

By popular request, GPT-5.4 Pro (Extended) has been added to prinzbench. It's the best model I've ever benchmarked (not surprising), beating GPT-5.4 (xhigh) by 10 points to achieve a new high score of 79/99 on my benchmark (somewhat surprising; I thought it would score even Show more

4:51 AM · Mar 18, 2026

101

Read 8 replies

Arena adds customizable leaderboard columns (price, context, votes, license)

Leaderboard UI (Arena): Arena added per-user leaderboard customization—columns like price per MTok, max context, total votes, license, and org can be toggled—per the customization demo. This is a small change, but it moves Arena closer to “benchmark explorer” instead of a single global ranking.

Arena.ai

@arena

Customize your Arena leaderboard. Everyone's real-world use for AI differs. Select the columns and data that matters most to you: - Rank Spread - Model Organization - License - Total Votes - Price ($/MToken) - Max Context

Arena.ai

@arena

Arena leaderboards now include Price and Context. - Price is shown as input / output cost per 1M tokens, and context shows the maximum context window. Compare Arena scores based on what matters for your use case.

5:01 PM · Mar 17, 2026

Arena adds GPT‑5.4 mini and nano to Text and Vision Arena matchups

Model inclusion (Arena): GPT‑5.4 mini and nano were added to Arena’s Text and Vision matchups, per the arena availability note, with Arena pointing users to run head-to-head comparisons via its main site in the voting link. This is mainly a distribution/measurement signal: it means the models will start accumulating public preference data outside vendor-reported evals.

Arena.ai

@arena

GPT 5.4 Mini and Nano by @OpenAI are available in the Text and Vision Arena! Check them out and don't forget to vote, we'll see how they stack up on the leaderboards.

OpenAI

@OpenAI

7:49 PM · Mar 17, 2026

Read 10 replies

🏗️ Compute & supply-chain constraints: chips, energy, and scaling bottlenecks

AI scaling constraints and compute economics: EUV bottlenecks, memory crunch signals, and the ‘Jevons paradox’ framing for why cheaper intelligence can still drive total spend up.

EUV lithography supply chain looks like a medium-term hard cap on AI scaling

EUV lithography (ASML ecosystem): A detailed thread breaks down why EUV tools are likely to be the pacing item for AI chip scaling—citing a 10,000+ supplier chain, multi-step tin-droplet laser timing, 18-mirror optics, and 3nm overlay requirements, with a projection that production may not exceed ~100 EUV machines/year by 2030 as described in EUV bottleneck thread.

The constraint here isn’t “money” so much as specialized sub-suppliers (for example Zeiss mirrors) and ultra-tight yield sensitivity, which turns one slow component into a global throughput limit for GPU/accelerator roadmaps.

Dwarkesh Patel

@dwarkesh_sp

EUV machines are the most complicated tools humans make. Their supply chain has over 10,000 individual suppliers, and any one of them not scaling fast enough can bottleneck the entire AI industry. An EUV tool fires lasers at a tiny tin droplet three times in precise sequence, Show more

8:10 PM · Mar 17, 2026

515

Read 37 replies

“Structural Jevons paradox” framing: cheaper inference can explode total compute

Digital Intelligence Capital (economics paper): A shared paper summary claims inference price declines can drive more aggregate compute through more compute-intensive agent architectures—describing a “structural Jevons paradox,” plus an “endogenous depreciation” dynamic where models lose economic value when a smarter competitor ships, as summarized in Paper summary.

The thread’s additional claim is that data/network effects and ongoing compute costs can push the market toward winner-take-all outcomes; treat that as a modeling conclusion rather than an observed industry fact.

Brilliant economic paper, directly models the "Structural Jevons Paradox" happening right now in the AI industry. The cost of running an LLM is dropping, but total computing energy is exploding anyway. It mathematically proves that as the unit cost of digital intelligence and Show more

Rohan Paul

@rohanpaul_ai

Citadel Securities published this graph showing a strange phenomenon. Job postings for software engineers are actually seeing a spike. The graph here is short term but still it's super interesting and really strange. Is it Jevons paradox at play. When AI makes coding cheaper,

8:18 PM · Mar 17, 2026

227

Read 23 replies

Anthropic’s compute ramp is framed as expensive without long-term commitments

Anthropic compute procurement: A note relayed by Dwarkesh suggests Anthropic could reach 5–6 GW by year end, but at higher cost than if it had locked in long-term compute early—because cloud partners (Bedrock/Vertex) can take ~50% gross margin, and short-term rates have climbed, as stated in Compute cost note.

It also claims they may need to raise model prices to suppress demand if supply can’t keep up, which is a direct “capacity meets pricing” linkage that infrastructure teams should watch.

Dwarkesh Patel

@dwarkesh_sp

.@dylan522p thinks Anthropic will get to 5–6 GW by year end. But at a far higher cost than it would have had to pay if it had gone crazy on compute commitments early like OpenAI did. Ant will get the compute either through cloud partners like Bedrock and Vertex who take ~50% Show more

10:21 PM · Mar 17, 2026

177

Memory-chip crunch may persist to ~2030, pressuring AI system costs

Memory supply (SK Group): A reported outlook says the global memory-chip crunch could last until around 2030, with wafer supply running 20%+ behind demand and prices expected to keep rising, as relayed in Memory crunch claim.

If this holds, it directly hits AI system cost structure (HBM/DRAM content per accelerator node) and can make “compute is cheaper” narratives false at the rack level even when model efficiency improves.

Chubby♨️

@kimmonismus

SK Group Chairman Chey Tae-won says the global memory-chip crunch may last until around 2030 Prices will probably keep rising with wafer supply running more than 20% behind demand.

5:34 PM · Mar 17, 2026

110

Read 15 replies

Citadel argues generative AI adoption follows an S-curve, not exponential

GenAI adoption constraints (Citadel Securities): A post argues adoption will follow a historical S-curve because physical and economic boundaries (compute, data centers, energy) halt exponential substitution; it claims that if marginal AI operating costs rise above human labor costs, firms stop substituting, as described in Adoption S-curve view.

This is a counterpoint to “efficiency always wins” narratives—grounded in capital and energy constraints rather than model capability.

Citadel Securities: Generative AI adoption will follow a historical S-curve, eventually plateauing, rather than growing exponentially. Because economic and physical boundaries will halt exponential growth. Displacing human labor demands massive compute power, data centers, and Show more

8:25 PM · Mar 17, 2026

Older GPUs can get more expensive as new models monetize them better

GPU rental economics (model efficiency): A clip argues that because newer models deliver much higher “value per token” on the same GPU, older hardware can become more expensive to rent—illustrated by “3 years ago: GPT-4 on H100” versus “now: GPT-5.4 on H100,” as described in Rental inversion example.

This frames a practical planning risk: demand can rise faster than efficiency gains, so unit-cost drops don’t necessarily translate into lower total spend.

Dwarkesh Patel

@dwarkesh_sp

The value produced by models is getting so much better so fast that old hardware is actually getting *more* expensive to rent. 3 years ago, the best model you could run on a H100 chip was GPT-4. Now, you can run GPT-5.4 on it, which is smaller and cheaper to run while Show more

5:06 PM · Mar 17, 2026

654

🗂️ Retrieval, parsing, and “make docs agent-readable” pipelines

Practical doc/search pipelines for agents and RAG: PDF/layout parsing, web content extraction partnerships, and prompt-optimization loops for relevance judging.

Dropbox Dash shows a DSPy loop for optimizing a relevance judge (and how it overfits)

Dropbox Dash (Dropbox): Dropbox shared a concrete “judge improvement loop” using DSPy to tune a relevance judge, with the NMSE metric dropping from 8.83 (hand-tuned) to 5.11 (MIPROv2) and 4.86 (GEPA), as shown in NMSE chart. This is the part engineers will recognize: once you treat prompts like parameters, you need the same discipline as model training.

• Overfitting modes: early runs copied example-specific keywords/usernames or even changed task parameters like the rating scale, according to Overfitting caveats screenshot.
• Operational takeaway: the write-up frames prompt optimization as a measurable loop (not vibes), but it only works if you add constraints and review edits, as reinforced by Dash judge post header.

Omar Khattab

@lateinteraction