GLM‑5‑Turbo lists at $0.96/$3.20 per Mtok – 202K context for agents

Ollama is now an official provider for OpenClaw. openclaw onboard --auth-choice ollama All models from Ollama will work seamlessly with OpenClaw. 🦞 Use it for the tasks you want, all from your chat app. Thank you @steipete for helping and reviewing. 🦞 Show more

12:28 AM · Mar 16, 2026

3.3K

Read 158 replies

Running OpenClaw on vLLM is a straight OpenAI-compatible endpoint swap

vLLM + OpenClaw (vLLM Project): A short recipe shows OpenClaw working with a self-hosted model by deploying it with vLLM, exposing an OpenAI-compatible API, then pointing OpenClaw at that endpoint—claiming tool calling works “out of the box,” per the vLLM setup guide.

This is a clean operational pattern for teams that want OpenClaw’s agent loop/UI while keeping model serving local or on their own infra, as demonstrated in the vLLM setup guide.

vLLM

@vllm_project

Want to run @openclaw with your own model using vLLM? 🦞 It is surprisingly easy and fast: 1️⃣ Deploy the model with vLLM 2️⃣ Expose the OpenAI-compatible API 3️⃣ Point OpenClaw to the endpoint Tool calling works out of the box, so it plugs nicely into OpenClaw agent workflows. Show more

3:30 PM · Mar 15, 2026

111

A 5-minute cron job uses OpenClaw to auto-block mention spam on X

Mentions hygiene (OpenClaw): Peter Steinberger reports an OpenClaw-powered cron that runs every 5 minutes and blocks “spam/reply guy/promo” mention accounts, saying it made replies useful again in the Mentions cleanup thread.

The concrete artifact is a daily digest showing 56 blocked profiles plus structured rationales (account signals + behavioral patterns), as shown in the Mentions cleanup thread, which is the kind of operational feedback loop that’s hard to get from manual moderation.

Peter Steinberger 🦞

@steipete

My openclaw twitter mention block cron job is working unreasonably well. Turns out AI is really good at detecting spam/reply guy/promo stuff. Runs every 5 min and cleans up my mentions - I actually see useful replies now and Twitter got pleasant again!

4:34 PM · Mar 15, 2026

2.3K

Read 279 replies

KiloClaw posts pricing for hosted OpenClaw compute with no token markup

KiloClaw (Kilo Code): KiloClaw published pricing for its hosted compute layer—$49/month, “zero markup on AI tokens,” and “500+ models,” with an early-bird $25/month for 6 months for the first 1,000 users, per the Pricing announcement.

The launch timeline is also spelled out: “Free trial starts tomorrow” and charges begin March 23, as stated in the Pricing announcement and detailed on the Pricing page.

Kilo

@kilocode

KiloClaw pricing is live. $49/month for hosted compute. Zero markup on AI tokens. 500+ models. First 1,000 users: 6 months at $25/month. Free trial starts tomorrow. Charges start March 23. → app.kilo.ai/claw/earlybird

4:00 PM · Mar 15, 2026

140

OpenClaw explores more powerful plugins and Claude Code/Codex bundles

OpenClaw plugins (OpenClaw): Steinberger says he’s working on making plugins “more powerful” while keeping the OpenClaw core lean, and explicitly calls out planned support for Claude Code/Codex plugin bundles, per the Plugin roadmap note.

He also signals near-term movement by asking for a PR while “about to land this,” as shown in the PR request, which suggests plugin surfaces/APIs are actively being reshaped rather than just discussed.

Peter Steinberger 🦞

@steipete

Thinking how we can evolve openclaw plugins to be more powerful while also making core leaner. Also wanna add support for claude code/codex plugin bundles. Good stuff coming soon!

4:14 PM · Mar 15, 2026

1.8K

Read 205 replies

OpenClaw gets labeled “bloatware” as Hermes migration talk spreads

OpenClaw vs alternatives (community): A blunt take—“Openclaw is bloatware now… switched to Hermes”—circulated via the Bloatware claim, reinforcing a recurring ecosystem tension between feature-rich agent runners and minimal “aesthetic” setups.

The same thread cluster includes claims that Hermes offers a migration script for OpenClaw users, as referenced in the Migration script mention. It’s sentiment, not a measured benchmark, but it’s the kind of narrative that influences tool adoption and contributor attention.

@jpthor

Openclaw is bloatware now. Switched to Hermes. Very aesthetic and clean. You can migrate easily.

Sudo su

@sudoingX

i get this question a lot recently so let me be clear. hermes has 11 model specific tool call parsers built in. it knows how qwen, deepseek, llama, mistral format their tool calls natively. openclaw doesn't parse any of that. it expects the API to handle it. that's why my 9B

1:37 PM · Mar 15, 2026

456

Read 35 replies

OpenClaw’s SF robotics hackathon shows up as an IRL builder signal

OpenClaw community (events): Photos and posts from a San Francisco OpenClaw robotics hackathon (Shack15) surfaced, showing an in-person builder cluster forming around OpenClaw, per the Hackathon post and follow-ups like the IRL hackathon update.

The visual signal includes attendees posing with a Unitree humanoid outfitted with boxing gloves, as shown in the Robot photo, which fits the pattern of OpenClaw positioning as an “always-on” agent runner people try to connect to physical systems.

Ray Fernando

@RayFernando1337

OpenClaw Robotics hackathon at Shack15.

5:15 PM · Mar 15, 2026

🔐 Claude Code + Agent SDK access: OAuth tokens, ToS, and workflow hacks

Claude Code operational/legal friction that affects engineers shipping tooling: what auth tokens can power what, and how users are automating Claude Code locally. This is distinct from general security news.

Anthropic says Claude consumer OAuth tokens can’t be used with the Agent SDK

Claude Agent SDK (Anthropic): Anthropic’s compliance docs state that OAuth tokens from Claude Free/Pro/Max are only for Claude Code and Claude.ai, and that using them in “any other product, tool, or service — including the Agent SDK” is not permitted, as quoted in the Compliance excerpt and detailed in the Legal and compliance docs. This matters for anyone building local wrappers, parallel runners, or “Claude Code but scripted” tooling, because the boundary between “using Claude Code” and “using the SDK” is exactly where ToS risk shows up.

Matt Pocock

@mattpocockuk

Replying to @mattpocockuk

The relevant compliance docs are here: code.claude.com/docs/en/legal-… "Using OAuth tokens obtained through Claude Free, Pro, or Max accounts in any other product, tool, or service — including the Agent SDK — is not permitted and constitutes a violation of the Consumer Terms of Show more

8:29 AM · Mar 15, 2026

Read 3 replies

Anthropic says clearer Agent SDK guidance is coming after confusing token rules

Agent SDK guidance (Anthropic): An Anthropic employee acknowledges the situation is confusing and says they’re working on clearer guidance for the Agent SDK, as stated in the Anthropic acknowledges confusion reply. In a longer follow-up, they attribute some of the gaps to “incredible growth since January” and explicitly concede they “have not done fully right by Agent SDK users,” according to the Growth and triage context comment.

Thariq

@trq212

Replying to @mattpocockuk

Sorry this has been confusing. I know we should be clearer here, the Agent SDK spans a lot of use cases and that makes it tough to give guidance for. We're working on more clarity for this & I'll follow up. Thanks for bearing with us, been a hectic couple weeks.

5:46 PM · Mar 15, 2026

214

Read 12 replies

Claude Code power users ask if subscription OAuth can drive Agent SDK local loops

Auth boundary confusion: A builder asks Anthropic to clarify whether a subscription OAuth token can power the Claude Agent SDK “strictly for using Claude Code in a local dev loop” (including parallelizing multiple Claude Codes), and whether an open-source tool that enables this pattern can be distributed, as laid out in the Agent SDK auth questions thread. The same thread drills into what counts as “Claude Code automation” (bash scripting) versus “another product/tool” (TypeScript + SDK), highlighting why engineers are stuck choosing between a supported abstraction and a potential ToS violation, per the Script vs SDK nuance follow-up.

Matt Pocock

@mattpocockuk

Can I get some questions answered by someone at Anthropic? 1. Can you use an OAuth token generated from a subscription to power the Claude Agent SDK strictly for using Claude Code in a local dev loop? All I want is a more reliable API for parallelizing multiple Claude Code's. Show more

8:27 AM · Mar 15, 2026

587

Read 93 replies

Auto-start Claude Code by adding it to your shell startup file

Claude Code workflow: A micro-automation pattern is to start Claude Code automatically in each new terminal by appending claude to your shell config (e.g., ~/.zshrc), as shown in the Auto-start tip screenshot.

This is presented as a “notice friction → ask Claude to fix it → repeat” loop, with a lightweight endorsement from another builder in the Reply reaction response.

varepsilon

@var_epsilon

linkedin is on another level entirely

2:49 PM · Mar 15, 2026

2.0K

Read 74 replies

Call for a single DRI on Agent SDK compliance questions to reduce FUD

Operational pattern: A concrete proposal is to name one person as the directly responsible individual for Agent SDK/compliance questions (“send all your questions my way”), on the theory that visible ownership reduces speculation and speeds up resolution, as argued in the Request a DRI post. For teams integrating Claude Code into internal tooling, this is the kind of governance mechanism that can unblock adoption when docs, product UX, and community statements drift out of sync.

Matt Pocock

@mattpocockuk

Replying to @trq212

I think what would help is having someone come out and say "I am directly responsible for this stuff going forward, please send all your questions my way". When people see a human face on it, especially a human face making progress and explaining decisions, they are less prone Show more

8:47 PM · Mar 15, 2026

🧑‍💻 OpenAI Codex & GPT‑5.4 in practice: reliability, UX, and events

Hands-on reports about Codex app/CLI workflows and GPT‑5.4 coding behavior—especially long-running task reliability and how people structure multi-threaded agent work. (Does not cover Claude-specific policy issues.)

GPT‑5.4 in Codex is still flaky on long-running tasks

GPT‑5.4 in Codex (OpenAI): A builder report says GPT‑5.4 frequently “stops early” on long tasks even with clear guidance, with missing leftovers only surfacing during code review, as described in the Long-run reliability complaint. The same post claims Cursor’s harness behaves better on similar work, and points to OpenAI Symphony as an approach that makes completion verifiable rather than assumed, as noted in the Symphony verification angle.

• Counter-signal: another practitioner says they’ve run GPT‑5.4 “non stop” since launch inside RepoPrompt (Codex app server under the hood) without these issues, suggesting harness + prompting differences may dominate, per the No issues report.

Kevin Kern

@kevinkern

gpt-5.4 in codex is often unreliable for long-running tasks. even with clear guidance, it often stops early and doesn't fully finish a task. In many cases, that only becomes obvious during code review. It ticks the tasks but there are a lot of leftovers. compared same & Show more

8:35 AM · Mar 15, 2026

149

Read 33 replies

Codex power users are asking for an orchestrator UX, not more chat threads

Codex UX (OpenAI): A power user describes a daily pattern of one “main” Codex chat plus separate chats per issue/feature, but says today’s UX makes the “main chat” no more prominent than any other thread—causing token waste as sessions repeatedly rediscover repo state, per the Codex chat history critique. The post calls for an orchestrator that can reference other threads by default, manage pins for undeployed work, and still allow isolation when needed.

The immediate takeaway is that multi-threaded agent work is now blocked by UI primitives (thread list, pins) rather than model capability.

Dan Shipper 📧

@danshipper

this is what my codex chat history looks like. we absolutely need a new UX for coding agent guis a few thoughts: 1. my usual workflow is i have one main chat where im doing all my work for the day and then separate windows for each individual issue / feature. once those are in Show more

4:08 PM · Mar 15, 2026

315

Read 76 replies

Cursor vs Codex: harness design theories from builders

Agent harness engineering: One explanation for why Cursor can feel more reliable than Codex is that Cursor appears to do deeper context building: codebase indexing (grep + semantic search + LSP graph), persisting long tool/MCP outputs to files instead of truncating, and model-specific instructions/tools tuned per provider, as laid out in the Cursor harness hypothesis. The same post speculates about multi-model worker/orchestrator setups that use a fast model to gather context and a stronger model to refine.

This frames “better coding agents” as a product of retrieval + tool-output persistence + prompt/tool tuning, not just raw model quality.

Kevin Kern

@kevinkern

ppl asking why cursor might be working better than codex cli. I think it's a combination of several things: - cursor has codebase indexing (not sure if they still use turbopuffer for embeddings). grep + semantic search. Including lsp graph data. - they save long shell/mcp Show more

Kevin Kern

@kevinkern

5:14 PM · Mar 15, 2026

Read 6 replies

Builders keep splitting “best coding” from “best coding UX”

Model selection in practice: One practitioner claims GPT‑5.4 is stronger than Opus 4.6 for pure coding—better with edge cases, security, and plan-following—while also saying Claude Code still wins on developer experience for CLI workflows (requirements gathering, slash commands, plugins, customization), per the Coding vs DX comparison.

This keeps showing up as a two-axis choice: code quality vs harness ergonomics. The tweet is anecdotal, but it matches how teams increasingly route tasks by “what fails less” rather than by one overall favorite model.

Haider.

@slow_developer

gpt-5.4 is better than opus-4.6 for pure coding handles edge cases and security better and follows plans more reliably. for cli use, claude code beats codex on developer experience it is smoother for requirements gathering, plugins, slash commands, and customization.

3:30 PM · Mar 15, 2026

159

Read 19 replies

OpenAI’s Codex team describes a culture of frequent stack-level bets

Codex org signal (OpenAI): An OpenAI engineer says the Codex team repeatedly asks how to make the system “an order of magnitude better every few months,” citing past bets like the Codex App and an early deployment of Cerebras inference with WebSockets, as described in the Culture note. They add they’re “well under way” on a next bet that makes even top engineers nervous.

This is a signal about iteration tempo: the perceived bottleneck is end-to-end stack work (app, inference transport), not just model training.

Tibo

@thsottiaux

Working at OpenAI is fun because questioning everything and taking risks is part of the culture. Within Codex, the team asks itself how we could make it an order of magnitude better every few months and then sets most things aside to go and do it across the entire stack. Some Show more

3:00 PM · Mar 15, 2026

1.7K

Read 154 replies

Codex app promo (OpenAI): A community tracker site advertises “around the clock 2× usage across all Codex surfaces for paid plans,” with a countdown timer and a deadline of April 2, 2026, shown in the Promo screenshot.

It’s not an official OpenAI post in this dataset, so treat the details as unverified until corroborated elsewhere, but it’s already shaping expectations around rate limits and the Codex app’s paid tiers.

Tibor Blaho

@btibor91

isCodex2x.com?

Mehul Mohan

@mehulmpt

Announcing isClaude2x.com - quickly check if Claude is 2x for you or not 👉🏻 Your local timezone 👉🏻 Homepage: UI for you 👉🏻 /short API: Just a yes/no [for you/agent] 👉🏻 /json API: a full JSON object with metadata [for agent]

6:12 PM · Mar 15, 2026

315

Read 13 replies

An open-source Codex mobile client ships as a stopgap via SSH

litter (community): A community-built “native mobile client for Codex” is being recommended as a way to use Codex remotely on iOS/Android via SSH until OpenAI ships official mobile support, per the Mobile client endorsement. The repository describes platform-specific apps (Kotlin/Swift) plus shared components and setup steps, as documented in the GitHub repo.

This matters for teams relying on long-running Codex sessions: it’s an early pattern for “phone as a window into the agent,” without waiting for first-party clients.

Numman Ali

@nummanali

Probably won’t find a better endorsement than this for using Codex on your phone remotely github.com/dnakov/litter Will get you through until Codex team release official support

0xSero

@0xSero

I knight KittyLitter as: The best mobile AI app. It’s Codex on your Phone, it lets you remote control your computer. I have been running research on it for 48h+ now and have no complaints. It uses your codex sub. Enjoy

10:48 AM · Mar 15, 2026

221

OpenAI and Notion announce a Codex workflow event in NYC on March 17

Codex × Notion (OpenAI, Notion): OpenAI Devs is promoting an in-person event at Notion’s NYC HQ on March 17 focused on Codex demos and practical workflows, as announced in the NYC event invite.

More specifics (agenda, speakers, registration mechanics) are outlined on the Event page. It’s a notable signal that Codex is being positioned as something teams can operationalize, not only a model you try in isolation.

OpenAI Developers

@OpenAIDevs

Codex 🤝 @NotionHQ Meet us in NYC on March 17 for a night packed with: Codex demos. Practical workflows. Builders to meet and learn from. luma.com/52o30i5i

12:03 AM · Mar 16, 2026

423

Read 30 replies

Builders are still looking for eval patterns for OpenAI Symphony

OpenAI Symphony (OpenAI): Community questions suggest Symphony adoption is still unclear—one person asks how many people are using it in the Usage check, followed by a practical question on how teams are building evaluations for real-time APIs in the Realtime eval question.

The open gap is measurement: once latency and streaming behavior change, offline “prompt → output” eval harnesses stop matching production behavior, and teams need new ways to score partial outputs, interruptions, and tool-call timing.

jason liu

@jxnlco

How many people are using OpenAI Symphony?

9:37 PM · Mar 15, 2026

177

Read 66 replies

🧭 Agentic coding workflows: context discipline, planning, and “can’t outsource thinking”

Practitioner patterns for getting reliable output from coding agents: planning emphasis, context management pitfalls, and iterative debugging/hardening loops. Compared to prior days, today is heavier on “workflow/UX debt” and agent attention limits.

Ask for git-diff edits to preserve structure in long plan revisions

Diff-based review prompting (doodlestein): When requesting revisions from multiple frontier models, the workflow explicitly asks for “git-diff style changes” so the model morphs the existing document instead of rewriting from scratch and dropping sections, as explained in the Diff prompt rationale and further clarified in the Why diffs help.

The same diff framing is then used to merge competing model feedback into a single hybrid revision, with the “best-of-all-worlds” synthesis step happening inside one model after ingesting the other models’ diff suggestions, per the Diff prompt rationale.

Jeffrey Emanuel

@doodlestein

I want to show how I go about planning major new features for my existing projects, because I've heard from many people that they are confused by my extreme emphasis on up-front planning. They object that they don't really know all the requirements at the beginning, and need the Show more

2:22 AM · Mar 16, 2026

Recency wins: don’t trust CLAUDE.md/AGENTS.md to keep working mid-session

Instruction salience (Uncle Bob): A concrete reminder that agents tend to optimize for “the last thing you told it,” while the “second-to-last” and “third-to-last” instructions degrade quickly—so rules placed early (including in CLAUDE.md / AGENTS.md) become less likely to be followed as the session evolves, per the Recency warning. This maps to a practical failure mode: teams encode policies once, then assume the agent will keep applying them while the working set shifts.

The operational implication is that “rules as context” behave like a fading cache; if a constraint matters, it needs periodic re-assertion or a harness-level enforcement mechanism rather than relying on initial text staying salient, as warned in the Recency warning.

Uncle Bob Martin

@unclebobmartin

The AI is _very_ focussed on the last thing you told it to do -- the thing it's working on now. The second to last thing you told it is hard for it to keep in mind. The third to last is even harder. Don't depend on those rules you told it at the start. Don't expect it to Show more

6:18 PM · Mar 15, 2026

123

Read 19 replies

“Spec is the new code” gets pushback: context engineering and code reading still matter

Spec-driven development (ecosystem debate): Pushback argues that treating specs/plans as a substitute for reading code will “hard land” within 6–8 months; the critique is that high-level describe/plan/breakdown workflows help, but most of the value still comes from context engineering and grounding in the actual codebase, as laid out in the Spec skepticism and amplified in the PRD jab.

A related one-liner captures the boundary condition for agents: “you cannot outsource the thinking,” per the Can’t outsource thinking. The common thread is that specs are a control surface, not an immunity shield against drift.

dex

@dexhorthy

technically actually a good post but the title “spec is the new code” harks back to the “all the code is assembly” idea which is WAYY far out. If you start behaving like you’ll be able to ship and maintain production software without reading the code, you’re in for a hard Show more

Julián

@juliandeangeIis

x.com/i/article/2033…

3:58 AM · Mar 16, 2026

Autoresearch mood: more tokens, less orchestration scaffolding

Autoresearch workflow (practice signal): The emerging takeaway is that autoresearch-style work benefits less from elaborate agent infrastructure and more from a minimal system that can “throw more tokens” at the problem, as stated in the Autoresearch takeaway.

A useful counterweight is that this still depends on a well-posed harness/contract: the observation that “tasteful constraints… channel the compute” shows up in the Harness constraints note, where the harness defines metrics/timeouts/policies so the extra tokens are pointed somewhere measurable.

Dan Shipper 📧

@danshipper

guys autoresearch is kind of amazing it’s also very bitter lesson-pilled: do away with all of your fancy agent infrastructure. just design the simplest possible system to let you throw more tokens at your problem

2:24 AM · Mar 16, 2026

419

Read 23 replies

Plan QA pattern: repeat “find blunders” passes until the critique stabilizes

Planning-as-a-system (doodlestein): A detailed planning workflow uses repeated critique passes (“5x: look over everything for blunders…”) interleaved after each substantive plan expansion; the claim is that each pass keeps finding new omissions until it converges, as shown in the Planning workflow thread and reiterated in the Why repeat 5x.

The repeat-until-stable loop is paired with “invert the analysis” prompts (what guarantees let you do things the reference system can’t) and with making the plan self-contained enough to hand to other models—so the plan becomes a portable artifact, not just a chat transcript, per the Planning workflow thread.

Jeffrey Emanuel

@doodlestein

2:22 AM · Mar 16, 2026

The agent writes faster; the bottleneck is still debugging and hardening

Verification work (Uncle Bob): Multiple notes emphasize that agent help speeds implementation, but the slow part remains making the application “perfectly solid,” and that real leverage comes from guiding the model through debugging and hardening—not from initial codegen, as argued in the Hardening reality and reinforced by the Skill to guide debugging.

A related nuance is that refactoring “cleanup tools” don’t automatically translate to tests: after heavy mutation/cleanup, he reports the agent itself flagging tests as a “hodge-podge of uncorrelated assertions,” pushing toward restructuring test suites as a different kind of work, per the Tests need restructuring.

Uncle Bob Martin

@unclebobmartin

debugging and hardening take time. The AI is helpful, and that speeds things up a big. But overall it's still a long slog to make sure that every nook and cranny of an application is perfectly solid.

2:21 PM · Mar 15, 2026

114

Read 12 replies

Prompt apprenticeship: go slower at first until the agent matches your quality bar

Prompting practice (Mitchell Hashimoto): Hashimoto describes deliberately forcing himself to learn how to prompt an agent to produce results at his own quality level, accepting that it’s initially “more than double the work” and slower, per the Hashimoto quote. The emphasis is on skill-building (closing the gap between what you’d write and what the agent produces), not on raw throughput.

The Pragmatic Engineer

@Pragmatic_Eng

How Mitchell Hashimoto (@mitchellh, creator of Ghostty, founder of HashiCorp) got good at using AI tools for coding + work: "I was forcing myself to figure out how to prompt the agent to produce the same quality result as mine. Even though I was working much slower because it Show more

1:54 PM · Mar 15, 2026

226

Simon Willison defines “agentic engineering” as coding-with-execution loops

Agentic engineering (Simon Willison): Willison added a new introductory chapter defining “agentic engineering” as building software with coding agents that can both write and execute code in a loop, drawing a line between production-oriented practice and “vibe coding,” as described in the New guide chapter and expanded in the Guide chapter. It reads like an attempt to standardize vocabulary for tool-using agents (Codex/Claude/Gemini CLIs) so teams can talk about reliability patterns instead of debating vibes.

The chapter also functions as a concise “what to optimize” list: tool access, feedback loops, and verification steps—useful framing after the earlier fireside material on the same guide, following up on Agentic engineering with an explicit definition and scope.

Simon Willison

@simonw

Just added the 12th chapter to my Agentic Engineering Patterns guide, but it's the first one in the sequence: I figured it was time to try and answer the obvious question, "What is agentic engineering?" simonwillison.net/guides/agentic…

10:44 PM · Mar 15, 2026

467

Read 24 replies

Model mode behavior (ChatGPT): There’s a visible split between people who see Auto mode as underusing the system (“restrain me from telling her to turn on Thinking mode”), as said in the Plane mode complaint, and people who report using Auto/Instant for most turns (“70% of turns”), as stated in the Auto usage share. Another datapoint is that some users switch based on task type—Auto for learning/“higher EQ,” but a heavier mode for search/data science—per the Task-based mode choice.

The net signal is that “mode selection” is now part of workflow culture, and teams will end up with implicit norms and expectations about latency/cost vs thoroughness even before they write them down, as shown in the Surprised reaction.

Matt Shumer

@mattshumer_

Sitting next to a woman on a plane using ChatGPT on Auto mode. I need someone to physically restrain me from telling her to turn on Thinking mode at the very least.

11:18 PM · Mar 15, 2026

975

Read 111 replies

“Beware the IDEs of March” is a shorthand for agent-tool churn

Tooling ergonomics (community signal): A short warning—“do not adopt any new code editors this month”—captures how fast agent IDEs and coding environments are shifting, and how easy it is to burn time migrating setups mid-wave, per the IDEs of March quip.

It’s not a product update, but it does reflect a constraint AI teams keep hitting: when the environment is changing weekly, “switching cost” becomes a real part of the engineering budget—even if models are improving.

Roman Helmet Guy

@romanhelmetguy

Warning: Do not adopt any new code editors this month. Beware the IDEs of March.

5:13 PM · Mar 15, 2026

5.9K

Read 94 replies

🛠️ Agent developer tools: CLIs, workspaces, and self-hostable platforms

New/updated developer tools and repos that make agent workflows more usable: agent workspaces, memory tooling, local-dev utilities. Excludes core coding assistants (Codex/Claude) and excludes model releases (feature).

ACE open-sources its context/playbook platform for coding agents

ACE (aceagent): ACE has been open-sourced with a new self-host path, shifting it from a hosted workflow tool into something teams can run alongside their agent stacks, as announced in Open-source announcement with setup details in the linked GitHub repo.

• What it’s for: the repo frames ACE as “agentic context engineering”—turning prompts into evolving playbooks that capture wins/failures and reduce repeated agent mistakes, per the GitHub repo.
• Ops shape: self-host instructions are Docker-first (Postgres/Redis/FastAPI), keeping the hosted service as an option, according to Open-source announcement.

Dan McAteer

@daniel_mac8

ACE is now open-source. Using ACE the past few weeks made me realize it should be available to everyone. It makes working with AI Coding Agents so much better. It's still available as a hosted service, and there are plans to improve ACE *A LOT*. But you can now self-host.

1:40 PM · Mar 15, 2026

299

Read 16 replies

Collaborator launches an infinite-canvas workspace for agentic development

Collaborator (collaborator-ai): Collaborator is being pitched as an end-to-end environment for agentic development—terminals, context files, and running code laid out on an infinite canvas—per Product demo post and the linked GitHub repo.

The public repo describes a macOS (arm64) desktop app that stores data locally and bundles a modern editor stack (Electron/Monaco) to reduce “tab hunting,” matching what’s shown in Product demo post.

Numman Ali

@nummanali

Collaborator is an end-to-end environment for agentic development - Terminals, context files, and running code - All arranged on an infinite canvas in one place - No context switching, no tab hunting - Just your agents and your work, side by side github.com/collaborator-a…

Yiliu

@yiliush

get it while it's hot github.com/collaborator-a…

1:44 PM · Mar 15, 2026

supermemory adds an agent-first CLI with scoped access and audits

supermemory (supermemory): supermemory introduced a CLI intended to make agents “first-class users,” where anything available in the platform can be executed via an agent prompt, as described in CLI launch post; it also adds scoped API access (tag-scoped permissions, read/write controls) plus audit logs for agent actions, per Scoped access details.

The positioning is explicitly “CLI over MCP for power,” while still acknowledging MCP isn’t going away, as stated in CLI launch post.

supermemory

@supermemory

Introducing the supermemory CLI. MCP isn't dead - but CLIs can be more powerful. So we built the most advanced CLI for agents. EVERY single thing you can do on the supermemory platform can be done by Agents. just prompt them. They are first class users. > npx supermemory

1:40 PM · Mar 15, 2026

611

Read 21 replies

Emdash adds review presets for repeatable agent review runs

Emdash (emdashsh): Emdash added “review presets,” letting you configure a default review agent + prompt and start a review chat for a task without retyping the same instructions, as shown in Preset feature demo.

This is a small UX change, but it formalizes “default reviewer” configuration as a product surface rather than a copy/paste habit, matching the flow in Preset feature demo.

Emdash (YC W26)

@emdashsh

New in Emdash: Review presets. You can now configure a default review agent + prompt and launch a review chat for a task without rewriting the same review instructions every time.

3:53 PM · Mar 15, 2026

Portless is now available on Windows for named localhost URLs

Portless (Vercel Labs): Portless is now available on Windows via npm install -g portless, extending its “named URLs instead of port numbers” local-dev workflow to Windows-based teams, as announced in Windows availability note with project specifics in the linked GitHub repo.

The repo description emphasizes stable .localhost naming, proxying, and workflow support such as git worktrees, per GitHub repo.

Chris Tate

@ctatedev

Portless now available on Windows npm install -g portless Now your whole team can use Portless Named URLs instead of port numbers For local dev

2:24 AM · Mar 16, 2026

🧩 Skills, CLIs, and extension conventions (agent install & portability)

Installable skills and conventions for distributing agent capabilities across tools (skills, bundles, CLI conventions). Excludes MCP protocol items unless the artifact is primarily a skill/installer.

Skillflag RFC: a proposed --skill convention for portable agent skill bundles

Skillflag (CLI convention): A draft spec proposes a --skill flag that any CLI can implement to expose installable “skill directories” (not just single prompt files); it centers on discovery via --skill list and export via --skill export <id> streaming a tar bundle to stdout, as shown in the Spec screenshot.

The proposal is intentionally agent-tool-agnostic (meant to be adapted by separate installers), which targets the current portability gap where every tool reinvents its own skills packaging and install paths.

Onur Solmaz

@onusoz

Request for comments skillflag: A complementary way to bundle agent skills right into your CLIs tl;dr define a --skill flag convention. It is basically like --help or manpages but for agents acpx already has this for example. you can run npx acpx --skill install to install Show more

10:31 PM · Mar 15, 2026

Read 4 replies

Warp adds a universal .agents/skills install target list across agent tools

Warp (Terminal): Warp now shows “Universal (.agents/skills) — always included” targets spanning multiple agent clients (Amp, Cline, Codex, Cursor, Gemini CLI, OpenCode, Warp), alongside additional per-tool install locations, as shown in the Installer targets UI.

This is an explicit move toward a shared filesystem convention for skills distribution, reducing per-tool installer logic.

Andrew Qu

@andrewqu

Thanks @BHolmesDev, Warp supports .agents/ for skills!

10:03 PM · Mar 13, 2026

Read 2 replies

CLI-Anything hits ~15K stars as a “make any software agent-ready” approach spreads

CLI-Anything (HKUDS): The repo is showing rapid adoption—one post calls out “15K stars already,” framing CLIs as a strong interface for coding agents, in the Stars and CLI note. The project positions itself as a framework to make existing software “agent-ready” by generating unified CLIs and plugin installs, as described in the GitHub repo.

elvis

@omarsar0

15K stars already!? Great idea. CLIs work amazingly well with coding agents. Worth playing around with. Do run a lot of tests if you are planning to use this to build tools.

4:08 PM · Mar 15, 2026

197

Read 17 replies

json-render ships a Solid.js generative UI skill via npx skills add

json-render (Vercel Labs): A Solid.js integration is now available as an installable skill, using npx skills add vercel-labs/json-render --skill solid per the Install command. The underlying component catalog and schema-driven UI approach are outlined on the Project site.

Chris Tate

@ctatedev

Generative UI for Solid.js is here

4:18 PM · Mar 15, 2026

172

Read 6 replies

Hermes Agent: /background runs prompts asynchronously from the CLI

Hermes Agent (Nous Research): Hermes has a built-in /background command to run a prompt asynchronously—documented inline as “Run a prompt in the background (usage: /background <prompt>)” in the Command hint screenshot.

It’s a small UX affordance, but it maps directly onto long-running agent workflows where foreground token streaming isn’t always the right default.

Teknium (e/λ)

@Teknium

Umm btw we've had this for several days, use /background in Hermes-Agent :)

Nimrod Gutman

@theguti

Bringing /btw to @openclaw... stay tuned merging this soon.

1:07 PM · Mar 15, 2026

194

Read 10 replies

🧱 Agent frameworks & delegation: coordination, trust, and continuous learning loops

Framework-level ideas and systems for multi-agent coordination, delegation, and learning from experience. Today’s tweets emphasize “agents as distributed systems” and online improvement loops.

LLM teams mapped to distributed systems failure modes

Language model teams as distributed systems (arXiv): A new paper frames multi-agent LLM setups as classic distributed systems—predicting the same pain points (O(n²) communication overhead, straggler delays, and consistency conflicts) and measuring how different coordination structures trade off progress vs resilience, as summarized in the Paper screenshot.

The empirical takeaway in the Paper screenshot is that decentralized teams can waste rounds communicating but can recover faster when individual agents stall, which gives agent builders a more principled way to choose team size and orchestration topology than “add more agents and hope.”

elvis

@omarsar0

We mostly solved multi-node coordination decades ago in distributed computing. Turns out LLM teams face some of the same coordination problems today. Here is a really good read for anyone designing multi-agent systems. It applies distributed systems theory to LLM teams and Show more

4:01 PM · Mar 15, 2026

138

OpenClaw-RL details how agents learn continuously from use

OpenClaw-RL (Gen-Verse/Princeton): Following up on Train by talking (continuous RL from interactions), today’s thread breaks the learning signal down into two next-state channels—evaluative feedback (good/bad) and directive hints (what to do instead)—in the Two signal types explainer.

• Async system design: The training loop is described as four parallel components—policy serving, environment/interaction collection, PRM judging, and policy training—so updates happen continuously without blocking user traffic, per the Four components and Async loops posts.
• Token-level correction: The directive path is framed as Hindsight-Guided On-Policy Distillation (OPD), where “what should have happened” is used to generate a teacher distribution and derive token-level gradients, as outlined in the How OPD works thread.

Primary artifacts are linked via the ArXiv paper and the GitHub repo.

Ksenia_TuringPost

@TheTuringPost

Replying to @TheTuringPost

1. The next state contains 2 kinds of useful learning signals that OpenClaw-RL uses: 1) Evaluative signals that show whether the previous action was good or bad. 2) Directive signals explaining what should have been done differently. These hints guide learning at the token Show more

2:25 PM · Mar 15, 2026

AgentRank proposes PoW-grounded trust scores for agent networks

AgentRank (HyperspaceAI): An AgentRank release announcement positions a PageRank-like scoring system for autonomous agents, where endorsements are anchored to cryptographically verified work to make sybil attacks expensive, as introduced in the AgentRank announcement and detailed on the Paper site.

The core claim on the Paper site is that “trust” becomes a network property computed from a delegation graph, with mechanisms like recency decay and penalties for sybil clusters, aiming to support peer-to-peer agent ecosystems where you need a non-handwavy way to choose which agents to rely on.

Varun

@varun_mathur

Introducing AgentRank | v3.6.0 In 1998 Google asked a simple question: with millions of webpages, how do you know which one to trust ? Their answer was PageRank - a page is important if important pages link to it. That one idea made the internet usable. We just shipped Show more

4:03 PM · Mar 15, 2026

292

Read 27 replies

DeepMind proposes a delegation protocol for agents with trust and verification

Intelligent AI Delegation (Google DeepMind): DeepMind published a framework that treats delegation as a sequence of decisions—whether to delegate, how to specify roles/boundaries, how to transfer authority/accountability, and how to verify results—rather than a one-shot “tell the agent and pray,” as described in the Delegation paper summary.

Beyond the high-level protocol, the Delegation paper summary explicitly points toward mechanisms like formal trust models (to prevent over/under-delegation) and verification approaches (including cryptographic proofs / skill certificates) to make multi-party delegation networks more robust.

🗞️ Google DeepMind's paper has a big plan for how we should actually give tasks to AI. It is not just about telling an AI to do something and hoping for the best. Instead, this framework looks at delegation as a string of choices where you figure out if you should even hand the Show more

10:26 PM · Mar 15, 2026

102

Read 10 replies

Agent memory is fragmenting into distinct architectural schools

Agent memory architectures: A roundup thread enumerates seven emerging approaches—Agentic Memory (AgeMem), Memex, MemRL, UMA, Pancake, Conditional memory, and a “multi-agent memory from a computer architecture perspective” framing—capturing how quickly “memory” is becoming its own design space for agents beyond a single vector DB pattern, as listed in the Memory architectures list and amplified in the Retweet list.

Ksenia_TuringPost

@TheTuringPost

7 emerging memory architectures for AI agents ▪️ Agentic Memory (AgeMem) ▪️ Memex ▪️ MemRL ▪️ UMA (Unified Memory Agent) ▪️ Pancake ▪️ Conditional memory ▪️ Multi-Agent Memory from a Computer Architecture Perspective turingpost.com/p/agenticmemory

11:23 AM · Mar 15, 2026

411

Read 13 replies

📏 Benchmarks & eval signals: webdev/design gaps, long-context retrieval, and leaderboards

Model comparisons and benchmark posts that inform engineering selection and regressions. Compared to earlier days, today adds more “design vs coding” and “completion rate” narratives.

MRCR v2 adds a Sonnet 4.6 1M retrieval datapoint alongside Opus’s lead

MRCR v2 long-context retrieval: Sonnet 4.6 is shown scoring 65.1% on the 8-needle MRCR v2 variant at 1M tokens, extending the retrieval discussion from MRCR chart (Opus leading at 1M) with a second Claude-family datapoint in the plot shared in Retrieval accuracy plot.

• Relative positions at 1M: The same figure shows Opus 4.6 at 78.3%, versus GPT-5.4 Pro at 36.6% and Gemini 3.1 Pro at 25.9%, as labeled directly in Retrieval accuracy plot.

The chart is a retrieval-quality reminder: larger context windows don’t help much if needle retrieval collapses at full length.

Kol Tregaskes

@koltregaskes

Claude Opus 4.6 achieves 78.3% on MRCR v2 long context retrieval benchmark at 1M tokens, well ahead of GPT-5.4 Pro at 36.6% and Gemini 3.1 Pro at 25.9%. Sonnet 4.6 scores 65.1% on the 8-needle variant of the same retrieval test. Both Claude models maintain accuracy at full Show more

Claude

@claudeai

1 million context window: Now generally available for Claude Opus 4.6 and Claude Sonnet 4.6.

9:31 PM · Mar 15, 2026

Read 4 replies

Website Arena puts GPT-5.4 near the bottom on UI/design tasks

Website Arena (benchmark): A Website Arena snapshot shared today puts GPT-5.4 at Elo 1298, with a blunt takeaway that it “can code” but “can’t design,” per the scorecard commentary in Website Arena chart; the same post notes it sits 79 points behind Claude Opus 4.6 at the top of that board.

This is a useful reminder that “good at coding” and “good at web UI” can diverge; this specific claim is benchmark-scoped rather than a general capability statement, and it depends heavily on the harness and judging rubric used in Website Arena.

BridgeMind

@bridgemindai

GPT 5.4 ranks near the bottom of Website Arena. Elo 1298. 79 points behind Claude Opus 4.6 which sits at #1. Behind Claude Opus 4.5. Behind Gemini 3.1 Pro Preview. Behind GLM 5. GPT 5.4 can code. It can't design. Know your models. Know their limits.

1:11 PM · Mar 15, 2026

Read 22 replies

BridgeBench Creative HTML shows GPT-5.4 winning a four-model web build prompt

BridgeBench Creative HTML (benchmark): A side-by-side run using the “same prompt” across GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, and Grok 4.20 Beta is presented as a head-to-head comparison, with the montage calling GPT-5.4 the winner in the final frame, as shown in Winner montage.

Treat this as a single test case unless you have the underlying prompt + judging artifact; it’s still a useful directional signal for teams tracking HTML/CSS/UI generation quality across frontier models.

BridgeMind

@bridgemindai

Testing GPT 5.4, Claude Opus 4.6, Gemini 3.1 Pro, and Grok 4.20 Beta on the BridgeBench Creative HTML benchmark. Same prompt. Four frontier models. One winner. Which model do you think takes it?

1:46 PM · Mar 15, 2026

Read 22 replies

Creative writing and EQ leaderboards keep GPT-5.4 near the top

Creative writing + EQ-Bench (leaderboards): A leaderboard snapshot circulating today places GPT-5.4 at the top of a “creative writing” rubric table, while separate chatter claims it ranks 3rd on EQ-Bench behind Claude models, as summarized in Results snippet and visually backed by the table screenshot in Creative writing table.

This is signal for teams choosing a default “writing” model, but it’s leaderboard-dependent; the same posts don’t include a canonical eval pack or prompt set to reproduce the exact ordering.

Sam Paech

@sam_paech

New results! GPT-5.4: Places 1st, 2nd & 3rd on creative writing, longform writing & EQ-Bench respectively Grok-4.20: Refusals everywhere, and consequently low scores Hunter-alpha: New 1M context stealth model on openrouter. Possibly Qwen-3.5 max?

4:53 AM · Mar 15, 2026

443

Read 36 replies

Grok 4.20 Beta gets framed as fast and long-context, still outside the top tier

Grok 4.20 Beta (xAI): A performance roundup frames Grok as strong on throughput—around 267 tokens/sec—with a 2M token context window and pricing called out as $2/M input and $6/M output, while still “yet to break into the Big 3,” per the summary post in Performance roundup.

The same thread uses an “intelligence index” bar chart for positioning, which is useful for quick triage but can hide task-specific gaps; no task-by-task breakdown is provided in the tweet itself.

Kol Tregaskes

@koltregaskes

Grok yet to break into the "Big 3" stranglehold but getting better. - Output speed around 267 tokens per second, among top performers. - Features 2M token context window for large inputs. - Pricing at $2 per million input tokens and $6 per million output tokens. - Intelligence Show more

Artificial Analysis

@ArtificialAnlys

xAI has released Grok 4.20 for API access in beta, and it scores 48 on the Artificial Analysis Intelligence Index with reasoning enabled Compared to @xAI’s previous Grok 4 flagship, Grok 4.20 Beta 0309 is an intelligence upgrade, achieving +6 points on the Intelligence Index. It

9:31 AM · Mar 15, 2026

Read 4 replies

The “AI IQ” meme resurfaces with GPT-5.4 pegged around 130

AI “IQ over time” (community metric): A meme chart claims a climb from GPT‑3.5 ~83 to GPT‑5.4 ~130, with a speculative “next frontier model ~145+,” framing it as a rising “cognitive ceiling,” as shown in the chart screenshot shared in IQ timeline chart.

The post itself caveats that IQ is “not a great way” to judge overall model quality, so this functions more as a vibe-y proxy for perceived reasoning gains than an engineering-grade eval.

Haider.

@slow_developer

we don't really talk much about the IQ metric but we've gone from roughly 83 iq to 130 with gpt-5.4, and the next major model update could easily push past 140 ai iq obviously isn't a great way to judge overall model quality, but the raw cognitive ceiling is improving fast

12:30 PM · Mar 15, 2026

268

Read 48 replies

🚀 Inference & self-hosting: vLLM, Apple Silicon caching, and tool-call compatibility

Serving/runtime engineering and “run it yourself” workflows: vLLM endpoints, KV-cache reuse, batching, and compatibility layers. Excludes chip roadmaps (covered under infrastructure/hardware).

oMLX speeds up local Claude Code on Mac by reusing KV cache

oMLX (jundot): A Claude Code “local LLM” setup report pins the biggest latency win on prefix/KV-cache reuse rather than model choice; switching from mlx_lm (no effective KV reuse for repeated system prefixes) to oMLX reportedly cut prefill latency by ~10× thanks to tiered KV caching (RAM+SSD) and continuous batching, per the local Claude Code setup follow-up in caching details with the implementation in the GitHub repo.

The same thread notes a practical sizing constraint—Mac Studio is suggested as comfortable around Qwen3.5 9B—using model fit estimator referenced in the hardware sizing tip.

AI Builder Club

@aibuilderclub_

1/ You can run Claude Code with local LLMs on a Mac for FREE. No API costs. Nothing leaves your machine. Making it fast is the hard part. I tried Ollama and LM Studio first, but prefill latency was painfully slow. Switching to another open-source inference engine made it ~10× Show more

4:44 AM · Mar 16, 2026

Claude Code can target a local Messages API backend, but headers can break caching

Claude Code (Anthropic): A concrete configuration pattern shows Claude Code routing to any backend that implements the Anthropic Messages API by setting ANTHROPIC_BASE_URL=http://localhost:8000, as described in the routing config. It also calls out a caching gotcha: Claude Code’s default Attribution Header can change the prefix and invalidate prefix/KV caches, and the workaround is CLAUDE_CODE_ATTRIBUTION_HEADER=0, per the same routing config note and the referenced Unsloth guide.

AI Builder Club

@aibuilderclub_

Replying to @aibuilderclub_

3/ So how does Claude Code run with local models? It just sends requests to any backend that implements the Anthropic Messages API. By setting: ANTHROPIC_BASE_URL=http://localhost:8000 Claude Code routes requests to your local inference server instead of Anthropic. One more Show more

4:44 AM · Mar 16, 2026

Run OpenClaw against a self-hosted vLLM endpoint with tool calling intact

vLLM + OpenClaw (community): A short guide shows how to run OpenClaw against your own vLLM deployment by exposing an OpenAI-compatible API and pointing OpenClaw at it; the claim is that tool calling works without custom glue, which makes vLLM a practical “bring your own weights” serving layer for OpenClaw workflows, as described in the setup steps and shown in the setup steps.

The workflow is framed as: deploy model in vLLM → expose OpenAI-compatible endpoint → configure OpenClaw to use that base URL; the post uses Kimi K2.5 as the example model, with details linked from the setup steps video.

vLLM

@vllm_project

3:30 PM · Mar 15, 2026

111

📄 Research drops: architecture tweaks, kernels, memory, and controllability

New papers/technical reports referenced today, spanning architecture efficiency, attention kernels, KV-cache management, and chain-of-thought controllability. (Non-product research only.)

FlashAttention-4 tunes attention kernels for Blackwell’s tensorcore-vs-memory imbalance

FlashAttention-4 (Princeton/Meta/NVIDIA et al.): The FlashAttention team reports FlashAttention-4, a Blackwell-focused attention kernel redesign that targets the new bottlenecks where tensorcore throughput rises faster than shared memory and exp units, as outlined in the Paper screenshot.

• Performance claims: The thread cites up to ~1613 TFLOPs/s (~71% utilization) on B200, with up to 1.3× over cuDNN and 2.7× over Triton on B200 BF16, per the Paper screenshot.
• Engineering angle: The paper emphasizes algorithm–kernel co-design (asynchronous MMA, tile sizing, softmax work reduction, backward-pass traffic reductions) and notes implementation in CuTe-DSL embedded in Python with 20–30× faster compile times, per the Paper screenshot.

The paper introduces FlashAttention-4 to make AI run faster on the newest generation of computer chips. Researchers from Princeton University, Meta, NVIDIA, and more have developed clever new pipelines, re-engineered core computations, and optimized memory usage to master the Show more

12:39 PM · Mar 15, 2026

Attention Residuals: learned cross-layer retrieval as a drop-in residual upgrade

Attention Residuals (Moonshot/Kimi): Moonshot proposes AttnRes, replacing uniform residual accumulation with learned, input-dependent attention over prior layers to reduce representation dilution and hidden-state growth, as described in the Paper thread and the linked Technical report.

• Scaling and practicality: A “Block AttnRes” variant compresses layers into blocks to make cross-layer attention feasible at depth, while claiming a consistent 1.25× compute advantage with <2% inference latency overhead, as summarized in the Paper thread and echoed by the Scaling law note.
• Validation target: Results are reported on Kimi Linear (48B total, 3B activated), with training-dynamics analysis suggesting more uniform gradients across depth, per the Training dynamics note.

Kimi.ai

@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with Show more

3:03 AM · Mar 16, 2026

2.1K

Read 59 replies

CoT-Control: models can steer outputs far more than their chains of thought

CoT controllability (OpenAI/UPenn/NYU et al.): A new evaluation suite finds current reasoning models are poor at selectively controlling their hidden chain-of-thought compared to controlling final answers—e.g., they can follow output constraints much more reliably than CoT constraints, according to the Paper snapshot.

• Key quantitative result: The post cites models only successfully hiding disallowed words in their CoT ~3% of the time, with an example gap for Claude Sonnet 4.5 of 2.7% CoT control vs 61.9% output control, per the Paper snapshot.
• Training interaction: It also claims CoT controllability tends to decrease with more RL, more test-time compute, and higher difficulty, leaving the authors “cautiously optimistic” about CoT monitorability for now, as stated in the Paper snapshot.

OpenAI, UPenn, NYU researchers tested if AI can hide its internal reasoning and found that current models constantly fail at it. i.e. safety teams can relax because AI models are currently terrible at covering their mental tracks. The big deal here is that AI safety relies on Show more

8:34 AM · Mar 15, 2026

Read 24 replies

LMEB: embedding evals for long-horizon, fragmented memory retrieval

LMEB (KaLM-Embedding): LMEB introduces a benchmark to measure embedding models on long-horizon memory retrieval (episodic, dialogue, semantic, procedural), arguing standard passage-retrieval leaderboards miss the retrieval patterns agent memory systems need, per the Benchmark card and the linked Paper page.

• Scope and takeaway: The summary cites 22 datasets and 193 zero-shot retrieval tasks, and reports that LMEB and MTEB are “orthogonal” (traditional retrieval performance doesn’t predict long-horizon memory retrieval), with “larger isn’t always better,” as described in the Benchmark card.

@_akhaliq

LMEB Long-horizon Memory Embedding Benchmark paper: huggingface.co/papers/2603.12…

4:05 AM · Mar 16, 2026

Read 3 replies

LookaheadKV: KV-cache eviction with future glimpses, minus draft generation

LookaheadKV (Samsung Research): LookaheadKV proposes KV-cache eviction that estimates token importance “looking ahead” without generating drafts, aiming to cut long-context inference overhead while preserving quality, per the Paper card and the linked Paper page.

• Efficiency claim: The post highlights up to 14.5× lower eviction cost versus prior lookahead-style approaches that rely on draft generation, along with faster inference/TTFT on long-context workloads, as stated in the Paper card.

@_akhaliq

LookaheadKV Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation paper: huggingface.co/papers/2603.10…

4:12 AM · Mar 16, 2026

Read 3 replies

Transformers-as-interpreters framing resurfaces: deterministic code “inside the forward pass”

Transformers Turing-complete discussion: A thread claims Transformers can be trained to run arbitrary programs by embedding an efficient assembly interpreter in the forward pass, enabling deterministic execution “in its own weights” rather than via an external sandbox, as asserted in the Interpreter claim.

The tweet doesn’t cite a specific paper or artifact, so treat it as a conceptual signal rather than a verifiable result from this dataset.

Jay Hack

@mathemagic1an

Transformers are Turing complete and can be trained to run arbitrary programs Turns out you can embed a relatively efficient assembly interpreter in the forward pass. This allows the LLM to execute deterministic code at inference time in its own weights, no sandbox

Christos Tzamos

@ChristosTzamos

1/4 LLMs solve research grade math problems but struggle with basic calculations. We bridge this gap by turning them to computers. We built a computer INSIDE a transformer that can run programs for millions of steps in seconds solving even the hardest Sudokus with 100% accuracy

7:25 PM · Mar 15, 2026

691

Read 19 replies

🏗️ AI infrastructure signals: capex, datacenter constraints, and inference bottlenecks

Compute/capex and infrastructure constraints with operational impact (water, CPUs, debt/refinancing narratives). This is the one place for non-tool, non-model infra signals today.

Data center water paper projects up to 1,451 MGD of new peak capacity by 2030

Small Bottle, Big Pipe (UC Riverside/RIT/Caltech): US data centers are projected to need 697–1,451 million gallons/day of new peak water capacity by 2030, following up on Water peaks (local peak-demand constraints); the paper also estimates up to $58B in public-water infrastructure spend may be needed, as described in the paper summary and reiterated with hotspot notes in the capacity breakdown.

• Why this matters operationally: the argument isn’t “national water share,” it’s “hot-day spikes”; the paper claims data centers evaporate ~75% of intake from public supplies and proposes gating hookups on funding capacity expansions, as summarized in the paper summary.

New paper on US data center water usages and future directions. - On a national level, data centers will continue to use a very small percentage of our total water supply by 2030 compared to farming and everyday public use. - However, companies currently use evaporative cooling Show more

7:28 AM · Mar 15, 2026

Read 12 replies

NVIDIA tees up CPUs as the bottleneck for agentic AI workflows ahead of GTC

NVIDIA (GTC preview): a CNBC preview claims CPUs are “becoming the bottleneck” for agentic AI workflows, with NVIDIA expected to unveil more CPU details at GTC—continuing the CPU-capacity chatter from CPU squeeze—as shown in the CNBC preview screenshot.

• Competitive context: the same preview notes Intel/AMD lead data-center CPUs, while NVIDIA is positioning its CPU strategy as part of the agent stack, per the CNBC preview screenshot.

Dan McAteer

@daniel_mac8

plot twist: > Nvidia plans to introduce an ai agent optimized CPU at GTC this week make things ai agents want. Nvidia is.

12:29 PM · Mar 15, 2026

Read 8 replies

“AI water issue is fake” counterpoint frames AI’s direct water use as ~0.008% of US total

AI water debate: a long counterpoint post argues the “AI water crisis” narrative is misplaced, following up on Water peaks (local peaks, not national share), by estimating US data centers at ~0.2% of total water use and direct onsite use at ~0.04%, with AI at ~20% of that (~0.008% total) as quoted in the blog recap and linked via the blog post.

• What it doesn’t resolve: even this framing concedes localized infrastructure stress can be real; it mainly argues the national-scale rhetoric is off relative to other sectors, per the blog recap.

This super long blog is packed with a lot of details and info. Argues the "AI water crisis" is misplaced. Data centers use a tiny share of water and operate like other industries. In 2023, data centers used around 200 to 250 million gallons of water per day in total, including Show more

Rohan Paul

@rohanpaul_ai

1:21 PM · Mar 15, 2026

Read 14 replies

Morgan Stanley frames 2026 as “gen-AI-capex-powered” investment-led growth

Morgan Stanley (capex framing): Fortune excerpts from a Morgan Stanley Wealth Management report describing a “gen-AI-capex-powered” era—an investment-led “reindustrialization renaissance” that’s “better for computers than humans,” per the Fortune excerpt.

• Why infra teams care: it’s another signal that AI spend is being treated as durable industrial buildout (chips, power, data centers), not a short-lived product cycle, as framed in the Fortune excerpt.

Fortune: "According to a newly released strategic report from Morgan Stanley Wealth Management, the market has entered a “gen-AI-capex-powered” era that represents a rare shift away from consumption-led growth and toward an investment-led “reindustrialization renaissance.” The Show more

11:43 AM · Mar 15, 2026

Read 7 replies

Software sector faces a 2028 debt wall of roughly $40B in maturities

Software financing (macro constraint): a circulating chart/claim says ~$40B in software and services debt matures in 2028, raising refinancing-risk questions for software vendors during high AI capex/opex cycles, as amplified in the debt wall repost.

The Kobeissi Letter

@KobeissiLetter

The software sector is facing a massive debt wall: ~$40 billion in software and services debt matures in 2028, the largest single-year concentration. The vast majority of this is rated B- or lower, deep in junk territory, with no investment-grade debt in the mix. In total, Show more

7:23 PM · Mar 15, 2026

1.6K

Read 139 replies

🛡️ Safety & policy edges: jailbreaks, bot/slop defenses, and guardrail debates

Security, misuse, and governance issues that affect deploying AI systems: jailbreak chatter, spam/bot mitigation, and “p-hacking at scale” concerns. Excludes Claude OAuth/ToS specifics (covered under Claude Code category).

Pattern: use an LLM classifier to auto-triage and block mention spam

Anti-spam ops pattern: One practitioner reports running an automated mention-cleanup loop every 5 minutes, claiming it’s “really good at detecting spam/reply guy/promo stuff,” and sharing a daily digest showing 56 blocked profiles with per-account rationales, as shown in the Block digest screenshot.

• What’s concrete here: tight cadence (5 min), an auditable log/digest, and decision rationales; this is closer to an internal trust/safety workflow than a one-off “blocklist” script.

It’s still a human-unverified classifier loop (risk of false positives), but the operational shape—scored decisions + audit trail—translates well to other community surfaces (support forums, Discord intake, app feedback channels).

Peter Steinberger 🦞

@steipete

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭

4:34 PM · Mar 15, 2026

2.3K

Read 279 replies

Universal jailbreak snippet circulates again, raising baseline prompt-hardening pressure

Prompt injection / jailbreak chatter: A “baby’s first universal jailbreak” snippet resurfaced in the wild, with the follow-on “uh oh” implying it works broadly across targets rather than being model-specific, per the Jailbreak snippet and Follow-up thread. For builders shipping assistants, this mainly translates into renewed pressure on instruction hierarchy, tool-output sanitization, and least-privilege tool scopes, because jailbreak memes tend to get copy/pasted into real support channels fast.

The tweets don’t include a reproducible eval artifact or a concrete success rate, so treat it as a distribution signal (what users will try) rather than a measured capability report.

@elder_plinius

baby's first universal jailbreak 🥹

4:26 PM · Mar 15, 2026

636

Read 35 replies

Warning signal: autonomous ‘science agents’ risk p-hacking failure modes

Scientific-method guardrails (Ethan Mollick): Mollick flags that scaling up agentic hypothesis generation without modern scientific norms could produce “p-hacking at scale,” arguing the real risk isn’t just wrong answers but systematically misleading ‘findings’ when systems pivot repeatedly until something looks good, as shown in the P-hacking at scale warning.

• Why it matters operationally: This maps directly to how teams design evaluation loops for research-y agents—if success metrics are under-specified, agents can optimize for superficial wins (novelty, significance, “interestingness”) instead of robustness.

The post is a warning, not a new framework; it’s pointing at a governance gap more than proposing a fix.

Ethan Mollick

@emollick

This is a very cool experiment but we need to get AIs to do good science. The modern scientific method & Mertonian norms are critical for a reason, and a failure to follow them has led to many of our current scientific crises. We don’t want p-hacking at scale

Huaxiu Yao

@HuaxiuYaoML

Everyone's excited about Karpathy's autoresearch that automates the experiment loop. We automated the whole damn thing. 🦞 Meet AutoResearchClaw: one message in, full conference paper out. Real experiments. Real citations. Real code. No human in the loop. One message in →

3:37 PM · Mar 15, 2026

152

Read 20 replies

Meta AI surfaces an “AI Detector” entrypoint in its UI

Meta AI (Meta): A new “AI Detector” navigation item showed up in Meta AI’s UI, but the destination page errors as unavailable, indicating an early/partial rollout or a feature flag not yet live, per the AI Detector nav leak.

For engineers, the key signal is product direction: Meta appears to be building first-party AI-origin detection UX into the assistant surface (even if accuracy/coverage and the underlying detector model aren’t described here).

TestingCatalog News 🗞

@testingcatalog

Meta is working on AI Detector feature for Meta AI. 100% AI 🤖

4:22 PM · Mar 15, 2026

128

Read 6 replies

🎬 Gen media & creative AI: video rollouts, cinematic summarization, and design-by-prompt

Generative media + creative tooling updates with practical implications (video model rollouts/guardrails, new overview formats, rapid brand/UI kit generation).

NotebookLM rolls out Cinematic Video Overviews to Pro accounts

NotebookLM (Google): Google started rolling out a new Cinematic option for NotebookLM’s Video Overviews to Pro accounts, positioning it as a more “immersive” visual storytelling format rather than the existing Explainer/Brief styles, as shown in the rollout screenshot.

The practical change is the format selector now steers the generation style (cinematic vs structured vs bite-sized) and exposes a customization prompt box with examples for narrative framing and visual style, as visible in the rollout screenshot.

TestingCatalog News 🗞

@testingcatalog

Google started rolling out Cinematic Video Overviews on NotebookLM to Pro accounts! Worth testing 👀 h/t Shaxzod Rashidov from tg

TestingCatalog News 🗞

@testingcatalog

Google is rolling out new Cinematic Video Overviews, powered by "a novel combination of Google's most advanced models," to Ultra subscribers. The next level 👀

4:17 PM · Mar 15, 2026

517

Read 20 replies

Freepik Spaces chains 4 text inputs into logo, UI kit, and animation

Freepik Spaces (workflow pattern): A shared 5-step node workflow turns four text fields (brand name, style, object, palette) into logos, then a button-style asset, then a full UI kit, and finally an infinite-loop animation—claimed end-to-end in about 6 minutes in the workflow thread.

It’s presented as a reusable “prompt DAG” pattern: keep variables as first-class nodes, wire them into image/video model nodes (the thread references Nano Banana and Kling), and then duplicate the Space to reuse the whole pipeline, as linked in the Space link via the Space workflow.

TechHalla

@techhalla

Your logo is hiding a whole UI Kit. Spent the weekend messing with this freepik space and stumbled onto this 5-steps workflow. Prompts in the thread 👇

4:33 PM · Mar 15, 2026

128