OpenAI Codex plugins ship Slack-to-Drive workflows – 39 Vercel skills, limits reset

We're rolling out plugins in Codex. Codex now works seamlessly out of the box with the most important tools builders already use, like @SlackHQ, @Figma, @NotionHQ, @gmail, and more. developers.openai.com/codex/plugins

10:31 PM · Mar 26, 2026

3.5K

Read 216 replies

OpenAI claims Codex is spreading to non-technical teams as plugins land

Codex usage (OpenAI): An internal adoption signal shows up alongside the plugin rollout: OpenAI staff claim Codex has “taken over” day-to-day work across the company, with non-technical teams like comms and sales using it once it’s plugged into the same tools engineering uses, per the Internal adoption note.

In parallel, individual power users describe using Codex for calendar management, bug triage, and keeping up with team activity—work that becomes feasible once the tool access and auth are unified through plugins, as described in the Power user workflow.

Alexander Embiricos

@embirico

Over the past months Codex has completely taken over how everyone works at @OpenAI. Technical teams use it for literally everything, and now that it’s easy to plug into all our tools, even non-technical teams like comms and sales are codexmaxing. Much more to come!

OpenAI Developers

@OpenAIDevs

11:22 PM · Mar 26, 2026

227

Read 17 replies

OpenAI resets Codex usage limits across all plans for the plugins launch

Codex quotas (OpenAI): OpenAI staff say they reset Codex usage limits across all plans so people can try the newly launched plugins, as stated in the Reset announcement and echoed in the Limits reset note.

The messaging frames this as an ops-side “clear the runway” move for the plugin rollout, with some posts describing the result as effectively “unlimited things” during the reset window, per the Reset announcement. The exact duration/guardrails of the reset aren’t specified in these tweets.

Tibo

@thsottiaux

developers.openai.com/codex/plugins

Replying to @thsottiaux

1:52 AM · Mar 27, 2026

197

Read 10 replies

Box ships a Codex plugin for automations over Box content

Box plugin in Codex (Box): Box says it launched a Codex plugin that lets Codex access Box-hosted documents and automate workflows “around it,” with a demo that turns earnings-call documents into structured data, as shown in the launch demo.

The pitch is that enterprise content systems become usable inputs to coding-agent pipelines (extract → structure → route to downstream systems) without building a bespoke connector layer, according to the launch demo.

Aaron Levie

@levie

Box just launched its plugin within Codex, which means you can take any content within Box and automate workflows around it using the power of a coding agent. Here's a quick example of processing earnings call documents to extract structured data at scale, which you could then Show more

OpenAI Developers

@OpenAIDevs

4:16 AM · Mar 27, 2026

Read 10 replies

Vercel plugin adds platform skills and validation inside Codex

Vercel plugin for Codex (Vercel): Vercel says its plugin is now supported in OpenAI Codex and Codex CLI, shipping with 39 platform skills, three specialized agents, and real-time code validation, as described in the Plugin announcement and the linked Changelog post.

This is a concrete example of plugins being used to make an agent “opinionated and correct” about a specific platform surface (deploy/config/debug), rather than relying on general model knowledge.

Vercel Developers

@vercel_dev

Vercel plugin now supported on OpenAI Codex and Codex CLI. Install it to level up Codex's knowledge and use of Vercel with: • 39 platform skills • 3 specialized agents • Real-time code validation vercel.com/changelog/verc…

12:46 AM · Mar 27, 2026

Codex plugins are being used as a daily “digest” automation layer

Plugins + automations (Codex): One concrete workflow pattern emerging is using Codex plugins plus a skill/automation chain to generate a one-page daily update (in this case, “public discourse around Codex”) and physically print it, as described in the Printed update example.

What’s notable here is the “before code” and “after code” work: collecting context from chat/email/docs, summarizing, and routing it to a human-readable artifact—work that previously required a pile of manual tool switching, per the Printed update example.

Tibo

@thsottiaux

Boom. Plugins! Codex now has access to most things it couldn’t reach before and all super easy behind the same with you use for your ChatGPT account. I use this for *everything*, from managing my calendar, triaging incoming bugs, keeping up to date with what everyone is doing at Show more

OpenAI Developers

@OpenAIDevs

11:29 PM · Mar 26, 2026

751

Read 57 replies

Codex plugins are being used to draft Google Slides in corporate templates

Codex → Slides workflow (Plugins): A second practical pattern: generating a first draft of a slide deck directly in Google Slides from Codex by using plugins, including applying an existing corporate slide template to structure the deck, as described in the Slides drafting example.

This is a small but concrete example of plugins turning Codex into a “real work” tool for artifacts that aren’t code, while still keeping the work close to an agent loop.

dominik kundel

@dkundel

Plugins are incredibly helpful 💖 One example that goes beyond traditional coding: I created the first draft of my slides for tonight's @resend meetup in Google Slides directly from Codex with plugins 🙂 It used our corporate slide template to structure the slides.

OpenAI Developers

@OpenAIDevs

1:25 AM · Mar 27, 2026

⏱️ Claude quota & reliability turbulence: peak-hour session burn and user fallout

Today’s Claude story is operational: Anthropic adjusts 5-hour session pacing during peak hours, with reports of Max/Pro sessions burning unusually fast and intermittent Claude Code outages. Focus is on shipping impact, cache-miss explanations, and user mitigation tactics.

Anthropic changes Claude session pacing during weekday peak hours

Claude (Anthropic): Anthropic is modifying how the 5-hour session limit is consumed during weekday peak windows—weekdays 5am–11am PT / 1pm–7pm GMT—so users will “move through” session limits faster, while weekly limits stay the same, as detailed in the limit pacing thread. Anthropic says it has offset demand with efficiency work, but still expects about 7% of users to hit session limits they previously would not, especially on paid tiers, according to the limit pacing thread.

The user-facing symptom shows up as “Current session” filling to 100% while “Weekly limits” remains partly unused, as captured in the usage UI screenshots.

To manage growing demand for Claude we're adjusting our 5 hour session limits for free/Pro/Max subs during peak hours. Your weekly limits remain unchanged. During weekdays between 5am–11am PT / 1pm–7pm GMT, you'll move through your 5-hour session limits faster than before.

7:45 PM · Mar 26, 2026

5.2K

Read 1.4K replies

Claude Code instability keeps pushing devs to use alternatives for a day

Claude Code (Anthropic): Ongoing reliability issues are still pushing some developers to switch tools mid-day; one report says “Claude Code performance is awful today” and that they’re switching to Codex with GPT-5.4, as stated in the switching report. Another follow-on says outages “keep pushing me into Codex and Codex keeps delivering,” per the fallback impression.

This is being framed less as model preference and more as an ops constraint: “Hoping today is better. I need my Claude Code back,” as written in the outage day account.

BridgeMind

@bridgemindai

Claude Code performance is awful today I am switching to Codex with GPT 5.4 for the day

4:43 PM · Mar 26, 2026

186

Read 35 replies

SessionGate reports show fast peak-hour burn, then normal pacing off-peak

Claude (Anthropic): Following up on Reliability pain (fast quota burn + outages), builders report highly variable session consumption: one Max-plan user said they hit 100% of the 5-hour session limit in under an hour during the incident window, per the outage and quota report. Others describe watching the usage bar climb “5% every refresh” during peak hours and then stabilizing off-peak, as described in the peak vs off-peak account.

A contrasting datapoint shows “back to normal” pacing with six parallel Opus agents and only 11% session usage after ~30 minutes, as shown in the recovery screenshot.

BridgeMind

@bridgemindai

Claude Code was nearly unusable yesterday. Major outages all day. Hit my limit on the $200/month Max plan in less than 1 hour. One hour. On a $200 plan. Had to fall back to Codex with GPT 5.4 for the rest of the day. It worked. But it's not the same. Claude Opus 4.6 is Show more

10:38 AM · Mar 26, 2026

347

Read 70 replies

Anthropic blames some sudden quota burn on long-context cache misses

Claude (Anthropic): An Anthropic engineer suggests some “sessionGate” cases may be explained by expensive prompt cache misses, especially when resuming long conversations with very large context (described as “million context”), as stated in the cache miss hypothesis. The claim is that resuming or branching long threads can defeat caching, making the same interaction costlier than expected.

This matches user-visible patterns where the “Current session” bar can race to 100% even when weekly usage remains moderate, as shown in the usage UI screenshots.

Replying to @altryne

I think very often these people are running into expensive prompt cache misses, e.g. when resuming a long conversation on million context. Happy to debug if you have a particular example. But I'll also make a thread on avoiding that separately.

8:05 PM · Mar 26, 2026

144

Read 43 replies

Teams shift Claude-heavy workloads to off-peak to avoid faster session burn

Claude (Anthropic): Anthropic’s practical mitigation is to shift token-intensive background jobs out of the weekday peak window; it explicitly calls out that off-peak scheduling will “stretch your session limits further,” as written in the workload scheduling guidance. Builders are already adjusting work hours in response to the peak-hour penalty, as noted in the shift hours reaction.

The behavior this is trying to avoid is “session-bound” throttling where session usage hits 100% even with weekly headroom remaining, as illustrated in the usage UI screenshots.

7:45 PM · Mar 26, 2026

5.2K

Read 1.4K replies

Claude’s quota turbulence is being read as visible compute constraint

Compute constraints: Multiple posts interpret the faster session burn and outages as a capacity problem rather than a product decision; one blunt take is “give Claude more computers,” as written in the capacity plea. Another user frames the week as frontier labs reducing subsidies and tightening access, per the subsidy pullback take.

Anthropic’s own framing is “manage growing demand,” with peak-hour pacing adjustments and an estimated 7% of users newly hitting session limits, as described in the limit pacing thread.

give claude more computers. ffs.

Thariq

@trq212

8:18 PM · Mar 26, 2026

129

Read 16 replies

Session-based limits trigger a paid-tier expectations backlash

Claude (Anthropic): The tone of discussion around the new peak-hour pacing is heated, with commentary that the reactions are “crazy” and expectations are “out of whack,” as argued in the expectations comment. Others compress the situation into a compute question—“Too much demand…or not enough compute?”—as asked in the demand vs compute prompt.

There’s also a broader narrative that this week looks like “frontier labs pull-back on subsidies,” as framed in the subsidy pullback take, which fits with Anthropic’s explicit “growing demand” wording in the limit pacing thread.

dax

@thdxr

man the responses to the new claude max limits are crazy everyones expectations are so out of whack it's kind of embarrassing to get this mad, i'd just be like damn ok i'll cancel were you guys depending on this shit to keep your grandma alive i don't get it

4:09 AM · Mar 27, 2026

917

Read 96 replies

📈 Cursor training ops: real-time RL checkpoints shipped every 5 hours

Research-to-product pipeline update: Cursor shares how Composer 2 checkpoints can improve continuously via real-time RL, compressing model iteration cycles into hours. Relevant to teams watching how coding models evolve and how quickly behavior changes in production.

Cursor says real-time RL lets Composer 2 ship improved checkpoints every five hours

Composer 2 (Cursor): Following up on Tech report (training recipe details), Cursor says its real-time RL pipeline can produce and ship improved model checkpoints on a five-hour cadence, with an internal “real-time RL reward” curve trending upward as evidence in the Real-time RL note.

This reframes “model version” as a moving target for teams benchmarking coding agents: improvements may land multiple times per workday, so regression tracking and eval replayability become operational concerns rather than periodic release notes, per the framing in the Real-time RL note.

Cursor

@cursor_ai

Earlier this week, we published our technical report on Composer 2. We're sharing additional research on how we train new checkpoints. With real-time RL, we can ship improved versions of the model every five hours.

4:30 PM · Mar 26, 2026

1.2K

Read 68 replies

Cursor highlights on-policy implicit feedback as a Composer training signal

Composer (Cursor): A Cursor researcher calls out on-policy implicit feedback as a key ingredient in how they train Composer, pointing to a loop where the model’s current behavior generates the data that then updates the next checkpoint, as stated in the Implicit feedback note.

The practical implication is that behavior shifts can be driven by product telemetry-like signals (implicit “this helped / didn’t help”) without waiting for slower human-label pipelines, aligning with Cursor’s broader push toward high-frequency checkpoint updates described in the Real-time RL note.

Jacob Jackson

@jbfja

Excited to share our work on training Composer with on-policy implicit feedback!

Cursor

@cursor_ai

4:38 PM · Mar 26, 2026

🗂️ Multi-agent orchestration UIs: Kanban boards, parallel terminals, isolated worktrees

A cluster of tooling focuses on coordinating many CLI agents without conflicts: task cards map to terminals/worktrees, dependency chains auto-run, and real-time diffs support review. This is orchestration UX, not a model release.

Cline Kanban ships a local board UI for parallel CLI agents with isolated worktrees

Cline Kanban (Cline): Cline shipped Kanban, a standalone local web app that orchestrates multiple CLI coding agents in parallel—each task card gets its own terminal and isolated git worktree to avoid conflicts, as described in the product overview and the longer workflow breakdown. It targets the practical pain of running “5, 10, even 20 agents” across terminals where failures can go unnoticed, shifting the bottleneck from model speed to human attention, per the workflow breakdown.

• Dependency chaining: Cards can be linked so one finishing can auto-commit and trigger the next step, with the feature list called out in the product overview.
• Local-first + no lock-in: It’s positioned as CLI-agnostic and compatible today with Claude Code, Codex, and Cline CLI, as stated in the product overview.

The core technical bet is that “task → worktree → review diff” becomes the default loop for agent swarms, instead of a single chat thread.

TestingCatalog News 🗞

@testingcatalog

Cline released Kanban, a free, open-source local web app that runs multiple CLI agents in parallel on your git repo. Each task card gets its own terminal and isolated worktree, so agents never conflict. Works with Claude Code, Codex, and Cline CLI right now, with more agents Show more

Cline

@cline

Introducing Cline Kanban: A standalone app for CLI-agnostic multi-agent orchestration. Claude and Codex compatible. npm i -g cline Tasks run in worktrees, click to review diffs, & link cards together to create dependency chains that complete large amounts of work autonomously.

3:24 PM · Mar 26, 2026

283

Read 9 replies

Kanban boards as the emerging UI for managing agent swarms

Orchestration UX trend: A thread predicts the Kanban-style “multi-agent orchestration” form factor will overtake other agent UIs “in the next six months,” as quoted in the form factor prediction—with Cline’s Kanban positioned as a concrete implementation of that idea in the product overview. The repeated framing is that agents scale faster than a human can monitor them, so the UI needs first-class visibility into state, diffs, and blocking errors, echoing the operational pain described in the workflow breakdown.

The tweets don’t provide outcome metrics yet (throughput, defect rate, time-to-merge), so this remains a directional signal rather than a validated productivity claim.

Ara

@arafatkatze

Calling it now, This form factor of multi agent orchestration will overtake every other agentic UX in the next six months. Doesn’t matter if it’s coding agent, product management agent or something else. Every multi agent flow of frontier models currently suffers from 2 main Show more

Cline

@cline

3:24 PM · Mar 26, 2026

1.9K

Read 81 replies

🛠️ Claude Code unattended maintenance: cloud auto-fix that follows PRs

Distinct from quota drama: Claude Code adds a cloud auto-fix workflow that can proactively follow pull requests and attempt to fix CI failures/review comments while you’re away. This is about reducing merge latency and human babysitting in CI loops.

Claude Code can now auto-fix CI failures and review comments while you’re away

Claude Code (Anthropic): Web/mobile/desktop sessions are gaining an Auto fix toggle that follows PRs and proactively remediates CI failures and review comments, with UI copy warning that Claude “may post comments on your behalf,” as shown in the Auto fix screenshot. It’s positioned as asynchronous maintenance rather than an interactive chat loop, per the Auto-fix announcement.

• How it behaves: The trigger is GitHub events arriving on a PR (CI failures, review feedback), and Claude keeps iterating in the background until it can propose or apply fixes, as implied by the “follow PRs” framing in the Auto-fix announcement.
• UX coupling: The same settings pane pairs Auto fix with an Auto merge toggle (off by default in the screenshot), so the feature can reduce time-to-merge without a human babysitting checks, as shown in the Auto fix screenshot.

TestingCatalog News 🗞

@testingcatalog

Claude Code can now auto-fix your CI issues while you are away, thanks to a new toggle. Already available on web, mobile, and desktop. Exactly what was needed 🔥

Noah Zweben

@noahzweben

Thrilled to announce Claude Code auto-fix – in the cloud. Web/Mobile sessions can now automatically follow PRs - fixing CI failures and addressing comments so that your PR is always green. This happens remotely so you can fully walk away and come back to a ready-to-go PR.

10:19 PM · Mar 26, 2026

115

Claude Code 2.1.85 lands with /compact fixes and better MCP/OAuth behavior

Claude Code (Anthropic): CLI 2.1.85 is now out, and the changelog highlights reliability improvements that matter for long-running or unattended sessions—especially around compaction and MCP auth flows—per the detailed notes in the 2.1.85 changelog.

• Long-session stability: /compact no longer fails with “context exceeded” when the conversation itself is huge; scroll performance and compaction-trigger UI stutter are also called out as improved in the 2.1.85 changelog.
• MCP + headless hooks: MCP OAuth now follows protected resource metadata discovery (RFC 9728), and PreToolUse hooks can satisfy AskUserQuestion by returning updatedInput + allow—useful for non-interactive UIs—according to the 2.1.85 changelog.

The release is also being tracked externally as imminent/just released in the Release watcher post.

Claude Code Changelog

@ClaudeCodeLog

Replying to @ClaudeCodeLog

Claude Code CLI 2.1.85 changelog: New features: • Added CLAUDE_CODE_MCP_SERVER_NAME and CLAUDE_CODE_MCP_SERVER_URL environment variables to MCP headersHelper scripts, allowing one helper to serve multiple servers • Added conditional if field for hooks using permission rule Show more

11:09 PM · Mar 26, 2026

🧠 IDE model routing goes local: VS Code selects Ollama models via Copilot

Practical shipping update for engineers: Visual Studio Code can now route Copilot-assisted workflows to any Ollama local or cloud model when Ollama is installed. This affects privacy/cost/debug loops for teams standardizing on VS Code.

VS Code can use Ollama models through GitHub Copilot model selection

VS Code + GitHub Copilot (Microsoft/GitHub) + Ollama: VS Code now integrates with Ollama via GitHub Copilot, so if Ollama is installed you can select “any local or cloud model from Ollama” directly inside the editor, as announced in the integration post.

• Model-picker UX: the screenshot in the integration post shows a single model chooser mixing hosted models (e.g., Claude, GPT) alongside Ollama entries (e.g., qwen3:8b), implying a unified routing surface inside VS Code.

• Operational implication: this makes “Copilot-assisted” workflows viable against local inference (privacy/cost control) without leaving the IDE, assuming the chosen Ollama model fits your machine and latency needs, per the integration post.

ollama

@ollama

Visual Studio Code now integrates with Ollama via GitHub Copilot. If you have Ollama installed, any local or cloud model from Ollama can be selected for use within Visual Studio Code.

10:05 PM · Mar 26, 2026

2.2K

Read 61 replies

🎙️ Realtime voice stack heats up: Gemini Flash Live, open TTS, open ASR, streaming plugins

A high-volume voice day: Google ships a realtime audio model for agents and upgrades Gemini Live; Mistral drops open-weight TTS; Cohere releases open ASR; and ecosystems add day-0 runtime support. Focus is latency, tool-calling from audio, and deployability.

Gemini 3.1 Flash Live rolls out across Gemini Live, AI Studio, and APIs

Gemini 3.1 Flash Live (Google DeepMind/Google): Google rolled out a new realtime audio model for building voice (and multimodal) agents, with the upgrade framed as a “step function improvement in quality, reliability, and latency” in the Launch thread; it’s also landing in Gemini Live experiences, with DeepMind positioning it around “more natural conversations” and improved function calling in the DeepMind announcement. Builders now see it in AI Studio/API surfaces under gemini-3.1-flash-live-preview, with an AI Studio card listing a Jan 2025 cutoff and per‑modality pricing fields in the AI Studio model card.

Early builder sentiment is strongly latency-focused: “The faster response is big, feels more human” as described in the Latency reaction, and the model is being pitched as resilient to messy, real-world audio conditions in the DeepMind announcement. Google also highlights agent-building affordances (voice, tool use) and broad language coverage in the Builder feature list, which is the core reason this matters operationally: it shifts realtime UX from “demoable” to “deployable” if the reliability claims hold.

Logan Kilpatrick

@OfficialLoganK

Introducing Gemini 3.1 Flash Live, our new realtime model to build voice and vision agents!! We have spent more than a year improving the model + infra + experience, the results? A step function improvement in quality, reliability, and latency.

3:19 PM · Mar 26, 2026

3.1K

Read 196 replies

Gemini 3.1 Flash Live adds thinking-level control with big TTFA tradeoffs

Gemini 3.1 Flash Live Preview (Google): Third-party benchmarking highlights a new knob that matters in production voice: configurable “thinking levels” (minimal→high) that trade reasoning for latency, with the speed deltas quantified by Artificial Analysis in the Benchmark breakdown. They report average time-to-first-audio (TTFA) at 0.96s on “minimal” vs 2.98s on “high” in the Speed chart, alongside a large capability swing (Big Bench Audio 70.5% minimal vs 95.9% high) described in the Benchmark breakdown.

• Where it lands vs competitors: On TTFA, “minimal” sits mid-pack at 0.96s while “high” moves to 2.98s, as shown in the Speed chart; on speech reasoning, “high” is near the top of the leaderboard per the Benchmark breakdown.

• Cost stability claim: Artificial Analysis notes pricing “remains stable” vs Gemini 2.5 Flash Native Audio Dialog at $0.35/hour audio input and $1.38/hour audio output in the Benchmark breakdown, though that claim is still external to Google’s own price card surfaces.

The practical change is that realtime voice endpoints now expose an explicit “reasoning budget” control, which is rarely first-class in audio agents.

Artificial Analysis

@ArtificialAnlys

Google has released Gemini 3.1 Flash Live Preview, achieving #2 in our Big Bench Audio Speech to Speech model benchmark, and now features configurable thinking levels With thinking level set to high, it scores 95.9% on Big Bench Audio, making it the second-highest scoring speech Show more

3:50 PM · Mar 26, 2026

303

Mistral’s Voxtral TTS launches with open weights and low-latency voice cloning claims

Voxtral TTS (Mistral AI): Mistral introduced Voxtral TTS as an open-weight, expressive, low-latency TTS model with 9-language support in the Launch announcement, with third-party summaries emphasizing voice cloning and “human preference” wins vs ElevenLabs in the Performance recap.

• Latency and footprint claims: A reported ~90ms time-to-first-audio and ~3GB RAM requirement show up in the Performance recap, which—if reproducible—moves high-quality TTS closer to “single GPU service” territory.

• Comparative positioning: Mistral’s reported blind listening tests show Voxtral preferred ~63% for flagship voices and ~70% for voice customization over ElevenLabs Flash v2.5 in the Performance recap.

• Integration framing: Mistral positions Voxtral as an “output layer” that can pair with a transcription stack (speech-to-speech pipelines) in the Launch announcement.

Net: the notable engineering hook is open weights plus explicit streaming/latency posture, which tends to decide whether teams can productize voice without vendor lock-in.

Mistral AI

@MistralAI

🔊Introducing Voxtral TTS: our new frontier open-weight model for natural, expressive, and ultra-fast text-to-speech 🎭Realistic, emotionally expressive speech. 🌍Supports 9 languages and accurately captures diverse dialects. ⚡Very low latency for time-to-first-audio. 🔄Easily Show more

3:00 PM · Mar 26, 2026

3.3K

Read 103 replies

Cohere releases Transcribe: a 2B Apache-2.0 open ASR model

Cohere Transcribe (Cohere): Cohere launched Transcribe, an open-source speech recognition model (Apache 2.0) with broad multilingual coverage, amplified via Hugging Face in the Launch amplification.

Serving support is immediately documented in the vLLM ecosystem, where vLLM shows a one-line vllm serve CohereLabs/cohere-transcribe-03-2026 example in the Serving instructions. External reporting also frames it as topping the Hugging Face Open ASR leaderboard and being designed for consumer GPUs, as detailed in the TechCrunch writeup.

The key engineering signal is “actually open” licensing plus day-0 tooling support, which tends to be what determines whether ASR gets piloted internally versus adopted for production.

Cohere

@cohere

Introducing: Cohere Transcribe – a new state-of-the-art in open source speech recognition.

1:25 PM · Mar 26, 2026

2.0K

Read 63 replies

Gemini Live gets its biggest upgrade: faster, longer, and less awkward

Gemini Live (Google): The consumer Gemini Live experience shipped its “biggest upgrade yet,” explicitly credited to Gemini 3.1 Flash Live in the Gemini Live upgrade; Google claims faster responses with fewer pauses, 2× longer conversations, and dynamic answer length/tone adjustments in that same Gemini Live upgrade.

DeepMind’s companion rollout messaging frames this as more natural audio conversations and better function calling for tasks in messy environments, as described in the DeepMind thread. The engineering relevance is that Live’s UX improvements are a proxy signal for the underlying realtime stack (streaming stability, turn-taking latency), and the bar is now set by “awkward pause” reduction rather than only word error rate or general reasoning.

Google Gemini

@GeminiApp

Gemini Live just got its biggest upgrade yet, powered by Gemini 3.1 Flash Live. •Faster responses with fewer awkward pauses •Smarter & able to follow along 2x longer conversations, so you can stay in the flow •Dynamically adjusts its answer lengths & tone to match the moment Show more

3:31 PM · Mar 26, 2026

3.1K

Read 150 replies

LiveKit ships Gemini Live API plugin docs for audio-in/audio-out agents

Gemini Live API plugin (LiveKit): LiveKit published integration docs for building audio-in/audio-out agents against Gemini’s Live API, with a “try it out today” callout in the Docs announcement and implementation details in the linked Plugin docs. The announcement specifically positions it as “the first Gemini 3 native audio model on the Live API,” and calls out better instruction following, improved tool calling, reduced speaker drift, and 70+ language support in the Docs announcement.

This matters because it’s a concrete “agent framework surface” for the model (session management + streaming audio plumbing), which tends to be the part teams end up re-implementing ad hoc when moving from a model demo to a shipped voice agent.

LiveKit

@livekit

Replying to @livekit

Try it out today: docs.livekit.io/agents/models/…

3:53 PM · Mar 26, 2026

vLLM lands Cohere’s encoder-decoder serving optimizations for speech models

Encoder-decoder speech serving (vLLM + Cohere): vLLM credits Cohere with upstreaming encoder-decoder serving improvements—variable-length encoder batching and packed attention for the decoder—claiming up to 2× throughput gains for speech workloads in the vLLM release note.

The important nuance is that this is not only “support for one model”: the vLLM release note explicitly says the throughput gains carry over to all encoder-decoder models in vLLM, so teams serving Whisper-like architectures may see immediate infra efficiency improvements without changing models.

🎉 Congrats to @Cohere on releasing Cohere Transcribe, a 2B speech recognition model (Apache 2.0, 14 languages). Day-0 support in vLLM. Cohere contributed encoder-decoder serving optimizations to vLLM: variable-length encoder batching and packed attention for the decoder. Up to Show more

Cohere

@cohere

Introducing: Cohere Transcribe – a new state-of-the-art in open source speech recognition.

3:57 PM · Mar 26, 2026

139

vLLM Omni adds day-0 serving support for Voxtral-4B-TTS

Voxtral-4B-TTS serving (vLLM Omni): vLLM announced day-0 support for Mistral’s Voxtral TTS, positioning it as “enterprise-grade TTS built for production voice agents,” including ultra-low latency streaming and 24kHz output formats in the vLLM Omni post.

The operationally useful detail is the concrete launch path: they show install commands and a vllm serve mistralai/Voxtral-4B-TTS-2603 --omni invocation in the vLLM Omni post, which reduces the usual “model release → weeks of serving glue” gap for teams standardizing on vLLM.

🎉 Congrats to @MistralAI on launching Voxtral 4B TTS — enterprise-grade TTS built for production voice agents. Day-0 support in vLLM Omni. 🌍 9 languages with natural prosody and emotional range 🎙️ 20 preset voices with easy adaptation to new ones ⚡ Ultra-low latency Show more

Mistral AI

@MistralAI

3:42 PM · Mar 26, 2026

283

Read 3 replies

🧑‍✈️ Agent runners & ops: hosted coworkers, usage dashboards, browser-driving loops

Ops-heavy posts focus on running agents as systems: hosted Slack coworkers, persistent background terminals, browser-driving via MCP, and usage/traffic ranking signals across agent platforms. Excludes Codex plugins (covered as the feature).

Anthropic speeds up Claude session-limit burn during weekday peak hours

Claude (Anthropic): Following up on Claude Code limits (outages + quota burn), Anthropic says free/Pro/Max users will move through the 5-hour session limit faster on weekdays from 5am–11am PT / 1pm–7pm GMT, while weekly limits stay unchanged, as explained in the Peak-hours limit change thread.

• Who gets hit and when: Anthropic estimates ~7% of users will newly hit session limits, especially on paid tiers, per the Peak-hours limit change context.
• Operational mitigations: They call out shifting token-heavy background jobs to off-peak hours in the Peak-hours limit change, and separately point to expensive prompt cache misses (e.g., resuming long-context chats) as a common cause of sudden usage spikes in the Cache miss hypothesis.

User reports are mixed—some describe watching usage climb ~5% per refresh during peak hours in the Pro plan burn report, while others say pacing returned to normal later the same day in the Back to normal report.

7:45 PM · Mar 26, 2026

5.2K

Read 1.4K replies

Every’s Plus One ships: a hosted OpenClaw coworker in Slack with preloaded tools

Plus One (Every): Every launched Plus One, a hosted OpenClaw-in-Slack “coworker” that aims to remove the always-on machine + manual integration setup tax; setup is described as one-click, and it can run on a ChatGPT subscription or other API keys per the Launch announcement.

• What’s bundled: The initial package includes built-in connections and team workflows (email, writing, doc editing, daily briefs/digests) as listed in the Launch announcement description.
• Rollout shape: Access is throttled (“letting in 20 people a week”) according to the Launch announcement note, which makes it more like a managed agent product than a template repo.

The positioning is explicitly “capable coworker out of the box,” not “bring your own harness and spend a weekend wiring it.”

Dan Shipper 📧

@danshipper

BREAKING! Introducing Plus One: A hosted @openclaw that lives in your Slack and comes pre-loaded with @every's best tools, skills, and workflows. Set it up in one click, and use your ChatGPT subscription (or any other API key.) Bring your Plus One to work: Show more

3:18 PM · Mar 26, 2026

608

Read 85 replies

Chrome MCP lets a coding agent drive a real browser session for console work

Chrome MCP (browser control loop): A builder working on OpenClaw reports switching from screenshot-based guidance to letting Codex connect to a live Chrome session via MCP, so the agent can navigate vendor dashboards and perform troubleshooting directly, as described in the Browser MCP workflow post.

The example is Microsoft Foundry setup (quota + deployment UI), but the pattern generalizes: once the agent can operate the browser, “read the docs + click the console” becomes automatable instead of a human bottleneck, per the Browser MCP workflow claim.

Peter Steinberger 🦞

@steipete

Testing a new feature for Microsoft Foundry support in @openclaw. Their website is a jungle, I used to make screenshots so codex can guide me through it, but now Chrome has an MCP so codex can simply connect and drive my browser session and do all of that for me. The human is no Show more

2:38 PM · Mar 26, 2026

1.3K

Read 116 replies

OpenRouter app rankings: OpenRouter posted a weekly “Trending” snapshot showing token growth across agent products; Cline is listed at +114% and Hermes Agent at +124% this week, with overall usage also shown for OpenClaw and Claude Code in the Rankings screenshot.

This is one of the few public, cross-tool datapoints that’s closer to “production usage” than model benchmark chatter, even if it’s still platform-scoped (it only reflects OpenRouter traffic).

OpenRouter

@OpenRouter

Multiple agents doubling in traffic this week, including Cline and Hermes Agent 👀

4:01 PM · Mar 26, 2026

Teams are running a parallel org chart of named AI coworkers

Agent operating model: Every describes a workflow where AI agents mirror a “parallel org chart”—each agent has a name, a manager, and recurring responsibilities—rather than being treated as ad hoc prompts, as outlined in the Parallel org chart description.

The practical point is that once agents have stable roles, the hard problems shift to ops: integration hygiene, uptime, and ongoing maintenance, which the same post flags as the real bottleneck in keeping “coworkers” useful over time in the Parallel org chart description section.

Dan Shipper 📧

@danshipper

3:18 PM · Mar 26, 2026

608

Read 85 replies

Vercel Sandbox adds automatic filesystem persistence for long-running agents

Vercel Sandbox (Vercel): Vercel added automatic persistence for sandboxes—filesystem state is saved when stopped and restored on resume—so agent work can continue without manual snapshotting, per the Persistence announcement and the Changelog post.

The product framing is that agents need “computers” with persistence (named, resumable environments), which Vercel reiterated in the Agents need persistence note.

Vercel Developers

@vercel_dev

Vercel Sandboxes can now automatically save their filesystem state when stopped and restore it when resumed. Automated persistence enable your sandboxes to continue where agents left off, without manually snapshotting. Available in beta ↓ vercel.com/changelog/verc…

10:27 PM · Mar 26, 2026

119

discrawl v0.2.0 speeds up Discord archive sync for community ops

discrawl v0.2.0 (steipete): discrawl shipped v0.2.0 with faster sync --full, improved backfill batching, and more reliable sync --since incremental behavior, as noted in the Release announcement and detailed in the Release notes.

The tool is being used to mine “biggest pain points” from a Discord community, which makes the performance work directly relevant to teams treating community logs as an input to agent/UX prioritization, per the Release announcement.

Peter Steinberger 🦞

@steipete

Shipped 0.2.0 of discrawl, syncs much faster now. I use it every day to understand what the biggest pain points are in the OpenClaw Discord. github.com/steipete/discr…

3:43 PM · Mar 26, 2026

378

Read 16 replies

KiloClaw pitches a 5-minute setup path for running agents via Telegram/Gmail/Slack

KiloClaw (Kilo Code): Kilo Code claims non-developers can run an OpenClaw-style agent with a KiloClaw account in “five minutes,” wiring tools and receiving updates via channels like Telegram/Gmail/Slack, as described in the Setup claim.

The configuration surface shown includes messaging channels (Telegram/Discord/Slack) plus optional developer tools and search keys, and the product page is linked in the Claws setup page.

Kilo

@kilocode

You don't need to be a developer to use OpenClaw anymore. You need a KiloClaw account and five minutes. Connect your tools, tell it what to do, and get updates where you actually check: Telegram, gmail, Slack, wherever. kilo.codes/claws

2:45 PM · Mar 26, 2026

⌨️ Everything becomes a CLI (for agents): finance ops, service emulation, provisioning flows

Multiple releases push “agent-native CLIs” for automation: finance tooling, emulators, and agent-friendly non-interactive modes. This beat is about interfaces designed for automation-first use, not models.

Ramp ships Ramp CLI: agent-accessible finance ops with 50+ tools and built-in skills

Ramp CLI (Ramp): Ramp released Ramp CLI to let agents operate company finance workflows via a tool surface instead of web UIs—exposing 50+ tools across cards, bills, expenses, travel, and approvals; Ramp also claims it’s “fewer tokens than MCP” and ships with prebuilt skills like receipt compliance and “agentic purchasing,” as described in the launch thread and recapped among the week’s CLI-first launches in the CLI roundup post.

The install path is positioned as a single shell bootstrap (curl … | bash), and the product framing is “agents manage finances” rather than “SDK for finance,” which signals a push toward auditably automatable ops surfaces instead of dashboard automation.

Ramp Labs

@RampLabs

Today, we're releasing Ramp CLI to let agents manage your company's finances. 50+ tools across cards, bills, expenses, travel, and approvals. Fewer tokens than MCP, and comes with pre-built skills like receipt compliance and agentic purchasing.

7:40 PM · Mar 26, 2026

1.9K

Read 77 replies

ElevenLabs reworks its CLI for agents: non-interactive default, Ink UI behind a flag

ElevenLabs CLI (ElevenLabs): ElevenLabs updated its CLI to be non-interactive by default so agents and automations can call it without prompt/TTY stalls, while keeping a richer interactive experience behind --human-friendly; the announcement also points to a skill install path (npx skills add elevenlabs/skills) that bundles agent guardrails workflows, according to the CLI and guardrails thread and the broader CLI-first release framing in the CLI roundup post.

The thread explicitly calls out how “human keyboard” defaults break agent orchestration, and this change puts ElevenLabs in the camp of tools treating CLIs as an automation API surface, not a developer convenience.

ElevenLabs Developers

@ElevenLabsDevs

Introducing Guardrails 2.0 for Agents Control how agents behave in production with protections to keep responses safe, compliant, and reliable.

ElevenLabs

@ElevenLabs

Introducing Guardrails 2.0 in ElevenAgents. Control how agents behave in production with a redesigned safety layer. You can define and enforce custom business policies. Or, toggle on pre-built protections to keep agents on-topic, on-brand, and resistant to manipulation.

1:35 AM · Mar 27, 2026

Stripe Projects adds Vercel provider for terminal provisioning and agent-discoverable deploys

Stripe Projects (Stripe/Vercel): Vercel is now a supported provider in Stripe Projects (developer preview), enabling provision+deploy flows from the terminal via stripe projects add vercel/project; Vercel also frames this as making the provider “discoverable” for autonomous setup inside agent workflows, per the Vercel preview note and the broader positioning on the Projects.dev overview.

This is a concrete step toward “DevOps lifecycle as code” for agents: the integration is described as typed provisioning and deployment, not an LLM integration.

Vercel Developers

@vercel_dev

Vercel is now a supported provider on Stripe Projects, available today in developer preview. You can provision and deploy from the terminal: ▲ ~/ 𝚜𝚝𝚛𝚒𝚙𝚎 𝚙𝚛𝚘𝚓𝚎𝚌𝚝𝚜 𝚊𝚍𝚍 𝚟𝚎𝚛𝚌𝚎𝚕/𝚙𝚛𝚘𝚓𝚎𝚌𝚝 And your agents can discover and autonomously set up Vercel as Show more

Stripe

@stripe

x.com/i/article/2034…

5:25 PM · Mar 26, 2026

177

Read 4 replies

Agent-first CLI checklist spreads: make hidden UI assumptions explicit

CLI design checklist (Cursor marketplace): A “CLI for Agents” checklist is being shared as a practical standard for building automation-friendly CLIs—non-interactive flags, layered help with examples, stdin/pipes support, fast actionable errors, idempotency, and dry-run support—surfacing in the plugin listing and detailed on the plugin page, with additional framing that “agents change what’s implicit” in the design checklist post.

This is one of the clearer “interface contract” patterns emerging around agent tooling: it’s less about prompts and more about making CLI behavior deterministic and machine-safe.

eric zakariasson

@ericzakariasson

Replying to @ericzakariasson

and here's a skill for it: cursor.com/marketplace/cu…

10:02 AM · Mar 26, 2026

emulate adds Apple, AWS, Microsoft, and Slack emulators callable via npx

emulate (ctatedev): The emulate project added four new emulators—Apple, AWS, Microsoft, and Slack—invoked with npx emulate, with scoped packages for programmatic use also announced in the same release note release note.

The practical impact is clearer mocking for agent-run integration tests and local harnesses: when the “tool” is an external SaaS API, a one-command emulator becomes a workflow primitive for repeatable runs.

Chris Tate

@ctatedev

4 new emulators now available: - Apple - AWS - Microsoft - Slack Emulate services from these and other popular providers with one command: npx emulate Or with programmatic access thru scoped packages

8:40 PM · Mar 26, 2026

288

Read 14 replies

📏 Benchmarks & eval realism: ARC-AGI-3 harness targeting, search leaderboards, and eval design skepticism

The eval discourse continues with new emphasis on harness targeting and what leaderboards actually measure, plus fresh leaderboard snapshots for search and design. Also includes practitioner advice on designing evals that shape agent behavior in production.

ARC Prize flags ARC-AGI-3 harness targeting as “buying” leaderboard performance

ARC-AGI-3 (ARC Prize): Following up on Leaderboard launch—sub-1% frontier scores—ARC Prize is now explicitly warning that benchmark-specific harness work can "buy" performance on the public demo set, pointing to Symbolica’s approach as a concrete example in the Harness targeting warning.

ARC Prize frames near-term gains as mostly harness innovation (valuable for operationalizing agents), while reserving the Verified Leaderboard for systems not tailored to ARC-AGI-3, as reiterated in the Harness targeting warning. It also highlights why its stateless client policy exists: to reduce leaderboard-chasing strategies and make comparisons cleaner.

• What the Symbolica harness looks like: The open-source code shows an orchestrator + specialized subagents (explore/theorize/test/solve) with shared memory and bounded action budgets, including explicit efficiency nudges like “RESET is a last resort,” as shown in the Harness code snapshot and published in the GitHub repo.

• Community vs verified positioning: ARC Prize is telling people not to read Community Leaderboard scores as “evidence of AGI progress,” while encouraging teams to share harness ideas because they translate into real-world agent reliability work, per the Harness targeting warning.

ARC Prize

@arcprize

Today's @symbolica harness is a clear example of what human-crafted targeting can achieve on ARC-AGI-3 public demo set You can "buy" performance with benchmark-specific prompts/strategies Their approach could still contain useful ideas, excited to see what the community finds

3:55 AM · Mar 27, 2026

137

Read 7 replies

ARC-AGI-3 human baseline opacity prompts ~15% “median human” estimate

ARC-AGI-3 (ARC Prize): Ongoing baseline ambiguity is producing back-of-the-envelope estimates for what “typical human” performance would look like under ARC-AGI-3’s efficiency scoring, with one estimate putting the median human around ~15% due to action-count penalties even when solve rates are high, as argued in the Baseline math estimate.

The estimate is explicitly framed as unverifiable without ARC publishing per-human distributions (rather than “2nd-best human per task”), and it references the benchmark’s own graphs discussed in the Baseline graph parsing, which points at the ARC Prize preview writeup in the ARC Prize post.

Ryan Greenblatt

@RyanPGreenblatt

I wish they published the performance for each human baseliner rather than just the performance of the second best human run on each task. My current guess is that the median human baseliner would score around ~15% on the metric but we can't check because the data isn't public!

ARC Prize

@arcprize

Announcing ARC-AGI-3 The only unsaturated agentic intelligence benchmark in the world Humans score 100%, AI <1% This human-AI gap demonstrates we do not yet have AGI Most benchmarks test what models already know, ARC-AGI-3 tests how they learn

11:14 PM · Mar 26, 2026

104

Read 7 replies

LangChain: too many evals can become noise that shapes worse agent behavior

Eval design (LangChain): A practical warning is circulating that “more evals” can make production agents worse because each eval acts like a noisy shaping vector; the proposal is to keep small, justified, targeted eval subsets and track metrics beyond accuracy, as outlined in the Evals shape behavior and reinforced in the Eval responsibility note.

The key engineering point is that eval selection is treated as part of the agent’s control surface (not just measurement), per the Evals shape behavior.

Viv

@Vtrivedy10

more evals != better agents at @LangChain, every eval our team maintains is like a noisy vector that shapes agent behavior in production as teams hill climb hundreds or thousands of evals to “improve their agents”, this problem of bad evals contributing noisy behavior compounds Show more

Viv

@Vtrivedy10

x.com/i/article/2036…

4:24 PM · Mar 26, 2026

DesignArena: GPT-5.4 “Design Skill” adds ~17 Elo; Opus 4.6 stays #1

DesignArena (Arcada Labs): A shared chart shows GPT-5.4 with Design Skill enabled at 1306 Elo versus 1289 without it (a +17 delta), while Claude Opus 4.6 remains at the top around 1370, per the Elo chart comparison.

The numbers suggest the “skill” toggle is incremental on this benchmark rather than a step-change, at least in the snapshot shown in the Elo chart comparison.

BridgeMind

@bridgemindai

DesignArena just added GPT 5.4 with Design Skill enabled. It still can't compete. GPT 5.4 with Design Skill: 1306 Elo. GPT 5.4 without: 1289. 17 points. That's all the design skill adds. Claude Opus 4.6 sits at #1 with 1370. 64 points ahead of GPT 5.4 even with the Show more

5:37 PM · Mar 26, 2026

Developer trust posture: assume LLM output is wrong without a cited source

Verification habit: A developer stance that’s being explicitly called out is to treat LLM responses as untrusted by default—“I automatically assume it's BS unless it's read a source”—with the claim that non-dev users often lack that reflex, as stated in the Source-first skepticism.

This maps directly onto how teams design research/eval workflows (source-grounding, citations, trace review), but the tweet is specifically about the everyday trust default, per the Source-first skepticism.

Matt Pocock

@mattpocockuk

Every time an LLM says anything to me, I automatically assume it's BS unless it's read a source confirming it NONE of the non-devs I talk to have this instinct

10:11 AM · Mar 26, 2026

1.0K

Read 175 replies

Search Arena: Gemini 3.1 Pro Grounding ranks #2 (three Gemini models top 7)

Search Arena (LMSYS/arena): Gemini 3.1 Pro Grounding landed at #2 on Search Arena with a reported 1219 ±9 score, and the same snapshot shows three Gemini variants in the top 7, as shown in the Leaderboard snapshot.

This is a narrow signal (search + grounding), but it’s a concrete leaderboard datapoint that’s easy to compare against other “search-tuned” variants in the same table, per the Leaderboard snapshot.

Arena.ai

@arena

Gemini 3.1 Pro Grounding has landed #2 in the Search Arena. This places three Gemini models in the top 7 for Search, more than any other lab. Congrats to @GoogleDeepMind on this achievement!

Google Gemini

@GeminiApp

Gemini 3.1 Pro is here: A smarter model for your most complex tasks. Building on the Gemini 3 series, 3.1 Pro is a step forward in reasoning. It's designed for tasks where a simple answer isn’t enough, taking advanced reasoning and making it useful for your hardest challenges.🧵

7:13 PM · Mar 26, 2026

274

A proposal for “ARC-AGI-X”: validated benchmarks with undisclosed tasks

Benchmarking process: A suggested direction is a reputable org-run benchmark where tasks stay undisclosed (including their nature) to reduce targetability, leaving only the leaderboard visible—an attempt to make overfitting harder, as proposed in the Hidden benchmark idea.

Ethan Mollick

@emollick

Kind of want a ARC-AGI-X test where a reputable organization runs it & builds a validated benchmark with outside expert help, but they never disclose the questions or even the nature of the challenges themselves so the tasks can never be targets. All we see is a leaderboard

9:55 AM · Mar 26, 2026

294

Read 45 replies

Mollick: small and vertical models are brittle; benchmarks hide OOD failure modes

Model evaluation realism: A reminder in the ongoing benchmarks discourse is that small/specialized models can look strong on benchmarks but fail hard on unusual or out-of-distribution situations, and that many benchmarks under-report these weaknesses, as argued in the Brittle model warning.

Ethan Mollick

@emollick

Small AI models and specialized vertical AI models are very brittle. Any unusual situation or out-of-distribution issue and they break down. You also won’t get emergent leaps or good problem solving. They still have uses, but benchmarks don’t do a good job of showing weaknesses

6:30 PM · Mar 26, 2026

127

Read 32 replies

⚙️ Inference engineering: real-time VLMs, quant/VRAM tricks, and vLLM reliability fixes

Performance-centric posts: new inference engines, memory optimizations for local workflows, and hard debugging fixes in serving stacks. This is about shipping faster/cheaper inference in production, not model announcements.

ComfyUI publishes Dynamic VRAM results: big wins on constrained setups

Dynamic VRAM (ComfyUI): Following up on Dynamic VRAM—the “run big models without OOM” feature—ComfyUI users are now sharing concrete before/after timings, including a ~283.7s → 83.2s drop on one configuration (RTX 5060, Windows, 32GB RAM, FP16) as shown in the Benchmark chart. It’s a memory scheduler story, not a model story.

• What it changes operationally: The project frames it as automatic VRAM/RAM management for Windows/Linux Nvidia workflows, reducing manual “fit in memory” tuning per the Dynamic VRAM announcement.

This looks most relevant for local pipelines where the alternative is reducing resolution/batch or swapping hardware.

Kol Tregaskes

@koltregaskes

ComfyUI launches Dynamic VRAM that lets everyday users run large AI models on standard hardware without crashes or slowdowns. The new system automatically manages memory for Windows and Linux Nvidia setups so workflows stay smooth and stable without manual tweaks.

ComfyUI

@ComfyUI

Upgrading your RAM is now unnecessary. Introducing our new ComfyUI Dynamic VRAM optimization. Running local models is now possible on even the most memory constrained hardware. Read more here: blog.comfy.org/p/dynamic-vram…

12:34 PM · Mar 26, 2026

Moondream launches Photon for real-time VLM inference (46ms, 60+ fps on H100)

Photon (Moondream): Moondream introduced Photon, an inference engine targeting production VLM latency—claiming 46ms end-to-end and 60+ fps on a single H100, framed as making “real-time vision AI” practical per the Photon announcement.

• What’s distinct: Photon is pitched as co-design across model shapes, caches, and custom kernels per platform, with a reported ~2× speedup vs vLLM for similar-sized models in their writeup, as described in the Photon announcement and detailed in the Photon blog post.

The open question is how these numbers hold across non-H100 targets and real batching/streaming workloads.

moondream

@moondreamai

VLMs too slow for production? Not anymore: 46ms end-to-end inference, 60+ fps on a single H100. Introducing Photon, Moondream's inference engine. Runs on everything from edge to server. moondream.ai/blog/photon-re…

3:56 PM · Mar 26, 2026

941

Read 26 replies

Serving Qwen 3.5 27B at ~1.1M tok/s on 96× B200 shows DP beating TP

vLLM distributed inference (B200/GKE): A Google Cloud engineer writeup claims ~1.1M total tokens/sec serving Qwen 3.5 27B (dense, FP8) on 96× B200, with a key observation that DP=8 nearly 4×’d throughput vs TP=8 because the model is “too small” for tensor parallelism to help on B200s, per the Throughput and scaling notes.

• Other deployment details: The notes highlight 97.1% scaling efficiency at 8 nodes and TPOT ~46ms flat across node counts, while calling out routing overhead and a bottleneck pod in the gateway path per the Throughput and scaling notes.

This is one of the few posts that includes enough knobs (DP/TP/MTP, routing overhead) to be reusable for cluster tuning.

a gcp eng networked 96 b200s and ran qwen 3.5 27B at warp speed of 1 million tokens per second did you know you can do that?

11:26 PM · Mar 26, 2026

Read 4 replies

vLLM merges a fix for a silent uint32 overflow in the Mamba-1 CUDA kernel

vLLM (AI21 Labs): AI21 reported and fixed a silent uint32 overflow in vLLM’s Mamba-1 CUDA kernel—caused by uint32_t stride × cache_index overflowing at scale—and the patch is now merged, as described in the Debugging summary and the Debugging writeup.

• Why inference engineers care: This is a classic “passes small tests, breaks at scale” failure mode; the story connects kernel integer widths to downstream logprob mismatches in large RL-style training/inference loops per the Debugging summary.

It’s a reminder that serving correctness bugs can surface as training instability, not crashes.

Thanks to @AI21Labs for tracking down a silent uint32 overflow in vLLM's Mamba-1 CUDA kernel and contributing the fix. Root cause: `uint32_t` stride × cache_index overflows silently at scale. Fix merged in #35275. The debugging story is worth a read. 🔗 ai21.com/blog/vllm-cuda…

11:06 AM · Mar 26, 2026

vLLM lands encoder-decoder serving optimizations from Cohere (up to 2× throughput)

vLLM (speech workloads): vLLM maintainers say Cohere contributed encoder-decoder serving optimizations—variable-length encoder batching plus packed attention for the decoder—claiming up to ~2× throughput improvement for speech workloads and applicability to encoder-decoder models more broadly, per the vLLM optimization note.

• Serving surface: The same post includes install/serve commands for the cohere-transcribe model in vLLM, which is useful as a template even if you’re swapping in a different encoder-decoder model, as shown in the vLLM optimization note.

This is a rare “model vendor upstreams serving primitives” datapoint, not just a model drop.

Cohere

@cohere

Introducing: Cohere Transcribe – a new state-of-the-art in open source speech recognition.

3:57 PM · Mar 26, 2026

139

vLLM reports up to 18× better Kimi K2.5 interactivity on AMD GPUs (upstreamed)

vLLM + AMD: vLLM project maintainers report up to 18× interactivity improvement when serving Kimi K2.5 1T MXFP4 on AMD GPUs, with fixes and GEMM tuning upstreamed into vLLM 0.18.0, per the Interactivity update.

• What’s implied: This reads like kernel/collective and GEMM tuning plus correctness fixes, with a follow-on GPU MODE hackathon track ($650K) mentioned for pushing Kimi inference further on MI355X hardware in the same Interactivity update.

The post doesn’t enumerate which workloads saw 18×, so scope is still unclear.

Up to 18x interactivity improvement on Kimi K2.5 1T MXFP4 on AMD GPUs. All fixes and GEMM tuning upstreamed into vLLM 0.18.0. And more is coming. AMD's GPU MODE Hackathon has a $650K track dedicated to pushing Kimi K2.5 inference on MI355X with vLLM. Thanks to the @AMD team and Show more

SemiAnalysis

@SemiAnalysis_

18x IMPROVEMENT ALERT🚀 In under 30 days, AMD was able to improvement Kimi K2.5 1T MXFP4 interactivity by up to 18x when iso-throughput. The main changes are in PR number 35850 AMD fixed their vLLM AITER integration to support the Kimi K2.5 MLA which uses num_head=8 for TP8 &

3:09 AM · Mar 27, 2026

A practical speculative decoding comparison for vLLM deployments circulates

Speculative decoding (vLLM): vLLM folks pointed to a “thorough and practical” comparison of speculative decoding strategies, positioned as a reference when choosing an SD approach for deployment tradeoffs (latency vs throughput vs complexity) per the Spec decoding reference.

This is more of a field note than a release: the tweet doesn’t include the full artifact, so treat it as a pointer rather than a canonical benchmark.

Great systematic study on speculative decoding in vLLM 🔬 Thorough and practical. A useful reference if you're choosing an SD strategy for your deployment. 👇

Jiaxiang Yu

@jiaxiangyuu

Speculative decoding (SD) is widely used to speed up LLM inference, but how well does it actually work in production settings? We presented, to our best knowledge, the first systematic evaluation of speculative decoding on a production-grade, widely deployed inference engine

End-to-end speedup across models and datasets (figures from the paper). Each subplot shows speedup (y-axis) vs. batch size (x-axis) for a given model-dataset pair. Notice how all curves trend downward as batch size increases.

1:32 AM · Mar 27, 2026

🔎 Retrieval & agentic search: open search agents, chunking discipline, visual citations

Retrieval is a standalone beat today: Chroma open-sources a 20B search agent positioned as faster/cheaper, and builders share practical RAG advice and auditability primitives (bounding boxes, chunking).

Chroma open-sources Context-1, a 20B agentic search model aimed at multi-step retrieval loops

Context-1 (Chroma): Chroma introduced Context-1, a 20B parameter “search agent” released Apache 2.0 and pitched as an “order of magnitude faster” and “order of magnitude cheaper” than long, frontier-model agentic search trajectories, as described in the Launch announcement; the same thread frames the motivation as multi-stage search where one hop informs the next, with public-benchmark callouts including Browsecomp-Plus, SealQA, LongSealQA, and FRAMES in the Launch announcement.

• Weights + adoption path: Hugging Face posted direct access to the model weights via the Weights link, which points to the Model card for pulling and running.
• Why this matters in practice: the launch is being interpreted as “agentic search moving down-market” (smaller, purpose-built models replacing long frontier trajectories), echoing the “bitter lesson” framing in the Commentary retweet and the “search subagents” angle in the Practitioner note.

Benchmarks and cost claims are currently presented as vendor-generated; there isn’t an independently reproduced eval artifact in these tweets.

Chroma

@trychroma

Introducing Chroma Context-1, a 20B parameter search agent. > pushes the pareto frontier of agentic search > order of magnitude faster > order of magnitude cheaper > Apache 2.0, open-source

7:01 PM · Mar 26, 2026

2.6K

Read 96 replies

LiteParse adds PDF text bounding boxes for visual citations and agent audit trails

LiteParse (LlamaIndex): LiteParse now exposes bounding boxes for every extracted text block in PDFs, so an agent can map an answer back to the exact line on the page and highlight it as an audit trail, as shown in the Bounding boxes announcement and documented in the Docs guide.

• What changed vs plain parsing: instead of “here’s the text,” you get text + coordinates, which makes UI-level verification (highlight-on-page) and post-hoc review feasible for doc agents.
• Where to get it: the implementation details and local-first workflow live in the GitHub repo, building on the earlier LiteParse positioning as a fast parser rather than a VLM-heavy pipeline, following up on LiteParse skill (agent-ready parsing integration).

Jerry Liu

@jerryjliu0

Our latest LiteParse release gives your AI agent access to text bounding boxes within any PDF 📐 LiteParse is our fast/free open-source document parser that can extract text from any document. Besides the extracted text, we also expose bounding boxes for every text block. This Show more

LlamaIndex 🦙

@llama_index

Bounding boxes are key for citations, and we just shipped a new guide showing how to use LiteParse for visual citations! LiteParse is our fast and open-source document parser. Using both bounding box extraction and page screenshots, anyone (including agents) can learn how to

6:28 PM · Mar 26, 2026

Weaviate’s chunking guide: 8 practical strategies to stop RAG from retrieving nonsense

Chunking discipline (Weaviate): Weaviate argues most “broken RAG” symptoms are chunking failures rather than embeddings or the vector DB, and shares a concrete menu of 8 chunking strategies—from fixed-size and recursive splitting through semantic/LLM-driven chunking and late chunking—in the Chunking techniques thread.

The useful engineering takeaway is that chunking is a three-way trade between chunk size, retrieval precision, and preserved context, with “late chunking” explicitly positioned as embedding a whole document first (long-context), then deriving chunk representations—see the Chunking techniques thread for the full taxonomy and when each tends to fail.

Weaviate AI Database

@weaviate_io

Your RAG system isn't broken. Your chunking strategy is. 8 techniques that actually work from simple to advanced: Poor chunking is often the reason your retrieval system returns irrelevant results. When you break text at arbitrary points, you disrupt the flow of meaning and Show more

2:19 PM · Mar 26, 2026

🧱 AI app builders ship to production: App Store autopublish, in-app collaboration, and agent-first UX

Builder-facing products that compress “idea → shipped app”: automated App Store submission flows, in-app commenting for human+agent collaboration, and lightweight ‘vibe coding’ platforms. This is about shipping workflow, not core models.

Rork Max Publishing automates the full App Store submission flow

Rork Max Publishing (Rork): Rork shipped an App Store publishing flow that auto-populates the entire App Store listing—metadata, icons, and screenshots—then submits on your behalf, according to the [product clip](t:20|product clip); it explicitly claims iPad screenshots and even generates a mock review artifact intended to reduce rejection risk, as described in the [launch post](t:20|launch post).

• Listing generation: It fills required App Store fields automatically and produces “beautiful icons & screenshots,” as shown in the [submission demo](t:20|submission demo).
• Review-surface handling: The “mock review so you don’t get rejected” claim is part of the same flow, per the [feature list](t:20|feature list).

This positions “App Store ops” (assets + compliance-ish paper cuts) as something an app-building agent stack can take over, not just codegen.

Every launches Plus One: a hosted OpenClaw coworker in Slack with bundled tools

Plus One (Every): Every announced Plus One, described as a hosted OpenClaw that lives in Slack and comes pre-loaded with tools/skills/workflows so you get a usable “coworker” without doing the usual agent infra and integration setup, per the [launch thread](t:49|launch thread).

• Packaged integrations: It’s positioned as one-click setup with common work tools (Google, Notion, GitHub, email workflows) and Every’s “agent-native apps,” as listed in the [product description](t:49|product description).
• Bundled workflows: The announcement highlights pre-loaded routines like a content digest and daily brief, and also calls out that “the hard part” has been hosting + integrations + ongoing care, per the [infrastructure rationale](t:49|infrastructure rationale).

This is a packaging move: selling the operational scaffolding around agents as the product, not the base harness.

Lovable adds in-app comments to collaborate inside generated apps

Commenting (Lovable): Lovable shipped in-app commenting so teams can leave feedback directly inside the app they’re building, as shown in the [feature demo](t:112|feature demo).

The practical shift is that iteration feedback moves from external threads into the product surface itself—useful when humans and agents are both making changes and need shared, anchored context.

Rork says it built an “App Store MCP” for automated publishing workflows

App Store MCP (Rork): Rork says it built an “App Store MCP,” per the short [status post](t:157|status post), implying an MCP-style tool surface for agents to drive App Store publishing steps programmatically.

The tweet doesn’t include docs, supported operations, or an authentication model yet; it’s a signal that App Store submission is being treated as a first-class “tool” for agents rather than a manual dashboard workflow.

🖥️ AI hardware signals: new workstation VRAM, inference throughput feats, and architecture primers

Hardware content today is practical: a comparative explainer of CPU/GPU/TPU/NPU/LPU tradeoffs, a new 32GB workstation GPU price point, and a high-throughput serving benchmark claim. This matters for local inference planning and capacity models.

96× B200s hit ~1.1M tok/s serving Qwen 3.5 27B (vLLM 0.18.0)

High-throughput serving report: A Google Cloud engineer writeup is summarized as hitting ~1.1M total tokens/sec serving Qwen 3.5 27B (dense, FP8) on 96 B200 GPUs with vLLM v0.18.0, with several deployment knobs called out (DP vs TP, MTP, and routing overhead), per the benchmark summary.

• Parallelism takeaway: The claim is DP=8 nearly 4× throughput vs TP=8 because the model is “too small” to benefit from tensor parallelism on B200s, as noted in the benchmark summary.

• Latency and scaling: The report cites ~46ms TPOT flat across node counts and ~97% scaling efficiency at 8 nodes (96.5% at 12), per the benchmark summary.

• Systems overhead: KV-cache-aware routing is described as adding ~35% overhead vs round-robin (EPP pod bottleneck), which is the kind of detail infra teams need for capacity models, per the benchmark summary.

a gcp eng networked 96 b200s and ran qwen 3.5 27B at warp speed of 1 million tokens per second did you know you can do that?

11:26 PM · Mar 26, 2026

Read 4 replies

Intel Arc Pro B70/B65 put 32GB VRAM at a $949 price point

Arc Pro B70/B65 (Intel): Intel’s Arc Pro workstation lineup is being positioned as a new local-inference-friendly price point, with the B70 and B65 both shown at 32GB and the B70 called out around 367 TOPS, with a reported starting price of $949, as shown in the spec chart.

• What’s actually new: A 32GB VRAM workstation card at sub-$1k is the concrete change engineers will model around for “single-box” local runs, per the pricing/spec framing in the spec chart.

• Evidence quality: Performance comparisons in the tweets are directional (“on par with RTX 5070”) and don’t include independent benchmarks yet, even though the spec table in the spec chart makes the memory and power envelope explicit.

Intel will release a GPU next week ARC Pro priced at $949. 32GB VRAM and on par with NVIDIA RTX 5070 which is ~$2k. Will be great for running local models ~30B params or better yet get two for the price of one 5070 and run ~70B models.

11:31 AM · Mar 26, 2026