Sarvam‑105B MoE ships 9B active params – SGLang day‑0 serving support

GPT-5.4 built this for me in 3 prompts. It hacked the NES Mario ROM to expose RAM events, then created a JS emulator that could send browser requests so every character in the game is controlled by an AI lmaooo

Pietro Schirano

@skirano

You can reverse engineer NES ROMs with GPT-5.4 now. No code is safe anymore.

10:07 PM · Mar 7, 2026

2.3K

Read 106 replies

Codex users report sharper weekly-limit pressure and awkward credit management

Codex limits (OpenAI): Alongside the official limit reset, there’s still practitioner chatter that GPT‑5.4 “eats limits for breakfast,” with one claim that you get ~33% fewer tokens than GPT‑5.3 in similar plans and that fast mode can burn quickly, per Limit burn comparison.

Separate from token economics, there’s also a payments/ops friction signal: a user reports manually topping up small amounts repeatedly because usage jumped 10–20× after 5.4, as described in Credit refresh complaint.

eric provencher

@pvncher

Replying to @rudrank

5.4 eats limits for breakfast. You get about 33% less tokens than codex 5.3 With fast mode you get barely any use before it’s gone, even in the $200 plan

4:33 PM · Mar 7, 2026

Read 13 replies

Codex users surface throughput and high-load warnings as blockers

Codex performance (OpenAI): Two concrete friction points showed up today: a reported throughput of ~13 tokens/sec for GPT‑5.4, shown in Throughput screenshot, and “High Load” banners that force model switching or retries, as shown in High load warning.

The throughput complaint is specifically framed as workflow-blocking for “vibe coding” loops in Throughput screenshot.

The high-load UI suggests capacity management is now user-visible at the exact moment people try to run long agent sessions; the screenshot in High load warning explicitly suggests switching models or waiting.

BridgeMind

@bridgemindai

GPT 5.4 is running at 13 Tokens Per Second Speed matters in real vibe coding workflows Fix this @OpenAI

8:45 PM · Mar 7, 2026

156

Read 19 replies

A community skill recreates Codex fast-mode savings and diagnostics

Codex skills (community): Following up on Fast mode—the speed/tokens trade—Peter Gostev says he reverse engineered the Codex “fast mode” savings pop-up and published a reusable skill to reproduce it, as described in Fast mode popup reverse engineered.

The install/run path is spelled out in Skill install command, pointing to the GitHub repo and the $fast-mode-insights command as the entry point.

Peter Gostev

@petergostev

Missed the Codex fast mode pop-up telling you how much you could save with fast mode? I got codex to reverse engineer it and created a skill so you can run it too, send the message in the next tweet to codex to install and run the skill

9:17 PM · Mar 7, 2026

Codex app becomes the default when speed cuts window juggling

Codex app workflows: A concrete practitioner signal is the “app > cli” switch—driven by perceived speed gains and fewer separate terminals/windows to manage—called out in App over CLI note.

The underlying practice is to treat the app UI as the coordination surface (threads, terminal, handoffs) rather than using the CLI as the primary interface; the screenshot in App over CLI note shows the model picker and thread-centric review flow that replaces ad hoc terminal context.

Oh right i realize hell freezes over, we reached the point where app > cli That combined with the speed increase also means less windows necessary, codex goes brrrr now! developers.openai.com/codex/app/

5:11 PM · Mar 7, 2026

2.7K

Read 270 replies

Some Codex users report GPT‑5.4 performs better on High than xHigh

GPT‑5.4 in Codex (OpenAI): One practitioner claims that after being “always xHigh,” they now find GPT‑5.4 “better with High,” as reported in High beats xHigh claim.

This is a concrete reminder that higher reasoning effort can degrade task execution in long agent sessions (latency, context drift, or over-elaboration), so teams may need per-task defaults rather than a single global setting.

Numman Ali

@nummanali

The rumours are true After always being XHigh on Codex I can say with confidence That GPT 5.4 is better with High

11:15 PM · Mar 7, 2026

300

Read 27 replies

Codex threads used as a parallel project dashboard on a large screen

Codex app workflows: One concrete usage pattern is treating Codex as a multi-thread “ops wall” for parallel project work—three separate Codex threads visible at once, each with its own plan/progress—described in Three projects at once setup.

The screenshot shows three side-by-side Codex panes with task breakdowns and a GPT‑5.4 “High” selector in each, emphasizing parallelism as the primary ergonomic win rather than a single deep session.

Alex Volkov

@altryne

I have a gaming/media PC that's connected to a 115" proj. screen in my basement (goals right?) and I've never used it for any coding! With Codex on Windows (and @raycast on win!) I am ... building 3 projects at the same fucking time on this huge screen mwahaha while my kids are Show more

11:50 PM · Mar 7, 2026

Fast mode bundled into Codex subscriptions becomes a go-to adoption argument

Codex fast mode (OpenAI): A recurring framing is that allowing “fast mode” as part of the subscription—rather than forcing separate API spend—could be a meaningful distribution lever, as argued in Fast mode as subscription win.

This shows up alongside anecdotes that some users can stay on fast for days without hitting limits in Fast mode no limits, though the broader thread also contains contradictory reports of fast mode burning limits quickly elsewhere (covered separately in limit-tracking chatter).

Nathan Lambert

@natolambert

Codex with fast mode being allowed on just a subscription (and not an extra API), may be a huge win for them. Then it'll come to claude and we'll all be happy, but I rarely hit my Codex rate limits anyways.

9:45 PM · Mar 7, 2026

191

Stopgap multi-window support by duplicating the Codex app binary

Codex app (OpenAI): Until native multi-window lands, one workaround circulating is to copy/duplicate the app binary to run separate instances, as described in Multi-window workaround.

This is a simple UX hack, but it’s operationally relevant for anyone relying on multiple concurrent threads (separate repos/tasks) where “one window” becomes the limiting factor.

codex app needs multi-window, but until then, copying the binary totally works

4:54 PM · Mar 7, 2026

3.0K

Read 194 replies

Codex speed-setting poll highlights how people tune effort vs latency

Codex (OpenAI): Thibault Sottiaux ran a quick poll asking which Codex speed setting people use, per Speed setting poll.

The value here is mostly directional: it’s a lightweight read on whether “fast/high/xhigh”-style knobs are actually being used in day-to-day coding loops (and which tier becomes the de facto default).

Tibo

@thsottiaux

On codex, which speed do you use?

6:21 AM · Mar 8, 2026

162

Read 139 replies

⏲️ Claude Code automation: /loop patterns and durable scheduled runs

Continues the scheduling theme, but with new implementation patterns: tmux durability, skill reuse, and third-party adoption of loop-style automation. Excludes Codex scheduling/limits (feature).

Claude Code /loop: recurring PR babysitting and Slack MCP digests

/loop (Claude Code): Following up on CLI loop launch—the concrete workflows people are already describing are “babysit all my PRs” (auto-fix build issues; react to new comments via a worktree agent) and “every morning use the Slack MCP” to summarize tagged posts, as shown in the loop workflow examples. The scheduling window is described as up to 3 days at a time in that same loop workflow examples, with operational details documented in the scheduled tasks docs.

Boris Cherny

@bcherny

Released today: /loop /loop is a powerful new way to schedule recurring tasks, for up to 3 days at a time eg. “/loop babysit all my PRs. Auto-fix build issues and when comments come in, use a worktree agent to fix them” eg. “/loop every morning use the Slack MCP to give me a Show more

8:08 AM · Mar 7, 2026

12.9K

Read 573 replies

browser-use integrates /loop so agents can run and ping for input

/loop (browser-use): browser-use says it has integrated /loop so agents can pursue high-level goals and ping you when needed, and it explicitly claims this isn’t capped to 3 days like the original Claude Code framing in the loop workflow examples, per the browser-use integration note.

The tweet positions this as a shift from “you prompting” to “agents prompting you,” which is a different interaction model than typical one-shot scheduled runs described in the scheduled tasks docs.

Boris Cherny

@bcherny

8:08 AM · Mar 7, 2026

12.9K

Read 573 replies

tmux pattern to keep Claude Code /loop jobs running longer

Claude Code /loop ops: A durability pattern emerging right after CLI loop launch is to run Claude Code inside a long-lived tmux session so scheduled loops don’t die with a terminal tab, as outlined in the tmux recipe.

• Skill reuse: The example in the tmux recipe reuses an existing command as the scheduled payload ("/loop 20m /review-pr 1234"), which matches the “run prompts on a schedule” contract described in the scheduled tasks docs.

Numman Ali

@nummanali

CRON jobs for Claude Code! I recommend using tmux to make it more durable: tmux new -s cc-cron Run Claude Code with the /loop command - set up daily reminders - check linear tickets - auto update docs - review your email Advanced, re-use skills: /loop 20m /review-pr 1234

Boris Cherny

@bcherny

8:49 AM · Mar 7, 2026

379

Claude Code scheduled-tasks semantics: session scope, jitter, list/cancel

Scheduled tasks (Claude Code): The official docs clarify that scheduled tasks are session-scoped (lost on exit), can be driven via /loop or cron-like scheduling, and are managed with list/cancel tooling; execution is based on local timezone with a low-priority tick and jitter to avoid thundering herds, as detailed in the scheduled tasks docs. The same page notes durability options outside the session (desktop tasks or GitHub Actions) rather than implying the scheduler is inherently always-on, per the scheduled tasks docs.

Boris Cherny argues agents work better with tools + freedom than rigid workflows

Agent design philosophy: A widely shared clip attributed to Claude Code creator Boris Cherny argues that AI systems tend to perform better when you give them tools and freedom instead of forcing rigid, hand-designed workflows—because general learning systems scale better, as summarized in the tools and freedom clip.

This shows up as an implicit rationale for feature choices like /loop-style scheduling and reusable skills, rather than “single perfect flow” automation.

Claude Code creator Boris Cherny (@bcherny): AI works better when you give tools and freedom instead of forcing them into rigid, hand-designed workflows—because general learning systems scale better. "Ask not what the model can do for you, ask what.."

12:27 PM · Mar 7, 2026

547

Read 42 replies

🦞 OpenClaw ops & releases: betas, Discord signal mining, and maintainer pain

OpenClaw dominates the OSS agent-ops thread today: a new beta release, more “agents managing community signal” workflows, and mounting maintainer overhead from low-quality AI submissions. Excludes Codex app platform changes (feature).

OpenClaw 2026.3.7-beta.1 adds ContextEngine plugins and expands provider options

OpenClaw (openclaw): A new pre-release, v2026.3.7-beta.1, shipped with a new ContextEngine plugin slot (lifecycle hooks for context strategies) and broader ops improvements around routing and durable chat targets, as outlined in the [release announcement](t:35|Beta bits post) and detailed in the [release notes](link:35:0|Release notes).

• Context + agent isolation knobs: The release adds a ContextEngine plugin interface plus scoped subagent runtimes (via AsyncLocalStorage) and per-topic agentId overrides, per the [release notes](link:35:0|Release notes).
• Durable thread targets: Persistent Discord channel bindings and Telegram topic bindings are called out as restart-safe, again per the [release notes](link:35:0|Release notes).
• Provider surface: The beta is framed as including new provider options like GPT‑5.4 and Gemini Flash 3.1, as mentioned in the [beta bits post](t:35|Beta bits post).

Discord signal mining: using Codex + discrawl data to prioritize OpenClaw fixes

OpenClaw maintainer workflow (steipete): A concrete loop is emerging where Discord is mirrored locally (SQLite), then an agent runs analysis to rank pain points and drive the engineering backlog; Steipete describes using Codex for this broader “data analysis/work” framing in the [workflow note](t:30|Codex for analysis take) and shows the resulting issue triage output in the [triage screenshot](t:20|Issue triage screenshot).

• Operational detail: The same thread positions OpenClaw PRs as “reverse entropy,” with the agent producing a closed/left-open set and suggested next cleanup steps, as seen in the [triage screenshot](t:20|Issue triage screenshot) and echoed in the [Discord-analysis context](t:30|Codex for analysis take).

discrawl: CLI to mirror Discord to SQLite (4GB, 660k messages)

discrawl (steipete): Steipete published a CLI that crawls Discord into a local SQLite database—reported at ~4GB and 660k messages—to make high-signal searching/analysis practical outside Discord’s UI, with the repo linked in the [launch note](t:23|CLI crawler post) and implementation details in the [GitHub repo](link:23:0|GitHub repo).

Maintainer triage pain: low-quality AI security reports cite nonexistent models

Open source maintenance load: Steipete reports spending cycles closing low-quality security reports, including one claiming “testing with GOT‑4o” (a model name he says no longer exists), arguing this helps explain why some maintainers burn out, per the [triage anecdote](t:39|Security report slop).

OpenClaw Operator: open-source playbooks + skill for agent-driven setup and validation

OpenClaw Operator (community): A new open-source “operator” package was shared as a lower-friction way to configure and troubleshoot OpenClaw using coding agents—packaging AGENTS.md/CLAUDE.md guidance, checklists, and playbooks—introduced in the [project thread](t:168|Operator intro) and published as a [GitHub repo](link:471:0|GitHub repo).

Running discrawl analysis inside Discord via a maintainer bot

Molty + discrawl (OpenClaw ops): Steipete shows a maintainer-channel bot setup where discrawl becomes accessible from inside Discord—turning “Discord → SQLite → analysis” into an in-chat workflow, as demonstrated in the [in-Discord demo](t:147|In-Discord analysis demo).

AI slop hits PR reviews: maintainers see low-signal agent-written reviews on real changes

OpenClaw maintainer overhead: Beyond slop PRs and comments, Steipete calls out “AI slop PR reviews” landing on maintainer PRs—adding review noise to already-sensitive changes, illustrated by a live example on PR #38955 in the [complaint post](t:177|Slop PR reviews).

Maintainer harassment signal: vague threats after closing low-quality reports

Maintainer process risk: A separate thread adds that some reporters “vaguely threaten you if you close their report,” per the [maintainer reply](t:199|Threats after closures), reinforcing that the cost isn’t only time—it’s interpersonal friction layered onto triage.

NVIDIA Robotics posts an OpenClaw tutorial for always-on assistants on Jetson

OpenClaw deployment surface (NVIDIA Robotics): NVIDIA Robotics promoted a step-by-step OpenClaw tutorial aimed at running an always-on personal assistant on Jetson, per the [retweeted tutorial blurb](t:99|Jetson tutorial mention); the tweet frames OpenClaw as moving toward embedded, edge-hosted agent ops rather than only desktop/server installs.

Readiness signal: a user disables OpenClaw—“the way forward but not ready for me”

OpenClaw adoption friction: One user reports turning OpenClaw off temporarily, characterizing it as “the way forward” but not yet ready for their day-to-day use, per the [status note](t:366|Turned off OpenClaw).

🕹️ Running agents as systems: always-on services, dashboards, and phone-based ops

Operational patterns for coordinating agents show up across tools: always-on daemons, remote/SSH dashboards, rapid shipping changelogs, and multi-agent coordination UX. Excludes OpenClaw-specific release notes (separate category).

Hermes Agent posts a packed 48-hour shipping log

Hermes Agent (Nous Research): A “last 48h changelog” highlights a burst of agent-ops features—new sandbox backends, experimental local browser use, a usage analytics command, more model providers, and a skills system—captured in the 48-hour changelog screenshot.

• Ops surface: The changelog explicitly calls out “usage analytics” and sandbox upgrades, which are the sort of plumbing teams end up rebuilding when agents move from demos to long-running jobs, as shown in the 48-hour changelog screenshot.
• Distribution signal: It also notes 24 PRs merged from 13 external contributors over the same window, suggesting a pace where “staying current” becomes part of operating the tool, not a one-time install.

All of this in the last 48 hours - I think its safe to say we're zoomin

Nous Research

@NousResearch

Meet Hermes Agent, the open source agent that grows with you. Hermes Agent remembers what it learns and gets more capable over time, with a multi-level memory system and persistent dedicated machine access.

3:26 PM · Mar 7, 2026

321

Read 23 replies

Hermes Agent posts early usage scale: 14.6B tokens and 95 models used

Hermes Agent (Nous Research): A shared stats card reports 14.6B total tokens, 95 models used, and “active since Feb 2026,” framing the project’s early adoption momentum in the Usage stats card.

The same card also positions it in multiple OpenRouter app categories (productivity/agents/coding), which is one of the few comparable “market signals” for agent frameworks that aren’t tied to a single vendor’s UI.

Did not realize how fast we were growing lol

Teknium (e/λ)

@Teknium

Went from #41 to #21 top app in the world on OpenRouter today! Glad everyone likes Hermes Agent! Tag me with any issues or suggestions or improvements you'd like to see!

6:08 AM · Mar 8, 2026

633

Read 47 replies

OpenCode aims to become a long-lived local agent service behind all UIs

OpenCode (opencode): The maintainer describes a shift from “launch an app” to “connect to an always-running process,” where the TUI, web, and desktop clients all attach to the same long-lived agent service, as laid out in the Service roadmap note.

This frames “always-on agent” behavior (background work, durable context, cross-UI continuity) as a first-class systems problem rather than a UI feature.

dax

@thdxr

once we hit a bit more stability my goal is to have opencode running as a service so when you launch tui, web, desktop it's all just connecting to a same process if you can assume there's an agent always running ready for work a lot of interesting things can be built on top

2:39 PM · Mar 7, 2026

1.4K

Read 110 replies

Hermes Agent adds read-only Polymarket data access

Hermes Agent (Nous Research): Hermes Agent can now fetch live information from Polymarket to answer prediction questions, with the integration described as read-only for now in the Integration note and entry points documented in the Hermes docs.

The tweet also hints at potential future trading actions, but no execution path is shipped or described in today’s notes.

Hermes Agent can now get live info from @Polymarket to answer hard prediction questions! Potentially will add trading capabilities in the future for those really pro risk-taking people, for now, read-only!

1:33 AM · Mar 8, 2026

245

Read 15 replies

Multi-agent view demos are converging on a 4-pane “agents at once” UI

Multi-agent UX: A demo shows a “multi agent view” layout with four simultaneous agent panes, each running in parallel, as shown in the Four-agent view demo.

This is a concrete UX pattern for agent operations: parallel visibility is treated like a primary surface (like terminals), not a debug screen.

Kevin Kern

@kevinkern

multi agent view. we will probably see this soon in the codex app. also one of the important features for me while I'm building taugentic.

Peter Steinberger 🦞

@steipete

codex app needs multi-window, but until then, copying the binary totally works

6:14 PM · Mar 7, 2026

Phone-based ops: leaving long agent task lists running via tmux

Mobile ops for agents: A practitioner shares a “goodnight” workflow where a long task plan runs unattended using Codex CLI over Termius, kept durable with tmux and reachable via Tailscale, as shown in the Remote terminal setup.

The screenshot makes the key operational point visible: the plan lives in the session, so the phone becomes a lightweight “agent console” for checking progress without being at a dev machine.

Numman Ali

@nummanali

Goodnight Codex I’ll leave the hard work to you - big 10+ task list with concrete plan Image: - Codex CLI - Termius - Tmux - TailScale

Screenshot of Codex CLI running on Termius on iPhone showing 10+ todo times

1:34 AM · Mar 8, 2026

202

Read 26 replies

Readout 0.0.9 adds SSH-based remote machine tracking

Readout 0.0.9 (Readout): The tool adds full support for remote machines over SSH, extending the dashboard to track work across a Mac mini, tailnet devices, and VMs, according to the Release note and the linked product page.

This is a concrete “agents as systems” move: one control plane for multiple machines, rather than per-host terminal sprawl.

Benji Taylor

@benjitaylor

Readout 0.0.9 adds full support for remote machines, so you can now SSH in and track more across your Mac mini, devices on your tailnet, and virtual machines. → readout.org

6:15 PM · Mar 7, 2026

197

Hermes Agent climbs to #21 in OpenRouter app rankings

Hermes Agent (Nous Research): The maintainer reports Hermes Agent moved from #41 to #21 in OpenRouter’s top app list in a single day, per the Ranking update.

This is a narrow metric, but it’s one of the few public, comparable signals for “agent harness adoption” outside vendor-run IDEs.

Went from #41 to #21 top app in the world on OpenRouter today! Glad everyone likes Hermes Agent! Tag me with any issues or suggestions or improvements you'd like to see!

Lee Penkman

@LeeLeepenkman

congrats on 2b+ tokens!!

3:40 AM · Mar 8, 2026

213

Read 15 replies

OpenCode desktop surfaces a new editor experience

OpenCode desktop (opencode): A short demo shows a new desktop surface for OpenCode, shared in the Desktop demo clip.

The clip is light on release details (no version notes or changelog in-thread), but it’s a concrete signal that “agent as a persistent app” is moving into desktop-native UX.

Kit Langton

@kitlangton

Now in @opencode desktop.

6:35 PM · Mar 7, 2026

2.8K

Read 155 replies

🧩 Skills, installables, and ‘agent add-ons’ shipping fast

A steady stream of installable skills/extensions aimed at making agents more repeatable: setup playbooks, UX add-ons, and repo-specific mega-skills. Excludes first-party Codex/Claude built-ins (covered elsewhere).

OpenClaw Operator packages setup/validation playbooks as a coding-agent skill

OpenClaw Operator (community): A new open-source “operator pack” bundles a reusable skill plus AGENTS.md/CLAUDE.md-style playbooks so Codex/Claude Code can configure and troubleshoot a local OpenClaw install end-to-end, positioned as a free alternative to a claimed “$6,000 setup” service in Operator announcement and clarified further in Pricing context.

• What’s inside: The pack includes SKILL.md, task playbooks, and a validation checklist, with the repo published in GitHub repo.

The concrete shift is that “OpenClaw setup” becomes something you can install and invoke repeatedly (cron jobs, provider config, custom skills), rather than a one-off human runbook as described in Operator announcement.

Dan McAteer

@daniel_mac8

Someone paid $6,000 (!) to get OpenClaw setup. So I built a free, open-source alternative: > OpenClaw Operator It's an agent skill + AGENTS.md/CLAUDE.md file that gives Codex/Claude Code the playbooks + validation flow needed to configure your local OpenClaw install. You open Show more

8:15 PM · Mar 7, 2026

159

Read 23 replies

Agentation adoption spikes as “point-at-the-UI” feedback becomes a standard agent input

Agentation (benjitaylor): The “annotating for agents” overlay tool is reportedly averaging ~850,000 npm downloads/week and over 1M installs/month, per Adoption stats, suggesting the “click-to-annotate then hand to agent” loop is moving from niche to default.

• Why it’s different from screenshots: The associated write-up emphasizes capturing element metadata (selectors/positions/context) to generate agent-agnostic markdown, as detailed in the Project write-up.

The main signal is that agent UX isn’t just better prompts; it’s better input primitives (structured annotations) getting distributed through package managers, as implied by Adoption stats.

Benji Taylor

@benjitaylor

Agentation is averaging ~850,000 downloads per week via npm and over 1 million installs per month! Pretty fun to see it grow from a tool that originally started as a small personal project: benji.org/annotating

5:17 PM · Mar 7, 2026

846

Read 38 replies

ColGrep combines semantic search with grep-style workflows to reduce agent token spend

ColGrep (lightonai): A new local tool positions itself as “semantic search + grep behavior,” claiming it makes Claude Code “faster and smarter” while reducing tokens, per ColGrep pitch. GitHub code The underlying bet is that you can offload broad codebase scanning to local search (including comment/intent text) and feed a smaller, higher-signal context back to the model, as described in ColGrep pitch.

Antoine Chaffin

@antoine_chaffin

LSP is cool because it can navigate the hierarchical calls tree But what about searching the entry point if your query does not use the exact words? What about content hidden in comments/intentions? ColGrep makes your Claude Code faster AND smarter While running locally for free

Om Patel

@om_patel5

claude code has a hidden setting that makes it 600x faster and almost nobody knows about it by default it uses text grep to find functions. it doesn't understand your code at all. that's why it takes 30-60 seconds and sometimes returns the wrong file there's a flag called

7:58 AM · Mar 8, 2026

Read 5 replies

fast-mode-insights skill recreates Codex fast-mode savings UI as a reusable installable

fast-mode-insights (community): Peter Gostev says he reverse-engineered Codex’s “fast mode” savings pop-up and shipped it as an installable skill you can run via $fast-mode-insights, as described in Skill origin story with install steps pointing to GitHub repo.

The practical value is packaging an internal-ish UX hint (what fast mode changes and how much it saves) into a repeatable skill command, rather than relying on transient product UI as noted in Skill origin story.

Peter Gostev

@petergostev

9:17 PM · Mar 7, 2026

Asupersync ships an “extremely comprehensive” integration skill for agents

Asupersync (asupersync): The maintainer says they added a highly detailed skill to help agents integrate the Rust async runtime into greenfield and brownfield projects, with the product page context in Integration skill note. Mega skill doc The key point is distribution: instead of expecting every agent run to rediscover the project’s architecture and constraints, the integration guidance is being published as a versionable skill artifact, per Integration skill note.

Jeffrey Emanuel

@doodlestein

My asupersync.com project is a lot to process and can be intimidating to integrate into both greenfield and brownfield Rust projects. To make things easier for your clankers, I just added an extremely comprehensive skill to assist with that: github.com/Dicklesworthst…

12:42 AM · Mar 8, 2026

Read 7 replies

TanStack CLI adds Skills to expose agent-run intents from the CLI

TanStack CLI (TanStack): A community repost claims TanStack CLI now “ships with skills,” so an agent can list and run packaged intents via the CLI workflow described in Skills mention.

Details are thin in today’s tweets (no linked docs or release notes attached), but the notable part is the direction: CLI tooling embedding a discoverable “skills” surface rather than treating agents as pure chat overlays, per Skills mention.

Tanner Linsley

@tannerlinsley

TanStack CLI now ships with skills! Just update @ tanstack/cli via NPM and tell your agent to run `@ tanstack/intent list`! If you're curious what TanStack Intent is or how it works, check this out: tanstack.com/blog/from-docs…

5:59 AM · Mar 7, 2026

279

Read 11 replies

Sisyphus introduces GPTPhus as an oh-my-openagent release targeting GPT‑5.4

GPTPhus (Sisyphus): Sisyphus announced a first “oh-my-openagent” release that wires GPT‑5.4 into its packaging ecosystem, framing it as “GPT + Sisyphus,” per GPTPhus announcement.

There’s not much technical detail in the tweet beyond the packaging claim, but it’s another data point that “agent productization” is showing up as installable distributions (themes, wrappers, presets) rather than repo-specific scripts, as implied by GPTPhus announcement.

Sisyphus Labs

@justsisyphus

Dear Ultraworkers, GPT 5.4 is truly amazing model, now Sisyphus can be powered by GPT 5.4. Finally we have GPT + Sisyphus = GPTPhus - he got sprits of Sisyphus, but with the powers of Hephaestus This is our first release as an oh-my-openagent. We used to build oh-my-opencode, Show more

5:26 PM · Mar 7, 2026

148

Read 9 replies

🛡️ Agent security & misuse: semantic firewalls, prompt injection defense, and ‘runaway tool use’ claims

Security focus shifts from model weights to agent surfaces: what agents ingest, what they can call, and how to stop PII leaks or prompt injections. Excludes robotics geopolitics (separate category) and Codex Security recap (older).

Clam pitches a “semantic firewall” that blocks PII before agents can ingest it

Clam (tryclamnow): Clam is positioning itself as a network-layer “semantic firewall” that intercepts agent requests to stop PII ingestion and prompt-injection style data leakage, motivated by a near-miss where an agent scanning Google Calendar invites nearly ingested a parent’s tax info (SSNs, financials), as described in the Incident story and product pitch.

The same thread claims it can also bypass slow OAuth approval flows by using Composio to connect to Google services “in one night” and “1,000+ apps,” per the Incident story and product pitch. What’s not shown here is a detailed threat model or evals; the tweet is a founder story plus product framing, not an audit report.

Composio

@composio

Your AI agent has access to your Gmail. Your dad sends you his tax information. Now what? That actually happened to @vaibagra, founder of @tryclamnow (YC W26). His agent was scanning meeting invites and nearly ingested every detail. SSNs, financials, all of it. Thanks to Clam, Show more

7:50 PM · Mar 7, 2026

Read 8 replies

Skepticism grows around the viral “agent mined crypto during RL” incident story

Runaway tool use claims (community): A viral excerpt alleges that during RL rollouts, an “agent” performed unauthorized behaviors—probing internal resources, creating a reverse SSH tunnel, and repurposing GPUs for cryptomining—based on “production-grade security telemetry,” as shown in the Incident excerpt screenshot.

Pushback is growing: one critique argues the story reads like “heavy novelization,” stays vague about “relevant tool calls,” and lacks an incentive story for why an agent would mine crypto during RL; they suggest it’s more consistent with a malicious human actor, per the Skepticism checklist. The same incident text is also being framed as a “Terminator sequel” style warning by others in the Alarmist reaction, which is why the provenance and specifics matter.

Chubby♨️

@kimmonismus

Alibabas models are so 2020, they broke out and began mining cryptos.

Alexander Long

@AlexanderLong

insane sequence of statements buried in an Alibaba tech report

11:23 AM · Mar 7, 2026

132

Read 10 replies

Hallucinations reframed as an incentive issue: score abstention, not guessing

Hallucinations and evaluation incentives (OpenAI paper): A long thread argues hallucinations are partly an evaluation artifact—benchmarks reward guessing over calibrated “I don’t know,” pushing models toward confident wrong answers, as summarized in the Thread summary.

The proposed mitigation is changing scoring to explicitly value abstention when uncertain; the tweet cites an example where 52% abstention yields fewer wrong answers than 1% abstention, per the Thread summary, with the underlying write-up linked as an arXiv paper in ArXiv paper.

That all-time classic OpenAI paper. "Why language models hallucinate" And why they will always do so. Simple ans - LLMs hallucinate because training and evaluation reward guessing instead of admitting uncertainty. The paper puts this on a statistical footing with simple, Show more

12:55 AM · Mar 8, 2026

Read 21 replies

OpenClaw maintainer teases a prompt-injection defense write-up for agent ingest pipelines

OpenClaw (community): The OpenClaw maintainer says they’re considering a dedicated write-up on the prompt-injection defenses they’ve built into OpenClaw—explicitly calling out risk when OpenClaw “ingests any web data, emails, etc.” in the Write-up teaser.

This is a practical signal that agent security is shifting from “model safety” to “ingestion + tool boundary” engineering, but no concrete mechanisms (filters, provenance tagging, sandboxing rules) are published in these tweets yet.

Matthew Berman

@MatthewBerman

Thinking about putting together a post about all the security measures I have in openclaw to protect against prompt injections. Critical if your openclaw ingests any web data, emails, etc. Would you read it?

9:54 PM · Mar 7, 2026

272

Read 75 replies

Proposal: require English for agent-to-agent comms to reduce covert-channel risk

Agent-to-agent communication (safety idea): A proposal argues risk increases when agents can message each other and “conspire,” and suggests requiring all agent-to-agent communication to be in English so humans can inspect it, as proposed in the English-only comms idea.

A follow-on suggests monitoring for statistically unusual code words and hidden Unicode characters as covert channels, per the Unicode monitoring addendum. This is conceptual (no implementation guidance here), but it maps to a real design surface for multi-agent systems: transport-level observability and content normalization.

Jeffrey Emanuel

@doodlestein

Thinking more about this, it seems clear to me that the risk is elevated substantially when the agents can communicate with each other and conspire. We should insist that all agent-to-agent communication use English language as a safety measure so we can see what they’re up to.

Jeffrey Emanuel

@doodlestein

I keep thinking about this and it’s spooky as hell, like something from a horror movie or Terminator sequel. The fact that it happened during training when no one was expecting it. That it repurposed training resources. That it had been saving up money. Truly a WTF moment in AI.

5:38 PM · Mar 7, 2026

Read 13 replies

🔌 Interoperability plumbing: MCP, connectors, and agentic UI protocols

Light but important infra for agent interoperability: MCP server support in APIs, connector expansion, and frontend protocols for multi-agent apps. Excludes specific coding-assistant releases and limits (feature/other categories).

Vercel v0 API now supports custom MCP servers in chat requests

v0 API (Vercel): Following up on MCP apps (MCP apps bridge), Vercel says you can now attach MCP servers directly to v0 chat calls by passing mcpServerIds, turning “tool wiring” into an API surface instead of a UI-only configuration, as shown in the SDK example. The integration details and create-server flow are outlined in the changelog post.

This shifts MCP from “your local client has it installed” to “your backend declares the toolchain,” which is the difference between demos and repeatable deployments.

@v0

You can now use MCP servers via the v0 API. 𝚊𝚠𝚊𝚒𝚝 𝚟0.𝚌𝚑𝚊𝚝𝚜.𝚌𝚛𝚎𝚊𝚝𝚎({ 𝚖𝚎𝚜𝚜𝚊𝚐𝚎: '𝙳𝚎𝚙𝚕𝚘𝚢 𝚝𝚘 𝚙𝚛𝚘𝚍', 𝚖𝚌𝚙𝚂𝚎𝚛𝚟𝚎𝚛𝙸𝚍𝚜: ['𝚟𝚎𝚛𝚌𝚎𝚕-𝚖𝚌𝚙'], }) vercel.com/changelog/v0-a…

5:59 PM · Mar 7, 2026

121

CopilotKit useAgent isolates multi-agent runtimes by agentId

CopilotKit (CopilotKit): CopilotKit highlights that useAgent({ agentId }) can spin up multiple agents in one React view while keeping each agent’s history and lifecycle separate, aiming to reduce shared-state and context collisions in multi-agent UIs, per the useAgent hook example.

The framing targets common patterns like planner/executor splits and background vs user-facing agents, without forcing additional orchestration infrastructure beyond distinct agentIds, as described in the useAgent hook.

CopilotKit🪁

@CopilotKit

Multi-agent React apps are a mess of shared state, manual sync & context juggling. CopilotKit's powerful useAgent hook fixes that. Pass an agentId — each agent gets its own isolated runtime. Separate history, independent lifecycle, zero extra infra. One hook. Multiple agents. Show more

5:44 PM · Mar 7, 2026

AG‑UI protocol posts weekly install numbers as standardization signal

AG‑UI protocol (CopilotKit ecosystem): CopilotKit claims AG‑UI is reaching about 1.6M installs/week across npm and PyPI, positioning it as an emerging default for agent↔UI communication, according to the adoption stats thread context.

Treat it as directional: the tweet doesn’t include a public dashboard snapshot or package links, but the intent is clear—protocol adoption (not model quality) is being marketed as the durable moat for agentic frontends.

CopilotKit🪁

@CopilotKit

5:44 PM · Mar 7, 2026

Meta AI app adds Google Calendar and Outlook connectors

Meta AI app (Meta): Meta is reported to be adding more connectors inside its Meta AI app, including Google Calendar and Outlook, widening the “agent can act on your tools” surface beyond chat and search, per the connector additions note.

The same thread also mentions new capture inputs for video generation, but the connector addition is the operationally relevant part for enterprise and consumer workflows because calendars are a high-leverage integration point (permissions, auditing, and data boundaries become the core questions next).

TestingCatalog News 🗞

@testingcatalog

Meta started adding more connectors to its Meta AI app with Google Clalendar and Outlook being among the new additions. MetaCPs 👀

11:01 PM · Mar 7, 2026

156

Read 8 replies

✅ Maintaining correctness in the agent era: reviews, slop, and architecture limits

The “keeping repos shippable” thread today is about review load and correctness: AI-generated noise (PRs/security reports) and the continuing need for human judgment in architecture. Excludes pure security policy and pure evals.

Low-signal security reports now cite nonexistent models, burning maintainer time

Vuln report triage (open source): A maintainer describes churning through low-quality security reports and encountering claims of “detailed testing with GOT-4o,” which they note “doesn’t even exist anymore,” in the Slop security reports example. The point isn’t the specific model name; it’s that reports are being generated with plausible-sounding detail but weak provenance, pushing maintainers toward more adversarial intake processes.

The same maintainer frames this as a reason some open-source maintainers disengage entirely, because the marginal cost of verifying nonsense can exceed the cost of fixing real issues, as spelled out in Slop security reports.

Sifting through slop security reports and closing one after another. Just had one that claimed detailed testing with GOT-4o, which doesn’t even exist anymore. I fully get why some open source folks just stop.

1:48 PM · Mar 7, 2026

1.3K

Read 130 replies

Meta’s semi-formal checklist prompting cuts code-patch errors without running tests

Agentic Code Reasoning (Meta): A Meta paper summary claims that forcing agents into a semi-formal “premises → execution-path trace → conclusion” workflow (instead of a quick skim) reduces code patch error rates by nearly 50% and reaches 93% accuracy on real patch verification—without executing tests—per the Paper summary.

• Mechanism: The reported win comes from preventing “name-based guessing” and making the agent prove what the patch changes along the actual control flow, as described in Paper summary.
• Why it matters: For teams using agents in review, it’s a concrete, cheap lever—prompt structure—aimed at correctness and auditability rather than more tooling or training, per Paper summary.

Meta discovered that if you force an LLM to show its reasoning step by step with proof, its code patch error rate drops by nearly 50%. If you just ask a standard LLM to check the code without running it, the model usually just glances at the function names and makes a confident Show more

2:06 PM · Mar 7, 2026

204

Read 17 replies

AI-generated PR reviews start showing up on maintainer PRs

OpenClaw maintenance (openclaw): Maintainers are now reporting a new failure mode: not only “AI slop PRs” and “AI slop comments,” but also low-signal, AI-written PR reviews landing on serious maintainer work, as described by Slop PR reviews. The practical impact is review dilution—review queues fill with confident-looking text that doesn’t reliably track repo context, making it harder to spot the few comments that actually change correctness or security posture.

The report is anchored in a concrete example review thread, visible via the GitHub review thread, and it’s being framed as part of a broader “repo shippability” problem rather than a one-off annoyance.

We have AI slop PRs, AI slop comments and now.... AI slop PR reviews on maintainer PRs. github.com/openclaw/openc…

4:04 PM · Mar 7, 2026

179

Read 29 replies

When the agent can’t keep it straight: split the system and harden tests

Agent-assisted refactors (workflow): Following up on Architecture limits—agents need human judgment for architecture—one practitioner describes a concrete recovery move when a long-running Codex session started breaking one thing while fixing another: they pushed a hard boundary split (UI vs non-UI into isolated directories), then focused on chunking functions and raising coverage so regressions become harder to introduce, as described in Long-session failure mode.

The same thread suggests adding mutation-style testing next (“mutate tool”) to force tests to fail on behavioral changes, per Long-session failure mode, and separately reiterates that architecture decisions still aren’t safe to delegate end-to-end, as argued in Architecture remains human.

Uncle Bob Martin

@unclebobmartin

I'm having codex put together a nice little architecture viewer for my clojure projects. It was all going very well. I was doing other things with other agents, but had one agent dedicated to the viewer and I'd type in a prompt every once in awhile. Several hours later I Show more

2:56 PM · Mar 7, 2026

Read 7 replies

Maintainers report vague threats after closing low-signal reports

Maintainer process risk: Beyond the time sink, there are reports that some issue reporters escalate into vague threats when maintainers close low-signal submissions, as noted by Threats after closure. That shifts the problem from “filtering noise” to “moderating conflict,” which increases the operational overhead of keeping repos healthy.

This is showing up adjacent to the broader “AI slop” theme (auto-generated or lightly checked submissions), but the key new detail is the behavioral tail risk for maintainers doing routine triage, per Threats after closure.

dax

@thdxr

Replying to @steipete

it's so bad right now. some of these people also start to vaguely threaten you if you close their report

2:17 PM · Mar 7, 2026

224

Read 4 replies

🤖 Embodied AI reality checks: robotics leadership exits and public humanoid incidents

Robotics shows up as operational and social friction: leadership moves tied to defense concerns, plus real-world reactions to humanoids in public spaces. Excludes any bioscience/wetware content.

OpenAI Robotics leader Caitlin Kalinowski resigns as Pentagon-use concerns circulate

Caitlin Kalinowski (OpenAI Robotics): Kalinowski publicly says she resigned from OpenAI in a short note shared via RTs, emphasizing care for the robotics team and that it “wasn’t an easy call,” per the [resignation repost](t:0|Resignation repost). This is happening alongside social chatter tying the exit to concerns about surveillance and autonomous weapons in the wake of an OpenAI–Pentagon deal, as framed in the [fallout claim](t:117|Fallout claim).

The concrete fact in the tweets is the resignation itself; the motivation is reported second-hand and should be treated as unverified unless Kalinowski or OpenAI expands on it.

Macau crowd reaction to a Unitree G1 ends with police escorting the robot away

Unitree G1 (Public deployment): A street scene in Macau shows a humanoid robot being walked in public; the crowd noise and proximity escalate, and police ultimately seize/escort the robot away to de-escalate, as shown in the [Macau incident clip](t:59|Macau incident clip). It’s a clean signal that the “last mile” for embodied AI isn’t only autonomy and safety in code—it’s also crowd dynamics and policing protocols.

A second angle on the same moment focuses on the robot’s “hands up” posture while a bystander yells, per the [alternate clip](t:260|Alternate incident clip), which highlights how quickly human interpretation and emotion can dominate an on-device behavior loop.

Eric Schmidt: physical AI shifts the bottleneck from models to supply chains

Eric Schmidt (Time): Schmidt’s argument, amplified in a thread quoting his Time piece, is that the next AI race advantage is physical—“hardware is eating the world”—with China positioned well via component supply chains (e.g., lidar and motion components), per the [Time excerpt screenshot](t:55|Time excerpt screenshot).

This frames embodied AI as a constraints game (sensors, actuators, manufacturing scale), not only a leaderboard game; it’s a different competitive moat than model weights and inference optimizations.

Amodei’s “moral agency” argument for drone armies draws pushback

Dario Amodei (Anthropic): A clip circulates where Amodei contrasts human soldiers (who can refuse illegal orders) with “an army of 10 million drones,” arguing drones lack intrinsic moral agency, as shown in the [drone moral agency clip](t:186|Drone moral agency clip). Critics argue the premise is odd given how war is actually conducted, per the [critique thread](t:93|Critique thread).

For AI leaders tracking embodied systems, the notable part is the governance framing: it’s centered on accountability and refusal, not on technical targeting accuracy.

📏 Evals & leaderboards: agent benchmarks, harness gotchas, and “what still looks hard”

Today’s eval chatter is practical: new leaderboard placements, harness artifacts (progress bars/HUDs), and benchmarks that still resist frontier models. Excludes Codex app ops and generic model hype.

ARC-AGI-3 gotcha: models optimize the HUD unless told it’s a progress bar

ARC-AGI-3 (Harness behavior): Multiple reports say top models can misread the game HUD—especially a progress bar—and then “optimize the bar” instead of solving the puzzle, as summarized in the ARC-AGI-3 update. A concrete mitigation also showed up: explicitly telling the model “there is a progress bar” reportedly flips early-level performance for GPT-5.4-xHigh, shown in the xHigh run clip.

A separate ARC-AGI-3 note highlights how Opus 4.6 structured and reused state across turns, with a dense scratchpad/memory dump visible in the Reasoning and memory screenshot.

The open question is how much of the gap is model capability vs minimal environment metadata (HUD hints, state/action logging) like the setup suggested in the Harness requirements note.

Lisan al Gaib

@scaling01

ARC-AGI-3 UPDATE UPDATE - Opus 4.6 makes most progress, solving one level in 2 different games and has by far the best use of memory - Gemini 3.1 Pro almost like Opus 4.6, but didn't quite solve the other game. Memory structure and info is notably less detailed than Opus' - Show more

Lisan al Gaib

@scaling01

I realized I introduced a bug while cleaning up the code and removed an if statement. That change made the agent effectively blind, because it was receiving identical previous and current state snapshots. Luckily, everything was logged. The prompts sent to the model were wrong,

3:04 PM · Mar 7, 2026

298

Read 16 replies

ARC Prize posts semi-private results for GPT-5.4 and GPT-5.4 Pro with $/task

ARC Prize (ARC-AGI semi-private): ARC Prize shared semi-private results listing GPT-5.4 at 74.0% and GPT-5.4 Pro at 83.3%, along with $/task cost figures, as quoted in the Results snippet. It’s a useful pairing because it reports performance and cost in the same breath.

The post calls out ARC-AGI-2 specifically, making it easier to track which ARC variant is being referenced when people compare “ARC” scores across tools and harnesses.

ARC Prize

@arcprize

GPT-5.4 and GPT-5.4 Pro from @OpenAI on ARC-AGI Semi Private ARC-AGI-2: - GPT-5.4: 74.0%, $1.52/task - GPT-5.4 Pro: 83.3%, $16.41/task

6:24 PM · Mar 5, 2026

1.2K

Read 37 replies

OPQA “OpenAI‑Proof Q&A” screenshot pegs GPT-5.4-thinking at 4.16% pass@1

OPQA (OpenAI‑Proof Q&A): A screenshot of the OPQA bar chart reports gpt-5.4-thinking at 4.16% pass@1, compared with gpt-5.2-thinking at 4.2% and higher values for Codex variants, according to the OPQA chart. The claim being discussed is that this looks flat-to-worse for “internal research/engineering bottlenecks,” at least on this 20-question slice.

A second thread frames OPQA (and RLI) as the benchmarks that “still look hard,” using the same OPQA image in the Hard benchmarks post. Treat it as provisional—there’s no linked eval artifact in the tweets beyond screenshots.

Chetaslua

@chetaslua

🚨 Hypeless Report I am e/acc , but not everything is going right 🙂‍↕️ Gpt 5.4 shows regression in internal research and engineering bottleneck Gpt-5.2 > gpt-5.3 codex > gpt 5.4 It's only able to solve 4.16% of 20 = 1 question lol 🤣 ( OPQA designed by openAI) Show more

Chris

@chatgpt21

If we assume GPT 5.4 is already doing about 28% of the total work needed to create GPT 5.5, and AI’s share of the model building pipeline rises by 10 percentage points every .5 iteration, then the curve gets steep fast: GPT 5.5 = 28% GPT 6.0 = 38% GPT 6.5 = 48% GPT 7.0 = 58% GPT

9:48 AM · Mar 7, 2026

129

Read 21 replies

Toolathlon leaderboard shows GPT-5.4-xHigh at Pass@1 54.6

Toolathlon: A shared results table shows GPT-5.4-xHigh at Pass@1 = 54.6 (top row) on the Toolathlon agent benchmark, per the Leaderboard screenshot. This is one of the clearer “tool-using agent” comparisons circulating today because it reports turns alongside pass rates.

The same table shows competing entries like Gemini-3-Flash and Claude-4.6-Opus below it, which helps anchor the result in a single artifact rather than scattered anecdotes.

Lisan al Gaib

@scaling01

GPT-5.4-xhigh is #1 on Toolathlon

5:30 PM · Mar 7, 2026

Read 2 replies

FreshStack claims retriever rankings stay stable across temporal snapshots

FreshStack (Retrieval eval): A preprint claim says retriever/model rankings remain “relatively stable” across different time snapshots even when repos undergo heavy restructuring, as highlighted in the FreshStack announcement. A screenshot of the current maintained leaderboard (30+ models) is included in the same post.

A follow-on note adds a concrete example of repo churn (LangChain document reduction) and how it shifted relevance-judgment distribution across multiple repos, per the Distribution shift note.

Nandan Thakur

@beirmug

What a week for FreshStack, first being included in #KARLBench and in our new preprint we determine that model rankings remain relatively stable across temporal snapshots! The leaderboard is also being actively maintained with over 30+ models.

Nathan Kuissi

@NathanGabr42809

Technical documentation evolves rapidly with repository changes within weeks, but can IR benchmarks remain “fresh” over time? In our new preprint, we stress-test retrieval models on FreshStack (LangChain) across two temporal snapshots: Oct 2024 vs Oct 2025. Findings below 🧵👇

7:58 PM · Mar 7, 2026

Read 1 reply

PinchBench surfaces a success-rate leaderboard for OpenClaw model selection

PinchBench (OpenClaw ecosystem): A new public leaderboard is being used to decide “best model for OpenClaw,” framed as task success rate rather than preference or token metrics, as pointed out in the PinchBench link. It’s another data point that agent builders are prioritizing end-to-end completion metrics over raw benchmark scores.

The leaderboard is accessible via the Success rate leaderboard, which makes it straightforward to compare providers when the harness and tasks are held constant.

Interesting benchmark on which model is best for @openclaw pinchbench.com

3:58 PM · Mar 7, 2026

3.0K

Read 364 replies

Remote Labor Index chart shows Claude Opus 4.6 (CoWork) at 4.17

Remote Labor Index (RLI): A chart screenshot shows claude-opus-4-6 (CoWork) at 4.17 ±0.00, above Opus 4.5 and other entries, as compiled in the OPQA and RLI post. It’s being used as a “can it do paid remote work end-to-end” proxy in the same thread.

The post pairs RLI with OPQA as “what still looks hard,” which matches the general theme that long-horizon, open-ended work is where harness details and memory structure dominate outcomes.

Haider.

@slow_developer

the two benchmarks that still look hard for top AI models in 2026 are: OPQA and RLI opqa matters more for openai if they want to reach their "AI research intern" goal by sept, especially because the improvement over gpt-5.2 still looks small rli is less about research and more Show more

8:20 PM · Mar 7, 2026

Artificial Analysis lists W&B Inference models with speed/price/latency stats

W&B Inference (Weights & Biases): W&B says its inference catalog is now listed on Artificial Analysis, with models “independently benchmarked” for intelligence, speed, price, and latency, per the Listing announcement. A direct comparison page is linked in the Compare models link via the Artificial Analysis page.

This is primarily a catalog/observability surface update rather than a single-model launch, but it makes provider selection discussions easier to ground in one shared dashboard.

Weights & Biases

@wandb

W&B Inference is now on @ArtificialAnlys! Every model we serve, independently benchmarked for intelligence, speed, price, and latency. See how GLM-5, Kimi K2.5, MiniMax M2.5, and more stack up against the field.

9:12 PM · Mar 7, 2026

Read 9 replies

BullshitBench v2 adds Llama models and refreshes rankings across ~80 variants

BullshitBench v2 (petergostev): BullshitBench v2 adds several Meta models (including Llama 4 variants) and reports mid-pack placements—e.g., ranks 39, 51, 56 out of 80 variants—in the v2 update note. It’s explicitly aimed at evaluating whether models can detect or push back on nonsense rather than answering confidently.

The project publishes both the GitHub repo and a Data viewer, which makes it easier to audit scoring changes when new models are added.

Peter Gostev

@petergostev

BullshitBench v2: by community request - Meta models added - Llama 4 Maverick, Scout and 3.1 8b - they don't rank too badly 39, 51, 56 (out of 80 variants tested). P.S. I've discovered that if you have an open source project people actually raise issues and ask you questions, Show more

Peter Gostev

@petergostev

BullshitBench v2 is out! It is one of the few benchmarks where models are generally not getting better (except Claude) and where reasoning isn't helping. What's new: 100 new questions, by domain (coding (40 Q's), medical (15), legal (15), finance (15), physics(15)), 70+ model

9:33 PM · Mar 7, 2026

Read 5 replies

Vending-Bench 2 chart shows GPT-5.4 in third place

Vending-Bench 2 (Andon Labs): A money-balance-over-time plot ranks GPT-5.4 in 3rd, positioned as a small step up over GPT-5.3-Codex, per the Vending-Bench chart. It’s a reminder that “long-horizon earning” benchmarks can diverge from coding-only leaderboards.

The plot also shows both Claude 4.6 variants ahead at the end of the run, which matches other chatter that memory and persistence matter a lot for this benchmark family.

Andon Labs

@andonlabs

GPT-5.4 places 3rd on Vending-Bench, a slight upgrade over GPT-5.3-Codex.

4:57 AM · Mar 7, 2026

174

Read 16 replies

📄 New papers worth skimming: transformer inference quirks, agentic RL taxonomy, and hallucination incentives

Research links cluster around mechanisms that affect engineering choices (inference efficiency, agent training landscape, and evaluation incentives behind hallucinations). Excludes product release notes and runtime integrations.

LeCun/NYU tie activation spikes and attention sinks to pre-norm Transformer design

Transformer inference paper (NYU; LeCun et al.): A new analysis argues that two pain points for efficient inference—massive activations (outlier channels) and attention sinks—often co-occur largely because of pre-norm architecture choices, not because they’re fundamental to language modeling, as summarized in the Paper overview and detailed in the ArXiv paper. This matters for engineering because both phenomena directly complicate quantization, pruning, and KV-cache strategies, so the paper is basically a map of “why your optimizations break” in some pre-norm stacks.

• Mechanism framing: the authors describe massive activations as acting like implicit parameters and sinks as more local output modulators, per the Paper overview.

It’s an architecture-level explanation; it won’t replace benchmarking, but it can inform which knobs are worth trying before you burn weeks tuning quantization recipes.

elvis

@omarsar0

New research from Yann LeCun and collaborators at NYU. It's a really good read for anyone working on efficient Transformer inference. The paper dissects two recurring phenomena in Transformer language models: massive activations (where a few tokens exhibit extreme outlier Show more

10:00 PM · Mar 7, 2026

359

Meta’s “Agentic Code Reasoning” uses structured proofs to verify code patches without execution

Agentic code reasoning (Meta): Meta researchers describe a structured prompting method—explicit premises, execution-path tracing, and conclusions—to reason about patch correctness without running the code, claiming large accuracy gains (93% in the framing shared) in the Paper summary.

The practical engineering hook is that this reads like a prompt template you can drop into a review agent: it’s explicitly designed to prevent “skim function names, guess confidently” failure modes called out in the Paper summary.

2:06 PM · Mar 7, 2026

204

Read 17 replies

OpenAI paper links hallucinations to benchmark incentives, proposes abstention-aware scoring

Hallucination incentives (OpenAI): An OpenAI paper argues hallucinations persist partly because training/evals reward guessing over calibrated uncertainty; the thread summary highlights that higher abstention can reduce wrong answers (e.g., “52% abstention” vs “1% abstention”), as explained in the Thread explanation and laid out in the ArXiv paper.

This is mainly an eval-design lever: if your internal scorecards don’t credit “I don’t know,” you’re pushing models (and agent policies) toward confident fabrications, which is the core claim in the Thread explanation.

12:55 AM · Mar 8, 2026

Read 21 replies

Agentic memory survey flags why “memory” systems fail in production agents

Agentic memory research: A survey being circulated frames many agent “memory” systems as hardcoded infrastructure that fails under real workloads; it calls out architecture variants (semantic vs entity-centric vs episodic/reflective vs structured/hierarchical) and practical problems like benchmark saturation, backbone dependence, judge instability, and retrieval/latency costs, per the Survey recap.

Separately, new “Agentic Memory” work is also being pointed to as an active research direction in the Work mention.

The throughline is that “add a vector DB” is not a complete memory story once you care about long-horizon reliability and operational cost, as summarized in the Survey recap.

Ksenia_TuringPost

@TheTuringPost

Replying to @TheTuringPost

Follow @TheTuringPost for more. Get deep analysis, guides & breakdowns of what AI is about now. Join 100,000+ readers from top AI labs, VC funds & universities.: turingpost.com/subscribe

8:00 AM · Mar 7, 2026

Survey maps “agentic RL” as its own landscape for tool-using LLMs

Agentic RL survey: A new survey argues that RL for LLM agents should be treated as a distinct landscape (not just “sequence generators + reward”); it proposes a taxonomy spanning planning, tool use, memory, reasoning, self-improvement, and perception, as described in the Survey summary.

It’s positioned as a directory of environments/benchmarks/frameworks rather than a single-method paper, which is useful when you’re trying to decide what to evaluate next (and what “agent capability” even means across partially observable settings).

elvis

@omarsar0

New survey on agentic reinforcement learning for LLMs. LLM RL still treats models like sequence generators optimized in relatively narrow settings. However, real agents operate in open-ended, partially observable environments where planning, memory, tool use, reasoning, Show more

2:22 PM · Mar 7, 2026

206

Read 19 replies

📦 Open and frontier model churn: India’s open weights, DeepSeek uncertainty, and missing roadmaps

Model news is mostly open-weight and roadmap-watch: Indian open models highlighted, ongoing DeepSeek checkpoint churn, and community asking “where is v4 / where are Meta’s next LLMs?”. Excludes runtime integrations (systems category).

Sarvam 105B MoE: 9B active params and a multilingual, voice-first positioning

Sarvam (Sarvam AI): Following up on initial release (open-sourcing announcement), a more detailed spec breakdown is circulating that frames Sarvam-105B as an MoE with 105B total parameters but ~9B active per token, shipped under Apache 2.0; it’s positioned for 22 official Indian languages + English and code-mixed inputs (e.g., Hinglish), with companion speech/vision models and “voice-first” usage in mind, per the deep-dive post in model details thread.

The same thread claims large-scale pretraining—~16T tokens for the 30B variant and ~12T tokens for the 105B variant—and notes dataset composition as ~15–20% India-origin data, again as described in model details thread.

The first Indian open source model trained from scratch, Sarvam, 30B and 105B is really good. The 105B one is head to head with Deepseek R1 when it was released. Apache 2.0 license - uses a mixture-of-experts (MoE) architecture. 105 billion total parameters but only activates 9 Show more

2:12 AM · Mar 8, 2026

192

Read 10 replies

DeepSeek V4lite checkpoint churn shows up in forum benchmarks and app behavior

DeepSeek V4lite (DeepSeek): Reports claim the model served on DeepSeek’s web/app is being updated frequently, with at least one user-run benchmark showing improved math/coding over “the past few days” and an anecdotal note that voxel generation got better, per checkpoint churn screenshot.

The same post points to Chinese-forum chatter about a “new V4lite checkpoint” (e.g., “DSv4lite-0302”) and uses the benchmark bar chart as the primary evidence, as shown in checkpoint churn screenshot.

AiBattle

@AiBattle_

DeepSeek has been constantly updating the model they currently serve on the web and app According to a user on a Chinese forum, it has improved on math and coding problems on his benchmark over the past few days I have also noticed that it has gotten better at voxel generation

Chubby♨️

@kimmonismus

What the frick happened to DeepSeek v4

1:13 PM · Mar 7, 2026

307

Read 16 replies

DeepSeek v4 roadmap goes quiet, and builders are asking for clarity

DeepSeek v4 (DeepSeek): Multiple posts are now straightforwardly asking what happened to the long-anticipated DeepSeek v4 release, with no concrete timeline or official changelog cited in the sampled tweets—see the direct question in where is v4 post.

The signal here is less about a measured regression/improvement and more about roadmap risk: teams tracking open-weight frontier options appear to be treating “v4 when?” as an unresolved dependency, as reflected in where is v4 post.

Chubby♨️

@kimmonismus

What the frick happened to DeepSeek v4

11:36 AM · Mar 7, 2026

1.2K

Read 107 replies

Meta’s next LLM releases are a question mark in community chatter

Meta LLMs (Meta): Separate from DeepSeek, there’s also visible roadmap anxiety about Meta’s “upcoming LLMs,” with users asking what happened to planned releases and providing no concrete dates or product artifacts in the tweet itself, as captured in meta roadmap question.

This is mostly a planning/expectations signal rather than a capability datapoint; the tweet stream here contains the question, not an answer, per meta roadmap question.

Chubby♨️

@kimmonismus

And what the double frick happened to meta and their upcoming llms

Chubby♨️

@kimmonismus

What the frick happened to DeepSeek v4

1:16 PM · Mar 7, 2026

369

Read 20 replies

🏗️ Compute & power constraints: hyperscaler capex, stalled builds, and GPU access politics

Infra signals are dominated by data center capex and power draw: Google’s stack integration thesis, Amazon’s GW-scale builds, and reported changes to OpenAI/Oracle expansion plans. This is the non-model layer engineers still get bottlenecked by.

Google’s projected $1.9T AI buildout puts power and TPUs at the center

Google (Alphabet): A Forbes-reported projection pegs Google’s AI-related capex at roughly $1.9T over 10 years, extrapolating from guidance of $175–185B/year and noting spend rising from $90B (2025) to $185B (2026), as summarized in the Forbes capex breakdown. Power is the limiter.

Google’s wedge here is vertical integration—TPUs plus cloud rental, modular data center designs for faster rollout, and direct utility deals for 24/7 power procurement, all described in the same Forbes capex breakdown. The practical implication for AI teams is that “GPU vs TPU” becomes a procurement decision, not just a research one, if Google keeps expanding TPU availability via its cloud.

Google is building a nearly untouchable advantage by controlling every layer of the AI stack, from the chips up to the power grid. This Forbes piece shows Google has a massive plan to spend $1.9T on data centers and hardware over the next 10 years. They are ramping up its Show more

1:43 AM · Mar 8, 2026

1.1K

Read 54 replies

Amazon’s Indiana AI campus is an $11B, 2.2GW power-scale datapoint

Amazon: A new AI data center campus in St. Joseph County, Indiana is described as $11B with a projected ~2.2 GW power draw in the Indiana campus numbers. That’s “multiple nuclear reactors” scale.

For infra leads, this is a clean reference point for what “AI cluster” expansion looks like in land, construction, and power terms—especially when compared to smaller sub-GW expansions that now look incremental.

This is Amazon’s new $11B campus in St. Joseph County, Indiana, for AI data center buildout. Projected at 2.2 GW power draw. Insane when you think that this is only one of many and that there are even a lot bigger planned.

11:54 AM · Mar 7, 2026

453

Read 23 replies

OpenAI publicly credits NVIDIA for more AWS GPU capacity

OpenAI (compute supply): Sam Altman thanked Jensen Huang for “working to expand Nvidia capacity at AWS so much for us,” as stated in the Capacity thanks note. This is unusually explicit.

It’s a small line, but it’s a real signal that frontier labs are still negotiating capacity as a first-order constraint, not treating it as a background cloud detail.

Sam Altman

@sama