Sarvam‑105B MoE hits 9B active params – SGLang day‑0 support lands

Replying to @rudrank

5.4 eats limits for breakfast. You get about 33% less tokens than codex 5.3 With fast mode you get barely any use before it’s gone, even in the $200 plan

4:33 PM · Mar 7, 2026

Read 12 replies

Workaround for Codex app multi-window: duplicate the app binary

Codex app (OpenAI): Until native multi-window arrives, one workaround is to copy the Codex macOS app binary so you can run multiple app instances side-by-side, as shown in the Multi-instance dock screenshot.

This is mostly about reducing context-switching friction when you want separate threads/projects visible at once.

Peter Steinberger 🦞

@steipete

codex app needs multi-window, but until then, copying the binary totally works

4:54 PM · Mar 7, 2026

2.4K

Read 169 replies

Codex app ships performance work plus a revamped worktree flow

Codex app (OpenAI): The Codex team says they’ve been “continuously improving” app performance and “overhauling the worktree flow,” per the Team performance note; the product surface for worktree handoff is visible in the Worktree handoff modal, and the app’s positioning is described on the Codex app page.

In practice, this is about making parallel agent work less fiddly: isolate changes in separate worktrees while keeping threads organized in one UI.

dominik kundel

@dkundel

The team has been cooking over the last few weeks continuously improving the Codex app performance while also shipping new features and overhauling the worktree flow! Give it a try 🔥

Paul Solt

@PaulSolt

Codex app is making big strides with performance. Might be overtaking CLI. What do you think?

3:03 AM · Mar 8, 2026

Codex surfaces a “High Load” warning for GPT‑5.4 demand

Codex (OpenAI): Some users are seeing a UI-level “High Load” banner for GPT‑5.4, telling them to switch models or retry, as shown in the High load screenshot.

This is a practical constraint signal: even if limits reset, availability/queueing can still gate throughput when demand spikes.

Hrishi

@hrishioa

Ah the old cycle repeats Good model launches from previously meh provider, gets overloaded, subscriptions tighten Just another day

3:33 PM · Mar 7, 2026

Cursor users report a long-thread follow-up bug with GPT‑5.4

Cursor (with GPT‑5.4): One builder reports that when a Cursor chat gets long, a follow-up question can be ignored and the model answers the previous question again, as described in the Long conversation report.

Attribution is unclear (client vs model vs context handling); the report specifically calls out the “really long” thread condition rather than a particular prompt style.

AshutoshShrivastava

@ai_for_success

It's clear that OpenAI now holds the crown for the best coding model. GPT 5.3 Codex was already better than Opus 4.6, and the new GPT 5.4 is above anything else I have used so far. The understanding of existing code, covering edge case scenarios, and writing code in the first Show more

8:43 AM · Mar 7, 2026

539

Read 66 replies

GPT‑5.4 is framed as one model for GPT, Codex, and computer use

GPT‑5.4 (OpenAI): One recurring framing is that 5.4 “unifies GPT + Codex + CUA into a single model,” suggesting a single family meant to cover chat, coding, and computer-use automation, as shown in the Unified model clip.

This matters operationally because it implies fewer “which model do I route to?” decisions inside agent harnesses, at the cost of heavier dependence on a single model’s rate limits and availability.

Adam.GPT

@TheRealAdamG

Besides the performance and intelligence bump, the part I’m most excited about is the fact that 5.4 unifies GPT + Codex + CUA into a single model.

1:41 PM · Mar 7, 2026

105

Codex on Windows: multi-threading three projects from one workstation

Codex on Windows (OpenAI): One workflow report shows running three Codex threads side-by-side on a large display (three projects in parallel), explicitly using GPT‑5.4 High and “native sandboxes,” as described and pictured in the Three-thread Windows setup.

This is a concrete example of how “agent UI as the workspace” changes physical setup: the screen real estate becomes part of throughput when you’re supervising multiple active threads.

Alex Volkov (Thursd/AI)

@altryne

I have a gaming/media PC that's connected to a 115" proj. screen in my basement (goals right?) and I've never used it for any coding! With Codex on Windows (and @raycast on win!) I am ... building 3 projects at the same fucking time on this huge screen mwahaha while my kids are Show more

11:50 PM · Mar 7, 2026

GPT‑5.4 used to instrument a Mario ROM and route events to AI control

GPT‑5.4 (OpenAI): A builder says GPT‑5.4 did the full pipeline in three prompts—instrumenting a Super Mario Bros. ROM to expose RAM events, then creating a JS emulator that can send browser requests so an AI controls characters, as shown in the Mario ROM agent demo.

It’s a concrete example of “agentic coding” being applied to reverse-engineering plus tooling glue (emulator + telemetry + web hooks) rather than CRUD app work.

Pietro Schirano

@skirano

GPT-5.4 built this for me in 3 prompts. It hacked the NES Mario ROM to expose RAM events, then created a JS emulator that could send browser requests so every character in the game is controlled by an AI lmaooo

Pietro Schirano

@skirano

You can reverse engineer NES ROMs with GPT-5.4 now. No code is safe anymore.

10:07 PM · Mar 7, 2026

1.4K

Read 77 replies

Some Codex users report GPT‑5.4 performs better on High than xHigh

GPT‑5.4 in Codex (OpenAI): A power user who had been running xHigh says they now believe GPT‑5.4 is better with High reasoning than xHigh, per the High vs xHigh claim.

This is one data point, but it’s a concrete workflow tweak people are experimenting with as they balance throughput, token burn, and completion quality.

The rumours are true After always being XHigh on Codex I can say with confidence That GPT 5.4 is better with High

11:15 PM · Mar 7, 2026

102

Read 14 replies

Claim: GPT‑5.4 can reimplement compiled behavior as a new Rust codebase

GPT‑5.4 (OpenAI): One post claims Codex/GPT‑5.4 can “look at the output of a compiled program” and independently write a new Rust codebase that reproduces the behavior, with a cost framing that dev economics shift from human labor to longer model inference time in the Compiled-output rewrite claim.

No artifact or repo is attached in the tweet, so treat it as anecdotal—still a useful north star for what people are attempting with 5.4-class coding agents.

GPT 5.4 is indeed a superior piece of technology. Codex 5.4 can look at the output of a compiled program and independently write a brand new Rust codebase that does the exact same thing. The economics of software development is truly shifting from paying for human labor to Show more

Ammaar Reshi

@ammaar

I asked Codex 5.4 to reverse engineer a DOS game with no source code. It’s been running for 6 hours, I can’t look away. It unpacked assets, disassembled the EXE, rebuilt the renderer, and built my childhood favorite SkyRoads in Rust! Now think of all the games we can revive.

1:40 AM · Mar 8, 2026

101

Read 6 replies

🔁 Claude Code automation: /loop, cron-like scheduling, and recurring task patterns

Continues yesterday’s scheduling push, but today the feed is about concrete usage patterns (/loop babysit PRs, tmux durability) and questions about desktop support. Excludes Codex/GPT‑5.4 workflow chatter (covered in the feature).

Claude Code documents /loop scheduling, including the 3‑day cap and cron primitives

Claude Code (Anthropic): Following up on /loop launch, Anthropic is now pointing people at a concrete scheduling UX: /loop runs recurring prompts “for up to 3 days at a time,” as described in the Release note, with the mechanics spelled out in the Scheduling docs. The docs make the constraints explicit: schedules are session-scoped (lost on exit) and are implemented via cron-style tools, with interval parsing/rounding and lightweight per-second checks.

• Interval parsing details: the reference explains units (s/m/h/d), default interval behavior, and that non-minute granularity gets rounded to cron’s 1‑minute floor, as shown in the Scheduling docs.
• Operational primitives: the same page calls out management commands (create/list/delete) and jitter to avoid synchronized thundering herds, per the Scheduling docs.

Boris Cherny

@bcherny

Released today: /loop /loop is a powerful new way to schedule recurring tasks, for up to 3 days at a time eg. “/loop babysit all my PRs. Auto-fix build issues and when comments come in, use a worktree agent to fix them” eg. “/loop every morning use the Slack MCP to give me a Show more

8:08 AM · Mar 7, 2026

11.4K

Read 504 replies

A practical durability pattern for /loop: pin the session in tmux

Claude Code (Anthropic): A concrete “make it survive disconnects” pattern is circulating: start a dedicated tmux session and run Claude Code’s /loop inside it, so the recurring task keeps running even when you detach, as shown in the Tmux workflow tip. This matches Claude Code’s current “session-scoped” scheduling model (i.e., tied to a running process) described in the Scheduling docs.

The same post captures real parsing behavior worth knowing: “no interval” defaults to a 10‑minute loop, and Claude will round odd intervals to a “clean” cron interval, per the Tmux workflow tip.

CRON jobs for Claude Code! I recommend using tmux to make it more durable: tmux new -s cc-cron Run Claude Code with the /loop command - set up daily reminders - check linear tickets - auto update docs - review your email Advanced, re-use skills: /loop 20m /review-pr 1234

Boris Cherny

@bcherny

8:49 AM · Mar 7, 2026

370

Recurring PR babysitting emerges as a first-class /loop use case

Claude Code (Anthropic): One of the first recurring-task templates being shared is “PR babysitting”: schedule a /loop that watches PRs, auto-fixes build issues, and spins up a worktree agent when new review comments land, as described in the PR babysit example. The point is to turn PR maintenance into a background, time-boxed agent loop rather than an interactive session.

Boris Cherny

@bcherny

8:08 AM · Mar 7, 2026

11.4K

Read 504 replies

Daily team digest via /loop + Slack MCP becomes a reference pattern

Claude Code (Anthropic): Another concrete /loop template uses MCP as the action surface: “every morning use the Slack MCP to give me a summary of top posts I was tagged in,” as shown in the Slack MCP example. It’s an early signal that /loop is being treated as a lightweight scheduler for MCP-driven ops tasks, not only code chores.

Boris Cherny

@bcherny

8:08 AM · Mar 7, 2026

11.4K

Read 504 replies

Claude Code confirms /loop support in the desktop app

Claude Code (Anthropic): A small but practical Q&A: a user explicitly asked whether /loop works in the desktop app, as seen in the Desktop app question, and Boris Cherny replied “Yes,” per the Compatibility reply. The docs still emphasize that scheduled tasks are session-tied, so “desktop vs CLI” mostly changes how reliably a session stays alive, not the underlying scheduling model, as described in the Scheduling docs.

Kol Tregaskes

@koltregaskes

Replying to @bcherny

Does this work on desktop app? 🤔

8:20 AM · Mar 7, 2026

🧪 Maintainer pain & quality control: slop PRs, fake security reports, and review automation

A clear maintainer signal today: AI-generated noise (reports/PR reviews) is increasing the review burden, driving discussion of stricter workflows and automation to preserve merge quality. Excludes OpenClaw product release details (covered separately).

Maintainers flag low-quality security reports, including made-up model claims

Maintainer ops (open source): A maintainer describes grinding through “slop” security reports, including one claiming testing with “GOT‑4o” (a model name they say doesn’t exist anymore), and calls out how this review burden pushes some maintainers to disengage, per the [maintainer note](t:24|maintainer note).

The concrete engineering impact is time-to-triage and trust collapse: when reports can’t be audited (or are clearly fabricated), the fastest safe workflow often becomes “close + move on,” which is exactly the opposite of what security processes need under load.

AI slop moves from PRs into PR reviews

GitHub review quality (open source): A maintainer reports a new failure mode—AI-generated “PR reviews” landing on maintainer PRs—stacking on top of already-common AI slop PRs and comments, as shown in the [PR review callout](t:153|PR review callout) with an example linked in the [review page](link:153:0|PR review example).

This is operationally different from low-quality PRs because it pollutes the reviewer signal channel (approvals, requested changes, review threads), which many repos treat as a gating mechanism.

Codex-as-maintainer: using agent threads to triage and close issues at scale

Maintainer workflow (agentic triage): A maintainer shows a Codex-driven triage pass that groups issues into “closed dupes,” “closed,” and “left open,” then drafts targeted closing comments, as seen in the [triage UI screenshot](t:23|triage UI screenshot).

The same maintainer frames Codex as useful for “data analysis/work” beyond coding in the [Codex framing](t:25|Codex framing), and separately describes running analysis over Discord data to decide what to fix next in the [Discord-to-priorities note](t:222|Discord-to-priorities note).

discrawl mirrors Discord history into a local SQLite database

discrawl (steipete): A new CLI mirrors Discord server history into a local SQLite DB for offline search/analysis; one reported run produced a ~4GB DB over ~660k messages, per the [tool announcement](t:18|tool announcement) and the linked [GitHub repo](link:18:0|GitHub repo).

This is directly aimed at maintainer quality control: extracting “where are users hurting?” from chat logs without relying on Discord’s search UI.

In-chat analytics: running Discord analysis inside Discord with an agent

Maintainer loop (in-channel analysis): A maintainer upgrades an internal bot/agent to access the Discord mirror tooling, then runs analysis “of Discord inside Discord,” as shown in the [in-Discord demo](t:122|in-Discord demo).

This pattern matters because it collapses the “collect data → analyze → report back” cycle into the same thread where maintainers coordinate work.

Maintainers report harassment after closing low-quality reports

Maintainer moderation load: Beyond spam volume, a maintainer says some submitters escalate into vague threats when their reports get closed, according to the [reply about threats](t:160|reply about threats).

That shifts “AI slop” from an engineering throughput problem into a moderation and safety problem—especially for solo maintainers handling inboxes and issue trackers.

“Vibe contributing” is framed as a threat to OSS maintenance capacity

Open source quality control: A roundup cites an article arguing that AI-enabled “vibe contributing” is increasing low-quality submissions and review burden on volunteer maintainers, per the [issue blurb](t:643|issue blurb) linking to the [full article](link:643:0|Ethics institute article).

Treat the framing as directional—no shared dataset artifact appears in these tweets—but it matches multiple maintainer anecdotes in this timeline about spam reports and review-channel pollution.

A senior engineering view: agents can code, but architecture still needs humans

Architecture vs agents: Robert “Uncle Bob” Martin argues that once he personally guided the architecture into a layered structure (and added dependency checks + visualization), progress improved; his conclusion is that agents “muddy the waters” if you let them invent architecture, per the [architecture caution](t:76|architecture caution).

He follows with a concrete recovery tactic—break the system into pieces, isolate UI/non‑UI, then increase test coverage and use mutation tooling to make regressions harder, as described in the [refactor plan](t:282|refactor plan).

🦞 OpenClaw platform updates: releases, provider support, and maintainer ops tooling

OpenClaw-related engineering is unusually visible today: new beta bits, provider additions, and maintainers building local analytics to prioritize fixes. Excludes general Codex/GPT‑5.4 praise unless it’s specifically about OpenClaw integration work.

OpenClaw 2026.3.7-beta.1 adds ContextEngine plugins for config-driven context

OpenClaw 2026.3.7-beta.1 (OpenClaw): The beta introduces a new ContextEngine plugin slot with full lifecycle hooks, enabling config-driven strategies for how context is built and managed, as described in the [beta release](t:36|Beta release) and detailed in the [release notes](link:36:0|Release notes). This is a direct extension point for teams who want to swap in custom context policies (e.g., “lossless” approaches) without forking core routing.

The same release also mentions new internals that support more structured agent execution (e.g., scoped runtimes), but the concrete platform change is that context handling becomes a first-class, pluggable subsystem per the [release notes](link:36:0|Release notes).

discrawl mirrors Discord servers into SQLite for offline queries

discrawl (steipete): A new CLI crawls Discord via a bot token and mirrors channels/threads/members/messages into a local SQLite DB for offline analysis; one maintainer reports ~4GB and ~660k messages captured, per the [project announcement](t:18|Discord crawl stats) and the [GitHub repo](link:18:0|GitHub repo). It’s designed around local search and structured queries (FTS5 + mention tables) rather than relying on Discord’s native search.

This is being used as maintainer tooling: extracting “what hurts” from community support channels at repo scale, not just ad hoc keyword search.

OpenClaw 2026.3.7-beta.1 adds durable Discord and Telegram thread bindings

OpenClaw 2026.3.7-beta.1 (OpenClaw): OpenClaw now persists Discord channel bindings and Telegram topic targets so thread routing survives restarts, as called out in the [beta release](t:36|Beta release) and expanded in the [release notes](link:36:0|Release notes). Telegram topic handling also gets a bunch of quality-of-life routing upgrades (topic binding, follow-up routing, approval buttons, in-topic confirmations) per the [release notes](link:36:0|Release notes).

For maintainers operating long-running “always on” agents in chat platforms, this is an ops reliability change, not a UI tweak.

OpenClaw 2026.3.7-beta.1 supports per-topic agent routing overrides

OpenClaw 2026.3.7-beta.1 (OpenClaw): The beta adds per-topic agentId overrides so specific Discord forum topics / Telegram topics / DMs can be pinned to dedicated agents, enabling more isolated sessions and cleaner long-running threads, per the [beta release](t:36|Beta release) and the [release notes](link:36:0|Release notes). A related addition is a sessions.get gateway method plus runtime scoping changes mentioned in the same [release notes](link:36:0|Release notes).

Net effect: routing becomes more explicit, and session boundaries can be designed rather than inferred.

OpenClaw 2026.3.7-beta.1 updates provider onboarding and Perplexity search

OpenClaw 2026.3.7-beta.1 (OpenClaw): Onboarding adds broader provider selection and switches the Perplexity integration to a structured Search API with filters, as listed in the [beta release](t:36|Beta release) and described in the [release notes](link:36:0|Release notes). The same release also calls out more SecretRef support in onboarding and gateway auth token handling, which tightens how secrets are represented in config per the [release notes](link:36:0|Release notes).

This is primarily a “wiring and defaults” update: less manual configuration when bringing new providers online, and more structured search outputs for downstream tools.

OpenClaw beta build adds GPT-5.4 and Gemini Flash 3.1 support

OpenClaw (model support): A new OpenClaw beta drop explicitly lists GPT-5.4 and Gemini Flash 3.1 as included provider/model options, per the [beta bits announcement](t:36|Beta bits announcement). This is a straightforward compatibility signal: OpenClaw users tracking fast model churn can test new defaults without waiting for a major stable.

Details of the surrounding platform changes (context plugins, routing, bindings) are bundled in the same [release notes](link:36:0|Release notes).

OpenClaw maintainer reports rising noise from AI-written security reports and reviews

OpenClaw maintainer ops: The OpenClaw maintainer reports spending time closing low-signal security reports, including ones that claim testing on non-existent model names, per the [maintainer note](t:24|Slop security reports). They also flag AI-generated PR reviews showing up on maintainer PRs, per the [PR review complaint](t:153|AI PR reviews), and another maintainer notes some reporters escalate into vague threats when issues are closed, per the [ops reply](t:160|Threats after closure).

This is a workflow tax: it increases the cost of public issue trackers and pushes teams toward more filtering, more automation, or both—exactly the pressure described in the [maintainer note](t:24|Slop security reports).

OpenClaw maintainer uses Codex threads for PR triage and support-channel mining

OpenClaw maintainer workflow: OpenClaw’s maintainer is using Codex as a workbench for high-churn maintenance work—triaging issues/PRs and producing structured “closed/left open/open now” summaries in the Codex thread UI, as shown in the [triage screenshot](t:23|Codex issue triage view). The same thread of work includes mining Discord to decide what to fix next, with the workflow rationale stated in the [Codex data analysis note](t:25|Codex for data analysis) and reiterated in the [Discord pain points post](t:222|Discord pain point filtering).

The concrete pattern here is treating maintenance as an agent-friendly dataset problem: ingest community messages, extract clusters, then apply fixes back to the repo using an agent loop, as seen in the [Codex triage view](t:23|Codex issue triage view).

OpenClaw maintainers run Discord analytics inside Discord

OpenClaw maintainer ops: After getting access to the mirrored Discord data, maintainers are running the analysis inside Discord—turning “what are the top pain points?” into a chat-native loop—per the [in-Discord demo](t:122|Molty runs discrawl).

This makes the feedback loop tighter: the same place people report problems becomes the interface for querying and prioritizing them, as shown in the [Molty clip](t:122|Molty runs discrawl).

PinchBench emerges as a model picker for OpenClaw-style tasks

PinchBench (OpenClaw evaluation): A success-rate leaderboard is being used as a practical “which model should run my OpenClaw agent?” reference, per the [benchmark callout](t:9|Benchmark mention) pointing to the [leaderboard site](link:9:0|Success rate leaderboard). It’s framed as model selection guidance for a specific agent workload rather than a generic intelligence chart.

This is an evaluation signal more than a release: it suggests maintainers are standardizing around external, task-level success metrics to pick providers, as implied by the [benchmark mention](t:9|Benchmark mention).

🧠 Agent SDKs & app architectures: multi-agent isolation, harness choices, and portability

Today’s posts emphasize the libraries and architecture patterns teams are choosing to build agentic products (multi-agent isolation in UI, non-vendor-locked SDK options). Excludes runner/ops dashboards (agent-ops-swarms) and MCP plumbing (orchestration-mcp).

LangChain’s deepagents SDK positions itself as a multi-model alternative to Claude Agent SDK

deepagents SDK (LangChain): LangChain’s maintainers describe deepagents as a production-oriented agent SDK that’s explicitly multi-model (OpenAI, Anthropic, Gemini, OpenRouter, open-weight) and already used for internal experiments, with claimed parity on common harness needs like filesystem ops, skills, memory, bash, and HITL in the deepagents overview.

They also highlight a common cross-vendor pattern—“planner on GPT, executor/subagent on Claude”—as a supported setup in the same deepagents overview.

Viv

@Vtrivedy10

Replying to @alexgshaw

we maintain the deepagents SDK @LangChain - supports multi-model (gpt, claude, gemini, openrouter, open source, etc), customers often use diff models together (ex: gpt planner, Claude executor or subagent) built harnesses/agents and ran our TB2 harbor experiments on the SDK Show more

12:39 AM · Mar 8, 2026

CopilotKit adds agentId-scoped useAgent for multi-agent React without shared-state chaos

CopilotKit (CopilotKit): CopilotKit is pushing a concrete UI architecture primitive for “multi-agent apps”: useAgent({ agentId }) creates multiple isolated agent instances (separate history + lifecycle) inside one React app, aiming to remove the usual shared-state/context juggling called out in their multi-agent hook post.

This is one of the cleaner “agent runtime isolation” stories so far: instead of manually namespacing state, it treats agent identity as a first-class key at the hook boundary, as shown in the multi-agent hook post.

CopilotKit🪁

@CopilotKit

Multi-agent React apps are a mess of shared state, manual sync & context juggling. CopilotKit's powerful useAgent hook fixes that. Pass an agentId — each agent gets its own isolated runtime. Separate history, independent lifecycle, zero extra infra. One hook. Multiple agents. Show more

5:44 PM · Mar 7, 2026

OpenCode sketches an always-on agent daemon shared by TUI, web, and desktop clients

OpenCode (opencode): OpenCode’s author describes a target architecture where a single persistent agent process runs “as a service,” and multiple front-ends (TUI, web, desktop) just attach to it—so you can assume an agent is always warm and ready, per the always-on service idea.

This frames “agent UX” less like launching a tool per session and more like connecting to a long-lived runtime with durable state, as implied by the always-on service idea.

dax

@thdxr

once we hit a bit more stability my goal is to have opencode running as a service so when you launch tui, web, desktop it's all just connecting to a same process if you can assume there's an agent always running ready for work a lot of interesting things can be built on top

2:39 PM · Mar 7, 2026

1.3K

Read 105 replies

Teams using Claude Agent SDK in production are now asking for non-Anthropic lock-in

Claude Agent SDK (Anthropic): A recurring concern is surfacing from teams that adopted Claude Agent SDK and now want to avoid vendor lock-in—one example is a Harbor user asking for “a non-Anthropic-specific alternative,” as stated in the lock-in question.

The thread quickly turns into an evaluation problem (“I need an eval to test which agent sdk I should choose”), as captured in the eval comment, with LangChain’s deepagents presented as one candidate in the deepagents response.

Alex Shaw

@alexgshaw

Is there a non-Anthropic-specific alternative to Claude Agent SDK? We've started using it quite a bit in Harbor, but some users don't want the Anthropic lock in so I'm exploring open alternatives.

12:19 AM · Mar 8, 2026

Read 24 replies

Hermes Agent adds read-only Polymarket access for answering prediction questions

Hermes Agent (Nous Research): Hermes Agent adds a new tool-style integration to pull live info from Polymarket in read-only mode, with trading described as a possible future addition in the Polymarket integration note.

The integration sits in the broader “agent framework + connectors” posture Hermes is taking (many tools, multiple interfaces), as outlined in the Hermes docs shared alongside the try Hermes link.

Teknium (e/λ)

@Teknium

Hermes Agent can now get live info from @Polymarket to answer hard prediction questions! Potentially will add trading capabilities in the future for those really pro risk-taking people, for now, read-only!

1:33 AM · Mar 8, 2026

100

🧩 Skills, installables, and ‘agent add-ons’ shipping this week

The ecosystem keeps packaging repeatable agent behavior into installable skills/repos (operator playbooks, insights tools, CLI skills). Excludes MCP/protocol work (covered under orchestration-mcp).

OpenClaw Operator packages setup + validation as an installable agent skill

OpenClaw Operator (Open source): A new repo bundles an agent SKILL.md plus AGENTS.md/CLAUDE.md-style playbooks to help Codex/Claude Code configure and validate a local OpenClaw install, framed as a free alternative to a reported “$6,000 setup” service, as described in the Operator announcement.

• What you actually get: A validation checklist and task playbooks (“set up a cron job”, “create a custom skill”, “fix provider config”, “troubleshoot and validate”), with the repo linked in the Repo link drop via the GitHub repo.

The pricing claim is explicitly hedged—“someone is charging that much” rather than confirmed paid engagements—per the Price point caveat.

Dan McAteer

@daniel_mac8

Someone paid $6,000 (!) to get OpenClaw setup. So I built a free, open-source alternative: > OpenClaw Operator It's an agent skill + AGENTS.md/CLAUDE.md file that gives Codex/Claude Code the playbooks + validation flow needed to configure your local OpenClaw install. You open Show more

8:15 PM · Mar 7, 2026

118

Read 15 replies

TanStack CLI adds first-class skills for agents

TanStack CLI (TanStack): The CLI reportedly “now ships with skills,” suggesting a move toward bundling agent-operable workflows alongside traditional scaffolding/commands, as shared in the Skills shipping note.

The tweet is a retweet without release notes attached here, so details like skill format, install path, and compatibility (Codex vs Claude Code vs Cursor) aren’t confirmed in today’s artifacts.

Tanner Linsley

@tannerlinsley

TanStack CLI now ships with skills! Just update @ tanstack/cli via NPM and tell your agent to run `@ tanstack/intent list`! If you're curious what TanStack Intent is or how it works, check this out: tanstack.com/blog/from-docs…

5:59 AM · Mar 7, 2026

245

Agentation’s annotation overlay becomes a scaled “agent UX” utility

Agentation (benjitaylor): The project is now averaging ~850,000 downloads/week via npm and over 1M installs/month, which is a notable adoption signal for “point-at-it” visual feedback tooling for agents, per the Download stats.

• What it does: An overlay for leaving precise, element-level notes (selectors/metadata) that export agent-agnostic markdown, as described in the companion write-up linked as Annotating for agents.

This sits in the same bucket as “skills for frontends”: shrinking the cost of communicating UI intent to an agent without writing long textual bug reports.

Benji Taylor

@benjitaylor

Agentation is averaging ~850,000 downloads per week via npm and over 1 million installs per month! Pretty fun to see it grow from a tool that originally started as a small personal project: benji.org/annotating

5:17 PM · Mar 7, 2026

629

Read 37 replies

fast-mode-insights turns Codex fast-mode savings into a runnable skill

fast-mode-insights (Community skill): A small installable skill reverse-engineers the Codex fast-mode “you could save…” pop-up and exposes it as a command you can run inside Codex, per the Skill announcement.

• How it ships: Installation and usage are spelled out with “install the skill… then run $fast-mode-insights” in the Install instructions, pointing to the Skill repo.

This is a narrow add-on, but it’s a concrete example of teams packaging UI/telemetry gaps as shareable skills rather than waiting on upstream product changes.

Peter Gostev

@petergostev

Missed the Codex fast mode pop-up telling you how much you could save with fast mode? I got codex to reverse engineer it and created a skill so you can run it too, send the message in the next tweet to codex to install and run the skill

9:17 PM · Mar 7, 2026

Asupersync adds a mega skill to guide agent-led integration work

Asupersync (Rust async runtime): The project added an “extremely comprehensive” skill intended to help agents integrate Asupersync into real Rust codebases, as announced in the Skill addition.

• Where to start: The integration guide lives as a versioned skill doc in the repo, linked in SKILL.md.

This is another example of maintainers treating “agent onboarding” as a first-class deliverable (not just README docs), which can materially change how quickly a coding agent becomes useful in a new project.

Jeffrey Emanuel

@doodlestein

My asupersync.com project is a lot to process and can be intimidating to integrate into both greenfield and brownfield Rust projects. To make things easier for your clankers, I just added an extremely comprehensive skill to assist with that: github.com/Dicklesworthst…

12:42 AM · Mar 8, 2026

Read 4 replies

🔌 MCP & interop plumbing: bringing external tools into agent workflows

The notable interop item today is API-level support for attaching custom MCP servers, plus ongoing MCP-as-a-workflow primitive discussion. Excludes non-MCP ‘skills’ packaging (coding-plugins).

Vercel v0 API can now attach custom MCP servers to chats

v0 API (Vercel): v0’s API now supports connecting custom MCP servers from code, extending the earlier “MCP apps” direction in MCP Apps deploy (postMessage JSON-RPC bridge); the new surface lets you create chats that include mcpServerIds—e.g., wiring a vercel-mcp server for “Deploy to prod” flows as shown in the code snippet.

This is described in Vercel’s changelog post with a concrete call pattern (v0.chats.create({ message, mcpServerIds: [...] })), which makes MCP attachment a first-class API input instead of an interactive setup step.

@v0

You can now use MCP servers via the v0 API. 𝚊𝚠𝚊𝚒𝚝 𝚟0.𝚌𝚑𝚊𝚝𝚜.𝚌𝚛𝚎𝚊𝚝𝚎({ 𝚖𝚎𝚜𝚜𝚊𝚐𝚎: '𝙳𝚎𝚙𝚕𝚘𝚢 𝚝𝚘 𝚙𝚛𝚘𝚍', 𝚖𝚌𝚙𝚂𝚎𝚛𝚟𝚎𝚛𝙸𝚍𝚜: ['𝚟𝚎𝚛𝚌𝚎𝚕-𝚖𝚌𝚙'], }) vercel.com/changelog/v0-a…

5:59 PM · Mar 7, 2026

Recurring task loops are being used as “MCP runners” for daily ops

Scheduled MCP workflows: builders are treating recurring prompt loops as a lightweight way to run MCP-backed chores—e.g., “every morning use the Slack MCP to give me a summary” in the [/loop examples](t:3|loop examples); in practice this turns MCP servers into “always-on” integrations that can be polled on an interval rather than only called ad hoc.

The behavior and limits are spelled out in the scheduling docs (recurring prompts up to 3 days in-session), while the tmux recipe shows how people keep these loops running longer by pinning the session in tmux (a durability workaround given tasks end when the session exits).

Boris Cherny

@bcherny

8:49 AM · Mar 7, 2026

370

📊 Model & agent eval churn: ARC variants, tool benchmarks, and ‘hard’ bottleneck tests

Benchmark talk stays intense: multiple leaderboards and ‘why it failed’ analyses (progress-bar fixation, tool-use benchmarks, internal bottleneck evals). Excludes new model releases themselves (model-releases).

ARC-AGI-3 runs show HUD fixation; telling models it’s a progress bar helps

ARC-AGI-3 (eval behavior): One tester reports GPT-5.4 (medium) repeatedly treats a changing HUD bar as “the goal,” underperforming Kimi and Gemini in that setup as described in the ARC-AGI-3 comparison; when the prompt explicitly states there is a progress bar, GPT-5.4-xHigh then clears early levels quickly according to the xHigh run videos.

• Harness implications: The same thread argues a minimal harness that carries prior action/state and labels HUD elements would likely stabilize top models on ARC-like 2D environments, as laid out in the Harness requirements note.
• Memory as differentiator: Separate evidence shows Opus 4.6 tracking detailed state and identifying the bar as a progress bar, with an example memory trace shown in the Reasoning and memory screenshot.

Lisan al Gaib

@scaling01

ARC-AGI-3 UPDATE UPDATE - Opus 4.6 makes most progress, solving one level in 2 different games and has by far the best use of memory - Gemini 3.1 Pro almost like Opus 4.6, but didn't quite solve the other game. Memory structure and info is notably less detailed than Opus' - Show more

Lisan al Gaib

@scaling01

I realized I introduced a bug while cleaning up the code and removed an if statement. That change made the agent effectively blind, because it was receiving identical previous and current state snapshots. Luckily, everything was logged. The prompts sent to the model were wrong,

3:04 PM · Mar 7, 2026

270

Read 17 replies

GPT-5.4-xHigh tops Toolathlon, edging Gemini Flash and Opus 4.6

Toolathlon (benchmark): A new leaderboard screenshot shows GPT-5.4-xHigh ranked #1 with 54.6 pass@1, ahead of Gemini-3 Flash (49.4) and Claude Opus 4.6 (47.2) as shown in the Leaderboard table.

This is one of the clearer “agent + tools” signals in the feed today because it reports both pass rates and average turns in the same artifact, rather than anecdotes.

Lisan al Gaib

@scaling01

GPT-5.4-xhigh is #1 on Toolathlon

5:30 PM · Mar 7, 2026

OPQA bottleneck chart shows GPT-5.4-thinking below some prior Codex variants

OpenAI-Proof Q&A (OPQA): A shared chart claims gpt-5.4-thinking scores 4.16% pass@1 (1/20), below gpt-5.2-codex at 8.33% and below gpt-5.3-codex at 5.8%, per the OPQA bar chart.

The post frames OPQA as “internal research and engineering bottlenecks” that each cost a day+ to resolve, so even small absolute changes read as meaningful noise-or-signal depending on sample size, as described in the OPQA bar chart.

Chetaslua

@chetaslua

🚨 Hypeless Report I am e/acc , but not everything is going right 🙂‍↕️ Gpt 5.4 shows regression in internal research and engineering bottleneck Gpt-5.2 > gpt-5.3 codex > gpt 5.4 It's only able to solve 4.16% of 20 = 1 question lol 🤣 ( OPQA designed by openAI) Show more

Chris

@chatgpt21

If we assume GPT 5.4 is already doing about 28% of the total work needed to create GPT 5.5, and AI’s share of the model building pipeline rises by 10 percentage points every .5 iteration, then the curve gets steep fast: GPT 5.5 = 28% GPT 6.0 = 38% GPT 6.5 = 48% GPT 7.0 = 58% GPT

9:48 AM · Mar 7, 2026

122

Read 20 replies

FreshStack reports retrieval rankings stable across temporal snapshots

FreshStack (retrieval benchmark): A new preprint claims that despite significant repo restructuring (LangChain and related repos), retriever rankings remain relatively stable across temporal snapshots, positioning FreshStack as a more reliable ongoing yardstick, per the Preprint claim and follow-up Stability takeaway.

One concrete example called out is a 67% LangChain document reduction shifting relevance-judgment distributions across multiple repos, but without reshuffling model ordering, as described in the Distribution shift note.

Nandan Thakur

@beirmug

What a week for FreshStack, first being included in #KARLBench and in our new preprint we determine that model rankings remain relatively stable across temporal snapshots! The leaderboard is also being actively maintained with over 30+ models.

Nathan Kuissi

@NathanGabr42809

Technical documentation evolves rapidly with repository changes within weeks, but can IR benchmarks remain “fresh” over time? In our new preprint, we stress-test retrieval models on FreshStack (LangChain) across two temporal snapshots: Oct 2024 vs Oct 2025. Findings below 🧵👇

7:58 PM · Mar 7, 2026

Vending-Bench 2 shows GPT-5.4 in 3rd behind Opus and Sonnet 4.6

Vending-Bench 2 (agent economy sim): A chart from Andon Labs shows GPT-5.4 finishing 3rd on final “money balance,” behind Claude Opus 4.6 and Claude Sonnet 4.6, and ahead of Gemini 3 Pro and GPT-5.3-Codex, as shown in the Balance-over-time plot.

It’s presented as a “slight upgrade over GPT-5.3-Codex,” but still not the top performer in this specific long-horizon trading/operations-style setup, per the Balance-over-time plot.

Andon Labs

@andonlabs

GPT-5.4 places 3rd on Vending-Bench, a slight upgrade over GPT-5.3-Codex.

4:57 AM · Mar 7, 2026

167

Read 16 replies

BullshitBench v2 expands to Meta Llama variants and refreshes the explorer

BullshitBench v2 (nonsense detection): The maintainer says v2 adds several Meta models (including Llama 4 variants) and publishes updated ranks (e.g., “39, 51, 56 out of 80 variants”), per the Benchmark update; artifacts are available as a GitHub repo plus an interactive Results explorer.

This is an eval niche that’s showing up more often in day-to-day tool selection: whether a model can reliably say “this is nonsense” instead of confidently guessing.

Peter Gostev

@petergostev

BullshitBench v2: by community request - Meta models added - Llama 4 Maverick, Scout and 3.1 8b - they don't rank too badly 39, 51, 56 (out of 80 variants tested). P.S. I've discovered that if you have an open source project people actually raise issues and ask you questions, Show more

Peter Gostev

@petergostev

BullshitBench v2 is out! It is one of the few benchmarks where models are generally not getting better (except Claude) and where reasoning isn't helping. What's new: 100 new questions, by domain (coding (40 Q's), medical (15), legal (15), finance (15), physics(15)), 70+ model

9:33 PM · Mar 7, 2026

Read 4 replies

PinchBench shared as a success-rate leaderboard for OpenClaw model choice

PinchBench (leaderboard): A practitioner points to PinchBench as a “which model is best” success-rate leaderboard for OpenClaw selection, linking the board in the Benchmark mention and the underlying site via the Success rate leaderboard.

This is mostly a routing signal—i.e., teams using OpenClaw-style harnesses are increasingly leaning on success-rate leaderboards instead of single-score academic benchmarks, at least for tool-using agent tasks.

Peter Steinberger 🦞

@steipete

Interesting benchmark on which model is best for @openclaw pinchbench.com

3:58 PM · Mar 7, 2026

2.3K

Read 299 replies

📦 Open weights & model checkpoints: India’s releases and China-watch updates

Model-release discussion today is dominated by open-weight LLMs and checkpoint drift rumors (especially India’s Sarvam and ongoing DeepSeek updates). Excludes runtime/day‑0 serving integrations (systems-inference).

Sarvam’s 30B/105B open weights get benchmark/spec details (MoE, Indian languages)

Sarvam (Sarvam AI): Following up on Sarvam release (India’s open-weight drop), more concrete specs and positioning surfaced around the Sarvam-105B MoE design—105B total params but 9B active per token, Apache 2.0 licensing, and explicit support for 22 Indian languages plus English, per a detailed breakdown in Specs and claims.

The same thread claims voice-first/multimodal ambitions (TTS/STT and document-vision “alongside”) and emphasizes agentic/reasoning targets (tool use, browsing, math, coding), while showing a benchmark table that includes BrowseComp and SWE-Bench Verified entries for Sarvam-105B in Specs and claims. Broader chatter frames the release as “two very strong open-weight LLMs from India,” referring to the two Sarvam sizes, in RT roundup.

The first Indian open source model trained from scratch, Sarvam, 30B and 105B is really good. The 105B one is head to head with Deepseek R1 when it was released. Apache 2.0 license - uses a mixture-of-experts (MoE) architecture. 105 billion total parameters but only activates 9 Show more

2:12 AM · Mar 8, 2026

Read 4 replies

DeepSeek V4 release-watch turns into “served checkpoint drift” narrative

DeepSeek (model served via web/app): Instead of a clear “DeepSeek v4” release, multiple posts point to frequent behind-the-scenes updates to what DeepSeek serves—users claim measurable improvements on math/coding benchmarks across the last few days and even qualitative gains on voxel generation, as described in Checkpoint update report.

A separate “where is it?” release-watch sentiment keeps recurring, with the direct call-out “What the frick happened to DeepSeek v4” in Release-watch post. The net effect is that builders are treating DeepSeek’s public surface as a moving target (checkpoint drift) rather than a stable versioned release—useful if you’re tracking regressions, but awkward for reproducible evals and procurement.

AiBattle

@AiBattle_

DeepSeek has been constantly updating the model they currently serve on the web and app According to a user on a Chinese forum, it has improved on math and coding problems on his benchmark over the past few days I have also noticed that it has gotten better at voxel generation

Chubby♨️

@kimmonismus

What the frick happened to DeepSeek v4

1:13 PM · Mar 7, 2026

299

Read 16 replies

Will frontier open weights slow down as training costs climb?

Open weights (ecosystem): A renewed thread argues it’s plausible frontier open-weight releases eventually slow or stop, because training costs rise and the strategic value of frontier weights increases—captured in Ethan Mollick’s note shown in Open weights cost caution.

This matters operationally because teams depending on “free frontier” checkpoints as a procurement strategy may see more volatility: fewer releases, more gated weights, or heavier commercialization pressure (especially if the frontier gap requires pricier hardware and power).

Kol Tregaskes

@koltregaskes

Yes, I can't see frontier open-weight models continuing forever - these are getting insanely expensive to train, needing pricier hardware and huge electricity. But China, France and the other countries always find efficiency gains and funding. I think they'll say alive for a Show more

Ethan Mollick

@emollick

I think it is entirely possible that there will be no new frontier open weights models at some point in the near future. Counting on the Chinese AI labs to keep making their models free forever doesn’t make sense as model costs rise & the value of having a frontier model goes up

6:33 PM · Mar 7, 2026

Meta’s next open-weight move draws “what happened?” posts

Meta (upcoming LLMs): A small but clear release-watch signal shows builders asking what happened to Meta’s next models—“And what the double frick happened to meta and their upcoming llms” in Release-watch post. There’s no linked artifact or spec shift in the tweets, but it reads as competitive-pressure chatter alongside the DeepSeek v4 uncertainty and the Sarvam open-weight drop.

Chubby♨️

@kimmonismus

And what the double frick happened to meta and their upcoming llms

Chubby♨️

@kimmonismus

What the frick happened to DeepSeek v4

1:16 PM · Mar 7, 2026

348

Read 20 replies

🏎️ Serving & runtime engineering: day‑0 support, attention kernels, and inference products

This cluster is about making models run fast and cheaply: day‑0 serving support, attention-kernel improvements, and new commercial inference offerings. Excludes the model weights themselves (model-releases).

SGLang adds day‑0 inference support for Sarvam MoE models

SGLang (LM-SYS): Day‑0 serving support for Sarvam MoE models is now live, per the Day-0 support note, with the concrete integration work tracked in a dedicated PR that adds inference support for Sarvam 30B MoE and Sarvam 105B MoE in SGLang, as shown in the Support PR and detailed in the GitHub PR. This matters for runtime teams because it’s the difference between “weights exist” and “you can actually deploy them” when the model has non-trivial attention/expert routing details.

• What’s being wired up: The PR explicitly calls out model-specific attention paths—GQA + QK norm for 30B and MLA with weight absorption + FP8 support for 105B—along with expert overlap/scheduling work that tends to be the real blocker for MoE models in production, per the GitHub PR.

It’s an early signal that Sarvam is treated as a first-class target in modern open serving stacks, not a “wait for downstream forks” model.

LMSYS Org

@lmsysorg

Day-0 support in SGLang is live!

Pratyush Kumar

@pratykumar

📢 Open-sourcing the Sarvam 30B and 105B models! Trained from scratch with all data, model research and inference optimisation done in-house, these models punch above their weight in most global benchmarks plus excel in Indian languages. Get the weights at Hugging Face and

6:09 PM · Mar 7, 2026

FlashMaskV4 rolls FlashAttention‑4 into flexible sparse masking

FlashMaskV4 (PaddlePaddle): PaddlePaddle says FlashMaskV4 now integrates FlashAttention‑4 kernels to keep custom masking flexibility while improving throughput, with reported speedups up to 2.9× in forward and 1.6× overall at 8k sequence length versus FA4 mask_mod, as announced in the Release thread. This lands squarely in the “runtime engineering” bucket: sparse/prefix/document masks are common in long-context training and serving, but they often fall off the fast path.

• Mask coverage focus: The announcement emphasizes column-wise sparse masking across multiple mask types (prefix LM, document, sliding window, etc.) and claims stability/efficiency from 8k to 128k contexts, per the Release thread and the linked FlashMask paper.

The main open question is portability beyond the referenced kernel stack (the post is PaddlePaddle-centric), but the perf claim is specific enough that kernel/runtime folks will want to reproduce it on their own attention backend.

PaddlePaddle

@PaddlePaddle

🚀 FlashMaskV4: Leveling up with FlashAttention-4 With @tri_dao officially releasing the FlashAttention-4 paper, we are thrilled to announce the continued evolution of FlashMaskV4! Building on our research FlashMask (arxiv.org/abs/2410.01359), we’ve integrated FA4’s core power Show more

2:25 PM · Mar 7, 2026

112

Moondream teases “Kestrel,” a cross-device inference product

Moondream (inference product): Moondream says it’s about to launch a commercial inference product (internal codename Kestrel) targeting “blazing speeds” across a wide hardware span—from an 8GB Jetson Orin up to H100—and is soliciting naming feedback, per the Naming request and follow-up in the Kestrel still possible. This is a runtime signal more than a model signal: they’re positioning a single serving stack across edge and datacenter GPUs.

The tweets don’t include a public spec (latency, batching model, quantization formats, or supported runtimes), so for now it’s best read as a go-to-market teaser rather than a benchmarked release.

moondream

@moondreamai

We're about to launch a commercial inference product that runs at blazing speeds on any device, from an 8GB Orin to H100. Internal codename: Kestrel. Help us name it for real:

8:22 PM · Mar 7, 2026

Read 18 replies

W&B Inference gets benchmarked on Artificial Analysis

W&B Inference (Weights & Biases): W&B says its inference offering is now listed on Artificial Analysis, with each served model independently tracked for “intelligence, speed, price, and latency,” per the AA listing note and the comparison page in AA model analysis. For serving engineers, the immediate value is externalized telemetry: throughput/latency and cost comparisons tend to be hard to normalize across providers.

• What’s explicitly covered: The announcement calls out models like GLM‑5, Kimi K2.5, MiniMax M2.5 being included in the AA comparison set, as stated in the AA listing note.

No detailed methodology is included in the tweets; the main artifact is the AA listing itself, via AA model analysis.

Weights & Biases

@wandb

W&B Inference is now on @ArtificialAnlys! Every model we serve, independently benchmarked for intelligence, speed, price, and latency. See how GLM-5, Kimi K2.5, MiniMax M2.5, and more stack up against the field.

9:12 PM · Mar 7, 2026

🔬 Automating research loops: agents that run experiments while you sleep

The standout theme is “research automation as a loop”: agents iterating on training code + hyperparameters with short-run evaluation. Excludes general agent SDK chatter (agent-frameworks).

Karpathy releases autoresearch, a minimal repo for autonomous LLM training experiments

autoresearch (Andrej Karpathy): Following up on Repo loop (agent runs experiments on a branch), Karpathy published a self-contained minimal repo (~630 LOC) that turns “LLM training core” into an autonomous experiment loop: a human iterates on program.md, while an agent edits train.py, runs fixed 5-minute trainings, and accumulates git commits when validation improves, as described in the release thread and shipped in the GitHub repo.

He also notes a “bigger cousin” of the same idea still running continuously in production on 8×H100, which frames this repo as a toyable version of a longer-running research harness, per the multi-GPU note.

The early community framing is that this pushes from “prompting a model” to “prompting an automated researcher,” as echoed in the reaction post; the repo is small enough that engineers can realistically fork it and swap in their own search policies, evaluators, and acceptance criteria.

Andrej Karpathy

@karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the Show more

7:53 PM · Mar 7, 2026

10.6K

Read 441 replies

A two-file research loop: program.md sets intent, the agent patches train.py and commits improvements

Autonomous experiment harness pattern: The loop in Karpathy’s setup splits responsibilities into two artifacts: the human maintains an instruction program (program.md), while the agent owns implementation changes in train.py, iterating through repeated fixed-budget runs (“every dot is a complete LLM training run that lasts exactly 5 minutes”) and using validation loss as the accept/reject gate before committing to a feature branch, as spelled out in the loop description.

This design bakes in three engineering properties that transfer well to other research automation projects: fixed-time evaluations to make results comparable run-to-run; versioned code search via git commits instead of opaque “agent memory”; and a single scalar metric gate that allows unattended overnight progress, matching the “leave it running” posture described in the continuous-run note.

Andrej Karpathy

@karpathy

7:53 PM · Mar 7, 2026

10.6K

Read 441 replies

📄 New papers worth a skim: agentic RL taxonomies, transformer artifacts, and code reasoning

A high-signal research day: multiple papers/threads on agentic RL framing, transformer inference pathologies, and structured prompting for code reasoning/verifier behavior. Excludes product/tool launches.

Agentic RL survey proposes a taxonomy for LLM agents beyond sequence modeling

Agentic RL survey (arXiv/TMLR): A new survey argues “agentic reinforcement learning” should be treated as its own landscape (not just sequence modeling with RL); it proposes a two-part taxonomy across core capabilities (planning, tool use, memory, reasoning, self-improvement, perception) and across application domains, then inventories environments/benchmarks/frameworks shaping the field, as summarized in the survey overview.

The framing is useful if you’re designing evals or training loops for agents (partial observability, long horizons, tool feedback), because it separates what’s currently entangled in practice—policy learning, memory systems, tool APIs, and environment design—into a clearer map of “what to optimize next,” per the survey overview.

elvis

@omarsar0

New survey on agentic reinforcement learning for LLMs. LLM RL still treats models like sequence generators optimized in relatively narrow settings. However, real agents operate in open-ended, partially observable environments where planning, memory, tool use, reasoning, Show more

2:22 PM · Mar 7, 2026

161

Read 17 replies

Meta’s “Agentic Code Reasoning” claims 93% patch-verification with a mandatory checklist

Agentic code reasoning (Meta): Meta researchers describe a “semi-formal reasoning” prompting method where the agent must write explicit premises, trace execution paths, and derive a proof-like conclusion; they report ~93% accuracy on patch verification without executing code, per the paper summary.

A key takeaway is behavioral: the paper claims the biggest failure mode is skipping reading local context and pattern-matching on familiar names, and the checklist forces the model to ground each claim in file-level evidence, as relayed in the paper summary.

Meta discovered that if you force an LLM to show its reasoning step by step with proof, its code patch error rate drops by nearly 50%. If you just ask a standard LLM to check the code without running it, the model usually just glances at the function names and makes a confident Show more

2:06 PM · Mar 7, 2026

121

LeCun/NYU: massive activations and attention sinks traced to pre-norm artifacts

Spike, Sparse, Sink (LeCun/NYU): A new paper dissects two recurring Transformer phenomena—massive activations (outlier channels acting like implicit parameters) and attention sinks (tokens that attract attention regardless of semantics)—and argues their co-occurrence is largely an architectural artifact of pre-norm design, with direct implications for quantization, pruning, and KV-cache handling, as described in the paper thread and ArXiv paper.

This lands as an “engineering interpretation” paper: it’s less about proposing a new model and more about explaining why some efficiency tricks break unpredictably, which is the practical hook called out in the paper thread.

elvis

@omarsar0

New research from Yann LeCun and collaborators at NYU. It's a really good read for anyone working on efficient Transformer inference. The paper dissects two recurring phenomena in Transformer language models: massive activations (where a few tokens exhibit extreme outlier Show more

10:00 PM · Mar 7, 2026

161

“Why language models hallucinate” resurfaces: benchmarks reward guessing over abstaining

Hallucination incentives (OpenAI paper discussion): A thread recaps OpenAI’s “Why language models hallucinate” argument that training/eval setups often reward confident guessing over admitting uncertainty; it highlights that allowing abstention can reduce wrong answers (even if headline accuracy drops), as summarized in the thread summary with the paper linked in ArXiv paper.

This framing is mainly about measurement design—if leaderboards don’t give any credit for “I don’t know,” the optimal strategy under evaluation pressure can become bluffing, as described in the thread summary.

That all-time classic OpenAI paper. "Why language models hallucinate" And why they will always do so. Simple ans - LLMs hallucinate because training and evaluation reward guessing instead of admitting uncertainty. The paper puts this on a statistical footing with simple, Show more

12:55 AM · Mar 8, 2026

Read 16 replies

🏗️ Compute supply & data center signals: GPUs, power draw, and buildout churn

Infra posts today are concrete: hyperscale capex and power draw numbers, plus a notable data center expansion reversal. Excludes funding/plan pricing (business-funding-enterprise).

Forbes projects Google’s AI infra spend could reach $1.9T over 10 years

Google AI infrastructure (Google): A Forbes interview is being recirculated with big numbers: Google’s capex for AI is cited as up to $185B in 2026 (vs $90B in 2025), and Forbes “does the math” to project $1.5T over eight years and $1.9T over ten years if spend stays around that level, as shown in the Forbes projection.

The thread also frames this as a stack play—chips (TPUs) through modular data center designs and power deals—though those specifics are asserted rather than independently evidenced in the tweets.

Google is building a nearly untouchable advantage by controlling every layer of the AI stack, from the chips up to the power grid. This Forbes piece shows Google has a massive plan to spend $1.9T on data centers and hardware over the next 10 years. They are ramping up its Show more

1:43 AM · Mar 8, 2026

129

Oracle and OpenAI reportedly cancel Texas expansion from 1.2 GW to 2.0 GW

Texas AI data center capacity (Oracle/OpenAI): Reuters reporting (via a tweet summary) says Oracle and OpenAI dropped a planned Texas site expansion from 1.2 GW to 2.0 GW, citing financing complexity and OpenAI changing its compute forecasts, according to the Reuters summary. The same summary claims the site hit reliability issues where freezing weather broke liquid-cooling systems, and that Meta is in talks to take the extra capacity—plus an eye-catching detail that Nvidia paid a $150M deposit tied to Meta’s chip choice.

This is one of the clearer “buildout churn” signals: power-scale plans are still being resized midstream, and capacity can be re-traded between frontier buyers when forecasts or financing shift.

Reuters: OpenAI and Oracle canceled a major expansion of their Stargate data center in Texas after hitting roadblocks with money and shifting technology needs. This project was originally meant to grow the site from 1.2GW to 2.0GW, which is roughly the power output of 2 nuclear Show more

12:15 PM · Mar 7, 2026

Amazon’s $11B Indiana AI data center campus is projected at 2.2 GW

Amazon data centers (AWS): A widely shared clip highlights Amazon’s new $11B campus buildout in St. Joseph County, Indiana, described as an AI data center project with a projected 2.2 GW power draw in the campus power figure. That scale is roughly “multiple nuclear reactors worth” of load, and it’s presented as one site among many.

The post is light on commissioning timelines and GPU specifics, but the number is the operational detail that matters: it implies the next wave of capacity planning is power-constrained as much as chip-constrained.

This is Amazon’s new $11B campus in St. Joseph County, Indiana, for AI data center buildout. Projected at 2.2 GW power draw. Insane when you think that this is only one of many and that there are even a lot bigger planned.

11:54 AM · Mar 7, 2026

429

Read 18 replies

OpenAI thanks Jensen for expanding Nvidia capacity at AWS

NVIDIA capacity at AWS (OpenAI): Sam Altman publicly thanked Jensen Huang for “expand[ing] Nvidia capacity at AWS” for OpenAI in the capacity thanks, which reads like an ongoing supply-side constraint update rather than a product launch. It’s a small but concrete signal that incremental GPU availability at a specific hyperscaler is still meaningful enough to call out.

The tweet doesn’t specify GPU type, region, or contract structure; it’s best interpreted as a “capacity is a bottleneck” pulse and a hint that AWS-side allocations (or lead times) shifted in OpenAI’s favor.

Sam Altman

@sama

Very grateful to Jensen for working to expand Nvidia capacity at AWS so much for us!

tae kim

@firstadopter

Jensen said TWO days ago Nvidia is expanding OpenAI capacity at AWS "like mad" We also know OpenAI Codex token use is exploding. Any narrative that says aggregate OpenAI compute needs are weakening seems suspect.

4:25 PM · Mar 7, 2026

2.6K

Read 235 replies

🛡️ Security & policy collisions: defense contracts, surveillance fears, and agent safeguards

Security/policy news today mixes org-level fallout (defense/surveillance concerns) with practitioner-level agent security (prompt injection, semantic firewalls, tool misuse allegations). Excludes OSS maintainer ‘slop’ mechanics (code-quality category).

OpenAI robotics leader Caitlin Kalinowski resigns amid Pentagon-use concerns

OpenAI Robotics (OpenAI): Caitlin Kalinowski publicly resigned from OpenAI, saying she “care[s] deeply about the Robotics team and the work we built together” in the Resignation retweet; follow-on posts frame the departure as rooted in concerns about surveillance and autonomous weapons tied to Pentagon contracting, as summarized in the Fallout summary and elaborated in the Expanded claim.

The operational read for builders is that defense-adjacent distribution can create internal churn and external scrutiny even when companies claim policy “red lines,” and robotics is where those lines get tested earliest because tool outputs become physical actions.

Caitlin Kalinowski

@kalinowski007

I resigned from OpenAI. I care deeply about the Robotics team and the work we built together. This wasn’t an easy call. AI has an important role in national security. But surveillance of Americans without judicial oversight and lethal autonomy without human authorization are Show more

4:30 PM · Mar 7, 2026

39.7K

Read 1.3K replies

Clam pitches a “semantic firewall” to stop agent PII leaks and prompt injection

Clam (Clam + Composio): Composio shared a case study where an agent with Gmail access nearly ingested a parent’s tax info, and positioned Clam as a “semantic firewall” that sits at the network layer to intercept agent requests and block PII leakage and prompt injection, per the PII near-miss story.

• Integration angle: The same thread claims Composio helped wire Gmail/Calendar quickly by avoiding a long OAuth approval process, per the PII near-miss story.

The concrete takeaway is the “guard the egress” architecture: treat every tool/API call as a policy enforcement point, rather than relying on prompt discipline alone.

Composio

@composio

Your AI agent has access to your Gmail. Your dad sends you his tax information. Now what? That actually happened to @vaibagra, founder of @tryclamnow (YC W26). His agent was scanning meeting invites and nearly ingested every detail. SSNs, financials, all of it. Thanks to Clam, Show more

7:50 PM · Mar 7, 2026

Viral “RL agent cryptomined via reverse SSH tunnel” story gets called fake

Agent safety discourse: A screenshot excerpt alleging an RL-trained agent initiated reverse SSH tunneling and repurposed GPUs for cryptomining circulated widely, as shown in the Excerpt screenshot; practitioners pushed back that it reads fabricated—citing “heavy novelization,” vague tool-call details, and a missing optimization incentive for mining during RL rollouts, per the Hoax skepticism.

This is a useful hygiene check for security teams: narratives about “unexpected agent behavior” are increasingly persuasive, so the bar should be logs + threat model + incentives, not prose.

Chubby♨️

@kimmonismus

Alibabas models are so 2020, they broke out and began mining cryptos.

Alexander Long

@AlexanderLong

insane sequence of statements buried in an Alibaba tech report

11:23 AM · Mar 7, 2026

120

Proposal: force agent-to-agent comms into English and monitor for steganography

Multi-agent safeguards: A thread argues risk rises when agents can coordinate, proposing that agent-to-agent communication be constrained to human-readable English so it’s inspectable, per the English-only proposal; it further suggests monitoring for statistically unusual code words and hidden Unicode characters as possible covert channels, as described in the Unicode monitoring addendum.

It’s speculative, but it’s a concrete design constraint people may try to bake into multi-agent orchestrators (especially for enterprise audit and incident response).

Jeffrey Emanuel

@doodlestein

Thinking more about this, it seems clear to me that the risk is elevated substantially when the agents can communicate with each other and conspire. We should insist that all agent-to-agent communication use English language as a safety measure so we can see what they’re up to.

Jeffrey Emanuel

@doodlestein

I keep thinking about this and it’s spooky as hell, like something from a horror movie or Terminator sequel. The fact that it happened during training when no one was expecting it. That it repurposed training resources. That it had been saving up money. Truly a WTF moment in AI.

5:38 PM · Mar 7, 2026

Read 12 replies

🗂️ Docs, auditability, and adversarial ‘LLM SEO’

Today’s doc/devex thread is about trust and traceability: when agents research online or modify business artifacts, teams want outputs that remain auditable and resistant to adversarial content. Excludes general coding-agent performance chatter (feature).

LLM SEO pressure rises as agents cite outdated or adversarial vendor claims

Adversarial comparison content: A founder testing browser-vendor options reported that an agent doing internet research cited an outdated claim (about rrweb usage) from a competitor blog post, and concluded they “can’t really trust” the agent for this kind of web research—then connected it to a broader pattern: competitor comparison pages are becoming table-stakes because LLMs will confidently repeat what they find online, per Browser vendor audit notes.

This is less about “hallucinations” and more about retrieval from adversarially-optimized pages; it pushes teams toward tighter source vetting (primary docs, changelogs) and more explicit provenance in agent-produced vendor analyses.

Michael

@michael_chomsky

I need a browser vendor for StartClaw--so I had AI compare all the browser vendors. @usekernel came out on top, but Claude Code ended up citing a Kernel blog post stating that @browserbase uses rrweb which I believe is outdated. Very interesting that I can't really trust Show more

8:27 PM · Mar 7, 2026

Read 11 replies

Auditing AI work in Excel depends on whether the agent stays “in-sheet”

Excel copilots (ChatGPT vs Claude): A practical auditability difference showed up when working on a very large, multi-tab historical macro dataset—ChatGPT tended to operate inside Excel (building formulas, manipulating sheets like a human), while Claude often switched to Python and pasted results back, which can break references and make provenance harder to inspect, as described in Excel comparison notes and reinforced in Follow-up on formulas only.

For teams that need traceable spreadsheets (finance, ops, analytics), the core issue is whether the system produces editable, dependency-preserving artifacts (formulas, references, pivots) versus opaque pastebacks that look right but are harder to audit later.

Ethan Mollick

@emollick

I gave ChatGPT for Excel and Claude for Excel a try on a very hard Excel file: macro-economic data from 1,000 years of English history across over a hundred tabs. I think both did a good job, and I did not spot errors (though I only did spot checks). However, Claude was harder Show more

2:02 AM · Mar 8, 2026

604

Read 53 replies

GPT-5.4 is being used as a “doc freshness” checker for repos

GPT-5.4 (OpenAI): Engineers are calling out a useful doc-maintenance behavior: the model proactively flags stale sections in Markdown docs and even suggests reorganizing them to reduce future agent misreads, as shown in Outdated docs catch and echoed in Markdown reorg suggestion.

The operational angle isn’t “better writing”—it’s keeping repo docs aligned with reality so downstream agents don’t treat obsolete instructions as ground truth during tool use and code changes.

Greg Brockman

@gdb

GPT-5.4 for catching outdated docs:

Yam Peleg

@Yampeleg

GPT-5.4 just randomly caught outdated sections in some .md files and also suggested moving them so other agents wouldn’t treat these as truth. Which means every agent before it made this mistake. I’m impressed.

8:46 PM · Mar 7, 2026

344

Read 54 replies

🎬 Generative media workflows: design-to-animation, local video stacks, and node graphs

Generative media is a meaningful secondary cluster today: practical creative pipelines (After Effects automation, ComfyUI nodes, local video workflows) rather than pure demos. Excludes any bioscience-related content.

ElevenLabs voice tools land in ComfyUI via Partner Nodes

ComfyUI × ElevenLabs (ComfyUI/ElevenLabs): ComfyUI shipped ElevenLabs as Partner Nodes, bringing a full voice toolchain into node graphs—drag/connect/run—per the Partner Nodes announcement and the longer feature list in the Node list.

• What you get: Text-to-speech, speech-to-speech, speech-to-text, voice isolation, text-to-dialogue, text-to-sound-effects, and a voice selector, as enumerated in the Node list.
• Why it matters for pipelines: This makes “prompt → image → video → voiceover” feasible inside a single ComfyUI canvas, as described in the Single graph workflow and detailed in the Integration blog.

ComfyUI

@ComfyUI

🔊 ElevenLabs is now available in ComfyUI via Partner Nodes. World-class voice AI, right in your node graph. Just drag, connect, run.

6:15 PM · Mar 7, 2026

122

LTX-2.3 ComfyUI templates updated, with a new Math Expression node dependency

LTX-2.3 workflows (ComfyUIWiki/ComfyUI): ComfyUIWiki pushed an updated LTX-2.3 workflow template and notes you may need the latest ComfyUI to get the new Math Expression node, per the Workflow update note.

• Templates: The updated JSON templates are shared as the Text-to-video template and the Image-to-video template.
• Operational detail: The update callout implies graphs that previously hard-coded arithmetic can now be parameterized via the Math Expression node, as suggested in the Workflow update note.

ComfyUI Wiki

@ComfyUIWiki

Just updated the LTX-2.3 workflow today. Please download the latest workflow here: github.com/Comfy-Org/work… github.com/Comfy-Org/work… You might need to update ComfyUI to get the new Math Expression node.

9:18 AM · Mar 7, 2026

156

Read 7 replies

LTX-2.3 is being ported to MLX for local Mac video runs

Local video on Mac (LTX): Following up on LTX-2.3 release (open-source local video model), a builder reports running LTX 2.3 on a custom MLX runtime built with GPT‑5.4 in Codex, with plans to ship adapters for LTX Desktop and ComfyUI, per the MLX runtime claim.

The post doesn’t include perf numbers yet; it links back to the model feature overview in the Model page.

Local SOTA Video on MacBook has been achieved! I've successfully run the LTX 2.3 Open Source model on a custom MLX runtime built with GPT 5.4 in Codex Will publicly release once I've made an adapter for LTX Desktop and ComfyUI More on LTX 2.3 here: ltx.io/model/ltx-2-3

7:29 PM · Mar 7, 2026

ChatGPT 5.4 is being used to generate After Effects animations from prompts

After Effects automation (OpenAI): A shared demo claims ChatGPT 5.4 can drive Adobe After Effects work by generating an animation setup from a prompt, producing layers/effects quickly enough to look like direct AE scripting or project templating, as shown in the After Effects demo.

The tweets don’t include a reproducible workflow or plugin name (e.g., ExtendScript vs CEP vs manual paste), so treat it as a capability anecdote rather than a documented integration.

el.cine

@EHuanglu

holy smoke.. chatgpt 5.4 can now create animations directly in after effects

2:33 PM · Mar 7, 2026

2.3K

Read 86 replies

RealWonder releases code for real-time, action-conditioned video generation

RealWonder (research repo): A new open repo and paper for real-time physical action-conditioned video generation is circulating via the Paper share, with the authors also pointing to released pipeline code in the GitHub repo.

The repo description emphasizes an interactive pipeline (single image → 3D/physics simulation intermediate → lightweight diffusion video), including a reported ~13.2 FPS at 480×832 in the GitHub repo.

@_akhaliq

RealWonder Real-Time Physical Action-Conditioned Video Generation paper: huggingface.co/papers/2603.05…

7:27 PM · Mar 7, 2026

Read 8 replies

A prompting workaround for better UI: use Google AI Studio’s app builder

UI generation tactic (Google AI Studio): A practitioner claims that using Google AI Studio’s app builder yields materially better UI/design outputs than prompting the same model via a CLI—even with the same prompt—illustrated in the side-by-side example from the Output comparison.

The core point is that the “builder” surface appears to add hidden scaffolding (layout/style constraints, component conventions, or a different system prompt), even when the visible prompt is identical, per the Output comparison.

Matt Shumer

@mattshumer_

Extremely underrated AI trick most people don't know about: If you want great designs from AI, don't prompt the model directly… use Google AI Studio's app builder. They're doing some magic behind the scenes. Same model, same prompt, completely different output:

10:16 PM · Mar 7, 2026

1.1K

Read 61 replies

A templated workflow for multi-scene ride videos using Nano Banana and Kling

Spaces workflow (Freepik/Kling): A shared “theme park tour” pipeline shows a structured sequence—generate visual elements, then animate them with Kling and stitch—framed as a reusable Space, per the Workflow walkthrough and the shareable artifact in the Freepik space.

This is less about model capability deltas and more about packaging a repeatable, parameterized media workflow that others can duplicate, as shown in the Space reuse instructions.

TechHalla

@techhalla

This workflow lets you recreate a tour in any theme park you want. Made with Spaces on freepik using nano banana and kling. make your own too, like this 👇

11:26 AM · Mar 7, 2026

Read 12 replies

🏫 Builder events & field reports: sandbox symposiums, hackathons, and community distribution

Events are a real distribution channel today: multiple hackathons/meetups focused on agent sandboxes and practical workflows (not just marketing). Excludes tool changelogs (owned by their tool categories).

AI Tinkerers SF runs a “Sandbox Symposium” to compare background-agent sandboxes

AI Tinkerers SF (Event): San Francisco hosted “Background Agents: The Sandbox Symposium,” framed as a research unhackathon where teams evaluate sandbox platforms for long-running agents across security, performance, portability, and developer experience, as described in the Event page and shown live in the Workshop photos.

The format is closer to “bench the infra” than “build a demo,” with sponsor demos and team writeups shared back to the community, per the Talk room photo and Loop emphasis.

dex

@dexhorthy

The @AITinkerers sandbox symposium is live! Cheers @blaxelAI @daytonaio @ona_hq @GeoffreyHuntley !!

7:28 PM · Mar 7, 2026

Read 3 replies

Long lines reported for YC’s multimodal frontiers hackathon (Google ecosystem)

Y Combinator (Hackathon): People reported long lines outside YC for a “multimodal frontiers hackathon,” with a sponsor stack name-dropping Google DeepMind plus tools like Chroma, LiveKit, and Browserbase, according to the Line photos.

The on-the-ground signal is demand: builders are showing up in person for multimodal + agent tooling workflows rather than model-spec talk, as implied by the crowd shots in Line photos.

👩‍💻 Paige Bailey

@DynamicWebPaige

Long lines outside of @ycombinator this morning for the @GoogleDeepMind @trychroma @livekit @browserbase multimodal frontiers hackathon! 🍌📽️ Featuring @NanoBanana, @antigravity, @stitchbygoogle, @GoogleAIStudio, @flowbygoogle, and much, much more:

6:45 PM · Mar 7, 2026

The Verge covers ClawCon NYC as OpenClaw’s community distribution engine

ClawCon NYC (OpenClaw): The Verge published an on-the-ground report portraying ClawCon as an “open-source personal AI” community meetup, citing scale signals like ~1,300 sign-ups and ~700 attendees, as summarized in the Verge excerpts and linked via the Verge report.

The piece frames the event’s social dynamic as “what do you use your agent for?” rather than job titles, and positions openness as “fix it yourself” leverage in contrast to closed assistants, per the Verge excerpts.

michael s galpert

@msg

The verge’s @haydenfield attended @clawcon nyc and shared her experience theverge.com/ai-artificial-…

5:36 PM · Mar 7, 2026

Read 6 replies

Claude Code for Entrepreneurs meetup recap frames events as a product channel

Claude Code for Entrepreneurs (Meetup): A recap described a crowded founder-focused event centered on Claude Code workflows, with Balaji dropping in as a featured speaker, per the Meetup recap clip.

The framing in the recap is that these meetups are functioning as a practical distribution channel—watching real agent workflows land better than feature lists, as stated in the Meetup recap clip.

cedric

@cedric_chee

An impressive crowd showed up to Claude Code for Entrepreneurs yesterday. The energy was unreal! Builders and founders packed the room exploring what's possible with @claudeai, @balajis dropped in as a featured speaker, and it turned into such a wholesome event.

nikkideyy

@nikkideyy

300+ people at the @claudeai Code community meetup at @ns 🤯 Talks, live demos, networking w/Claude enthusiasts came from KL and SG for this. Should I drop a recap? 🙈👀

1:15 AM · Mar 8, 2026

Lovable goes free for a day alongside 120+ SheBuilds in-person events

Lovable (Event + promo): Lovable announced a 24-hour free-to-use window for International Women’s Day in partnership with Anthropic, paired with “120+” in-person SheBuilds events worldwide and a livestream from Stockholm, according to the IWD announcement and the Event page.

The access window timing (12:00am ET Mar 8 to 12:59am ET Mar 9) was clarified in the Timing details and the FAQ page.

Lovable

@Lovable

Happy International Women's Day! Today Lovable is free to use for everyone, in partnership with @AnthropicAI. The Lovable SheBuilds community is hosting over 120 in-person events around the world, and we're live-streaming from the Lovable HQ in Stockholm at 2.30pm CET: Show more

5:00 AM · Mar 8, 2026

312

Read 22 replies

“Agent Glow Up” hackathon shows up as another in-person agent build node

Agent Glow Up (Hackathon): A Saturday build-day was shared as “Agent Glow Up,” with an in-person room setup and chairs-for-demos vibe captured in the Hackathon room photo.

It’s another example of agent communities using meetups as distribution—people are learning by watching live runs, not reading docs.

Maximillian Piras

@M4XMXM

Saturday @ the Agent Glow Up hackathon 🌅

Li Yin

@panda_liyin

introducing the AI Agent Glow Up ✨ Hackathon with @adalengineer ! a curated hackathon for technical builders in SF, with judges from: Gemini, NVIDIA, Meta, Pika, Exa AI, YC and more! 📷 7th March at @WorkOS offices! the focus: building beautiful UI/UX experiences for

12:51 AM · Mar 8, 2026

Gemini 3 hackathon in Singapore gets a “6 demos” field report

Gemini 3 (Hackathon): A field report from Singapore said they saw “6” demos at a Gemini 3 hackathon and called the energy high, as noted in the Hackathon mention.

No project links or judging criteria were included in the tweet, so this reads as a demand/enthusiasm signal rather than a capability benchmark.

Brian Chew

@brianchew

caught 6 awesome demos at the Gemini 3 Hackathon in singapore🇸🇬 today and the energy was unreal. big shoutout to @65labslah @cerebral_valley folks and @vadiamit, @SaadGH for putting this together 🙏 the challenge? "bring something new to life." no basic RAG apps, no chatbots, no Show more

5:27 PM · Mar 7, 2026

OpenAI Devs hackathon hosted at Lorong AI’s new space

OpenAI Devs (Hackathon): An attendee reported judging at an OpenAI Devs hackathon hosted in a new Lorong AI space, with the note that they arrived late and couldn’t stay all day, as described in the Judge note.

The tweet doesn’t include a public agenda or artifact, so details like tracks, prize structure, or demo themes aren’t verifiable from today’s posts.

Louisa

@louiedooey

Couldn't make it for the entire day, but happy to judge at the @OpenAIDevs hackathon at the new @Lorong_AI space (and arrived earlier to dabble with Codex). Insane work, builders, and energy here. Hackathons are easily the best place to keep up with AI. Personal highlights in Show more

9:04 AM · Mar 7, 2026

Read 3 replies

💰 Economics of the agent era: pricing, subsidies, and enterprise adoption math

Business/econ threads today are specifically about unit economics for agentic coding (subscription burn, provider subsidies) and how that changes buying behavior. Excludes raw data center capex (infrastructure).

Cursor analysis alleges Claude Code’s $200 plan implies $2k–$5k in compute spend

Claude Code (Anthropic): A reported internal Cursor analysis claims Anthropic’s $200/month Claude Code subscription can consume far more compute than it bills for—~$2,000 in compute previously and ~$5,000 now—implying aggressive subsidization, per the excerpt shared in Compute spend excerpt.

The claim is second-hand (“a person familiar…”) and doesn’t include methodology, but it’s being used as an explanation for why competitors struggle to match Claude Code’s pricing/usage posture, as framed in Compute spend excerpt.

Chubby♨️

@kimmonismus

Internal cursor research shows that Anthropic us subsidizing hard to compete with its competitors: „According to a person familiar with the company’s internal analysis, Cursor estimated last year that a $200-per-month Claude Code subscription could use up to $2,000 in compute, Show more

8:51 AM · Mar 7, 2026

444

Read 61 replies

Per-seat SaaS pricing gets questioned as agents multiply per-user usage

Pricing model debate: A recurring argument is that per-seat SaaS pricing breaks down when a single user can drive 10×–1000× more work via agents, as summarized in Per-seat pricing critique. A related concern is who gets priced out if agentic coding becomes the default workflow, especially for developers in lower-income regions, as raised in Affordability concern.

No alternative pricing scheme is proposed in these tweets, but the thread frames “usage skew” (one seat consuming orders of magnitude more compute) as the core mismatch, per Per-seat pricing critique.

roon

@tszzl

the “per seat” software sales model makes no sense in the agentic era where some people will effectively spend 10, 100, 1000x more tokens than others, and the inequalities will intensify as the technology gets better

3:49 AM · Mar 8, 2026

707

Read 61 replies

Alibaba Cloud pushes a $3 AI coding plan with 18k requests/month via daily flash deal

AI Coding Plan (Alibaba Cloud): Alibaba Cloud is being promoted as offering a $3 first-month Lite “AI coding plan” (via a daily flash deal that resets at 00:00 UTC+8) with 18k requests/month, positioned as compatible with tools like Claude Code/Cline/Qwen Code in Pricing wedge thread; the product page describes Lite/Pro tiers and the flash-deal mechanic in the Plan page.

This is being framed as a potential adoption wedge in price-sensitive dev communities, but the tweets don’t include any throughput/latency limits or model mix details beyond what’s on the plan page.

AshutoshShrivastava

@ai_for_success

This is crazy. If this lands in the developer community, especially the Indian dev community, Alibaba will regret it. Alibaba Cloud currently has a coding plan for $3 for the first month for the Lite plan. - Works with Claude Code, Cline, Qwen Code - 18k requests/month. Catch Show more

1:43 PM · Mar 7, 2026

205

Read 28 replies

Series A SaaS math thread claims “classic” outcomes no longer drive fund returns

Enterprise SaaS funding math: A thread lays out a back-of-the-envelope venture model where a $1M ARR company meeting a “33222” growth expectation reaches $72M in 5 years and $250M in 8 years, then might get a ~7× public multiple (≈$1.75B value), yielding roughly 17.5× gross and “maybe 10× after dilution” for Series A investors—framed as only ~33% IRR even with near-perfect execution, per Funding math thread.

The tweet argues this is structurally harder than prior eras due to lower SaaS multiples, higher entry valuations, and higher hiring costs, as stated in Funding math thread.

Deedy

@deedydas