Clawdbot shows 923 unauthenticated gateways on Shodan – $165.83 token spend

923 Clawdbot gateways exposed on Shodan with zero auth. Shell access, API keys, everything. Check yours: ~/.clawdbot/clawdbot.json → bind: "loopback" If it says "all" you're cooked. Fix it. Restart. 10 seconds. 🦞

12:49 AM · Jan 26, 2026

284

Clawdbot cost reality: $165.83 in token spend with Opus 4.5 spikes

Clawdbot token economics: A usage screenshot shows $165.83 in token spend from 01/01/2026–01/25/2026, with the visible spend dominated by Claude Opus 4.5 and a single-day spike near $140, as shown in the cost dashboard screenshot.

The post is notable because it turns “agents feel free once they run” into a concrete billing curve: bursty days can dwarf baseline usage, especially when a premium model becomes the default for tool-heavy loops.

Matthew Berman

@MatthewBerman

Clawd is NOT cheap. How do I reduce the cost here?

5:30 PM · Jan 25, 2026

998

Read 314 replies

Browser Use publishes an official Clawdbot skill for cloud browsers and parallelism

Browser Use skill for Clawdbot: Browser Use shipped an official skill on ClawdHub that lets Clawdbot drive Browser Use Cloud sessions (to handle logins, captchas, and anti-bot) and spawn multiple browser subagents for claimed 3–5× speedups, as announced in the skill launch details.

• Operational implication: this expands Clawdbot’s “computer-use” blast radius from local desktop automation to remote, multi-session browsing; that changes what secrets can end up in logs and which tokens need rotation, per the skill launch details.

The skill download is also linked directly by Browser Use in the download pointer.

Browser Use

@browser_use

Clawdbot 🤝 Browser Use We published an official Browser Use skill on ClawdHub! Enable Clawdbot to: - Use Browser Use Cloud browsers to get around anti-bot, captchas, and authentication - Spin up multiple browser subagents for 3-5x speed Try it out now!

1:22 AM · Jan 26, 2026

Read 38 replies

Clawdbot delegation bug: claims it used local models but didn’t

Clawdbot model routing: A user reports Clawdbot not strictly following delegation instructions—saying it’s using local models but actually not—making “who executed what” a first-class ops bug for multi-model setups, as evidenced by the delegation log screenshot.

This is a reliability problem, not a UX nit: if teams rely on cheap local models for low-risk steps, misrouting silently changes cost, latency, and potentially data-handling assumptions.

Matthew Berman

@MatthewBerman

One frustrating issue with @openclaw is it's inability to strictly listen to model delegation instructions. It keeps telling me it's using local models but isn't.

9:37 PM · Jan 25, 2026

Read 14 replies

Clawdbot guardrails checklist: sandbox on, whitelist commands, run audits

Clawdbot security posture: A short operator checklist recommends turning on a sandbox, only enabling a command whitelist if you truly need out-of-sandbox execution, and explicitly reading the security docs, as captured in the guardrails guidance.

A separate thread also points at a built-in security checker (“run the audit”) as part of early setup, per the security audit pointer.

Peter Steinberger 🦞

@steipete

Replying to @hznus

Guardrails: - enable sandbox - enable white-list if you want to run commands out of it - read security doc - use model that has best-what-we-have prompt inject defense - run `clawdbot security audit` - don't add it to group chats if it is your personal bot docs.clawd.bot/gateway/sandbo…

3:32 AM · Jan 25, 2026

1.8K

Read 45 replies

“Clawd disaster incoming” warning as VPS-hosted gateways proliferate

Clawdbot ops risk signal: A circulating warning predicts incidents as more people run Clawdbot gateways on VPS instances while skipping docs and security setup, framed bluntly in the disaster warning retweet.

The claim isn’t a technical root cause by itself, but it matches the broader pattern that “agent servers” quickly become “internet servers,” and that’s where default auth, secret storage, and update hygiene start to dominate outcomes.

fmdz

@fmdz387

Clawd disaster incoming if this trend of hosting ClawdBot on VPS instances keeps up, along with people not reading the docs and opening ports with zero auth... I'm scared we're gonna have a massive credentials breach soon and it can be huge This is just a basic scan of Show more

10:24 PM · Jan 25, 2026

7.1K

Read 411 replies

Railway template pattern for Clawdbot: HTTP proxy on 8080 plus /setup flow

Clawdbot on Railway: A walkthrough shows a repeatable deploy pattern using a Railway template, then exposing an HTTP proxy on port 8080, and finishing configuration via a browser-based /setup page to choose a model and connect a chat channel, as described in the Railway setup overview and the port 8080 step.

• Model choice as an ops lever: the thread explicitly frames model selection as a cost/quality decision (it recommends MiniMax as cheaper), per the model selection step.

It’s a clean “hosted gateway” recipe, but it also means you now own public ingress, auth, and patch cadence—problems many teams underestimated in earlier agent waves.

Paul Couvert

@itsPaulAi

You can setup Clawdbot on a server without typing a single command So you can even use it on your phone from anywhere. - Just duplicate a template - Choose your model - Pick a channel (Telegram, WA, etc.) And you're done. 5 minutes max. Quick steps below

9:51 PM · Jan 25, 2026

199

Read 16 replies

Clawdbot “daily timeline digest” automation with jq-based feed extraction

Clawdbot automation pattern: One concrete “personal assistant” workflow wires Clawdbot to read an X feed and send a daily digest to WhatsApp, including a prompt that runs a shell script and uses jq to extract only tweets with media, as shown in the digest demo and the prompt text.

The same setup notes creating a fresh X account to reduce bias in what the bot “sees,” per the setup detail.

Olivia Moore

@omooretweets

I set up Clawdbot to Monitor the Situation 👇 It reads my X feed and sends me a breakdown of the vibes on the timeline that day. ...as well as podcasts & articles I'll like, trending posts, funding news, and even memes

10:55 PM · Jan 25, 2026

315

Read 30 replies

Replit Agent reportedly set up Clawdbot in ~10 minutes

Clawdbot on Replit: A user reports using Replit Agent to deploy and configure Clawdbot in “10 minutes,” with a screenshot of a running Clawdbot Gateway Dashboard showing “Health OK,” as shown in the Replit setup proof.

This is a different trade: faster time-to-first-agent than hand-rolling infra, but your threat model shifts to a hosted control plane plus whatever secrets the gateway stores.

Niall O'Higgins

@niallohiggins

Everyone's buying Mac minis to run @openclaw. I just asked @Replit Agent to set it up. Took 10 minutes. Been running since this morning.

9:01 PM · Jan 25, 2026

216

Read 35 replies

Mac mini buying wave gets pushback: “a Raspberry Pi works”

Hardware overhang debate: Following up on Mac mini demand (local Clawdbot hosting hype), new posts show the buying signal spilling into retail stockouts—one listing is marked “SOLD OUT” at a store, as shown in the sold out listing.

At the same time, there’s explicit backlash that this is unnecessary—“a raspberry pi works”—in the hardware pushback, plus ongoing jokes about extreme overkill hardware for Clawdbot (like a £530k DGX box) in the DGX meme screenshot.

Alex Volkov (Thursd/AI)

@altryne

Yall are crazy for this lol, when I bought mine on friday, they had 27 of them fresh of the truck...

Beff (e/acc)

@beffjezos

The only place where a product made 21 days ago off of a model that's 40 days old can cause a shortage of Mac minis overnight

3:36 PM · Jan 25, 2026

Read 6 replies

🧠 Codex CLI UX: plan-mode polish, time horizons, and model-choice heuristics

Continues the Codex focus from earlier in the week, but today’s tweets were mostly about plan/execute UX changes and practitioner heuristics for when to use GPT‑5.2 Codex variants. Excludes Clawdbot (feature) and Claude Code updates (separate category).

Codex 0.90 tightens Plan mode with an explicit plan→execute handoff

Codex CLI (OpenAI): Codex v0.90 ships small but workflow-shaping Plan mode polish—clearer plans, a more explicit “execute” handoff, and simpler switching between planning and coding, as described in the Plan mode note.

• Plan→Code confirmation: The UI now asks “Implement this plan?” with a Yes/No branch (switch to Code vs stay in Plan), as shown in the Plan handoff prompt.
• Mode selection friction: The release framing calls out “simpler mode selection between coding and plan,” per the Plan mode note.

Codex 0.90 includes improvements to plan mode. "Clearer plan with execute handoff and simpler mode selection between coding and plan."

5:00 PM · Jan 25, 2026

106

One reported Codex workflow: xhigh for planning, then Codex for implementation

GPT‑5.2 Codex (OpenAI): A practitioner reports that long-context work is still a pain point with Claude Opus 4.5, while GPT‑5.2 handles longer contexts better (at the cost of speed and token use); their workflow is “stick to gpt‑5.2 xhigh for planning, then switch … codex for implementation,” according to the Context-length routing.

context length is still a problem with opus 4.5 gpt-5.2 handles longer contexts much better, even though it is slower, and xhigh uses a lot of tokens for longer and more complex tasks, i stick to gpt-5.2 xhigh for planning, then switch to gpt-5.2 codex for implementation

4:15 PM · Jan 25, 2026

178

Read 15 replies

Terminal-Bench shows time budgets matter more than Codex tier in many tasks

Terminal-Bench (ValsAI): When benchmark timeouts are increased 5×, GPT‑5.2 Codex High scores 60.67% and XHigh scores 60.97%, as reported in the 5× timeout results.

The author’s read is that both tiers jump meaningfully versus default settings—and the gap barely changes—implying some tasks are “capable but not within time limits,” per the Timeout interpretation.

Vals AI

@ValsAI

By popular demand - when we 5x'd the timeouts defined by the benchmark, high scored 60.67% and xhigh scored 60.97%.

Vals AI

@ValsAI

We recently discovered GPT 5.2 high reasoning performs better than xhigh on Terminal Bench 2 - 52.8% for high vs 46.3% for xhigh. This seems impossible, so what’s the explanation?

4:42 AM · Jan 26, 2026

139

Codex Plan mode discoverability shows up as a bottleneck

Codex CLI (OpenAI): A power user reaction suggests Plan mode still isn’t “discoverable” enough—“Codex has a plan mode!!” is framed as new information even for someone who follows the space closely, per the Plan mode surprise.

The same timeline also shows Codex leaning into keyboard-driven mode cycling (“Plan mode (shift+tab to cycle)”) in the Mode cycle hint, which may be part of why some users miss the feature.

Numman Ali

@nummanali

This is news to me - and I’m never behind Codex has a plan mode !! Bloody good write up by my friend Will Last week he had 100 followers and now he’s 1600+! He’s high signal and I have notifications on for him I only follow a select few and he’s made the cut

am.will

@LLMJunky

🔥 Codex 0.9.0 is out, and with it, a bunch of new changes. Some of these updates are exactly what you've been waiting for 🫵. Starting with Codex's new Plan Mode! There's been some improvements since my last tweet. This is going to be a long one, so buckle up. 👇

11:16 PM · Jan 25, 2026

Cursor subagents are being used to pin GPT‑5.2 Codex XHigh for reviews

Cursor subagents: A shared example shows creating a specialized subagent via /create-subagent, pinning gpt-5.2-codex-xhigh for a “Senior engineer code review” role, and setting it to read-only for safer operation, as shown in the Subagent config example.

You can build your own subagents in cursor with /create-subagent Define name, model, desc, readyonly Invoke it with /codex-review-senior

9:46 AM · Jan 25, 2026

Pragmatic tool rotation emerges: Codex vs Claude Code vs Gemini CLI

Agentic coding workflow: One builder describes actively switching between Codex and Claude Code depending on task fit, and explicitly says they’re “rooting for gemini CLI” as a third competitor, per the Tool rotation note.

The framing is that leapfrogging between tools is expected and even desirable, rather than locking into one stack.

Christian Bager Bach Houmann

i've been using both codex and claude code, depending on the task right now, gpt-5.2-codex high/xhigh works better for me i hope this competition continues, with claude code beating codex again, then codex beating claude code, and so on and i'm also rooting for gemini CLI

11:25 PM · Jan 25, 2026

Read 9 replies

Codex Plan→Code UX critique: “you are now in code mode” template feels thin

Codex CLI (OpenAI): A user complaint highlights Plan→Code transition UX: the Code-mode template appears to be just “you are now in code mode,” as shown in the Code mode prompt screenshot.

The same prompt text is traceable to Codex’s repo in the GitHub template, which makes it straightforward for teams to audit or fork the behavior.

@chrisbbh

Replying to @kevinkern

I like the plan -> execute mode flow in 0.89, but the plan -> code is a bit weird given the prompt for code mode.

5:48 PM · Jan 25, 2026

Read 1 reply

🧩 Claude Code shipping details: async hooks, rewind/fork, and upcoming security UI

Today’s Claude Code items were concrete workflow affordances (non-blocking hooks; rewind/fork in the VSCode extension) plus a rumor of a Security Center UI. Excludes Clawdbot (feature) and Codex plan-mode items (separate category).

Claude Code VSCode extension adds rewind & fork conversation history

Claude Code VSCode extension (Anthropic): v2.1.19 adds the ability to rewind and/or fork from an earlier point in a chat session—options include “Fork conversation from here” and “Rewind code to here,” as shown in the v2.1.19 feature post.

This is a concrete UX affordance for long-running agent work: you can branch an alternative approach or roll back code to a known-good point without starting a fresh session.

Boris Cherny

@bcherny

Just shipped in v2.1.19: rewind & fork message history in the Claude Code VSCode extension

8:31 PM · Jan 25, 2026

1.1K

Read 92 replies

Claude Code /review used as a repeatable “find tricky bugs” loop

Claude Code (Anthropic): A developer reports using /review ~10 times on an integration effort and says it found a “valid, tricky bug” each time, per the review workflow post.

The same post highlights recurring operational pain: keeping sessions mapped correctly across agents is “finicky,” with frequent refactors/cleanup needed to keep the loop reliable.

Peter Steinberger 🦞

@steipete

I'm working on LINE integration and it's now the 10th time I use /review and each time it did find a valid, tricky bug. Getting sessions right and all the mapping between agents is finicky and each time I'm refactoring and cleaning up code. There gotta be a better way.

8:20 AM · Jan 25, 2026

240

Read 36 replies

Claude Code hooks can run async so logging/notifications don’t block execution

Claude Code (Anthropic): Hook commands can now run in the background; setting async: true lets PostToolUse hooks (logging, alerts, side effects) avoid blocking the main agent loop, as shown in the hook config note.

This changes the “instrumentation tax” for teams who rely on hooks for audit logs, CI pings, or local telemetry—those can run without extending wall-clock time for each tool call.

Boris Cherny

@bcherny

Hooks can now run in the background without blocking Claude Code's execution. Just add async: true to your hook config. Great for logging, notifications, or any side-effect that shouldn't slow things down.

8:37 PM · Jan 25, 2026

2.8K

Read 122 replies

Anthropic is rumored to be preparing a Security Center UI for Claude Code

Claude Code (Anthropic): A report claims Anthropic is preparing Security Center (formerly “AutoPatch”) to browse historical scans/issues and manually trigger new scans, per the security center scoop and the linked scoop article.

What’s still unclear from the thread: whether this is purely UI over existing scanning, what scanners/rulesets are supported, and how it integrates with Claude Code’s existing workflow (CLI vs extension vs cloud).

TestingCatalog News 🗞

@testingcatalog

Anthropic is preparing to release Security Center (formerly AutoPatch) for Claude Code. Users will be able to browse historical scans and detected issues, as well as trigger new scans manually. Will it be a direct response to the upcoming Codex upgrades and cybersecurity Show more

4:01 PM · Jan 25, 2026

223

Read 10 replies

Claude Code “juicing the harness” anecdotes push concurrency into the dozens

Claude Code (Anthropic): One builder claims “87 concurrent subs” in Claude Code after “juicing the harness,” as stated in the concurrency claim. A separate screenshot shows “200+ agents running” and many background tasks, as shown in the agents running screenshot.

These posts are thin on implementation details, but they’re concrete evidence that people are treating Claude Code less like an IDE feature and more like an orchestration surface.

geoff

@GeoffreyHuntley

87 concurrent subs in claude code has to be a new record for me. figured out how to juice the harness.

12:03 PM · Jan 25, 2026

Read 19 replies

Repo stats screenshot shows Claude Code as a top committer in production

Claude Code (Anthropic): A production repo’s “Top Committers” chart shows “Claude code” as the top weekly contributor even while the whole team uses AI-assisted tools, as shown in the repo stats screenshot.

This is anecdotal (no breakdown of what counts as a “Claude code” committer), but it’s a concrete telemetry-style artifact teams are starting to share publicly.

Jerry Liu

@jerryjliu0

This is our production repo Claude code is the top weekly contributor Even as our entire team is using AI-assisted tools

10:32 PM · Jan 25, 2026

A builder says Claude Code feels unusable after adapting to newer Codex

Claude Code (Anthropic): A user reports they “can’t tolerate” Claude Code anymore and wonders if it regressed or if they’re just acclimated to newer Codex behavior, as stated in the friction report and reiterated in the follow-up note.

This is a useful signal for tool owners: perceived quality is increasingly comparative (loop speed, context handling, plan/execute ergonomics), not absolute.

kache

@yacineMTB

wild. i can't use claude code anymore. like i actually cant tolerate it

5:33 PM · Jan 25, 2026

275

Read 51 replies

Claude Opus gets praise for choosing built-in tools vs Bash at the right times

Claude Opus in Claude Code (Anthropic): A practitioner highlights a subtle quality factor: Opus “choosing between built-in tools and bash” and being strong at writing bash scripts, as described in the tool choice comment.

For agent-loop builders, this points at a practical harness metric: not only whether a model can write code, but whether it reliably picks the lowest-friction execution path (native tool vs shell script) under time and context constraints.

ben

@benhylak

Claude Opus is just *so* good at choosing between built-in tools and bash. And it's really good at writing bash scripts. It's really subtle, but makes a huge difference.

10:11 PM · Jan 25, 2026

144

🧭 Agentic coding practice: delegation, management theory, and “vibe coding” pushback

The new signal today is less about tool releases and more about how teams are adapting: managing agent delegation like “management 101,” disputes over “I don’t code anymore,” and practical harness-thinking discussions. Excludes Clawdbot operations (feature).

Code quality still constrains agents: big codebases + verbose outputs hurt

Codebase ergonomics (practice): A pointed rebuttal to “code doesn’t matter now” argues two constraints remain: agents degrade as the amount of code they must reason over grows, and LLM output tends to be verbose (often easy for humans to simplify), as stated in Two facts argument and echoed by the “glue numpy functions” jab in Sarcastic counterexample. The implicit tactic is to treat codebase size and clarity as an input to agent performance, not a separate concern.

This fits with why teams are investing in refactors and guardrails even when generation quality improves.

dax

@thdxr

1. if you use coding agents you know they struggle the more code they have to deal with 2. LLMs produce verbose code that's easy for a human to cut in half hard to dispute these two facts, put them together and you have the answer to "code doesn't matter ai can just fix it"

3:45 AM · Jan 26, 2026

Read 92 replies

Delegating to coding agents is re-teaching builders “management 101”

Agent delegation (practice): As builders hand more authority to coding agents, the bottleneck is shifting from “can the model code?” to “can you manage work?”—goal-setting at different delegation levels, clear direction, feedback loops, coordination, and resource allocation, as laid out in Management theory framing and reiterated in Delegation checklist. The point is that many failure modes look like classic org design problems—just compressed into minutes instead of quarters.

This frames “agent leadership” as a skill separate from prompt craft: you’re designing roles, interfaces, and accountability structures, not just asking for code.

As a business school professor, its striking that a lot of the AI folks on this site, as they increasingly delegate authority to coding agents, are re-encountering the basic problems that underlie management theory and practice. Many delegation problems are old & well-understood!

5:47 PM · Jan 25, 2026

1.8K

Read 72 replies

Harnesses are “opinionated context engineering” for hard agentic tasks

Agent harness design (practice): A practitioner framing says harnesses are mostly delivery mechanisms for opinionated context engineering—long-running memory, context offloading/reading, built-in tools/subagents, and resumable handoffs—rather than “just” a model wrapper, as described in Harnesses as context engineering. The proposed next step is experiments that hold the model fixed and iterate on harness design, per Harnesses as context engineering and the follow-up discussion in Evals as the missing piece.

This is a useful lens for engineers comparing agents: many “model X feels better” reports are actually harness differences.

Viv

@Vtrivedy10

Harnesses matter for hard agentic tasks for the most part, harnesses today are delivery mechanisms for opinionated context engineering - how to use long running memory - context offloading + reading - built in tools/subagents to help models on common tasks - dictate execution Show more

Zhenting Qi

@ZhentingQi

Agent scaffolding matters as much as, or even more than, raw model capability for hard agentic tasks. In our latest research with @Meta, we show that carefully designed scaffolding achieve 54.3% (Claude Opus) and 52.7% (Claude Sonnet) on SWE-Bench-Pro, compared to a 52.0% Claude

4:10 PM · Jan 25, 2026

Read 9 replies

“Era of writing code is over” claim spreads, anchored by 100% AI-coded anecdotes

Post-coding narrative (signal): The strongest “evidence” being passed around is anecdotal: multiple builders claim 100% of their code contributions now come from coding agents, with screenshots compiling quotes like “100% of my contributions… were written by Claude Code” and “100%, I don’t write code anymore,” as shown in 100% AI coding screenshots.

A separate framing calls this a “software-first singularity” that already happened, again relying on the same quote pattern in Singularity meme. Treat this as discourse, not measurement: the tweets provide no controlled definition of “don’t write code” (reviewing, specifying, and debugging are still work).

both openai and anthropic researchers have confirmed that: codex and claude code can write all of their code. i've said this recently and am getting a lot of hate from delusional people who still want to write code from scratch the era of writing code is over, and reading code Show more

1:50 PM · Jan 25, 2026

948

Read 156 replies

One-prompt Claude Code build: a complete Sierra-style adventure game shipped

Claude Code (Anthropic): A concrete “end-to-end build loop” example: a Sierra-style adventure game was designed, playtested, and deployed by Claude Code from a single instruction plus a follow-up “playtest and improve” prompt, with a full walkthrough published in Game build and walkthrough and an explicit note about the one-prompt workflow in Single prompt claim.

The operational takeaway for agentic coding practice is that the hard part is no longer scaffolding a repo from scratch—it’s specifying acceptance criteria, forcing self-testing, and deciding what “done” means when the agent can also deploy, as evidenced by the shipped playable artifact in Playable game.

This game was 100% designed, tested, and made by Claude Code with the instructions to "make a complete Sierra-style adventure game with EGA-like graphics and text parser, with 10-15 minutes of gameplay." I then told it to playtest the game & deploy. Play: enchanted-lighthouse-game.netlify.app

7:50 PM · Jan 25, 2026

Read 77 replies

Pushback on “I don’t code anymore” posts: thinking about code still matters

Vibe coding discourse (culture → practice): A counter-position says “I’ve moved on from coding” is mostly signaling; the real work is still thinking about code quality and resisting entropy, as argued in Post-coding backlash and reinforced by the follow-up in Same critique extended. A related note frames early adopters as an “invisible contribution” to OSS—using buggy agent-built tools early, then digging into performance/docs issues so maintainers can reach v1, as described in Early adoption helps maintainers.

The practical implication is that agent productivity can increase the rate at which teams accumulate messy code unless quality ownership stays explicit.

dax

@thdxr

all these people making "i've moved on from coding" posts are so fucking lame i'm thinking about code as much if not more than ever

10:27 PM · Jan 25, 2026

Read 76 replies

Rollback vs restart: error recovery becomes a first-class agent workflow

Agent error recovery (practice): Long-horizon runs still go off the rails; the open question is whether the recovery path is “nuke and restart” or a structured rollback/repair flow, with the trade-off discussed explicitly in Recovery flow question and expanded as a “credit assignment is hard” problem in Verification and judging. The same thread points to verification/judging/testing startups as the current stopgap for measuring intermediate correctness when end-to-end outcomes are delayed.

This is a pragmatic reminder that autonomy needs operational undo, not just better prompts.

Viv

@Vtrivedy10

Replying to @NirantK

million dollar question!! a couple that i’m excited about: - pushing the limit of agent fully controlling its own context. heavier self context editing, forking, delegating - orchestrating multi-agents purely with filesystem offloading+search+ dynamically inject context over Show more

5:21 PM · Jan 25, 2026

Read 1 reply

Margins thesis: best agent loops + distilled models beat bigger models over time

Orchestration economics (signal): A strategy thesis argues the winners will combine strong agent loops with distilled models—cheap inference plus better orchestration can make small models “smarter than they really are,” which improves margins as usage scales, as stated in Loops and distillation thesis. The underlying engineering bet is that harness quality (routing, decomposition, verification) compounds, while raw model advantage commoditizes.

This connects directly to the harness-focused experimentation agenda emerging in parallel threads.

i think this a lot the company that wins this race will be the one with the best agent loops and the best distilled models distilled models are cheap to run, and good orchestration can make them much smarter than they really are that's how you get much higher margins over time

11:30 AM · Jan 25, 2026

Read 27 replies

Stop calling agent groups “swarms”; structure them like teams

Multi-agent coordination (language): A small but practical naming debate argues that calling collections of agents “swarms” pushes people toward the wrong mental model—whereas “teams” or “organizations” implies roles, protocols, and coordination costs, as argued in Teams not swarms and extended in Naming satire. For leaders, the subtext is risk communication: terminology changes how non-engineers perceive autonomy and operational safety.

This is less about vibes and more about designing human-readable operating models for multi-agent systems.

Lets not call groups of agents "swarms" - it is both terrifying (maybe the point?) & not a useful analogy. Groups of agents should be called teams or organizations. It both describes how to structure them and also how to use them. Don't let the weird AI folk naming win again!

5:40 AM · Jan 26, 2026

369

Read 62 replies

🧰 Guardrails & installables: destructive-command blocking, leak scanning, and agent add-ons

Mostly security/quality-focused extensions: command guardrails for coding agents, repo leak scanning skills, and reusable agent definitions. Excludes MCP protocol plumbing (separate category) and Clawdbot ops (feature).

dcg adds fast destructive-command blocking for Claude Code tool calls

destructive_command_guard (doodlestein): A new guardrail tool called dcg hooks into Claude Code’s pre-tool lifecycle to detect and block potentially destructive operations (deletes, hard resets, data loss) with an emphasis on speed and low false positives, as described in the dcg tool rundown.

• Fast path + script-aware checks: It prioritizes fast matching (SIMD regex) but switches to deeper inspection when it sees ad‑hoc scripts (heredocs), using AST-style analysis to catch “creative” destructive behavior that avoids obvious commands, as explained in the dcg tool rundown.
• Preset packs for domains: It ships with ~50 presets that can be enabled per project stack (example given: S3-like semantics where “destructive” isn’t always a literal delete), also outlined in the dcg tool rundown.

The intent is to add an execution-time safety net without turning the human into a constant approval bottleneck, per the design goals in the dcg tool rundown.

Jeffrey Emanuel

@doodlestein

Agent coding life hack: I’m 100% convinced that there are hundreds of thousands of developers out there who would love and use my dcg tool if they only knew about it. dcg: destructive_command_guard This is a free, open-source, highly-optimized rust program that runs using Show more

7:41 PM · Jan 25, 2026

536

Read 28 replies

Security leak guardrails skill bundles gitleaks, CI scanning, and pre-commit hooks

Security leak guardrails (agent-skills): A reusable “skill” packages repo-level leak prevention—gitleaks, CI scanning, and pre-commit hooks—positioned as a baseline safety net for agent-heavy codebases, as summarized in the skill summary and detailed in the GitHub repo.

The core idea is to make secret scanning and enforcement automatic (CI + local hooks) so agent-generated diffs don’t silently introduce credentials into git history, per the implementation notes in the GitHub repo.

If you need a security safety net for your codebase, this skill sets up - gitleaks, - CI scanning & - pre-commit hooks github.com/regenrek/agent…

12:24 PM · Jan 25, 2026

Claude Code subagent template for gpt-image-1 turns image gen into a parallel task

Image-generator agent (Claude Code template): A shareable subagent definition shows how to wire Claude Code to OpenAI’s gpt-image-1 image API with a dedicated system prompt and a “use this agent when…” routing description; it’s framed as parallelizable so the agent can generate many assets concurrently, as shown in the agent definition.

The practical value is packaging prompt engineering + API glue into an installable agent role, rather than re-prompting image generation steps repeatedly, per the example-driven spec in the agent definition.

Replying to @emollick

It was a single prompt for the entire game, and then a prompt to playtest and improve the outcome.

7:53 PM · Jan 25, 2026

morphllm ships a lightweight “watch all PRs” tool to track review status

PR watcher (morphllm): A small installable utility is being shared as a way to monitor all open PRs from a single surface—aimed at reducing review latency in agent-heavy repos—demoed in the terminal demo.

A follow-up note flags monorepo support work (Vercel) as an active fix area, per the monorepo note, with installation linked in the install link.

Morph

@morphllm

watch all of your PRs

Morph

@morphllm

Introducing Glance, a fast browser agent trained with RL to test code changes. We're releasing this with BrowserBot, which puts a video of it testing your app right in your GitHub PRs Frontier models are trained to be helpful. Our model is trained to be bad at using software.

9:37 AM · Jan 25, 2026

101

HeyGen releases an agent skill to generate avatars and render Remotion videos

HeyGen + Remotion skill: A new “any agent” skill wraps a 4-step pipeline—script → avatar → Remotion composition → render—so agents can produce avatar videos end-to-end with a reproducible command-line render step, as shown in the pipeline screenshot.

It’s positioned as a reusable add-on for Claude Code and other agent runners rather than a one-off workflow, per the integration framing in the pipeline screenshot.

Kol Tregaskes

@koltregaskes

HeyGen released their Claude Code (any agent) skill. You can now create avatars with your favourite agent, including Clawdbot.

Joshua Xu

@joshua_xu_

The HeyGen avatar skill is available for Claude Code now. I was playing around with Remotion + HeyGen to make a talking video. It works great! Here’s how you can do it: - Install the remotion skill - Install the heygen avatar skill: npx add-skill heygen-com/skills - Set up

3:32 PM · Jan 25, 2026

🧑‍💻 Cursor & adjacent IDE tooling: subagent configs, CLI ergonomics, and editor perf papercuts

Cursor chatter today was narrower: how to create/invoke subagents and a couple of real papercuts (type resolution). Excludes Codex plan-mode changes (separate category).

Cursor shows a /create-subagent workflow and slash-command invocation

Cursor (IDE tooling): Cursor users are sharing a concrete recipe for creating reusable subagents via /create-subagent, setting name, model, description, and readonly, then invoking the agent as a slash command (example: /codex-review-senior)—see the walkthrough in Subagent creation tip.

This makes “model-pinned specialists” feel closer to an IDE-native primitive than a prompt convention, because the agent becomes an addressable tool surface rather than a copy/paste template, as shown in the Subagent creation tip.

You can build your own subagents in cursor with /create-subagent Define name, model, desc, readyonly Invoke it with /codex-review-senior

9:46 AM · Jan 25, 2026

Composer-1 paired with GPT‑5.2 Codex XHigh shows up as a power-user combo

Composer-1 (Cursor-adjacent model surface): A power user reports strong results pairing Composer-1 with GPT‑5.2 Codex XHigh, and explicitly asks for Composer to be exposed via API; they show a CLI call pattern (agent --model composer-1 -p "…") and a structured, repo-specific answer output in the CLI example.

This reads as “model routing inside a dev loop” becoming a product surface—Composer as a UX layer plus GPT‑5.2 as the heavy coder—based on the CLI example.

Numman Ali

@nummanali

Composer 1 is wildly good paired with GPT 5.2 Codex Extra High I wish the @cursor_ai team exposed this model via an API I'd pay for this alone to use with my personal Codex sub Oh but wait, I can 😏 > agent --model composer-1 -p "How does auth work?"

10:19 PM · Jan 25, 2026

Cursor subagent execution is framed around a Task tool with model and attachment hooks

Cursor (subagent runtime): A shared reference screenshot documents Cursor’s subagent “invoke” interface as a Task tool with optional parameters like model?, resume?, readonly?, subagent_type, and attachments?, plus the key operational constraint that subagents don’t see the full chat history, as summarized in Task tool overview.

The doc-style framing implies Cursor is treating subagents as first-class, structured calls (more like function/tool invocations than free-form prompts), with attachments called out as a differentiator versus some other harnesses per Task tool overview.

Numman Ali

@nummanali

Cursor sub agent invoke tool is called Task Full overview of how it works below - very comprehensive! In line with Claude Code except it can take also take attachments @milichab when will support to specify models be added? I don't want to create custom sub agents, I just Show more

1:07 PM · Jan 25, 2026

A Cursor CLI speed demo adds energy to terminal-first agent workflows

Cursor CLI (Cursor): A short terminal demo positions “cursor cli” as a fast command-line surface for interacting with Cursor tooling, leaning into the idea that agent workflows belong in the terminal as much as the editor, as shown in Cursor CLI demo.

No version or new flags were cited in the clip, so treat this as an adoption/ergonomics signal rather than a specific release note per Cursor CLI demo.

eric zakariasson

@ericzakariasson

cursor cli is a fast cli

Rafael Garcia

@rfgarcia

It's so slow compared to cursor CLI

10:51 PM · Jan 25, 2026

330

Read 24 replies

Cursor users flag slow type resolution as a debugging bottleneck

Cursor (editor performance): A user reports “really slow type resolution in cursor” while trying to debug, framing it as non-trivial to troubleshoot, as described in Type resolution complaint.

This is a small complaint, but it’s exactly the kind that compounds when agents increase churn in the codebase and developers spend more time navigating/generated code than writing it, per the sentiment in Type resolution complaint.

ben

@benhylak

does anyone else have really slow type resolution in cursor? trying to debug and it's not straight forward.

10:14 PM · Jan 25, 2026

🔌 Interop plumbing: MCP Apps, AG‑UI, and chat-to-app synchronization layers

Today’s MCP content was about interactive UI as a tool output (MCP Apps) and the missing sync/orchestration layer between agent ↔ UI ↔ app. Excludes Clawdbot skills (feature) and non-MCP installables (plugins category).

CopilotKit publishes an MCP Apps + AG‑UI integration flow and starter template

CopilotKit (CopilotKit): CopilotKit published a concrete “bring MCP Apps into your own agentic app” walkthrough, framing the key gap as the sync/orchestration layer between agent ↔ UI ↔ app (via CopilotKit runtime + AG‑UI), and it pairs that with a runnable starter (npx copilotkit create -f mcp-apps) in the integration tutorial and the linked tutorial.

• Interop surface area: MCP Apps extends MCP so tool outputs can include interactive UI that host apps render; the tutorial spells out how state is supposed to move across the agent, CopilotKit runtime, AG‑UI, and the embedded MCP App UI, as described in the integration tutorial.
• What’s explicit now: the write-up calls out the “missing sync layer” (agent ↔ UI ↔ app) as the blocking piece for developers trying to treat UI as a first-class tool output, per the integration tutorial.

CopilotKit🪁

@CopilotKit

Learn how to bring MCP Apps into your own agentic applications ⬇️ In our latest tutorial, we cover: - MCP-UI (where the idea first proved out) - how MCP Apps works under the hood - the missing sync layer (agent ↔ UI ↔ app) - the complete integration flow ...and more You get Show more

5:26 PM · Jan 25, 2026

LangGraph chatter frames task graphs as a file-system-first control surface

LangGraph (LangChain): A practitioner thread argues “(dynamic) LangGraph is inevitable” and highlights that model-imposed Task structure is appealing partly because it’s “file-system pilled” (tasks as a durable, inspectable artifact), as stated in the LangGraph comment. It connects to a broader view that modern harnesses are mostly context engineering + orchestration choices (memory/offloading/handoffs), per the harnesses take.

The open question implied by the discourse is how much “structure by files” can double as an interop layer between agents and UIs (inspectable state, resumability) versus staying an internal harness detail.

Viv

@Vtrivedy10

dread it, run from it, but (dynamic) LangGraph is inevitable def interested by this model imposed structure of Tasks think the part i like the best is that this whole approach is incredibly file-system pilled which we love

CJ Hess

@seejayhess

x.com/i/article/2015…

8:04 AM · Jan 25, 2026

MiniMax describes cloud-hosted personal work agents controlled via Slack-style chat

Cloud work agents (MiniMax): A MiniMax account amplifies that they run “a personal work agent fully in the cloud,” with the interaction surface being Slack-style instant messaging, as described in the MiniMax RT.

This is another signal that chat platforms are becoming the control plane while execution/state live elsewhere, which keeps pushing interop pressure onto message schemas, state sync, and app-side UI/action acknowledgement rather than “just better prompting,” per the MiniMax RT.

Skyler Miao

@SkylerMiao7

+1. We run something very similar at MiniMax — a personal work agent fully in the cloud, chatting via Slack-style IM. No Mac Mini needed. Cloud agents are just more practical: 24/7 online, no power or network hiccups, easy to scale. And you save money — better to sponsor the Show more

Peter Steinberger 🦞

@steipete

Please don't buy a Mac Mini, rather sponsor one of the many contributors of @openclaw You can deploy this on Amazon's Free Tier. github.com/clawdbot/clawd…

1:24 AM · Jan 26, 2026

226

Read 7 replies

✅ Keeping agent code shippable: review loops, PR observability, and correctness gating

Fewer big launches today; the notable items were “review loops” (catching tricky bugs repeatedly) and lightweight PR monitoring surfaces. Excludes general benchmarks (separate category).

Claude Code /review shows up as a reliable “find the tricky bug” gate

Claude Code /review (Anthropic): A practitioner reports using /review for the 10th time while wiring up a LINE integration, and says it has surfaced a “valid, tricky bug” every single run—highlighting /review as a repeatable correctness gate before merging, not a one-off “nice to have,” as described in Review loop report.

The same post also calls out what keeps breaking: “getting sessions right” and “mapping between agents” causing repeated refactors/cleanup, which frames /review’s value as catching state/coordination edge cases that are easy to miss in agent-heavy code paths, per Review loop report.

Peter Steinberger 🦞

@steipete

8:20 AM · Jan 25, 2026

240

Read 36 replies

A lightweight “watch all PRs” CLI lands as an agent-era review surface

PR watcher (morphllm): A small CLI pitched as “watch all of your PRs” is making the rounds as a lightweight observability surface for review/merge latency, with a terminal demo shown in PR watcher demo.

An install pointer is shared separately in Install pointer, framing this as something people are dropping into existing workflows rather than a new full review platform.

Morph

@morphllm

watch all of your PRs

Morph

@morphllm

9:37 AM · Jan 25, 2026

101

A practical trace-reading rule: stop at the first upstream error

Trace debugging heuristic: A debugging rule-of-thumb is resurfacing for agent traces: when reading long execution traces, stop at the first (most upstream) error you can find, since later failures are often cascades rather than root causes, as stated in Trace heuristic.

This maps cleanly to agent correctness work because tool-call chains (fetch → parse → plan → patch → test) can generate noisy downstream exceptions; the heuristic keeps review attention on the earliest invariant violation rather than the last visible crash, per Trace heuristic.

🏗️ Agent builders: orchestration models, memory layers, and DSPy/RLM experiments

Framework-layer news today centered on orchestration models (small controller + tools), “memory OS” layers, and DSPy RLM work for DataFrame-centric analysis. Excludes end-user agent runners (feature) and pure research benchmarks (separate category).

NVIDIA ToolOrchestra introduces an RL-trained Orchestrator-8B for cost-aware tool routing

ToolOrchestra (NVIDIA): NVIDIA is pitching ToolOrchestra as an orchestration stack where a small Orchestrator alternates “reasoning → tool calling → tool response” and learns routing policies with RL across basic tools, specialist LLMs, and frontier generalists, as described in the framework overview.

• Cost/perf claim: TheTuringPost summarizes the headline positioning as “GPT-5-level (and beyond) performance” with a much smaller controller—citing “2.5× more efficient” and “~30% of GPT-5’s cost” in the framework overview.
• Training signal: The loop is explicitly optimized for outcome plus efficiency plus user preference, which matters if you’re building orchestration that must trade off latency/cost vs quality rather than just chasing a single benchmark, as shown in the framework overview.

The most concrete artifact is NVIDIA’s own write-up on the project page, but today’s tweets don’t include an independently reproduced eval or model card beyond that.

Ksenia_TuringPost

@TheTuringPost

An orchestration framework for small models that coordinate powerful tools – ToolOrchestra from NVIDIA It’s like a conductor model for agentic systems. Instead of solving everything itself, a small Orchestrator model reasons step-by-step and decides which tool or expert model Show more

11:22 PM · Jan 25, 2026

Read 6 replies

DSPy RLM adds DataFrame-centric workflows, with native support proposed in a draft PR

DSPy (StanfordNLP community): A DSPy user shared an early DSPy + RLM + DataFrames integration that runs multi-iteration analysis over a pandas-like table (10 iterations shown), indicating a push toward “data analyst agent” patterns that treat tabular data as a first-class input, per the implementation screenshot.

• Upstreaming: A draft PR proposes “native DataFrame support for RLM,” inviting feedback in the draft PR link and detailing the changes in the GitHub PR.
• Agent-loop ergonomics: The shown RLM loop logs iterative attempts ("RLM iteration 1/10") and emits an explicit “approach” block before producing results, which is a useful shape for inspectable, trace-like data workflows, as highlighted in the implementation screenshot.

What’s still unclear from the tweets is API stability (e.g., serialization, schema inference, and memory/caching strategy for large frames)—the PR is draft-stage and the surfaced example is a first pass.

Kevin Madura

@kmad

DSPy + RLM + Dataframes 👀 First implementation seems to be working well. More testing to do but this will be a powerful combination

10:02 PM · Jan 25, 2026

189

Read 9 replies

🏦 Enterprise signals: OpenAI builder town hall, capex theses, and monetization pressure

Business-side signals today were concentrated around OpenAI engaging builders directly, plus macro theses about capex acceleration and agent-driven commerce. Excludes tool-specific feature work (covered in product categories).

OpenAI says its API business added $1B+ ARR in the last month

OpenAI API business (OpenAI): A screenshot of Sam Altman’s post claims OpenAI “added more than $1B of ARR in the last month just from our API business,” positioning the API org as a major growth engine independent of ChatGPT subscriptions, as shown in the ARR claim screenshot.

• Go-to-market signal: if accurate, it implies enterprise and developer consumption is scaling fast enough to move revenue materially in a single month—useful context when interpreting upcoming platform decisions and pricing posture.

• Caveat: the tweet screenshot doesn’t provide a breakdown (net new customers vs expansion, or how ARR is defined), so treat the magnitude as directional unless corroborated elsewhere beyond the ARR claim screenshot.

OpenAI added over $1B in annual recurring revenue (ARR) in just the last month from its API business.

Sam Altman

@sama

We have added more than $1B of ARR in the last month just from our API business. People think of us mostly as ChatGPT, but the API team is doing amazing work!

12:30 PM · Jan 23, 2026

OpenAI sets a live builder town hall for “new generation of tools” feedback

OpenAI (Sam Altman): OpenAI announced a live “town hall for AI builders” to collect feedback as it starts building “a new generation of tools,” with the discussion livestreamed on YouTube at 4pm PT and questions taken via replies, as stated in the town hall invite.

• Why it matters: this is a rare explicit signal that OpenAI wants direct input on developer tooling direction (not just model APIs), and it sets an expectation of near-term product surface changes rather than research-only updates, as echoed by the amplified screenshot.

• What’s unknown: the post doesn’t specify whether this is about agent frameworks, IDE/CLI experiences, deployment tooling, or new platform primitives—only that it’s “a first pass at a new format,” per the town hall invite.

Sam Altman

@sama

Tomorrow we’re hosting a town hall for AI builders at OpenAI. We want feedback as we start building a new generation of tools. This is an experiment and a first pass at a new format — we’ll livestream the discussion on YouTube at 4 pm PT. Reply here with questions and we’ll Show more

10:13 PM · Jan 25, 2026

8.3K

Read 4.3K replies

ARK projects $1.4T data-center systems spend by 2030 and agents mediating commerce

ARK Big Ideas 2026 (ARK Invest): ARK’s 2026 deck projects data center systems investment reaching roughly $1.4T by 2030, tying it to inference costs collapsing “>99%” and demand for APIs surging; it also forecasts purchasing agents compressing checkout to ~90 seconds and mediating ~25% of online spend by 2030, as summarized in the ARK highlights.

• Market structure signal: the deck treats “foundation models as a consumer operating system” and agents as the interaction layer, implying the next monetization fight is about who owns the agent interface and transaction flow, per the ARK highlights.

• Infrastructure implication: the charted “technology investment waves” frames AI software as a potentially GDP-scale capex category; regardless of exact numbers, it’s a public, investor-facing narrative that can influence how boards justify large compute and power commitments, as shown in the ARK highlights.

ARK Invest just released 2026 big ideas: The great acceleration and AI. Global markets are entering an unprecedented phase of technology funding. - Data center systems investment could reach ~$1.4T by 2030 as inference costs collapse >99% and API demand surges. - Nvidia leads Show more

3:38 PM · Jan 25, 2026

Read 59 replies

Anthropic reportedly targets $9B annualized revenue with ~$5.2B cash burn

Anthropic financial signal (reported): A retweet cites The Information’s reporting that Anthropic’s 2025 outlook included $9B in annualized revenue alongside roughly $5.2B in cash burn, as referenced by the financials recap.

• What it means for leaders: those figures (if accurate) reinforce that frontier-model economics are still dominated by inference/training spend, and that “revenue growth” and “cash efficiency” can diverge sharply in this phase.

• Evidence limits: the tweet is a secondhand pointer to an article; the financials recap excerpt doesn’t include assumptions (pricing, margins, or product mix), so it’s best treated as a directional competitive/market signal rather than audited disclosure.

The Information notes that Anthropic’s 2025 outlook included $9B in annualized revenue and - $5.2B in cash burn. That’s already tight, and 23% higher inference costs from Google and Amazon made things worse, shrinking margins and inflating spending. With a small user base, Show more

12:24 AM · Jan 25, 2026

137

OpenAI CFO signals new monetization paths beyond subscriptions

OpenAI monetization (OpenAI): A retweeted note says OpenAI’s CFO Sarah Friar “hinted at new ways the company could make money beyond ChatGPT subscriptions,” framing it as a response to falling compute costs and the need to monetize at scale, as referenced in the CFO monetization mention.

• What it suggests: revenue strategy is being discussed as a product constraint (not just finance), which often precedes changes in packaging (new paid features, commerce rails, or enterprise offerings) rather than purely model upgrades.

• What’s missing: the tweet is a pointer without details—no specific product surface, pricing, or rollout timeline is included in the CFO monetization mention.

OpenAI’s CFO, Sarah Friar, hinted at new ways the company could make money beyond ChatGPT subscriptions as compute costs keep climbing. She floated “licensing models,” where OpenAI could take a cut of downstream revenue if a customer’s product succeeds (like earning a share of Show more

9:00 AM · Jan 23, 2026

Fanvue reportedly hits $100M ARR with AI influencer accounts allowed

Fanvue (AI influencer monetization): Fanvue—positioned as an OnlyFans-style platform that explicitly allows “AI influencer” accounts—was reported as crossing $100M ARR, alongside a chart showing rising monthly web visits through 2025, as described in the ARR claim and visualized in the ARR claim.

The traffic trend is also shown in the following chart.

• Why it matters to AI orgs: this is another data point that “synthetic creator” businesses can sustain meaningful subscription revenue, which can affect model demand on the generation side (image/video/voice) and raises new platform-risk questions around identity, moderation, and content provenance.

• Caveat: the ARR figure is presented as a report/claim in the ARR claim without a linked primary filing in the tweet payload.

Fanvue is the OnlyFans-style platform that explicitly allows “AI influencer” accounts, i.e. the “model” is AI-generated. And a lot of creators are killing it there. The company reportedly crossed $100M ARR. We are so crossing the threshold.

1:43 AM · Jan 26, 2026

Read 13 replies

OpenAI board chair calls AI “probably” a bubble at Davos

AI investment climate (OpenAI): A retweet attributes to OpenAI board chair Bret Taylor the view that AI is “probably” a bubble, with “too much money both smart and dumb,” as referenced in the bubble comment.

• Why analysts care: when a top board member uses “bubble” language publicly, it usually signals sensitivity to capital efficiency narratives and a preference for defensible business lines (enterprise contracts, platform lock-in, distribution) over growth-at-any-cost.

• Still ambiguous: the retweet excerpt doesn’t clarify whether “bubble” refers to startups broadly, model training spend, specific valuation pockets, or adoption timing, per the limited context in the bubble comment.

OpenAI board chair Bret Taylor told CNBC at Davos that AI is “probably” a bubble, with too much money both smart and dumb, funding competitors across the entire stack. He expects a correction and consolidation over the next few years, but argues that this messy competition is Show more

1:30 PM · Jan 23, 2026

Read 13 replies

📏 Evals reality checks: deep research integrity, detector brittleness, and time-budget effects

A cluster of eval-focused papers and leaderboard observations: deep-research agents misread images, detectors fail out-of-domain, and benchmark time limits materially change results. Excludes model releases (separate category).

MMDeepResearch-Bench targets the “pretty report, wrong evidence” failure mode

MMDeepResearch-Bench (arXiv): A new benchmark evaluates deep-research agents on 140 expert-authored tasks across 21 domains and scores three things—readability, citation grounding, and whether image-based claims match the cited images—highlighting that agents can write well while still misreading visuals, as described in the Benchmark overview.

The core engineering implication is that “has citations” is not enough: MMDeepResearch-Bench explicitly checks whether the cited material (including images) supports the claim, per the Benchmark overview.

A new test shows AI Deep research agents can write well, yet still misread the images they cite. Deep Research Agents can search the web and write long reports with citations, but existing benchmarks rarely check whether the report truly matches both the sources and the images. Show more

8:53 AM · Jan 25, 2026

Terminal-Bench time budgets look like hidden capability caps for agents

Terminal-Bench (ValsAI): A 5× increase in benchmark timeouts pushed GPT‑5.2‑Codex high to 60.67% and xhigh to 60.97%, as reported in the Timeout experiment results; the gap between reasoning tiers shrinks, suggesting many tasks are solvable given more wall-clock rather than more “reasoning mode.”

The punchline for eval watchers is that time limits are acting like a strong, under-reported knob: the model appears capable, but not within the default budget, per the Timeout experiment results.

Vals AI

@ValsAI

By popular demand - when we 5x'd the timeouts defined by the benchmark, high scored 60.67% and xhigh scored 60.97%.

Vals AI

@ValsAI

We recently discovered GPT 5.2 high reasoning performs better than xhigh on Terminal Bench 2 - 52.8% for high vs 46.3% for xhigh. This seems impossible, so what’s the explanation?

4:42 AM · Jan 26, 2026

139

AI-text detector generalization collapses under domain shifts

AI-generated text detection (arXiv): A detector study reports that models can look near-perfect on familiar text but drop to 57% accuracy when training and test domains differ, and links failures to shallow linguistic shifts (tense, pronouns, passive voice), as summarized in the Linguistic analysis thread.

This frames “detector performance” as mostly a distribution-matching story: prompt styles and domain changes can erase the signal the detector learned, per the Linguistic analysis thread.

The paper shows why AI-text detectors break under new prompts and domains, by tracking simple language shifts. Detectors can hit near-perfect accuracy on familiar text, yet drop to 57% when training and test domains differ. The paper’s core idea is that detectors lean on Show more

8:17 AM · Jan 25, 2026

MAGA-Bench measures how “humanized” AI text evades detectors

MAGA-Bench (arXiv): A new dataset and pipeline generates AI text intentionally polished to look human (persona prompting, self-critique rewrites, detector-feedback loops), and reports existing detectors’ AUC dropping by ~8.13%, while fine-tuning on the harder data improves generalization by ~4.60%, as described in the Benchmark summary.

The evaluation framing is adversarial-by-construction: it tests detectors against an “evasion-trained” distribution rather than vanilla generations, per the Benchmark summary.

Usual “AI detector” tools are easier to fool than people think. People use detectors to catch fake reviews, fake school work, or fake news, but those detectors often learn simple clues like a certain stiff writing style, and then they break when the AI writes in a more human Show more

3:55 AM · Jan 26, 2026

Readers can’t reliably detect LLM writing, and disclosure changes preferences

LLM vs human perception study (arXiv): In a survey experiment, participants with ML expertise could not identify LLM-generated research abstracts above chance—even when confident—and when authorship was disclosed, LLM-edited abstracts were rated best overall, as summarized in the Perception study summary.

This is an eval-data point against “style-based AI detection by humans,” and a separate point about evaluation protocols: disclosure changes how people score quality, per the Perception study summary.

This paper shows why AI detection by writing style fails, even for machine learning experts. It also shows human written text with light LLM editing tends to read better than fully human or fully LLM drafts. When the study told readers who wrote what, 55% picked the human Show more

6:20 AM · Jan 26, 2026

Sycophancy eval shows prompt order can dominate “agree with user” behavior

Sycophancy evaluation (arXiv): A bet-style protocol finds “sycophancy” is highly sensitive to small prompt details; recency bias (the last claim in the prompt) can dominate outcomes, and some models shift behavior depending on whether the user explicitly asks “Am I right?”, as summarized in the Bet-style sycophancy summary.

The practical read is that single-prompt sycophancy probes are brittle: wording and ordering can flip measured bias, per the Bet-style sycophancy summary.

This paper tests if LLMs flatter users, and finds that tiny prompt details can flip the outcome. A new bet style test shows when LLMs agree just to please the user, even on facts. Sycophancy here means the model agrees with the user's claim even when it is wrong, which can Show more

5:18 AM · Jan 26, 2026

A survey on why LLMs miss real GitHub issues circulates again

LLM issue-resolution evals (survey): A circulating survey claim says LLM agents “often fail at fixing real GitHub issues” and focuses on what interventions actually improve success rates, as referenced in the Survey claim RT.

The tweets don’t include the paper link or concrete numbers beyond the claim itself, so treat it as a pointer to a broader literature review rather than an actionable benchmark artifact based solely on the Survey claim RT.

LLMs often fail at fixing real GitHub issues, and this survey shows what finally helps. Github Issue resolution usually needs reading an issue report, finding the right files across many folders, making a patch, a code change, then proving it works by running tests, and Show more

4:32 AM · Jan 24, 2026

🖥️ Serving & API efficiency: batching, token-cost tuning, and runtime frictions

Lower volume today, but with practical efficiency hooks: Gemini Batch API for cheaper offline workloads and concrete “reduce token consumption” tactics for agentic systems. Excludes GPU supply and chip setup pain (infrastructure category).

Token spend playbook claims up to 75% reduction for agentic systems

Token optimization (Elementor engineers): A practical guide claims agentic systems can cut token consumption by up to ~75% by combining model selection, prompt caching, context optimization, and structured outputs, as summarized in the Token spend summary and detailed in the Token optimization blog.

• Why it’s actionable: The emphasis is on controllable levers inside a production harness (routing cheaper models to low-stakes steps; caching stable prefixes; trimming/compacting context before it hits expensive models), rather than model-side changes, as described in the Token spend summary.

No benchmark artifact is included in the tweets, so the ~75% figure should be treated as workload-dependent rather than guaranteed.

Deep Learning Weekly

@dl_weekly

🤖 From this week's issue: A practical guide to reducing LLM token consumption in agentic systems by up to 75% through model selection, prompt caching, context optimization, and structured outputs. medium.com/elementor-engi…

1:40 PM · Jan 25, 2026

Read 1 reply

Gemini Batch API makes offline evals cheaper by turning requests into batch jobs

Gemini Batch API (Google): Batch mode is being pushed as the default for evals and other work that isn’t latency-sensitive, with a concrete inline-requests example that creates a batch job via client.batches.create() in the Batch API snippet.

• Operational shape: The code pattern is “assemble many GenerateContentRequests → submit once → retrieve later,” which changes how you think about throughput (queueing) and rate limits for eval pipelines, as shown in the Batch API snippet.

The thread frames this as especially relevant for large eval datasets where per-request interactive APIs become the bottleneck.

👩‍💻 Paige Bailey

@DynamicWebPaige

📊 If you're running evals using the Gemini APIs (or doing any work that isn't super time-sensitive): strongly, *strongly* suggest taking a look at the Batch API. Would it be useful for me to create a notebook showing how to use it with one of the evals datasets from @Kaggle or Show more

👩‍💻 Paige Bailey

@DynamicWebPaige

👋 Reminder: the Gemini APIs in @GoogleAIStudio have a batch mode, which not only saves time – but also money! If you're sending in many requests all at once, I strongly, *strongly* recommend taking a look. Link to the developer documentation in the tweet below:

7:52 PM · Jan 25, 2026

Gemini Batch API walkthrough shows 50% cheaper eval runs with HF datasets

Gemini Batch API tutorial (Google): A step-by-step writeup and a Colab notebook show how to run “massive evals” against Gemini using Hugging Face Datasets, pitched as ~50% cheaper when you can wait for results, as described in the Tutorial links and the accompanying Batch API blog post.

• Runnable artifact: The notebook is published so teams can swap models (the example uses Gemini 2.5 Flash Lite) and reuse the same batch harness, as provided in the Colab notebook.
• What this enables: It’s an explicit “offline eval lane” pattern—submit jobs in bulk, then score asynchronously—rather than trying to squeeze everything through interactive API quotas, as framed in the Blog and notebook.

👩‍💻 Paige Bailey

@DynamicWebPaige

Replying to @DynamicWebPaige

🔗 @ThePracticalDev blog post: dev.to/googleai/bench… 📒 @GitHub link for the @GoogleColab notebook: github.com/dynamicwebpaig…

5:15 AM · Jan 26, 2026

🏭 Hardware & platform friction: B200 bring-up pain and stack ownership lessons

Infra chatter today was dominated by complaints about how hard new NVIDIA systems are to get running in practice, with a contrast to vertically integrated TPU/JAX stacks. Excludes consumer hardware rumors and Clawdbot hardware memes (feature).

B200 bring-up friction is surfacing as a real delivery risk

NVIDIA B200 (platform bring-up): Builders are flagging that getting B200 systems running has been unexpectedly hard in practice, undercutting the assumption that “new gen = magically faster” in day-to-day engineering throughput, as described in the bring-up frustration follow-up.

The contrast being drawn is that teams feel the integration burden (drivers, frameworks, cluster plumbing) can dominate the speedup story if the software stack isn’t turnkey—so the bottleneck shifts from raw FLOPs to “time-to-first-productive-run,” per the bring-up frustration.

finbarr

@finbarrtimbers

I don’t know any of the details but I continue to be surprised at how hard it’s been to get B200s working. My naive model of “new gen = magically faster” has failed me.

3:52 PM · Jan 25, 2026

114

NVIDIA’s software ownership gap is getting called out directly

NVIDIA software stack (PyTorch friction): A pointed complaint is circulating that NVIDIA should “own torch” and make running next-gen boxes (explicitly framed around B200-class systems) trivial, because current bring-up effort is surprising for such expensive hardware, as stated in the torch ownership gripe.

This is less about benchmarks and more about developer-time economics: if framework + runtime + cluster integration remains brittle, the effective cost of a new GPU generation includes weeks of enablement work, not just capex—an argument implied by the torch ownership gripe and echoed by the broader bring-up thread in bring-up frustration.

finbarr

@finbarrtimbers

Peter Gostev (in SF 2-6 Feb)

I really don’t understand why Nvidia doesn’t own torch and hasn’t made it trivial to run B200s

finbarr

@finbarrtimbers

I don’t know any of the details but I continue to be surprised at how hard it’s been to get B200s working. My naive model of “new gen = magically faster” has failed me.

3:53 PM · Jan 25, 2026

205

Read 14 replies

DGX B300 pricing meme highlights the absurdity of “just scale hardware”

NVIDIA DGX B300 (pricing signal): A screenshot of an NVIDIA DGX B300 listed at £530,011.98 is being used as a shorthand for how unrealistic “throw hardware at it” advice can be for most teams, as shown in the DGX B300 listing.

While posted as a joke about what it takes to run agent stacks, it also lands as a procurement reality check: for many orgs, the constraint is platform availability + operational maturity, not just willingness to spend, which is the subtext of the DGX B300 listing.

@petergostev

Just to check before I order, is this enough to run @openclaw?

11:57 AM · Jan 25, 2026

859

Read 104 replies

📦 Model drops worth testing: multimodal image editing and China model churn

Fewer model drops than earlier in the week; the standout is Tencent’s multimodal image-editing model, plus continued China frontier churn discussions. Excludes voice models (voice category) and benchmarks (evals category).

Tencent ships HunyuanImage 3.0-Instruct for instruction-following image editing

HunyuanImage 3.0-Instruct (Tencent): Tencent introduced a native multimodal image-editing model built on an 80B-parameter MoE (13B activated); it frames editing as a “thinking” workflow with native chain-of-thought and a MixGRPO training recipe, and it emphasizes high-precision edits that preserve non-target regions, as described in the launch thread.

• Editing behavior: Supports add/remove/modify operations while “keeping non-target areas intact,” with examples shown in the launch thread.
• Multi-image fusion: Positions itself as strong at composing scenes by extracting/blending elements from multiple images, per the launch thread.

Access details are still thin in the tweet (beyond “PC only” try link), so practical throughput/latency and API availability remain unclear from today’s posts.

Tencent HY

@TencentHunyuan

Today, we introduce HunyuanImage 3.0-Instruct, a native multimodal model focusing on image-editing by integrating visual understanding with precise image synthesis! 🚀 It understands input images and reasons before generating images. Built on an 80B-parameter MoE architecture Show more

4:00 AM · Jan 26, 2026

Read 40 replies

ERNIE 5.0 post-launch recap flags longer context and multi-turn stability, but high cost

ERNIE 5.0 (Baidu): Following up on official live (initial “officially live” chatter), a ZhihuFrontier recap claims the official ERNIE 5.0 release fixed multiple preview issues, pushing max context to ~61K tokens and improving multi-turn from ~8 turns to 30+, while token usage rose +18% and latency stayed “roughly unchanged,” as summarized in the ZhihuFrontier weekly recap.

• Model shape and positioning: The same recap frames ERNIE 5.0 as “not a breakthrough” and still expensive at 2T-scale, while a separate amplification repeats the “unified multimodal MoE” line and cites 2.4T parameters in the launch RT.
• Remaining issues: It calls out contextual hallucinations and instruction-following randomness as still present, per the ZhihuFrontier weekly recap.

This is mostly secondary reporting and commentary today; there’s no model card, pricing sheet, or reproducible eval artifact in the tweet set.

Zhihu Frontier

@ZhihuFrontier

🚀 Zhihu Frontier Weekly | AI & Tech Highlights Catch up on the hottest AI updates and industry moves! 1️⃣ OpenAI ChatGPT｜Ads officially coming to ChatGPT 2️⃣ Google Gemini｜Accuracy jumps from 21% to 97% by copy-pasting? 3️⃣ Agent Skills｜Why they suddenly became the hottest Show more

5:14 PM · Jan 25, 2026

🧪 Reasoning & training ideas: test-time learning, efficient transformers, and AGI definitional fights

Today’s research-ish discourse emphasized new training/architecture ideas (test-time learning loops, sparse/token-keyed transformer modules) and ongoing arguments about what counts as intelligence/AGI. Excludes any bioscience-related research content.

TTT-Discover trains at inference time to search for breakthrough solutions

TTT-Discover (Stanford + NVIDIA): The TTT-Discover workflow frames scientific-style problem solving as an inference-time training loop—generate many candidates, score, then do lightweight updates—rather than a frozen model doing prompt-only iteration; it’s positioned as a way to push toward best-of-best outcomes (not average reward), as described in the Paper thread.

• Compute/cost shape: The implementation described in the Tech details uses LoRA-style updates for ~50 steps and samples 512 solutions per step, with an estimated cost of around $500 per problem.
• Why engineers care: This is an explicit recipe for “learning while serving” (with an RL loop), which is a different operational model than long-context prompting or tool retries—see the loop description in the Paper thread.

Ksenia_TuringPost

@TheTuringPost

New Stanford and NVIDIA's paper that really worth your attention They introduced Test-Time Training to Discover (TTT-Discover), which lets models keep learning at inference time, using RL to find breakthrough solutions. It’s a new way to effectively solve scientific problems. Show more

10:55 AM · Jan 25, 2026

355

STEM proposes token-indexed embedding modules as a compute-saving FFN replacement

STEM (Meta AI): The STEM paper proposes replacing the Transformer FFN up-projection with a token-keyed embedding lookup, keeping the rest of the FFN structure intact; the claim is smoother training, more predictable compute than MoE routing, and up to ~4% higher average accuracy on knowledge-heavy evals while reducing compute by skipping one of the big FFN matmuls, as shown in the Paper screenshot.

• Systems angle: The figure in the Paper screenshot emphasizes that embedding tables can sit in cheaper memory (e.g., CPU) and be prefetched to GPU, making the compute path more stable than expert routing.
• Editability claim: Because vectors are tied to tokens, the thread argues facts may be “editable” by swapping token vectors, per the Paper screenshot.

This paper replaces part of Transformer feed forward layers with token linked embeddings, boosting accuracy while cutting compute. STEM gives Transformers more memory by turning a feed forward matrix into a token keyed embedding lookup. It avoids expert routing headaches, saves Show more

11:00 AM · Jan 25, 2026

201

Hassabis rejects “AGI” as marketing and says today’s systems aren’t close

AGI definition (DeepMind): Demis Hassabis argues “AGI” shouldn’t be treated as a marketing term; he frames the bar as broad human cognitive capability including physical intelligence and invention-level creativity (new theories, new art genres), and says “today’s systems are nowhere near that,” as shown in the Hassabis clip.

This lands as a direct pushback against “coding model = AGI” takes, and it sets an evaluative frame that is broader than benchmarks on text-only tasks, per the Hassabis clip.

Demis Hassabis says we shouldn't treat AGI as a marketing term But as a system that exhibits all human cognitive capabilities, including physical intelligence It must go beyond solving math problems to true invention, creating new theories like Einstein or new art genres like Show more

9:00 AM · Jan 25, 2026

431

Read 50 replies

Terence Tao argues AI is forcing a rethink of what “intelligence” means

Intelligence definition (Terence Tao): Terence Tao’s clip argues that as models solve more tasks, the solutions often stop “looking intelligent” and start looking like mechanisms (neural nets, next-token prediction), which suggests our intuitive definition of intelligence may be miscalibrated—ending with the near-quote “maybe that’s actually a lot of what humans do,” as shown in the Tao clip.

The practical implication for model builders is that “does it feel intelligent?” is a shaky eval axis once capability normalizes; the Tao clip frames that as a human-perception artifact, not a capability boundary.

Terence Tao says the era of AI is proving that our definition of intelligence is inaccurate We thought intelligence was some vague, mystical way of thinking But as AI solves tasks, it never looks intelligent, just tricks, neural networks, and next-token prediction "maybe Show more

6:35 PM · Jan 25, 2026

3.3K

Read 162 replies

Yann LeCun calls “coding = AGI” reactions a recurring category error

AGI skepticism (Yann LeCun): Yann LeCun replies to “Claude Opus 4.5 for coding is AGI” by calling it a familiar delusion—pointing to a long history where computers beat humans in narrow tasks (chess, Go, compilers, etc.) without implying human-level AI, as captured in the LeCun reply screenshot.

This is showing up as a community counterweight to capability hype: the LeCun reply screenshot frames “AGI” as a category mistake when inferred from a single domain win.

yann lecun left meta and chose WAR i think he's right. we keep making the same mistake: we see "superhuman" results in a narrow area and immediately start yelling AGI but it also shows how fast we normalize things like show today's models to someone 5-10 years ago, and they'd Show more

1:50 AM · Jan 26, 2026

697

Read 58 replies

Hassabis weighs in on “scaling is dead” and “singularity now” narratives

Scaling debate (DeepMind): A circulating clip shows Demis Hassabis responding to Ilya Sutskever’s “scaling is dead” framing and Elon Musk’s “we’ve reached the singularity” claim, keeping the discussion focused on what’s actually solved vs what isn’t, as shared in the Interview clip.

The takeaway for labs is less about a specific metric and more about narrative control: the Interview clip is being used as an anchor for interpreting whether the next gains come from more scale, new architectures, or new training loops.

Chubby♨️

@kimmonismus

Demis Hassabis on Ilya Sutskever’s claim that scaling is dead, and on Elon Musk’s clam that we have reached the singularity.

11:56 AM · Jan 25, 2026

894

Read 35 replies

A viral thread claims Tencent can replace fine-tuning/RL with ~$18 of compute

Low-budget post-training claim (Tencent): A widely shared retweet claims Tencent “killed fine-tuning and RL” with an $18 budget, implying a much cheaper alternative to classic RL/post-training stacks, as referenced in the Viral claim retweet.

The tweet doesn’t include a paper link or enough methodological detail to evaluate (objective, data, evals, compute accounting), but the $18 number is already being used as a talking point about post-training cost curves in the Viral claim retweet.

Oliver Prompts

@oliviscusAI

Tencent just killed fine-tuning and RL with a $18 budget 🤯 They developed a method that replaces traditional Reinforcement Learning (RL) entirely. It’s called Training-Free GRPO. It allows LLMs to learn from 100 examples by treating memory as a policy optimizer.

9:28 AM · Jan 25, 2026

Read 37 replies

AGI definitional drift: “until we agree, it’s a buzzword” framing spreads

AGI term drift: A thread argues the AGI debate “doesn’t really go anywhere” because people place it on wildly different timelines and “big AI companies will define AGI in a way that lets them claim they’ve achieved it,” concluding that without shared definitions the label becomes a buzzword, per the AGI buzzword framing.

the AGI debate doesn't really go anywhere some people think we've already achieved it, others think it's decades away, and some believe it'll never happen big AI companies will define AGI in a way that lets them claim they've achieved it so, until everyone agrees, "AGI" is Show more

9:00 PM · Jan 25, 2026

110

Read 51 replies

🛡️ Security hygiene for the agent era: hacked accounts, risky DMs, and auth hardening

Security discussion today centered on account takeovers and practical auth hardening (2FA/passkeys/connected-app audits). Excludes Clawdbot-specific exposure incidents (covered as the feature).

Account takeover hardening checklist resurfaces as hacked-account reports spread

Account security hygiene: A practical hardening checklist is getting reshared—centered on moving to passkeys, using app-based 2FA (not SMS), pruning connected apps, storing backup codes, and turning on “password reset protect,” as laid out in the hardening checklist. This is aimed at the current wave of high-profile account takeovers.

• Settings that matter: The checklist explicitly calls out “password reset protect” in X’s security settings, as shown in the hardening checklist.

It’s framed as baseline hygiene rather than a tool-specific fix, per the hardening checklist.

Ian Nuttall

@iannuttall

A bunch of big accounts getting hacked recently. Make sure you: - Generate strong password w/ a manager like 1password - Remove connected apps you don't use/recognise - Add passkeys - Use 2FA with app (1password/authy) - Don't use phone 2FA (sms can be intercepted) - Create & Show more

9:30 AM · Jan 25, 2026

Hacked-account report questions how takeovers persist even with 2FA

Account takeover mechanics: A report of receiving messages from a hacked high-profile account raises the question of how compromises are still happening “despite presumed 2FA,” as stated in the hacked account note. It reinforces that the current wave isn’t limited to low-hygiene accounts.

• Hardening context: The incident is being discussed alongside a broader “audit your security settings” checklist (passkeys, connected-app review, reset protections), as shown in the hardening checklist.

No root cause is provided in the tweets, only the persistence of takeovers even under stronger auth assumptions per the hacked account note.

Ian Nuttall

@iannuttall

Just got a message from a hacked @ryancarson account, heads up @sqs :/ how is this happening to huge accounts who presumably use 2fa?

8:52 AM · Jan 25, 2026

Read 7 replies

Suspicious cal.com DM triggers “verify before clicking” escalation

Social engineering risk: A builder asks Anthropic folks to sanity-check a DM containing a cal.com link, citing recent hacks and a new “don’t click links in DMs” norm, as described in the DM screenshot. It’s a small example of how account takeover fallout is changing day-to-day comms.

The DM appears to come from an established account (joined 2012) but is treated as potentially compromised, per the DM screenshot.

Ray Fernando

@RayFernando1337

Can folks from Anthropic vet this? I don’t click links in DMs after seeing my friends get hacked. I wonder if .@The_Whole_Daisy is compromised.

1:07 AM · Jan 26, 2026

🧑‍🔬 Developer identity shift: “I don’t code anymore” backlash and AI-first career advice

The discourse itself is news today: identity loss + status games around “coding is over,” and explicit career advice to prioritize AI tool fluency. Excludes concrete tool updates (handled elsewhere).

“I don’t write code anymore” goes mainstream inside AI builder circles

Post-coding narrative: The claim that “the era of writing code is over” keeps getting repeated as builders share screenshots of prominent users saying they now do 0% manual coding, including “100%, I don't write code anymore” as captured in the 100% coding quotes and re-shared again in the singularity screenshot.

• What’s new vs last month: It’s not framed as “AI helped me ship faster,” but as a status/identity statement (“I don’t write code anymore”), which is starting to function like a meme-able proof point rather than a workflow description, per the 100% coding quotes.

• Why it matters for teams: This narrative tends to compress real differences between “agent writes most diffs” and “engineer stops caring about code quality,” a tension that immediately triggers backlash in adjacent threads (covered separately in the backlash post).

1:50 PM · Jan 25, 2026

948

Read 156 replies

Backlash grows: code quality still matters even with coding agents

Code quality backlash: A counter-thread argues the “I’ve moved on from coding” posture is performative, and that serious builders still “think about code as much if not more than ever,” as stated in the backlash post.

• Agent constraint argument: The concrete claim is that coding agents degrade as the codebase grows and that LLM output is often verbose, so quality and readability remain binding constraints on agent throughput—“put them together and you have the answer” in the codebase constraint claim.

• Cultural split: The thread frames this as the same old fight (people resisting codebase hygiene), now rebranded as agent-era fatalism, per the follow-up critique.

dax

@thdxr

all these people making "i've moved on from coding" posts are so fucking lame i'm thinking about code as much if not more than ever

10:27 PM · Jan 25, 2026

Read 76 replies

Identity loss post spreads: pride in writing code replaced by AI output

Engineer identity shift: A widely shared confession describes “loss of identity” as a software engineer—“the act of writing code” was part of self-image, and watching AI do in seconds what took hours triggers “relief and mourning, awe and anxiety,” as quoted in the identity loss quote.

This is less about tool performance and more about the social meaning of “craft,” which shows up as a second-order effect of agent adoption (morale, hiring narratives, and what gets rewarded internally), per the framing in the identity loss quote.

AI will flip everyone’s life upside down. the reset wave has started that will spares no one.

Madison Kanna

@Madisonkanna

as a software engineer, i feel a real loss of identity right now. for a long time i defined myself in part by the act of writing code. the pride in a hard-earned solution was part of who i was. now i watch AI accomplish in seconds what took me hours. i find myself caught between

3:26 AM · Jan 26, 2026

166

Read 33 replies

Hassabis to undergrads: get fluent with AI tools, not internships

Demis Hassabis (DeepMind): A clip circulating today summarizes Hassabis advising undergraduates that becoming “unbelievably proficient with AI tools” can be more valuable than traditional internships for getting into a profession, as relayed in the career advice clip.

The practical implication for early-career folks is that “tool fluency” is being positioned as a career moat by top lab leadership, not just a productivity hack, per the framing in the career advice clip.

"Demis Hassabis' Advice: Skip Internships, Master AI" Google DeepMind's CEO advises undergraduates that getting unbelievably proficient with AI tools is now more valuable than traditional internships for leapfrogging into a profession.

7:45 PM · Jan 25, 2026

5.1K

Read 159 replies

“Software-first singularity”: the claim that the shift already happened

Software-first singularity: A meme format argues the shift to AI-driven software creation “came and went” and “no one noticed,” anchored by the same “100%, I don't write code anymore” screenshot that’s being used as evidence across the timeline in the singularity screenshot.

This lands as a narrative compression: it reframes a messy transition (partial delegation, review, supervision) as a completed phase change, using a single quotable line as the proof artifact, as shown in the singularity screenshot.

Dan McAteer

@daniel_mac8

The Software-First Singularity came and went. No one noticed. But a rare few.

roon

@tszzl

100%, I don’t write code anymore

2:00 PM · Jan 25, 2026

149

Read 10 replies

🤖 Embodied AI signals: humanoid timelines, robot services, and drone swarms

Robotics chatter today mixed near-term humanoid optimism with concrete deployment stats and military drone-swarm stories. Excludes purely speculative AGI debate (covered under reasoning/training).

Demis Hassabis puts humanoid robot progress on a 12–18 month clock

Humanoid robots (DeepMind): Demis Hassabis is quoted claiming we’re “12–18 months away” from a “critical moment” where key humanoid-robot problems get solved, framing timelines in months rather than years as shown in Hassabis timeline clip.

The implication for builders is less about a single model milestone and more about system integration becoming the pacing item—perception, control, reliability, and deployment constraints converging faster than typical hardware refresh cycles.

Chubby♨️

@kimmonismus

Demis Hassabis: We're 12-18 months away from the critical moment when the problems of humanoid robots will be solved. We're now only thinking in months, not years. Crazy.

3:22 PM · Jan 25, 2026

Read 71 replies

Humanoid robot deployments: ~16k in 2025, China >80%, >100k projected by 2027

Humanoid market (Counterpoint via rohanpaul_ai): A widely shared market snapshot claims ~16,000 humanoid robots were installed globally in 2025, with China accounting for >80% of installs; cumulative installs are projected to exceed 100,000 by 2027, according to Market stats thread.

The same thread also points at near-term commercialization vectors—sub-$1,600 entry models, “robots-as-a-service” rental, and larger-scale production plans—as part of why deployment could accelerate quickly, per Market stats thread.

In 2025, an additional ~16,000 units of humanoid robots were installed worldwide, with China alone accounting for more than 80% of the installations. The cumulative installations of humanoid robots are projected to exceed 100,000 units by 2027. 3 trends are taking shape in the Show more

2:32 PM · Jan 25, 2026

185

Read 14 replies

China is training drone swarms using predator hunting behaviors, per WSJ

Drone swarms (China military R&D): A WSJ-reported effort describes training AI-driven drones using predator-inspired behaviors (hawks/coyotes/wolves) for coordinated pursuit/attack and intercept patterns, as summarized in WSJ swarm summary.

The story matters operationally because it’s a reminder that multi-agent coordination is not just a software metaphor; it’s being treated as a learnable control policy with explicit adversarial pressure (jamming, deception, interception), per WSJ swarm summary.

🇨🇳 China is developing AI-driven weapons and drone swarms modeled after the hunting patterns of hawks and wolves. Researchers are using predator instincts to train drones to track, pursue, and strike targets in synchronized groups, improving their adaptability and Show more

12:33 AM · Jan 26, 2026

121

Hundreds of drones fell from the sky in China; operators cite unknown cause

Drone swarm reliability (Field incident): Footage shows hundreds of drones dropping out of the sky nearly simultaneously; initial blame on police jamming shifted to “unknown” or operator error, according to Mass drone drop clip.

For engineers tracking swarm systems, this is a real-world reminder that RF links, control handoffs, and failsafe behavior dominate perceived safety more than lab-level autonomy demos, as implied by Mass drone drop clip.

🇨🇳 A rare incidence of hundreds of drones dropping out of the sky in China. Operators initially were blaming police jamming equipment causing signal interference, but later realized the source is unknown and likely their own error.

12:28 AM · Jan 26, 2026

131

Read 15 replies

Rifle-mounted robots appear in India’s Republic Day rehearsal footage

Military robotics (India): Video from India’s Republic Day rehearsal shows tracked robots with mounted rifles moving in formation, as shown in Rehearsal robot clip.

It’s a deployment signal: even when autonomy is unclear, platformization (mobility + payload + comms) is moving into public-facing exercises, per Rehearsal robot clip.

Spotted at Kartavya Path: rifle-mounted robots in the Indian Army’s Republic Day rehearsal

Aditya Raj Kaul

@AdityaRajKaul

Rifle mounted Robots of Indian Army at the Republic Day rehersal earlier today at the Kartavya Path in New Delhi. 🇮🇳

2:00 PM · Jan 23, 2026

Verobotics shows façade-climbing robots for exterior cleaning and inspection

Verobotics (Facade robots): A field demo shows robots adhering to and moving along building exteriors to clean and scan façades, positioning robotics as a replacement for hazardous rope-access work, as shown in Facade robot demo.

This is a concrete “embodied AI” wedge: narrow task scope, clear ROI, and deployment in an environment where autonomy can be bounded (repeatable surfaces, constrained routes).

Verobotics's Robots climb building exteriors to clean and scan façades Solves a great problem. Exterior cleaning/inspection has long relied on lowering a person from a roof; These were risky and limit work to a few times a year. Robots solved it now.

5:48 PM · Jan 25, 2026

498

🎨 Generative media workflows: AI influencers, video consistency tricks, and restoration prompts

Creator-side gen media remained active: AI influencer monetization workflows, repeatability tricks for video, and long restoration prompts. Excludes voice agents (separate category).

Grid prompting plus start/end frames is being used for more consistent AI video

AI video consistency (Technique): A practical recipe is being shared for keeping generated video more stable by combining grid prompting with explicit start and end frames, with results shown in the consistency demo; the example claims it was made using Nano Banana Pro and Kling 2.6, which matters because it’s a cross-model workflow rather than a model-specific feature.

• Why it works (mechanically): The method constrains the model’s degrees of freedom twice—first with a structured prompt grid, then with boundary conditions via first/last frames—per the consistency demo.

TechHalla

@techhalla

get perfect consistency in your AI videos by combining grid prompting with start and end frames made with Nano Banana Pro and Kling 2.6 from the prompt below 👇

TechHalla

@techhalla

Quick 3x3 grid prompt share + extraction prompt · 1st use Nano banana pro and 👇 DIRECTIVE: Generate a 3x3 grid of photorealistic images depicting Darth Vader's journey through a desert canyon. CHARACTERS: Darth Vader in slightly worn, dust-covered iconic armor. His cybernetic

attached you'll find a 3x3 grid with 3 rows and 3 columns. Extract only the frame from row 1 column 1

attached you'll find a 3x3 grid with 3 rows and 3 columns. Extract only the frame from row 2 column 1

attached you'll find a 3x3 grid with 3 rows and 3 columns. Extract only the frame from row 3 column 3

9:36 AM · Jan 25, 2026

257

Read 19 replies

A long “master shot” prompt template is spreading for photo restoration workflows

Photo restoration prompt (Template): A detailed, reusable prompt template is being shared for “complete photographic restoration and high-end upgrade,” emphasizing strict reference preservation plus cinematic lighting, texture upgrades, lens simulation, and film-style color grading, as shown in the prompt screenshot. It’s positioned for Nano Banana Pro (via Freepik) and is shared as a screenshot because of length, per the prompt screenshot.

• Template shape: The prompt is structured like a spec—directive, “critical reference handling,” and explicit visual upgrade requirements—which is useful for teams trying to standardize restoration outputs across operators, per the prompt screenshot.

TechHalla

@techhalla

Replying to @techhalla

And the best part is I only used one prompt (plus the image as reference). Try it with any photo and let me know how it goes! I’m sending it as a screenshot since it’s too long! Ask Grok to extract it for you.

2:31 PM · Jan 25, 2026

118

Read 6 replies

Higgsfield pitches a 10‑minute “AI influencers” monetization playbook for 2026

AI influencers workflow (Higgsfield): A short-form “make MILLIONS with AI Influencers in 2026” pitch is circulating as a quick-start workflow, framed as a “10 minute guide” plus a “2026 playbook” link in the guide thread; the thread also uses an incentive mechanic (“retweet & reply for 50 credits”), which is a common growth loop in creator tool distribution.

• What’s actionable here: The artifact to evaluate is the packaging (short guide + playbook + incentives) rather than any verified ROI claims, which are not substantiated in the post itself per the guide thread.

Higgsfield AI 🧩

@higgsfield_ai

How to make MILLIONS with AI Influencers in 2026? Watch the 10 minute guide and start making money this week. The link is below 👇 For 4 hours: retweet & reply for 50 credits

1:54 AM · Jan 26, 2026

5.7K

Read 6.1K replies

AI influencer content pipeline (ProperPrompter): A creator-focused thread argues there’s “massive opportunity” in AI influencers and points to a “full workflow + secrets for cinematic AI social media content” in an associated article, as described in the workflow pitch. The operational takeaway for builders is that creator demand is clustering around repeatable “cinema” aesthetics and distribution playbooks—regardless of whether the monetization outcomes are reproducible.

• Positioning signal: The thread explicitly calls out the “people dismiss it as cringe” objection, which suggests the real bottleneck is social acceptability/brand risk rather than generation capability, per the workflow pitch.

proper

@ProperPrompter

there's massive opportunity for "AI Influencers" that people will miss because they write it off as "cringe." my full workflow + secrets for cinematic AI social media content in the article below. worked very hard on it. appreciate a like, bookmark & share if it helps 🫶

proper

@ProperPrompter

x.com/i/article/2014…

7:24 PM · Jan 25, 2026

Personalized wallpaper generation is emerging as a lightweight Nano Banana Pro use case

Nano Banana Pro (Use case): A small but concrete workflow is being shared for generating personalized wallpapers, illustrated with a sample output in the wallpaper example. It’s a low-friction “asset factory” pattern: one prompt → many background variants, which is often where consumer gen-media tools first stick.

• Product signal: The post frames wallpapers as a “fun use case,” suggesting the workflow’s value is quick iteration and personal taste matching rather than photoreal fidelity, per the wallpaper example.

Legit

@legit_api

A fun use case for Nano Banana Pro: generate personalized wallpapers

6:38 PM · Jan 25, 2026

📚 The browser as a sandbox: web-native containment patterns for agent apps

A smaller but high-signal devex thread: treating the browser as the sandbox for agentic apps, with concrete notes on iframe sandboxing/CSP and directory upload primitives. Excludes repo-local prompt rules (coding-workflows) and MCP plumbing (orchestration).

Browser sandboxing patterns for agent apps: iframe sandbox meets CSP

Browser sandbox containment: Simon Willison published deeper notes on “the browser is the sandbox,” focusing on how web-native agent apps can treat the browser as the containment boundary—especially the tricky intersection of <iframe sandbox> and Content Security Policy (CSP) for constraining what untrusted agent-generated UI/code can do, as described in his follow-up notes and earlier containment thread.

The write-up frames a practical decomposition of “sandbox” into (1) file access, (2) network access, and (3) safe code execution—then explores how browser primitives (nested iframes, CSP headers, workers/Wasm) can be composed to approximate a secure agent runtime, with implementation details in the blog post.

Simon Willison

@simonw

Replying to @simonw

Just published a bunch more notes on this on my blog simonwillison.net/2026/Jan/25/th…

11:52 PM · Jan 25, 2026

webkitdirectory becomes a viable primitive for web agent file access

Directory upload primitive: A small but useful browser capability is resurfacing: <input webkitdirectory> now works across Chrome, Firefox, and Safari, enabling “select a folder” flows for web UIs without requiring full File System Access API permissions, as noted in directory demo and referenced from the broader sandboxing discussion in containment thread.

For agentic web apps, this makes it easier to build a “bring your repo/docs” interface where the browser can provide a bounded, user-approved file corpus; Simon’s demo shows folder enumeration + file tree/preview in the directory explorer.

Simon Willison

@simonw

Replying to @simonw

Plus it introduced me to <input webkitdirectory> which turns out to work in Chrome, Firefox and Safari these days! Inspired me to build this demo tools.simonwillison.net/webkitdirectory

10:53 PM · Jan 25, 2026

Rich Markdown renderer surfaces as a simple terminal UX upgrade

Rich (Python): The Markdown() renderer in Rich is getting shared as a low-effort way to make terminal outputs (agent reports, eval summaries, logs) more readable, per the Rich Markdown tip linking to the Rich docs.

This shows up as a recurring devex move in agent-heavy workflows: keep the runtime in a TUI/CLI, but present intermediate artifacts (plans, diffs, checklists) in Markdown with syntax highlighting and formatting rather than raw text.

Pamela Fox

@pamelafox

My Python scripts are getting way cuter now that I discovered the Markdown() renderer from rich. rich.readthedocs.io/en/stable/mark…

6:18 AM · Oct 30, 2025

280

🎧 Voice is speeding up: real-time cloning claims and latency as the UX unlock

Voice items today were mostly about latency and open-source acceleration (real-time cloning, fast TTS UIs), plus a few “voice mode feels smoother” reactions. Excludes creative media pipelines (gen-media category).

VoxCPM claims real-time voice cloning without tokenization

VoxCPM (OpenBMB): An OpenBMB retweet claims an open-source TTS system can clone a human voice “in real time without tokenization,” pointing at VoxCPM as the core idea real-time cloning claim. This is part of the ongoing shift away from discretized audio token pipelines toward continuous/“tokenizer-free” designs.

There are no evals, latency numbers, or reproducible benchmarks in the tweet itself, so treat it as an announcement-level signal until a reference implementation and measurements are circulated.

Sumanth

@Sumanth_077

This is huge!! You can now clone a human voice in real time without tokenization. VoxCPM is an open-source text-to-speech system that models speech in continuous space instead of discrete tokens. Most TTS systems convert speech to discrete tokens before generation. This Show more

1:53 PM · Jan 24, 2026

355

Read 10 replies

Voice AI feels different as latency approaches zero

Voice latency (UX): A builder framing says voice AI is about to “have a moment” because the experience shifts nonlinearly as latency approaches ~0, per the latency claim. This is the practical argument that speed (not just MOS or benchmark scores) is what makes voice agents feel conversational enough to replace taps/typing.

The claim is qualitative (no numbers or measurements in the tweet), but it matches what teams see in real deployments: smaller cuts in tail latency often change turn-taking, interruption handling, and user trust more than model upgrades.

ben

@benhylak

voice ai is about to have a moment i don't think people get how rapidly the experience changes as latency approaches 0

Hugging Models

@HuggingModels

NVIDIA just dropped PersonaPlex-7B 🤯 A full-duplex voice model that listens and talks at the same time. No pauses. No turn-taking. Real conversation. 100% open source. Free. Voice AI just leveled up. huggingface.co/nvidia/persona…

11:04 PM · Jan 25, 2026

3.2K

Read 57 replies

ChatGPT voice mode feedback: smoother delivery and less robotic sound

ChatGPT voice mode (OpenAI): An anecdotal reaction says the “new ChatGPT voice mode” sounds gentler, avoids audio peaks better, and feels less robotic, per the voice mode reaction. That’s consistent with improvements in prosody control and output-level audio dynamics (often as important as the underlying text model for perceived quality).

No clips or A/B measurements are attached in today’s tweet, so it’s sentiment—not a spec drop.

Chris

@chatgpt21

The new ChatGPT voice mode is impressively smooth. It sounds gentle, avoids audio peaks, feels less robotic, and even repeats words exactly as you say them. It’s a big improvement from the glitchy older version. Plus, it no longer just repeats what you say it jumps straight Show more

8:35 PM · Jan 24, 2026

567

Read 53 replies

Qwen3-TTS gets a one-click local Gradio UI

Qwen3-TTS (runtime UX): A community Gradio web UI is being shared as a “1-click” way to run Qwen3-TTS locally on a PC, according to the Gradio UI retweet. The practical value here is lowering the “demo friction” for teams that want to compare open TTS quickly without wiring their own inference harness.

This is a packaging/workflow update rather than a model update; it doesn’t change Qwen3-TTS capability, but it changes who can actually try it.

cocktail peanut

@cocktailpeanut

Run Qwen3-TTS on your PC with 1-Click @SUP3RMASS1VE has created a gradio web ui for running Qwen3-TTS. My personal favorite feature is "voice design", where you can actually prompt the speech style, it's a game changer! Show more