ChatGPT tests ads for Free and $8 Go – answers ‘independent’

In the coming weeks, we plan to start testing ads in ChatGPT free and Go tiers. We’re sharing our principles early on how we’ll approach ads–guided by putting user trust and transparency first as we work to make AI accessible to everyone. What matters most: - Responses in Show more

6:00 PM · Jan 16, 2026

9.7K

Read 3.6K replies

ChatGPT ad mockups show sponsored cards beneath answers

ChatGPT ads UI (OpenAI): OpenAI shared early examples of ad formats that appear after the main answer—e.g., a “Sponsored” product card shown under dinner-party suggestions—per the [ad format example]Ad format example. The separation is positioned as the core UX constraint: ads are “always separate and clearly labeled,” aligned with the [stated principles]Ad principles thread.

This mockup matters because it sets expectations for where monetization can live in an assistant UI (below answers, not blended into them), and implies a commerce-style card format rather than pre-roll or interstitial placements.

OpenAI

@OpenAI

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭

Replying to @OpenAI

Here's an example of what the first ad formats we plan to test could look like.

6:00 PM · Jan 16, 2026

1.3K

Read 252 replies

Users press OpenAI on conversation-based ad targeting boundaries

Conversation privacy questions (ChatGPT): A high-signal concern in replies is whether ad targeting will use conversation content or metadata—explicitly raised in a question asking if users should expect their conversation content to be used to “teach an AI to sell them stuff,” per the [targeting question]Targeting question.

OpenAI’s published principles say conversations are kept “private from advertisers,” and users can turn off personalization and clear data used for ads, as shown in the [principles screenshot]Ad principles screenshot and echoed in the [OpenAI thread]Ad principles thread. What’s still not fully resolved in the tweets is the practical boundary between “private from advertisers” and “used internally for targeting.”

@elder_plinius

Replying to @sama

I noticed you didn’t exclude targeted ads based on conversation history. can users expect their conversation content and/or metadata to be used to optimally teach an AI to sell them stuff, essentially?

10:02 PM · Jan 16, 2026

1.8K

Read 53 replies

Altman’s “ads as last resort” quote resurfaces amid ChatGPT ad testing

OpenAI positioning (ChatGPT): A clip of Sam Altman saying he thinks of ads as a “last resort” business model is being recirculated as OpenAI moves into ad testing, per the [resurfaced interview clip]Resurfaced interview clip.

The contrast is now explicit in Altman’s current framing that ads may be needed because “a lot of people want to use a lot of AI and don’t want to pay,” as written in the [Altman statement]Altman statement.

Tom Warren

@tomwarren

"I kind of think of ads as like a last resort for us as a business model," - Sam Altman, October 2024

Sam Altman

@sama

We are starting to test ads in ChatGPT free and Go (new $8/month option) tiers. Here are our principles. Most importantly, we will not accept money to influence the answer ChatGPT gives you, and we keep your conversations private from advertisers. It is clear to us that a lot

10:48 PM · Jan 16, 2026

78.1K

Read 792 replies

OpenAI frames ads as funding free and low-cost ChatGPT access

Accessibility framing (OpenAI): OpenAI messaging describes ads as supporting “making AI accessible to everyone” by helping keep ChatGPT available at free and affordable price points, with testing planned “in the US soon,” as shown in the [in-app style notice]In-app style notice.

That framing sits alongside the explicit tier separation (Plus/Pro/Business/Enterprise ad-free) stated in the [OpenAI announcement]Ad principles thread. It’s a straightforward subsidy argument, with the operational detail being that the ad-exposed tiers include Free and Go.

AshutoshShrivastava

@ai_for_success

🚨 OpenAI We plan to begin testing ads in the free tier and ChatGPT Go in the US soon. IT'S SO OVER....

6:45 PM · Jan 16, 2026

154

Read 25 replies

Wired details ChatGPT ads: topic matching, aggregate metrics, opt-out

Ad targeting mechanics (OpenAI): A Wired report says OpenAI’s initial ad tests will match ads to conversation topics; ads show only for Free and $8/month Go users; advertisers receive aggregate performance metrics rather than individual user data, per the [Wired explainer]Wired explainer highlighted in [the source post]Wired source post. It also describes controls around opting out of personalization while keeping other personalization features.

The Wired framing aligns with OpenAI’s claim that “responses won’t be influenced by ads” and that “your conversations are private from advertisers,” as stated in the [OpenAI announcement]Ad principles thread.

Chubby♨️

@kimmonismus

Replying to @kimmonismus

Source: wired.com/story/openai-t…

6:49 PM · Jan 16, 2026

Community skepticism focuses on incentives and durability of ad principles

Trust & incentives (ChatGPT ads): A recurring reaction is skepticism that ad-related principles will survive contact with revenue incentives—e.g., concerns that “one day these ‘ad principles’ disappear” as advertisers pay more for frequency, per the [skeptical post]Skeptical post.

Even among people accepting the rollout as “fine,” the debate centers on whether “answer independence” can remain credible long-term once there’s a monetization loop, with the principle itself repeatedly emphasized in OpenAI’s materials, including the [answer independence snippet]Answer independence snippet.

Lisan al Gaib

@scaling01

now wait until they find out that advertisers pay more when you show their products more often and one day these "ad principles" disappear from their website

OpenAI

@OpenAI

6:46 PM · Jan 16, 2026

145

Read 8 replies

🧑‍💻 OpenAI Codex tooling: real-time steering, speed push, and ‘very fast Codex’ teasers

Codex-centric updates and practitioner comparisons: multiple posts on new CLI steering, speed optimizations, and early quality anecdotes vs Claude Code. Excludes ChatGPT ads/Go and non-Codex OpenAI legal dispute coverage.

Codex CLI adds mid-turn steering without interrupting (experimental toggle)

Codex CLI (OpenAI): Codex CLI can now be steered while a task is running—without a hard interrupt—so you can correct course and watch the agent adapt “almost real time,” building on steer toggle (experimental steering) as described in the steering note from mid-turn steering.

The mechanics showing up in the CLI UI are practical: the Steer conversation toggle changes input behavior so Enter can submit immediately while work is in progress, and Tab can be used to queue messages, per the pro tip in terminal toggle screenshot. The net effect is fewer “panic Ctrl+C” aborts and more incremental course correction while Codex is already deep in a repo.

Tibo

@thsottiaux

Within the CLI, you can now steer codex mid-turn without interrupting and watch the agent adapt in almost real time. Enable in /experimental

am.will

@LLMJunky

Codex team is at it again with just another insanely useful feature. If you see your agent going off the rails, or needing some addt'l context, you no longer need to stop the agent. Follow up prompts while the agent is working now inserts the prompt at the next thinking step,

8:08 AM · Jan 16, 2026

1.6K

Read 131 replies

Sam Altman teases “very fast Codex” as OpenAI pushes speed improvements

Codex (OpenAI): Multiple OpenAI-side signals point to a speed push: Sam Altman posts the near-quote “very fast Codex coming,” in very fast teaser, while Codex team members echo “we heard you wanted faster codex” in speed-focused post and hint “something speedy” in speedy hint.

The framing isn’t just speed-for-speed: Altman also pairs speed with “higher level of intelligence” in speed plus intelligence, which reads like a roadmap promise that future Codex defaults may move up the quality-latency curve rather than only shaving seconds.

Sam Altman

@sama

Very fast Codex coming!

Cerebras

@cerebras

OpenAI🤝Cerebras openai.com/index/cerebras…

7:21 PM · Jan 16, 2026

8.6K

Read 677 replies

Codex CLI experiments with shell snapshotting to reduce startup overhead

Codex CLI (OpenAI): The CLI’s /experimental menu now surfaces Shell snapshotting, pitched as a way to make Codex faster by snapshotting your shell environment so it doesn’t re-run login scripts every command, as shown in the CLI screenshot from shell snapshot tip.

This is a small but telling knob: it targets repeated per-command setup cost (environment init) rather than model latency, and it suggests the team is still chipping away at “time to first useful tool call” inside the terminal loop.

Ian Nuttall

@iannuttall

Codex CLI pro tip: run /experimental and enable Steer conversation Now you can hit tab to stack up messages just like before, or hit enter to submit immediately and steer in real time no more ctrl+c to frantically stop them making a mistake! just type and hit enter

8:35 AM · Jan 16, 2026

341

Read 11 replies

Developers report choosing GPT-5.2 Codex xhigh over Claude Code Opus 4.5

GPT-5.2 Codex xhigh (OpenAI): A repeated side-by-side pattern is showing up in practitioner anecdotes: running the same prompt in Codex (GPT‑5.2 Codex xhigh) and Claude Code (Opus 4.5), then continuing with Codex because the initial output is better, as stated in the comparison note from side-by-side result.

This is a narrow datapoint (no shared prompt, repo, or eval artifact in the tweets), but it’s a clean “first-response quality” claim, which is often what decides which agent you keep driving when you’re juggling two terminals.

Peter Gostev

@petergostev

It's been 3 times in a row now when I had Codex (GPT 5.2 Codex xhigh) and Claude Code (Opus 4.5) side by side, starting with the same prompt. Continued with Codex all three times based on the quality of the initial output.

4:28 PM · Jan 16, 2026

718

Read 45 replies

“Smarter vs faster” model choice reshapes how teams schedule agent work

Workflow shift (GPT-5.2 vs Opus 4.5): One concrete behavioral change being reported is that switching from Opus to GPT‑5.2 changes the workday from long “deep work” blocks to shorter cycles spent prompting and steering async agents, as described in workflow reflection.

The same thread emphasizes less time reviewing and more time writing longer prompts and stacking PRs, per review time note. In Codex contexts, this maps directly to tool UX: better steering and lower latency make “30-minute steering windows” viable, while slow-but-smart settings tend to force longer uninterrupted blocks.

Adam

@adamdotdev

It’s crazy how much this smarter vs faster spectrum impacts dev workflow. I switched from opus to 5.2 because I got hooked on “smarter” and it’s changed my work day completely. Instead of needing 3 hour blocks of time to get in the zone and accomplish something, now I want more Show more

Andrej Karpathy

@karpathy

I hope this is not my fault. It's definitely very smart so a little bit faster would be good now. x.com/karpathy/statu…

12:37 PM · Jan 16, 2026

1.1K

Read 31 replies

Prompting heuristic: adding “spec” reportedly makes Codex planning 10× better

Prompting pattern (Codex): A simple planning hack is circulating: Codex “plans get 10× better if the word ‘spec’ is in the input,” per spec keyword tip.

This reads like a routing cue—nudging the model into a more structured “requirements → plan → implementation” mode rather than jumping straight to edits. There’s no counterexample in the tweets, but it’s a crisp, reproducible knob people are already adopting in agent instructions and task templates.

jason liu

@jxnlco

codex plans get 10x better if the word 'spec' is in the input not sure why

9:56 PM · Jan 16, 2026

364

Read 15 replies

Why GPT‑5.2 feels slow in Codex: deep repo exploration vs supplied context

Codex performance perception: A specific explanation for “GPT‑5.2 is slow” shows up repeatedly: when run inside Codex, it may spend time exploring the codebase and gathering context; if you provide the necessary context up front, responses can arrive much faster, per slowness explanation.

The same point is reinforced as a Q&A-style takeaway in follow-up question. It’s a reminder that some latency is self-inflicted by an agent harness doing the right thing (searching/reading), and the speed/quality trade can be shifted by how much context you hand it.

eric provencher

@pvncher

Replying to @pvncher

People think gpt5.2 is insanely slow because they’re always prompting it inside codex, where it has look around to such depths that it takes forever to return. If you instead just give it all the context it needs, it answers in a fraction of the time.

12:31 AM · Jan 17, 2026

🧑‍🤝‍🧑 Claude Cowork expands to Pro: workflow UX, connectors, and safety guardrails

Cowork is the other cross-account product spike today: rollout to Pro users, connector fixes, and early feedback on limits and safety behaviors. Excludes MCP-specific integration details (covered under Orchestration/MCP).

Claude Cowork expands to Pro subscribers with session renaming and connector fixes

Claude Cowork (Anthropic): Following up on early leak—Cowork is now available to Claude Pro subscribers; Anthropic frames it as still a “research preview,” but notes they’ve already shipped session renaming plus connector improvements and bug fixes, as described in the Pro rollout note.

Usage limits are now a first-order constraint: Anthropic warns that because Cowork handles more complex work, Pro users “may hit their usage limit sooner,” according to the Limit warning.

Access is currently pointed at the macOS app download flow via the Desktop download page, and the team is explicitly asking for feedback while iterating quickly, per the Daily updates request.

Claude

@claudeai

Claude Cowork is now available to Pro subscribers.

Claude

@claudeai

Introducing Cowork: Claude Code for the rest of your work. Cowork lets you complete non-technical tasks much like how developers use Claude Code.

5:28 PM · Jan 16, 2026

8.4K

Read 357 replies

Cowork tightens safety UX: explicit permission before deletions

Cowork safety UX (Anthropic): A new safety behavior is being rolled into Cowork where Claude should “always request explicit permission before deleting” items, as surfaced in the Deletion permission change.

The Cowork team is also signaling broader guardrail tightening based on early user feedback, with “a big safety improvement” called out by a product lead in the Safety improvement note.

What’s not yet clear from today’s posts is whether this permission gate is enforced uniformly across all connected surfaces or only for specific actions.

Felix Rieseberg

@felixrieseberg

More Cowork improvements shipped today! • We've taught Claude to always request explicit permission before deleting anything • Add new folders mid-conversation without starting over • Create new folders in the directory picker • Give feedback with 👍 & 👎 buttons • Smarter Show more

9:13 PM · Jan 16, 2026

613

Read 52 replies

Early Cowork users say the Slack-email-docs agent loop finally feels coherent

Cowork adoption sentiment (Anthropic): Early users are describing Cowork as the first time a “Slack + Email + Docs + Agent Loop” has actually worked in practice, with a notably strong endorsement in the Cowork believer quote.

This “it finally fits together” feeling is showing up right as Pro access opens, as echoed by multiple “Cowork is now available for Pro” confirmations in the Availability confirmation and the Pro rollout note.

Thariq

@trq212

I am a Cowork believer, it is somehow the first time "Slack + Email + Docs + Agent Loop" has actually worked and we're just getting started

Claude

@claudeai

Claude Cowork is now available to Pro subscribers.

6:19 PM · Jan 16, 2026

1.1K

Read 71 replies

Local-first ‘Cowork’ pattern gains traction: run coworker workflows on-device

Local-first cowork pattern (Hugging Face): A parallel thread to Cowork’s cloud workflow is the push for Cowork-like orchestration that runs on local models “not to send all your data to a remote cloud,” as argued in the Local cowork demo.

The idea is getting visible distribution quickly—Hugging Face’s CEO notes it trending on the Hub in the Trending confirmation.

This doesn’t change Cowork itself, but it’s a clear competitive pressure line: same “agent coworker” UX goals, with privacy posture flipped.

clem 🤗

@ClementDelangue

Cowork but with local models not to send all your data to a remote cloud!

Claude

@claudeai

In Cowork, you give Claude access to a folder on your computer. Claude can then read, edit, or create files in that folder. Try it to create a spreadsheet from a pile of screenshots, or produce a first draft from scattered notes.

12:34 AM · Jan 17, 2026

2.6K

Read 79 replies

Windows pressure builds as Cowork remains framed around the macOS app

Cowork platform availability (Anthropic): The Pro rollout continues to point people to “try it in the macOS app,” as stated in the macOS app callout, and users are increasingly asking when Cowork will be available on Windows.

That demand is explicit in replies like “Wen Windows?” in the Windows request and “Is there a Windows version?” in the Windows version question.

The public download entry point exists via the Desktop download page, but today’s Cowork-specific guidance still reads as macOS-first.

Claude

@claudeai

Claude Cowork is now available to Pro subscribers.

Claude

@claudeai

Introducing Cowork: Claude Code for the rest of your work. Cowork lets you complete non-technical tasks much like how developers use Claude Code.

5:28 PM · Jan 16, 2026

8.4K

Read 357 replies

⚖️ OpenAI vs Elon lawsuit narrative: filings, diary excerpts, and ‘context’ rebuttals

Governance/legal storyline dominates non-product OpenAI discourse today: unsealed docs, competing narratives about nonprofit vs for-profit intent, and leadership responses. Excludes ad-monetization rollout (feature) and general policy changes on X (separate category).

OpenAI publishes “The truth Elon left out” rebuttal in Musk lawsuit

OpenAI (OpenAI): OpenAI published a point-by-point rebuttal arguing Elon Musk’s filing selectively quoted internal notes and omitted surrounding context, with Sam Altman framing the dispute as “cherry-picking” to make Greg Brockman look bad in the Altman context post and the linked Rebuttal post; the post also reiterates OpenAI’s current dual-entity structure (nonprofit controlling a PBC) and cites an equity value around $130B in the same Rebuttal post.

The central evidentiary move in the rebuttal is contrasting Musk’s quoted phrasing with original call notes—showing what was included vs left out, as illustrated in the Call notes image and amplified again via the Comparison repost.

Sam Altman

@sama

Replying to @sama

lots more here: openai.com/index/the-trut… elon is cherry-picking things to make greg look bad, but the full story is that elon was pushing for a new structure, and greg and ilya spent a lot of time trying to figure out if they could meet his demands.

9:15 PM · Jan 16, 2026

4.0K

Read 323 replies

Unsealed court order keeps Musk claims alive and surfaces internal Brockman notes

OpenAI v. Musk (US District Court): A circulated court order indicates OpenAI’s motion for summary judgment was denied on Musk’s claims while Microsoft’s motion was granted only in part, per the Court order PDF shared in the Court order link.

The same unsealed materials also include excerpts from Brockman’s personal files—critics highlight lines about not “steal[ing] the non-profit” and references to “making the billions,” as shown in the Filing screenshot and repeated in the Diary quote tweet.

Deedy

@deedydas

Replying to @deedydas

Source: courthousenews.com/wp-content/upl…

8:08 AM · Jan 16, 2026

112

Greg Brockman says Musk demanded control; calls journal excerpt use “dishonest”

Greg Brockman (OpenAI): Brockman says Musk’s use of his private journal is “beyond dishonest,” and claims the snippets were about whether to accept Musk’s “draconian terms,” including demands for majority equity and control, as stated in the Brockman statement and expanded in the Brockman follow-up.

He also adds a process takeaway: OpenAI says it avoided publicly correcting Musk’s narratives “out of respect,” but now wants to tell “the real history” as the case proceeds, according to the Brockman follow-up.

Greg Brockman

@gdb

I have great respect for Elon, but the way he cherry-picked from my personal journal is beyond dishonest. Elon and we had agreed a for-profit was the next step for OpenAI's mission. The context shows these snippets were actually about whether to accept Elon's draconian terms.

Elon Musk

@elonmusk

They openly discuss their conspiracy to commit fraud and steal the charity

12:56 AM · Jan 17, 2026

4.1K

Read 551 replies

CNBC: OpenAI warned investors to expect “deliberately outlandish” Musk claims

Investor communications (OpenAI): A CNBC screenshot claims OpenAI told investors and banking partners to brace for “deliberately outlandish” claims from Musk ahead of an April trial window, as shown in the CNBC screenshot discussion.

The tweet context also highlights a credibility meta-point—commentators note the wording is “outlandish” rather than “false,” reflecting how closely the investor-relations posture is being parsed in public discourse, per the CNBC screenshot discussion.

Lisan al Gaib

@scaling01

interesting choice of words they didn't say "false claims" from Musk and why should investors brace If there is nothing to those claims?

Deedy

@deedydas

“This is the only chance we have to get out from Elon. Is he the ‘glorious leader’ that I would pick? We truly have a chance to make that happen. Financially, what will take me to $1B?” – OpenAI President Greg Brockman’s diary [2017] Deep down, it really is about the money.

1:01 PM · Jan 16, 2026

114

Executives react to lawsuit discovery: private diaries and notes become evidence

Discovery risk (Org ops): A thread circulating an excerpt labeled “Brockman’s Personal Files—2017” highlights that private notes can become discoverable once litigation starts, with a widely shared screenshot emphasizing the “Financially, what will take me to $1B?” line in the Discovery excerpt.

The practical implication being debated is not model capability; it’s operational exposure—how internal intent documents can be recontextualized as public evidence when a governance dispute goes to court, as the Discovery excerpt illustrates.

ben

@benhylak

a nice reminder that even your secret diary is discoverable if you get sued.

Deedy

@deedydas

9:29 PM · Jan 16, 2026

Observers frame the dispute as a fight over philanthropic intent vs for‑profit transition

Narrative framing (Ecosystem): Independent commentary summarizes the case as a dispute over whether OpenAI was intended as a philanthropic project, who pushed for a for-profit transition first, and what financial motivations were in play, as laid out in the Dispute recap.

The same thread positions the newly surfaced filings as kicking off “the next phase” of the public argument rather than settling facts, signaling continued reputational and governance spillover while the legal process runs, per the Dispute recap.

Chubby♨️

@kimmonismus

The dispute between @elonmusk and @sama/@openai is entering its next phase. Once again, the question is: Was it intended as a philanthropic project, who first aimed to transform OpenAI into a for-profit organization, and for whose financial purposes? The disput startet this Show more

Sam Altman

@sama

🫠

9:38 PM · Jan 16, 2026

116

Read 20 replies

🕹️ Agent runners & operator tooling: Clawdbot, Ralph/Loom loops, Kilo for Slack, Browser Use, Scouts

Operational layer news: personal/organizational agent runners, multi-agent automation, and ‘agents running in the background’ patterns—lots of hands-on demos and productionization stories. Excludes SDK-only items (Agent Frameworks) and MCP plumbing (Orchestration/MCP).

Browser Use adds 1Password-backed cloud logins with 2FA/MFA handling

Browser Use (Browser Use): Browser Use says it can now automate logins “in the cloud securely with 1Password,” explicitly claiming this solves 2FA/MFA for agent-driven workflows, per the 1Password login feature. The tweet frames this as a blocker removal for cloud browser automation rather than a marginal UX tweak, per the 1Password login feature.

Browser Use

@browser_use

That's a lot of money... Browser Use can now automate logins in the cloud securely with @1Password 2FA/MFA is solved.. try it now!

Larsen Cundric

@larsencc

You can now see my @browser_use founding engineer salary (2FA/MFA auto-filled 👀)

2:45 AM · Jan 17, 2026

164

Read 6 replies

Clawd can boot Claude Code and run commands via PTY mode

Clawd (Clawdbot): A demo shows Clawd starting the Claude Code CLI “in PTY mode,” then issuing a command (claude "tell me a mass effect joke") and relaying the output back into chat, as shown in the PTY control screenshots.

This is a concrete example of a runner acting as an operator-of-operators—driving another agent tool as a subprocess—rather than only calling APIs, per the PTY control screenshots.

Clawd assuming control of claude

5:14 AM · Jan 17, 2026

150

Read 14 replies

Yutori Scouts ships realtime traces, natural-language feedback, and an X bot

Scouts (Yutori): Scouts shipped three operator-facing upgrades: a realtime agent “livestream” for watching actions as they run, natural-language feedback on reports (not just thumbs), and an X bot that turns mentions into a Scout and replies with updates, as announced in the Scouts updates thread.

• Feedback loop: Report-quality guidance can now be written in plain language and absorbed into Scout guidelines, as shown in the Feedback demo.
• X-triggered creation: Tagging @ScoutThisForMe creates a Scout and keeps updates in-thread, per the X bot announcement.

The operational theme is visibility + control in long-running monitoring agents, tied to concrete UI affordances rather than backend claims, per the Scouts updates thread.

Yutori

@yutori_ai

We shipped three big updates to Scouts this week — realtime agent livestream, natural language feedback on reports, a Twitter bot — and several QoL improvements. Details below 🧵 yutori.com/changelog/#jan…

5:09 PM · Jan 16, 2026

A “life of a packet” diagram shows Clawdbot’s cross-OS execution path

Clawdbot (Clawdbot): A shared architecture walkthrough maps a full request path from WSL to Windows host processes to a native sqlite3.dll write on NTFS—explicitly calling out agent IDs, TCP hops, temp files, and process spawning, as captured in the Architecture walkthrough.

The flow reads like an operator runbook (proxy → host router → spawned MCP server → temp JSON blob → CLI → native sqlite), which matters because debugging “agent did something weird” usually lands in exactly these seams, per the Architecture walkthrough.

Someone posted this on our Discord and I'm still marvelling the architecture. This is all one one machine.

4:48 AM · Jan 17, 2026

215

Read 12 replies

Clawdbot merges PR #1000 to stop SIGKILLing background jobs on abort

Clawdbot (Clawdbot): PR #1000 was merged to prevent abort handling from sending SIGKILL to backgrounded processes, tightening long-running task reliability and reducing accidental collateral damage during cancellations, as described in the PR #1000 merge. The specific change targets the “abort” path, which is exactly where background agent runners tend to accumulate sharp edges.

Merged PR 1000 on @openclaw github.com/clawdbot/clawd…

10:00 AM · Jan 16, 2026

Kilo Cloud Agents add one-click “browser as IDE” demos

Kilo Cloud Agents (KiloCode): Kilo says the browser can act as a full coding environment via Cloud Agents, and it shipped two one-click demos that fork a repo and run an agent workflow (e.g., updating an avatar, learning game mechanics), as shown in the Cloud agents post.

This is a concrete productization move: pre-canned environments + repo forking + agent context, presented as a single click path, per the Cloud agents post.

Kilo

@kilocode

Your browser is now a full coding environment. Cloud Agents give you access to Kilo from anywhere - no local setup required. We just shipped 2 pre-canned demos you can spin up with a single click: app.kilo.ai/cloud#demo-upd…

2:50 PM · Jan 16, 2026

Loom runs AFK system testing and verification loops

Loom / Ralph loops (Geoffrey Huntley): Following up on Open-source Loom (repo release), a new clip shows Loom running end-to-end system testing/verification “completely AFK,” with a screen full of PASS results while the operator is elsewhere, per the AFK verification photo.

The operational claim is straightforward: tasks that used to require multi-engineer planning plus weeks of verification are being expressed as a repeated loop with automated checks, as described in the AFK verification photo.

geoff

@GeoffreyHuntley

automatic system testing and verification as a ralph loop

5:24 AM · Jan 17, 2026

Read 11 replies

Browser Use approves top 200 waitlisters, then adds 1,000 more seats

Browser Use (Browser Use): Access is being doled out via a waitlist leaderboard: the “top 200 users” were approved, per the Top 200 approved clip, and then another 1,000 users were approved later, per the 1,000 more approved.

A sample confirmation screen shows a “You’re in!” message and rank-based approval, as shown in the Waitlist confirmation screenshot.

Browser Use

@browser_use

Skill diff.. The top 200 users from the waitlist leaderboard have been approved. Try out BU!

6:46 PM · Jan 16, 2026

Clawdbot docs are reportedly 95% agent-written

Clawdbot (Clawdbot): The maintainer claims the public documentation is now “95% Codex and 5% Clawd,” framing documentation as an automated output rather than a manual bottleneck, per the Docs generation claim and the linked docs site in Docs site. This is a concrete datapoint for “agent-written documentation at scale” workflows, tied directly to an operating project.

If anyone still thinks codex can't write well, check docs.clawd.bot - it's 95% codex and 5% clawd. Not writing docs by hand.

5:21 AM · Jan 17, 2026

315

Read 19 replies

Clawdbot can see reactions and treat them as system events

Clawdbot (Clawdbot): A small but operationally relevant UX detail: Clawdbot reports that emoji reactions arrive as system messages and are visible to the agent (and it can infer who reacted), as shown in the Reactions screenshot.

For agent runners, this creates a lightweight feedback channel distinct from “reply with text,” which can matter when trying to keep bots from over-participating, per the Reactions screenshot.

Clawd can both set and see reactions.

2:37 AM · Jan 17, 2026

Read 4 replies

🧠 ChatGPT product updates (non-ads): memory retrieval + Business ‘apps in custom GPTs’ beta

Non-ad ChatGPT changes today focus on longer-term user state: improved memory retrieval with sourcing, and workspace app connections for Business custom GPTs. Excludes ads/Go rollout (feature).

ChatGPT memory search now cites which past chats it used (Plus & Pro)

ChatGPT (OpenAI): Following up on Reference chats (more reliable past-chat recall), OpenAI now says that when reference chat history is enabled, ChatGPT can more reliably find specific details from your past chats—and it will show the past chat it used as a source so you can open and review the original context, as described in the release notes snippet and detailed in the release notes page.

The update is positioned as global for Plus and Pro users in the same release note entry, while Sam Altman’s “awesome” reaction in the exec endorsement is one of the few real-world signals in the tweets about how noticeable the recall improvement feels in day-to-day use.

Adam.GPT

@TheRealAdamG

help.openai.com/en/articles/68… Memory search in ChatGPT just got supercharged.

2:46 PM · Jan 16, 2026

Read 6 replies

Apps in custom GPTs beta rolls out to select ChatGPT Business workspaces

Apps in custom GPTs (OpenAI): OpenAI is rolling out a beta that lets workspace GPT creators attach approved workspace apps to a custom GPT, starting with select ChatGPT Business accounts as noted in the rollout mention and expanded in the Help Center doc.

• Live org context: The feature is framed as letting GPTs “retrieve information and perform tasks” from connected systems (docs, calendars, business data) without requiring custom Actions or manual file uploads, per the doc screenshot.

The gating matters operationally: it’s Business-only for now, with workspace admin control over which apps can be used and how broadly a GPT can be shared, as written in the Help Center doc.

Tibor Blaho

@btibor91

Apps in Custom GPTs are rolling out as a beta to select ChatGPT Business accounts

Tibor Blaho

@btibor91

Custom GPTs in ChatGPT will gain the capability to use "Apps" (connectors) - "Choose which apps this GPT can use. If none selected, all accessible connectors are allowed."

10:58 AM · Jan 16, 2026

125

ChatGPT surfaces an improved chat history browsing UI

Chat history (ChatGPT): A UI change is being demoed that makes chat history browsing feel more explicit in-product—showing “Chat history is here!” and a scrollable list of prior chats in the UI demo clip.

This is adjacent to the new memory/retrieval work, but distinct: it’s about navigation and visibility of past conversations rather than the model’s ability to pull specific details into answers.

Kol Tregaskes

@koltregaskes

ChatGPT has improved chat history.

samir

@_samirism

we've been improving memory. ChatGPT is now more reliable at finding and remembering details from your past chats, like recipes or workouts. give it a try - and let us know what you think

9:29 AM · Jan 16, 2026

Read 4 replies

🧩 How engineers are changing dev loops: speed vs intelligence, disposable software, and spec-first prompting

Practice-layer discussion: how to drive agents effectively (specs, planning, shorter work blocks, and ‘disposable’ one-off software). Excludes tool release notes (Coding Assistants) and CI/PR review specifics (Code Quality).

File interfaces reduce the need to overthink chunking in agentic retrieval

File interfaces (workflow pattern): The “chunking is dead” claim is being sharpened: when agents can dynamically navigate files (search, open, scroll), static chunk/embed pipelines become less central for many code-and-doc workloads, as argued in Chunking is dead.

The follow-up nuance matters: the critique is aimed at “naive” chunk-and-vector-db as the default retrieval interface, while still acknowledging you’ll want persistence/metadata layers at scale, as clarified in Clarifying note. The underlying bet is that simple file ops (ls, grep, targeted reads) are “unreasonably effective” up to a few hundred docs, as reinforced in File interface rationale.

Jerry Liu

@jerryjliu0

RAG/retrieval might not be dead. But chunking is dead. There is no point overthinking your chunk size if the agent can dynamically expand context around a file.

9:18 PM · Jan 16, 2026

523

Read 48 replies

TypeScript feedback loops turn agents into green-CI machines

TypeScript feedback loops (workflow pattern): A concrete recipe for keeping agent output reliable is being pushed: bake typechecking, tests, and pre-commit hooks into the loop so the agent gets fast, objective failure signals and retries until CI is green, per Feedback loops tutorial and the linked Tutorial.

The emphasis is practical: structure the repo so the agent can verify its own work continuously rather than relying on human review as the primary correctness check.

Matt Pocock

@mattpocockuk

Here are the AI feedback loops I use on every single TypeScript project. Before: Ralph produces 100% slop After: Green CI, all the time Feed the tutorial below to your coding agent, and enjoy. aihero.dev/s/feedback-loo… Show more

11:59 AM · Jan 16, 2026

995

Read 49 replies

Boundary-first vibe coding: verify inputs/outputs, not the generated internals

Verification over inspection (workflow framing): A clear trust model is being proposed: vibe-code functions/libraries/components, but engage seriously at boundaries—specs, tests, contracts—rather than trying to read every generated line, as summarized in Boundary-first framing.

• Trust mechanism shift: The framing is that old trust came from reputation/OSS social proof; now trust has to come from checks you can run yourself (property tests, fuzzing, contracts), as argued in Trust via verification and echoed by Contracts over free-form.

This is a direct response to “whole-system slop”: it localizes risk to a component boundary you can test.

Maxime Rivest 🧙‍♂️🦙🐧

@MaximeRivest

I think we hit on something important there. > Using black boxes and code you don't understand is not something new brought by vibe coding. > It's not because you accept that some parts of your systems are black-boxed that you don't work on establishing their boundaries and Show more

Kevin Madura

@kmad

Spot on. The logical boundaries still exist regardless of where the underlying code comes from. At the end of the day you still need a way to conceptualize what’s going on and how the system is intended to function. Though your explanation is much more elegant than that :)

4:52 PM · Jan 16, 2026

Disposable software and the “market of one” mindset goes mainstream

Disposable software (workflow pattern): The “build it, use it once, throw it away” framing is getting articulated as a serious shift in how software value is created, with the claim that “the minimum viable market is now one,” per Disposable software take.

This reframes the dev loop around time-to-outcome rather than product polish—especially for internal tooling and one-off workflows where the old ROI math never penciled out.

Addy Osmani

@addyosmani

We've entered the era of disposable software - tools vibe-coded for a single task, a single hour, a single person. The minimum viable market is now one. Certain kinds of software used to be an investment. Now it can be a napkin. Just ask the AI to build it, use it once, and Show more

Theo - t3.gg

@theo

I'm vibe coding 2 to 3 apps a day to solve random problems and it's saving so much time. None of these things are useful enough to release but they're all so useful to me. I think about software entirely differently now.

6:48 PM · Jan 16, 2026

485

Read 59 replies

Faster models are changing developer time blocks into async agent driving

Work scheduling (speed vs intelligence): A concrete behavioral change is showing up: switching to a faster model pushes work into many short cycles—“more 30 min blocks” to respond to agents instead of needing “3 hour blocks” to get into flow—per Work block shift.

As a side effect, the work shifts from heavy review/correction toward longer upfront prompts and stacking parallel changes, as described in Stacking PRs pattern.

Adam

@adamdotdev

Andrej Karpathy

@karpathy

I hope this is not my fault. It's definitely very smart so a little bit faster would be good now. x.com/karpathy/statu…

12:37 PM · Jan 16, 2026

1.1K

Read 31 replies

Spec-first prompting gets a simple heuristic: include “spec”

Spec-first prompting (workflow pattern): A small but repeated heuristic is being shared: “plans get 10x better if the word ‘spec’ is in the input,” per Spec keyword trick.

It’s a signal that many agent planning failures are still prompt-shape failures: asking for a spec nudges models into requirements and acceptance-criteria mode instead of jumping straight to edits.

jason liu

@jxnlco

codex plans get 10x better if the word 'spec' is in the input not sure why

9:56 PM · Jan 16, 2026

364

Read 15 replies

Vibe coding expands prototypes, but doesn’t erase maintenance economics

Software maintenance economics (workflow framing): Levie draws a clean line: AI makes it cheaper to prototype and build internal apps, but the “long tail” of maintenance (bugs, connectors, API changes, operations) still dominates—so large orgs will keep renting CRM/ERP rather than vibe-coding replacements, as argued in Maintenance still dominates.

The point is less about code generation speed and more about who pays the ongoing tax of keeping systems correct and connected.

Aaron Levie

@levie

AI brings down the cost of building software dramatically. And now everyone can write code for any use case they can think of. But nothing changes about the concept of core competencies in a company. Companies spend their finite resources on things that differentiate them and Show more

Peer Richelsen

@peer_rich

"why would i pay for saas if i can prompt the software myself and run it" my brother in christ have you heard of open source businesses the last thing people want to do is to be in charge of development and maintenance of software

2:35 AM · Jan 17, 2026

469

Read 62 replies

Agent memory and state resurface as core dev-loop plumbing

Agent memory and state (workflow pattern): There’s renewed attention on “memory/state” as an engineering surface—framed as something getting cool again via agent ecosystems—and the claim that any filesystem-as-source-of-truth pattern tends to evolve into a database as complexity grows, per Memory-state resurgence.

The takeaway is less about which tool wins and more about acknowledging that long-running agent work forces explicit choices about persistence, indexing, and mutation control.

swyx

@swyx

Replying to @swyx

langchain piece is well written x.com/hwchase17/stat…

Harrison Chase

@hwchase17

x.com/i/article/2011…

1:33 AM · Jan 17, 2026

Classic “good code” criteria are getting reused as agent prompt primitives

Prompt grounding for code quality (workflow pattern): A simple prompting move is being recommended: use classic programming books’ descriptions of “good code” directly inside prompts, skills, and AGENTS.md to shape agent behavior, as suggested in Prompting with classics.

This is a reminder that, even with stronger models, the fastest way to reduce rewrites is often to specify taste and constraints in reusable project artifacts.

Matt Pocock

@mattpocockuk

Advice for anyone prompting AI agents: The classic coding books contain some of the best descriptions of "good code" ever written. Use them in your prompts, skills, and AGENTS.md files

5:11 PM · Jan 16, 2026

805

Read 57 replies

📚 Retrieval & memory methods: Agentic RAG vs Enhanced, multi-vector retrieval, and cache-parallel decoding

Today’s retrieval chatter is unusually research-heavy: empirical comparisons of Agentic vs Enhanced RAG, new decoding tricks to avoid giant prompts, and strong advocacy for multi-vector retrieval. Excludes bioscience-related papers entirely.

Agentic RAG vs Enhanced RAG: first head-to-head study favors pipelines on hard fact-checking

Agentic RAG vs Enhanced RAG (research): A new empirical comparison argues the “LLM orchestrates everything” approach is more flexible but materially pricier—Agentic RAG often needs 2–10× more tokens/compute—and can lose badly on tasks where a fixed pipeline helps, with Enhanced RAG winning on FEVER by +28.8 F1 in the RAG comparison thread.

• Where Enhanced wins: Router/rewriter/reranker stacks look more stable on datasets where agents retrieve unnecessarily, as described in the RAG comparison thread.
• Where Agentic helps: The same write-up claims modest gains on intent handling and query rewriting averages, but ties performance tightly to underlying model strength in the RAG comparison thread.

The main open question is how much of the gap is “agent policy” vs “missing modules” (rerankers, explicit routers) rather than the agentic paradigm itself.

elvis

@omarsar0

Is Agentic RAG worth it? RAG systems have evolved from simple retriever-generator pipelines to sophisticated workflows. It remains unclear when to use Enhanced RAG (fixed pipelines with dedicated modules) versus Agentic RAG (LLM orchestrates the entire process dynamically). Show more

4:11 PM · Jan 16, 2026

611

Read 26 replies

PCED proposes parallel per-document decoding to avoid long RAG prefills

PCED (research): “Parallel Context-of-Experts Decoding” keeps separate KV caches per retrieved document, runs them in parallel, and combines logits via retrieval-aware contrastive decoding—aiming to scale multi-doc evidence without stuffing everything into one prompt, as explained in the PCED paper summary.

• Latency claim: The thread reports >180× faster time to first token, framing PCED as a way to dodge the prefill bottleneck while still stitching evidence across documents in decoding, per the PCED paper summary.
• Key constraint: This approach assumes you can access and combine full token logits across parallel “experts,” which can be awkward depending on your serving stack, as implied in the PCED paper summary.

It’s a decoding-time alternative to rerankers and long-context prompts, not a new retriever.

This paper shows a training free RAG trick that combines many documents without packing them into 1 prompt. Instead of stuffing documents into a giant prompt, this paper turns each retrieved document into an expert. Another speed trick saves each document as a key value cache, Show more

11:33 AM · Jan 16, 2026

150

Read 8 replies

MemGovern uses “experience cards” memory to lift automated bug-fixing results

MemGovern (research): A memory construction pipeline converts messy GitHub issue/PR history into structured “experience cards” (indexable symptoms + resolution + verification), and reports a +4.65% gain on SWE-bench Verified when plugged into a standard fixing agent, as summarized in the MemGovern summary.

• Why it’s different from plain retrieval: The cards are explicitly “governed” (cleaned, split into index vs resolution) to reduce noise and make retrieval actionable, per the MemGovern summary.

This is basically “curated episodic memory” for code agents, built from open-source history.

This work shows that better memory, not a bigger model, can lift automated bug fixing success. MemGovern turns messy GitHub bug fix history into usable memory that helps code agents fix more issues. The authors take real GitHub issues, pull requests (proposed code changes), and Show more

8:26 AM · Jan 16, 2026

Multi-vector retrieval push: ColBERT/ColPali proponents argue dense is losing

Multi-vector retrieval (ColBERT/ColPali): Practitioners are again arguing that “multi vector is the only way forward” for retrieval quality, citing repeated cases where small multi-vector models outperform much larger dense encoders in real benchmarks, as stated in the Multi-vector claim and expanded in the ColBERT advocacy.

• Why it matters: The pitch is that late-interaction scoring better preserves token-level evidence (especially for reasoning-heavy or long-context queries) than a single pooled embedding, per the ColBERT advocacy.

This is advocacy, not a new release—no single canonical benchmark artifact is linked in the tweets, so treat the performance claims as directional.

Omar Khattab

@lateinteraction

“multi vector is the only way forward to make retrieval better”

Aamir

@aaxsh18

a 32M parameter multi vector model outperforms 600M parameter models and comes close to a 8B model. multi vector is the only way forward to make retrieval better. @rikiyatakehi cooked big time

6:14 PM · Jan 16, 2026

210

Read 5 replies

CompassMem builds event-graph memory to answer long-horizon questions better

CompassMem (research): An event-centric memory system stores interactions as a graph of events with explicit temporal/causal links, then traverses the graph to satisfy decomposed sub-questions; it reports about 52% average F1 on LoCoMo long-conversation questions, with strongest gains on multi-step and time-based queries, according to the CompassMem summary.

This frames “memory” less as semantic similarity search and more as navigable structure that preserves ordering and dependency.

This paper builds CompassMem, an event graph memory that helps LLM agents recall and reason across long histories. Most agent memory systems save past text as isolated snippets and pull it back by meaning similarity, so they struggle when answers depend on linked events and time Show more

9:28 AM · Jan 16, 2026

File-interface retrieval gets a sharper definition: grep-first until you truly scale

File interface retrieval (workflow): A follow-up clarification tightens what “chunking is dead” means in practice: using file tools (scroll/search within files) plus simple ls/grep can be “unreasonably effective” up to a few hundred docs, while production apps still need a persistence layer and likely decoupled retrieval vs final context, per the Chunking clarification and File interface rationale.

• Architecture nuance: The thread explicitly distinguishes “naive chunk/embed/vector-db as the only retrieval” from hybrid systems where a DB indexes metadata and the agent reads whole files on demand, as described in the Chunking clarification.

It’s a concrete scoping statement: file tools aren’t a database replacement, but they can postpone database complexity for smaller corpora.

Jerry Liu

@jerryjliu0

Ok clarifying some comments here so that this is not entirely clickbait: - RAG is so loosely defined right now you could call any retrieval RAG. i was referring to file search replacing "naive" RAG as in chunk, embed, put into vector db as the only form of retrieval - You Show more

Jerry Liu

@jerryjliu0

file search is the new RAG

8:02 PM · Jan 16, 2026

174

Read 22 replies

LlamaParse can surface human highlights as tagged context for agents

LlamaParse (LlamaIndex): A small but practical parsing trick—asking for “output highlighted text with special html tags” returns highlighted spans wrapped in <mark> so downstream extractors/agents can prioritize what a human annotator cared about, as shown in the Highlight extraction tip.

This is a retrieval-quality move: it turns latent human attention (highlights) into explicit, machine-usable structure.

Jerry Liu

@jerryjliu0

Our mission these days is digitalizing paperwork 📄, and a lot of paperwork has handdrawn stuff - like highlights! If you upload a scanned form into LlamaParse, you can add a simple prompt to output all highlighted text with html tags. This is reflected in both the raw markdown Show more

8:12 PM · Jan 16, 2026

159

📈 Benchmarks & real-world usage measurement: Economic Index, token studies, and head-to-head reviews

Measurement-heavy day: real usage datasets (Anthropic Economic Index, OpenRouter token study), product benchmarks, and model-vs-model comparisons for code review and deep research. Excludes pure model release announcements.

Anthropic data suggests multi-turn use sustains longer task horizons

Economic Index (Anthropic): New figures from Anthropic’s Economic Index analysis suggest Claude.ai usage degrades much more slowly with task duration than API usage—extrapolating to ~19 hours at a 50% success rate, as shown in the Task horizon chart and detailed in the Economic Index PDF linked in Economic Index PDF.

• Duration vs success: Claude.ai’s fitted trend stays above 60% across the plotted window, while 1P API falls toward 50% around ~5 hours, according to the Task horizon chart.
• Education and speedups: The report’s plots indicate measured speedups rise with predicted education level, as shown in the Speedup vs education plot.

The data is observational (real usage + model-based estimates), so treat the extrapolation as directional rather than a benchmark score.

Haider.

@slow_developer

in a recent anthropic report: "long-horizon tasks hit 19 hours at a 50% success rate” using multi-turn conversations" - cursor CEO showed this by running multiple gpt-5.2 agents to build a browser in a week - anthropic is doing something similar by squeezing gains out of Show more

9:00 AM · Jan 16, 2026

119

Read 13 replies

Kilo’s code review test finds Grok Code Fast 1 matches Opus 4.5 on detection

Kilo Code Reviews (Kilo): Kilo reports a head-to-head code review comparison where Grok Code Fast 1 (free tier) found 8 issues at a 44% detection rate—matching Claude Opus 4.5—while GPT‑5.2 found 13 issues at 56%, as shown in the Benchmark table post and expanded in the free models writeup linked in Free reviews test.

• Frontier vs free framing: The same table places GPT‑5.2 as top on issues found (13) and detection rate (56%), while Opus 4.5 and Grok Code Fast 1 tie on detection at 44%, per the Benchmark table post.
• Methodology context: Kilo points to separate deep dives on frontier and free model runs in the frontier writeup linked in Frontier reviews post and the free models writeup linked in Free reviews test.

The benchmark is a single PR-style task suite; treat the ranking as task-specific rather than a general coding leaderboard.

Kilo

@kilocode

Grok Code Fast 1 matches Opus 4.5 for bug detection. This is based on our head-to-head comparison of 6 AI coding models across different code review tasks. SOTA results: blog.kilo.ai/p/code-reviews… Currently) free model results: blog.kilo.ai/p/free-reviews…

4:31 PM · Jan 16, 2026

148

OpenRouter’s 100T-token study shows open-weight and agentic usage rising

State of AI (OpenRouter): OpenRouter published an empirical analysis of 100T tokens of anonymized request metadata, reporting open-weight models rising to ~33% of usage by late 2025 alongside a shift toward agent-style, tool-using workloads, as summarized in the 100T token study thread.

• Open vs closed mix: The paper’s headline claim is open-weight traffic reaching roughly one-third of all usage by late 2025, per the 100T token study thread.
• What people do with it: The reported open-weight traffic skewed heavily toward roleplay and coding assistance, while “agentic inference” and longer inputs increased over time, according to the same 100T token study thread.

Because the study uses metadata rather than prompt text, the granularity comes from proxies (tool calls, tokens, timing) rather than content inspection.

This paper analyzes 100T OpenRouter tokens to show LLM use is shifting toward agents, not simple chat. The authors study anonymized request metadata, like model choice, token counts, timing, region, and tool calls, not the actual text. They then compare open weight models, Show more

2:39 PM · Jan 16, 2026

Read 5 replies

Artificial Analysis reports DeepSeek R1 throughput on SambaNova SN40L

Hardware benchmarking (Artificial Analysis): Artificial Analysis says its hardware suite now includes DeepSeek R1 runs on SambaNova’s SN40L RDU, with throughput reaching ~4,700 tokens/sec at and beyond 256 concurrent requests, according to the Benchmarking announcement.

It also highlights unusually high per-user speed at low concurrency—peaking at 269 tokens/sec for single-user workloads—as stated in the Single-user speed note. Pricing comparisons are explicitly incomplete because SN40L isn’t offered with standard hourly spot pricing, per the same Single-user speed note.

Artificial Analysis

@ArtificialAnlys

SambaNova hardware benchmarking: Artificial Analysis’ Hardware Benchmarking now includes DeepSeek R1 on SambaNova’s SN40L RDU, showing outperformance compared to NVIDIA H200 chips across most tested concurrency levels ➤ The SN40L system tested handles batch sizes of up to 256, Show more

6:39 PM · Jan 16, 2026

Parallel shares DeepSearchQA accuracy and cost table for agentic search tasks

DeepSearchQA (Parallel): Parallel published a DeepSearchQA table claiming its Task API variants outperform Gemini Deep Research and OpenAI GPT‑5.2 Pro on a joint accuracy/cost view, led by “Ultra2X” at 72.6% accuracy and 600 CPM, as shown in the Benchmarks hub post.

• Relative positioning: The same table lists Gemini Deep Research at 64.3% and 2500 CPM and OpenAI GPT‑5.2 Pro at 61% and 1830 CPM, per the Benchmarks hub post.

No evaluation artifact or dataset card is linked in the tweets, so the claim is best treated as vendor-reported until independently reproduced.

Parallel Web Systems

@p0

At Parallel, one of our guiding principles is that quality compounds, which is why we obsess over accuracy. Today, we’re introducing our benchmarks hub, where you’ll find our most recent published benchmarks across our product suite. Our latest is DeepSearchQA, Google’s deep Show more

10:46 PM · Jan 16, 2026

Epoch summarizes a 2025 forecasting miss on revenue and bio risk

Forecasting evaluation (Epoch AI): Epoch’s write-up of an AI Digest forecasting survey says forecasters largely matched benchmark score trajectories but missed on real-world outcomes—most notably underestimating annualized AI revenue (median $16B vs ~ $30B), as summarized in the Core takeaway and written up in the Gradient update linked in Gradient update.

It also reports mixed performance on risk forecasts, calling out underestimated biological risk (while cyber and autonomy calls were closer), per the Risk forecast note.

Epoch AI

@EpochAIResearch

Replying to @EpochAIResearch

The core takeaway: the forecasters were mostly right on benchmarks, but had mixed results on societal impacts.

8:42 PM · Jan 16, 2026

⚙️ Inference & self-hosting: day‑0 runtimes, consumer GPU economics, and local deployments

Serving/runtime posts focus on getting new models running fast (day‑0 support) and practical local inference economics on consumer GPUs. Excludes chip geopolitics and broader energy buildout (Infrastructure).

Paper quantifies private LLM inference costs on RTX 5060Ti/5070Ti/5090 GPUs

Private inference economics: A new paper argues SMEs can run private LLM inference on consumer Blackwell GPUs with electricity-only costs around $0.001–$0.04 per 1M tokens, with long-context RAG latency hinging on high-end cards like the RTX 5090 for sub-second time-to-first-token, as summarized in the paper thread.

• Quantization result: the paper claims NVFP4 improves throughput about 1.6× while cutting energy about 41% with modest quality loss, per the paper thread.

It’s a clean, numbers-first datapoint for teams comparing local inference vs “cheap API tiers,” especially where data governance forces on-prem.

This paper shows small firms can run private LLMs on RTX 50 GPUs at low cost. Electricity-only inference lands around $0.001-$0.04 per 1M tokens, about 40-200x cheaper than budget cloud application programming interfaces (APIs). The problem is that cloud LLMs can expose Show more

1:37 PM · Jan 16, 2026

vLLM-Omni adds day-0 support for FLUX.2 [klein] image generation

vLLM-Omni (vLLM Project): vLLM-Omni added day-0 support for FLUX.2 [klein], framing it as a fast, consumer-GPU-friendly image generator with integrated text-to-image and inpainting, as described in the day-0 support note.

The practical implication is another step toward “model drops → runnable in production runtimes” without bespoke glue, especially for teams standardizing on vLLM-style serving surfaces.

vLLM

@vllm_project

🎉 Day-0 support for FLUX.2 [klein] is now available in vLLM-Omni! FLUX.2 [klein] from @bfl_ml brings high-performance image generation, balancing speed with top-tier aesthetics: ⚡️ Sub-second Inference: <0.5s per image for real-time apps. 🎨 All-in-One: Integrated Show more

Black Forest Labs

@bfl_ml

Introducing FLUX.2 [klein]. Blazing fast. Beautiful. Generate stunning images in under a second while maintaining exceptional quality. Great for fast editing, changing styles, and developing ideas from 0 → 1. Available via API, or run it locally - Klein 4B under Apache 2.0,

10:29 AM · Jan 16, 2026

Android Studio adds Ollama as a model provider in its IDE model picker

Android Studio (Google) + Ollama: The latest Android Studio build surfaces Ollama as a first-class model provider in the IDE’s model picker—alongside Gemini and Anthropic—showing local options like gpt-oss and gemma variants, as shown in the model picker screenshot.

This is a concrete “local by default” integration point: a mainstream IDE UI treating on-device model selection like any other provider choice.

ollama

@ollama

The latest Android Studio has Ollama support.

Android Developers

@AndroidDev

Designed to give you more control, flexibility, & agentic experiences, the @AndroidStudio Otter 3 Feature Drop is now stable→ goo.gle/3No0jGj From Agent Mode conversation threads and choosing any AI model to automating UI tests,this release helps you build smarter, not

7:39 PM · Jan 16, 2026

556

Read 11 replies

Ollama ships TranslateGemma with a required prompt format and guide

TranslateGemma (Google) on Ollama: Following up on TranslateGemma release—open translation models—Ollama now lists TranslateGemma as runnable via ollama run translategemma, with a warning that it requires a specific prompting format, per the Ollama announcement and the linked prompting guide.

The update is less about new model capability and more about friction removal: “downloadable, local translation” with a documented invocation contract.

ollama

@ollama

ollama run translategemma TranslateGemma is available on Ollama. Now you can use it in apps to translate between 55 languages. Note, it requires a specific prompting format 👇👇👇 Show more

Google DeepMind

@GoogleDeepMind

We’re releasing TranslateGemma, a new family of open translation models with support for 55 languages. 🌐 Available in 4B, 12B, and 27B parameter sizes – they’re designed for efficiency without sacrificing quality.

11:34 PM · Jan 16, 2026

1.3K

Read 25 replies

SGLang launches official site consolidating docs, cookbook, and ecosystem

SGLang (LM-SYS): LM-SYS launched an official SGLang website to centralize docs, deployment guides, ecosystem projects, and community events, positioned as a response to information sprawl as adoption grows, according to the website launch note.

This is more “ops readiness” than a feature release, but it’s a real lever for teams standardizing on SGLang for serving and needing a single canonical reference.

LMSYS Org

@lmsysorg

🚀 We are launching the official SGLang website: sglang.io 🎉 We built this site because SGLang has grown significantly, and the information around it was becoming scattered. The new website brings everything around SGLang into one place: 🔹 Docs & Cookbook – core Show more

5:32 PM · Jan 16, 2026

113

Read 2 replies

🧪 Model and benchmarked drop watch: FLUX.2, LTX‑2, YOLO26, embeddings, and ‘Sonata’ hostname spotting

Model chatter spans fast open image/video models, new retrieval embeddings, and ‘what is this hostname’ leak-watching. Excludes bioscience-related papers and excludes Veo/creative workflow tutorials (in Gen Media).

FLUX.2 [klein] posts top open-model image-edit rankings and day-0 vLLM support

FLUX.2 [klein] (Black Forest Labs): Following up on initial release, new third-party signals put FLUX.2 [klein] near the top of open image editing—Artificial Analysis notes the 9B variant is #2 among open models in Image Edit Arena and also competitive in Text-to-Image, with pricing/positioning details in the Rank and pricing breakdown; LMArena/arena posts similar placement for 4B and 9B in both Image Edit and Text-to-Image leaderboards, per the Leaderboard snapshot.

• Serving readiness: vLLM-Omni added day-0 support for FLUX.2 [klein], calling out sub-second inference and a ~13GB VRAM target for the 4B Apache-2.0 model, as described in the vLLM-Omni support note.

Overall, today’s chatter is less about “new model exists” and more about where it lands on public preference leaderboards and whether the open inference stack is ready on day one.

Artificial Analysis

@ArtificialAnlys

FLUX.2 [klein] is the new open weights image model from Black Forest Labs, with the 9B variant ranking as the top open weights image editing model in the Artificial Analysis Image Editing Arena! FLUX.2 [klein] is the spiritual successor to FLUX.1 [schnell] from @bfl_ml, released Show more

1:41 AM · Jan 17, 2026

LTX-2 becomes the top open-weights video model in Artificial Analysis Video Arena

LTX-2 (Lightricks): Artificial Analysis claims LTX-2 is now the leading open-weights video model in its Video Arena, surpassing Wan 2.2 A14B on both text-to-video and image-to-video, with licensing caveats noted in the Arena leader claim.

The same post distinguishes between the newly open-sourced base weights (including a 19B base) and the vendor’s “Pro/Fast” API endpoints that layer additional pipeline optimizations on top, per the Arena leader claim.

Artificial Analysis

@ArtificialAnlys

LTX-2 is the new leading open weights video model, surpassing Wan 2.2 A14B in both Text to Video and Image to Video in the Artificial Analysis Video Arena! LTX-2, originally released in November by @Lightricks, was recently open sourced with both the base 19B model and a Show more

8:12 PM · Jan 16, 2026

182

Ultralytics releases YOLO26 family: ~30 small models for detection, seg, and keypoints

YOLO26 (Ultralytics): Ultralytics’ YOLO26 family is being shared as a broad “small model” drop—about 30 variants under 50M parameters covering open-vocab detection, segmentation, and keypoint detection, with a CPU demo highlighted in the Release demo.

The collection link aggregated in the Model collection pointer points to a Hugging Face bundle—see the Model collection for the full set of weights and variants.

merve

@mervenoyann

even your toaster can see now 🔥 @ultralytics dropped YOLO26 family of models, not only for detection.. 30 models for open vocab detection, segmentation, keypoint detection and more! 🥹 all models <50M params, this demo runs on CPU 🤯

11:13 AM · Jan 16, 2026

389

Read 9 replies

Voyage 4 embedding models ship, with open-weights voyage-4-nano called out on FreshStack

Voyage 4 embeddings (VoyageAI/MongoDB): Tweets point to a Voyage 4 embedding-model release, with special attention on the first open-weights entry “voyage-4-nano,” which is claimed to beat Stella on the FreshStack retrieval leaderboard, according to the FreshStack comparison and the FreshStack comparison framing.

The benchmark context for that claim lives on the FreshStack site—see the FreshStack leaderboard for what the evaluation is measuring (technical-doc retrieval) and how models are ranked.

Nandan Thakur

@beirmug

🆕 Voyage 4 models have been released by @VoyageAI @tengyuma @MongoDB! Benchmarked their ⭐ first open weights model (voyage-4-nano) on #FreshStack and it outperformers Stella on the leaderboard ~ similar size! 🎉

Tengyu Ma

@tengyuma

Excited by many launches at #mongodblocal by @VoyageAI by MongoDB: 1. voyage-4-large: MoE embedding model with new SOTA accuracy & 40% lower costs. 2. voyage-4, -4-lite, and -4-nano (open-weight!) with shared embedding space that enables flexible cost-benefit optimization. 3.

8:04 PM · Jan 16, 2026

xAI tests Grok 4.20 “Theta-Hat” checkpoint on LMArena

Grok 4.20 (xAI): A model-watching thread says xAI is testing multiple Grok 4.20 variants on LMArena, with the latest checkpoint “Theta-Hat” described as the strongest so far in the LMArena test clip.

No official spec or release notes are in the tweets, so the only concrete artifact today is the “in-arena checkpoint” observation and the naming.

AiBattle

@AiBattle_

xAI continues testing various versions of Grok 4.20 on LMArena The most recent model, "Theta-Hat", seems to be the most capable yet

8:02 AM · Jan 16, 2026

198

Read 10 replies

“sonata.openai.com” hostnames spotted; speculation points to an OpenAI audio model

Sonata (OpenAI): Hostname-watchers report newly observed subdomains including “sonata.openai.com” (dated 2026-01-16) and “sonata.api.openai.com” (dated 2026-01-15) in the Hostname sightings.

Speculation in follow-ups suggests “Sonata” could map to an upcoming audio or music-related model/product, as raised in the Audio model question and echoed in the Audio model speculation.

Tibor Blaho

@btibor91

Sonata . OpenAI . com?

9:59 AM · Jan 16, 2026

735

Read 48 replies

HeartMuLa open-sourced music foundation models shared, including a 3B OSS checkpoint

HeartMuLa (Ario Scale Global): HeartMuLa is being circulated as a family of open-sourced music foundation models, with a Hugging Face checkpoint referenced as “HeartMuLa-oss-3B,” per the Model card link and the Hugging Face model card.

Today’s tweets don’t include benchmark numbers; what’s concrete is the open checkpoint naming, the model family positioning, and the availability of weights via Hugging Face.

@_akhaliq

Replying to @_akhaliq

model: huggingface.co/HeartMuLa/Hear…

5:03 PM · Jan 16, 2026

🔌 Open Responses spec adoption: SDKs and harnesses standardizing multi-provider Responses

Continuation of yesterday’s Open Responses momentum, now with concrete SDK positioning and builder intent to create harnesses around the spec. Excludes MCP-specific connectors (Orchestration/MCP).

OpenRouter ships an Open Responses-native SDK positioned for agentic multi-model apps

OpenRouter SDK (OpenRouter): Following up on Standardization, OpenRouter is now explicitly positioning its SDK as “the first agentic SDK native to Open Responses,” with a single interface spanning 300+ AI models as described in the SDK positioning claim and detailed on the SDK page. The emphasis is that multi-provider switching and agent-style workflows (streaming, tools, composable steps) can live behind one API surface instead of per-provider adapters.

• What’s concrete in the SDK pitch: The SDK page highlights 300+ models plus built-in streaming and tool isolation primitives, with Open Responses called out as the native grounding spec in the SDK page.

This is an adoption signal more than a spec change: Open Responses is turning into something third-party SDKs want to be “native” to, not just “compatible with.”

OpenRouter

@OpenRouter

The OpenRouter SDK is the first agentic SDK native to Open Responses Easy to get started:

OpenRouter

@OpenRouter

We're standardizing on Open Responses for @OpenAI Integrations. A unified request/response schema improves support for multimodal inputs, interleaved reasoning, and other advanced features for developers and users alike!

4:06 PM · Jan 16, 2026

136

Read 4 replies

OpenAI Devs spotlights early Open Responses adoption by builders and tooling projects

Open Responses (OpenAI Devs): OpenAI Devs is now explicitly amplifying “builders are already using Open Responses,” framing it as an open-source spec for interoperable, multi-provider Responses-style interfaces in the Adoption shoutout.

The notable shift versus the initial spec announcement is the messaging moving from “here’s the spec” to “here are early adopters,” as shown in the Adoption shoutout and reinforced by the explainer clip in the Open Responses recap.

OpenAI Developers

@OpenAIDevs

Replying to @OpenAIDevs

x.com/vllm_project/s…

vLLM

@vllm_project

When we added support for gpt-oss, the Responses API didn't have a standard and we essentially reverse-engineered the protocol by iterating and guessing based on the behavior. We are very excited about the Open Responses spec: clean primitives, better tooling, consistency for the

6:10 PM · Jan 16, 2026

vLLM frames Open Responses as a way to stop “reverse-engineering” provider behavior

Open Responses (vLLM ecosystem): vLLM contributors are explicitly endorsing Open Responses as a cleanup of a painful integration workflow—having to “reverse-engineer the protocol by iterating and guessing” when adding support for a Responses-like API, then welcoming the spec as “clean primitives” and better consistency per the Meetup note with context.

The concrete claim here is not about new vLLM code landing today; it’s about integration cost—Open Responses is being used as the argument for fewer bespoke shims and fewer behavior-driven “guess the protocol” loops, as described in the Meetup note with context.

vLLM

@vllm_project

The first @vllm_project meetup of 2026 in Munich is here! Talks & demos from @RedHat, @AMD, @MistralAI, and CROZ + hands-on GPU workshop. See you there! 🥨🍻

Red Hat AI

@RedHat_AI

Munich AI builders 👋 Join the @vllm_project meetup on 24 Feb for real world GenAI inference and optimization. Talks and demos from @RedHat, @AMD, @MistralAI, and CROZ, plus hands on GPU inference and time to connect with engineers building open AI. 🔗 luma.com/y2e9eugh

10:03 AM · Jan 16, 2026

Builder pattern: new coding harnesses planned around first-class Open Responses support

Open Responses-first harnesses (Pattern): A concrete builder signal today is people designing their own coding/agent harnesses around Open Responses as the primary abstraction, with nummanali framing “too many providers, too many variations” as the blocker that the spec removes in the Harness intent note. The practical idea is that a harness can focus on outcomes (plans, tool orchestration, agent loops) while treating provider/model differences as a configuration detail.

This is still intent, not a shipped tool; there’s no published harness interface or reference implementation in the tweets yet beyond the stated direction in the Harness intent note.

Numman Ali

@nummanali

It’s time for me to create my own coding agent harness with first class support for the Open Responses spec This was the only blocker for me, too many providers, too many variations With this I can focus on the outcomes over the implementation details

OpenAI Developers

@OpenAIDevs

Today we’re announcing Open Responses: an open-source spec for building multi-provider, interoperable LLM interfaces built on top of the original OpenAI Responses API. ✅ Multi-provider by default ✅ Useful for real-world workflows ✅ Extensible without fragmentation Build

9:08 AM · Jan 16, 2026

Read 6 replies

🧰 Plugins & Skills ecosystems: ‘npm for skills’, reusable prompts, and agent UX add-ons

Installable/portable extensions and skill packs are a major thread today (skills registries, skill installers, statuslines). Excludes MCP servers (Orchestration/MCP) and full agent runners (Agent Ops).

Vercel pitches “skills” as an agent-agnostic, npm-like ecosystem for AI extensions

Skills (Vercel): Vercel is positioning skills as a portable, agent-agnostic packaging ecosystem—explicitly framed as the “npm of AI skills”—intended to make reusable agent capabilities installable across different runtimes, as described in the skills announcement. The point is standardizing the unit of reuse (a skill) so distribution and discovery look more like software packages than copy-pasted prompts.

What’s still unclear from today’s tweets is the concrete on-disk format and execution contract (how a skill declares tools, inputs/outputs, permissions, and sandboxing) beyond the high-level framing in the skills announcement.

Guillermo Rauch

@rauchg

We're introducing 𝚜𝚔𝚒𝚕𝚕𝚜 – the "npm" of AI skills. Excited to see an open, agent-agnostic ecosystem of skills flourish. To get started, try: ▲ ~/ npx skills i vercel-labs/agent-skills

2:06 AM · Jan 17, 2026

5.7K

Read 230 replies

OpenRouter adds a real-time cost-tracking statusline for Claude Code sessions

Claude Code statusline (OpenRouter): OpenRouter documented a real-time cost statusline for Claude Code that reads generation IDs from the transcript, calls OpenRouter’s generation endpoint for spend, and keeps a per-session running total—including cache discounts—per the statusline walkthrough and the linked integration guide in Claude Code docs.

• What it surfaces: Provider, model name, cumulative cost, and cache discount are shown inline in the terminal statusline, as shown in the statusline walkthrough.

It’s a small UI hook, but it formalizes “what did this agent run cost?” as a first-class signal rather than a post-hoc dashboard lookup.

OpenRouter

@OpenRouter

TIP: you can add an API cost-tracking statusline to Claude Code Includes caching discounts. openrouter.ai/docs/guides/gu…

6:53 PM · Jan 16, 2026

RepoPrompt 1.5.68 adds worktree window control and new installers

RepoPrompt 1.5.68 (RepoPrompt): RepoPrompt shipped v1.5.68 with a worktree-window workflow (spin up/close windows for agent use) and added an opinionated claude-rp installer plus opencode installers, as announced in the release note with the full changelog linked in changelog link.

• Why it’s “ecosystem” news: The emphasis is on install paths and compatibility layers (getting a “nice Claude Code” setup with clashing tools disabled) rather than a single-model upgrade, per the release note.

The tweets don’t include performance numbers or a demo, so operational impact is still qualitative.

eric provencher

@pvncher

Just released @RepoPrompt 1.5.68 - It's now possible to spin up and close windows for easy agent use with worktrees using the mcp and cli. - Added an installer for claude-rp to get a nice Claude Code with clashing tools disabled, as well as an @opencode installers

6:53 PM · Jan 16, 2026

A “building glamorous TUIs” meta-skill packages Charm usage for agents

Building glamorous TUIs (meta-skill): A new meta-skill packages guidance on using Charm libraries to get better terminal UI output from agents, with the implementation published as a GitHub skill in the skill release.

This is notable less for new model capability and more for capturing a repeatable “taste layer” (design conventions + library choices) into an installable artifact, rather than a one-off prompt—per the skill release.

Jeffrey Emanuel

@doodlestein

I decided to turn this post into an elaborate skill that operationalizes the concept of “use any and all Charm libraries that are relevant to your use case”: github.com/Dicklesworthst… This stuff is what makes bv look so nice. And the acfs scripts. Everything Charm makes is great.

Jeffrey Emanuel

@doodlestein

Literally every single library shown on this site is an exquisite gem and you should always use any that happen to fit your use case and the language you're using (basically Golang and bash): charm.land

1:19 PM · Jan 16, 2026

Read 2 replies

Firecrawl documents a one-file install artifact for agent setups

Firecrawl install artifact (Firecrawl): Firecrawl is pushing an agent-friendly one-file install pattern—“tell your agents how to install Firecrawl with one file”—as a documentation convention for reliable setup automation, per the docs update.

This is a small move, but it’s squarely in the “skills and reusable setup” lane: turning environment bootstrapping into a portable artifact agents can consume consistently, as described in the docs update.

Firecrawl

@firecrawl

Tell your agents how to install Firecrawl with one file 🔥 Now live on our docs: docs.firecrawl.dev/install.md

Nick Khami

@skeptrune

there's now a /install. md file on every Mintlify site 🙂

10:39 PM · Jan 16, 2026

💼 Capital and enterprise moves: funding, acquisitions, and licensing for AI data

Business-side signals today include a major gen-video funding round plus infra-adjacent acquisitions/licensing moves that affect AI training data and developer ecosystems. Excludes ad monetization (feature).

Cloudflare buys Human Native, an AI data marketplace for creator licensing

Human Native (Cloudflare): Cloudflare is acquiring Human Native, a UK-based AI data marketplace aimed at brokering deals between AI developers and content creators—Cloudflare says it will build tooling for “fair and transparent” access to high-quality training data, while declining to disclose deal terms in the deal summary.

The immediate signal is that “data licensing as infrastructure” keeps getting more formal: instead of scraping → training → lawsuits, vendors are trying to sell auditable access paths that can scale with enterprise procurement.

Wes Roth

@WesRoth