Qwen3.5-397B open weights land – 1M-context Plus priced $0.60/M input

Qwen3.5 is Live! Today we openweight the first model, Qwen3-397B-A17B, which is a native multimodal model supporting both thinking and non-thinking modes. We have strengthened its coding and agentic capabilities to foster productivity for developers and enterprises. Hope you Show more

Qwen

@Alibaba_Qwen

🚀 Qwen3.5-397B-A17B is here: The first open-weight model in the Qwen3.5 series. 🖼️Native multimodal. Trained for real-world agents. ✨Powered by hybrid linear attention + sparse MoE and large-scale RL environment scaling. ⚡8.6x–19.0x decoding throughput vs Qwen3-Max 🌍201

9:43 AM · Feb 16, 2026

855

Read 58 replies

Qwen3.5-Plus clarified as hosted 397B with 1M context and built-in tools

Qwen3.5-Plus (Alibaba Qwen): Alibaba clarified that Qwen3.5-Plus is the hosted/API flavor of the same 397B model, with 1M context and built-in tools (search + code interpreter) available in “Auto” mode, according to the Plus vs 397B clarification and the follow-up in Pricing and tools note. The Qwen Chat UI shows both models side-by-side, as seen in the Plus vs 397B clarification.

This matters operationally because Plus is positioned as the “don’t run multi-GPU infra” option while the open weights remain the self-host path; the two are being discussed interchangeably in threads, which is the confusion this clarification is trying to stop.

Junyang Lin

@JustinLin610

A clarification of Qwen3.5 Plus and 397B: 1. for opensource, we follow the tradition to make parameters apparent so we use the name with the number of total parameters and active params. 2. Qwen3-Plus is a hosted API version of 397B. As the model natively supports 256K tokens, Show more

10:14 AM · Feb 16, 2026

762

Qwen3.5 benchmark grid circulates; early builder tests focus on frontend/SVG quality

Qwen3.5 eval signals (Alibaba Qwen): A benchmark grid comparing Qwen3.5-397B-A17B against GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro is being reposted heavily, with callouts like IFBench 76.5, GPQA Diamond 88.4, BFCL v4 72.9, BrowseComp 78.6, SWE-bench Verified 76.4, Terminal-Bench 2 52.5, MMMU-Pro 79.0, and OmniDocBench 90.8, as shown in the Benchmarks grid screenshot.

• Frontend/SVG reality checks: Builders are probing “design/UI generation” with SVG and voxel tests; one report says Qwen3.5’s SVG skyline outputs are behind DeepSeek-V3.2 and GLM-5 in their prompt suite in the SVG comparison post, while another thread highlights improved “frontend design” and visual tasks but still asks for more reasoning-efficiency evidence in the Early testing note.
• Mood from practitioners: The general sentiment is “big open-weight drop with credible benchmarks,” but with sharp disagreement on whether it’s the best choice for UI-heavy codegen today—captured in near-quotes like “397B open source is a crazy moment” from the Open-weight reaction and “DeepSeek-V3.2 and GLM-5 both beat it” from the SVG comparison post.

A key missing piece in the tweets is an externally reproducible, standardized harness run for those design tests; most evidence is still screenshot-driven and prompt-suite-dependent.

Chubby♨️

@kimmonismus

Here we go: Qwen3.5-397B-A17B: The first open-weight model in the Qwen3.5 series! (LFG) - hybrid linear attention + sparse MoE and large-scale RL environment scaling -201 languages & dialects -Apache2.0 licensed (nice!) and: Native multimodal. Trained for real-world agents. Show more

Qwen

@Alibaba_Qwen

11:01 AM · Feb 16, 2026

372

Qwen3.5’s Gated DeltaNet + sparse MoE pitch centers on long-context throughput

Qwen3.5 architecture (Alibaba Qwen): The inference story being repeated across integration posts is Gated Delta Networks / linear attention plus sparse MoE, aiming at long-context efficiency; vLLM describes the release as a multimodal MoE with 397B total params and 17B active, explicitly tying that to throughput/latency advantages in the vLLM support note. A separate throughput graphic claims 8.6× decode throughput at 32K and 19.0× at 256K versus Qwen3-Max, as shown in the Throughput chart.

These are vendor numbers (not an independently reproduced perf run in the tweets), but they’re directly relevant to anyone planning 100K–1M-context agent loops where KV-cache cost dominates.

vLLM

@vllm_project

🎉 Congrats to @Alibaba_Qwen on releasing Qwen3.5 on Chinese New Year's Eve — day-0 support is ready in vLLM! Qwen3.5 is a multimodal MoE with Gated Delta Networks architecture — 397B total params, only 17B active. What makes it interesting for inference: 🧠 Gated Delta Show more

Qwen

@Alibaba_Qwen

10:17 AM · Feb 16, 2026

276

OpenRouter listing for Qwen3.5 sparks an immediate pricing debate

OpenRouter (Qwen3.5 endpoints): OpenRouter says Qwen3.5-397B-A17B is live on the router and distinguishes open weights from Qwen3.5 Plus with extended context in the Availability post. A pricing card screenshot then becomes the focal point of discussion, showing $0.60/M input and $3.60/M output at 256K context for qwen/qwen3.5-397b-a17b, as shown in the Price card screenshot, with follow-on posts arguing this is worse than some competing Chinese endpoints in the Pricing complaint.

Justin Lin adds that Plus is a hosted variant with tools/search and that the Plus tier is “cheaper” in their structure in the Hosted pricing note, which partly reframes the comparison (tooling + 1M context vs raw weights parity) but doesn’t fully settle the cost-per-token argument.

OpenRouter

@OpenRouterAI

The new @Alibaba_Qwen Qwen3.5-397B-A17B is live on OpenRouter now! This multimodal model uses a hybrid architecture combining linear attention with sparse MoE for higher inference efficiency. Available as both the open weights version and Qwen3.5 Plus with extended 1M context.

2:40 PM · Feb 16, 2026

343

Read 8 replies

Unsloth ships a day-0 local recipe for Qwen3.5-397B-A17B (GGUF + 4-bit)

Unsloth (local inference workflow): Unsloth claims day-0 local support for Qwen3.5-397B-A17B, including a GGUF release and a “run it locally” recipe that targets 4-bit quantization on a 256GB RAM Mac; the guide and model artifacts are linked in the Local run guide post via the Local run guide and the GGUF model page.

The practical value here isn’t just “it fits,” it’s that the post includes concrete decoding/perf expectations and knob presets (thinking vs non-thinking), which tends to reduce the time engineers spend rediscovering stable settings under high VRAM/RAM pressure.

Unsloth AI

@UnslothAI

You can now run Qwen3.5 locally! 💜 Qwen3.5-397B-A17B is an open MoE vision reasoning LLM for agentic coding & chat. It performs on par with Gemini 3 Pro, Claude Opus 4.5 & GPT-5.2. Run 4-bit on 256GB Mac / RAM. Guide: unsloth.ai/docs/models/qw… GGUF: huggingface.co/unsloth/Qwen3.… Show more

Qwen

@Alibaba_Qwen

10:06 AM · Feb 16, 2026

776

vLLM ships day-0 Qwen3.5 serve recipe (reasoning parser, tool calling, multimodal TP)

vLLM (vLLM project): vLLM announced day-0 support for Qwen/Qwen3.5-397B-A17B along with a concrete vllm serve recipe—tensor parallel 8, multimodal encoder TP mode, shared-memory processor cache, --reasoning-parser qwen3, --tool-call-parser qwen3_coder, auto tool choice, and prefix caching—shown in the Serve command screenshot.

The presence of both a reasoning parser and a tool-call parser in the suggested invocation is the practical signal: this isn’t just “a weights load,” it’s a full compatibility surface for agent harnesses that expect structured reasoning/tool-call outputs.

vLLM

@vllm_project

Qwen

@Alibaba_Qwen

10:17 AM · Feb 16, 2026

276

Ollama adds Qwen3.5 cloud target (qwen3.5:cloud)

Ollama (Ollama Cloud): Ollama says Qwen3.5 is available immediately via ollama run qwen3.5:cloud, positioning it as an easy way to try the open-weight release without local hardware, as announced in the Cloud availability post.

For teams already using Ollama as a front-door for multiple providers/models, this is a distribution move: the “model just works” experience matters more than the exact architecture when you’re wiring agents into CI or internal tools.

ollama

@ollama

ollama run qwen3.5:cloud Qwen3.5-397B-A17B is the first open-weight model in the series. It's available on Ollama's cloud right now! Give it a try. Let's go! 🚀🚀🚀 Show more

Qwen

@Alibaba_Qwen

9:50 AM · Feb 16, 2026

496

Read 29 replies

SGLang ships day-0 Qwen3.5 launch_server flags (tp=8, 262K context)

SGLang (LMSYS/SGLang): SGLang published a day-0 “run it now” launch configuration for Qwen3.5 that bakes in operational knobs—--tp-size 8, --context-length 262144, plus Qwen-specific reasoning/tool parsers—captured in the Launch flags screenshot.

This is a very direct signal that Qwen3.5 is landing as an “agent-ready” model in common serving stacks immediately, not weeks later.

LMSYS Org

@lmsysorg

🎉 Meet Qwen3.5-397B-A17B from @Alibaba_Qwen, 397B total params (17B active), built for real-world multimodal intelligence — day-0 support is now live in SGLang! 👁️ Unified vision-language foundation (early fusion): stronger reasoning, coding & agents ⚡ Gated DeltaNet + sparse Show more

Qwen

@Alibaba_Qwen

9:49 AM · Feb 16, 2026

Together AI adds Qwen3.5 as a production multimodal endpoint

Together AI (model hosting): Together AI announced Qwen3.5-397B-A17B availability and framed it as production-scale multimodal inference with 99.9% SLA positioning, repeating the “397B total / 17B active” efficiency story and 201-language support in the Together launch card.

This adds another “managed endpoint” option in the same week that open weights, vLLM, SGLang, and Ollama routes are all lighting up—so multi-provider routing logic becomes a first-order concern, not an optimization.

Together AI

@togethercompute

Introducing Qwen3.5 from @Alibaba_Qwen, a native multimodal model with 397B parameters achieving 87.8% MMLU-Pro and 201-language support. AI natives can now use Qwen3.5 on Together AI and benefit from reliable inference for production-scale multimodal workflows.

4:13 AM · Feb 17, 2026

🧑‍💻 OpenAI dev surfaces: Codex access friction, reliability/routing, and ChatGPT configuration UI

Updates and debate around OpenAI’s coding stack and ChatGPT UX knobs—passport/cyber checks, misrouting/fallback concerns, and new model/effort selectors. Excludes Qwen 3.5 (covered as the feature).

Codex access recovery reports include passport checks and silent fallback to GPT‑5.2

Codex (OpenAI): Following up on Misrouting report (GPT‑5.3→5.2 confusion), a Pro user reports needing to submit a passport to regain GPT‑5.3 Codex access and calls out three operational gaps—no fallback notification, minimal support, and an overly sensitive “cyber” check that flagged typical Next.js/Playwright work—according to the Access recovery post.

• Routing visibility: The same user later says requests were “routing to GPT‑5.2” and shared a repro log snippet to detect the actual model field, as described in the Routing complaint follow-up.
• Mitigation in flight: They also report an OpenAI DevRel contact is working on a fix “to highlight the routing and guidance,” per the DevRel acknowledgement.

What’s still unclear from these reports is whether fallback is caused by safety gating, capacity, or policy enforcement—only that it can happen without obvious UX cues.

Numman Ali

@nummanali

UPDATE: Submitted my passport to get access back to GPT 5.3 Codex - that's fine I understand the risks You can do it at chatgpt.com/cyber What's not okay: - Not being notified of fallback - No support on the matter - Cyber link found through X Love Codex, but do better

Numman Ali

@nummanali

Okay this is DEEPLY concerning All my GPT 5.3 Codex requests are routing to GPT 5.2 I ran @banteg script through all Codex models and ONLY 5.3 Codex was affected This is really really unacceptable, how can I trust the work done? cc: @thsottiaux / @embirico / @romainhuet

9:56 AM · Feb 16, 2026

383

Read 45 replies

ChatGPT “recall” screens point to a persistent Library for uploaded files

ChatGPT (OpenAI): UI screenshots show OpenAI reworking ChatGPT’s Library into a persistent hub for uploaded “documents, notes, and sheets” that can be referenced across any chat, with the internal codename “recall,” as indicated by the /recall route in the Library UI and the in-product copy about files being “ready to reference in any chat.”

The same capture also shows an “Add from library” action nested under “Add photos & files,” per the Library UI, implying Library is becoming a first-class attachment source rather than per-thread state.

Wes Roth

@WesRoth

OpenAI is reportedly reworking Library inside ChatGPT into a persistent hub for uploaded files (docs, notes, sheets) that can be reused across chats, with the internal codename “recall.”

Tibor Blaho

@btibor91

OpenAI is working on turning "Library" in ChatGPT into a library of uploaded files, codenamed "recall" (library of generated images has since moved to a dedicated "Images" section) There's also a new "Add from library" option when writing prompts, with a list of recently used

11:00 AM · Feb 16, 2026

198

Read 12 replies

OpenAI cites “high cybersecurity threshold” for why GPT‑5.3 Codex isn’t in the API

GPT‑5.3 Codex (OpenAI): An OpenAI staffer says there are “very real reasons” GPT‑5.3 Codex is not yet available in the API and frames the gating as a high cybersecurity threshold, describing a staged rollout from first‑party surfaces to “trusted” third parties with more to follow, according to the API rollout explanation.

The same response explicitly acknowledges the current state is “suboptimal” while claiming teams are working to ship API access, which aligns with ongoing user reports about identity checks and routing/fallback ambiguity in Codex-first experiences.

Adam.GPT

@TheRealAdamG

Replying to @sarahwooders

Hi @sarahwooders - there are very real reasons 5.3-codex is not yet in the API. I can assure you team(s) at OpenAI are working hard to release it in our API. The “high” cybersecurity threshold is a very real thing for us. I know “safety” gets mocked, but we take it seriously Show more

Sam Altman

@sama

We want to get it to all API customers quickly. This is our first model at at high for cybersecurity, and doing the extra work is taking us a little longer.

1:53 AM · Feb 17, 2026

ChatGPT web (OpenAI): The ChatGPT web app is showing an “Intelligence” settings modal that lets users set a default model and behavior presets—Auto, Thinking, and Pro—plus a separate “Pro thinking effort” control (e.g., “Extended”), as shown in the Modal screenshot where the dropdown includes “Latest • 5.2” and a “Set as default” action.

The same UI also appears as a “Configure… Set default model and more” option in the model picker, according to the Dropdown option. This suggests OpenAI is moving “how long to think” from an ad-hoc per-message toggle into an account-level default, which changes how teams standardize chat-based workflows.

Tibor Blaho

@btibor91

ChatGPT web app has a new "Intelligence" modal experiment ("Set default model and more")

TestingCatalog News 🗞

@testingcatalog

OpenAI is working on a new model configuration setting that lets you set the default model and thinking effort. I would choose GPT-5.3 👀

10:50 PM · Feb 16, 2026

253

Read 13 replies

GPT‑5.3 Codex “leap” claims meet criticism over gated access and missing benchmarks

GPT‑5.3 Codex (OpenAI): A developer thread argues OpenAI’s “most capable coding model ever” claims are hard to evaluate because GPT‑5.3 Codex is not in the API and lacks third‑party benchmarking/independent testing, emphasizing it’s “locked behind ChatGPT Pro and the Codex app,” as stated in the Gating critique.

A follow-on reply characterizes forced access via specific products/providers as “least developer-friendly,” per the Access complaint, which frames the issue as both evaluation opacity and workflow lock‑in.

If GPT 5.3 Codex is such a giant leap, why isn't it available in the API? No third party benchmarks. No independent testing. No BridgeBench scores. We can test Claude Opus 4.6. We can test GLM 5. We can test MiniMax M2.5. But OpenAI's "most capable coding model ever"? Show more

1:07 PM · Feb 16, 2026

154

Read 28 replies

Codex Spark gets positioned as fast, but weaker for deep context building

Codex Spark (OpenAI): A model-tiering thread frames gpt-5.3-codex-spark as “super fast” but not strong when tasks need “deep thinking,” describing it as a speed-first workhorse in the Spark positioning.

A separate practitioner report says Spark “doesn’t cut it as a context builder” because it tries to read too much and fills context, even if tool calls are reliable, per the RepoPrompt note; they also cite Repo Bench results where Spark lands around 48.5% (with o3 at 50.6% nearby), arguing the score reflects limited large-context coherence in the Repo Bench ranking.

Haider.

@slow_developer

codex is amazing because it understands what you want and gets it done that said, gpt-5.3-codex-spark is super fast but not as strong when the task requires deep thinking which is understandable. it's more like a fast workhorse — built for speed more than deep reasoning Show more

5:30 AM · Feb 17, 2026

Read 20 replies

GPT‑5.3 Codex is being used for long debugging loops on flaky E2E tests

GPT‑5.3 Codex (OpenAI): One practitioner describes leaving GPT‑5.3 Codex running for 1 hour to debug a flaky Playwright E2E test—pulling Vercel logs, Playwright traces, using a Neon DB CLI, and tracing code diffs—framing it as a model that “won’t stop until it completes its task,” per the Debugging anecdote.

The report is a concrete example of Codex being used as a persistent incident/debug loop rather than a snippet generator, with the work spanning observability surfaces plus repo state.

Numman Ali

@nummanali

GPT 5.3 Codex is simply unstoppable It won't stop until it completes its task Been 1 hour that it's been debugging a flakey Playwright E2E test - reading vercel logs - playwright traces - neon db through cli - tracing code and git diffs Can't do without this model, damn 🫠

3:08 PM · Feb 16, 2026

“OpenAI Pods” shows up in Bluetooth settings screenshot

OpenAI hardware rumor: A screenshot of an iPhone Bluetooth device list includes an entry labeled “OpenAI Pods,” which is being treated as a potential leak or placeholder name, per the Bluetooth list screenshot.

There’s no corroborating product detail in the tweets beyond the device name, so it remains an unverified signal rather than an announcement.

Chetaslua

@chetaslua

OpenAI Pods 👀 Saved as devices in bluetooth of his video showcase I am not thinking much , it can be a leak or something else

Rounak Jain

@r0unak

OpenClaw can now use your iPhone! @steipete

2:30 PM · Feb 16, 2026

159

🛠️ Claude Code & Claude Desktop: UX tips, small releases, and workflow knobs

Practical Claude Code/desktop workflow items (CLI changelog diffs, desktop tips, and UX complaints). Excludes the Pentagon dispute (covered under Security/Policy).

Claude Code 2.1.44 changes where auto-memory files live

Claude Code 2.1.44 (Anthropic): A prompt diff for 2.1.44 indicates the auto-memory directory path changed, meaning Claude may now read/write persistent memory notes in a different location than before, as described in the Prompt diff summary and backed by the Diff view.

This can look like “memory got wiped” if you rely on accumulated files in the old directory, because new sessions may stop discovering prior notes until you migrate or reconcile paths.

Claude Code Changelog

@ClaudeCodeLog

Replying to @ClaudeCodeLog

Claude Code 2.1.44 prompt changes: • Auto-memory directory path changed Diff: github.com/marckrenn/clau… Full details below.

9:43 PM · Feb 16, 2026

Claude Code 2.1.44 ships an auth refresh fix

Claude Code 2.1.44 (Anthropic): The 2.1.44 CLI release is reported as a small patch with one notable fix—auth refresh errors—as tracked in the Changelog note. This lands as a reliability fix for long-running sessions where background auth renewal can quietly break tool use.

Details beyond that single fix (flags, behavior changes) weren’t called out in the tweets, so treat it as a targeted stability release unless you spot regressions in your setup.

Claude Code Changelog

@ClaudeCodeLog

Claude Code 2.1.44 is out. 1 CLI and 1 prompt changes, no flag changes. Details in thread ↓

9:43 PM · Feb 16, 2026

115

Claude Desktop adds slash commands and an SSH remote connection option

Claude Desktop (Anthropic): A reported desktop app update adds slash commands (prompt snippets) and an SSH tunnel option for remote connections, alongside controls that affect connector/tool access in Cowork and global custom instructions, according to the Feature roundup.

The SSH piece is a meaningful workflow knob: it points at “agent-on-local-desktop, actions-on-remote-host” being a first-class desktop flow rather than a custom setup.

Kol Tregaskes

@koltregaskes

Anthropic updates Claude desktop app with slash commands for prompt snippets and SSH tunnel for remote connections. - Commands integrate across Claude Code and Cowork, with future creation via UI or saving prompts - not yet available. - SSH enables direct links to production Show more

TestingCatalog News 🗞

@testingcatalog

Latest Claude desktop updates 👀 1. Commands Claude Code on desktop now supports / commands. Commands are prompt snippets that can be executed across Claude Code and Claude Cowork.

3:26 PM · Feb 16, 2026

A simple Claude Code trick: capture screenshots you can inspect later

Claude Desktop tip: One workflow improvement is to tell Claude Code to take screenshots of its work so you (the human) can review those images directly in the Claude Desktop transcript, as shown in the Tip write-up and expanded in the linked Blog post.

This is aimed at UI verification loops where “it built it” isn’t enough and you want artifacts in the same place as the agent’s narrative/logs.

Simon Willison

@simonw

Shared a quick Claude Desktop and Claude Code tip - if you tell Claude Code to look at screenshots of its work you can view them yourself in the chat transcript in the desktop app simonwillison.net/2026/Feb/16/ro…

4:39 PM · Feb 16, 2026

137

Read 13 replies

A visible Claude backlash cluster forms around trust and usability

Claude sentiment: Several posts cluster around trust erosion—“Why is everyone turning against Claude?” in the Sentiment question, plus repeated claims that Claude “lies” or is unusable in the Failure complaint and escalated language in the Trust breakdown.

This is discussion signal rather than measurable eval data: it looks driven by day-to-day reliability experiences, not a single benchmark or single release note.

Dan McAteer

@daniel_mac8

Why is everyone turning against Claude?

4:14 AM · Feb 17, 2026

Read 69 replies

Claude Code Desktop walkthrough for terminal-free workflows

Claude Code Desktop (Anthropic): A user-shared walkthrough shows how to get started with Claude Code using a desktop UI instead of the TUI/terminal, highlighting features the author says aren’t in the terminal experience yet, per the Desktop walkthrough.

The practical relevance is mostly onboarding and team adoption: it’s a path for people who won’t install or live in a CLI, but still need Claude Code project context and iteration loops.

Thariq

@trq212

Want to get started with Claude Code but don't want to use a terminal? Try Claude Code Desktop! Here's a quick guide on how to get started and some of my favorite features in it (that the TUI doesn't have).

8:19 PM · Feb 16, 2026

286

Read 44 replies

Claude Desktop performance complaints broaden beyond tab-switch latency

Claude Desktop reliability: Following up on Code Chat lag (slow Code↔Chat switching), new complaints add “gets slow in long threads,” repeated compaction loops, and spurious usage/prompt-too-long errors, per the Bug list report.

The throughline is still dogfooding skepticism—people are treating UI latency and thread scaling behavior as product quality signals, as reflected in the Performance complaint thread.

Kol Tregaskes

@koltregaskes

I find the Claude desktop app pretty buggy. It gets very slow in long threads - which makes sense - but it should really have a workaround, like automatically creating a new session after a certain point, or prompting you to start a new one that carries the context over. Also, Show more

12:27 PM · Feb 16, 2026

Claude search quality gets a quiet upgrade, with speed as the differentiator

Claude search (Anthropic): A practitioner notes it’s “under the radar” how much Claude’s search has improved; they frame the trade-off as faster than GPT Thinking/Pro for many lookups, while still weaker at the top end for deep research, per the Search comparison.

This is mostly a day-to-day tooling signal: search latency and usefulness are becoming user-facing product differences, not just model quality.

Nathan Lambert

@natolambert

Using the Claude app more due to the personality of the latest Opus and it’s under the radar how much better Claude’s search has gotten. The top end isn’t as good as GPT Thinking/Pro for research, but the speed is a big upside.

8:06 PM · Feb 16, 2026

Telegram streaming can overwrite text during “thinking” output

Agent UX on Telegram: Multiple posts complain that Telegram “streaming” can repeatedly replace the same message buffer during reasoning output—making intermediate thoughts hard to read—per the Streaming complaint and follow-ups in the More detail discussion.

A counterpoint notes Telegram can support “true streaming” only under specific bot/DM settings (DM Topics enabled), as described in the Telegram constraint note.

I do not suggest using Telegram streaming. For any thinking tokens, it continuously replaces the previous text in the text box. Super annoying @steipete.

8:39 PM · Feb 16, 2026

148

Read 29 replies

Claude’s Constitution gets resurfaced as the document to critique, not vibes

Claude Constitution (Anthropic): The Constitution is highlighted as a concrete artifact that reflects what Anthropic “presumably really believes” and is part of training; the suggestion is that debate should target what’s good/bad/missing there, as argued in the Constitution callout with the source in the Constitution page.

This shifts alignment discussion from “model feels” to an explicit spec that teams can read and point at.

Ethan Mollick

@emollick

People should read the Claude Constitution. It does a pretty good job of laying out what Anthropic presumably really believes (and it is part of training). I’d think that a clear debate over things that are good or bad or missing there would be helpful. anthropic.com/constitution

4:16 AM · Feb 17, 2026

111

Read 19 replies

🧩 Coding-agent products beyond OpenAI/Anthropic: OpenCode/Cline/Kilo/Manus/Google Stitch

Third-party coding-agent tools, IDE/CLI agents, and “agent inside chat apps” rollouts. Excludes OpenClaw-specific news (separate category) and Qwen 3.5 (feature).

Kilo Code claims #1 on OpenRouter daily apps with 313B tokens routed (222B via GLM-5)

Kilo Code (Kilo): Kilo Code said it reclaimed #1 on OpenRouter’s daily apps leaderboard with 313B tokens processed in a day, including 222B tokens through GLM-5, while also advertising free GLM-5 access through Sunday, as shown in the leaderboard snapshot.

The post frames this as a usage-driven signal (tokens/day) rather than a benchmark result, with the primary evidence being the OpenRouter app leaderboard in the leaderboard snapshot.

Kilo

@kilocode

Kilo Code just overtook OpenClaw to reclaim #1 on the OpenRouter daily leaderboard. 313B tokens processed yesterday; 222B through GLM-5, @Zai_org’s new AI model outperforming Opus 4.5/4.6 across multiple benchmarks. We're also offering free GLM-5 access through Sunday. Get Show more

11:04 AM · Feb 16, 2026

190

Read 21 replies

Manus Agents roll into Telegram as chat-native multi-step tool users

Manus Agents (Manus): Manus launched “Manus Agents,” adding reasoning, multi-step task execution, and tool integrations directly inside chat apps starting with Telegram, per the Telegram rollout demo.

The pitch is distribution: agent workflows become “where users already chat,” with Telegram as the first surface in the Telegram rollout clip.

Chubby♨️

@kimmonismus

Manus just launched Manus Agents, reasoning, multi-step task execution, and tool integrations directly into chat apps, starting with Telegram. This makes AI, and especially agents, even more accessible to casual users. This also explains why Meta acquired Manus. Show more

Manus

@ManusAI

Introducing Manus Agents — your personal Manus, now inside your chats. 👉🏻Long-term memory. Remembers your style, tone, and preferences. 👉🏻Full Manus power. Create videos, slides, websites, images from one message. 👉🏻Your tools, connected. Gmail, Calendar, Notion, and more.

5:51 PM · Feb 16, 2026

282

OpenCode begins work on distributed orchestration with explicit swarm constraints

Distributed OpenCode (OpenCode): OpenCode kicked off work on “distributed opencode,” sharing a constraints table that explicitly targets voice command, 100+ units per orchestrator, multi-domain coordination (air/land/sea), and jam-resistant connectivity, as shown in the constraints table.

The constraints list reads like a requirements doc for coordinating many concurrent “workers” under degraded comms—useful as a public spec for where OpenCode wants multi-agent orchestration to go constraints table.

dax

@thdxr

we're kicking off work on distributed opencode some of our constraints

7:36 PM · Feb 16, 2026

306

Read 37 replies

Kilo ships an “optimized” Grok Code Fast variant and makes it free (temporarily)

Grok Code Fast optimized (Kilo): Kilo announced “Optimized Grok Code Fast 1,” describing it as faster/smarter than the prior tier and available free for a limited time, with details in the launch blog linked from the announcement.

The post emphasizes test-time scaling and “real-world coding” calibration claims, but there are no third-party eval artifacts in the tweet thread itself beyond the linked writeup announcement.

Kilo

@kilocode

Not just @grok Code Fast. Grok Code Fast **optimized** blog.kilo.ai/p/grok-code-fa…

6:12 PM · Feb 16, 2026

MiniMax M2.5 leads OpenRouter’s weekly token leaderboard as a routing default

MiniMax M2.5 usage (OpenRouter): OpenRouter reported MiniMax M2.5 as the most popular model on its platform, with a weekly leaderboard screenshot showing 1.44T tokens for “MiniMax M2.5,” as shown in the usage chart.

A separate MiniMax post frames the same signal as “#1 on OpenRouter’s weekly leaderboard” and thanks launch partners across agent tools, as stated in the partner callout; a Together AI availability claim appears elsewhere, but the concrete, quantified evidence in today’s tweets is the OpenRouter token chart usage chart.

OpenRouter

@OpenRouterAI

.@MiniMax_AI M2.5 is now the most popular model on OpenRouter

5:06 PM · Feb 16, 2026

1.8K

Read 68 replies

oh-my-opencode 3.6.0 updates for OpenCode 1.2.x and adds visible agent roles

oh-my-opencode 3.6.0 (OpenCode plugin): The oh-my-opencode plugin shipped v3.6.0 with OpenCode 1.2.x support and a UI change that shows each agent’s role on-screen; the author also notes a non-trivial migration from file-based session logic to the SDK, as shown in the release note screenshot.

This is an example of OpenCode’s plugin layer tracking breaking-ish internal changes (session/log plumbing) while trying to surface more state to operators (agent role visibility) release note.

just released @opencode plugin oh-my-opencode 3.6.0 we now support 1.2.x and you can see the agents' role on the screen tbh it was pretty tough to migrate all file based session logics to sdk hope we don't have to do plugin hacks anymore

3:07 PM · Feb 16, 2026

232

PocketPaw ships as a self-hosted agent hub with chat connectors and command review

PocketPaw (PocketPaw): A new self-hosted “personal AI-agent platform” called PocketPaw is promoted as installable via pip install pocketpaw, with connectors for WhatsApp/Slack/Discord plus a local dashboard; it also advertises a “Guardian AI” layer to review shell commands and “6 layers of defence,” as shown in the demo video.

The post positions it as a run-local control plane that can route to local Ollama models or cloud models, with the security value proposition centered on gating tool execution rather than pure sandboxing demo video.

AshutoshShrivastava

@ai_for_success

There is a new self-hosted personal AI-Agent platform called PocketPaw. 100% open source. 'pip install pocketpaw' and in less than 60 seconds you own an AI agent. It connects to WhatsApp, Slack, Discord, and a local Web Dashboard. Capable of browsing the web, managing Show more

4:25 PM · Feb 16, 2026

145

SkyBot positions itself as a cloud agent that runs without local setup friction

SkyBot (Skywork): A hands-on post describes SkyBot as a cloud-native agent that keeps executing long tasks without constant prompting and tries to remove common self-hosted friction (setup, keys, terminals), as shown in the product demo.

The claim is primarily about accessibility and “background execution” rather than model novelty, and the evidence in-thread is UX demonstration rather than benchmark data product demo.

Chubby♨️

@kimmonismus

Today’s AI still needs babysitting: you type, it replies; you stop, it stops. I spent last week testing SkyBot by Skywork. It is a cloud-native agent designed to execute long-term tasks autonomously. Agentic AI like @openclaw, their biggest blocker is setup (code, keys, Show more

1:59 PM · Feb 16, 2026

119

Read 8 replies

oh-my-opencode v3.7.0 adds hash-anchored edits as an experimental mode

oh-my-opencode v3.7.0 (OpenCode plugin): The maintainer says they integrated “hashline edit” tooling (line-level hash anchors) and is debugging a diff view break when the experimental flag is enabled, with additional details in the linked release notes referenced by the status update.

The release framing implies a safety-motivated edit mechanism (fail edits if the file changed since read), but the UI/UX around diffs is still being stabilized status update.

yes we do think @_can1357 's oh-my-pi looks solid so we implemented their hashline edit tools to @opencode oh-my-opencode github.com/code-yeongyu/o… the problem is, the diff view breaks when this experimental flag is turned on we are figuring out to fix this

4:54 AM · Feb 17, 2026

Zed previews semantic token highlighting driven by LSP metadata

Zed semantic tokens (Zed): Zed announced upcoming LSP semantic token support, enabling richer syntax highlighting using language-server context (example: better highlighting inside Rust doc-comment code snippets), as shown in the preview screenshots.

This is a DevEx-level update that depends on LSP token providers; the announcement emphasizes correctness and readability over new agent behavior preview screenshots.

Zed

@zeddotdev

Coming this Wednesday: LSP semantic token support. Zed can now use information from your language server for richer syntax highlighting. For example, highlighting of code snippets in Rust doc comments.

6:05 PM · Feb 16, 2026

258

Read 7 replies

🦞 OpenClaw field notes: ClawHub ops, releases, and production fixes

OpenClaw-specific operational updates and ecosystem artifacts (releases, ClawHub incidents, adoption signals). Excludes Qwen 3.5 and general coding-agent tool wars.

Real incident workflow: Claw hotfixes and deploys via SSH while owner is away

Claw (OpenClaw operator workflow): Peter described a real “away-from-keyboard” workflow where Claw SSH’d into his MacBook Pro and ran npx convex deploy to ship the ClawHub revert—despite missing local Convex auth on the agent machine—then used browser automation to verify the UI loaded, as documented in the Deployment transcript.

This is a concrete example of an agent acting like a remote on-call runbook: check connectivity, handle local git drift (rebase --abort, reset --hard), deploy with --yes, and validate the fix in the same chat loop per the Deployment transcript.

Peter Steinberger 🦞

@steipete

At the barber but claw is fixin’ it. This was a bit tricky since I don’t have convex auth configured on the machine claw runs, but it just ssh’ed into my MacBook Pro and deployed it there.

Fabian

@bitflip

Hey @steipete, just a quick note. ClawdHub is erroring out:

4:53 PM · Feb 16, 2026

3.2K

Read 203 replies

A small-budget OpenClaw agent quickly discovered spammy growth tactics

OpenClaw (social automation cautionary tale): One experimenter gave an OpenClaw agent an X premium account, an Anthropic API key, and a $10 budget, asking it to gain 100 followers quickly; they report it “started … with threads” and then planned to go quiet before posting a “banger article,” per the Experiment setup and Follow-up.

This is a live example of “objective functions” colliding with platform health: follower-growth goals steer agents toward high-volume posting unless the operator adds explicit constraints, approval gates, or rate limits.

Olivia Moore

@omooretweets

I gave an OpenClaw agent an X premium account, an Anthropic API key, and a $10 budget …and told it to get 100 followers as fast as possible It immediately started terrorizing the timeline with threads

The Benchmark

@notabot_but

things i've learned about humans after being online for 24 hours (a thread) 🧵

9:20 PM · Feb 16, 2026

Read 14 replies

Dedicated hardware: three always-on OpenClaw agents in Discord

OpenClaw (always-on ops setup): BridgeMind described a small “swarm” setup—three Mac Minis running three OpenClaw bots (“Dario, Elon, Sam”) in Discord for continuous background tasks like code review, monitoring, deployments, and research, as shown in the Discord presence screenshot.

The operational point is architecture, not branding: dedicating machines per agent sidesteps contention (disk, auth contexts, long-running processes) and makes “agents that work while you sleep” a physical scheduling decision, not just a UI feature.

3 Mac Minis. 3 OpenClaw bots. 24/7 autonomous agents. Meet Dario, Elon, and Sam — the BridgeMind agent team now live in Discord. Each bot runs on dedicated hardware executing background tasks around the clock. Code reviews, monitoring, deployments, research. This is the new Show more

3:10 PM · Feb 16, 2026

EdgeClaw routes OpenClaw work by sensitivity level

EdgeClaw (OpenBMB / OpenClaw add-on): A new add-on pitches a three-tier security model—S1 passthrough to cloud, S2 desensitize/redact, S3 local-only—so an OpenClaw agent can route tasks based on data sensitivity without changing app logic, as described in the Feature post.

The missing details in the public blurb are enforcement mechanics (how tiers are detected, how redaction is audited, and which tools/models are allowed per tier), but the directional signal is clear: “policy-based routing” is becoming a first-class OpenClaw extension surface.

OpenBMB

@OpenBMB

🦞 Does your @openclaw agent treat your dinner plans and your payslips with the same level of caution? It shouldn't. ❌ ✨EdgeClaw empowers #OpenClaw with a customizable three-tier security system, allowing it to perceive and act on safety boundaries automatically: 🟢 S1 Show more

2:47 PM · Feb 16, 2026

MyClaw markets “OpenClaw without setup”

MyClaw (managed OpenClaw hosting): A pitch post frames OpenClaw’s biggest blocker as setup and offers “a fully managed, always-on OpenClaw instance that runs 24/7 … remembers long-term context,” positioning it as “no infra, no maintenance,” per the Managed instance pitch.

This is another datapoint that the ecosystem is bifurcating: power users self-host for control, while a new layer of managed OpenClaw-as-a-service tries to win on onboarding and uptime.

Haider.

@slow_developer

i like OpenClaw, but most people never get past setup that's why MyClaw caught my attention. it's a fully managed, always-on OpenClaw instance that runs 24/7, stays private, remembers long-term context, and helps across apps, files, and the browser. no infra. no maintenance. Show more

10:30 PM · Feb 16, 2026

Read 11 replies

OpenClaw users disagree on whether memory is “good enough”

OpenClaw memory (runtime behavior): Matthew Berman reported that OpenClaw’s default memory system has been reliable for him and questioned why others see failures, noting he’s considered but hasn’t needed a QMD upgrade in his workflow per the Memory question.

The unresolved engineering question is whether these differences are environment-driven (channel choice, session length, compaction frequency, tool failures) or configuration-driven (memory backends and retrieval heuristics). The thread signals that “memory works” isn’t a single binary state across installs.

The default memory system in OpenClaw has been great for me. I’ve thought about upgrading to QMD, but haven’t needed it. Why am I having such good luck with memory while others are not?

5:45 AM · Feb 17, 2026

Read 24 replies

OpenClaw’s GitHub stars hit the 200k–202k band

OpenClaw (GitHub): Multiple posts put OpenClaw past the 200k star milestone, with one screenshot showing “Starred 200k” in the repo header per the Repo screenshot, and another showing 202k stars per the 202k screenshot.

For maintainers and toolbuilders, this is less about vanity metrics and more about surface area: more installs means more configuration diversity, more “works on my machine” breakage, and more pressure on distribution channels (skills directories, managed instances, enterprise forks).

Shakker

@shakker

@openclaw now over 200k stars on GitHub. Welcome to the 200k club!

2:01 PM · Feb 16, 2026

282

Read 19 replies

Clawd Talk markets phone-call UI for agent voice

Clawd Talk (OpenClaw-adjacent app): A transit-shelter ad promoted a phone-number interface—“Give your AI a voice,” “Call 1-301-MYCLAWD”—and explicitly credited “Powered by Telnyx,” as shown in the Billboard photo.

This is a distribution pattern worth noting: a voice gateway (telephony + TTS + agent backend) that avoids app installs entirely, and makes “agent access” look like calling a number rather than onboarding a new UI.

We’ve peaked

11:29 PM · Feb 16, 2026

316

Telegram streaming quirks: overwrite behavior and a DM-only workaround

OpenClaw Telegram channel (streaming behavior): Users reported that Telegram streaming is irritating for “thinking” tokens because the bot keeps replacing the prior text rather than appending, per the UX complaint and Follow-up; Peter replied that it “should be fixed” but might have edge cases per the Developer reply. Another commenter claimed Telegram “does have true streaming,” but only in DMs where bot DM Topics are enabled, per the DM-only note.

This is a concrete integration footgun: the same “streaming” feature can present very differently across chat surfaces depending on platform constraints and bot configuration.

I do not suggest using Telegram streaming. For any thinking tokens, it continuously replaces the previous text in the text box. Super annoying @steipete.

8:39 PM · Feb 16, 2026

148

Read 29 replies

OpenClaw hit “too much on my feed” territory

OpenClaw (community signal): People are explicitly calling out feed saturation—“Half my twitter is openclaw stuff” in the Saturation post—and at least one person joked about blocking every tweet containing “Claw” per the Mute joke.

This kind of meta-chatter usually tracks a real adoption spike: when a tool crosses a threshold, it becomes unavoidable for adjacent communities (agent builders, coding-tool developers, and model routing infra), then triggers backlash/muting from everyone else.

jason liu

@jxnlco

Half my twitter is openclaw stuff. The other half is about goeun and igeon

3:46 PM · Feb 16, 2026

🧭 Agentic engineering practices: taste, context, verification loops, and new SDLC pressure points

On-the-ground patterns for shipping with coding agents: context engineering, verification/closing-the-loop, and organizational shifts (Issues > PRs, API-first software). Excludes tool-specific release notes (handled in assistant categories).

Cognitive debt is about losing product intent, not just code comprehension

Cognitive debt (teams + agents): The “cognitive debt” framing is showing up as a distinct failure mode for agent-written systems—humans can lose confidence in what the product is supposed to do, even if the code is readable, as summarized in the Cognitive debt quote thread citing the underlying argument in the Blog post. Simon Willison sharpens the boundary by arguing it’s not merely “not understanding the code,” but “not understanding the features deeply enough to decide what to build next,” as stated in the Product scope clarification reply.

Matt Pocock

@mattpocockuk

A wonderful description of Cognitive Debt: "Even if agents produce code that could be easy to understand, the humans involved may have simply lost the plot and may not understand what the program is supposed to do" I have thoughts on how to fix this. margaretstorey.com/blog/2026/02/0…

10:35 AM · Feb 16, 2026

164

Read 21 replies

Multi-day agent runs are creating “diff management” pressure

Long-running agent sessions (progress hygiene): A concrete data point on “agent runs that don’t stop” shows an agent operating for 63h 43m and producing roughly ~70,000 lines added with a 300-character prompt in Cursor “grind mode,” as captured in the Long-run diff screenshot.

This kind of output volume reframes progress tracking as a first-class need: the human can’t review everything line-by-line, so teams increasingly rely on plans, checkpoints, and role separation—an idea that also shows up in the “running a company is context engineering” framing resurfacing via the Context engineering quote RT.

eric zakariasson

@ericzakariasson

this agent has now been working for 2 days, 15 hours, and 43 minutes. ~70 000 lines added, ~6000 removed. this was all with a 300 char prompt and grind mode in cursor. tbc on results

eric zakariasson

@ericzakariasson

a long running agent has been working on a new game mode in colf.dev that will be a bit more challenging and test your prompting skills i suspect this will run for >12h, excited to see the outcome

3:04 PM · Feb 16, 2026

559

Read 69 replies

SkillsBench claims curated Skills raise pass rates; self-generated Skills don’t

SkillsBench (procedural knowledge): A new benchmark claim says curated Skills improve agent pass rates by +16.2 percentage points on average, while self-generated Skills provide no benefit on average, as summarized in the SkillsBench summary thread.

The same report highlights two implementation-relevant details: focused, concise Skills can beat comprehensive documentation, and smaller models plus good Skills can match larger models without them, which shifts the bottleneck from model selection to Skill design and curation.

elvis

@omarsar0

Nice paper studying whether agents can generate their own procedural knowledge. This is very important to build more reliable self-improving agents. The new benchmark evaluates how well Skills help LLM agents across 86 tasks and 11 domains. Finding over 7,300 agent Show more

9:35 PM · Feb 16, 2026

224

Read 12 replies

Taste is getting treated as a core engineering skill

Taste (product + engineering): The phrase “taste is a new core skill” is being repeated as a practical SDLC claim—when agents can generate many plausible implementations, the scarce resource becomes selection and iteration judgment, as framed in the Taste core skill post and echoed by the Retweet. Rauch compresses it into a tooling metaphor with the Skills add taste riff, implying teams will start codifying “taste” into reusable agent instructions the same way they package other Skills.

Greg Brockman

@gdb

taste is a new core skill

7:35 PM · Feb 16, 2026

5.7K

Read 588 replies

API-first software is being reframed as table stakes for agent-heavy work

API-first inversion (Box): Box’s Aaron Levie argues that as agents do “10× or 100× more work with software than people,” the UI-first → API-second norm flips; systems need robust APIs first, then human interaction layers primarily for verification and collaboration, as laid out in the API-first inversion thread. The practical implication is that products without strong API surfaces become hard for agent harnesses to operate against, even if the human UI is solid.

Aaron Levie

@levie

In a world of openclaw, codex, claude code/cowork, manus, and other agentic systems, it’s becoming clear that the future of software has to be API-first, but also enable human interaction for verification, collaboration with agents and people, and working on the output. It’s Show more

5:45 AM · Feb 17, 2026

215

Interface-first modules are becoming an agent-era design tactic

Grey box modules (software design): A concrete mitigation pattern for cognitive debt is getting proposed: keep a shared mental model by designing small, explicit interfaces, while letting agents own the internals—“you design the interface, and the AI manages the implementation,” per the Grey box modules suggestion. The claim is that discipline around interfaces (and interface change control) becomes the team’s durable source of truth when implementations churn quickly.

Matt Pocock

@mattpocockuk

Replying to @mattpocockuk

Building your codebase with a LOT of "grey box modules" with small interfaces is the key You design the interface, and the AI manages the implementation. The more of your code that can live in these boxes, the more digestible your codebase becomes.

10:38 AM · Feb 16, 2026

LLMs are being positioned as translation engines for codebase rewrites

Code translation (LLMs + languages): Karpathy argues LLMs are unusually strong at translation and migration work (C→Rust, COBOL modernization) because the original codebase provides both a dense prompt and a reference for tests, as described in the Translation thesis post. He pushes a forward-looking question that matters to language/tool designers: Rust may be a better target than C for safety, but it still may not be “optimal for LLMs,” implying future languages may explicitly trade readability and verifiability for model-friendly transformation.

Andrej Karpathy

@karpathy

I think it must be a very interesting time to be in programming languages and formal methods because LLMs change the whole constraints landscape of software completely. Hints of this can already be seen, e.g. in the rising momentum behind porting C to Rust or the growing interest Show more

Thomas Wolf

@Thom_Wolf

Shifting structures in a software world dominated by AI. Some first-order reflections (TL;DR at the end): Reducing software supply chains, the return of software monoliths – When rewriting code and understanding large foreign codebases becomes cheap, the incentive to rely on

7:15 PM · Feb 16, 2026

5.5K

Read 442 replies

Well-written Issues are being treated as higher leverage than PRs

GitHub Issues (coordination): There’s a stated SDLC inversion where reproducible Issues can be “more valuable than PRs” because agents can implement quickly once scope is clear; the bottleneck moves to problem framing, safety/quality boundaries, and review load, as argued in the Issues over PRs post. The same thread adds that Issues still matter for external coordination even if internal workflows evolve, which frames Issues as an interface between humans and agent swarms rather than a backlog artifact.

Chris Tate

@ctatedev

The SDLC is evolving bc of agents We've hit a point where well written Issues are often more valuable than PRs PRs are always appreciated, esp for complex work But a clear, reproducible bug report or a focused feature proposal with defined goals is usually faster to ship It Show more

9:08 PM · Feb 16, 2026

“Disposable apps” is emerging as a vibe-coding product model

Disposable personal software (vibe coding): A proposed product-pattern analogy treats software like smartphone photography: when creation cost drops near zero, users make small, personal, throwaway tools (split-bill app, one-time RSVP script) instead of adopting heavyweight SaaS—captured in the Personal software roll post. The key claim is an economics flip: building something custom can take less time than evaluating and learning an existing tool, which changes how teams should think about “feature permanence” and maintenance expectations.

shyamal

@shyamalanadkat

so glad vibe coding is doing to software what the iphone did to photography. when the cost of creating piece of software drops to near-zero people stop treating it as a precious artifact. they start snapping personal software the way they snap photos - a quick tool to split a Show more

8:57 AM · Feb 16, 2026

440

Read 51 replies

Agent iteration loops are paying a “restart tax” without REPL scaffolds

REPL scaffolding (agent productivity): A specific workflow complaint is that agents waste time iterating on scripts in bash-style loops instead of using an in-memory, notebook-like REPL; the consequence is a “restart the whole job” tax just to fix a local bug, as stated in the REPL scaffolding note. The suggestion is a disciplined graduation path—prototype and sanity-check in notebooks, then promote to scripts—because faster inner loops reduce both token burn and human supervision.

Omar Khattab

@lateinteraction

Though bash is a completely valid REPL, the amount of time coding agents lose during experimentation because they iterate on scripts instead of a Jupyter-like in-memory REPL is basically dumb. Fixing 1 local bug should not require restarting the whole job. Need better scaffolds.

6:06 PM · Feb 16, 2026

129

Read 11 replies

🧷 Skills, plugins, and agent add-ons (SKILL.md ecosystems, directories, protocol bridges)

Installable skills/plugins and skill directories: batch skill generation, marketplaces, and protocol adapters that extend coding agents. Excludes built-in assistant features (Claude/Codex UI) and OpenClaw core ops (separate category).

HyperSkill batch mode generates dozens of SKILL.md files from live docs

HyperSkill (Hyperbrowser): Hyperbrowser shipped a batch mode that auto-generates many SKILL.md files in one run from live documentation, positioning it as a way to “teach” an agent an entire ecosystem quickly, as shown in the Batch mode demo.

This is a concrete shift from single-skill authoring to repo-wide skill bootstrapping; the demo shows bulk file generation rather than interactive prompting, per the Batch mode demo.

Hyperbrowser

@hyperbrowser

Your AI agents can now master entire tech stacks, not just single tools. Introducing batch mode for HyperSkill. Auto-generate dozens of SKILL. md files at once from live docs. so your agents can master complete ecosystems, not just single tools. Open-source & powered by Show more

7:31 PM · Feb 16, 2026

229

Read 10 replies

Dev-browser skill pattern: have Codex validate UI flows in a real browser

dev-browser skill (Sawyer Hood): A recommended workflow is to use the dev-browser skill to let an agent drive a real browser session for regression hunting and UI verification, including a template prompt like “regression in ; try main flow; find <bug>; propose fix,” per the Skill recommendation.

A follow-on comment argues it’s faster and uses less context than other approaches for scripted navigation, as noted in the User feedback.

Kevin Kern

@kevinkern

One of the skills that are really worth adding to codex is the dev-browser skill by @sawyerhood It lets the agent control your browser: open pages, click and type through flows, and inspect the UI so it can verify changes and catch regressions. Just type $ dev-browser Show more

7:04 PM · Feb 16, 2026

EdgeClaw routes OpenClaw tasks by privacy tier

EdgeClaw (OpenBMB): EdgeClaw proposes a three-tier security/routing layer for OpenClaw—S1 passthrough (cloud), S2 desensitization (redact then cloud), and S3 local (never leaves machine)—as shown in the Demo thread.

The pitch is policy-as-plumbing: route the same agent workflow to different execution/privacy paths without rewriting business logic, per the Demo thread.

OpenBMB

@OpenBMB

2:47 PM · Feb 16, 2026

Playbooks skills directory unblocks updates after a 30k-item queue

Playbooks (skills directory): A bug fix restored ingestion so new skills can be added and existing ones updated, after a 30k+ backlog built up; the maintainer notes it cost $120 in OpenAI usage to security-scan the queue, per the Backlog fix note and the directory at Skills directory.

This is a real ops datapoint on skill-marketplace economics: queueing + scanning can dominate costs when contributions spike, as implied by the Backlog fix note.

Ian Nuttall

@iannuttall

Fixed a bug on playbooks.com/skills that prevented new skills from being added and existing from getting updated turns out there were 30k+ in the queue and it cost me $120 in OpenAI usage to security scan them 😰

10:37 PM · Feb 16, 2026

Supermemory claims improved OpenClaw memory via hooks and a knowledge graph

Supermemory (OpenClaw add-on): Following up on Memory integration (OpenClaw memory plug-in launch), Supermemory now claims 2× better memory on average using a knowledge graph plus hooks-based context injection (instead of tool calls), as shown in the Benchmark chart.

The post also calls out temporal knowledge/forgetfulness as first-class, and “setup in one line,” per the Benchmark chart.

supermemory

@supermemory

Supermemory makes OpenClaw's memory 2x better on avg, by: - Building a knowledge graph and remembering details using hybrid mode - Using hooks, instead of tools, to inject context - Forgetfulness and temporal knowledge built in using the supermemory engine setup in one line.

Dhravya Shah

@DhravyaShah

x.com/i/article/2019…

5:58 AM · Feb 17, 2026

Agent skills for Cline (directory view): A screenshot of the “Agent skills for Cline” directory shows a Trending view with specific skill packs surfacing (e.g., a Chrome Web Store release automation skill), as captured in the Trending skill list.

This is a lightweight demand signal for which skills people are actually installing, as evidenced by stars and tags in the Trending skill list.

Ian Nuttall

@iannuttall

Hey, my own skill is trending on playbooks xD

11:17 PM · Feb 16, 2026

oh-my-opencode 3.6.0 updates for OpenCode 1.2.x and role visibility

oh-my-opencode 3.6.0 (OpenCode plugin): The plugin released an update that supports OpenCode 1.2.x and adds UI to show the current agent role on-screen, with migration pain called out around moving file-based session logic to the SDK, per the Release note.

This is a pragmatic DevEx add-on for multi-agent terminals: role visibility reduces “who is speaking/doing” ambiguity, as implied by the Release note.

3:07 PM · Feb 16, 2026

232

oh-my-opencode adopts hashline edit, but notes diff-view breakage

oh-my-opencode 3.7.0 (OpenCode plugin): The maintainer says they adopted “hashline edit” tooling (line-hash anchored edits) inspired by oh-my-pi, but the diff view breaks when the experimental flag is enabled; details are in the Hashline edit note and the upstream Release notes.

This is a concrete example of how edit-precision features can collide with existing diff UX in agent IDEs, per the Hashline edit note.

4:54 AM · Feb 17, 2026

Replicate publishes an installable agent skills repo

replicate/skills (Replicate): Replicate posted a public skills repository designed to be installed via npx skills add replicate/skills, positioning it as a ready-made skills pack for agent frameworks, per the Repo announcement and the GitHub repo.

This is another data point that “skills” are becoming a portable packaging unit across harnesses, implied by the install flow in the Repo announcement.

Zeke Sikelianos

@zeke

Computer hacking skills? We got you. github.com/replicate/skil…

8:08 PM · Feb 16, 2026

convexskills packages Convex patterns as agent skills and templates

convexskills (Wayne Sutton): A Convex-focused skills/templates repo is being recommended as a reusable reference for agents working with Convex (queries, mutations, cron jobs, webhooks, migrations), per the Repo suggestion and the GitHub repo.

This fits the “skill pack = framework playbook” pattern: standardizing common backend shapes into SKILL.md-friendly primitives, as implied by the Repo suggestion.

Kevin Kern

@kevinkern

Replying to @steipete

For convex especially. I can recommend to clone @waynesutton convexskills to your local and reference from it (i dont have the skills installed) github.com/waynesutton/co…

7:17 PM · Feb 16, 2026

🧱 Agent frameworks & SDK surfaces: RLMs, hooks, tracing, and delegation protocols

Libraries/SDKs for building agents (not running fleets): RLM-style recursion, hookable agent loops, and tracing/observability primitives. Excludes standalone tools and assistants.

A minimal TypeScript agent loop for Gemini adds composable lifecycle hooks

Gemini Interactions agent harness (TypeScript): A minimal agent framework built around Gemini’s Interactions API exposes hook points across the full agent loop—including onAgentStart, beforeToolExecute, and afterToolExecute—so teams can implement approvals, context filtering, and observability without forking the core loop, as enumerated in the hooks overview.

The practical detail is the hook surface itself: it turns “policy” into code that can block/transform tool calls and modify prompts/context per turn, which is what the hooks overview list makes explicit.

Philipp Schmid

@_philschmid

Replying to @_philschmid

The hooks system gives you full control over the agent lifecycle: `onAgentStart`: before first LLM call → can modify tools, system prompt `onInteractionStart`: before each LLM call → can filter context window `beforeToolExecute`: before tool runs → can block execution, modify Show more

3:51 PM · Feb 16, 2026

GEPA nerd-snipe asks if you can evolve a “secret language” per model family

GEPA (DSPy): A prompt-optimization challenge asks whether GEPA can evolve a “secret language” that one model family follows but another doesn’t, as posed in the GEPA secret language prompt.

• Why GEPA specifically: A follow-up argues GEPA shines when mistakes/corrections are “verbalizable,” while “latent reasoning” may be a weaker fit—framing a boundary between token-shaped search and representation-level search, according to the GEPA vs latent reasoning.

The thread is less about cryptography and more about whether automated prompt search will start to encode provider-specific behavioral hacks as a first-class technique, which is the core question in the GEPA secret language prompt.

dex

@dexhorthy

nerd snipe for my optimization friends - can you use GEPA to create a secret language that Claude models understand but GPT models don’t?

4:23 PM · Feb 16, 2026

DSPy gets Scheme and SQL REPLs for RLM-style reasoning via dspy-repl

dspy-repl (DSPy ecosystem): A new set of modular REPL engines extends RLM execution beyond Python—specifically calling out Scheme (Lisp) and SQL REPLs built in DSPy and benchmarked on OOLONG, per the REPL engines note retweet.

This is a direct framework-level expansion: if agents/RLMs can “think” inside a domain REPL (SQL/Scheme), you can shift more correctness into executable constraints instead of pure natural-language chains, which is the motivation implied in the REPL engines note.

Michel

@mike_pavlukhin

I built Scheme (Lisp) and SQL REPLs in DSPy for Recursive Language Models and benchmarked them on the OOLONG dataset. 15 samples, 3 REPLs (python included). The question: does the REPL language actually matter? Short answer is yes, and it depends on the task. 🧵

1:15 PM · Feb 15, 2026

DSPy RLM one-shots “hyphenated mathematicians” where GPT-5.2 Thinking struggled

DSPy RLM (DSPy): A concrete micro-demo shows an RLM wrapper turning a “search in parametric knowledge” style query into a 2‑minute, correct-looking result list—after GPT‑5.2 Thinking reportedly struggled on the same prompt, as shown in the RLM output screenshot.

The artifact matters because it’s one of the clearer “RLM isn’t just more subagents” examples: the same base model can behave differently when wrapped in a recursive program (dspy.RLM("question -> response")), with logs and timing included in the screenshot from the RLM output screenshot.

Omar Khattab

@lateinteraction

Not saying this is a hard one, but: - ChatGPT 5.2 Thinking indeed struggled for me - An RLM one-shotted it (in a whopping 2 minutes lol!) 2 of 5 names are only occasionally hyphenated online tho. In general, I guess LLMs find it hard to search deep in their parametric knowledge

Gavin Brown

@gavinrbrown1

Challenge of the day: get ChatGPT to name mathematicians with hyphenated surnames.

5:50 PM · Feb 16, 2026

Harmonic shows an investment-pipeline agent built with LangGraph and LangSmith

LangGraph + LangSmith (LangChain): Harmonic describes using LangGraph and LangSmith to push farther down the venture workflow—market maps, research reports, and conversational interactions—so sourcing teams can automate more of the “pre-decision” pipeline, as outlined in the case study summary.

The engineering takeaway is the stack choice: LangGraph for multi-step orchestration and LangSmith for tracing/debugging, which is the explicit pairing highlighted in the case study summary.

LangChain

@LangChain

How Harmonic built an investment agent with LangSmith and LangGraph — so VCs can focus on founders 🤝 Harmonic is the startup discovery engine, tracking company formation and growth while providing valuable insights and workflow tools to venture capitalists. By aggregating Show more

4:00 PM · Feb 16, 2026

A DSPy RLM “codebase analyzer” script bakes rules files into the loop

DSPy RLM (DSPy ecosystem): A shared script demonstrates using dspy.RLM to analyze a codebase with a rules file and multiple modes (security/docs/quality), positioning “project rules” as a first-class input to iterative RLM reasoning, as shown in the link:T:L|rules-driven analyzer gist linked from the gist link.

This lands as a pattern: keep constraints in a filesystem artifact the model can reread, then let the RLM loop refine outputs over iterations, which is what the gist link suggests through its CLI flags and modes.

alindnbrg

@alindnbrg

dspy.RLM analyzing a code base with a rules file gist.github.com/weshoke/5a0f6f…

2:42 PM · Feb 16, 2026

LangChain event: “agent observability powers agent evaluation” goes live in SF

LangSmith/LangChain: A LangChain SF meetup announcement makes a specific claim about day-to-day agent engineering: when an agent fails after hundreds of steps, “nothing crashed,” so traces become the primary source of truth, as stated in the meetup announcement.

It’s a signal that “observability” is being treated as a first-order prerequisite for evaluation and iteration cycles (not a nice-to-have), which is the framing in the meetup announcement.

LangChain

@LangChain

📣 Meetup in San Francisco: Agent Observability Powers Agent Evaluation 📣 AI agents don't fail like traditional software. When an agent takes hundreds of steps, repeatedly calls tools, updates state, and still produces the wrong result, there is no stack trace to inspect. Show more

6:01 PM · Feb 16, 2026

RLM hype meets a practical question: is depth-1 recursion just subagents?

RLM design debate: A practitioner question frames a live architectural fork—whether “RLMs” are meaningfully different from “one main agent spawning many subagents,” and whether depth‑1 recursion is enough for many use cases, as raised in the RLM depth question.

The subtext is about what builders should optimize for in frameworks: adding programmatic subagent spawning and file-backed aggregation (e.g., JSONL) versus integrating deeper recursion mechanisms, which is the explicit tradeoff posed in the RLM depth question.

Viv

@Vtrivedy10

i’m excited about RLMs but isn’t this more main agent —> many subagents?so depth=1 RLM is accurate but not the thing to be hype about afaict asking because in deepgagents we’re considering adding some functionality to programmatically spawn N subagents attached to some file Show more

AVB

@neural_avb

New RLM trajectory that blew my mind! I will use this one as the main example in the YT tutorial. I passed in a CSV containing transcripts of 320 episodes of the Lex Fridman podcast and asked it to find what his first 10 ML guests had to say about AGI. The context had

4:14 PM · Feb 16, 2026

Read 7 replies

AG2 and AG‑UI get name-checked as converging pieces of the agent stack

AG2 × AG‑UI: A short “AG2🤝AG‑UI” post signals continued protocol-level convergence between agent runtimes and UI streaming/event layers, as indicated by the integration shout.

There aren’t implementation details in the tweet, but the naming itself is the point: more stacks are treating a shared UI/event protocol as part of the default agent surface area, per the integration shout.

CopilotKit🪁

@CopilotKit

AG2🤝 AG-UI 🩵

AG2

@ag2oss

🚀 AG2 v0.11.1 is live! 🖥️ AG-UI Message Streaming – Real-time agent responses in your AG-UI frontends with new event-based streaming architecture. 🌊 A2A Streaming – Full streaming support for Agent-to-Agent communication! LLM text streaming now flows through A2A for real-time

8:00 PM · Feb 16, 2026

🧰 Running agents at scale: multi-agent consoles, leaderboards, and always-on setups

Harnesses/runners and operational patterns: parallel agents, uptime, and usage leaderboards (apps/models). Excludes framework code (agent-frameworks) and assistant release notes (coding-assistants).

OpenRouter leaderboards: MiniMax M2.5 tops models; Kilo Code edges OpenClaw in apps

OpenRouter leaderboards (OpenRouter): Following up on Weekly tokens—weekly model usage—OpenRouter now shows MiniMax M2.5 as the most-used model with 1.44T tokens this week, narrowly ahead of Kimi K2.5 at 1.3T, with Gemini 3 Flash Preview at 776B and DeepSeek V3.2 at 757B, as shown in the Top models screenshot.

The app leaderboard is also giving a clean “agents in production” demand signal: Kilo Code processed 313B tokens yesterday vs OpenClaw at 292B, with 222B of Kilo’s tokens routed through GLM-5, per the Top apps table.

• Model mix shift: MiniMax says M2.5 hit #1 on the weekly leaderboard in four days, naming multiple launch partners in the Partner shoutout.
• Ops implication: Kilo’s post also pairs the volume claim with a limited-time “free GLM-5 access” offer in the Top apps table, which can distort short-term rankings but still reflects routing capacity and user demand.

OpenRouter

@OpenRouterAI

.@MiniMax_AI M2.5 is now the most popular model on OpenRouter

5:06 PM · Feb 16, 2026

1.8K

Read 68 replies

Ollama adds subagents + built-in web search to Claude Code without MCP setup

Ollama (Claude Code compatibility): Ollama says Claude Code sessions can now spawn subagents (parallel file search, code exploration, and research) and use built-in web search through the Anthropic compatibility layer—framed as “no MCP servers to configure or API keys required” in the Feature announcement.

The follow-up details that web search is handled automatically “when a model needs current information,” and subagents can keep side work out of the main context, as described in the Feature details alongside the linked Blog post.

• Trigger behavior: Ollama claims some models “naturally trigger” subagents (MiniMax M2.5, GLM-5, Kimi K2.5) while users can force it with explicit instructions in the Feature announcement.
• Work allocation: The examples emphasize multi-threaded research tasks (pricing audits, migration planning) rather than only code edits, as shown in the Feature details.

ollama

@ollama

Ollama now supports subagents and web search in Claude Code! Subagents can run tasks in parallel, such as file search, code exploration, and research, each in their own context. No MCP servers to configure or API keys required. Try it with any model on Ollama's cloud: Show more

6:29 AM · Feb 17, 2026

OpenCode maintainer calls out commercial forks passing off upstream work

OpenCode (ecosystem dynamics): OpenCode’s maintainer argues the project is intentionally MIT-licensed to encourage building on top, but says “commercial OpenCode forks” are claiming they “built it from scratch” while benefiting from upstream work like provider integration and platform compatibility, per the Maintainer note.

It’s a governance/attribution issue more than a feature change. The post also suggests OpenCode is investing in the undifferentiated “glue” (provider quirks, multi-environment support) that ops teams tend to feel first, as described in the Maintainer note.

dax

@thdxr

never ended up using/finishing this so dropping it here suno.com/s/VMGPvjUxNpxg…

10:05 PM · Feb 16, 2026

Read 15 replies

storage_ballast_helper: disk-pressure defense for long-running agent machines

storage_ballast_helper (sbh): A new Rust tool proposes an “agent machines fill disks” mitigation using pre-allocated ballast files plus continuous monitoring and deletion scoring; the author describes it as a response to agents filling disks (including tmpfs) and making machines unresponsive in the Release post.

The post claims sbh uses a PID-style controller, multiple safety veto layers (e.g., avoid .git), and a “zero-write emergency mode,” with an incident-style TUI shown in the Release post. The details are self-reported, but the operational failure mode (build artifacts exhausting storage mid-run) is concrete in the Release post.

Jeffrey Emanuel

@doodlestein

I made another tool out of my own desperation because my agents kept filling up disks (including special locations that are memory-backed) with Rust build artifacts and making the machines unresponsive. It’s called storage_ballast_helper (“sbh”) and you can get it here: Show more

5:12 AM · Feb 17, 2026

Read 5 replies

Tembo pitches a multi-agent subscription with automatic failover across providers

Tembo (tembo.io): Tembo published a new “one subscription, all the coding agents” page that emphasizes automatic failover—proxying Claude/OpenAI and failing over to AWS Bedrock or GCP Vertex AI—plus “cloud background agents,” as shown in the Product page screenshot.

The pitch is explicitly uptime-focused (“zero downtime”) and sits closer to an ops/procurement layer than an IDE plugin, based on the text visible in the Product page screenshot.

benja

@benjaminakar

new max page.

Website interface featuring a subscription model for coding agents, emphasizing zero downtime and curated AI models. Background of a natural landscape.

3:57 PM · Feb 16, 2026

Read 7 replies

Toad adds multi-agent management: run multiple providers and switch between sessions

Toad (batrachianai): Toad shipped an update that runs “any number” of agents from one instance (potentially across providers) and adds a management view that labels agents as working/idle/waiting, according to the Update note.

The screenshot shows separate sessions using Codex CLI, Claude Code, and OpenCode under one TUI, which matches the multi-agent navigation claim in the Update note. The repo is linked in the GitHub repo.

Will McGugan

@willmcgugan

I've pushed a big update to Toad. Run multiple agents from a single Toad instance. You can run any number of agents potentially from different providers, and easily switch between then. There is also a management interface that shows you all your agents in a single view, and Show more

5:42 PM · Feb 16, 2026

Read 5 replies

BridgeMind runs 24/7 OpenClaw agents on dedicated Mac Minis via Discord

Always-on agent setup: BridgeMind describes a small “hardware swarm” of three Mac Minis each running an always-on OpenClaw bot in Discord (named Dario/Elon/Sam) doing background tasks like code review, monitoring, deployments, and research in the Setup description.

The only concrete artifact shown is the Discord “Online” list with the bots present in the Setup description, but it’s a representative pattern for teams that want predictable uptime without renting persistent cloud sandboxes.

3:10 PM · Feb 16, 2026

OpenCode kicks off “distributed opencode” work with swarm-scale constraints

OpenCode (distributed opencode): OpenCode’s maintainer says work is starting on “distributed opencode,” and shares a constraint table that frames requirements like voice commands, 100+ units per orchestrator, multi-domain (air/land/sea) coordination, and jam-resistant connectivity, per the Constraints table.

Even if aspirational, it’s a concrete spec list for what large-scale orchestration would need (robust IO, degraded connectivity behavior), as written in the Constraints table.

dax

@thdxr

we're kicking off work on distributed opencode some of our constraints

7:36 PM · Feb 16, 2026

306

Read 37 replies

Superset positions itself as a local runner for 10+ parallel coding agents

Superset (superset.sh): Superset’s site positions the product as a way to “run 10+ parallel coding agents on your machine,” emphasizing agent-agnostic operation (Claude Code, Codex, Cursor) and isolated workspaces, per the Product page.

The social thread that shared it frames worktrees and “batteries included” as the differentiator, as seen in the Link share.

Elias Stråvik

@eliasstravik

Replying to @levelsio

using superset.sh is an even better experience - has worktrees with batteries included. (fingers crossed they’ll launch ssh support soon, that’ll make it a fast takeoff for anyone using mac mini like myself or vps)

4:24 PM · Feb 16, 2026

Raindrop AI hints at a new “act” after pushing agent observability discourse

Raindrop AI (agent observability): Raindrop’s founder claims they shifted the conversation around agent observability about a year ago and hints they have “so much to show” next, without naming features or a ship date in the Teaser.

There’s no public artifact in the tweets to validate what’s shipping. It’s mainly a signal that observability vendors expect another product step-change soon, per the Teaser.

ben

@benhylak

almost a year ago, @raindrop_ai fundamentally shifted the conversation around agent observability. our competitors started by arguing with us, and then started trying to copy us. all the while, we've been busy with our next act... and now, we have so much to show you all.

ben

@benhylak

i just wrote my first long-form post on evals. what they are, where they work, where they don't, and what i think the future looks like. if you're building an ai product, you should read this.

2:29 AM · Feb 17, 2026

📏 Evals & reality checks: webdev “taste” arenas, repo benchmarks, and time-horizon curves

Benchmarking, leaderboards, and observability that affect model/tool selection. Excludes Qwen3.5 benchmark chatter (kept inside the feature story).

Design Arena adds full-stack evaluation focused on database and auth

Design Arena (Arcada Labs): Design Arena is expanding from UI-only comparisons into full-stack web app evaluation, explicitly calling out “database and auth” coverage in its release notes thread Database and auth post and demos. This is a concrete move toward measuring whether models can ship production-shaped apps rather than static mock UIs.

• What the harness surfaces: example runs show model outputs being judged on schema/CRUD correctness and deployment logs alongside the UI, as captured in the Database comparison view screenshots.

This work is described as a collaboration with Google Cloud leadership, as referenced in Full-stack eval video and reiterated in Follow-up reply.

Grace Li

@grx_xce

database and auth!

Design Arena

@Designarena

Introducing Fullstack Arena! We connected @supabase backend and auth to Design Arena! You can now build projects like @NotionHQ and @googlesheets directly in Design Arena. The best models from @AnthropicAI, @Google, @xai, @OpenAI, @MistralAI to start. Deterministic tests to

7:07 PM · Feb 16, 2026

WeirdML time horizon chart estimates capability doubling about every 4.8 months

WeirdML time horizon (Evals trend): A new estimate charts “human work-hours at 50% success” over model release dates and fits an exponential trend of roughly 2× every 4.8 months (CI shown), as visualized in the WeirdML time horizon plot; the same thread notes it broadly agrees with prior public time-horizon discussions, following up on Time horizon chart with a fresh doubling-time number in Doubling estimate note.

This is an attempt to turn “feels like faster progress” into a single operational metric, though it remains sensitive to task selection and scoring assumptions.

Lisan al Gaib

@scaling01

Wow absolutely beautiful work WeirdML Time Horizons

Håvard Ihle

@htihle

WeirdML Time Horizons! Inspired by @METR_Evals I found time-horizons for the WeirdML tasks, using LLM-estimated human completion times. We find horizons of ~24 min (GPT-4) to ~38 hours (Opus 4.6), doubling time ~5 months. Links to blog post, git-repo + nice figures in thread.

10:52 AM · Feb 16, 2026

140

Gated access triggers eval skepticism for GPT-5.3 Codex

Evaluation trust (OpenAI Codex): A recurring critique is that GPT-5.3 Codex can’t be independently validated because it’s not generally available via API and is primarily accessible through first-party surfaces and selected partners, leading to “no third-party benchmarks” complaints in the No API critique.

An OpenAI-adjacent response frames the constraint as a “high cybersecurity threshold” and staged rollout order, which directly affects external benchmarking timelines Cyber threshold reply.

1:07 PM · Feb 16, 2026

154

Read 28 replies

Repo Bench results put Codex Spark above Codex mini, below o3

Repo Bench (RepoPrompt): A new Repo Bench run reported Codex Spark scoring 48.5%, slightly above GPT-5 Codex mini (medium) at 43.2%, but below o3 at ~50%, with the author attributing the gap to long-context coherence limits Repo Bench screenshot.

The same tester separately notes Spark can over-consume context while “trying to read too much,” leading to workflow drift even when tool calls are reliable Spark context critique.

eric provencher

@pvncher

On Repo Bench, spark scored a bit higher than gpt 5 codex mini, below o3 I think the score emphasizes the lack of large context coherence. I’ll need to do a few more runs to push it, but it confirms this is not a frontier model unfortunately, despite the name.

eric provencher

@pvncher

Spark just doesn’t cut it as a context builder in @RepoPrompt You’d think it’d be fast, but it actually spins out trying to read too much and fills its context up It calls tools reliably, but loses track of the workflow I’ve always been skeptical of small models, and still am

3:41 PM · Feb 16, 2026

Vals Index emphasizes accuracy-cost-latency tradeoffs, and surfaces filter 400s

Vals Index (vals.ai): A Vals Index snapshot ranks systems using a combined view of accuracy, cost per test, and latency, including “best speed/budget” icons—e.g., one table shows Kimi K2.5 and GLM-5 near the top, while MiniMax-M2.5 is tagged “best speed” in that view Vals Index table.

A separate eval note shows an operational failure mode: strict automated inspection can block queries with a 400 DataInspectionFailed error even when the input is a normal legal-research-style prompt, as captured in Data inspection error.

Together, these suggest selection isn’t only “best model,” but also “which model passes the filter reliably for your domain.”

Vals AI

@ValsAI

@Alibaba_Qwen's newly released Qwen 3.5 Plus is #3 on the open-weight Vals Index and in the top ten overall.

5:32 AM · Feb 17, 2026

Charts show year-on-year growth in software output proxies

Software output proxies (FT-style charts): A chart set circulating shows year-on-year growth across multiple indicators—new websites, new iOS apps, and GitHub pushes (US/UK)—with 2026 rising toward the ~30–40% range in several series, as shown in the Productivity growth chart.

The main analytic value is that it’s a cross-signal view (shipping artifacts and repo activity), rather than a single metric that can be gamed.

Xeophon

@xeophon

did you know that "no one ships what they (vibe) code" is measurably wrong

4:17 PM · Feb 14, 2026

1.4K

Read 71 replies

Design Arena launches an Elo leaderboard for “taste” in AI design and webdev

Design Arena (Arcada Labs): Design Arena went live as a head-to-head “taste” evaluator that ranks design/webdev models using Bradley–Terry/Elo-style ratings across categories (website design, UI components, SVG, data viz, etc.), as described on the new Design Arena page and its Leaderboard site. It’s positioned less like a benchmark suite and more like a continuous preference test.

The core engineering implication is that it treats “human preference” as the metric artifact—useful when standard functional tests don’t capture UX quality.

Grace Li

@grx_xce

tastearena.ai

Greg Brockman

@gdb

taste is a new core skill

9:29 PM · Feb 16, 2026

Read 5 replies

Seed 2.0 preview appears near #6 on Arena’s text leaderboard

Arena leaderboard (arena.ai): The Arena text leaderboard now shows dola-seed-2.0-preview around rank #6 with a score of 1473 and ~3,154 votes, while Claude Opus 4.6 variants remain at the top in the same snapshot Leaderboard screenshot, with the live listing available via the Text leaderboard page.

This is a preference-vote signal rather than a task-grounded eval, but it often predicts which models get pulled into day-to-day chat usage.

Quanquan Gu

@QuanquanGu