Fresh stories
GPT-5.5 vs Opus 4.7: users compare plan mode, frontend output, and 120K-context use
User posts and HN threads compared GPT-5.5 and Opus 4.7 across plan mode, frontend work, and 120K-context sessions. The split results mean token burn and instruction discipline matter as much as raw benchmark scores.

Pi community ships `pi-listens`, `pi-kanban`, and `pi-codex-conversion` in one-day extension burst
Independent Pi builders shipped a voice layer, a kanban and observability dashboard, a Codex-conversion tool with `apply_patch`, and smaller UI extensions in the same window. The burst matters because it turns Pi from a single coding agent into a real local-first extension ecosystem with voice, review, and workflow primitives.

ERNIE 5.1 Preview ranks No. 4 on Search Arena and claims 6% pretraining cost
Baidu pushed ERNIE 5.1 Preview with new leaderboard claims, including No. 4 on Search Arena and No. 13 on LMArena Text. Treat the 6% pretraining cost claim cautiously until an independent technical report confirms it.


GPT-5.5 vs Opus 4.7: users compare plan mode, frontend output, and 120K-context use
User posts and HN threads compared GPT-5.5 and Opus 4.7 across plan mode, frontend work, and 120K-context sessions. The split results mean token burn and instruction discipline matter as much as raw benchmark scores.

Codex 0.130.0 adds `codex remote-control` and migration support for Code and Cowork
A day after `/goal` and remote-control preview surfaced, Codex 0.130.0 shipped a simpler headless entrypoint while the app’s migration tool added Code and Cowork support. Users also showed Codex handling bug repro, long-running `/goal` sessions, and plugin-driven expense filing, which broadens its role from chat-first coding to delegated workflows.

Hermes Agent reports No. 1 OpenRouter rank after v0.13.0
Nous said Hermes Agent hit No. 1 among AI apps on OpenRouter after v0.13.0 shipped and added credential pools for rotating provider keys. Independent posts also tracked migrations from OpenClaw and early routing support in the same stack.

Claude Code adds `frontend-slides` for HTML briefs and publishable slides
A day after HTML artifacts surfaced as a Claude Code workflow, Anthropic promoted a `frontend-slides` plugin with direct install commands and artifact publishing. The rollout sharpened a real workflow split: teams are using HTML for human review and demos, while keeping markdown or MDX for token-efficient agent context.
Pi community ships `pi-listens`, `pi-kanban`, and `pi-codex-conversion` in one-day extension burst
OpenRouter launches Pareto Code with min_coding_score tiers and Nitro routing
Claude Code guide fixes hallucinated SHAs with adaptive thinking off and effort=high
ERNIE 5.1 Preview ranks No. 4 on Search Arena and claims 6% pretraining cost
Top storiesthis week
Codex adds /goal mode for long-running tasks with remote control preview
OpenAI reports Codex can now keep pursuing a goal until an end state and is adding remote control plus a usage tab. The update matters because Codex sessions can span longer tasks and be managed across devices with less manual babysitting.


METR says Claude Mythos Preview hits 16-hour p50 Horizon in early snapshot
METR said an early Claude Mythos Preview snapshot reached at least a 16-hour 50% time horizon, with only five tasks in-suite at that range. The result matters because Mythos is beyond METR's stable measurement band, so cross-model comparisons are less reliable.

Claude Code users report HTML artifacts improve PR review, dashboards, and visual explainers
A cluster of Claude Code users, guides, and companion tools shifted from Markdown toward HTML artifacts for code review, dashboards, and explainer pages. The pattern matters because richer outputs are easier to inspect and share during long agent workflows, though several builders note the token cost is materially higher than Markdown.

Anthropic reports 'Teaching Claude why' cuts agentic misalignment by 3x
Anthropic said training Claude on principled responses and aligned fictional stories removed previously observed blackmail behavior in Claude 4 lab tests. The post matters because Anthropic says the broader interventions generalized better than narrow eval-matching examples and survived RL fine-tuning.

OpenAI reports accidental CoT grading touched GPT-5.4 Thinking in under 0.6% of samples
OpenAI said a new detector found limited chain-of-thought grading in earlier Instant and mini models and in less than 0.6% of GPT-5.4 Thinking samples. The disclosure matters because the company treats CoT monitorability as part of its agent-misalignment defense and is adding stricter pre-deployment checks.








