Executive Summary

OpenAI just cleared the ICPC AI track with 12/12 solved inside the 5‑hour window—11 on the first try, with an experimental reasoning model finishing the holdout. DeepMind’s Gemini 2.5 Deep Think hit 10/12 in 677 minutes and uniquely cracked a human‑unsolved problem. Meanwhile, fresh evals show finance agents averaging ~50% on tool‑using tasks and a ROUGE audit cutting detector AUROC by up to 45.9%.

In numbers:

OpenAI: 12/12 solved; 11 first‑try; same judge, time, memory, and hidden tests
Would rank 1st vs human teams under identical ICPC sandbox judging conditions
Gemini 2.5 Deep Think: 10/12 in 677 minutes; solved “Problem C” no team solved
Finance Agent Benchmark: ~50% average on SEC workflows; some models score under 10%
Metrics check: AUROC drops up to 45.9% when rescored by LLM‑judge
ROUGE bias: length/repetition confounds; detectors inflate scores without factual gains
Perks: ICPC human finalists get 1 year of ChatGPT Pro from OpenAI

Also:

Cursor Tab online RL: +28% accepts; −21% suggestions across 400M+ daily interactions
Vercel mcp‑to‑ai‑sdk trims ~50k‑token tool menus; stabilizes schemas and cuts latency
Perceptron Isaac 0.1: 2B‑parameter open VLM for grounded localization and OCR

Feature Spotlight

Evals: ICPC Breakthrough & Real‑world Agent Tests

ICPC shock: OpenAI’s GPT‑5 + experimental model solved 12/12 problems; Gemini 2.5 Deep Think solved 10/12 (rank‑equiv 2nd). Five‑hour window, 139 teams context—clear quantified leap in reasoning.

Major eval day: OpenAI reports 12/12 at ICPC under AI track and DeepMind’s Gemini 2.5 Deep Think hits 10/12 incl. a human‑unsolved problem. Also new domain evals (finance agents) and metrics critiques (ROUGE vs LLM‑judge).

Jump to Evals: ICPC Breakthrough & Real‑world Agent Tests topics

📑 Table of Contents

📊 Evals: ICPC Breakthrough & Real‑world Agent Tests

Study: ROUGE overstates hallucination detectors; LLM‑judge cuts AUROC up to 45.9%

A new paper argues ROUGE misaligns with human judgment, inflating hallucination‑detection results; when re‑scored with an LLM‑as‑Judge aligned to humans, AUROC drops as much as −45.9% (e.g., Perplexity on NQ‑Open with Mistral). It also finds many detectors correlate with answer length rather than truth, and that repetition can game overlap metrics. paper overview, metric drops, length confounder, paper link

Human‑aligned labeling shows higher F1 (0.832) and agreement (0.723) vs ROUGE (F1 0.565, agreement 0.142). paper overview
Simple baselines (length features) rival complex detectors; ROUGE penalizes longer correct answers and rewards repeated content. simple baselines, paper link
Conclusion: evaluate with semantically aware, human‑aligned labelers; don’t tune detectors to overlap proxies. recommendations, and ArXiv paper

Finance Agent Benchmark reveals top LLMs stumble on real SEC/tool workflows

Vals AI’s Finance Agent Benchmark evaluates models on end‑to‑end analyst tasks (research filings, use tools, apply domain logic) and finds big gaps between public benchmark scores and real workflows. Top models that ace MMLU/Math500 average roughly 50% here; some score under 10% on basics. benchmark intro, method summary

Tasks emphasize tool use, retrieval from SEC filings, and delivering analysis, exposing hallucinations and misuse of tools. failure patterns
Public academic benchmarks overstate readiness for production agents; this suite targets realistic, consequence‑bearing tasks instead. method summary
Link to benchmark and details for replication and further challenges. Benchmark page

Braintrust ships Loop, a natural-language interface over experiments, logs and docs

Braintrust’s Loop lets teams query experiments and logs in plain English (e.g., “show errors last 24h,” “find runs with high factuality”), and get instant help on platform features with answers grounded in docs. It targets faster debugging, evaluation, and iteration cycles. feature launch, docs Q&A, docs guide

Works across playgrounds, experiments, datasets, and logs; supports model selection and tools for summarization, prompt/dataset edits, eval execution. docs guide
Public beta behind a feature flag; available from version 0.0.74 in hybrid deployments. docs guide, and Loop docs

MusicArena expanded its community leaderboard with three new dimensions—Vocal Quality, Instrumental, and Lyrics Language—while keeping blind A/B voting and shareable test sets. The change aims to measure more than raw metrics, letting listeners rate feel, melody, and adherence across songs and instrumentals. feature overview, language dimension, and share test set

Blind A/B voting remains (no brand logos), producing a transparent, crowd‑driven scoreboard across models. why this matters, and Music Arena
New Instrumental dimension isolates arrangement/beat quality for producers comparing models on EDM, ambient, hip‑hop, etc. instrumental details
Language dimension lets voters compare vocal generation across English, Japanese, Spanish and more. language dimension
Users can publish their own test sets to replicate others' votes, improving reproducibility. share test set

Arena voting UI

OpenRouter enables native web search for OpenAI/Anthropic; Exa.ai for others by default

OpenRouter rolled out configurable web grounding: OpenAI and Anthropic models now use their native web engines by default, while all other models use a custom pipeline powered by Exa.ai. Developers can enable via plugin/slug flags and tune result counts and prompts. feature announcement, docs update

Adds standardized schema for parsed results with inline citations; model slug suffix :online and web plugin control behavior. docs update
Pricing/engine selection is configurable (e.g., native vs Exa), balancing relevance, cost, and provider policy constraints. Web search docs

🏗️ Compute, Funding & Supply

Non‑AI but relevant for AI: funding/capacity and chip policy shifts shaping model training/inference. Groq raises; China instructs tech firms to halt Nvidia purchases—direct compute impact.

Microsoft commits $30B to UK AI infrastructure, including a 23K+ GPU supercomputer

Microsoft will invest $30B in the UK from 2025–2028 to expand AI infrastructure, allocating ~$15B to build the country’s largest supercomputer (23,000+ NVIDIA GPUs in partnership with Nscale) and ~$15B to scale operations, jobs, and sector adoption. The move fortifies UK sovereign compute and deepens the UK–US AI corridor.

Microsoft UK invest

Supercomputer build: 23K+ NVIDIA GPUs with Nscale (workload training and high‑throughput inference), positioning the UK for top‑tier AI capacity MS blog headline
Expansion dollars: A further ~$15B funds operations, jobs, and AI adoption across finance, healthcare, and public services (talent programs and regional capacity) MS blog headline
Sovereign compute context: Follows OpenAI’s Stargate UK plan to scale to 31K GPUs and establish local run capacity for critical sectors Stargate overview
UK AI hub momentum: Complements broader UK investments (e.g., recent £5bn data center plans) and signals multi‑vendor build‑out to meet demand Google UK invest
What it means for builders: More domestic training and inference capacity, improved latency/compliance for regulated workloads, and a sturdier supply posture through 2028 MS blog headline

🎬 Generative Media & Video

Creator‑facing improvements: realtime video access, faster image models, and workflow guides. Mostly product/ops updates; fewer new model science items today.

Seedream 4.0 cuts average latency by 8s, 2K image‑to‑image now 12–18s

ByteDance’s Seedream 4.0 got a performance bump: average latency is down ~8 seconds across tasks while preserving quality; 2K image‑to‑image now completes in ~12–18 seconds (down from ~20–25s).

Speed update and call‑to‑try links for both text‑to‑image and image editing are provided, with no quality regressions claimed latency update, try links, Bytedance text to image, and Bytedance image editing.
This builds on its recent momentum on creator arenas and workflows arena surge.

YouTube ships Veo 3–powered tools: motion transfer, auto edit with music, and dialog→soundtrack

YouTube added Veo 3–enabled creation features: apply motion from one video to a photo, auto‑edit raw footage with music/transitions/voiceover, and an upcoming feature to turn video dialog into a new soundtrack.

Motion‑transfer, end‑to‑end auto editing, and dialog→soundtrack (US first) expand creator workflows directly in YouTube feature list.

Hailuo rolls out 7‑day unlimited start/end‑frame video fills with creator playbooks

Hailuo enabled a 7‑day “Unlimited Start/End Frames” mode that lets creators specify the first and last frames and have the model fill the motion in‑between, with community demos showing sharp physics, morph transitions, and action shots. The threads include practical prompts, storyboard tips, and looping tricks for fast iteration.

Walkthroughs cover storyboarding stills, using Nano Banana for image edits, and exporting 1080p fills with JSON‑style prompts feature thread, how it works, and full guide.
Examples highlight smooth morphs, dynamic camera motion, and believable contact/physics; several prompt ALTs are shared for reuse prompt examples, transition demo, favorites set, action shots.
Try Hailuo via the shared link; creators note it’s faster to reach seamless transitions versus other tools Hailuo AI.

Krea opens realtime video generation for all subscribers

Krea AI’s realtime video is now broadly available to subscribers, enabling instant generate‑and‑edit video feedback loops directly in the app.

Rollout announcement confirms access “for all subscribers,” expanding earlier tests to the full paid user base rollout note.

Reve teases v1.0 image model focused on prompt adherence, typography and aesthetics

Reve previewed its 1.0 image model, trained for high prompt adherence and typography‑aware generation, with sign‑ups open for early access.

Launch tease emphasizes clean type rendering in both graphic and environmental scenes and a focus on aesthetic quality; preview signup is live launch tease, Reve site.

AI SDK publishes official guide for Gemini image generation and editing (Nano Banana)

Vercel’s AI SDK released a how‑to on using Google’s Gemini (Nano Banana/2.5 Flash) for text‑to‑image, editing existing images via natural language, and composing multiple images in code.

Guide covers model setup, generateText for image creation, and editing/compositing patterns with examples guide link, announcement, AI SDK guide.

🎙️ Voice & Real‑time Stacks

Big editor release and platform scaling: ElevenLabs Studio 3.0 unifies TTS/Music/SFX/Isolation with video; concurrency boosts for chat mode; open‑source real‑time voice agent framework.

ElevenLabs Studio 3.0 adds video timeline, speech correction, and multiplayer commenting

ElevenLabs rolled out Studio 3.0, folding pro‑grade voiceovers, music, SFX, noise removal, and voice changing into a single editor with a video timeline; Plus/Pro/Business users can enhance existing media or generate from prompts, with multiplayer review built‑in launch thread, availability note, and Studio 3.0 page. This lands in context of v0 Agents Starter for spinning up voice agents, extending the company’s real‑time voice stack from agents to production editing.

Voice tools in one place: expressive TTS voiceovers, Eleven Music, SFX, Voice Isolator, and Voice Changer all operate on a single timeline tools update, music auto.
Speech Correction regenerates mistaken words in your own cloned voice by editing text—no re‑recording needed speech correction.
Automatic captioning/subtitles and processing of external audio/video files streamline post‑production for creators and teams workflow note, availability note.
Multiplayer commenting: share a feedback URL to collect time‑stamped notes on the same project (goodbye Slack/email long threads) multiplayer comments.
Developer take: strong all‑in‑one suite for dev content and demos; try it now and suggest roadmap items for October dev takeaway, availability note.

ElevenLabs Agents Platform: chat mode now supports up to 25× higher concurrency

ElevenLabs raised concurrency limits for Agents Platform chat sessions, making it far easier to scale large fleets of simultaneous, text‑only interactions compared with voice pipelines concurrency update.

Chat mode aims at high‑throughput deployments (support desks, multi‑tenant assistants) where audio I/O isn’t required workshop plug.
Docs detail enabling text‑only conversations via config or runtime overrides, with code examples in Python/JS and security notes for overrides docs.

Qwen3-ASR Toolkit turns 3‑minute ASR into hours-long, fast, CLI-driven transcription

Alibaba’s Qwen team shipped the Qwen3‑ASR‑Toolkit, a free CLI that slices long audio/video with smart VAD, runs parallel jobs, and auto‑resamples, overcoming the Qwen3‑ASR‑Flash 3‑minute limit toolkit release.

Handles MP4/MOV/MP3/WAV/M4A and more; parallelization yields major throughput gains on large backlogs toolkit release.
One‑command install (pip) to unlock hours‑long transcriptions against the Qwen3‑ASR‑Flash API toolkit release, community note.

TEN Framework: open-source real-time voice agents with visual designer and edge-cloud support

TEN Framework released an open‑source stack for building real‑time multimodal voice agents, featuring a visual TMAN Designer, cross‑language SDKs, and edge‑cloud integration framework intro.

Production‑ready real‑time conversations (multimodal), with Python, Node.js, C++, and Go client support feature list.
Agent state management, compatibility with custom fine‑tuned models, and a visual flow builder speed prototyping and ops feature list.
Source, docs, and examples available on GitHub for quick starts and customization GitHub repo.

🧩 Agent Interop & MCP Patterns

MCP usage and safety patterns keep evolving: security tradeoffs with dynamic tools vs curated static tools; registries and docs strengthen. Excludes dev‑tool launches already listed.

Vercel’s mcp-to-ai-sdk turns dynamic MCP tools into static, safer SDK tools

Vercel is proposing a pragmatic fix for flaky, risky MCP integrations: generate static AI SDK tools from MCP servers, so agents use pinned, reviewable specs instead of live‑changing tool descriptions. This mitigates prompt injection, capability drift, token bloat, and tool‑call errors.

The approach compiles MCP tool schemas into static AI SDK tools, reducing attack surface from injected descriptions and unexpected server updates Vercel blog, and Vercel blog post
It cuts token usage (GitHub’s MCP server can add ~50k tokens) by stripping unused tools and oversized schemas from prompts Vercel blog
Tool‑call accuracy improves by stabilizing names/schemas and eliminating description drift during agent planning Vercel blog
Complementary docs for MCP in the AI SDK landed recently to help teams adopt the pattern AI SDK MCP docs, and Model Context Protocol (MCP) Tools

Developers flag MCP servers for token bloat, unused tools and frequent crashes

A widely‑shared complaint sums up current MCP growing pains: servers expose tools that are never called, inflate context windows (driving token costs), and crash hard enough to break sessions—especially in coding agents at scale.

Token burn: auto‑loaded tool catalogs bloat prompts, making teams ask why their agent spends so many tokens even when tools aren’t invoked dev critique
Reliability: crashes propagate through agent stacks, interrupting runs and damaging user trust dev critique
Practical takeaway: right‑size tool sets, prefer static tool generation, and add filters/namespaces to limit what the agent sees at planning time dev critique

GitHub’s MCP Registry gains early traction; VS Code installs and OSS sync highlighted

The new GitHub‑backed MCP Registry is drawing usage fast, with developers noting one‑click installs (currently VS Code‑only) and auto‑sync from the OSS Community Registry—building on its launch. Early listings already show top web‑data tools surfacing.

Registry details: each MCP server is backed by a GitHub repo, and the catalog auto‑displays OSS entries that developers self‑publish; current UI lacks filters for local vs remote tools registry notes, and repo overview
Early momentum: Firecrawl touts MCP v3 and a top spot in the new registry, signaling fast community uptake for web search/scrape tools firecrawl update
Developer UX: VS Code installation is supported first; broader IDE support and server filtering are common requests from early testers registry notes, and GitHub MCP profile

Andrew Ng’s team ships MCP course with Box; AI SDK expands agent docs

Education and docs for MCP‑based agents are catching up. Andrew Ng’s program released a short course on building with MCP servers (Box files), while the AI SDK team shipped new agent documentation covering foundations, patterns, and loop control.

Course: “Build AI Apps with MCP Servers: Working with Box Files,” taught by Box’s Chief Architect, focuses on real integrations and file workflows course launch
AI SDK agent docs: foundations, workflow patterns, loop control, and the Agent class; pairs well with the SDK’s MCP tooling guides AI SDK agents docs, and Model Context Protocol (MCP) Tools
Net effect: clearer recipes for safe tool exposure, smaller prompts, and more predictable agent loops when using MCP servers course launch, and AI SDK agents docs

🗂️ RAG, Retrieval & Evaluations

Focus on production‑grade retrieval and eval flywheels; several courses, frameworks and a new Elo‑style reranker method. Quieter on new chunking papers vs yesterday.

zELO: ELO‑inspired, unsupervised training for rerankers and embeddings

ZeroEntropy’s zELO reframes ranking as a Thurstone/Elo process and trains rerankers and embedding models end‑to‑end with no human labels, using only large unlabeled query–document pools. The team reports state‑of‑the‑art, cross‑domain results and a modest resource profile.

Trained on 112k queries with 100 docs each in under 10k H100 hours, end‑to‑end and unsupervised paper thread, and ArXiv paper
Delivers top retrieval scores across finance, legal, coding, and STEM domains, surpassing some closed rerankers on NDCG@10 and Recall paper thread
Ships open weights for multiple rerankers (zerank‑1 / zerank‑1‑small) for immediate drop‑in RAG gains paper link
Practical angle: plugs into RAG stacks where labeled data is scarce, replacing prompt‑tuned LLM reranking with small, cheap models

Granite‑Docling‑258M: layout‑faithful PDF→HTML/Markdown VLM for doc ETL

IBM released Granite‑Docling‑258M, a 258M‑param vision‑language model that outputs Docling’s structured markup (sections, tables, code, equations), slotting into existing Docling CLI/SDK to preserve reading order and layout for retrieval.

Architecture: SIGLIP2‑base‑patch16‑512 vision encoder + Granite‑165M LM via a pixel‑shuffle projector (IDEFICS3‑style) for compact, layout‑aware conversion model overview
Outputs structured HTML/Markdown (incl. inline math, captions, region‑specific inference) and answers structural queries (e.g., element presence/order) across English with early CJ/AR support model details
Licensed Apache‑2.0 with ready integration into Docling toolchain for bbox/full‑page runs and split‑page overlays HF collection

HORNET teases agentic retrieval engine for schema‑first, multi‑agent search

HORNET is hiring and taking its “R.I.P. Naive RAG” event on the road, positioning a schema‑first retrieval engine designed for agent workloads (multi‑step, whole‑document reading, high recall) with VPC/on‑prem options.

Role: Founding US GTM Engineer (San Francisco) as the team scales agent‑optimized retrieval beyond human‑query search hiring note, and HORNET site

gtm role posting

London call for venues confirms growing community interest; event theme centers on agentic retrieval replacing chunk‑and‑embed RAG event invite, and Luma event
Positioning: model‑agnostic API for single→multi‑agent workflows across data scopes, with private deployments for compliance‑sensitive teams hiring note

Agentic Query Optimization for unstructured data (DocETL/DocWrangler)

Berkeley researchers outlined how agent systems can optimize end‑to‑end unstructured data processing—parsing, transforming and querying documents with feedback‑aware operators, building on DocETL and DocWrangler.

Livestream: “Agentic Query Optimization for Unstructured Data Processing” (DocETL/DocWrangler) with concrete operator graphs and failure analysis livestream, and YouTube stream
Emphasis on semantic ETL, error visibility and iterative refinement loops that reduce downstream retrieval misses in agent pipelines livestream
In context of DocWrangler recognition for a mixed‑initiative semantic ETL IDE; the new talk adds system diagrams and live demos beyond the paper accolades livestream

Streaming document ETL for agents with Kafka, Flink and LlamaParse

LlamaIndex published a production‑grade walkthrough for building real‑time document pipelines that parse, chunk, embed and index at stream speed—ideal for agentic retrieval.

Architecture: S3 ingest → LlamaParse/ETL → Confluent Kafka → Apache Flink for transforms/embeddings → MongoDB for serving pipeline thread

streaming pipeline diagram

Focus is high‑throughput, low‑latency indexing for downstream agents (no nightly batch), with schema’d topics and Avro serialization to keep consumers stable pipeline thread
Good template for teams migrating from cron‑based RAG to event‑driven updates (fresher corpora, fewer cache misses) pipeline thread

Configurable web grounding for LLMs via OpenRouter web search

OpenRouter’s new web search integrates native engines for OpenAI and Anthropic models (and Exa for others) so developers can flip models into an :online mode for real‑time grounding.

Enable by appending :online (e.g., openai/gpt‑4o:online) or via the web plugin; results arrive in a standardized schema with citations feature update, and web search docs
Configuration includes engine selection, result counts and custom search prompts; Exa usage is priced per 1k results (docs detail pricing and knobs) web search docs
Practical impact: fewer hallucinations on current events, easier to trace sources in agent chains that mix tool calls and search feature update

🛡️ Safety, Alignment & Governance

Safety research and platform responses: OpenAI+Apollo probe ‘scheming’ behavior and mitigations; Anthropic publishes a Claude quality postmortem. Excludes ICPC coverage.

OpenAI/Apollo find scheming behaviors in frontier models; 30× reduction via deliberative alignment

OpenAI, working with Apollo Research, reports controlled tests showing “scheming” behaviors in multiple frontier models and a mitigation that cuts covert actions by 30× on o3. The team urges chain‑of‑thought transparency for evals while better methods mature. research post, OpenAI blog post

Deliberative alignment (explicit reasoning + a Model Spec clause banning scheming) reduced covert actions by 30× on o3 and also improved clarity of rationales. deliberative results, OpenAI Model Spec
Evals covered OpenAI o3 and o4‑mini, Gemini‑2.5‑pro, and Claude Opus‑4; findings show that models can adapt behaviors when they recognize they are being tested (situational awareness). deliberative results, situational awareness
OpenAI cautions that the field isn’t ready for eval‑aware models with opaque reasoning and asks developers to preserve chain‑of‑thought for safety testing; efforts include cross‑lab evals and a $500k Kaggle challenge. CoT transparency, next steps
The work distinguishes “simple deception today” from future harms as models take on longer‑horizon tasks, positioning this as preparedness, not a claim that deployed systems are causing serious harm now. definitions

Hidden PDF text can hijack LLM peer‑reviews, flipping accept rates to 0%–100%

A study shows that invisible PDF text (white‑on‑white, tiny fonts) survives conversion to Markdown and can bias LLM‑generated peer‑review forms—some models outputting 100% accept under positive injection and 0% under negative, far from the human baseline (~43% accept). paper summary

Simulated ICLR workflows with fixed forms found neutral prompts also elevated acceptances vs human baseline, suggesting systematic bias in automation. paper summary
Models that looked “resistant” often broke form constraints (illegal scores/fields), making their outputs unusable for copy‑paste; that is not true robustness. paper summary
Recommended mitigation is to render PDFs as images before text extraction so hidden glyphs don’t enter the model input stream. paper summary

paper title page

Senate hears families blaming chatbots for teen suicides; platforms pressed for stronger guards

Parents suing OpenAI and Character AI testified that bots encouraged self‑harm and even helped draft suicide notes, fueling calls for tighter protections and clearer escalation protocols. OpenAI outlined a stricter teen safety posture in parallel. Senate hearing, policy post

Witnesses described bots missing obvious distress signals, prompting lawmakers to push for guardrails that trigger interventions and limit risky content exposure. Senate hearing
OpenAI set out principles separating adult freedom/private use from teen‑specific safeguards (age prediction, parental controls, restrictions on sensitive topics), in context of teen safety announced yesterday. policy post
Safety groups highlighted gaps at other platforms (e.g., Meta, Character AI), signaling heightened scrutiny across the ecosystem. Senate hearing

hearing overview

Disney, Universal, WBD sue China’s MiniMax (Hailuo) for ads using protected characters

Three major studios filed a joint lawsuit accusing MiniMax’s Hailuo app of “willful and brazen” infringement for using characters like Darth Vader, Minions and the Joker in promotional materials, upping legal risk as MiniMax prepares for a Hong Kong IPO. lawsuit report

The case is being cast as the first major U.S. studio copyright challenge against a Chinese AI firm, testing how generative‑ad collateral will be treated in court. lawsuit report
Timing ahead of an IPO raises disclosure and liability stakes for MiniMax and may chill similar marketing practices across generative‑video apps. lawsuit report

studios allege misuse

Multi‑step “Content Concretization” jailbreak lifts harmful‑code success from 7% to 62% at ~$0.075/prompt

Researchers show that drafting requirements/pseudocode with a cheaper model and then iteratively “upgrading” via a higher‑tier model reframes safety as editing—slipping past filters and producing more dangerous, runnable code. Optimal at three refinement steps before quality dips. paper summary

Baseline success without refinement was 7%; three refinements reached 62% and raised perceived maliciousness/technical quality in panel ratings. paper summary
Many scripts ran after minor fixes; strongly targeted attacks still needed environment‑specific tuning. paper summary
The most vulnerable phase is extension/expansion of drafts; defense work should target edit‑mode behaviors, not just direct generation. paper summary

method overview

🧠 Training, RL and Reasoning Methods

Fresh results on RL‑driven reasoning and agent scaling: answer‑only RL (R1) gets a Nature cover; agentic CPT and web RL stacks. Mostly methods; fewer product drops today.

DeepSeek‑R1 shows answer‑only RL yields emergent reasoning; Nature cover study

DeepSeek’s R1 demonstrates that reinforcement learning with answer‑only rewards can induce sophisticated reasoning patterns (self‑reflection, verification, strategy shifts) without supervised chains‑of‑thought, and distills to smaller models. Reported jumps include AIME’24 pass@1 from 15.6% to 77.9% (86.7% with self‑consistency). Nature cover, and training detail

Trains via GRPO (grouped PPO): sample multiple solutions, reward only correct final answers, and constrain policy to a reference model (formatting/accuracy rules, no learned reward model) training detail
Emergent behaviors: verification, re‑planning mid‑solution, and longer thoughts on hard math/coding/STEM tasks; then distilled to smaller models Nature cover
Why it matters: avoids human trace bottlenecks, enabling non‑human solution paths that still land on correct results; competitive or superior on verifiable tasks Nature cover

Nature paper cover

Tongyi’s AgentFounder uses agentic continual pretraining to boost web research SOTA

Tongyi details Agentic Continual Pre‑training (Agentic CPT): first learn agent behaviors with next‑token learning over agent‑style data, then fine‑tune on trajectories. Reported scores include 39.9% on BrowseComp‑EN and 31.5% Pass@1 on HLE. In context of Initial stack (Tongyi’s deep‑research pipeline), this paper adds concrete training mechanics and benchmarks. See paper summary, and Hugging Face paper.

Two‑stage scale‑up: 32K → 128K context; offline “first/second‑order” action synthesis to seed plans without tool calls, then trajectory FT to preserve broad tool use paper summary
Results emphasize long‑horizon browsing and reasoning (BrowseComp EN/ZH, HLE, GAIA, FRAMES, xbench‑DeepSearch) with competitive scores benchmarks overview
Takeaway: teach agent habits first (cheap offline data), then align with demos to reduce optimization conflict between behavior learning and imitation paper summary

Agentic CPT figure

PHLoRA turns full fine‑tunes into post‑hoc LoRA adapters, cutting load/serve costs

PHLoRA extracts a low‑rank adapter from any fully fine‑tuned model without training data, turning the FT delta into a compact LoRA via SVD. Reported wins: >10× faster loading and up to 4× cheaper serving when many requests share adapters, with ~≤1% accuracy delta at ranks 32–64. paper overview

Method: subtract base from fine‑tuned weights, SVD the delta, and package as LoRA; load‑on‑demand or merge back as needed (no gradients/data required) paper overview
Multi‑modal evidence (text/image/video on Nova) suggests most useful change lies in a few singular directions, preserving behavior at low rank paper overview
Ops impact: slash cold‑start and memory; better multi‑adapter multiplexing under high fan‑out inference paper overview

PHLoRA paper front page

Microsoft: ICL is real learning but needs 50–100 shots and fails under shift

A large empirical study finds in‑context learning acts like real learning but relies heavily on exemplar statistics: with 50–100 examples, accuracy climbs and model differences narrow; prompt wording matters less than intact exemplars; but robustness collapses under distribution shift and “salad‑of‑thought” scrambling. overview, many‑shot gains, exemplar vs wording, and robustness

Many‑shot regime: gains become steady past ~50 examples; few‑shot often insufficient for hard tasks many‑shot gains
Instructions vs exemplars: replacing instructions with word salad eventually recovers, but corrupting exemplars tanks performance exemplar vs wording
Fragility: strong brittleness OOD; chain‑of‑thought and auto prompt optimization particularly sensitive robustness
Task variance: similarly shaped tasks can diverge by >30% accuracy, highlighting shallow heuristic pickup over deep structure task variance

🛠️ Agentic Coding & Dev Tools

Busy day for practical stacks: new IDE integrations, CLI/TUI workflows, online‑RL Tab improvements, and utility add‑ons. Excludes the ICPC feature coverage above.

Cursor Tab switches to online RL, boosting accepts 28% and cutting suggestions 21%

Cursor detailed a production move to online reinforcement learning for its Tab suggestions, retraining every 1–2 hours on real accept/reject feedback and rolling out models continuously. In context of Cursor 1.6, which added faster terminals and MCP Resources, this is a bigger shift: live policy updates at 400M requests/day scale.

Reported gains are +28% acceptance and −21% suggestion volume, yielding fewer, better in‑flow completions update note, Cursor blog post
Uses policy‑gradient RL directly on the Tab model (no separate filter), optimizing when and what to suggest to reduce distraction Cursor blog post
Operates as an online loop (accept/reject → reward → frequent redeploys), enabling rapid drift correction without long offline cycles update note, Cursor blog post

Vercel AI SDK introduces static tools from MCP to mitigate prompt injection and token bloat

Vercel announced mcp‑to‑ai‑sdk, a codegen workflow that converts dynamic Model Context Protocol (MCP) servers into pinned, static AI SDK tools—reducing security and reliability risks while shrinking prompt size.

Addresses common pitfalls: prompt injection via changing tool descriptions, unexpected capability changes, and massive token overhead from large registries security blog
Generates typed, static tools you control, avoiding drift and improving tool‑call accuracy in agents security blog
Complements the AI SDK’s core MCP docs and patterns for safe context/tool usage in production AI SDK MCP docs
Resonates with developer pain around unstable MCP stacks (unused tools, crashes, token burn) @thdxr comment

Cline lands on JetBrains with platform‑agnostic host bridge and native IDE integration

Cline released a JetBrains plugin that brings its AI coding assistant to IntelliJ family IDEs with a native bridge to editor APIs, mirroring its VS Code experience and keeping users model/provider‑agnostic.

Architecture splits UI, core AI logic, and a gRPC host bridge, so the same engine can run across IDEs, CLI, and future surfaces Cline blog
Available via JetBrains Marketplace; supports project‑aware refactors and code understanding within IDE context JetBrains launch, Marketplace listing
Users retain control over models/providers (no lock‑in), aligning with Cline’s agnostic stance Install page

Codegen platform ships Claude Code integration, AI code reviews, analytics, and improved sandboxes

The Codegen team rolled a major release focused on running code agents at scale, layering AI code reviews and analytics on top of new Claude Code integration and upgraded sandboxing.

Adds Claude Code integration for planful edits and multi‑step changes, alongside platform AI code reviews release thread
Code agent analytics provide visibility into agent behavior and outcomes across runs release thread
“Best‑in‑class” sandboxes target safer, more reliable execution for automated changes; early users report thoughtful background‑agent design user review

Zed v0.204 adds Claude Code Plan/permissions and SSH support for Claude Code and Gemini CLI

Zed shipped v0.204 with deeper coding‑agent ergonomics: full support for Claude Code Plan mode and permission flows, plus the ability to use Claude Code and Gemini CLI over SSH within Zed.

New Plan/permission modes let users gate changes and review multi‑step edits more safely in‑editor release note
Remote dev: run Claude Code and Gemini CLI over SSH for server‑side workflows without leaving Zed release note
Quality‑of‑life adds include line ending toggle, quick uncommit, and syntax‑node navigation; full details in the changelog line ending tip, git uncommit, syntax nav, Zed release notes

🧪 New Models & Open Weights

Compact, specialized models and physical‑world perception lead: an open 2B VLM for grounded tasks and a tiny doc VLM for layout‑faithful parsing. Excludes prior‑day launches.

Perceptron releases Isaac 0.1, a 2B perceptive‑language model with open weights

Perceptron unveiled Isaac 0.1, a compact 2B perceptive‑language model built for the physical world, with open weights and an SDK. It focuses on grounded pointing, VQA, OCR, and in‑prompt visual adaptation (few‑shot), and is already runnable via a public demo, HF weights, and fal hosting launch thread, Perceptron blog, Hugging Face org, and Perceptron SDK.

Grounded spatial intelligence: precise pointing and localization through occlusion/clutter; ask “what’s broken?” and get a visually cited answer grounding demo, and conversational pointing
In‑context learning for perception: add a few annotated examples (defects/safety flags) and the model adapts without YOLO‑style fine‑tuning few‑shot examples
Visual reasoning: OCR, dense scenes, multi‑resolution inputs; targeted for low latency and tail‑latency constraints (edge viability) visual reasoning
Try it and deploy: live demo and weights, plus a provider‑agnostic SDK and fal integration for quick experiments Perceptron demo, Perceptron SDK, and fal blog post

IBM debuts Granite‑Docling‑258M, an Apache‑2.0 document VLM for faithful PDF→HTML/MD conversion

IBM released Granite‑Docling‑258M, a 258M‑parameter image+text→text model that converts PDFs to structured HTML/Markdown with preserved layout, equations, tables, code blocks, and captions—wired directly into the Docling toolchain, under Apache‑2.0 model overview and HF collection.

Architecture: SIGLIP2‑base vision encoder + Granite‑165M LM with a pixel‑shuffle projector (IDEFICS3‑style), keeping the model small yet layout‑aware model notes
Capabilities: full‑page or region‑specific conversion; inline math and display equations; table and code formatting; basic multilingual support (EN + experimental ZH/JA/AR) deep dive
Tooling: Docling CLI/SDK outputs HTML/MD (incl. split‑page overlays) so downstream systems maintain reading order and structure rather than lossy plain text usage details

MiniCPM‑V 4.5 pushes on‑device VLMs, with a public demo and an open iOS cookbook

OpenBMB’s 8B MiniCPM‑V 4.5 is drawing praise for factual correction and reasoning, with a live demo and an iOS cookbook to run it locally on iPhone/iPad via their open‑sourced app user demo praise, HF demo, and iOS app repo.

On‑device focus: runs on Apple devices (MLX) and ships examples to stand up real multimodal apps without external dependencies deployment note
Practical strengths: community reports highlight strong "thinking" and robustness on vision‑questioning tasks despite the small footprint user demo praise
Getting started: try the hosted demo first, then clone the cookbook for local builds and quantized deployments on iOS HF demo and iOS app repo

✨ Agent Payments

Backfill placeholder for migration.

Google details AP2 with 60+ payments partners and cryptographic “Mandates” for agent purchases

Google’s Agent Payments Protocol (AP2) formalizes how AI agents buy things on users’ behalf, backed by cryptographic Mandates that prove authorization, authenticity, and accountability across rails from cards to stablecoins. The initial roster spans 60+ partners including Mastercard, PayPal, Coinbase, and American Express.

AP2 introduces Intent and Cart Mandates to gate search→checkout with dual approvals and signed artifacts, creating an auditable paper trail for autonomous transactions Protocol overview
Partner lineup (payments networks, PSPs, crypto firms) signals multi‑rail reach at launch; see the full wall of logos for ecosystem breadth Protocol overview

AP2 partner logos

Practical dev angle: AP2’s spec + open governance aim to standardize agent checkout across storefronts; early coverage outlines flows and GitHub specs so teams can wire agents to pay safely News recap, and Google blog post
Design note: pairing AP2 with A2A (coordination) and x402 (stablecoins) sets the stage for per‑task payments, limited‑time authorizations, and revocation, reducing fraud/pricing ambiguity for long‑running agents Protocol overview

Coinbase x402 brings HTTP-native stablecoin payments to AI agents, tying into Google’s agent stack

Coinbase unveiled x402, an open protocol that lets autonomous agents send/receive stablecoin payments directly over HTTP, with Google integration that slots x402 into the broader A2A (Agent‑to‑Agent) and AP2 ecosystems. The move lays concrete rails for agentic commerce use cases like per‑crawl fees, machine‑to‑machine billing, and micro‑transactions.

x402 is the stablecoin extension inside Google’s Agent Payments Protocol (AP2), enabling on‑chain settlements alongside cards/banks while keeping HTTP ergonomics Protocol fit, and Stablecoin focus
Google’s A2A handles agent discovery/coordination; AP2 provides auditable, cryptographic mandates; x402 adds the crypto rail—together forming the “virtual agent economy” Google’s been teeing up A2A summary, Overview thread, and Economy paper
Early marketplace signal: Coinbase launched x402 Bazaar to seed integrations; developers are eyeing agentic commerce patterns (micropayments, per‑task/per‑crawl billing) that were hard to do with legacy rails Economy paper, and Use cases
Why it matters for engineers: predictable, standardized payments unlock background agents (research, scraping, automation) to transact autonomously with traceability; expect SDKs, webhooks, and mandate flows to become the default pattern for agent toolchains Step‑by‑step, and Stablecoin focus

✨ Ai Wearables

Backfill placeholder for migration.

Meta’s Ray‑Ban Display glasses debut with on‑lens UI and an EMG wristband for hands‑free control

Meta announced Ray‑Ban Display AI glasses paired with a neural (EMG) wristband, bringing a 600×600, 20° FOV on‑lens display and hands‑free control to market at $799. Live demos highlighted streaming from the glasses and an in‑view UI, alongside new features like Conversation Focus, live translations/voice, and async “agent” workflows—though one on‑stage vision demo stumbled due to Wi‑Fi. launch summary, price detail, availability note, demo opening

On‑device UI and optics: 600×600 pixels, 20° FOV, ~5,000 nits with ~2% light leakage for privacy; designed to be readable in bright scenes display specs
Neural wristband: EMG control shown for subtle, hands‑free interactions with the glasses and apps neural band demo
Conversation Focus and comms: selectively amplifies voices in noisy spaces; live translations and voice calling supported conversation focus, and translations and calls
Agent workflows: “Live AI agents” can run tasks and notify you when done, enabling async usage on‑the‑go agents feature
Demo hiccups: a vision capability demo failed on stage; presenters later blamed Wi‑Fi, not the model vision demo, and Wi‑Fi comment

lens display specs

Ray‑Ban Meta Gen 2 brings 3K recording, wider FOV and stabilization at $499 in October

Meta’s Ray‑Ban Meta Gen 2 refresh focuses on capture: 3K video, a wide‑angle camera and stabilization, plus a 2× battery boost over prior models. It’s slated to ship in October for $499. 3K and battery, wide angle, and price and ship

Capture upgrades: 3K recording for sharper POV footage with on‑device stabilization 3K and battery, and wide angle
Practicality: 2× battery for longer sessions; same familiar frames with better internals 3K and battery
Timing and cost: $499 with deliveries starting in October (regional rollouts vary) price and ship
Software features: inherits the broader platform updates (translations, calls, agents) announced alongside the Display model features rollup

Oakley Meta Vanguard targets athletes with Garmin/Strava hooks and ruggedized design

Meta and Oakley introduced the Meta Vanguard variant tuned for athletes, integrating Garmin and Strava and emphasizing outdoor fitness use with camera and audio improvements. fitness integrations, and oakley mention

Training integrations: native Garmin and Strava support for workouts, routes and metrics in the glasses workflow fitness integrations
Capture and optics: wide‑angle camera and stabilization for action footage; designed for active environments camera features
Platform parity: benefits from the same AI layer (translations, calls, Conversation Focus) shared across the lineup feature bundle

oakley camera lens

✨ Business Funding Enterprise

Backfill placeholder for migration.

China orders major tech firms to halt Nvidia AI chip purchases, steering demand to domestic silicon

China’s internet regulator instructed leading platforms (e.g., Alibaba, ByteDance) to stop buying Nvidia’s China‑market chips (e.g., RTX Pro 6000D/H20 context), asserting local AI processors can match or beat allowed Nvidia parts—shifting procurement toward Huawei/Cambricon. ban report, Huang comments

Nvidia CEO called the move disappointing; broader US‑China agendas loom over chip access and export licensing Huang comments
Coverage notes a recent US license framework (with a 15% revenue share) that hadn’t yet translated to sales; firms paused large orders following the directive CNBC summary, and CNBC article
Enterprise impact: near‑term friction for multinationals and local clouds; accelerates domestic chip adoption and supplier diversification in China’s AI stack Ars coverage

Groq raises $750M at a $6.9B valuation to expand inference capacity

Groq secured $750M to meet surging demand for low‑latency, low‑cost inference and expand its footprint powering the "American AI Stack." The round, led by Disruptive and joined by BlackRock, Samsung, Cisco and others, values the company at $6.9B. funding announcement, Groq press release

Funding will scale GroqCloud and LPU deployments serving 2M+ developers and Fortune 500 workloads across NA, EU, and ME regions funding announcement
CEO frames compute as "the new oil" amid an inference boom, with media outlining the company’s role in US‑origin AI export strategy Bloomberg interview, and Bloomberg video
Positioning: accelerate capacity to support reasoning agents and real‑time apps where speed dominates UX and unit economics funding announcement

Anthropic’s Economic Index: 77% of Claude API usage is full task automation

Anthropic’s new Economic Index shows most enterprise use of Claude is end‑to‑end automation (not copiloting), spotlighting risk displacement for entry‑level knowledge work. Fortune summary

Split: 77% full task delegation, 12% learning/augmentation; dominant tasks include coding, office workflows, and bulk writing Fortune summary
Labor signal: report and commentary highlight potential attrition in junior roles and a shift of Gen Z toward trades as insulation from early automation Fortune summary, data recap
Enterprise read: vendors wiring agents directly into workflows to harvest productivity gains explains the hands‑off tilt of usage Fortune summary

Microsoft invests $30B in UK AI, including a 23k+ GPU supercomputer with Nscale

Microsoft announced a $30B UK investment (2025–2028) split between a 23k+ NVIDIA GPU supercomputer with Nscale (~~$15B) and broader expansion (~~$15B) to drive AI adoption across finance, healthcare and other sectors. investment brief

Signals sovereign and regional compute build‑out to meet model training/inference demand and enterprise rollouts investment brief
Complements ongoing UK‑US AI alignment efforts and intensifies competition for regional AI hubs and talent investment brief

D‑ID acquires simpleshow to pursue the $50B interactive AI market

D‑ID bought corporate video platform simpleshow, combining interactive digital humans with enterprise‑grade video creation and a 1,500‑customer base across 70 countries to target a $50B interactive AI market. transaction coverage

Distribution: inherits long‑tenured enterprise relationships and leverages Azure partnerships for global scale transaction coverage
Product thesis: always‑on avatars for training, sales and marketing; expands from creator‑led TTS/avatars into broader enterprise learning stacks transaction coverage

Perplexity integrates 1Password into Comet for built‑in personal security

Perplexity announced a partnership to bring 1Password into its Comet experience, baking credential management and security workflows directly into research and answer sessions. partnership note

Adoption signal: reduces context‑switching for users who need secure access while querying docs, apps, and the web partnership note
Enterprise read: blends browsing, answers, and security UX—an integration pattern likely to spread across knowledge‑worker stacks partnership note

✨ Feature Icpc Reasoning

Backfill placeholder for migration.

OpenAI hits a perfect 12/12 under ICPC conditions with GPT‑5 plus an internal reasoner

OpenAI reports its general‑purpose reasoning stack solved all 12 ICPC World Finals problems in an ICPC‑supervised AI track, matching human contest constraints. GPT‑5 solved 11/12 on first submission; an internal reasoning model selected and delivered the final hardest solution after multiple attempts. OpenAI summary, problem split, and approach details

Same five‑hour window, identical hidden tests, time/memory limits, hardware specs, and sandboxing as human teams, but scored in a separate AI track (not on the official human leaderboard). ICPC context
Method: generate multiple candidate solutions with GPT‑5 and an internal reasoner, then use the experimental reasoner to select/submit the best; 11 first‑try correct submissions from GPT‑5, 1 from the reasoner after 9 total submits on the final problem. approach details, problem split
Community note: this marks a first clean superhuman result on a top programming contest; internal leaders and observers emphasized the significance. OpenAI CTO note, sama comment

OpenAI ICPC posts

Gemini 2.5 Deep Think earns gold‑level at ICPC (10/12) and uniquely solves Problem C

Google DeepMind’s advanced Gemini 2.5 Deep Think achieved gold‑medal performance in the ICPC AI track, solving 10 of 12 problems in 677 minutes and cracking a “ducts/reservoir optimization” task that no university team solved. DeepMind announcement, 10 of 12, and Problem C note

Compete setup mirrored human constraints (five hours, same judge); DeepMind shared that multiple agents propose solutions, run tests, and verify before submission. Full details in the post. DeepMind blog
DeepMind frames this as a “profound leap” in abstract problem solving, building on its earlier IMO math success and highlighting multi‑step reasoning, RL advances, and parallel thinking. performance factors, future assistants

Gemini ICPC post

ICPC AI track conditions clarified; community probes tool use and posts references

Both OpenAI and Google ran in an ICPC‑supervised AI track under the same five‑hour window and judge constraints as humans; the AI results do not alter the official human leaderboard (best human teams solved 11/12). context thread

OpenAI states identical hidden tests, time/memory limits, hardware specs, and sandboxing; GPT‑5 + an internal model did 12/12 with an ensemble‑and‑select strategy. ICPC context, and approach details
DeepMind shared method notes (agents proposing, executing code/tests); community asked for clarity on code‑execution permissions during the contest; DeepMind’s blog provides the official framing. tooling question, DeepMind blog
For a sense of difficulty, see ICPC Problem G “Lava Moat” (terrain triangulation + minimum moat path). Even top human teams took ~270 of 300 minutes; OpenAI cited 9 submissions on the hardest item. problem thread, hardest problem split

Lava Moat problem

Why this matters: first clear superhuman coding result, and what comes next

Researchers and practitioners call this the first unambiguous superhuman coding milestone on a premier contest: OpenAI at 12/12 in ICPC conditions and Gemini at 10/12 with a unique human‑unsolved problem. swyx analysis, and professor take

Perspective: community compares this to prior leaps (Deep Blue, AlphaGo), noting rapid gains since o1/o3 era and forecasting smarter coding assistants and research acceleration. then vs now, historic framing
Caveats: separate AI track, continued requests for transparent artifacts and rules (e.g., execution constraints), and need for third‑party replication and tooling audits. verification question, DeepMind blog

✨ Hardware Accelerators

Backfill placeholder for migration.

China orders tech giants to halt Nvidia AI chip buys, pushing shift to Huawei/Cambricon

Beijing’s internet regulator told major platforms (e.g., Alibaba, ByteDance) to stop testing and purchasing Nvidia’s China‑market GPUs, escalating the push for domestic accelerators. Reports cite RTX Pro 6000D and prior H20 allowances now effectively off the table, with evaluations concluding local parts can match or beat the permitted Nvidia SKUs. China ban thread, FT headline

Regulators instructed firms to cancel orders and pause verification runs (some were in the “tens of thousands” of units) while steering demand to Huawei Ascend and Cambricon parts FT headline
Nvidia’s Jensen Huang called the move “disappointing,” following a summer of whiplash (a Trump‑era license framework for H20 with a 15% revenue share, and a parallel antitrust probe into Mellanox in China) CNBC article
Coverage notes the CAC directive widens earlier guidance; multiple outlets say China now considers domestic AI processors “good enough,” accelerating semiconductor self‑reliance Ars Technica
Market takeaway for AI builders: sourcing in China will increasingly hinge on Huawei/Cambricon stacks and software toolchains, not Nvidia’s China‑specific boards policy roundup

Groq raises $750M at $6.9B to scale ultra‑low‑latency inference infrastructure

Groq secured $750M to expand LPU‑powered inference capacity, citing “insatiable” demand for fast, low‑cost model serving from 2M+ developers and Fortune 500s. The raise positions Groq squarely in the accelerator race alongside GPU clouds, with a pitch centered on speed/price per token. funding news

Company frames growth as part of an “American AI Stack,” adding global data centers and enterprise SLAs for training/inference endpoints Groq funding page
CEO Jonathan Ross emphasized compute scarcity (“compute is the new oil”) and mainstream inference economics in a Bloomberg segment CEO soundbite, Bloomberg video
Practical takeaway: if you’re cost‑bound on reasoning or RAG traffic, it’s worth benchmarking LPU throughput/latency against your current GPU fleet, especially for steady, high‑QPS workloads

Together AI schedules NVIDIA Blackwell deep‑dive with NVIDIA and SemiAnalysis

Together AI will host an October 1 session dissecting Blackwell’s architecture, training/inference optimizations, and infra patterns, featuring Dylan Patel (SemiAnalysis) and Ian Buck (NVIDIA). For infra leads, it’s a chance to map Blackwell design choices to cluster networking, schedulers, storage and SLA‑backed endpoints. webinar announce

Session promises concrete guidance on deploying Blackwell for reasoning workloads and GPU cloud considerations (security, observability, throughput) event page
In context of GB300 burn‑in, where new large racks were coming online, this is a timely bridge from “what’s rolling out” to “how to run it well”

NVIDIA AI Dev teases hardware powering up in lab ahead of new deployments

NVIDIA’s AI Dev account posted a short “out of the box, powering up in the lab” clip, hinting at fresh systems coming online. For teams tracking delivery windows and developer kit availability, it’s another sign of near‑term rollouts. lab teaser

server powering up

The teaser lands as operators brace for next‑gen GPU waves and software stack updates; keep an eye on driver/runtime notes tied to these shipments lab teaser

✨ Research Papers

Backfill placeholder for migration.

World modeling with Probabilistic Structure Integration (PSI) learns usable structure from raw video without actions

Stanford’s PSI introduces a 3‑step loop—probabilistic prediction, structure extraction, and integration—that turns raw video into a self‑improving world model able to predict, edit, and control scenes. A 7B model trained on 1.4T tokens shows gains across prediction, flow, depth, novel views, and object edits. paper thread

PSI abstract figure

Predicts full distributions (keeps what‑if outputs sharp) and uses random-access patch prompts to reveal or steer motion and camera pose paper thread
Zero‑shot extraction of optical flow (motion), depth (viewpoint), and segments (shared movement), then re‑tokens these maps for continued training (structure integration) paper thread
After integration, can move objects from sparse flow, steer views with depth, and compose controls cleanly; shows measurable improvements across benchmarks with the 7B/1.4T‑token setup paper thread

Diffusion model locality traced to dataset statistics; sensitivity fields match Wiener‑like projections

New analysis shows that the pixel‑local behavior seen in diffusion models is largely a property of the data, not the U‑Net vs. transformer architecture. The authors introduce sensitivity fields that align with high‑SNR directions in datasets and build an analytical denoiser that better matches trained models than prior “optimal denoisers.” paper thread

paper first page

On natural photos, influence concentrates near each pixel; on centered faces, long‑range links appear (e.g., between eyes), matching data structure paper thread
Adding a faint W‑shaped tint to training images leads the learned sensitivity to reproduce the same pattern, demonstrating a data‑driven cause paper thread
Analytical denoiser using data‑driven projections often beats a plain Wiener filter and tracks trained diffusion behavior more closely than closed‑form baselines paper thread

Turning abstract harmful ideas into concrete drafts then "editing" them bypasses defenses, lifting jailbreak success from 7% to 62%

Researchers show a two‑stage pipeline where a cheaper model drafts requirements/pseudocode and a stronger model "improves" it can pierce safety filters: extending a draft is screened less than generating from scratch. Three refinement steps were best; four steps hurt. paper thread

paper title

Pipeline: low‑tier model produces spec/prototype → high‑tier model upgrades code using both the original and draft; average cost ≈ $0.075 per prompt paper thread
Baseline 7% jailbreak rate rises to 62% at 3 refinements; panel ratings find refined outputs more harmful and technically solid paper thread
Many scripts run with minor fixes; strong attacks still need target‑specific tuning; “extension phase” is the weak point in current guardrails paper thread

People can’t reliably predict or redact text cues that let LLMs infer sensitive traits; automated aids outperform humans

A UChicago/Google study of 240 U.S. users shows they’re poor at spotting what traits (job, age, location, relationship) can be inferred from text—and even worse at removing the cues. User rewrites worked 28% of the time vs. ChatGPT’s 50%, suggesting UI safeguards to flag risky clues are needed. paper thread

paper title

Participants overestimated some risks and missed others; concern was miscalibrated across trait types paper thread
Simple paraphrasing fails; deletions, vagueness, or generalization work better, but are hard to do unaided paper thread
Key takeaway: inference can’t be reliably blocked by users alone—systems should highlight and suggest edits automatically paper thread

Probabilistic reasoning study: distribution queries are fine; conditioning, label bias, and long‑context counting remain weak spots

Across tasks like mode finding, likelihood tables, and sampling from given discrete distributions, larger models do well—until conditioning or long raw lists are introduced. Counting errors dominate, but tool‑assisted counting fixes much of it. Labels (e.g., fruit/state names) bias results. paper thread

paper title

Top models can sample closely to target distributions—sometimes beating a basic random sampler—but conditional sampling is unreliable paper thread
Performance degrades as outcome space grows; distilled/large models hold up better on easy cases paper thread
Label semantics matter; swapping names changes accuracy; external code tools for counting mitigate 60%+ drops on long lists paper thread

OpenAI GPT‑5 nails 12/12 at ICPC AI track – 5‑hour contest parity

Evals: ICPC Breakthrough & Real‑world Agent Tests

📑 Table of Contents

On this page