Poetiq ARC‑AGI‑2 harness pushes GPT‑5.2 X‑High to 75% – $8 tasks

Stay in the loop

Free daily newsletter & Telegram daily report

Executive Summary

Poetiq’s scaffolded solver on GPT‑5.2 X‑High posts 75% on ARC‑AGI‑2’s public eval—about 15 points over the ~60% human baseline and ~14 points above base GPT‑5.2 X‑High at ~61%—with runs averaging roughly $8 per task. The system compiles puzzles into Python, iteratively tests and rewrites code, and stops early once confidence thresholds are met, turning ARC‑AGI‑2 into a program‑synthesis benchmark rather than pure text reasoning. Commentators argue this marks the public ARC‑AGI‑2 split as effectively saturated; skeptics counter that Poetiq’s solver may overfit the format and remind that hidden/verified sets remain unsolved, underscoring the growing gap between “model capability” and “system‑plus‑harness capability.”

• Serving & retrieval: LMSYS’s SpecBundle EAGLE‑3 drafts and SpecForge v0.2 deliver 2–3× throughput gains on 17B–30B models; xAI’s Grok Collections adds layout‑aware hybrid RAG with finance/legal/code eval claims, while LlamaParse’s DOJ redaction episode shows why binary‑layer parsing matters.
• Safety, evals & infra: OpenAI quantifies chain‑of‑thought monitorability and uses RL attackers to harden Atlas; Epoch charts faster capability and time‑horizon doubling; the EVIL benchmark and UK NCSC prompt‑injection guidance expose persistent misuse channels; SoftBank’s $22.5B OpenAI funding push and Microsoft’s AI‑assisted Rust rewrite plan highlight how capital, code migration, and agents now co‑evolve with model progress.

Feature: Poetiq harness pushes ARC‑AGI‑2 past human baseline

Poetiq’s scaffolded solver using GPT‑5.2 X‑High scores 75% on ARC‑AGI‑2 public eval (~$8/task), surpassing the 60% human baseline—spotlighting system harnessing as a key driver of frontier performance.

Cross‑account today: Poetiq’s scaffolded system on GPT‑5.2 X‑High hits 75% on ARC‑AGI‑2 public eval at ~$8/task, beating the 60% human line and igniting debate about benchmark saturation and the power of harness design.

Jump to Feature: Poetiq harness pushes ARC‑AGI‑2 past human baseline topics

🧩 Feature: Poetiq harness pushes ARC‑AGI‑2 past human baseline

Poetiq harness on GPT‑5.2 X‑High hits 75% on ARC‑AGI‑2

Poetiq ARC‑AGI harness (Poetiq): Poetiq reports 75% accuracy on the ARC‑AGI‑2 public eval set using a scaffolded system built on GPT‑5.2 X‑High, crossing the ~60% human baseline and previous AI bests near 60% as shown in the public eval chart and comparison plot; several summaries highlight that this score costs about $8 per task, which is materially higher than prior runs but still within many research budgets comparison plot. This pushes ARC‑AGI‑2 from a frontier reasoning target into territory where system design plus a strong base model can exceed average human test‑takers.

Engineers get a concrete performance–cost point: around 75% with GPT‑5.2 X‑High plus Poetiq’s harness versus roughly 60% with base models or simpler prompts, according to the scatter plots and cost axes in the alternate chart. Commentary frames this as the first time an AI system has clearly beaten the human line on the public ARC‑AGI‑2 eval under realistic budget assumptions, though not yet on the hidden “verified” set arc explainer.

Poetiq ARC‑AGI‑2 harness pushes GPT‑5.2 X‑High to 75% – $8 tasks

Executive Summary

Top links today

Feature: Poetiq harness pushes ARC‑AGI‑2 past human baseline

Table of Contents

🧩 Feature: Poetiq harness pushes ARC‑AGI‑2 past human baseline

Poetiq harness on GPT‑5.2 X‑High hits 75% on ARC‑AGI‑2

Harness lifts GPT‑5.2 X‑High from ~61% to 75% on ARC‑AGI‑2

Inside Poetiq’s iterative ARC‑AGI‑2 harness for GPT‑5.2

ARC‑AGI‑2 nears saturation as Poetiq jumps from 65% to 75%

🛠️ Agent dev UX: plugins, integrations and context control

ChatGPT now runs Replit apps directly from chat with /replit

Braintrust links Claude Code traces and production data in both directions

Claude Code adds /plugins marketplace with official integrations

Firecrawl’s /agent endpoint turns natural language into multi‑step web crawls

Higher‑order instruction layer improves Claude Code system prompts

Dev‑Agent‑Lens proxies Claude Code through LiteLLM with full tracing

Warp adds /fork-and-compact and /compact to tame agent context

CodexBar 0.13 adds browser‑cookie Claude auth and richer usage stats

Peakypanes adds shared quick‑reply input across multiple agent sessions

🔎 RAG stacks and document parsing you can ship

xAI launches Grok Collections API for layout-aware hybrid document search

LlamaParse exposes hidden text behind DOJ’s layered PDF redactions

OpenRouter and NVIDIA NeMo add runtime flags for distillable models

⚙️ Speculative decoding at scale

LMSYS ships SpecBundle EAGLE‑3 drafts and SpecForge v0.2 for production spec decoding

SpecBundle EAGLE‑3 draft models deliver 3×+ throughput vs non‑spec decoding

📊 Benchmark trendlines and measurement rigor (excludes feature)

Epoch index shows frontier AI capability gains nearly doubled since 2024

Epoch details how scaffolds and providers can skew benchmark scores

METR time horizons now doubling in about 4.6 months

ValsAI card shows GLM‑4.7 competitive across law, finance and coding

Gemini 3 Flash debuts #5 on SimpleBench, behind GPT‑5 Pro

MiMo‑V2‑Flash ranks #5 open model on WebDev, #25 in Text Arena

🧠 MiniMax M2.1: coding/agent traction and early sentiment

MiniMax M2.1 ranks #2 open-weight on Vals Index with strong cost profile

Builders report strong long-horizon and design performance from MiniMax M2.1

MiniMax M2.1 spreads into Kilo, Roo Code, Trae and Code Arena

🔌 MCP Apps and server cohesion

MCP Apps proposes standard interactive UIs for MCP servers

Analysis shows Claude Code slash commands can nearly double context use vs skills

mcp-config generates multi-client install snippets from mcp.json

Engineers worry MCP tool search may surface incohesive tools in isolation

📑 Fresh research: monitorability, world video, SLM tool use, async thinking

OpenAI studies how monitorable model chain-of-thought really is

Asynchronous Reasoning lets LLMs think, listen and talk at once

AWS fine-tunes a 350M model to 77.55% on ToolBench

Nemotron‑Math releases 7.5M long math traces with Python tool use

WorldWarp couples long video generation to a live 3D geometry cache

🛡️ Agent security: prompt‑injection hardening and misuse risks

EVIL benchmark shows LLM judges often help users break the law, with demographic skew

OpenAI details RL-based attacker used to harden ChatGPT Atlas against prompt injection

UK NCSC warns prompt injection is not SQL injection and may be worse

🎥 Cinematic AV and image editing: Seedance, Kandinsky, Qwen‑Edit

Qwen‑Image‑Edit‑2511 goes open‑source with stronger identity‑preserving edits

Seedance 1.5 Pro brings native audio‑video generation to Higgs, fal and Replicate

LightX2V accelerates Qwen‑Image‑Edit‑2511 pipelines by over 40×

invideo Vision turns one prompt into 3×3 cinematic storyboards

Kandinsky 5.0 Video Pro lands on fal for HD controllable text‑to‑video

Wan 2.6 Image on fal adds multi‑reference style, subject and background fusion

Hitem3D V2.0 generates print‑grade 3D models ready for CNC and 3D printing

ImagineArt’s new upscaler combines Topaz and Magnific for 16× images

Lucy Restyle Long‑Form on fal targets up to 30‑minute video restyling

Z‑Image plus SCAIL enable consistent multi‑character pose transfer for videos

💼 Enterprise moves: agents in knowledge work, codebase rewrites

Microsoft plans AI‑assisted Rust rewrite of all C/C++ by 2030

ClickUp acquires Codegen to embed background coding agents into work platform

Anthropic’s Project Vend phase two stresses Claude agents in real retail operations

Similarweb iOS data shows ChatGPT app ~20× Gemini’s daily active users

Salesforce exec frames agents as future brand ambassadors for enterprises

⚡ Power and capital fueling AI buildouts

SoftBank races to deliver $22.5B OpenAI funding for Stargate‑scale buildout

🗣️ Production TTS and cloning pipelines

Qwen3‑TTS ships controllable VoiceDesign and 3‑second multilingual VoiceClone

MiniMax Speech 2.6 Turbo arrives on Together AI with real‑time, 40+ language TTS

🤖 Robotics demos and public performances

Unitree G1 shows polished kung‑fu routines and shares stage at major concert

“Pick Up Anything” wheeled robot nails small-object bin picking

On this page