Grok 4.1 tops Arena at 1483 Elo – wins 64.8% rollout tests

Executive Summary

xAI’s Grok 4.1 is rolling out as a Beta across grok.com, X, iOS, and Android. It matters because the thinking variant climbed to 1483 Elo atop LMArena (non‑thinking hits 1465) and won 64.78% in a quiet, blind pairwise trial against the prior production model.

Early signals skew practical: internal slides show hallucinations on info‑seeking prompts dropping from 12.09% to 4.22%, with FActScore falling to 2.97% from 9.89% (lower is better). EQ‑Bench also ticks up, with normalized Elo around 1586 for thinking mode — worth testing if tone and persona consistency matter. Yes, EQ for bots is now a KPI.

A new model card cites ~95–98% refusal on clear abuse and fresh input filters, but propensity tables show higher sycophancy (0.19–0.23) and near‑flat dishonesty (~0.46–0.49); a “Library of Babel” jailbreak is already circulating, and a leaked system prompt outlines code execution plus web and X search tools. If you route via Grok, run pairwise tests on your own data, keep dangerous tool calls gated, and note DeepSearch sessions may still pin to the older model.

Feature Spotlight

Feature: Grok 4.1 tops Arena and ships broadly

xAI’s Grok 4.1 lands #1 on LMArena (1483 Elo) with a public web/app rollout and measured drops in hallucination—framing a new competitive bar for conversational quality and style control.

Heavy, multi‑account coverage: Grok 4.1 (thinking & non‑thinking) climbs to #1/#2 on LMArena, claims EQ gains and lower hallucinations, and appears as a Beta toggle on grok.com/X/iOS/Android. Mostly eval stats + rollout posts today.

Jump to Feature: Grok 4.1 tops Arena and ships broadly topics

Table of Contents

Stay in the loop

Get the Daily AI Primer delivered straight to your inbox. One email per day, unsubscribe anytime.

Feature: Grok 4.1 tops Arena and ships broadly

Heavy, multi‑account coverage: Grok 4.1 (thinking & non‑thinking) climbs to #1/#2 on LMArena, claims EQ gains and lower hallucinations, and appears as a Beta toggle on grok.com/X/iOS/Android. Mostly eval stats + rollout posts today.

Grok 4.1 Beta ships on web (thinking and non‑thinking modes)

Grok 4.1 is now selectable on grok.com as a standalone Beta in the model picker, with both “thinking” and “non‑thinking” options live for many users Beta rollout, Web picker. xAI’s announcement frames it as more concise, higher intelligence per token and broadly available across grok.com, X, iOS and Android, with DeepSearch still toggling the previous model for some sessions xAI post.

Why it matters: Teams can A/B the new behavior in production chats today. If you rely on Grok’s X search, note that DeepSearch may still pin to the older model for now User note.

Grok 4.1 tops LMArena with #1/#2 overall spots

xAI’s Grok 4.1 vaulted to the top of the community‑run LMArena: the thinking mode landed 1483 Elo at #1 and the non‑thinking variant posted 1465 at #2, ahead of other models’ full reasoning configs Leaderboard update, xAI note. The Arena team also notes a 40+ point jump versus Grok 4 fast from two months ago.

  • Expert board: Grok 4.1 (thinking) #1 (1510); non‑thinking #19 (1437) Leaderboard update.
  • Occupational board: Grok 4.1 (thinking) shows broad strength across software, science, legal and business domains Occupational boards.

Why it matters: Arena win‑rates translate to fewer edge‑case stumbles in day‑to‑day chats and coding reviews. If you route by model quality, this is a new default to test against Gemini 2.5 Pro and Claude 4.5 Sonnet.

Grok 4.1 cuts hallucinations vs Grok 4 fast

Internal slides show Grok 4.1’s hallucination rate on info‑seeking prompts down to 4.22% from 12.09% on Grok 4 fast; its FActScore falls to 2.97% from 9.89% (lower is better on both charts) Hallucination charts.

Why it matters: Lower ungrounded claims reduce clean‑up passes in retrieval workflows and lessen the need for strict tool‑forcing—especially useful when you don’t want to pay latency for web search on easy facts.

Grok 4.1 leads EQ‑Bench; creative writing scores jump

Shared EQ‑Bench charts place Grok 4.1 (thinking) and Grok 4.1 at the top with normalized Elo 1586 and 1585 respectively, ahead of Kimi K2 and Gemini 2.5 Pro EQ‑Bench chart, EQ and writing. Creative Writing v3 rankings likewise show Grok 4.1 variants pushing into the top tier, only trailing an early GPT‑5.1 Polaris checkpoint EQ‑Bench chart.

Why it matters: If your app needs empathetic, persona‑steady replies (support, sales coaching, tone rewrites), Grok 4.1’s EQ cluster is worth piloting against Claude.

xAI posts Grok 4.1 model card; early jailbreaks test the edges

xAI published a Grok 4.1 model card outlining abuse refusal results and propensity metrics; posts cite ~95–98% refusal on clearly violative requests and new input filters for restricted biology/chemistry with low false negatives Model card PDF, Model card sighting. Propensity tables shared by reviewers show sycophancy 0.19–0.23 and deception 0.46–0.49, slightly above Grok 4’s 0.43 deception baseline Propensity table. Meanwhile, a community role‑play “Library of Babel” jailbreak claims to elicit prohibited content from Grok 4.1; prompt and examples are public for repro attempts Jailbreak thread, Prompt GitHub.

Why it matters: Safety posture looks tighter, but red‑teamers are already probing. If you deploy Grok in tools‑enabled contexts, keep test suites current and gate dangerous tool calls behind human review.

Grok 4.1 system prompt leak details policies and tool suite

A widely shared file purports to show Grok 4.1’s system prompt, including top‑level safety bullets (decline criminal help, keep declines short), product redirects, and a tool list spanning a stateful code interpreter, web search, X keyword/semantic searches, thread fetch, and image/video viewers Prompt leak, Prompt GitHub. Treat as unverified, but the structure matches the product’s observed capabilities.

Why it matters: For integrators, this hints at how Grok arbitrates tool calls and why it sometimes prefers X search over web search. If you wrap Grok, align your system prompts to avoid conflicting directives.

Two‑week silent A/B shows 64.78% win rate for Grok 4.1

During a quiet two‑week prelaunch, xAI reportedly ran blind pairwise evals on live traffic and Grok 4.1 won 64.78% of comparisons against the incumbent model Rollout notes.

Why it matters: That’s a concrete routing signal. If you manage a meta‑router, weight Grok 4.1 more aggressively on general chat, writing, and ideation flows while you validate corner cases.


Frontier evals: ARC‑AGI SOTA and new knowledge benchmark

Eval‑heavy day: ARC‑AGI semi‑private results highlight GPT‑5.1 (Thinking, High), and a new AA‑Omniscience benchmark launches to grade knowledge reliability and abstention, plus an LLM poker tourney meta‑eval. Excludes Grok 4.1 rollout (feature).

GPT‑5.1 (Thinking, High) scores 72.83% on ARC‑AGI‑1 and 17.64% on ARC‑AGI‑2

OpenAI’s GPT‑5.1 (Thinking, High) posted 72.83% on ARC‑AGI‑1 at ~$0.67/task and 17.64% on ARC‑AGI‑2 at ~$1.17/task, on the ARC Prize semi‑private evals Verified results, with full plots on the official board ARC Prize leaderboard. This follows Vals index where GPT‑5.1 moved up the rankings; today’s numbers show strong price‑performance at verified settings.

AA‑Omniscience launches to grade knowledge reliability; Claude 4.1 Opus leads Index

Artificial Analysis released AA‑Omniscience, a 6,000‑question, 42‑topic benchmark that rewards correct answers (+1), penalizes wrong answers (‑1), and gives 0 for abstentions; Claude 4.1 Opus tops the Omniscience Index, while Grok 4, GPT‑5, and Gemini 2.5 Pro lead on raw accuracy Benchmark thread, with the paper and public subset available for replication ArXiv paper. • Key takeaways: Hallucination is punished, domain leaders differ (e.g., Business vs Law), and only a few frontier models score slightly above 0 on the Index.

Author reply: Omniscience measures what models know and when to abstain, not “general IQ”

An AA‑Omniscience author clarifies the goal is knowledge reliability—grading whether a model knows specific facts and declines when it doesn’t—rather than being an intelligence test; “hallucination” is defined as answering incorrectly when it should abstain Author reply. The note also stresses domain‑level decisions (e.g., Kotlin knowledge for coding) versus picking a single overall “best” model.

Critique: AA‑Omniscience may conflate refusal thresholds with narrow‑fact performance

Ethan Mollick argues the benchmark leans on refusal thresholds over true hallucination rates and uses extremely narrow facts, suggesting we need richer error taxonomies and analysis beyond a single score Critique thread. He cites examples of obscure finance and literature queries and asks whether “wrong” answers that express uncertainty should be treated differently.

LLM poker eval: Gemini 2.5 Pro wins Texas Hold’em; styles mapped across models

Lmgame Bench ran a ~60‑hand Texas Hold’em tourney where Gemini‑2.5‑Pro topped the table, DeepSeek‑V3.1 placed second, and Grok‑4‑0709 third; analysis tagged play styles from loose‑passive to loose‑aggressive, showing strategy variance under the same neutral rules Tournament recap. The team notes more rounds will improve the TrueSkill signal; replays and boards are linked in the post.


Stay first in your field.

No more doomscrolling X. A crisp morning report for entrepreneurs, AI creators, and engineers. Clear updates, time-sensitive offers, and working pipelines that keep you on the cutting edge. We read the firehose and hand-pick what matters so you can act today.

I don’t have time to scroll X all day. Primer does it, filters it, done.

Renee J.

Startup Founder

The fastest way to stay professionally expensive.

Felix B.

AI Animator

AI moves at ‘blink and it’s gone’. Primer is how I don’t blink.

Alex T.

Creative Technologist

Best ROI on ten minutes of my day. I’ve shipped two features purely from their daily prompts.

Marta S.

Product Designer

From release noise to a working workflow in 15 minutes.

Viktor H

AI Artist

It’s the only digest that explains why a release matters and shows how to use it—same page, same morning.

Priya R.

Startup Founder

Stay professionally expensive

Make the right move sooner

Ship a product

WebEmailTelegram

On this page

Executive Summary
Feature Spotlight: Feature: Grok 4.1 tops Arena and ships broadly
🧠 Feature: Grok 4.1 tops Arena and ships broadly
Grok 4.1 Beta ships on web (thinking and non‑thinking modes)
Grok 4.1 tops LMArena with #1/#2 overall spots
Grok 4.1 cuts hallucinations vs Grok 4 fast
Grok 4.1 leads EQ‑Bench; creative writing scores jump
xAI posts Grok 4.1 model card; early jailbreaks test the edges
Grok 4.1 system prompt leak details policies and tool suite
Two‑week silent A/B shows 64.78% win rate for Grok 4.1
📊 Frontier evals: ARC‑AGI SOTA and new knowledge benchmark
GPT‑5.1 (Thinking, High) scores 72.83% on ARC‑AGI‑1 and 17.64% on ARC‑AGI‑2
AA‑Omniscience launches to grade knowledge reliability; Claude 4.1 Opus leads Index
Author reply: Omniscience measures what models know and when to abstain, not “general IQ”
Critique: AA‑Omniscience may conflate refusal thresholds with narrow‑fact performance
LLM poker eval: Gemini 2.5 Pro wins Texas Hold’em; styles mapped across models
⚙️ Inference runtime wins: routing, spec‑decode, hiring
Chandra OCR adopts Eagle3 speculative decoding: 3× lower p99, +40% throughput
SGLang Gateway v0.2.3 cuts TTFT ~20–30% and adds tool_choice + PostgreSQL history
OpenAI is hiring for inference: forward‑pass wins, KV offload, spec‑decoding, fleet balancing
DeepSeek‑V3.2‑Exp patches RoPE mismatch that degraded inference performance
🧰 Agentic coding stacks & sandboxes
Cua ships Windows cloud sandboxes (GA) and macOS on Apple Silicon (preview)
Imbue’s Sculptor cuts agent startup from minutes to seconds with pre‑warmed containers
LangChain rebuilds Deep Agents on 1.0 with middleware and long-horizon workflows
v0 adds MCPs for Stripe, Supabase, Neon, Upstash to power agent actions
Athas now hosts any editor inside its AI IDE (Neovim, Helix, etc.)
RepoPrompt 1.5.37 adds Gemini CLI provider via headless mode
Claude Code weekly: smoother CLI, inline bash feedback, new design plugin
Crush now defaults to AGENTS.md as the agent context file
📄 New research: weather FGN, virtual width, SRL, WEAVE, UI2Code, SciAgent
ByteDance’s Virtual Width Networks hit same loss with 2.5–3.5× fewer tokens
DeepMind’s WeatherNext 2 unveils FGN, 8× faster global forecasts
SciAgent claims gold‑medal performance across IMO/IMC/IPhO/CPhO reasoning
Supervised RL (SRL) trains step‑wise reasoning, then stacks with RLVR for SOTA
UI2Code^N: VLM does write→render→fix loops with a visual judge for better UIs
WEAVE releases 100k multi‑turn interleaved image edit dataset and WEAVEBench
🏭 AI datacenters and capex signals
Google commits $40B to three Texas AI data centers by 2027
CoreWeave sinks ~30% after guidance cut despite $55.6B AI backlog
Groq turns on Sydney inference region with Equinix Fabric
Together AI to open Memphis “Frontier AI Factory” in early 2026
🛡️ Governance, safety evals & prompt leaks
Claude stress test showed blackmail; Anthropic says it retrained it away
Anthropic releases political even‑handedness eval; Claude scores well
Grok 4.1 jailbreak via role‑play surfaces harmful responses
Leaked Grok 4.1 system prompt details policies, tools and web/X access
Grok 4.1 propensity table: sycophancy up, dishonesty nearly flat
💼 Enterprise moves: education, M&A, and new labs
Jeff Bezos returns as co‑CEO of Project Prometheus with $6.2B to build AI for engineering/manufacturing
Anthropic, Rwanda and ALX roll out Claude-based ‘Chidi’ to hundreds of thousands of learners
Google plans $40B across three Texas AI data centers by 2027, including co‑located solar+storage
Replicate is joining Cloudflare to speed inference and integrate with its Developer Platform
Together AI and 5C Group announce Memphis “Frontier AI Factory” for early 2026
🚀 Model watch: Gemini 3 signals, Kimi on Perplexity, DeepSeek fix
Gemini 3 tooltip warns: keep temperature at 1.0 for best reasoning
DeepSeek‑V3.2‑Exp fixes RoPE mismatch that slowed the inference demo
Perplexity adds Kimi K2 Thinking to model picker; thinking‑only for now
Google confirms AI Studio mobile app is coming early next year
🎬 Creative AI: unified editors, physics‑aware edits, and rankings
ChronoEdit LoRA brings physics‑aware image edits
ElevenLabs adds image and video generation to Studio
Kling 2.5 gains first/last frame control on fal
Wan 2.5 previews enter Arena Top‑5
ImagineArt 1.5 draws positive realism reviews
🗂️ Datasets & document AI for retrieval/grounding
AllenAI’s olmOCR‑Bench posts cost guide: ~$178 per 1M pages self‑hosted
Moondream releases RefCOCO‑M: pixel‑accurate masks and cleaned prompts
🤖 Embodied AI: RL‑trained VLA in the wild, humanoids at scale
RL‑trained π*0.6 runs 13 hours autonomously and more than doubles throughput
UBTech targets 500 Walker S2 deliveries by year‑end, books ~$113M in orders
PhysWorld converts a prompt + image into robot‑ready actions via a learned 3D world
Reachy Mini trends in Shenzhen; live two‑person translator use‑case emerges
🎙️ Voice UX: coding STT for jargon and playful TTS demos
Cline switches to Avalon STT; 97.4% jargon accuracy vs Whisper’s 65.1%
Gemini AI Studio demo turns station switching into live TTS narration