Grok 4.1 tops Arena at 1483 Elo – wins 64.8% rollout tests | Daily AI Primer

Executive Summary

xAI’s Grok 4.1 is rolling out as a Beta across grok.com, X, iOS, and Android. It matters because the thinking variant climbed to 1483 Elo atop LMArena (non‑thinking hits 1465) and won 64.78% in a quiet, blind pairwise trial against the prior production model.

Early signals skew practical: internal slides show hallucinations on info‑seeking prompts dropping from 12.09% to 4.22%, with FActScore falling to 2.97% from 9.89% (lower is better). EQ‑Bench also ticks up, with normalized Elo around 1586 for thinking mode — worth testing if tone and persona consistency matter. Yes, EQ for bots is now a KPI.

A new model card cites ~95–98% refusal on clear abuse and fresh input filters, but propensity tables show higher sycophancy (0.19–0.23) and near‑flat dishonesty (~0.46–0.49); a “Library of Babel” jailbreak is already circulating, and a leaked system prompt outlines code execution plus web and X search tools. If you route via Grok, run pairwise tests on your own data, keep dangerous tool calls gated, and note DeepSearch sessions may still pin to the older model.

Feature Spotlight

Feature: Grok 4.1 tops Arena and ships broadly

xAI’s Grok 4.1 lands #1 on LMArena (1483 Elo) with a public web/app rollout and measured drops in hallucination—framing a new competitive bar for conversational quality and style control.

Heavy, multi‑account coverage: Grok 4.1 (thinking & non‑thinking) climbs to #1/#2 on LMArena, claims EQ gains and lower hallucinations, and appears as a Beta toggle on grok.com/X/iOS/Android. Mostly eval stats + rollout posts today.

Jump to Feature: Grok 4.1 tops Arena and ships broadly topics

Stay in the loop

Get the Daily AI Primer delivered straight to your inbox. One email per day, unsubscribe anytime.

Feature: Grok 4.1 tops Arena and ships broadly

Grok 4.1 Beta ships on web (thinking and non‑thinking modes)

Grok 4.1 is now selectable on grok.com as a standalone Beta in the model picker, with both “thinking” and “non‑thinking” options live for many users Beta rollout, Web picker. xAI’s announcement frames it as more concise, higher intelligence per token and broadly available across grok.com, X, iOS and Android, with DeepSearch still toggling the previous model for some sessions xAI post.

Why it matters: Teams can A/B the new behavior in production chats today. If you rely on Grok’s X search, note that DeepSearch may still pin to the older model for now User note.

Grok 4.1 tops LMArena with #1/#2 overall spots

xAI’s Grok 4.1 vaulted to the top of the community‑run LMArena: the thinking mode landed 1483 Elo at #1 and the non‑thinking variant posted 1465 at #2, ahead of other models’ full reasoning configs Leaderboard update, xAI note. The Arena team also notes a 40+ point jump versus Grok 4 fast from two months ago.

Expert board: Grok 4.1 (thinking) #1 (1510); non‑thinking #19 (1437) Leaderboard update.
Occupational board: Grok 4.1 (thinking) shows broad strength across software, science, legal and business domains Occupational boards.

Why it matters: Arena win‑rates translate to fewer edge‑case stumbles in day‑to‑day chats and coding reviews. If you route by model quality, this is a new default to test against Gemini 2.5 Pro and Claude 4.5 Sonnet.

Grok 4.1 cuts hallucinations vs Grok 4 fast

Internal slides show Grok 4.1’s hallucination rate on info‑seeking prompts down to 4.22% from 12.09% on Grok 4 fast; its FActScore falls to 2.97% from 9.89% (lower is better on both charts) Hallucination charts.

Why it matters: Lower ungrounded claims reduce clean‑up passes in retrieval workflows and lessen the need for strict tool‑forcing—especially useful when you don’t want to pay latency for web search on easy facts.

Grok 4.1 leads EQ‑Bench; creative writing scores jump

Shared EQ‑Bench charts place Grok 4.1 (thinking) and Grok 4.1 at the top with normalized Elo 1586 and 1585 respectively, ahead of Kimi K2 and Gemini 2.5 Pro EQ‑Bench chart, EQ and writing. Creative Writing v3 rankings likewise show Grok 4.1 variants pushing into the top tier, only trailing an early GPT‑5.1 Polaris checkpoint EQ‑Bench chart.

Why it matters: If your app needs empathetic, persona‑steady replies (support, sales coaching, tone rewrites), Grok 4.1’s EQ cluster is worth piloting against Claude.

xAI posts Grok 4.1 model card; early jailbreaks test the edges

xAI published a Grok 4.1 model card outlining abuse refusal results and propensity metrics; posts cite ~95–98% refusal on clearly violative requests and new input filters for restricted biology/chemistry with low false negatives Model card PDF, Model card sighting. Propensity tables shared by reviewers show sycophancy 0.19–0.23 and deception 0.46–0.49, slightly above Grok 4’s 0.43 deception baseline Propensity table. Meanwhile, a community role‑play “Library of Babel” jailbreak claims to elicit prohibited content from Grok 4.1; prompt and examples are public for repro attempts Jailbreak thread, Prompt GitHub.

Why it matters: Safety posture looks tighter, but red‑teamers are already probing. If you deploy Grok in tools‑enabled contexts, keep test suites current and gate dangerous tool calls behind human review.

Grok 4.1 system prompt leak details policies and tool suite

A widely shared file purports to show Grok 4.1’s system prompt, including top‑level safety bullets (decline criminal help, keep declines short), product redirects, and a tool list spanning a stateful code interpreter, web search, X keyword/semantic searches, thread fetch, and image/video viewers Prompt leak, Prompt GitHub. Treat as unverified, but the structure matches the product’s observed capabilities.

Why it matters: For integrators, this hints at how Grok arbitrates tool calls and why it sometimes prefers X search over web search. If you wrap Grok, align your system prompts to avoid conflicting directives.

Two‑week silent A/B shows 64.78% win rate for Grok 4.1

During a quiet two‑week prelaunch, xAI reportedly ran blind pairwise evals on live traffic and Grok 4.1 won 64.78% of comparisons against the incumbent model Rollout notes.

Why it matters: That’s a concrete routing signal. If you manage a meta‑router, weight Grok 4.1 more aggressively on general chat, writing, and ideation flows while you validate corner cases.

Frontier evals: ARC‑AGI SOTA and new knowledge benchmark

Eval‑heavy day: ARC‑AGI semi‑private results highlight GPT‑5.1 (Thinking, High), and a new AA‑Omniscience benchmark launches to grade knowledge reliability and abstention, plus an LLM poker tourney meta‑eval. Excludes Grok 4.1 rollout (feature).

GPT‑5.1 (Thinking, High) scores 72.83% on ARC‑AGI‑1 and 17.64% on ARC‑AGI‑2

OpenAI’s GPT‑5.1 (Thinking, High) posted 72.83% on ARC‑AGI‑1 at ~$0.67/task and 17.64% on ARC‑AGI‑2 at ~$1.17/task, on the ARC Prize semi‑private evals Verified results, with full plots on the official board ARC Prize leaderboard. This follows Vals index where GPT‑5.1 moved up the rankings; today’s numbers show strong price‑performance at verified settings.

AA‑Omniscience launches to grade knowledge reliability; Claude 4.1 Opus leads Index

Artificial Analysis released AA‑Omniscience, a 6,000‑question, 42‑topic benchmark that rewards correct answers (+1), penalizes wrong answers (‑1), and gives 0 for abstentions; Claude 4.1 Opus tops the Omniscience Index, while Grok 4, GPT‑5, and Gemini 2.5 Pro lead on raw accuracy Benchmark thread, with the paper and public subset available for replication ArXiv paper. • Key takeaways: Hallucination is punished, domain leaders differ (e.g., Business vs Law), and only a few frontier models score slightly above 0 on the Index.

Author reply: Omniscience measures what models know and when to abstain, not “general IQ”

An AA‑Omniscience author clarifies the goal is knowledge reliability—grading whether a model knows specific facts and declines when it doesn’t—rather than being an intelligence test; “hallucination” is defined as answering incorrectly when it should abstain Author reply. The note also stresses domain‑level decisions (e.g., Kotlin knowledge for coding) versus picking a single overall “best” model.

Critique: AA‑Omniscience may conflate refusal thresholds with narrow‑fact performance

Ethan Mollick argues the benchmark leans on refusal thresholds over true hallucination rates and uses extremely narrow facts, suggesting we need richer error taxonomies and analysis beyond a single score Critique thread. He cites examples of obscure finance and literature queries and asks whether “wrong” answers that express uncertainty should be treated differently.

LLM poker eval: Gemini 2.5 Pro wins Texas Hold’em; styles mapped across models

Lmgame Bench ran a ~60‑hand Texas Hold’em tourney where Gemini‑2.5‑Pro topped the table, DeepSeek‑V3.1 placed second, and Grok‑4‑0709 third; analysis tagged play styles from loose‑passive to loose‑aggressive, showing strategy variance under the same neutral rules Tournament recap. The team notes more rounds will improve the TrueSkill signal; replays and boards are linked in the post.

Stay first in your field.

No more doomscrolling X. A crisp morning report for entrepreneurs, AI creators, and engineers. Clear updates, time-sensitive offers, and working pipelines that keep you on the cutting edge. We read the firehose and hand-pick what matters so you can act today.

I don’t have time to scroll X all day. Primer does it, filters it, done.

Renee J.

Startup Founder

The fastest way to stay professionally expensive.

Felix B.

AI Animator

AI moves at ‘blink and it’s gone’. Primer is how I don’t blink.

Alex T.

Creative Technologist

Best ROI on ten minutes of my day. I’ve shipped two features purely from their daily prompts.

Marta S.

Product Designer

From release noise to a working workflow in 15 minutes.

Viktor H

AI Artist

It’s the only digest that explains why a release matters and shows how to use it—same page, same morning.

Priya R.

Startup Founder

Get access

Stay professionally expensive

Make the right move sooner

Ship a product

WebEmailTelegram

Inference runtime wins: routing, spec‑decode, hiring

Chandra OCR adopts Eagle3 speculative decoding: 3× lower p99, +40% throughput

DatalabTO sped up its Chandra OCR serving by training an Eagle3 draft model and using tree‑style speculative decoding: p99 latency drops ~3×, p50 ~25%, throughput rises ~40%, with no accuracy loss versus the target model engineering thread. The write‑up details drafting multiple candidate branches from three internal layers to lift acceptance rates, then verifying in parallel; the post also shares production rollout notes and benchmarks engineering blog.

So what? If your decode path dominates costs, this is a strong pattern: train a small, domain‑tuned draft on your traffic (docs/forms/receipts), enable tree drafting, and gate acceptance aggressively to preserve correctness while reclaiming tail latency.

Grok 4.1 tops Arena at 1483 Elo – wins 64.8% rollout tests

Executive Summary

Feature: Grok 4.1 tops Arena and ships broadly

Table of Contents

Feature: Grok 4.1 tops Arena and ships broadly

Grok 4.1 Beta ships on web (thinking and non‑thinking modes)

Grok 4.1 tops LMArena with #1/#2 overall spots

Grok 4.1 cuts hallucinations vs Grok 4 fast

Grok 4.1 leads EQ‑Bench; creative writing scores jump

xAI posts Grok 4.1 model card; early jailbreaks test the edges

Grok 4.1 system prompt leak details policies and tool suite

Two‑week silent A/B shows 64.78% win rate for Grok 4.1

Frontier evals: ARC‑AGI SOTA and new knowledge benchmark

GPT‑5.1 (Thinking, High) scores 72.83% on ARC‑AGI‑1 and 17.64% on ARC‑AGI‑2

AA‑Omniscience launches to grade knowledge reliability; Claude 4.1 Opus leads Index

Author reply: Omniscience measures what models know and when to abstain, not “general IQ”

Critique: AA‑Omniscience may conflate refusal thresholds with narrow‑fact performance

LLM poker eval: Gemini 2.5 Pro wins Texas Hold’em; styles mapped across models

Stay first in your field.

Inference runtime wins: routing, spec‑decode, hiring

Chandra OCR adopts Eagle3 speculative decoding: 3× lower p99, +40% throughput

On this page

Grok 4.1 tops Arena at 1483 Elo – wins 64.8% rollout tests

Executive Summary

Feature: Grok 4.1 tops Arena and ships broadly

Table of Contents

🧠Feature: Grok 4.1 tops Arena and ships broadly

Grok 4.1 Beta ships on web (thinking and non‑thinking modes)

Grok 4.1 tops LMArena with #1/#2 overall spots

Grok 4.1 cuts hallucinations vs Grok 4 fast

Grok 4.1 leads EQ‑Bench; creative writing scores jump

xAI posts Grok 4.1 model card; early jailbreaks test the edges

Grok 4.1 system prompt leak details policies and tool suite

Two‑week silent A/B shows 64.78% win rate for Grok 4.1

📊Frontier evals: ARC‑AGI SOTA and new knowledge benchmark

GPT‑5.1 (Thinking, High) scores 72.83% on ARC‑AGI‑1 and 17.64% on ARC‑AGI‑2

AA‑Omniscience launches to grade knowledge reliability; Claude 4.1 Opus leads Index

Author reply: Omniscience measures what models know and when to abstain, not “general IQ”

Critique: AA‑Omniscience may conflate refusal thresholds with narrow‑fact performance

LLM poker eval: Gemini 2.5 Pro wins Texas Hold’em; styles mapped across models

Stay first in your field.

On this page

Feature: Grok 4.1 tops Arena and ships broadly

Frontier evals: ARC‑AGI SOTA and new knowledge benchmark