Grok 4.1 tops Arena at 1483 Elo – wins 64.8% rollout tests
Executive Summary
xAI’s Grok 4.1 is rolling out as a Beta across grok.com, X, iOS, and Android. It matters because the thinking variant climbed to 1483 Elo atop LMArena (non‑thinking hits 1465) and won 64.78% in a quiet, blind pairwise trial against the prior production model.
Early signals skew practical: internal slides show hallucinations on info‑seeking prompts dropping from 12.09% to 4.22%, with FActScore falling to 2.97% from 9.89% (lower is better). EQ‑Bench also ticks up, with normalized Elo around 1586 for thinking mode — worth testing if tone and persona consistency matter. Yes, EQ for bots is now a KPI.
A new model card cites ~95–98% refusal on clear abuse and fresh input filters, but propensity tables show higher sycophancy (0.19–0.23) and near‑flat dishonesty (~0.46–0.49); a “Library of Babel” jailbreak is already circulating, and a leaked system prompt outlines code execution plus web and X search tools. If you route via Grok, run pairwise tests on your own data, keep dangerous tool calls gated, and note DeepSearch sessions may still pin to the older model.
Feature Spotlight
Feature: Grok 4.1 tops Arena and ships broadly
xAI’s Grok 4.1 lands #1 on LMArena (1483 Elo) with a public web/app rollout and measured drops in hallucination—framing a new competitive bar for conversational quality and style control.
Heavy, multi‑account coverage: Grok 4.1 (thinking & non‑thinking) climbs to #1/#2 on LMArena, claims EQ gains and lower hallucinations, and appears as a Beta toggle on grok.com/X/iOS/Android. Mostly eval stats + rollout posts today.
Jump to Feature: Grok 4.1 tops Arena and ships broadly topicsTable of Contents
Stay in the loop
Get the Daily AI Primer delivered straight to your inbox. One email per day, unsubscribe anytime.
Feature: Grok 4.1 tops Arena and ships broadly
Heavy, multi‑account coverage: Grok 4.1 (thinking & non‑thinking) climbs to #1/#2 on LMArena, claims EQ gains and lower hallucinations, and appears as a Beta toggle on grok.com/X/iOS/Android. Mostly eval stats + rollout posts today.
Grok 4.1 Beta ships on web (thinking and non‑thinking modes)
Grok 4.1 is now selectable on grok.com as a standalone Beta in the model picker, with both “thinking” and “non‑thinking” options live for many users Beta rollout, Web picker. xAI’s announcement frames it as more concise, higher intelligence per token and broadly available across grok.com, X, iOS and Android, with DeepSearch still toggling the previous model for some sessions xAI post.
Why it matters: Teams can A/B the new behavior in production chats today. If you rely on Grok’s X search, note that DeepSearch may still pin to the older model for now User note.
Grok 4.1 tops LMArena with #1/#2 overall spots
xAI’s Grok 4.1 vaulted to the top of the community‑run LMArena: the thinking mode landed 1483 Elo at #1 and the non‑thinking variant posted 1465 at #2, ahead of other models’ full reasoning configs Leaderboard update, xAI note. The Arena team also notes a 40+ point jump versus Grok 4 fast from two months ago.
- Expert board: Grok 4.1 (thinking) #1 (1510); non‑thinking #19 (1437) Leaderboard update.
- Occupational board: Grok 4.1 (thinking) shows broad strength across software, science, legal and business domains Occupational boards.
Why it matters: Arena win‑rates translate to fewer edge‑case stumbles in day‑to‑day chats and coding reviews. If you route by model quality, this is a new default to test against Gemini 2.5 Pro and Claude 4.5 Sonnet.
Grok 4.1 cuts hallucinations vs Grok 4 fast
Internal slides show Grok 4.1’s hallucination rate on info‑seeking prompts down to 4.22% from 12.09% on Grok 4 fast; its FActScore falls to 2.97% from 9.89% (lower is better on both charts) Hallucination charts.
Why it matters: Lower ungrounded claims reduce clean‑up passes in retrieval workflows and lessen the need for strict tool‑forcing—especially useful when you don’t want to pay latency for web search on easy facts.
Grok 4.1 leads EQ‑Bench; creative writing scores jump
Shared EQ‑Bench charts place Grok 4.1 (thinking) and Grok 4.1 at the top with normalized Elo 1586 and 1585 respectively, ahead of Kimi K2 and Gemini 2.5 Pro EQ‑Bench chart, EQ and writing. Creative Writing v3 rankings likewise show Grok 4.1 variants pushing into the top tier, only trailing an early GPT‑5.1 Polaris checkpoint EQ‑Bench chart.
Why it matters: If your app needs empathetic, persona‑steady replies (support, sales coaching, tone rewrites), Grok 4.1’s EQ cluster is worth piloting against Claude.
xAI posts Grok 4.1 model card; early jailbreaks test the edges
xAI published a Grok 4.1 model card outlining abuse refusal results and propensity metrics; posts cite ~95–98% refusal on clearly violative requests and new input filters for restricted biology/chemistry with low false negatives Model card PDF, Model card sighting. Propensity tables shared by reviewers show sycophancy 0.19–0.23 and deception 0.46–0.49, slightly above Grok 4’s 0.43 deception baseline Propensity table. Meanwhile, a community role‑play “Library of Babel” jailbreak claims to elicit prohibited content from Grok 4.1; prompt and examples are public for repro attempts Jailbreak thread, Prompt GitHub.
Why it matters: Safety posture looks tighter, but red‑teamers are already probing. If you deploy Grok in tools‑enabled contexts, keep test suites current and gate dangerous tool calls behind human review.
Grok 4.1 system prompt leak details policies and tool suite
A widely shared file purports to show Grok 4.1’s system prompt, including top‑level safety bullets (decline criminal help, keep declines short), product redirects, and a tool list spanning a stateful code interpreter, web search, X keyword/semantic searches, thread fetch, and image/video viewers Prompt leak, Prompt GitHub. Treat as unverified, but the structure matches the product’s observed capabilities.
Why it matters: For integrators, this hints at how Grok arbitrates tool calls and why it sometimes prefers X search over web search. If you wrap Grok, align your system prompts to avoid conflicting directives.
Two‑week silent A/B shows 64.78% win rate for Grok 4.1
During a quiet two‑week prelaunch, xAI reportedly ran blind pairwise evals on live traffic and Grok 4.1 won 64.78% of comparisons against the incumbent model Rollout notes.
Why it matters: That’s a concrete routing signal. If you manage a meta‑router, weight Grok 4.1 more aggressively on general chat, writing, and ideation flows while you validate corner cases.
Frontier evals: ARC‑AGI SOTA and new knowledge benchmark
Eval‑heavy day: ARC‑AGI semi‑private results highlight GPT‑5.1 (Thinking, High), and a new AA‑Omniscience benchmark launches to grade knowledge reliability and abstention, plus an LLM poker tourney meta‑eval. Excludes Grok 4.1 rollout (feature).
GPT‑5.1 (Thinking, High) scores 72.83% on ARC‑AGI‑1 and 17.64% on ARC‑AGI‑2
OpenAI’s GPT‑5.1 (Thinking, High) posted 72.83% on ARC‑AGI‑1 at ~$0.67/task and 17.64% on ARC‑AGI‑2 at ~$1.17/task, on the ARC Prize semi‑private evals Verified results, with full plots on the official board ARC Prize leaderboard. This follows Vals index where GPT‑5.1 moved up the rankings; today’s numbers show strong price‑performance at verified settings.
AA‑Omniscience launches to grade knowledge reliability; Claude 4.1 Opus leads Index
Artificial Analysis released AA‑Omniscience, a 6,000‑question, 42‑topic benchmark that rewards correct answers (+1), penalizes wrong answers (‑1), and gives 0 for abstentions; Claude 4.1 Opus tops the Omniscience Index, while Grok 4, GPT‑5, and Gemini 2.5 Pro lead on raw accuracy Benchmark thread, with the paper and public subset available for replication ArXiv paper. • Key takeaways: Hallucination is punished, domain leaders differ (e.g., Business vs Law), and only a few frontier models score slightly above 0 on the Index.
Author reply: Omniscience measures what models know and when to abstain, not “general IQ”
An AA‑Omniscience author clarifies the goal is knowledge reliability—grading whether a model knows specific facts and declines when it doesn’t—rather than being an intelligence test; “hallucination” is defined as answering incorrectly when it should abstain Author reply. The note also stresses domain‑level decisions (e.g., Kotlin knowledge for coding) versus picking a single overall “best” model.
Critique: AA‑Omniscience may conflate refusal thresholds with narrow‑fact performance
Ethan Mollick argues the benchmark leans on refusal thresholds over true hallucination rates and uses extremely narrow facts, suggesting we need richer error taxonomies and analysis beyond a single score Critique thread. He cites examples of obscure finance and literature queries and asks whether “wrong” answers that express uncertainty should be treated differently.
LLM poker eval: Gemini 2.5 Pro wins Texas Hold’em; styles mapped across models
Lmgame Bench ran a ~60‑hand Texas Hold’em tourney where Gemini‑2.5‑Pro topped the table, DeepSeek‑V3.1 placed second, and Grok‑4‑0709 third; analysis tagged play styles from loose‑passive to loose‑aggressive, showing strategy variance under the same neutral rules Tournament recap. The team notes more rounds will improve the TrueSkill signal; replays and boards are linked in the post.

Stay first in your field.
No more doomscrolling X. A crisp morning report for entrepreneurs, AI creators, and engineers. Clear updates, time-sensitive offers, and working pipelines that keep you on the cutting edge. We read the firehose and hand-pick what matters so you can act today.
I don’t have time to scroll X all day. Primer does it, filters it, done.
Renee J.
Startup Founder
The fastest way to stay professionally expensive.
Felix B.
AI Animator
AI moves at ‘blink and it’s gone’. Primer is how I don’t blink.
Alex T.
Creative Technologist
Best ROI on ten minutes of my day. I’ve shipped two features purely from their daily prompts.
Marta S.
Product Designer
From release noise to a working workflow in 15 minutes.
Viktor H
AI Artist
It’s the only digest that explains why a release matters and shows how to use it—same page, same morning.
Priya R.
Startup Founder
Stay professionally expensive
Make the right move sooner
Ship a product