xAI released Grok 4.20 Beta in the API with reasoning, non-reasoning, and multi-agent variants, a 2M-token window, and lower pricing than Grok 4. Test it for long-context and speed-sensitive workloads, but compare coding performance against top rivals on your own evals.

xAI's launch adds three API variants: grok-4.20-multi-agent-beta-0309, grok-4.20-beta-0309-reasoning, and grok-4.20-beta-0309-non-reasoning. The launch table in the API pricing post shows all three at 2M context with the same $2/$6 per-million token price, and the attached screenshot also lists 4M TPM and 607 RPM rate limits.
OpenRouter exposed the release almost immediately, which matters for teams already routing across providers. The OpenRouter listing shows both the base and multi-agent variants, and a follow-up post says the base model includes "agentic tool calling capabilities," a built-in reasoning toggle, and $5 per 1K web searches. Artificial Analysis adds the clearest before-and-after comparison: according to its model breakdown, Grok 4.20 moves from Grok 4's 256K context to 2M and cuts API pricing from $3/$15 to $2/$6.
The strongest early signal is reliability under uncertainty. Artificial Analysis says in its highlights thread that Grok 4.20 hallucinates only 22% of the time when it should refuse, calling that "the lowest hallucination rate of any model we have tested." Its fuller writeup in the AA breakdown reports a 78% AA-Omniscience non-hallucination score, the best result in that benchmark so far.
The same source says Grok 4.20 also leads IFBench at 82.9%, a +29.2 point jump over Grok 4, and runs at about 265-267 output tokens per second on xAI's API AA highlights AA breakdown. That combination of long context, lower price, and high throughput is the practical engineering story here. OpenRouter screenshots cited in the throughput post even show the multi-agent variant reaching 1,122 tokens per second through xAI on one routed path, though that figure is provider-path-specific rather than a universal model speed guarantee.
The coding story is less clear-cut. Vals says in its benchmark post that the Grok 4.20 Beta reasoning snapshot lands at #13 overall, but it also "shines on the SWE Bench split" at #4 with 72.55% and improves Terminal Bench 2 by 10 percentage points versus earlier Grok models. That suggests meaningful progress, especially for long-context and repo-style tasks, without breaking into the very top tier overall.
Artificial Analysis is more cautious on agentic and coding-adjacent work. Its full breakdown says Grok 4.20 scores 42 on the Artificial Analysis Coding Index and 1,062 on GDPval-AA, which it describes as "well behind frontier peers" on general agent performance. BridgeBench results from a benchmark post are more favorable for the new multi-agent mode, with a 4-agent setup taking #1 at 96.1 overall and 100% completion, while the single-model Grok 4.20 Beta sits lower on the same leaderboard. For engineers, that makes the release look strongest where long context, speed, instruction following, and low hallucination matter most, with coding quality still needing task-specific evals against GPT-5.4, Claude, and Gemini.
Epoch AI says GPT-5.4 Pro elicited a publishable solution to one 2019 conjecture in its FrontierMath Open Problems set, with a formal writeup planned. Treat it as an early milestone worth reproducing, not blanket evidence that frontier models can already automate math research.
releaseOpenClaw shipped version 2026.3.22 with ClawHub, OpenShell plus SSH sandboxes, side-question flows, and more search and model options, then followed with a 2026.3.23 patch. Teams get a broader plugin surface, but should patch quickly and review plugin trust boundaries as the ecosystem grows.
releaseCursor shipped Instant Grep, a local regex index built from n-grams, inverted indexes, and Bloom filters that drops large-repo searches from seconds to milliseconds. Faster candidate retrieval shortens the coding-agent loop, especially when ripgrep-style scans become the bottleneck.
breakingChatGPT now saves uploaded and generated files into an account-level Library that can be reused across conversations from the web sidebar or recent-files picker. It removes repetitive re-uploading and makes past PDFs, spreadsheets, and images part of a persistent working context.
breakingEpoch AI says GPT-5.4 Pro elicited a publishable solution to one 2019 conjecture in its FrontierMath Open Problems set, with a formal writeup planned. Treat it as an early milestone worth reproducing, not blanket evidence that frontier models can already automate math research.
The Grok 4.20 Beta shows three major improvements over Grok 4: ➤ Our lowest ever hallucination rate on the AA-Omniscience evaluation. When Grok did not know the answer, it hallucinated an incorrect answer 22% of the time - this is the lowest hallucination rate of any model we Show more
We evaluated the Grok 4.20 Beta (Reasoning) snapshot on the Vals Index. It lands at #13 overall, with low latency and relatively low cost.
Grok 4.2 on OpenRouter let's give it a try
Grok 4.20 Beta just dropped in the xAI API. 2,000,000 token context. $2 input. $6 output. Three variants: multi-agent, reasoning, and non-reasoning. GPT 5.4 has 1M context at $2.50 input. Claude Opus 4.6 has 200K. xAI just shipped the largest context window on the market at Show more
Full breakdown of results: