breakingMarch 13, 2026

Vals benchmarks Grok 4.20 Beta: ProofBench rises to 14% while legal tasks regress

Vals published a benchmark pass for Grok 4.20 Beta showing gains on coding, math, multimodal, and Terminal Bench 2, alongside weaker legal-task results. Check task-level results before adopting it, especially if legal workflows matter more than headline benchmark gains.

Multimodal Evals Benchmarks

3 min read

Vals benchmarks Grok 4.20 Beta: ProofBench rises to 14% while legal tasks regress

TL;DR

Vals says Grok 4.20 Beta (Reasoning) is an overall step up from Grok 4.1 Fast (Reasoning), with gains across coding, math, and multimodal evals in its latest benchmark pass Vals overview.
The biggest task-level jump Vals called out was on ProofBench, where Grok 4.20 Beta reached 14% versus 4% for Grok 4.1 Fast, while coding-oriented suites including LiveCodeBench, SWE-Bench, Terminal Bench 2, and Vibe Code Bench also improved coding gains ProofBench jump.
Multimodal results also moved up: Vals reports 83.47% and a #9 rank on MMMU, versus #31 for Grok 4.1 Fast, alongside improvement on SAGE for grading handwritten work multimodal gains.
The same run showed weaker legal performance and beta-release caveats: Vals ranked the model #30 on CaseLaw and #62 on LegalBench, and said the snapshot may change in later iterations