New third-party tests put MiniMax M2.7 at a 34% hallucination rate, roughly 65 tps, and 27.04% on Vibe Code Bench while users pushed it through physics-heavy web demos. It looks increasingly viable for agent workflows, but performance still swings by task and harness.

The cleanest change is hallucination handling. In the AA-Omniscience chart, M2.7 drops to 34%, down from 89% for M2.5, and that places it ahead of GPT-5.4 on that specific benchmark. Cedric Chee's voxel pagoda test lines up with the same direction of travel: he says M2.5 sometimes hallucinated the surrounding garden scene, while M2.7 does so less often.
The coding picture is more mixed but still stronger than the last release. The benchmark roundup claims 56.22% on SWE-Pro, matching GPT-5.3 Codex, plus gains on Terminal Bench 2, VIBE-Pro, Toolathlon, and GDPval-AA. A separate Vibe Code Bench leaderboard is much harsher, placing M2.7 at 27.04% ± 4.18 for end-to-end app generation, but it also tags the run at $2.82 per test and 1377 seconds latency, which is a materially cheaper profile than the top GPT and Claude entries in that table.
The best independent review here is less bullish than the launch chatter. According to the Zhihu review thread, M2.7 improved “direct/indirect instruction execution” and context hallucination, but stability is still uneven: it can score full marks on long code derivation, then fall to “unusable” on medium-complexity tasks because of misread instructions or repeated fixes. The same review says there is “no substantial upgrade” in high-level engineering design, even if the model now more often writes SPEC.md and README.md to track project logic.
Cedric Chee's thread summary makes a similar distinction. He calls out better “real-world engineering” and “professional office delivery,” but frames M2.7 as an early step in model self-evolution rather than a broad jump in general intelligence. The Zhihu review thread is explicit that hard reasoning regressed slightly, with 50%-100% higher token use from excessive enumeration and more max-token failures on complex tasks.
MiniMax shipped M2.7 directly into production surfaces instead of keeping it as a paper release. The launch coverage says it is available immediately in MiniMax Agent and via API, and MiniMax's release post is the main product reference. Teknium's Hermes Agent support adds that it landed in Hermes Agent through the MiniMax provider on day one.
The practical usage pattern emerging around M2.7 is agent orchestration plus browser-native coding. A MaxClaw session slide describes a “100,000+ scalable” agent system with components including an LLM gateway, MCP server, and MicroVM sandboxing. On the application side, user demos show M2.7 generating self-contained HTML and physics-heavy web experiences: one receipt physics demo builds a draggable cloth-style receipt simulation in a single HTML file, and another water physics demo extends that to a temperature-controlled water simulation. Those demos are anecdotal, but they fit the narrower claim in the benchmark roundup that M2.7 is strongest in multi-round edits and agentic delivery rather than pure reasoning peaks.
OpenHands introduced EvoClaw, a benchmark that reconstructs milestone DAGs from repo history to test continuous software evolution instead of isolated tasks. The first results show agents can clear single tasks yet still collapse under regressions and technical debt over longer runs.
releaseOpenClaw shipped version 2026.3.22 with ClawHub, OpenShell plus SSH sandboxes, side-question flows, and more search and model options, then followed with a 2026.3.23 patch. Teams get a broader plugin surface, but should patch quickly and review plugin trust boundaries as the ecosystem grows.
releaseCursor shipped Instant Grep, a local regex index built from n-grams, inverted indexes, and Bloom filters that drops large-repo searches from seconds to milliseconds. Faster candidate retrieval shortens the coding-agent loop, especially when ripgrep-style scans become the bottleneck.
breakingChatGPT now saves uploaded and generated files into an account-level Library that can be reused across conversations from the web sidebar or recent-files picker. It removes repetitive re-uploading and makes past PDFs, spreadsheets, and images part of a persistent working context.
breakingEpoch AI says GPT-5.4 Pro elicited a publishable solution to one 2019 conjecture in its FrontierMath Open Problems set, with a formal writeup planned. Treat it as an early milestone worth reproducing, not blanket evidence that frontier models can already automate math research.
Open source models are catching up faster than anyone expected. MiniMax M2.7 hallucination rate: 34%. MiniMax M2.5 was 89%. 55 point drop in a single generation. Out of 423 models on AA-Omniscience. M2.5 hallucinated at the same level as GPT 5.4. M2.7 just leapfrogged Show more
🔍Follow Zhihu contributor toyama nao, a top large model reviewer, to evaluate @MiniMax_AI MiniMax-M2.7's capabilities in detail!✨ 📌 Basic Info: MiniMax iterates monthly in the Agent-driven model track. As a minor version upgrade, M2.7 carries its new understanding of the
One last Minimax M2.7 result for you all - it has broken 25% on Vibe Code Bench. This is a benchmark we created in-house, testing a model's ability to write an application completely from scratch. It is the only Chinese model to do so so far.
MaxClaw architecture for 100K+ scalable agents system. Great insights!
Live Stream Alert | MiniMax × @OpenClaw Join the OpenClaw AI Ecosystem Session and dive into MiniMax M2.7, from early echoes of self-evolution to efficient solutions for 100,000 OpenClaw running clusters. 🎁 MiniMax vouchers are being given away during the session. Watch on X:
Chinese AI startup MiniMax has launched MiniMax-M2.7, a proprietary frontier model purpose-built for autonomous, agentic workflows. It is available immediately via the MiniMax Agent platform and their API.

MiniMax-M2.7 just landed in MiniMax Agent. The model helped build itself. Now it's here to build for you. ↓ Try Now: agent.minimax.io