Z.ai GLM-4.7-Flash hits 59.2% SWE-bench – 198K context local agents

Stay in the loop

Free daily newsletter & Telegram daily report

Executive Summary

Z.ai released open-weights GLM-4.7-Flash, pitched as a local coding + agent backend with an API; the model is described as a 30B-A3B MoE and is being circulated on a 59.2% SWE-bench Verified claim; Z.ai also advertises a free API tier capped at 1 concurrency, plus a faster paid FlashX variant. The benchmark table is getting retold as “30B-class that’s actually usable,” but comparisons remain provisional without a single canonical eval artifact.

• vLLM/SGLang day-0: both shipped GLM-specific tool-call parsing (glm47) and reasoning parsing (glm45); both expose speculative decoding knobs (SGLang highlights EAGLE settings), signaling expected use in structured agent loops.
• Ollama runtime: ollama run glm-4.7-flash lands in v0.14.3+ (pre-release); the model page lists a 198K context window and multiple quantizations.

The adjacent coding-agent discourse keeps converging on latency + ergonomics: OpenAI tees up Codex “5.3” feedback and faster runs, while eval chatter (e.g., time-budgeted Terminal Bench 2) suggests more “thinking” can lose to timeouts when wall-clock is part of capability.

GLM-4.7-Flash makes local coding models feel “frontier-adjacent”

GLM-4.7-Flash ships open weights with strong SWE-bench results plus day‑0 local/serving support, making “run a serious coding agent locally” materially more practical this week.

High-volume story: Z.ai’s open-weights GLM-4.7-Flash (30B-A3B MoE) drops with unusually strong coding/agent benchmarks and immediate ecosystem support for running locally (Ollama) and serving (vLLM/SGLang).

Jump to GLM-4.7-Flash makes local coding models feel “frontier-adjacent” topics

⚡️ GLM-4.7-Flash makes local coding models feel “frontier-adjacent”

GLM-4.7-Flash launches as an open-weights coding/agent model with free API tier

GLM-4.7-Flash (Z.ai): Z.ai released GLM-4.7-Flash, positioning it as a local coding + agentic assistant with downloadable weights and an API offering, including a free tier limited to 1 concurrency as described in the launch thread and detailed on the [pricing page](link:6:0|pricing page); the model is described as a 30B-A3B MoE in the MoE detail and has a higher-speed paid variant (FlashX) in the same launch thread.

The deployment story is intentionally “run it yourself or call it”: the [model card](link:6:1|model card) emphasizes local serving paths (vLLM/SGLang) alongside API access, which is the core lever for teams comparing hosted agent loops vs on-prem costs.

GLM-4.7-Flash benchmark claims put a 30B-class open model in the coding conversation

GLM-4.7-Flash (Z.ai): The release is getting circulated primarily on coding/agent evals, especially 59.2% on SWE-bench Verified plus strong showings on several reasoning and browsing benchmarks, as shown in the benchmarks chart and echoed in the release recap.

• Coding + agent evals: The same chart highlights τ²-Bench 79.5 and BrowseComp 42.8, alongside GPQA 75.2 and HLE 14.4, as shown in the benchmarks chart.
• Positioning vs nearby open baselines: Multiple accounts summarize it as “strongest 30B class” and emphasize local deployability, as in the community summary.

Treat the exact comparisons as provisional until a single canonical eval artifact is shared, but the consistent retelling is that GLM-4.7-Flash is being judged as a practical SWE-bench tier option rather than a “toy local model.”

Ollama adds GLM-4.7-Flash in v0.14.3+ pre-release

Ollama (Ollama): GLM-4.7-Flash can now be launched via ollama run glm-4.7-flash in Ollama v0.14.3+ (pre-release), as stated in the Ollama announcement with download pointers in the pre-release links.

The key operational details are in the Ollama [model page](link:249:0|model page), which lists a 198K context window and multiple quantization variants (a common “try it locally first” entry point for teams evaluating whether a 30B-class MoE is usable on their hardware).

SGLang adds day-0 GLM-4.7-Flash support with EAGLE speculative config

SGLang (LMSYS): SGLang announced day-0 support for GLM-4.7-Flash and shared a launch_server command that includes glm47 tool-call parsing and glm45 reasoning parsing, plus EAGLE speculative decoding settings, as shown in the SGLang support post.

The concrete server flags in the SGLang support post (tp-size, speculative steps/topk/draft tokens, memory fraction) make it straightforward to reproduce a production-ish serving setup rather than treating this as a “download weights and hope” release.

vLLM ships day-0 GLM-4.7-Flash support with tool-call parsing flags

vLLM (vLLM): vLLM merged “day-0 support” for GLM-4.7-Flash, including a serving recipe that wires up a dedicated tool-call parser (glm47) and reasoning parser (glm45), as shown in the vLLM support post.

The example vllm serve line in the vLLM support post also includes speculative decoding knobs (--speculative-config.*) and --enable-auto-tool-choice, which is a concrete signal that GLM-4.7-Flash is being treated as a tool-using agent backend, not only a chat model.

Early tool-use sentiment: GLM-4.7-Flash gets praise for tool calling reliability

GLM-4.7-Flash tool use: One early practitioner takeaway is that “glm47 is pretty damn good at tool calling,” according to the tool-calling comment, and that lines up with infra projects explicitly shipping GLM-specific tool parsing flags in the vLLM serve example.

The combination of (a) human sentiment in the tool-calling comment and (b) first-class parsers in vLLM/SGLang suggests the ecosystem expects GLM-4.7-Flash to be used in agent loops that depend on structured tool calls, not only “chat with a local model.”

GLM-4.7-Flash shows up in OpenCode via Hugging Face Inference Providers

OpenCode + Hugging Face Inference Providers: A community RT reports GLM-4.7-Flash is available inside OpenCode through Hugging Face Inference Providers, as shown in the OpenCode terminal screenshot.

This matters for evaluation workflows because it creates a third path besides “self-host via vLLM/SGLang” or “use Z.ai’s API”: a hosted-inference surface that can be swapped into an existing agent TUI, per the OpenCode integration.

GLM-4.7-Flash gets framed as a cheap local workhorse, not a demo model

Local deployment economics: Several posts frame GLM-4.7-Flash as a “try it at home” option that’s cheap enough to run locally, while still being strong enough to matter for coding/agents—see the home deployment take and a separate “best cost to quality” claim in the workhorse claim.

The home deployment take also includes a screenshot showing API concurrency backpressure (“High concurrency usage… reduce concurrency”), which is a small but concrete indicator of early demand and the fact that people are testing it in parallel agent setups, not only single-chat sessions.

🧠 Codex product cadence: speed, UX asks, and planning ergonomics

Continues the Codex-speed narrative, but today’s tweets are mostly practitioner UX requests and product-cadence signals around Codex rather than new benchmark drops. Excludes GLM-4.7-Flash (covered as the feature).

Codex CLI shows a downgrade prompt when nearing limits

Codex CLI (OpenAI): A screenshot shows a “usage limit” interruption that offers to switch to gpt-5.1-codex-mini for lower credit usage and includes a concrete retry time (“Jan 22nd, 2026 6:41 PM”), as shown in the [limit prompt](t:688|Limit prompt).

This indicates Codex CLI is actively nudging users toward a cheaper/faster fallback model under backpressure, instead of failing silently.

Codex power-user wishlist focuses on speed, memory, and long-task autonomy

Codex (OpenAI): A detailed wishlist lays out what heavy users want next: faster responses “keep the quality,” version-aware web search, fewer “continue/what’s next” stalls on long tasks, persistent project memory, better default test strategy, tool/turn status events, an automatic review pass, and screen/video replay debugging—see the [Codex wishlist](t:179|Codex wishlist). It’s a tight snapshot of where Codex friction still shows up in day-to-day shipping.

A smaller but notable add-on is that Codex can “reject creative ideas,” requiring persuasion, per the [behavior note](t:536|Behavior note).

Altman says OpenAI will deliver faster and smarter soon

Codex/ChatGPT speed (OpenAI): Sam Altman claims OpenAI will deliver “a higher level of intelligence while also being much faster soon,” as captured in the [Altman speed quote](t:70|Altman speed quote).

Some observers are already attributing the “faster” part to infrastructure partnerships (e.g., the [Cerebras speculation](t:75|Cerebras speculation)), but there’s no confirmed mechanism in the tweets.

Builders argue speed matters more than extra intelligence for sync coding

Speed vs intelligence (coding agents): One argument claims synchronous coding hits diminishing returns at a level where “>95%” of quick queries won’t improve much with smarter models, and that more intelligence mostly helps async, multi-hour tasks—see the [speed threshold post](t:84|Speed threshold post). Short sentence.

This framing lines up with Codex-centric UX asks that prioritize iteration latency (faster loops, fewer stalls) over pushing only for higher reasoning depth.

Conductor brings back Codex thinking levels control

Conductor (Codex UI): Conductor says it “brought back codex thinking levels,” showing a three-step selector—Basic, Advanced, Expert—in the [thinking levels clip](t:352|Thinking levels clip).

This is a concrete ergonomics knob for teams balancing latency vs depth during interactive coding sessions.

OpenAI starts public feedback intake for Codex 5.3

Codex (OpenAI): Sam Altman explicitly asks what’s working and what should improve in “5.3,” using a direct reply prompt to gather user feedback at scale, as shown in the [5.3 question](t:9|5.3 question). It’s a concrete signal that Codex 5.3 is being scoped with practitioner input rather than only internal benchmarks.

The request is broad (no specific features named), so it’s more “roadmap intake” than an announced change.

Codex can over-index on spec writing when prompted to “build the spec”

Codex (workflow pitfall): A practitioner reports that asking Codex to “Build the spec” resulted in ~20 minutes spent expanding spec markdown rather than writing code, as described in the [spec drift report](t:269|Spec drift report). Short sentence. It’s a reminder that “spec-first” prompts can accidentally become the goal, not the means.

The failure mode is less about correctness and more about objective misalignment: the agent optimizes for producing a better document because that’s the clearest completion target.

OpenAI leadership frames Codex progress as compounding

Codex (OpenAI): Sam Altman posts a cadence signal—“hard to imagine what it’s going to look like at the end of this year if things keep compounding,” as written in the [team execution note](t:7|Team execution note). It reads like an internal velocity update made public.

This is qualitative (no release details), but it aligns with broader “Codex is moving fast” framing seen elsewhere today.

DesignArena leak suggests how OpenAI models surface before release

DesignArena model IDs (OpenAI): A screenshot of DesignArena-style config shows GPT-5.2 (High) as “honeycomb” and GPT-5.2 (XHigh) as “candycane,” both pointing at gpt-5.2-2025-12-11, as shown in the [DesignArena config screenshot](t:100|DesignArena config screenshot).

The same post suggests future 5.x variants may appear there first, but today’s evidence is limited to 5.2’s already-known mapping.

Codex 5.2 hype starts showing user fatigue

Codex (adoption sentiment): A small but real counter-signal appears as one user says “everyone is telling me i need to try 5.2 codex. i’m tired,” in the [adoption fatigue post](t:188|Adoption fatigue post). Short sentence.

It’s not a product issue by itself, but it’s a reminder that “try the new model” churn can become a tax for power users when the evaluation burden shifts onto them.

Today’s Claude discourse is mostly operational pain (rate limits, logins, freezes) and workflow tweaks, not new Claude product primitives. Excludes the Assistant Axis research (covered separately).

Claude Code account switching request targets multi-session usage limits

Claude Code (Anthropic): A power user asked for cross-session account propagation—when one Claude Code session hits the 5‑hour or weekly limit and you run /login in another session, every other in‑flight session on that machine should automatically switch over the next time it hits a limit, as described in the Account switching request. It’s framed around avoiding context loss and manual re-auth when running “10 agents going in the same project,” building on Account switching session-limit coping patterns.

• Session UX nits that compound at scale: The same request asks for shorter auth URLs (terminal clickability) and printing the session ID on exit for unambiguous resume, as spelled out in the Account switching request.

Z.ai GLM-4.7-Flash hits 59.2% SWE-bench – 198K context local agents

Executive Summary

Top links today

GLM-4.7-Flash makes local coding models feel “frontier-adjacent”

Table of Contents

⚡️ GLM-4.7-Flash makes local coding models feel “frontier-adjacent”

GLM-4.7-Flash launches as an open-weights coding/agent model with free API tier

GLM-4.7-Flash benchmark claims put a 30B-class open model in the coding conversation

Ollama adds GLM-4.7-Flash in v0.14.3+ pre-release

SGLang adds day-0 GLM-4.7-Flash support with EAGLE speculative config

vLLM ships day-0 GLM-4.7-Flash support with tool-call parsing flags

Early tool-use sentiment: GLM-4.7-Flash gets praise for tool calling reliability

GLM-4.7-Flash shows up in OpenCode via Hugging Face Inference Providers

GLM-4.7-Flash gets framed as a cheap local workhorse, not a demo model

🧠 Codex product cadence: speed, UX asks, and planning ergonomics

Codex CLI shows a downgrade prompt when nearing limits

Codex power-user wishlist focuses on speed, memory, and long-task autonomy

Altman says OpenAI will deliver faster and smarter soon

Builders argue speed matters more than extra intelligence for sync coding

Conductor brings back Codex thinking levels control

OpenAI starts public feedback intake for Codex 5.3

Codex can over-index on spec writing when prompted to “build the spec”

OpenAI leadership frames Codex progress as compounding

DesignArena leak suggests how OpenAI models surface before release

Codex 5.2 hype starts showing user fatigue

🧩 Claude Code & Cowork friction points: limits, login, and “hype vs reality”

Claude Code account switching request targets multi-session usage limits

CodexBar 0.18 beta adds a pace indicator for Claude/Codex usage

Claude Code freezing reports show up in Warp terminal workflows

Some Claude Code users are overriding search to load full files into context

A code review stance shift: review specs and tests, not generated diffs

Cursor friction on WSL drives some users back to VSCode despite agent coding

Claude Code hype vs reality becomes a visible sentiment thread

Some builders report moving from coding to “requirements → inputs” work

Authentication friction is becoming a bottleneck for agent-heavy setups

Power-user fatigue shows up around “try Codex 5.2” pressure

🛠️ IDE/harness wars: OpenCode, Cursor regressions, and builder platforms

OpenCode fixes long-session UI memory bloat caused by message retention bug

Cursor on WSL complaints: 90-second load times and broken multi-line edits

SuperDesign adds “tree search” to fork AI design conversations and explore flows

Lovable adds public profiles for publishing projects and gaining followers

OpenCode TUI praised as “high taste,” but curl-to-bash install feels slow

Replit Design Mode ranks #1 on DesignArena’s app-builder leaderboard

Factory announces NYC office expansion and hiring for sales/CS/solutions engineering

🧑‍✈️ Running agent fleets: orchestration dashboards, Ralph loops, and remote execution

ntm swarm management pattern adds a controller hierarchy for dozens of agents

Vibe Kanban open-sources a dashboard to orchestrate multiple coding agents

BLACKBOX launches an Agents API to run multiple SWE agents on remote VMs

Plan sync, implement async: align fast then hand off to cloud agents with validation envs

Wreckit 1.0.0 ships a TUI for running Ralph Wiggum loops over a roadmap

Athas IDE prototypes a tab view to open new agents and browser tabs inside the editor

Conductor shows DAU growth while hiring for orchestration UI roles

Sandbox-maximalist thesis: classic Unix VMs as the substrate for agent loops

Slack as a control plane for long-running role-based Claude Code agents

Compute literacy framing: running swarms shifts the job to oversight and context injection hooks

🧰 Developer-side tooling around agents: menubar, CLIs, and doc ingestion hacks

CodexBar 0.18.0-beta.1 adds pace view and expands provider support

parseout CLI batch converts PDFs to markdown for Claude Cowork ingestion

RepoPrompt teases /rp-review for context-builder-based PR review

Transition-matrix visualizations for summarizing agent traces and failures

TanStack AI teaser shows model spend “what-if” cost comparison UI

Trimmy flattens multiline shell snippets and reformats markdown for clean pastes

Vibe Browse exports browser sessions into OpenAI/Anthropic fine-tuning format

lane: small open-source CLI utility published by benhylak

🧭 How teams are actually shipping with agents: context, specs, and “vibe” failure modes

Vibe-coded apps can feel stable but behave brittle in the “boring” parts

Model swapping isn’t free: prompts and harnesses co-evolve with the model

Plan sync, implement async: treat planning as the high-bandwidth step

Software engineering shifts from syntax to turning ambiguity into clarity

The wait equation: some software work was worth delaying until agents matured

“Don’t treat oil like fresco”: agent coding as a new medium

Speed may matter more than extra intelligence for most “in-the-loop” coding

Thinking-token overhead is the current tax on production-grade codegen

Why some dev workflows break with coding agents: nbdev as a case study

🖥️ Computer-use agents & MCP surface area expands (Notion, Comet, xAI)

Notion Agents tests custom MCPs, connectors, and computer-use automation

Comet browser rolls out “Act for me” screen takeover in the sidebar

X confirms work on a promptable ranking algorithm

xAI pitches “human emulators” as software-agnostic UI automation

Google Stitch may add PRD generation and API key-based usage