Z.ai GLM‑4.6V opens 106B VLM – 128K context, $0.60 per million tokens

Stay in the loop

Free daily newsletter & Telegram daily report

Executive Summary

Z.ai is throwing a serious gauntlet in the open vision‑language space: GLM‑4.6V, a 106B‑parameter multimodal model with 128K context, shipped today with public weights, native tool use, and an API priced at $0.60 / $0.90 per million input / output tokens. Its 9B sibling, GLM‑4.6V‑Flash, is not only open but free to call via API, giving teams a practical low‑latency option for local or cheap hosted runs.

What’s new here isn’t just another VLM checkpoint, it’s the stack around it. The model handles long video and document workloads end‑to‑end—think one‑hour matches or ~150‑page reports in a single pass—and bakes in multimodal function calling so it can pass screenshots and PDFs into tools, hit search or RAG backends, then visually re‑read charts before answering. Benchmarks show 88.8 on MMbench V1.1 and competitive MMMU‑Pro scores, often matching or beating larger open rivals like Qwen3‑VL‑235B and Step‑3‑321B.

Ecosystem support landed day‑zero: vLLM 0.12.0 ships an FP8 recipe with 4‑way tensor parallelism and tool parsers, MLX‑VLM and SGLang already have integrations, and indie apps are using it for OCR‑to‑JSON and design‑to‑code flows. Net effect: wherever you’d normally reach for Qwen or LLaVA, GLM‑4.6V is now a credible toggle in the dropdown rather than a science project.

Feature: Z.AI’s GLM‑4.6V goes open with native multimodal tool use

Open GLM‑4.6V/Flash add native multimodal function calling and 128K context; day‑0 vLLM support, free Flash tier, and docs make it a practical, low‑latency VLM option for real products.

Cross‑account launch dominates today: open GLM‑4.6V (106B) and 4.6V‑Flash (9B) add native function calling, 128K multimodal context, day‑0 vLLM serve, docs, pricing. Many demos stress long‑video/doc handling and design‑to‑code flows.

Jump to Feature: Z.AI’s GLM‑4.6V goes open with native multimodal tool use topics

🧠 Feature: Z.AI’s GLM‑4.6V goes open with native multimodal tool use

Z.ai launches open GLM‑4.6V and free 4.6V‑Flash with 128K multimodal context

Z.ai officially released the GLM‑4.6V series—106B flagship and 9B GLM‑4.6V‑Flash—as open multimodal models with 128K context, native function calling and public weights on Hugging Face, alongside an API priced at $0.60 input / $0.90 output per 1M tokens for 4.6V while Flash is free. launch thread Developers can download weights, call the hosted API, or use Z.ai Chat, with a full collection page and technical blog detailing multimodal inputs (images, video, text, files) and interleaved image‑text generation. (hf collection, tech blog)

The free Flash tier plus open weights make this one of the more accessible long‑context vision‑language families for teams that want tool‑using multimodal models without being locked to a single proprietary stack.

Z.ai GLM‑4.6V opens 106B VLM – 128K context, $0.60 per million tokens

Executive Summary

Top links today

Feature: Z.AI’s GLM‑4.6V goes open with native multimodal tool use

Table of Contents

🧠 Feature: Z.AI’s GLM‑4.6V goes open with native multimodal tool use

Z.ai launches open GLM‑4.6V and free 4.6V‑Flash with 128K multimodal context

GLM‑4.6V and Flash post strong vision‑language scores vs Qwen and Step‑3

GLM‑4.6V bakes in native multimodal function calling and search‑to‑answer flows

GLM‑4.6V pushes 128K multimodal context to hour‑long videos and large docs

GLM‑4.6V and Flash get rapid support across Hugging Face, MLX‑VLM, SGLang and tools

GLM‑4.6V targets frontend devs with design‑to‑code generation

vLLM ships FP8 GLM‑4.6V recipe with tool and reasoning parsers

Early testers lean into GLM‑4.6V for SVG graphics, coding evals and OCR

🧰 Coding agents in practice: Slack handoff, background workers, routers

Claude Code can now be delegated tasks directly from Slack

OpenRouter’s Body Builder lets devs describe multi‑model calls in plain English

Warp adds model comparison cards and an auto‑routing option

Kilo Code debuts an Adoption Dashboard and leans into Copilot comparisons

RepoPrompt moves MCP to Unix sockets and cuts idle CPU to 0.1%

CodeLayer deep agents run planning phases as background sub‑agents

🏗️ Compute supply and DC finance: H200 to China and neocloud funding

US to license Nvidia H200 exports to China with 25% revenue skim

Fluidstack targets ~$700M raise at ~$7B valuation with Google‑backed DC leases

📊 Evals and telemetry: job‑level rankings, Code Arena, trace fan‑out

Arena debuts Occupational rankings to test models by real jobs

OpenRouter Broadcast pipes LLM traces into Langfuse, LangSmith, Datadog and W&B

DeepSeek V3.2 arrives in Code Arena for live coding battles

Step Game update shows GPT‑5.1 and Gemini 3 Pro leading social reasoning

📈 Enterprise adoption and GTM: OpenAI report and agentic commerce

OpenAI’s 2025 enterprise AI report puts hard numbers on workplace usage

ChatGPT turns Instacart into an in‑chat grocery shopping agent

Hugging Face and Google Cloud move 5 GB in 13 seconds

🧪 Frontier signals beyond GLM: Rnj‑1, Gemini Flash whispers, Grok ETA

LM Arena’s ‘Seahawk’ and ‘Skyhawk’ likely tease Gemini 3 Flash variants

Rnj‑1 open 8B model surges on Hugging Face trending charts

Qwen 3 Next arrives on Ollama for local experimentation

Jina releases 2B VLM claiming SOTA multilingual doc understanding

🛡️ Legal and safety: NYT v. Perplexity, clinic gap, jailbreak datasets

NYT sues Perplexity over paywalled RAG and NYT‑branded hallucinations

Clinical LLMs ace exams but lag badly on real care and safety

Community jailbreak pipeline mass‑generates rich attack prompts

“From FLOPs to Footprints” ties AI training to heavy‑metal footprints

Big Tech–funded AI papers show higher impact and insularity

🔌 MCP interop and agent plumbing

AIGNE paper proposes ‘everything is a file’ abstraction for agent context

Anthropic clarifies how MCP tool calls flow through the context window

mcporter 0.7.1 daemon now hot‑reloads MCP servers on config changes

Amp IDE can now find the exact agent thread that created a file

🎬 Creative stacks: NB Pro workflows, Kling O1 editing, LongCat text fidelity

Kling O1 leans into multimodal video editing, not just text prompts

Meituan’s 6B LongCat-Image rivals 20B+ models in bilingual, text-heavy image work

Nano Banana Pro community is converging on reusable prompt workflows

Pika 2.2 arrives as an API via Fal for apps that need video

Gemini adds NB Pro-powered image resize flow in Thinking mode

NB Pro’s HTML→UI experiment exposes strengths and gaps in layout fidelity

📚 New papers: unified multimodal, realism rewards, agentic video loops

Active Video Perception frames long‑video QA as plan→observe→reflect loops

EMMA proposes a single efficient stack for multimodal understanding, generation, and editing

RealGen uses detector‑guided rewards to push text‑to‑image photorealism

Self‑Improving VLM Judges train themselves without human labels

EditThinker wraps existing image editors with an iterative reasoning layer

MotionV2V edits motion inside videos while keeping appearance fixed

One‑to‑All Animation enables alignment‑free character animation and pose transfer

SpaceControl adds test‑time spatial constraints to 3D generative models

TwinFlow pushes large diffusion models toward one‑step generation

🎙️ Realtime voice and music agents

Lyria Camera turns your phone into a real-time soundtrack generator

ElevenLabs ships real-time Santa voice agent plus AI Christmas music

Builders lean on Gemini Live’s new on-screen visual guidance

Pipecat 0.0.97 tightens voice agent core and adds Gradium models

🦾 Embodied AI in production: farm autonomy and mass humanoids

China doubles down on embodied AI with provincial pilots and big funds

AgiBot reaches 5,000 humanoids in mass production with shared control stack

Honghu T70 electric tractor shows 6‑hour, ±2.5 cm autonomous farm work

Autonomous delivery carts handle grocery routes in rural China

On this page