Inference & Infrastructure — Explore AI Tools & Stories

Fresh stories

New

Flash-MoE claims Qwen3.5-397B runs on iPhone at 0.6 tokens/sec via SSD streaming

Flash-MoE now shows SSD-streamed expert weights pushing a 397B Qwen3.5 variant onto an iPhone at 0.6 tokens per second, extending its earlier laptop demos. Treat it as a memory-tiering prototype rather than a deployable mobile serving target, because speed, heat, and context headroom remain tight.

⚡Inference Optimization1d ago

Breaking

Miles adds ROCm support on AMD Instinct and raises AIME to 0.729

Miles added ROCm support for AMD Instinct clusters and reported GRPO post-training gains on Qwen3-30B-A3B, including AIME rising from 0.665 to 0.729. It matters if you are evaluating rollout-heavy RL jobs off NVIDIA and want concrete throughput and step-time numbers before porting.

New

Reinforcement Learning·1d ago·3 min read

New1d ago

Flash-MoE claims Qwen3.5-397B runs on iPhone at 0.6 tokens/sec via SSD streaming

⚡Inference Optimization1d ago

New1d ago

Miles adds ROCm support on AMD Instinct and raises AIME to 0.729

Release⚡Reinforcement Learning1d ago

Top storiesthis week

See all →

Breaking

Flash-MoE claims 4.4 tokens/sec on Qwen3.5-397B on 48GB M3 Max

A pure C and Metal engine streams 209GB of MoE weights from SSD and reports tool-calling support in 4-bit mode on a laptop-class Mac. It is a concrete benchmark for teams exploring expert streaming, quantization, and page-cache tricks on consumer hardware.

New

LLM Serving·2d ago·2 min read

New

OpenAI adds container pools to Responses API for 10x faster agent spin-up

OpenAI says Responses API requests can reuse warm containers for skills, shell, and code interpreter, cutting startup times by about 10x. Faster execution matters more now that Codex is spreading to free users, students, and subagent-heavy workflows.

Release⚡Codex3d ago

New

Unsloth releases Studio: local training UI for 500+ models with 70% less VRAM

Unsloth Studio launched as an open-source web UI to run, fine-tune, compare, and export local models, with file-to-dataset workflows and sandboxed code execution. Try it if you want to move prototype training and evaluation off cloud notebooks and onto local or rented boxes.

Release⚡Developer Experience1w ago

New

Hao AI Lab launches Dreamverse: 30s 1080p video in 4.5s on one GPU

Dreamverse paired Hao AI Lab's FastVideo stack with an interface for editing video scenes in a faster-than-playback loop, using quantization and fused kernels to keep latency below viewing time. The stack is interesting if you are building real-time multimodal generation or multi-user video serving.

Release⚡Realtime AI1w ago

New

Morph launches FlashCompact: 33k tok/s compaction from 200k to 50k in 1.5s

Morph released FlashCompact, a specialized compaction model and SDK for coding agents, claiming 33k tokens per second and near-invisible long-context compression. Use it or copy the approach if compaction latency and noisy tool output are blocking longer agent runs.

Release⚡Context Engineering1w ago

New

Together releases Mamba-3 with MIMO decoding and 1.5B fastest prefill plus decode

Together introduced Mamba-3 and open-sourced kernels for a new MIMO state-space variant that targets decode efficiency and beats Mamba-2, GDN, and Llama 3.2 1B at 1.5B scale. Test it when deployment speed matters more than chasing another generic Transformer baseline.

Release⚡Inference Optimization1w ago

See all stories →

New

Together releases Mamba-3 with MIMO decoding and 1.5B fastest prefill plus decode

Release⚡Inference Optimization1w ago

Explore what's new in AI

Filter by tag in Inference & Infrastructure

Fresh stories

Flash-MoE claims Qwen3.5-397B runs on iPhone at 0.6 tokens/sec via SSD streaming

Miles adds ROCm support on AMD Instinct and raises AIME to 0.729

Flash-MoE claims Qwen3.5-397B runs on iPhone at 0.6 tokens/sec via SSD streaming

Miles adds ROCm support on AMD Instinct and raises AIME to 0.729

Top storiesthis week

Flash-MoE claims 4.4 tokens/sec on Qwen3.5-397B on 48GB M3 Max

OpenAI adds container pools to Responses API for 10x faster agent spin-up

Unsloth releases Studio: local training UI for 500+ models with 70% less VRAM

Hao AI Lab launches Dreamverse: 30s 1080p video in 4.5s on one GPU

Morph launches FlashCompact: 33k tok/s compaction from 200k to 50k in 1.5s

Together releases Mamba-3 with MIMO decoding and 1.5B fastest prefill plus decode

Flash-MoE claims 4.4 tokens/sec on Qwen3.5-397B on 48GB M3 Max

OpenAI adds container pools to Responses API for 10x faster agent spin-up

Unsloth releases Studio: local training UI for 500+ models with 70% less VRAM

Hao AI Lab launches Dreamverse: 30s 1080p video in 4.5s on one GPU

Morph launches FlashCompact: 33k tok/s compaction from 200k to 50k in 1.5s

Together releases Mamba-3 with MIMO decoding and 1.5B fastest prefill plus decode