Qwen3โVL 4B/8B launch FP8 builds โ approach 72B performance
Stay in the loop
Free daily newsletter & Telegram daily report
Executive Summary
Alibaba just shrunk its flagshipโstyle VLM into laptopโfriendly sizes: Qwen3โVL arrives in 4B and 8B โInstructโ and โThinkingโ variants with FP8 builds and lowโVRAM footprints. Why this matters: the team claims these compact models beat Gemini 2.5 Flash Lite and GPTโ5 Nano across STEM, VQA, OCR, video, and agent tasks, and often approach Qwen2.5โVLโ72B qualityโbringing nearโflagship vision to everyday hardware without a dataโcenter bill.
The rollout nails developer ergonomics. Dayโ0 support in MLXโVLM and LM Studio means Apple silicon users can run dense and MoE locally on NPU/GPU/CPU, while vLLM adds a productionโgrade serving path with strong JSONโstructured outputs and high throughput, according to early adopters. The 8B variants are already battling on LMArena in both Text and Vision modes, a timely public read after Qwenโs strong visual standings last week. Cookbooks ship with taskโlevel recipes for OCR, grounding, multiโimage/video, and agent loops, and TRL notebooks show SFT/GRPO fineโtuning on the 4B model in a free Colabโplus Hugging Face Spaces and a lightweight inโbrowser app to poke the models fast. Kaggle entries round out quick benchmarking.
If you need a bigger yardstick, Ollama Cloud now offers Qwen3โVLโ235B for free, making it easy to compare footprint vs capability before standardizing a tier.
Feature Spotlight
Feature: Qwen3โVL goes compact (4B/8B) with nearโflagship VLM
Qwen3โVL 4B/8B deliver nearโflagship multimodal ability at a fraction of VRAM, with FP8, cookbooks, and dayโ0 ecosystem supportโpushing serious VLMs onto laptops/phones and enabling agentic vision at edge costs.
Crossโaccount launch dominates today: Alibabaโs Qwen3โVL 4B/8B (Instruct/Thinking) lands with low VRAM, FP8, cookbooks, MLX/LM Studio support, and leaderboard exposureโmostly model release + developer enablement coverage.
Jump to Feature: Qwen3โVL goes compact (4B/8B) with nearโflagship VLM topicsTable of Contents
๐งฉ Feature: Qwen3โVL goes compact (4B/8B) with nearโflagship VLM
Crossโaccount launch dominates today: Alibabaโs Qwen3โVL 4B/8B (Instruct/Thinking) lands with low VRAM, FP8, cookbooks, MLX/LM Studio support, and leaderboard exposureโmostly model release + developer enablement coverage.
Qwen3โVL 4B/8B launch: compact, FP8-ready models rival larger VLMs
Alibaba released Qwen3โVL in 4B and 8B "Instruct" and "Thinking" variants that run in lower VRAM while retaining full capabilities, with FP8 builds for efficient deployment. The team claims the compact models beat Gemini 2.5 Flash Lite and GPTโ5 Nano across STEM, VQA, OCR, video and agent tasks, and often approach Qwen2.5โVLโ72B performance release thread.
Hugging Face and ModelScope collections, API docs, and cookbooks are live for immediate use Hugging Face collection, ModelScope collection, Thinking API docs, Instruct API docs, Cookbooks. A Kaggle entry rounds out access for benchmarking and demos Kaggle models.
Cookbooks, TRL notebooks and Spaces shorten timeโtoโvalue for Qwen3โVL
Developer cookbooks cover OCR, object grounding, multiโimage/video, and agent workflows, with the GitHub README pointing straight to task templates cookbook note, GitHub readme. Community notebooks show dayโ0 fineโtuning (SFT/GRPO) on the 4B model in free Colab, plus a Hugging Face Space to compare against moondream3 with object detection hf fineโtune, Colab notebook, HF Space demo. A lightweight Hugging Face app for the 4B Instruct model offers a quick way to poke the compact VLM inโbrowser hf app.
Dayโ0 Apple silicon path: MLXโVLM and LM Studio run Qwen3โVL locally
Qwen3โVL (Dense & MoE) is now supported in MLXโVLM, with maintainers noting optional devices (MLXโCUDA, CPU) and onโdevice readiness; community reports LM Studio + MLX runs on Apple silicon as well mlx release, mac usage. Alibaba amplified the "dayโ0 on NPU, GPU, CPU" positioning, signaling a practical local dev loop for Mac users npu support.
vLLM adds Qwen3โVL, signaling productionโgrade serving path
vLLM highlighted Qwen3โVL as one of its most popular multimodal models, making the compact VLMs easy to serve at scale in existing Python inference stacks vLLM note. Early adopters also point out strong JSONโstructured outputs and high throughput, useful for APIโstyle multimodal pipelines json claim.
Qwen3โVLโ235B free on Ollama Cloud complements compact family
Ollama made the 235B Qwen3โVL available free on its cloud with readyโtoโcopy examples (OCR on menus, math with image context), and says smaller models and full onโdevice runs are coming soon cloud launch, Ollama blog. This provides a largeโmodel counterpart to Alibabaโs new compact 4B/8B releases for teams testing capability vs. footprint tradeโoffs.
Qwen3โVLโ8B joins LMArenaโs Text & Vision battles
The 8B Thinking and Instruct variants are now live in LMArenaโs Text and Vision modes for headโtoโhead prompting and community voting arena entry, arena listing. This comes after Qwen3โVLโs strong standing on visual tracks, providing a lighter option for sideโbyโside comparisons Arena standing.
๐ Search models: GPTโ5 web stack lands in the API
Fresh model release focused on web search in dev workflows; excludes Qwen3โVL compact launch which is covered as the feature.
OpenAI ships gptโ5โsearchโapi: 60% cheaper web search in Chat Completions with domain filtering
OpenAI added a dedicated web search model (gptโ5โsearchโapi) to Chat Completions, priced at $10 per 1,000 callsโabout 60% lower than beforeโand it supports domain filtering like the Responses tool pricing update. The updated docs outline three modesโnonโreasoning lookups, agentic search with reasoning, and deep researchโwith copyโpaste snippets for JS, Python, Bash and C# OpenAI docs.
Engineers are already seeing both the undated and dated variants (โgptโ5โsearchโapiโ2025โ10โ14โ) in the platform model selector, confirming availability beyond docs model list screenshot. Thirdโparty trackers also flagged the new IDs surfacing across the ecosystem, reinforcing the rollout signal model finder alert.
- Endpoint parity means domain allowlists/blocks now work the same in Chat Completions as in Responses, simplifying agent plumbing for grounded search.
- The 60% price drop encourages broader use of agentic and deepโresearch traces; teams should still watch token budgets and throttle strategies in long sessions.
๐ ๏ธ Agentic coding: Codex, subโagents, and endโtoโend delegation
Heavy practitioner activity today: Codex CLI education, planโmode previews, Claude Code subโagents, Cursor longโrun build, and Factory 1.8 integrations.
Claude Code subโagents emerge as a best practice for deep repository work
Practitioners report big quality and speed wins by orchestrating Claude Code subโagentsโrepo mappers, analyzers, API readers, writersโin parallel, then merging their outputs, outperforming singleโloop agents on codebase mapping and authoring subagent demo. The broader โDeep Agentsโ pattern emphasizes structured planning, orchestrator + specialist subโagents, agentic search, hybrid memory, and verification to tackle multiโhour tasks design blueprint.
Takeaway: split concerns, keep each agentโs context clean, and compose resultsโespecially for docs generation, migrations, and new feature scaffolding.
Factory 1.8: delegate Linear tickets, spin headless โdroid execโ, and close incidents via SentryรLinear
Factoryโs 1.8 release plugs Droids into more of the dev stack: assign Linear issues for endโtoโend completion release thread, mention @Factory in Slack threads to spin a headless session and track progress Slack handโoff, run nonโinteractive batch jobs with droid exec for large refactors/tests headless mode, and loop incidents through Sentry with fixes handed back into Linear incident flow. Release notes outline autonomy controls and output formats for CI/CD use release notes. A justโinโtime permissions prompt in terminal tightens safety for file edits and reads permission UI.
Net: Factory is evolving from a chatโagent to a delegatable dev assistant embedded across planning, code, and onโcall.
OpenAI kicks off Codex video series and CLI howโto for GPTโ5โCodex
OpenAI Developers launched a multiโpart video series to help teams get productive with the Codex CLI and GPTโ5โCodex, including install and usage patterns (npm i -g @openai/codex) series intro, with a mirrored post reiterating the CLI workflow install recap. For broader context on how Codex is used across real products and events, see OpenAIโs behindโtheโscenes writeโup on using Codex to run DevDay demos and tooling DevDay blog, and the central docs hub for deeper guides and examples Codex hub.
Anthropicโs code_execution tool lands in the AI SDK with typed I/O for bash and text edits
A new Anthropic code_execution_20250825 tool with bash execution and text editing is being wired into the AI SDK, with fully typed inputs/outputs to simplify UI surface creation and safer agent loops feature brief. A companion example shows the typed schema intended to streamline frontends and logging typed I/O.
Implication: agent builders can give Claude more reliable, auditable action surfaces for repo ops, scripts, and patchingโwithout bespoke adapters.
Codex โPlan Modeโ preview surfaces with iterative planning and planner model toggle
A Plan Mode PR shows readโonly, iterative planning (Shift+Tab), a configurable planner model (/plan-model), and checklisted plan updatesโhinting at a firstโclass planning pass before code edits PR details. Although that PR was later closed without merge, maintainers signal Plan Mode is still coming PR status note. This lands as teams double down on planningโfirst loops, following up on plan mode adoption in Cursor that cut backโandโforth and spawned background jobs.
Expectations: a dedicated planner improves longโhorizon tasks (refactors, migrations) and lets orgs standardize plan quality independently of the coding model.
Codex CLI v0.45 trims tokens by ~10% at equal quality; new usage videos share workflows
OpenAI engineers said Codex CLI v0.45 made interactions ~10% more tokenโefficient at the same result quality efficiency note. A senior IC also shared a personal walkthrough on dayโtoโday usage workflow video, and the DevDay writeโup details broader Codex applications across stage demos and app development DevDay blog. Teams shipping with Codex can expect lower cost per loop and clearer patterns for multiโtab, multiโtask workflows.
Cursor agent runs 24 hours straight to ship a working project management app
A developer ran Cursorโs agent in a continuous loop with Sonnet 4.5 (CLI), minimal seed context, and autonomous progress tracking until โcompleted,โ yielding a functional PM app and surprisingly navigable code structure overnight run, with notes on minimal upfront scope and what improved with a bit more initial context run notes. Screens show the resulting UI and repo layout app snapshot.
Why it matters: endโtoโend feasibility for greenfield apps is crossing from demo to practice; the next frontier is hardeningโtests, security, and production CI/CD.
Vercel ships a Slack Agent Template and Bolt library to build agents inside Slack
Vercel demonstrated v0 running directly in Slack at Dreamforce and released a Slack Bolt library plus a starter template, letting teams query data, build dashboards, and ship agent actions without leaving chat stage demo, with a template to scaffold Slack agents quickly agent starter.
This reduces โglue codeโ to bring LLM agents where work already happensโchannels, threads, and DMsโwith a path to production deploys on Vercel.
โข Resources: Slack Bolt library, and Agent template.
Braintrust adds remote evals so you can benchmark local agents without moving workloads
Braintrust showed how to hook locally running agents to remote evaluations, so teams can iterate on tasks, datasets, and scorers without redeploying infrastructure howโto guide. This is useful when you want reproducible evals (goldens, regressions) while keeping heavy data and tooling on your box or VPC.
Vercel AI SDK adds Anthropic memory tool integration for agent state management
Vercel noted it is shipping support for Anthropicโs memory tool in the AI SDK, making it easier to persist, inspect, and version agent memories in product UIs memory support. For coding agents, this reduces prompt bloat and gives users recovery points when long sessions meander or break.
๐ Serving speed: GB200 tokens/sec and adaptive speculation
Runtime perf stories concentrate on Blackwell throughput and specโdecoding; mostly systems knobs rather than silicon announcements.
SGLang hits ~26k input / 13k output tok/s per GPU on GB200 NVL72
LMSYS reports SGLang sustaining ~26K input and ~13K output tokens per GPU per second on NVIDIA GB200 NVL72, with up to 4ร generation speedups over Hopper in InferenceMAX runs and SGLang set as the default DeepSeek engine on both NVIDIA and AMD. Following up on tokens per MW lead, GB200โs efficiency story now comes with concrete perโGPU throughput measurements. benchmarks overview LMSYS blog post
These results reflect joint systemโlevel optimizations (prefill/decode disaggregation, expert parallelism, FP8 attention, NVFP4 GEMMs) and Blackwellโs higherโbandwidth interconnects. The adoption as the default engine in the SemiAnalysis InferenceMAX benchmark underlines runtime maturity beyond a single hardware stack. collaboration note
Together AIโs ATLAS speculator delivers up to 4ร faster inference and ~500 TPS on DeepSeekโV3.1
Together AI unveiled ATLAS, an adaptive speculative decoding system that learns from live traffic to accelerate inference; they show up to 4ร faster throughput vs baseline and around 500 TPS on DeepSeekโV3.1, nearly 2ร faster than their prior Turbo speculator. results overview ATLAS blog post
Because ATLAS adapts to workload acceptance rates and latency profiles at runtime, it avoids static speculator brittleness and continues improving under production mixesโrelevant for teams pushing longโcontext and multiโturn agent workloads where token budgets dominate cost and latency.
DGX Spark runtime clarified: read vs generation speeds, and where it lands vs 5090/M4 Pro
Simon Willison updated his DGX Spark notes to separate token read (prefill) from token generation speeds, highlighting earlyโecosystem wrinkles on ARM64 CUDA and containers for inference workflows. blog update handsโon notes
Independent charts circulating today place Sparkโs output tokens/sec close to Appleโs Mac Mini M4 Pro and below RTX 5090/5080 desktop GPUs; at ~$4,000, it targets smallโmodel local serving rather than topโend desktop cards. This framing helps teams choose hardware based on actual decode throughput rather than mixed prefill+decode numbers. benchmarks chart
๐งฑ Desktops and racks: DGX Spark and AMD Helios
Concrete hardware updates span desktop Blackwell boxes and rackโscale MI450 systems; excludes runtime perf which sits under Serving speed.
DGX Spark lands on desks: 1โPFLOP Blackwell box ~$4k, early buyers flag ARM64 gaps
NVIDIAโs DGX Spark is now in customersโ handsโJensen Huang even handโdelivered a unitโwith users calling out ~1 PFLOP in a tiny form factor and an expected ~$4,000 price bracket, following up on early runtime coverage that focused on capability demos. OpenAIโs Greg Brockman highlighted the sizeโtoโcompute leap, while a community review details 128 GB RAM, ~3.7 TB NVMe, and an ARM64 stack that works best today via NVIDIAโs official containers as the CUDA-onโARM ecosystem matures hand delivery and blog post. A comparative chart situates Spark near Appleโs Mac Mini M4 Pro on smaller models yet below RTX 5090 class boards for raw headroom, framing Spark as a developer desktop rather than a training rig comparison chart.
For engineers, the takeaway is clear: strong local prototyping hardware, but plan on containerized toolchains and watch for fastโmoving ARM64 library support as the ecosystem catches up update note.
Oracle to deploy 50,000 AMD MI450 GPUs on OCI from Q3 2026
Oracle Cloud Infrastructure will roll out 50,000 AMD Instinct MI450 accelerators starting in Q3โ26, with further expansion in 2027+, giving enterprises a sizable nonโNVIDIA public cloud option for training and inference. The deployment uses AMDโs Helios rack designโMI450 + nextโgen EPYC + Pensando networkingโoffering preโwired rack blocks that target scale and serviceability deployment plan.
For AI platform teams, this strengthens a multiโvendor strategy: OCIโs MI450 footprint could improve supply diversification and pricing leverage while testing the maturity of ROCmโbased toolchains at rack scale; evaluation should include memoryโbound workloads that benefit from MI450โs HBM4 capacity and bandwidth.
AMD Helios racks: 72 MI450s, ~1.4 exaFLOPS FP8 and 31 TB HBM4 per cabinet
AMDโs new Helios rackโscale platform packages 72 Instinct MI450 GPUs into a serviceable cabinet with quickโdisconnect liquid cooling, delivering roughly 1.4 exaFLOPS FP8 and ~31 TB of HBM4 per rack. Each MI450 pairs 432 GB HBM4 with ~19.6 TB/s bandwidth, with UALink for inโnode scaleโup and UEC Ethernet for crossโrack scaleโout; the design follows Open Rack Wide conventions to simplify deployment and field service platform details.
For leaders scoping nonโNVIDIA capacity, Heliosโ emphasis on memory capacity per GPU, open fabrics, and rackโlevel serviceability is the notable angle; it positions AMDโs stack as a credible alternative for large training and highโcontext inference footprints.
๐๏ธ AI factories and demand math
Macro infra signals today are investment and usage economics; few pure capex filings, but meaningful scale commitments and tokenโdemand curves.
Hidden โthinking tokensโ push usage to ~1.3 quadrillion per month even as unit prices fall
Usage is exploding while price per token drops: WSJ highlights that modelsโ hidden reasoning traces (โthinking tokensโ) are swelling tokens per answer, driving spend despite cheaper rates WSJ analysis. The ecosystem view shows monthly usage jumping to roughly 1,300T tokens (~1.3 quadrillion) by Oct โ25, up from ~9.7T in May, as deep inference and selfโchecking take hold ecosystem map.
For AI leaders, the economics bifurcate into quick vs deep inference. Capacity planning now hinges on acceptance rates for speculative decoding, toolโuse loops, and explicit โthought budgets,โ not just list prices.
OpenAIโs compute roadmap now totals ~26 GW across Nvidia, Broadcom and AMD
OpenAIโs disclosed arrangements add up to roughly 26 GW of dataโcenter capacity equivalents: a 10 GW Broadcom customโaccelerator program, a 10 GW Nvidia systems LOI (with an anticipated ~$100B Nvidia investment as rollouts happen), and 6 GW of AMD capacity paired with a warrant for up to 160M AMD shares at $0.01 tied to deployments compute summary, following up on 10 GW term sheet.
The plan targets first racks shipping in 2Hโ2026 and multiโyear buildโouts through 2029, implying a fullโstack bet (silicon, memory hierarchy, compiler/runtime) to raise perf/W and reduce cost per token at scale.
Google to build $15B AI hub in India with gigawatt-scale data center and subsea landing
Google will invest $15B from 2026โ2030 to stand up its first AI hub in India (Visakhapatnam), including a gigawattโscale AI data center built with AdaniConneX and Airtel and a new subsea cable landing that ties into Googleโs global network announcement recap.
This adds multiโGWโclass capacity in South Asia and shortens latency paths for Indiaโs fastโgrowing AI workloads. The bundle (DC, subsea, and "full AI stack") signals a vertically integrated approach that can insulate critical AI services from regional power and network bottlenecks.
Oracle to deploy 50,000 AMD MI450s on OCI starting Q3 2026, expanding in 2027+
Oracle Cloud Infrastructure will add 50,000 AMD Instinct MI450 GPUs beginning in Q3โ2026, with further expansion planned into 2027 and beyond deployment note. The move strengthens a nonโNvidia publicโcloud lane for largeโscale training and inference and offers enterprises vendor diversification during a supplyโconstrained cycle.
Expect customers to weigh ROCm maturity, cluster fabrics, and model portability alongside price/perf as they evaluate multiโvendor pipelines.
AMD unveils Helios rack: 72 MI450s, 31 TB HBM4 and ~1.4 exaFLOPS FP8 per cabinet
AMDโs Helios rack aims to simplify serviceability at AIโfactory scale: 72 MI450 GPUs per cabinet, ~31 TB of HBM4 and ~1.4 exaFLOPS FP8, with UALink for inโnode GPU memory sharing, UEC Ethernet for scaleโout, and quickโdisconnect liquid cooling platform brief.
Heliosโ memoryโheavy design targets throughput on memoryโbound training and largeโcontext inference while leaning on open fabrics, a useful counterweight to proprietary interconnect stacks.
Brookfield commits up to $5B to Bloom Energy to finance fuelโcell AI data centers
Infrastructure financier Brookfield Asset Management will provide up to $5B to Bloom Energy to fund fuelโcellโpowered AI data centers, expanding onโsite generation options beyond grid ties and traditional PPAs funding update.
Fuel cells can reduce interconnect delays, cut transmission risk, and improve siting flexibility for AI factories, though cost curves will hinge on fuel pricing and fleetโscale deployment economics.
AI is lifting GDP via investment before productivity, says WSJ
WSJ reports that AIโs contribution is showing up first in capex, not broad productivity gains: roughly twoโthirds of U.S. GDP growth in early 2025 came from software and compute investment, while measured worker productivity impacts remain mixed; about 10% of firms report AI use WSJ summary.
For AI planners, this implies nearโterm demand will keep tracking capex cycles (chips, racks, power, data) even as productivity dividends arrive more slowly and unevenly across sectors.
๐ก๏ธ Wellโbeing guardrails, ageโgating and privacy signals
Policy/safety moves dominate discourse: OpenAIโs wellโbeing council and adultโcontent stance; Californiaโs oneโclick privacy control. Excludes model launches.
OpenAI will relax mentalโhealth refusals and allow erotica for verified adults in December
Sam Altman said ChatGPTโs conservative stance on mentalโhealth topics will be eased as new safeguards mature, and that erotica will be permitted for verified adults starting in December as ageโgating rolls out broadly Altman thread. The shift emphasizes user choice (โtreat adult users like adultsโ) and optional, more expressive personalities, while maintaining crisis protections for atโrisk users.
OpenAI creates Expert Council on WellโBeing and AI to inform youth safeguards and product design
OpenAI introduced an eightโmember Expert Council spanning psychology, psychiatry, and HCI to guide ChatGPT and Sora on healthy interactions across age groups, building on prior work like parental controls and teen distress notifications OpenAI announcement, and outlining scope and members in detail OpenAI blog post. The council is intended to continuously advise on features that affect emotions, motivation, and mental health.
California AB566 mandates oneโtime browser โdo not sell/shareโ signal and statewide deletion system
California approved AB566 requiring browsers to ship a builtโin privacy control that broadcasts a oneโclick โdo not sell/shareโ signal, plus a statewide dataโbroker deletion request flow law details.
- Browser signal deadline: mandatory by Janโ2027; deletion system begins Janโ2026 with 45โday broker checks thereafter law details.
- Practical impact: crossโsite ad targeting and broker feeds will face wider optโouts by default, tightening data available to train and target AIโpowered ad systems.
California signs first U.S. chatbot law requiring safeguards; enables lawsuits for harms
Governor Newsom signed SB 243, billed as the first U.S. law regulating AI chatbots, mandating operator safeguards for vulnerable users and allowing lawsuits when failures cause harm news coverage, following up on SB243 that required bots to disclose nonโhuman status and address selfโharm protocols. The signature moves the mandate from passage to enforceable law, raising compliance stakes for AI assistants deployed at scale.
๐ฌ Video/3D tools: Sora 2 vs Veo 3.1, deflicker, and imageโ3D
Large creative stack chatter today (video arena moves, deflicker tool, ComfyUI workflows, imageโ3D). Separate from the Qwen VLM feature.
Sora 2 Pro ties Veo 3 for #1 on Video Arena; Sora 2 (with audio) ranks #3
LMArenaโs organizers added OpenAIโs Sora 2 and Sora 2 Pro to their TextโtoโVideo leaderboard, noting Sora 2 Pro is the first to tie Veo 3 variants for the top spot while Sora 2 lands at #3 and is praised for synchronized audio generation leaderboard update.
This is the first broad, headโtoโhead signal that Soraโs audio+video pipeline is competitive with Veo 3 on overall quality, and it raises the bar for integrated sound in T2V workflows (see the invite for direct prompting and voting in Discord) Discord invite.
Gemini UI surfaces Veo 3.1 banners and a โfastโ model, pointing to imminent release
Screenshots show โNew! Video generation just got better with Veo 3.1โ banners in Gemini with a Try Now entry point, and a Model Finder card listing veoโ3.1 and veoโ3.1โfast preview endpoints; one user notes availability appears limited to the U.S. for now Gemini banner, model finder card, UI screenshot. This follows yesterdayโs thirdโparty endpoint sightings without Google confirmation Veo 3.1 hint.
If this ships as indicated, expect longer durations and better motion consistency to narrow gaps vs Sora 2 on multiโscene coherence (and give Gemini Studio users a native path for video+audio).
Higgsfield launches Enhancer to kill flicker; adds Sora 2 MAX/Pro MAX and a freeโrun promo
Higgsfield introduced Enhancer, a universal deflicker that cleans up AIโgenerated and legacy footage, alongside Sora 2 MAX (higher fidelity) and Sora 2 Pro MAX (multiโscene at MAX quality), plus a 1080p Upscale Previewโavailable with โUnlimitedโ usage through the end of the week release thread. Creators are already pushing teaser workflows on the platform and highlighting the unlimited window for Ultimate/Creator plans creator promo.
Imageโ3D in ~2โ5 minutes: Hitem3D cuts modeling to ~$0.30โ$1.40 per asset with watertight meshes
Hitem3Dโbuilt atop Sparc3D geometry and ULTRA3D speedโturns one or a few reference photos into studioโready, watertight meshes (up to 1536ยณ detail) in about 2โ3 minutes per view, with typical jobs completing in ~2โ5 minutes and costing roughly $0.30โ$1.40 per model feature breakdown, ArXiv paper. A followโup thread walks through the view expansion, alignment, triangulation, and texturing path, and calls out a faces/hair variant for characterโcentric assets how-to thread, official note.
Runway debuts Apps for everyday video work: remove, reshoot, add dialogue, upscale to 4K, restyle
Runway released a set of five AI video AppsโRemove from Video, Reshoot Product, Add Dialogue (with tone), Upscale Video (oneโclick to 4K), and Change Image Style (incl. relighting/retexturing)โrolling out on the web with more coming weekly and an open call for ideas apps overview, feature list. For production teams, this bundles common cleanup and repurposing tasks without leaving the browser, potentially reducing roundโtrips to NLEs or specialized plugins.
ComfyUIโs 3โminute guide to WAN 2.2 Animate and a new character replacement workflow
For creators assembling animation pipelines in nodes, ComfyUI shared a concise WAN 2.2 Animate tutorial to get character motion running fast, plus a separate walkthrough on character replacement inside the same graph tutorial, character replacement. The team is soliciting feedback on the workflow ergonomics to refine defaults and examples feedback request.
๐ Leaderboards and puzzlers (nonโvideo)
Smaller eval pulse today: coding/vibe arena standings and NYT Connections deltas. Excludes Qwen leaderboard items (covered in the feature) and video arena (in Media).
DeepSeek-V3.2-Exp leads new Vibe Coding Arena
DeepSeekโV3.2โExp now tops the BigCodeArena โvibe codingโ board, reflecting executionโfirst, userโvoted preferences; following up on execution evals that introduced the arenaโs runโandโjudge method, todayโs snapshot shows a clear pecking order driven by real usage. See the board preview in leaderboard image.
- Top 3 today: DeepSeekโV3.2โExp (โ1107), DeepSeekโV3.1โTerminus (โ1080.9), and qwen3โ235bโa22bโinstructโ2507 (โ1069.8), all based on raw headโtoโhead user votes and execution outcomes leaderboard image.
NYT Connections update: GPTโ5 Pro 83.9; DeepSeek V3.2 Exp 59.4
A fresh NYT Connections snapshot (last 100 puzzles) puts GPTโ5 Pro at 83.9, with DeepSeek V3.2 Exp at 59.4; Claude Sonnet 4.5 Thinking 16K rises to 48.2 while its nonโreasoning mode reaches 46.1; GLMโ4.6 sits at 24.2. The update also notes GPTโ5 Pro trails an earlier o3โpro marker (87.3) in this cut score rundown, and confirms the scope is the most recent 100 puzzles method note, with the benchmark and scripts available in the maintainerโs repo GitHub repo.
๐ผ Enterprise moves: funding, assistants and commerce apps
Notable GTM signals span a large Series B, Salesforce+OpenAI/Slack integrations, a Walmart ChatGPT app, and AIโdriven layoffs commentary.
Salesforce brings Agentforce 360 into ChatGPT; $100M support savings cited
Salesforce expanded its OpenAI partnership so Agentforce 360 apps can run inside ChatGPT, alongside deeper Slack and Codex tieโins shown at Dreamforce partnership summary, integration graphic, and a Salesforce post company update. Internally, Salesforce says AI support agents are saving ~$100M annually and Reddit reports 84% faster resolution (46% deflection, 8.9โ1.4 min) savings stats. Full partner details are in Salesforceโs release Salesforce press release, following up on Slack apps which shipped ChatGPT for Slack.
For enterprises, this tightens the loop between CRM data, Slack workflows, and ChatGPT distributionโlowering friction to deploy agentic flows where users already work.
Goldman Sachs signals job cuts as AI adoption accelerates under โOneGS 3.0โ
Goldman told staff it will constrain headcount growth and make limited reductions this year as part of an AIโdriven efficiency push (โOneGS 3.0โ), even as overall headcount remains above 2024 levels ai plan memo, bloomberg snapshot.
For AI leaders and HR, itโs a visible example of largeโbank operational redesign around automation in onboarding, compliance, lending, and vendor flowsโshifting workforce mix while scalability improves.
Walmart launches a shopping app inside ChatGPT
Walmart rolled out a ChatGPT app that lets users browse and buy across Walmart and Samโs Club assortments directly in ChatGPT, signaling mainstream retailโs move into agentic commerce walmart details, Bloomberg report, deal roundup.
For commerce leaders, this tests conversion in conversational channels and sets a template for catalog search, bundling, and checkout flows embedded in generalโpurpose assistants.
Reducto raises $75M Series B, surpasses 1B pages processed
Document AI startup Reducto closed a $75M Series B led by a16z, taking total funding to $108M, and says it has now processed 1B+ pages (~6ร since its Series A) with plans to double down on model research and productionโgrade tooling funding note.
For AI leaders, this marks accelerating enterprise demand for highโaccuracy OCR+VLM pipelines in regulated workflows (charts, tables, multiโpage docs) and a wellโfunded competitor in the document automation stack.
Slack turns Slackbot into a full AI assistant with private AWS processing
Slack is piloting a revamped Slackbot that plans meetings, summarizes threads, finds files via natural language, and coordinates calendars (Google/Outlook), with AI running in a private AWS environment and wider rollout targeted by endโ2025 product brief.
This positions Slackโs native assistant as an enterprise option alongside thirdโparty botsโimportant for CIOs weighing data residency, privacy, and changeโmanagement for knowledge work assistants.
Vercel ships Slack Agent Template and Bolt library to build agents inside Slack
Vercel released a Slack Agent Template plus a Bolt library to quickly build, test and deploy Slack agents, complementing the growing wave of agentic enterprise workflows showcased at Dreamforce template link, slack demo.
For platform teams, this lowers the integration cost to embed RAG, approvals, and code actions directly in Slack with CI/CDโfriendly scaffolding.
๐งช Reasoning, RL and longโhorizon methods
Dense research day: diffusion guidance, chain robustness, dynamic context windows, memoryโdriven agents, overthinking control, KV compression, and AI peer review.
Deep search agents with sliding context hit ~33.5%
A sliding window that retains assistant thoughts, elides older tool dumps, and preserves latest tool outputs lets agents reason over ~100 turns within 32k, reaching ~33.5% on a tougher multiโpage benchmarkโwithout external summarizers paper abstract.
- Training mirrors the runtime view (tool calls replaced by placeholders), followed by RL that rewards only correct final answers, stabilizing long multiโstep sessions paper abstract.
MUSE learns on the job to set new TAC SOTA
An experienceโdriven loopโplan, execute, reflect, storeโwrites hierarchical memories (strategic notes, subโtask SOPs, tool tips) and reuses them across tasks and models, pushing TAC to 51.78% with Geminiโ2.5 Flash paper overview.
- A separate reflect agent blocks false success, proposes retries, and distills new SOPs; plainโlanguage memories transfer to other LLMs without retraining paper overview.
RL finds the heads that matter for reasoning
A tiny gate per attention head mixes a short sliding window with full history; RL raises gates for heads that improve correctness. Highโgate heads keep full KV while others get trimmed, preserving chain integrity and saving ~20โ50% KV with nearโlossless (or better) accuracy on math/coding paper title page.
- On a tough math set, the compressed policy even beats the fullโcache baseline by focusing memory where it counts paper title page.
TAG diffusion guidance trims steps and hallucinations
Tangential Amplifying Guidance (TAG) amplifies the tangential component of each diffusion update to keep samples near highโprobability regions, reducing semantic drift; in tests, ~30 TAG steps surpass 100โstep classifierโfree guidance on quality without extra model calls paper thread.
- Methodologically, TAG keeps the radial (noise) part fixed while modestly boosting the tangential (content) component, improving likelihood and textโimage alignment on standard samplers paper recap.
RโHorizon: RL improves longโchain reasoning
New details show reinforcement learning on composed chains lifts chain accuracy and yields roughly +7.5 on AIME24, addressing early stopping and format breaks that crater long sequences paper update. This builds on RโHorizon, which introduced breadthโandโdepth chaining to stress longโhorizon reasoning.
- Gap analysis finds real chain scores fall below independentโtrial expectations as chain length grows; verifiable rewards and GRPO reduce error accumulation paper summary.
Reasoning shaping curbs overthinking
Group Relative Segment Penalization (GRSP) supervises at the step level: it detects step boundaries (keywords or confidence dips) and penalizes choppy, short segments that correlate with wrong answers, yielding shorter outputs with stable accuracy on math and RL setups paper details.
- Larger bases still benefit, compressing steps better and stabilizing training versus tokenโlevel penalties paper details.
AI metareview nears human accept/reject accuracy
An ensemble of reviewer personas (empiricist, theorist, pedagogical) plus a metareviewer achieves ~81.8% accept/reject accuracy on 1,963 submissions, close to the ~83.9% human average; AI excels at fact/literature checks but lags on novelty/theory paper summary.
- Rebuttals can overโsway agents, so humans stay in the loop; ensembles reduce persona bias versus single reviewers paper summary.
MPO coโoptimizes words and visuals for MLLMs
Treating prompts as paired text+cue (image/video/molecule) and updating both after singleโnote feedback improves MLLM answers while slashing trial budgets by ~70% compared to textโonly optimizers; strong parents are warmโstarted to explore efficiently paper summary.
- Visual cues are created, edited, or mixed to align attention on the right details, with wins across images, videos, and molecular tasks paper summary.



