Together GPU Clusters added autoscaling, RBAC, observability, and self-healing controls to its managed cluster product. Use it if your team is moving from ad hoc GPU pools to production training or inference and needs more platform controls out of the box.

Together framed this as a move from "experimental GPU infrastructure" to "production-ready AI platforms" launch thread. The new control plane features cover the usual gaps teams hit when bare GPU access turns into shared internal infrastructure: elasticity, permissions, debugging, and failure recovery.
The most implementation-relevant addition is autoscaling via Kubernetes Cluster Autoscaler, which Together's capabilities post describes as scaling GPU capacity with real-time demand. The same post says observability is exposed through Grafana dashboards for GPU, networking, and storage telemetry, while RBAC adds project isolation for multi-team use. On reliability, Together highlights active health checks and "3-click node repair" to reduce MTTR capabilities post.
Together is aiming this at teams running either large distributed training jobs or variable production inference traffic product announcement. That matters because those two workloads usually force different infrastructure tradeoffs: training clusters need coordinated capacity and failure handling, while inference fleets care more about demand swings and cost control.
The announcement post says these additions are meant to address static provisioning, brittle permission management, observability gaps, and hardware failures inside managed GPU environments. Together's product page also ties the cluster offer to NVIDIA GB200, B200, H200, and H100-based deployments, so the update is less about new silicon than about making the managed layer more usable for platform teams operating shared GPU estates.
Miles added ROCm support for AMD Instinct clusters and reported GRPO post-training gains on Qwen3-30B-A3B, including AIME rising from 0.665 to 0.729. It matters if you are evaluating rollout-heavy RL jobs off NVIDIA and want concrete throughput and step-time numbers before porting.
releaseOpenClaw shipped version 2026.3.22 with ClawHub, OpenShell plus SSH sandboxes, side-question flows, and more search and model options, then followed with a 2026.3.23 patch. Teams get a broader plugin surface, but should patch quickly and review plugin trust boundaries as the ecosystem grows.
releaseCursor shipped Instant Grep, a local regex index built from n-grams, inverted indexes, and Bloom filters that drops large-repo searches from seconds to milliseconds. Faster candidate retrieval shortens the coding-agent loop, especially when ripgrep-style scans become the bottleneck.
breakingChatGPT now saves uploaded and generated files into an account-level Library that can be reused across conversations from the web sidebar or recent-files picker. It removes repetitive re-uploading and makes past PDFs, spreadsheets, and images part of a persistent working context.
breakingEpoch AI says GPT-5.4 Pro elicited a publishable solution to one 2019 conjecture in its FrontierMath Open Problems set, with a formal writeup planned. Treat it as an early milestone worth reproducing, not blanket evidence that frontier models can already automate math research.
Together GPU Clusters now includes autoscaling, RBAC, full-stack observability, and self-healing operations built in. Move from experimental GPU infrastructure to production-ready AI platforms with elastic capacity, multi-team governance, and automated failure recovery.
Built for teams running distributed training at scale and production inference workloads. Get started with Together GPU Clusters: together.ai/gpu-clusters Read the announcement: together.ai/blog/new-in-to…