Andrej Karpathy open-sourced autoresearch, a minimal agent loop for automated ML research, and reported roughly 20 additive changes that reduced nanochat’s Time to GPT-2 from 2.02 hours to 1.80 hours. Research teams can use it as a concrete recipe for closed-loop experimentation on any metric with cheap proxy evaluations.

Karpathy's release thread positions autoresearch less as a polished product than as a reusable loop: give an agent a measurable objective, let it modify the training code, run full experiments, score the result, and preserve wins. The repo itself is available in the GitHub project, while the early walkthrough from a widely shared summary distilled the operating model to "~630 lines of code," "single GPU," and short training cycles.
That matters because the contribution is procedural. Instead of promising autonomous science in the abstract, autoresearch packages the bread-and-butter ML tuning workflow Karpathy describes doing manually for "2 decades" into an agentic closed loop that can keep iterating while humans refine prompts and constraints manual-to-agent shift.
The strongest evidence is the nanochat run itself. Karpathy says a roughly two-day run on a depth-12 model found about 20 validation-loss improvements, and that every one he tested was additive and transferred to larger depth-24 models. Stacked together, those changes moved Time to GPT-2 from 2.02 hours to 1.80 hours, which he says becomes the new leaderboard entry measured speedup.
The
shows 276 plotted experiments with 29 kept improvements on the running-best path, while the thread says the broader process worked through about 700 autonomous changes. The retained fixes included sharper attention from adding a missing QKnorm scaler, regularization for value embeddings, less conservative banded attention, corrected AdamW betas, a tuned weight decay schedule, and improved initialization kept improvements.
Karpathy also says the agent "looked at the sequence of results of experiments and used that to plan the next ones," which is the more important engineering claim than raw benchmark movement: the loop is doing sequential experimental design, not just grid search planning claim. Meanwhile, the retweeted result spread quickly, with a reposted copy passing 1,000 reposts, signaling that this specific benchmark delta landed as more than a niche repo drop.
Karpathy's framing is blunt: "All LLM frontier labs will do this" and scaling it is "just engineering" frontier-lab claim. His proposed path is a swarm model: agents tune smaller systems cheaply, promising ideas get promoted to larger scales, and humans stay on the edges for supervision and problem selection swarm roadmap.
The practical boundary condition is also clear in the thread. This works best where the target metric is cheap to score directly, or where a smaller model or proxy objective gives a fast signal. That's why nanochat is a plausible first target and why the same pattern could extend to inference, training, or system-level metrics that can be evaluated repeatedly without expensive human review proxy-metric framing.
A useful read from practitioners is that the hard part may shift from execution to research design. In one engineer's reaction, the interesting work becomes setting hypotheses, building verification methods, and using "contracts" so longer-horizon agents improve systems without drifting off-task.
Claude can now drive macOS apps, browser tabs, the keyboard, and the mouse from Claude Cowork and Claude Code, with permission prompts when it needs direct screen access. That makes legacy desktop workflows automatable, and Anthropic is pairing the push with more background-task support for longer agent loops.
releaseOpenClaw shipped version 2026.3.22 with ClawHub, OpenShell plus SSH sandboxes, side-question flows, and more search and model options, then followed with a 2026.3.23 patch. Teams get a broader plugin surface, but should patch quickly and review plugin trust boundaries as the ecosystem grows.
releaseCursor shipped Instant Grep, a local regex index built from n-grams, inverted indexes, and Bloom filters that drops large-repo searches from seconds to milliseconds. Faster candidate retrieval shortens the coding-agent loop, especially when ripgrep-style scans become the bottleneck.
breakingChatGPT now saves uploaded and generated files into an account-level Library that can be reused across conversations from the web sidebar or recent-files picker. It removes repetitive re-uploading and makes past PDFs, spreadsheets, and images part of a persistent working context.
breakingEpoch AI says GPT-5.4 Pro elicited a publishable solution to one 2019 conjecture in its FrontierMath Open Problems set, with a formal writeup planned. Treat it as an early milestone worth reproducing, not blanket evidence that frontier models can already automate math research.
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, Show more
Andrej Karpathy just dropped something absurdely insane. An open-source repo where an AI agent runs its own ML research loop. While you sleep. The setup is almost absurdly simple: -~630 lines of code -single GPU -5-minute training runs But here’s the twist. The human Show more
I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the
"All LLM frontier labs will do this. It's the final boss battle... Doing it is 'just engineering' and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and Show more
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes,