Kosmos AI Scientist posts 79.4% accuracy, 1,500‑paper runs – Google tests Co‑Scientist
Executive Summary
Edison Scientific launched Kosmos, an autonomous “AI Scientist” that turns long‑horizon literature‑to‑code research into auditable runs tied to code and citations. It reports 79.4% audited conclusion accuracy — the kind of throughput that turns compute into publishable work.
Beta users say a 20‑step run replaced months of expert effort, scaling linearly with depth. And Google is pushing the same pattern: Gemini Enterprise is piloting a “Co‑Scientist” that tournament‑ranks ~100 ideas in ~40 minutes against an explicit rubric, while NotebookLM’s new Deep Research browses hundreds of pages and compiles a cited report.
A timely 94‑page survey argues for closed‑loop agents that plan experiments, call tools, and grade their own steps. If you pilot this wave, set budget guardrails and log every step.
Feature Spotlight
Feature: AI‑accelerated science and research agents
AI research agents arrive: Kosmos claims single‑run synthesis of ~1.5k papers + 42k LOC with auditable outputs, while Google tests a 40‑min multi‑agent Co‑Scientist that ranks ~100 ideas per run; NotebookLM adds Deep Research reports.
Cross‑account surge around autonomous research: Kosmos “AI Scientist,” Google’s Gemini Enterprise Co‑Scientist, and NotebookLM’s Deep Research. Engineers care because these systems operationalize long‑horizon workflows with auditable traces and tournament‑style idea selection.
Jump to Feature: AI‑accelerated science and research agents topicsTable of Contents
Stay in the loop
Get the Daily AI Primer delivered straight to your inbox. One email per day, unsubscribe anytime.
Feature: AI‑accelerated science and research agents
Cross‑account surge around autonomous research: Kosmos “AI Scientist,” Google’s Gemini Enterprise Co‑Scientist, and NotebookLM’s Deep Research. Engineers care because these systems operationalize long‑horizon workflows with auditable traces and tournament‑style idea selection.
Kosmos “AI Scientist” debuts with audited outputs and expert‑level throughput
Edison Scientific unveiled Kosmos, an autonomous research system that can synthesize ~1,500 papers and write ~42,000 lines of analysis code in a single run, with 79.4% conclusion accuracy and full traceability to code and citations Altman endorsement, Launch article. The team highlights seven example discoveries and a structured world‑model approach that lets the agent stay on‑objective over millions of tokens.
- Beta users report a single 20‑step run replaced about 6.14 months of expert work, with perceived work scaling linearly with run depth scaling chart.
Why this matters: Kosmos packages long‑horizon research into repeatable, auditable workflows. That’s the piece lab leads and R&D heads need to justify compute and compliance at the same time.
Gemini Enterprise “Co‑Scientist” runs tournament rankings to refine research ideas
Internal strings and demos show Google piloting two multi‑agent flows inside Gemini Enterprise: Idea Generation and a Co‑Scientist that, per run, spends ~40 minutes to generate and tournament‑rank ~100 ideas against user‑set criteria feature leak, Feature brief. The 3‑step loop takes a research goal + data, spawns specialist agents to explore, then evaluates and ranks based on an explicit rubric.
Why this matters: Teams get a repeatable front‑end for directed ideation with built‑in evaluation, which is the bottleneck for scaling literature triage and hypothesis pruning across orgs.
NotebookLM “Deep Research” turns broad web sweeps into structured, cited reports
Google rolled out a Deep Research mode in NotebookLM that can autonomously browse hundreds of pages, synthesize findings into a structured report, and attach an annotated source list; it also expands supported source types (e.g., Drive URLs, Sheets, images) for mixed‑media research sets feature demo, Google blog post. Early user tests call it an “outstanding learning tool,” noting integrated mind maps, flashcards, and quizzes for follow‑up study hands‑on notes.
Why this matters: This is a ready‑to‑try research assistant with long‑running retrieval and auditable outputs—useful for product reviews, policy scans, and backgrounders that used to take days.
Survey catalogs scientific LLMs and argues for agent loops tied to real evidence
A comprehensive survey of scientific LLMs compiles 270 datasets and 190 benchmarks, proposes a taxonomy spanning raw observations→theory, and tracks a shift from single‑turn quizzes to process‑based grading of steps, tools, and intermediate results paper thread, ArXiv paper. The authors advocate closed‑loop agents that plan experiments, call simulators or labs, validate outcomes, and update shared knowledge—framing how to train and evaluate systems beyond static corpora.
Why this matters: It’s a roadmap for engineers stitching models, tools, and evaluators into credible pipelines for scientific work, with benchmarks that reward the process—not just the final answer.
AI factories, datacenters and ops wins
Infra stayed hot: NVIDIA’s Jensen framed custom ASICs vs ‘AI factories’, Groq opened a 4.5MW Sydney site, and OpenAI reclaimed ~30k CPU cores via a logging tweak. Also posted: H200/B200 price trends and DRAM/VRAM squeeze. Excludes research‑agent launches (covered as feature).
NVIDIA’s Jensen dismisses custom ASICs as “science projects,” touts AI factories
At a UBS Q&A during GTC, Jensen Huang argued that customer ASICs can’t match NVIDIA’s full‑stack “AI factory” approach, citing an internal roadmap claiming up to ~40× beyond Hopper and the ability to place $100B‑scale POs with end‑to‑end systems and supply chain confidence transcript highlights. For infra leads, the message is clear: buyers will be sold on time‑to‑revenue, not chip lists.
This frames procurement around platform certainty and execution risk. If you’re modeling long‑lead data center bets, build scenarios where ASIC options don’t materially lower TCO once software, networking, power, and delivery timelines are included.
OpenAI frees ~30,000 CPU cores by disabling a costly Fluent Bit path
OpenAI’s observability team profiled node‑level Fluent Bit and found fstatat64 calls (triggered by inotify) burning ~35% CPU; turning that path off returned ~30,000 CPU cores to Kubernetes clusters processing nearly 10 PB/day of logs talk recap, with methodology and impact shared in the KubeCon session KubeCon talk. This is a big ops win: same workload, half the CPU.
If you run Fluent Bit, replicate the perf tracing, test inotify behavior under heavy appenders, and stage a rollout behind feature flags. Savings at this scale can fund more inference capacity immediately.
Groq opens 4.5MW Sydney site to serve APAC with local inference
Groq lit up a 4.5MW data center in Sydney in partnership with Equinix Fabric, bringing low‑latency token serving to Australia and the wider APAC region launch note, with details in the company’s release press post. For teams in Australia, this cuts cross‑ocean latency and can lower per‑request costs when routing to closer endpoints.
Expect regional routing policies and capacity reservations to matter. If you’re piloting Groq, test latency deltas from Sydney versus US/EU regions and adjust traffic shaping accordingly.
H200/B200 pricing spikes at launch, steps down later but stays elevated
Morgan Stanley exhibits circulating today show rental pricing for 8× H200 and early B200 nodes surging at launch, then stepping down as supply ramps—yet not returning to prior baselines chart thread. The takeaway for capacity planners: scarcity premiums ease, but structural demand keeps floor prices higher than last gen.
Model budgets around staged price relief, not a full reversion. Lock short terms for the peak window; renegotiate as additional capacity lands.
RAM/VRAM prices reportedly tripling in months amid AI server demand
A widely shared Gamers Nexus breakdown reports DRAM pricing up ~3× in recent months, with knock‑on effects for NAND and GPU VRAM as AI servers absorb supply; prior oversupply cuts and potential manufacturer coordination are cited as drivers video note, echoed by community commentary flagging lab lock‑ins market note. This affects both server buildouts and on‑device edge AI plans.
Budget buffers for memory should widen. When speccing clusters or local inference nodes, watch lead times and consider pre‑buys on DIMMs/VRAM‑heavy SKUs before the next allocation bump.

Stay first in your field.
No more doomscrolling X. A crisp morning report for entrepreneurs, AI creators, and engineers. Clear updates, time-sensitive offers, and working pipelines that keep you on the cutting edge. We read the firehose and hand-pick what matters so you can act today.
I don’t have time to scroll X all day. Primer does it, filters it, done.
Renee J.
Startup Founder
The fastest way to stay professionally expensive.
Felix B.
AI Animator
AI moves at ‘blink and it’s gone’. Primer is how I don’t blink.
Alex T.
Creative Technologist
Best ROI on ten minutes of my day. I’ve shipped two features purely from their daily prompts.
Marta S.
Product Designer
From release noise to a working workflow in 15 minutes.
Viktor H
AI Artist
It’s the only digest that explains why a release matters and shows how to use it—same page, same morning.
Priya R.
Startup Founder
Stay professionally expensive
Make the right move sooner
Ship a product