Kosmos AI Scientist posts 79.4% accuracy, 1,500‑paper runs – Google tests Co‑Scientist | Daily AI Primer

Executive Summary

Edison Scientific launched Kosmos, an autonomous “AI Scientist” that turns long‑horizon literature‑to‑code research into auditable runs tied to code and citations. It reports 79.4% audited conclusion accuracy — the kind of throughput that turns compute into publishable work.

Beta users say a 20‑step run replaced months of expert effort, scaling linearly with depth. And Google is pushing the same pattern: Gemini Enterprise is piloting a “Co‑Scientist” that tournament‑ranks ~100 ideas in ~40 minutes against an explicit rubric, while NotebookLM’s new Deep Research browses hundreds of pages and compiles a cited report.

A timely 94‑page survey argues for closed‑loop agents that plan experiments, call tools, and grade their own steps. If you pilot this wave, set budget guardrails and log every step.

Feature Spotlight

Feature: AI‑accelerated science and research agents

AI research agents arrive: Kosmos claims single‑run synthesis of ~1.5k papers + 42k LOC with auditable outputs, while Google tests a 40‑min multi‑agent Co‑Scientist that ranks ~100 ideas per run; NotebookLM adds Deep Research reports.

Cross‑account surge around autonomous research: Kosmos “AI Scientist,” Google’s Gemini Enterprise Co‑Scientist, and NotebookLM’s Deep Research. Engineers care because these systems operationalize long‑horizon workflows with auditable traces and tournament‑style idea selection.

Jump to Feature: AI‑accelerated science and research agents topics

Stay in the loop

Get the Daily AI Primer delivered straight to your inbox. One email per day, unsubscribe anytime.

Feature: AI‑accelerated science and research agents

Kosmos “AI Scientist” debuts with audited outputs and expert‑level throughput

Edison Scientific unveiled Kosmos, an autonomous research system that can synthesize ~1,500 papers and write ~42,000 lines of analysis code in a single run, with 79.4% conclusion accuracy and full traceability to code and citations Altman endorsement, Launch article. The team highlights seven example discoveries and a structured world‑model approach that lets the agent stay on‑objective over millions of tokens.

Beta users report a single 20‑step run replaced about 6.14 months of expert work, with perceived work scaling linearly with run depth scaling chart.

Why this matters: Kosmos packages long‑horizon research into repeatable, auditable workflows. That’s the piece lab leads and R&D heads need to justify compute and compliance at the same time.

Gemini Enterprise “Co‑Scientist” runs tournament rankings to refine research ideas

Internal strings and demos show Google piloting two multi‑agent flows inside Gemini Enterprise: Idea Generation and a Co‑Scientist that, per run, spends ~40 minutes to generate and tournament‑rank ~100 ideas against user‑set criteria feature leak, Feature brief. The 3‑step loop takes a research goal + data, spawns specialist agents to explore, then evaluates and ranks based on an explicit rubric.

Why this matters: Teams get a repeatable front‑end for directed ideation with built‑in evaluation, which is the bottleneck for scaling literature triage and hypothesis pruning across orgs.

NotebookLM “Deep Research” turns broad web sweeps into structured, cited reports

Google rolled out a Deep Research mode in NotebookLM that can autonomously browse hundreds of pages, synthesize findings into a structured report, and attach an annotated source list; it also expands supported source types (e.g., Drive URLs, Sheets, images) for mixed‑media research sets feature demo, Google blog post. Early user tests call it an “outstanding learning tool,” noting integrated mind maps, flashcards, and quizzes for follow‑up study hands‑on notes.

Why this matters: This is a ready‑to‑try research assistant with long‑running retrieval and auditable outputs—useful for product reviews, policy scans, and backgrounders that used to take days.

Survey catalogs scientific LLMs and argues for agent loops tied to real evidence

A comprehensive survey of scientific LLMs compiles 270 datasets and 190 benchmarks, proposes a taxonomy spanning raw observations→theory, and tracks a shift from single‑turn quizzes to process‑based grading of steps, tools, and intermediate results paper thread, ArXiv paper. The authors advocate closed‑loop agents that plan experiments, call simulators or labs, validate outcomes, and update shared knowledge—framing how to train and evaluate systems beyond static corpora.

Why this matters: It’s a roadmap for engineers stitching models, tools, and evaluators into credible pipelines for scientific work, with benchmarks that reward the process—not just the final answer.

AI factories, datacenters and ops wins

Infra stayed hot: NVIDIA’s Jensen framed custom ASICs vs ‘AI factories’, Groq opened a 4.5MW Sydney site, and OpenAI reclaimed ~30k CPU cores via a logging tweak. Also posted: H200/B200 price trends and DRAM/VRAM squeeze. Excludes research‑agent launches (covered as feature).

NVIDIA’s Jensen dismisses custom ASICs as “science projects,” touts AI factories

At a UBS Q&A during GTC, Jensen Huang argued that customer ASICs can’t match NVIDIA’s full‑stack “AI factory” approach, citing an internal roadmap claiming up to ~40× beyond Hopper and the ability to place $100B‑scale POs with end‑to‑end systems and supply chain confidence transcript highlights. For infra leads, the message is clear: buyers will be sold on time‑to‑revenue, not chip lists.

This frames procurement around platform certainty and execution risk. If you’re modeling long‑lead data center bets, build scenarios where ASIC options don’t materially lower TCO once software, networking, power, and delivery timelines are included.

OpenAI frees ~30,000 CPU cores by disabling a costly Fluent Bit path

OpenAI’s observability team profiled node‑level Fluent Bit and found fstatat64 calls (triggered by inotify) burning ~35% CPU; turning that path off returned ~30,000 CPU cores to Kubernetes clusters processing nearly 10 PB/day of logs talk recap, with methodology and impact shared in the KubeCon session KubeCon talk. This is a big ops win: same workload, half the CPU.

If you run Fluent Bit, replicate the perf tracing, test inotify behavior under heavy appenders, and stage a rollout behind feature flags. Savings at this scale can fund more inference capacity immediately.

Groq opens 4.5MW Sydney site to serve APAC with local inference

Groq lit up a 4.5MW data center in Sydney in partnership with Equinix Fabric, bringing low‑latency token serving to Australia and the wider APAC region launch note, with details in the company’s release press post. For teams in Australia, this cuts cross‑ocean latency and can lower per‑request costs when routing to closer endpoints.

Expect regional routing policies and capacity reservations to matter. If you’re piloting Groq, test latency deltas from Sydney versus US/EU regions and adjust traffic shaping accordingly.

H200/B200 pricing spikes at launch, steps down later but stays elevated

Morgan Stanley exhibits circulating today show rental pricing for 8× H200 and early B200 nodes surging at launch, then stepping down as supply ramps—yet not returning to prior baselines chart thread. The takeaway for capacity planners: scarcity premiums ease, but structural demand keeps floor prices higher than last gen.

Model budgets around staged price relief, not a full reversion. Lock short terms for the peak window; renegotiate as additional capacity lands.

RAM/VRAM prices reportedly tripling in months amid AI server demand

A widely shared Gamers Nexus breakdown reports DRAM pricing up ~3× in recent months, with knock‑on effects for NAND and GPU VRAM as AI servers absorb supply; prior oversupply cuts and potential manufacturer coordination are cited as drivers video note, echoed by community commentary flagging lab lock‑ins market note. This affects both server buildouts and on‑device edge AI plans.

YouTube analysis

Budget buffers for memory should widen. When speccing clusters or local inference nodes, watch lead times and consider pre‑buys on DIMMs/VRAM‑heavy SKUs before the next allocation bump.

Stay first in your field.

No more doomscrolling X. A crisp morning report for entrepreneurs, AI creators, and engineers. Clear updates, time-sensitive offers, and working pipelines that keep you on the cutting edge. We read the firehose and hand-pick what matters so you can act today.

I don’t have time to scroll X all day. Primer does it, filters it, done.

Renee J.

Startup Founder

The fastest way to stay professionally expensive.

Felix B.

AI Animator

AI moves at ‘blink and it’s gone’. Primer is how I don’t blink.

Alex T.

Creative Technologist

Best ROI on ten minutes of my day. I’ve shipped two features purely from their daily prompts.

Marta S.

Product Designer

From release noise to a working workflow in 15 minutes.

Viktor H

AI Artist

It’s the only digest that explains why a release matters and shows how to use it—same page, same morning.

Priya R.

Startup Founder

Get access

Stay professionally expensive

Make the right move sooner

Ship a product

WebEmailTelegram

Agentic dev tooling and coding workflows

Claude Code gets a one‑line Windows installer (no WSL)

Anthropic’s coding CLI runs on Windows with a single command, shown installing Claude Code v2.0.35 without WSL: curl -fsSL https://claude.ai/install.cmd -o install.cmd && install.cmd && del install.cmd Windows install. This lowers setup friction for enterprise laptops and lab machines; grab the script directly if you need to audit it first installer script.

Kosmos AI Scientist posts 79.4% accuracy, 1,500‑paper runs – Google tests Co‑Scientist

Executive Summary

Feature: AI‑accelerated science and research agents

Table of Contents

Feature: AI‑accelerated science and research agents

Kosmos “AI Scientist” debuts with audited outputs and expert‑level throughput

Gemini Enterprise “Co‑Scientist” runs tournament rankings to refine research ideas

NotebookLM “Deep Research” turns broad web sweeps into structured, cited reports

Survey catalogs scientific LLMs and argues for agent loops tied to real evidence

AI factories, datacenters and ops wins

NVIDIA’s Jensen dismisses custom ASICs as “science projects,” touts AI factories

OpenAI frees ~30,000 CPU cores by disabling a costly Fluent Bit path

Groq opens 4.5MW Sydney site to serve APAC with local inference

H200/B200 pricing spikes at launch, steps down later but stays elevated

RAM/VRAM prices reportedly tripling in months amid AI server demand

Stay first in your field.

Agentic dev tooling and coding workflows

Claude Code gets a one‑line Windows installer (no WSL)

On this page

Kosmos AI Scientist posts 79.4% accuracy, 1,500‑paper runs – Google tests Co‑Scientist

Executive Summary

Feature: AI‑accelerated science and research agents

Table of Contents

🔬Feature: AI‑accelerated science and research agents

Kosmos “AI Scientist” debuts with audited outputs and expert‑level throughput

Gemini Enterprise “Co‑Scientist” runs tournament rankings to refine research ideas

NotebookLM “Deep Research” turns broad web sweeps into structured, cited reports

Survey catalogs scientific LLMs and argues for agent loops tied to real evidence

🏭AI factories, datacenters and ops wins

NVIDIA’s Jensen dismisses custom ASICs as “science projects,” touts AI factories

OpenAI frees ~30,000 CPU cores by disabling a costly Fluent Bit path

Groq opens 4.5MW Sydney site to serve APAC with local inference

H200/B200 pricing spikes at launch, steps down later but stays elevated

RAM/VRAM prices reportedly tripling in months amid AI server demand

Stay first in your field.

On this page

Feature: AI‑accelerated science and research agents

AI factories, datacenters and ops wins