Kosmos AI Scientist posts 79.4% accuracy, 1,500‑paper runs – Google tests Co‑Scientist

Executive Summary

Edison Scientific launched Kosmos, an autonomous “AI Scientist” that turns long‑horizon literature‑to‑code research into auditable runs tied to code and citations. It reports 79.4% audited conclusion accuracy — the kind of throughput that turns compute into publishable work.

Beta users say a 20‑step run replaced months of expert effort, scaling linearly with depth. And Google is pushing the same pattern: Gemini Enterprise is piloting a “Co‑Scientist” that tournament‑ranks ~100 ideas in ~40 minutes against an explicit rubric, while NotebookLM’s new Deep Research browses hundreds of pages and compiles a cited report.

A timely 94‑page survey argues for closed‑loop agents that plan experiments, call tools, and grade their own steps. If you pilot this wave, set budget guardrails and log every step.

Feature Spotlight

Feature: AI‑accelerated science and research agents

AI research agents arrive: Kosmos claims single‑run synthesis of ~1.5k papers + 42k LOC with auditable outputs, while Google tests a 40‑min multi‑agent Co‑Scientist that ranks ~100 ideas per run; NotebookLM adds Deep Research reports.

Cross‑account surge around autonomous research: Kosmos “AI Scientist,” Google’s Gemini Enterprise Co‑Scientist, and NotebookLM’s Deep Research. Engineers care because these systems operationalize long‑horizon workflows with auditable traces and tournament‑style idea selection.

Jump to Feature: AI‑accelerated science and research agents topics

Table of Contents

Stay in the loop

Get the Daily AI Primer delivered straight to your inbox. One email per day, unsubscribe anytime.

Feature: AI‑accelerated science and research agents

Cross‑account surge around autonomous research: Kosmos “AI Scientist,” Google’s Gemini Enterprise Co‑Scientist, and NotebookLM’s Deep Research. Engineers care because these systems operationalize long‑horizon workflows with auditable traces and tournament‑style idea selection.

Kosmos “AI Scientist” debuts with audited outputs and expert‑level throughput

Edison Scientific unveiled Kosmos, an autonomous research system that can synthesize ~1,500 papers and write ~42,000 lines of analysis code in a single run, with 79.4% conclusion accuracy and full traceability to code and citations Altman endorsement, Launch article. The team highlights seven example discoveries and a structured world‑model approach that lets the agent stay on‑objective over millions of tokens.

  • Beta users report a single 20‑step run replaced about 6.14 months of expert work, with perceived work scaling linearly with run depth scaling chart.

Why this matters: Kosmos packages long‑horizon research into repeatable, auditable workflows. That’s the piece lab leads and R&D heads need to justify compute and compliance at the same time.

Gemini Enterprise “Co‑Scientist” runs tournament rankings to refine research ideas

Internal strings and demos show Google piloting two multi‑agent flows inside Gemini Enterprise: Idea Generation and a Co‑Scientist that, per run, spends ~40 minutes to generate and tournament‑rank ~100 ideas against user‑set criteria feature leak, Feature brief. The 3‑step loop takes a research goal + data, spawns specialist agents to explore, then evaluates and ranks based on an explicit rubric.

Why this matters: Teams get a repeatable front‑end for directed ideation with built‑in evaluation, which is the bottleneck for scaling literature triage and hypothesis pruning across orgs.

NotebookLM “Deep Research” turns broad web sweeps into structured, cited reports

Google rolled out a Deep Research mode in NotebookLM that can autonomously browse hundreds of pages, synthesize findings into a structured report, and attach an annotated source list; it also expands supported source types (e.g., Drive URLs, Sheets, images) for mixed‑media research sets feature demo, Google blog post. Early user tests call it an “outstanding learning tool,” noting integrated mind maps, flashcards, and quizzes for follow‑up study hands‑on notes.

Why this matters: This is a ready‑to‑try research assistant with long‑running retrieval and auditable outputs—useful for product reviews, policy scans, and backgrounders that used to take days.

Survey catalogs scientific LLMs and argues for agent loops tied to real evidence

A comprehensive survey of scientific LLMs compiles 270 datasets and 190 benchmarks, proposes a taxonomy spanning raw observations→theory, and tracks a shift from single‑turn quizzes to process‑based grading of steps, tools, and intermediate results paper thread, ArXiv paper. The authors advocate closed‑loop agents that plan experiments, call simulators or labs, validate outcomes, and update shared knowledge—framing how to train and evaluate systems beyond static corpora.

Why this matters: It’s a roadmap for engineers stitching models, tools, and evaluators into credible pipelines for scientific work, with benchmarks that reward the process—not just the final answer.


AI factories, datacenters and ops wins

Infra stayed hot: NVIDIA’s Jensen framed custom ASICs vs ‘AI factories’, Groq opened a 4.5MW Sydney site, and OpenAI reclaimed ~30k CPU cores via a logging tweak. Also posted: H200/B200 price trends and DRAM/VRAM squeeze. Excludes research‑agent launches (covered as feature).

NVIDIA’s Jensen dismisses custom ASICs as “science projects,” touts AI factories

At a UBS Q&A during GTC, Jensen Huang argued that customer ASICs can’t match NVIDIA’s full‑stack “AI factory” approach, citing an internal roadmap claiming up to ~40× beyond Hopper and the ability to place $100B‑scale POs with end‑to‑end systems and supply chain confidence transcript highlights. For infra leads, the message is clear: buyers will be sold on time‑to‑revenue, not chip lists.

This frames procurement around platform certainty and execution risk. If you’re modeling long‑lead data center bets, build scenarios where ASIC options don’t materially lower TCO once software, networking, power, and delivery timelines are included.

OpenAI frees ~30,000 CPU cores by disabling a costly Fluent Bit path

OpenAI’s observability team profiled node‑level Fluent Bit and found fstatat64 calls (triggered by inotify) burning ~35% CPU; turning that path off returned ~30,000 CPU cores to Kubernetes clusters processing nearly 10 PB/day of logs talk recap, with methodology and impact shared in the KubeCon session KubeCon talk. This is a big ops win: same workload, half the CPU.

If you run Fluent Bit, replicate the perf tracing, test inotify behavior under heavy appenders, and stage a rollout behind feature flags. Savings at this scale can fund more inference capacity immediately.

Groq opens 4.5MW Sydney site to serve APAC with local inference

Groq lit up a 4.5MW data center in Sydney in partnership with Equinix Fabric, bringing low‑latency token serving to Australia and the wider APAC region launch note, with details in the company’s release press post. For teams in Australia, this cuts cross‑ocean latency and can lower per‑request costs when routing to closer endpoints.

Expect regional routing policies and capacity reservations to matter. If you’re piloting Groq, test latency deltas from Sydney versus US/EU regions and adjust traffic shaping accordingly.

H200/B200 pricing spikes at launch, steps down later but stays elevated

Morgan Stanley exhibits circulating today show rental pricing for 8× H200 and early B200 nodes surging at launch, then stepping down as supply ramps—yet not returning to prior baselines chart thread. The takeaway for capacity planners: scarcity premiums ease, but structural demand keeps floor prices higher than last gen.

Model budgets around staged price relief, not a full reversion. Lock short terms for the peak window; renegotiate as additional capacity lands.

RAM/VRAM prices reportedly tripling in months amid AI server demand

A widely shared Gamers Nexus breakdown reports DRAM pricing up ~3× in recent months, with knock‑on effects for NAND and GPU VRAM as AI servers absorb supply; prior oversupply cuts and potential manufacturer coordination are cited as drivers video note, echoed by community commentary flagging lab lock‑ins market note. This affects both server buildouts and on‑device edge AI plans.

YouTube analysis

Budget buffers for memory should widen. When speccing clusters or local inference nodes, watch lead times and consider pre‑buys on DIMMs/VRAM‑heavy SKUs before the next allocation bump.


Stay first in your field.

No more doomscrolling X. A crisp morning report for entrepreneurs, AI creators, and engineers. Clear updates, time-sensitive offers, and working pipelines that keep you on the cutting edge. We read the firehose and hand-pick what matters so you can act today.

I don’t have time to scroll X all day. Primer does it, filters it, done.

Renee J.

Startup Founder

The fastest way to stay professionally expensive.

Felix B.

AI Animator

AI moves at ‘blink and it’s gone’. Primer is how I don’t blink.

Alex T.

Creative Technologist

Best ROI on ten minutes of my day. I’ve shipped two features purely from their daily prompts.

Marta S.

Product Designer

From release noise to a working workflow in 15 minutes.

Viktor H

AI Artist

It’s the only digest that explains why a release matters and shows how to use it—same page, same morning.

Priya R.

Startup Founder

Stay professionally expensive

Make the right move sooner

Ship a product

WebEmailTelegram

On this page

Executive Summary
Feature Spotlight: Feature: AI‑accelerated science and research agents
🔬 Feature: AI‑accelerated science and research agents
Kosmos “AI Scientist” debuts with audited outputs and expert‑level throughput
Gemini Enterprise “Co‑Scientist” runs tournament rankings to refine research ideas
NotebookLM “Deep Research” turns broad web sweeps into structured, cited reports
Survey catalogs scientific LLMs and argues for agent loops tied to real evidence
🏭 AI factories, datacenters and ops wins
NVIDIA’s Jensen dismisses custom ASICs as “science projects,” touts AI factories
OpenAI frees ~30,000 CPU cores by disabling a costly Fluent Bit path
Groq opens 4.5MW Sydney site to serve APAC with local inference
H200/B200 pricing spikes at launch, steps down later but stays elevated
RAM/VRAM prices reportedly tripling in months amid AI server demand
🛠️ Agentic dev tooling and coding workflows
Claude Code gets a one‑line Windows installer (no WSL)
LangChain formalizes “Deep Agents” with planning, sub‑agents and memory
Amp CLI adds --mode to steer how the agent executes
mcporter compiles a Remote MCP server into a ready‑to‑run CLI
NVIDIA shows a Bash “computer‑use” agent built with LangGraph
OpenCode previews a full‑stack agent TUI with plugins and a web console
LangGraph “Swarm” demo ships an Article Explainer multi‑agent tool
🔭 Gemini 3 watch: pre‑release signals and strings
App strings tie Gemini 3 Pro image creation to Nano Banana 2
Signals converge on Gemini 3 next week; gemini‑cli is updating daily
Early tester: Gemini 3 links the “perfect” YouTube Short to answer a query
Community asks what backs Gemini 3 hype versus GPT‑5 Pro
📊 Benchmarks and how to measure agentic work
Calls grow to benchmark agents for brittleness, doom loops and tool use
A practical eval recipe: criteria, application, and automation for verifiability
GPT‑5.1 variants dethrone Claude on Design Arena
🔋 Local inference efficiency: Intelligence‑per‑Watt
IPW study: local LLMs cover 88.7% of queries; 5.3× efficiency gain, hybrid saves ~60%
🗂️ Retrieval & document AI pipelines
Gemini File Search docs land with code for stores, uploads, and grounded answers
OlmOCR‑2 uses deterministic unit tests (RLVR) to score parsing runs at scale
HF’s OCR guide adds new models and when‑to‑finetune guidance for document AI
TeaRAG’s agentic RAG keeps accuracy while cutting tokens by ~60%
💼 Enterprise adoption, pricing and ROI
OpenAI reclaimed ~30,000 CPU cores by disabling a hot Fluent Bit path
Local–cloud routing shows up to ~74% compute cost cuts and ~80% energy savings
Study: AI‑written proposals erode signals; contractor wages drop ~5%
AI agent pricing should track ROI, not SaaS seat caps
Calls grow to benchmark agentic work, not just one‑shot answers
Gmail adds context‑aware scheduling that proposes times and auto‑books
🧠 Reasoning dynamics and verifiability
Karpathy: Software 2.0 automates what you can verify, not what you can specify
RL for reasoning: entropy collapses; 600 curated problems can match ~17k
A practical framework: verifiability = rubric, application, automation
Agent eval gap: call for WHY‑failure diagnostics beyond single‑answer scores
OlmOCR2 turns parsing into RLVR with LLM‑generated unit tests as rewards
TeaRAG trims ~60% tokens while nudging EM up via fact‑graph and process DPO
Survey: scale agents by growing tasks, tools, and verifiers in one G‑E‑F loop
🎨 Creative media: relighting, style LoRAs and demos
NVIDIA’s ChronoEdit‑14B “Paint‑Brush” LoRA lands with rapid cinematic restyles
Qwen‑Edit Multi‑Angle Lighting LoRA ships controllable relighting presets
AI‑native game demo: one world model drives assets, lighting and camera
ImagineArt v1.5 release praised for sharper, more lifelike people
Grok Imagine micro‑clip shows high‑fidelity macro detail on an opal spider
🛡️ Safety, identity and governance signals
Report: Yann LeCun to leave Meta; calls LLMs a dead end, backs world models
Moonshot AI warns of Kimi impersonators; confirms official handles
Fei‑Fei Li says AGI is more marketing than science; parts exist, whole doesn’t
🤖 Embodied AI: biped agility and authenticity debate
UBTech humanoid “warehouse army” clip labeled CGI by critics; community asks for proof
LimX TRON 1 biped shows agile locomotion; community asks for voice/assistant I/O