PRIMALABSINSIDE

AI inference,
built for the agentic era.

tokens / GPU

GPUs / watt

tokens / watt

Maximized

The metric that defines the agentic era. Two variables, controlled as one.

A self-improving inference control plane for the full host stack — engine, sampling, model, CPU, and GPU. Runs beside existing serving infrastructure — vLLM, SGLang, TensorRT-LLM. Finds the operating point no human can hold — then holds it as workloads shift.

From exascale supercomputing to agentic inference. By the team that trained the largest AI models on Frontier — the world's first exascale supercomputer.

R&D 100 2025 Winner

2× ACM Gordon Bell Finalist

$50M+ DOE Funding

87% Sustained Efficiency

§ 01 — THE AGENTIC ERA

Agents make inference non-stationary.
Static tuning cannot follow a moving optimum.

Agentic workloads are bursty, long-running, and mixed: prefill-heavy planning, decode-heavy execution, tool calls, retries, and memory. Multi-step agents consume 5–30× the tokens of a chat turn; coding and deep-research agents push 100–1,000×. Multiplied by adoption, total inference demand is accelerating faster than chips and megawatts can follow — against a GPU supply that grows in 12–24 month cycles and a power grid that grows in years.

THE INFLECTION

Token demand is accelerating faster than chips and megawatts can follow. Supply is not.

Token demand climbs with agent adoption × tokens-per-task — a compounding curve. GPU shipments grow with fab capacity, in 12–24 month cycles. Grid power grows in multi-year transformer and substation timelines. The gap between the curves is where inference economics live — or die.

Series · Indexed to 2024
Token demand · agentic × adoption
GPU supply · accelerator FLOPS
Grid power · AI data centers

5–30×

More Tokens per Task

Agentic models consume 5–30× the tokens of a chatbot turn. Coding agents push 100–1,000× when retries and context reloads are included.

Sources: Gartner · arXiv 2604.22750

100–300MW

Per Hyperscale Facility

A single hyperscale AI data center now draws 100–300 MW of continuous power — equivalent to the demand of a mid-sized city.

Source: IEA · Energy and AI

11GW

Stalled, Power-Limited

Up to 11 GW of announced 2026 data center capacity remains unbuilt due to power, transformer, and grid interconnection constraints.

Source: Sightline Climate · Data Center Outlook 2026

PRE-AGENTIC

Chat. RAG. Summarization.

~2023 — 2024

WorkloadPrefill-heavy · GPU-dominant · short outputs
PatternPredictable batch shapes
ToleranceLatency tails forgiven · single-turn
Tuned byStatic configs · vendor defaults

AGENTIC

Coding agents. Computer-use. Deep research.

2025 →

WorkloadPrefill + decode-heavy · CPU + GPU coordination · long reasoning
PatternBursty · variable shape · tool-use
ToleranceEvery tail on the critical path
Tuned byClosed-loop control · adapts as load shifts

The math is unsentimental. Token demand grows faster than chips, faster than megawatts. The only exit is up: more tokens from every GPU, more GPUs inside every watt — both, simultaneously. A static configuration cannot keep up with traffic whose shape changes by the request. An autonomous controller that senses the workload signature and re-tunes the stack in real time isn't a faster version of the old approach — it's the architecture this era requires.

§ 02 — THE MATH

More throughput AND less power per token.
Measured. Not asserted.

Performance and efficiency are the same control problem. Same model, same silicon, head-to-head against vLLM Default and the leading commercial inference providers.

THROUGHPUT

+60%

vs vLLM Default, same silicon

On gpt-oss-120b across 4× H100 Dedicated, concurrency = 256. The controller holds the fleet at its true ceiling in real time — not the conservative cap that leaves real tokens on the table.

VS NEXT-BEST DEDICATED

+27%

vs the leading commercial dedicated provider

Compared head-to-head against the fastest commercial dedicated inference on the same model and hardware. The controller reads hardware telemetry and acts through hardware and software — SLA-grade latency, no thermal throttling under load.

ACROSS THE PORTFOLIO

27–97%

throughput gain across five frontier models

Hardware-, model-, and workload-aware maximization delivers double-digit to near-doubling throughput gains over vLLM Default — from 14B dense to 675B sparse MoE — with no hardware changes and no quality regression.

Throughput vs vLLM Default · gpt-oss-120b

% faster than vLLM Default baseline · higher is better

4× H100 DedicatedConcurrency = 256Decode-heavy

Source: PrimaLabs Performance Brief · May 2026 · gpt-oss-120b, 4× H100, decode-heavy, concurrency = 256Full methodology & cross-model results on request →

Tokens per watt

Throughput is the headline. Power is what scales it.

EFFICIENCY

Tokens/ watt

More useful work, same power draw.

The controller pushes throughput up and brings the power required to produce it down. The same fleet returns more revenue tokens at the same power draw — efficiency that flows directly to gross margin.

DENSITY

Nodes/ megawatt

More serving capacity inside the same envelope.

More tokens per watt means more nodes fit inside the same contracted megawatt. Neoclouds win the buildout; enterprise AI platforms turn power constraints into capacity instead of capex.

ENVELOPE

Alwaysin-budget

The controller stays inside power & thermal limits.

Throughput, latency, and tokens-per-watt are all improved without ever leaving the rack's power and thermal budget. The controller holds the operating point where every objective gains — and stays there as load shifts.

One closed loop. Reads hardware telemetry. Acts through hardware and software controls. Stays inside SLA and rack envelope at every step.

§ 03 — PRODUCT

Deploy as a control plane.
Keep the inference stack already running.

PrimaLabs is the autonomous control plane that sits above the serving stack. Orchestrators like NVIDIA Dynamo coordinate engines (vLLM, SGLang, TensorRT-LLM) within a deployment; PrimaLabs continuously tunes the operating point across the full host stack — engine, sampling, model, CPU, GPU, power, and thermal — and learns across deployments, models, and time. The two are complementary: orchestrators handle distributed serving; PrimaLabs holds the fleet at its actual ceiling.

Applications & agentsunchanged APIs

PrimaLabs control planeobserve · optimize · act

Orchestration planeNVIDIA Dynamo · Triton

Inference enginesvLLM · SGLang · TensorRT-LLM

Kubernetes / host stackpolicies & placement

GPU fleetpower · thermal · utilization

What changes after PrimaLabs is installed.

Inference stops being a configuration problem. PrimaLabs continuously searches the safe operating space, evaluates workload signatures, and applies the next-best configuration — while respecting latency, quality, power, and thermal constraints.

The result is a measurable lift in throughput and tokens-per-watt on infrastructure already in production — with benchmark evidence that can be shared with platform, finance, and customer teams.

Integrates withvLLM, SGLang, TensorRT-LLM, NVIDIA Dynamo, Triton, Kubernetes, GPU telemetry
Optimizesbatching, scheduling, placement, power, runtime knobs
Governed bySLA, power, thermal, and quality guardrails
Delivered asoffline benchmark → shadow-mode validation → production control

For platform teams

Higher utilization and fewer emergency tuning cycles as workloads shift.

For finance teams

More revenue tokens from the same GPU and power envelope.

For customers

Better throughput under load without sacrificing latency targets.

For operators

Continuous control with explicit guardrails instead of one-time configs.

§ 04 — HOW IT WORKS

A self-improving control system.
Per GPU. Per millisecond. Better the longer it runs.

One loop, always running. Reads telemetry from CPU and GPU. Acts across the full host stack — engine, sampling, model placement, CPU and GPU. Learns from every cycle. Stays inside SLA and rack envelope at every step — installed in hours, invisible to workloads, no operator intervention required.

— a closed-loop control discipline perfected at exascale, now applied to AI inference at every scale.

01SENSE

Telemetry across CPU and GPU.

Engine config, sampling parameters, model state, CPU scheduling, GPU utilization, memory tiers — sampled in real time and modeled as a live signature of the fleet.

→

02ACT

Autonomous control across the host stack.

The controller acts across the full stack — engine, sampling, model, CPU, GPU — closing the loop between what the workload demands and what the hardware can deliver. No static configurations. No human in the loop.

→

03LEARN

Gains that compound, across the fleet.

The system learns the signature of every workload, every model, every deployment — and carries that knowledge forward. Performance compounds: every cycle, every customer, every release. Not a one-time tuning exercise. A platform that gets better the longer it runs.

§ 05 — BUILT FOR

Two customers.
One performance ceiling.

Purpose-built for the operators whose economics are changing fastest as workloads shift to agentic. Different customers, same performance ceiling — and the same closed-loop platform that holds them there.

Neoclouds

A performance edge customers benchmark.

As customer workloads shift to agentic — prefill- and decode-heavy, bursty, latency-sensitive — performance on existing fleets becomes the differentiator. PrimaLabs Inside is the control system that holds those fleets at measurably higher throughput and tokens-per-watt, in real time. A spec sheet number customers benchmark and sales teams can sell.

A per-GPU number that wins benchmarks — and the customers who run them.

See how a Neocloud customer benchmarked this →

Enterprise AI

An always-on performance team, embedded.

As internal AI shifts from chat to agents, tokens-per-task multiply by 10–100× and inference economics rewrite themselves overnight. PrimaLabs runs as an always-on, self-improving control plane across the fleet — extracting and holding the throughput and tokens-per-watt that internal teams would otherwise chase manually for quarters. Gains compound; engineers focus on what only humans can do.

A fleet that holds at its design point as agentic workloads multiply.

Read the enterprise pilot results →

§ 06 — THE TEAM FOR THIS MOMENT

The discipline of exascale.
Applied to agentic inference.

For two decades, the DOE national labs perfected one discipline: extracting maximum throughput from fixed hardware under hard power and thermal constraints. That is precisely the problem inference economics now faces. PrimaLabs is that discipline, applied — by the team behind ORBIT and the largest AI models ever trained on Frontier, the world's first exascale supercomputer.

Prasanna Balaprakash, PhD

Co-founder & CEO · Technical Vision

Director of AI Programs, Oak Ridge National Laboratory
Led ORBIT: 113B-parameter foundation model · 1.6 exaFLOPS sustained on 49,152 GPUs
ACM Gordon Bell Prize Finalist 2024 (ORBIT) & 2025 (ORBIT-2 · 4.1 exaFLOPS, 98% scaling)
White House panels · Tennessee AI Advisory Council

Chaitanya Hiremath

Co-Founder, President & COO

Four-time founder · zero-to-one across enterprise SaaS, AI, and healthtech
Built and scaled a venture studio incubating early-stage startups
Entrepreneur Magazine — Top 25 People in Tech
Operator background across GTM, fundraising, and product — pre-seed to growth

The Team

Ankita Singh

Product Strategy and Ops

Dual master's from MIT — MBA + ME
Previously Bosch Ventures, BCG, Bose
Global Venturing Rising Star 2025

Romain Egele, PhD

AutoML at HPC Scale

PhD Université Paris-Saclay · 8+ years at Argonne National Lab
Co-author, trillion-parameter LLM training on Frontier (ORNL × Paris-Saclay)
Lead developer of DeepHyper · Best Paper Award, IEEE eScience 2023

Thomas Randall, PhD

HPC & GPU Optimization

PhD Clemson · GPU performance autotuning
Best Paper Award · ACM ICS 2021 (FULL-W2V)
DOE Exascale Computing Project · Argonne & Oak Ridge collaborator

98%

Strong scaling efficiency · ORBIT-2 on 65,536 Frontier GPUs

2×

ACM Gordon Bell Finalist · 2024 (ORBIT) & 2025 (ORBIT-2)

R&D 100

2025 Winner · the search engine inside PrimaLabs

$50M+

DOE research funding · 30+ peer-reviewed publications

Backed By

TheGPRitual Capital

§ 07 — GET STARTED

Prove your tokens-per-watt ceiling.
Start with a benchmark pilot.

Every GPU fleet is leaving tokens on the table. PrimaLabs benchmarks the workload, quantifies the upside, validates guardrails in shadow mode, and turns the winning configuration into a production control plane.

01 · Baseline

Replay the workload.

Capture prompt mix, concurrency, model, hardware, latency targets, and power envelope.

02 · Optimize

Find the frontier.

Search safe runtime, scheduler, placement, and hardware policies against the chosen objectives.

03 · Validate

Prove the guardrails.

Compare throughput, tokens-per-watt, latency distribution, and thermal behavior — head-to-head.

04 · Deploy

Control in production.

Move from benchmark report → shadow mode → bounded production optimization.

Buyer-safe by design: benchmark first, prove methodology, validate in shadow mode, then enable production controls inside explicit SLA, quality, power, and thermal guardrails.

Review benchmark methodology

Agents make inference non-stationary.Static tuning cannot follow a moving optimum.

More throughput AND less power per token.Measured. Not asserted.