PRIMALABSINSIDE

AI inference,
built for the agentic era.

A self-improving inference control plane for the full host stack — engine, sampling, model, CPU, and GPU. Runs beside existing serving infrastructure — vLLM, SGLang, TensorRT-LLM. Finds the operating point no human can hold — then holds it as workloads shift.

From exascale supercomputing to agentic inference. By the team that trained the largest AI models on Frontier — the world's first exascale supercomputer.
R&D 100 2025 Winner
ACM Gordon Bell Finalist
$50M+ DOE Funding
87% Sustained Efficiency
§ 01 — THE AGENTIC ERA

Agents make inference non-stationary.
Static tuning cannot follow a moving optimum.

Agentic workloads are bursty, long-running, and mixed: prefill-heavy planning, decode-heavy execution, tool calls, retries, and memory. Multi-step agents consume 5–30× the tokens of a chat turn; coding and deep-research agents push 100–1,000×. Multiplied by adoption, total inference demand is accelerating faster than chips and megawatts can follow — against a GPU supply that grows in 12–24 month cycles and a power grid that grows in years.

THE INFLECTION
Token demand is accelerating faster than chips and megawatts can follow. Supply is not.

Token demand climbs with agent adoption × tokens-per-task — a compounding curve. GPU shipments grow with fab capacity, in 12–24 month cycles. Grid power grows in multi-year transformer and substation timelines. The gap between the curves is where inference economics live — or die.

  • Series · Indexed to 2024
  • Token demand · agentic × adoption
  • GPU supply · accelerator FLOPS
  • Grid power · AI data centers
Indexed projections: AI inference token demand vs GPU supply vs grid power for AI, 2024–2030Logarithmic chart showing token demand projected to grow approximately 100× from 2024 to 2030, while GPU shipment supply grows roughly 8× and grid power for AI data centers grows 2.5×. The widening gap between demand and supply is the central feature of the figure. Source consensus: Goldman Sachs, JPMorgan, IEA, NVIDIA.10×20×50×100×INDEXED TO 2024 · LOG SCALE2024202520262027202820292030NOWGrid power for AI2.5× by 2030GPU supply~8× by 2030Token demand~100× by 2030the gap~12× by 2030SOURCES: GOLDMAN SACHS · JPMORGAN · IEA · NVIDIAFIG. 1 · INDEXED, LOG SCALE
5–30×
More Tokens per Task
Agentic models consume 5–30× the tokens of a chatbot turn. Coding agents push 100–1,000× when retries and context reloads are included.
100–300MW
Per Hyperscale Facility
A single hyperscale AI data center now draws 100–300 MW of continuous power — equivalent to the demand of a mid-sized city.
11GW
Stalled, Power-Limited
Up to 11 GW of announced 2026 data center capacity remains unbuilt due to power, transformer, and grid interconnection constraints.
PRE-AGENTIC
Chat. RAG. Summarization.
~2023 — 2024
  • WorkloadPrefill-heavy · GPU-dominant · short outputs
  • PatternPredictable batch shapes
  • ToleranceLatency tails forgiven · single-turn
  • Tuned byStatic configs · vendor defaults
AGENTIC
Coding agents. Computer-use. Deep research.
2025 →
  • WorkloadPrefill + decode-heavy · CPU + GPU coordination · long reasoning
  • PatternBursty · variable shape · tool-use
  • ToleranceEvery tail on the critical path
  • Tuned byClosed-loop control · adapts as load shifts

The math is unsentimental. Token demand grows faster than chips, faster than megawatts. The only exit is up: more tokens from every GPU, more GPUs inside every watt — both, simultaneously. A static configuration cannot keep up with traffic whose shape changes by the request. An autonomous controller that senses the workload signature and re-tunes the stack in real time isn't a faster version of the old approach — it's the architecture this era requires.

§ 02 — THE MATH

More throughput AND less power per token.
Measured. Not asserted.

Performance and efficiency are the same control problem. Same model, same silicon, head-to-head against vLLM Default and the leading commercial inference providers.

THROUGHPUT
+60%

vs vLLM Default, same silicon

On gpt-oss-120b across 4× H100 Dedicated, concurrency = 256. The controller holds the fleet at its true ceiling in real time — not the conservative cap that leaves real tokens on the table.

VS NEXT-BEST DEDICATED
+27%

vs the leading commercial dedicated provider

Compared head-to-head against the fastest commercial dedicated inference on the same model and hardware. The controller reads hardware telemetry and acts through hardware and software — SLA-grade latency, no thermal throttling under load.

ACROSS THE PORTFOLIO
27–97%

throughput gain across five frontier models

Hardware-, model-, and workload-aware maximization delivers double-digit to near-doubling throughput gains over vLLM Default — from 14B dense to 675B sparse MoE — with no hardware changes and no quality regression.

Throughput vs vLLM Default · gpt-oss-120b
% faster than vLLM Default baseline · higher is better
4× H100 DedicatedConcurrency = 256Decode-heavy
Throughput comparison vs vLLM Default baseline on gpt-oss-120b, 4× H100 DedicatedHorizontal bar chart. PrimaLabs Dedicated: +60% faster than vLLM Default. Fireworks Dedicated: +26% faster. vLLM Default: baseline. Fireworks Serverless: 35% slower. Together Serverless: 81% slower.05k10k15k20k25kPrimaLabsDedicated · 4×H100+60% fasterFireworksDedicated · 4×H100+26% fastervLLM DefaultDedicated · 4×H100DEFAULTFireworksServerless35% slowerTogetherServerless81% slowerOUTPUT TOKENS / SECOND
Source: PrimaLabs Performance Brief · May 2026 · gpt-oss-120b, 4× H100, decode-heavy, concurrency = 256Full methodology & cross-model results on request →
Tokens per watt

Throughput is the headline. Power is what scales it.

EFFICIENCY
Tokens/ watt
More useful work, same power draw.
The controller pushes throughput up and brings the power required to produce it down. The same fleet returns more revenue tokens at the same power draw — efficiency that flows directly to gross margin.
DENSITY
Nodes/ megawatt
More serving capacity inside the same envelope.
More tokens per watt means more nodes fit inside the same contracted megawatt. Neoclouds win the buildout; enterprise AI platforms turn power constraints into capacity instead of capex.
ENVELOPE
Alwaysin-budget
The controller stays inside power & thermal limits.
Throughput, latency, and tokens-per-watt are all improved without ever leaving the rack's power and thermal budget. The controller holds the operating point where every objective gains — and stays there as load shifts.
One closed loop. Reads hardware telemetry. Acts through hardware and software controls. Stays inside SLA and rack envelope at every step.
§ 03 — PRODUCT

Deploy as a control plane.
Keep the inference stack already running.

PrimaLabs is the autonomous control plane that sits above the serving stack. Orchestrators like NVIDIA Dynamo coordinate engines (vLLM, SGLang, TensorRT-LLM) within a deployment; PrimaLabs continuously tunes the operating point across the full host stack — engine, sampling, model, CPU, GPU, power, and thermal — and learns across deployments, models, and time. The two are complementary: orchestrators handle distributed serving; PrimaLabs holds the fleet at its actual ceiling.

What changes after PrimaLabs is installed.

Inference stops being a configuration problem. PrimaLabs continuously searches the safe operating space, evaluates workload signatures, and applies the next-best configuration — while respecting latency, quality, power, and thermal constraints.

The result is a measurable lift in throughput and tokens-per-watt on infrastructure already in production — with benchmark evidence that can be shared with platform, finance, and customer teams.

  • Integrates withvLLM, SGLang, TensorRT-LLM, NVIDIA Dynamo, Triton, Kubernetes, GPU telemetry
  • Optimizesbatching, scheduling, placement, power, runtime knobs
  • Governed bySLA, power, thermal, and quality guardrails
  • Delivered asoffline benchmark → shadow-mode validation → production control
For platform teams
Higher utilization and fewer emergency tuning cycles as workloads shift.
For finance teams
More revenue tokens from the same GPU and power envelope.
For customers
Better throughput under load without sacrificing latency targets.
For operators
Continuous control with explicit guardrails instead of one-time configs.
§ 04 — HOW IT WORKS

A self-improving control system.
Per GPU. Per millisecond. Better the longer it runs.

One loop, always running. Reads telemetry from CPU and GPU. Acts across the full host stack — engine, sampling, model placement, CPU and GPU. Learns from every cycle. Stays inside SLA and rack envelope at every step — installed in hours, invisible to workloads, no operator intervention required.

— a closed-loop control discipline perfected at exascale, now applied to AI inference at every scale.

01SENSE

Telemetry across CPU and GPU.

Engine config, sampling parameters, model state, CPU scheduling, GPU utilization, memory tiers — sampled in real time and modeled as a live signature of the fleet.

02ACT

Autonomous control across the host stack.

The controller acts across the full stack — engine, sampling, model, CPU, GPU — closing the loop between what the workload demands and what the hardware can deliver. No static configurations. No human in the loop.

03LEARN

Gains that compound, across the fleet.

The system learns the signature of every workload, every model, every deployment — and carries that knowledge forward. Performance compounds: every cycle, every customer, every release. Not a one-time tuning exercise. A platform that gets better the longer it runs.

§ 05 — BUILT FOR

Two customers.
One performance ceiling.

Purpose-built for the operators whose economics are changing fastest as workloads shift to agentic. Different customers, same performance ceiling — and the same closed-loop platform that holds them there.

Neoclouds

A performance edge customers benchmark.

As customer workloads shift to agentic — prefill- and decode-heavy, bursty, latency-sensitive — performance on existing fleets becomes the differentiator. PrimaLabs Inside is the control system that holds those fleets at measurably higher throughput and tokens-per-watt, in real time. A spec sheet number customers benchmark and sales teams can sell.

A per-GPU number that wins benchmarks — and the customers who run them.
See how a Neocloud customer benchmarked this →
Enterprise AI

An always-on performance team, embedded.

As internal AI shifts from chat to agents, tokens-per-task multiply by 10–100× and inference economics rewrite themselves overnight. PrimaLabs runs as an always-on, self-improving control plane across the fleet — extracting and holding the throughput and tokens-per-watt that internal teams would otherwise chase manually for quarters. Gains compound; engineers focus on what only humans can do.

A fleet that holds at its design point as agentic workloads multiply.
Read the enterprise pilot results →
§ 06 — THE TEAM FOR THIS MOMENT

The discipline of exascale.
Applied to agentic inference.

For two decades, the DOE national labs perfected one discipline: extracting maximum throughput from fixed hardware under hard power and thermal constraints. That is precisely the problem inference economics now faces. PrimaLabs is that discipline, applied — by the team behind ORBIT and the largest AI models ever trained on Frontier, the world's first exascale supercomputer.

Co-founder & CEO · Technical Vision
Prasanna Balaprakash
  • Director of AI Programs, Oak Ridge National Laboratory
  • Led ORBIT: 113B-parameter foundation model · 1.6 exaFLOPS sustained on 49,152 GPUs
  • ACM Gordon Bell Prize Finalist 2024 (ORBIT) & 2025 (ORBIT-2 · 4.1 exaFLOPS, 98% scaling)
  • White House panels · Tennessee AI Advisory Council
Co-Founder, President & COO
Chaitanya Hiremath
  • Four-time founder · zero-to-one across enterprise SaaS, AI, and healthtech
  • Built and scaled a venture studio incubating early-stage startups
  • Entrepreneur Magazine — Top 25 People in Tech
  • Operator background across GTM, fundraising, and product — pre-seed to growth
The Team
Product Strategy and Ops
Ankita Singh
  • Dual master's from MIT — MBA + ME
  • Previously Bosch Ventures, BCG, Bose
  • Global Venturing Rising Star 2025
AutoML at HPC Scale
Romain Egele
  • PhD Université Paris-Saclay · 8+ years at Argonne National Lab
  • Co-author, trillion-parameter LLM training on Frontier (ORNL × Paris-Saclay)
  • Lead developer of DeepHyper · Best Paper Award, IEEE eScience 2023
HPC & GPU Optimization
Thomas Randall
  • PhD Clemson · GPU performance autotuning
  • Best Paper Award · ACM ICS 2021 (FULL-W2V)
  • DOE Exascale Computing Project · Argonne & Oak Ridge collaborator
98%
Strong scaling efficiency · ORBIT-2 on 65,536 Frontier GPUs
ACM Gordon Bell Finalist · 2024 (ORBIT) & 2025 (ORBIT-2)
R&D 100
2025 Winner · the search engine inside PrimaLabs
$50M+
DOE research funding · 30+ peer-reviewed publications
Backed By
TheGPRitual Capital
§ 07 — GET STARTED

Prove your tokens-per-watt ceiling.
Start with a benchmark pilot.

Every GPU fleet is leaving tokens on the table. PrimaLabs benchmarks the workload, quantifies the upside, validates guardrails in shadow mode, and turns the winning configuration into a production control plane.

01 · Baseline

Replay the workload.

Capture prompt mix, concurrency, model, hardware, latency targets, and power envelope.

02 · Optimize

Find the frontier.

Search safe runtime, scheduler, placement, and hardware policies against the chosen objectives.

03 · Validate

Prove the guardrails.

Compare throughput, tokens-per-watt, latency distribution, and thermal behavior — head-to-head.

04 · Deploy

Control in production.

Move from benchmark report → shadow mode → bounded production optimization.

Buyer-safe by design: benchmark first, prove methodology, validate in shadow mode, then enable production controls inside explicit SLA, quality, power, and thermal guardrails.
Review benchmark methodology