AI inference,
built for the agentic era.
A self-improving inference control plane for the full host stack — engine, sampling, model, CPU, and GPU. Runs beside existing serving infrastructure — vLLM, SGLang, TensorRT-LLM. Finds the operating point no human can hold — then holds it as workloads shift.
Agents make inference non-stationary.
Static tuning cannot follow a moving optimum.
Agentic workloads are bursty, long-running, and mixed: prefill-heavy planning, decode-heavy execution, tool calls, retries, and memory. Multi-step agents consume 5–30× the tokens of a chat turn; coding and deep-research agents push 100–1,000×. Multiplied by adoption, total inference demand is accelerating faster than chips and megawatts can follow — against a GPU supply that grows in 12–24 month cycles and a power grid that grows in years.
Token demand climbs with agent adoption × tokens-per-task — a compounding curve. GPU shipments grow with fab capacity, in 12–24 month cycles. Grid power grows in multi-year transformer and substation timelines. The gap between the curves is where inference economics live — or die.
- Series · Indexed to 2024
- Token demand · agentic × adoption
- GPU supply · accelerator FLOPS
- Grid power · AI data centers
- WorkloadPrefill-heavy · GPU-dominant · short outputs
- PatternPredictable batch shapes
- ToleranceLatency tails forgiven · single-turn
- Tuned byStatic configs · vendor defaults
- WorkloadPrefill + decode-heavy · CPU + GPU coordination · long reasoning
- PatternBursty · variable shape · tool-use
- ToleranceEvery tail on the critical path
- Tuned byClosed-loop control · adapts as load shifts
The math is unsentimental. Token demand grows faster than chips, faster than megawatts. The only exit is up: more tokens from every GPU, more GPUs inside every watt — both, simultaneously. A static configuration cannot keep up with traffic whose shape changes by the request. An autonomous controller that senses the workload signature and re-tunes the stack in real time isn't a faster version of the old approach — it's the architecture this era requires.
More throughput AND less power per token.
Measured. Not asserted.
Performance and efficiency are the same control problem. Same model, same silicon, head-to-head against vLLM Default and the leading commercial inference providers.
vs vLLM Default, same silicon
On gpt-oss-120b across 4× H100 Dedicated, concurrency = 256. The controller holds the fleet at its true ceiling in real time — not the conservative cap that leaves real tokens on the table.
vs the leading commercial dedicated provider
Compared head-to-head against the fastest commercial dedicated inference on the same model and hardware. The controller reads hardware telemetry and acts through hardware and software — SLA-grade latency, no thermal throttling under load.
throughput gain across five frontier models
Hardware-, model-, and workload-aware maximization delivers double-digit to near-doubling throughput gains over vLLM Default — from 14B dense to 675B sparse MoE — with no hardware changes and no quality regression.
Throughput is the headline. Power is what scales it.
Deploy as a control plane.
Keep the inference stack already running.
PrimaLabs is the autonomous control plane that sits above the serving stack. Orchestrators like NVIDIA Dynamo coordinate engines (vLLM, SGLang, TensorRT-LLM) within a deployment; PrimaLabs continuously tunes the operating point across the full host stack — engine, sampling, model, CPU, GPU, power, and thermal — and learns across deployments, models, and time. The two are complementary: orchestrators handle distributed serving; PrimaLabs holds the fleet at its actual ceiling.
What changes after PrimaLabs is installed.
Inference stops being a configuration problem. PrimaLabs continuously searches the safe operating space, evaluates workload signatures, and applies the next-best configuration — while respecting latency, quality, power, and thermal constraints.
The result is a measurable lift in throughput and tokens-per-watt on infrastructure already in production — with benchmark evidence that can be shared with platform, finance, and customer teams.
- Integrates withvLLM, SGLang, TensorRT-LLM, NVIDIA Dynamo, Triton, Kubernetes, GPU telemetry
- Optimizesbatching, scheduling, placement, power, runtime knobs
- Governed bySLA, power, thermal, and quality guardrails
- Delivered asoffline benchmark → shadow-mode validation → production control
A self-improving control system.
Per GPU. Per millisecond. Better the longer it runs.
One loop, always running. Reads telemetry from CPU and GPU. Acts across the full host stack — engine, sampling, model placement, CPU and GPU. Learns from every cycle. Stays inside SLA and rack envelope at every step — installed in hours, invisible to workloads, no operator intervention required.
— a closed-loop control discipline perfected at exascale, now applied to AI inference at every scale.
Telemetry across CPU and GPU.
Engine config, sampling parameters, model state, CPU scheduling, GPU utilization, memory tiers — sampled in real time and modeled as a live signature of the fleet.
Autonomous control across the host stack.
The controller acts across the full stack — engine, sampling, model, CPU, GPU — closing the loop between what the workload demands and what the hardware can deliver. No static configurations. No human in the loop.
Gains that compound, across the fleet.
The system learns the signature of every workload, every model, every deployment — and carries that knowledge forward. Performance compounds: every cycle, every customer, every release. Not a one-time tuning exercise. A platform that gets better the longer it runs.
Two customers.
One performance ceiling.
Purpose-built for the operators whose economics are changing fastest as workloads shift to agentic. Different customers, same performance ceiling — and the same closed-loop platform that holds them there.
A performance edge customers benchmark.
As customer workloads shift to agentic — prefill- and decode-heavy, bursty, latency-sensitive — performance on existing fleets becomes the differentiator. PrimaLabs Inside is the control system that holds those fleets at measurably higher throughput and tokens-per-watt, in real time. A spec sheet number customers benchmark and sales teams can sell.
An always-on performance team, embedded.
As internal AI shifts from chat to agents, tokens-per-task multiply by 10–100× and inference economics rewrite themselves overnight. PrimaLabs runs as an always-on, self-improving control plane across the fleet — extracting and holding the throughput and tokens-per-watt that internal teams would otherwise chase manually for quarters. Gains compound; engineers focus on what only humans can do.
The discipline of exascale.
Applied to agentic inference.
For two decades, the DOE national labs perfected one discipline: extracting maximum throughput from fixed hardware under hard power and thermal constraints. That is precisely the problem inference economics now faces. PrimaLabs is that discipline, applied — by the team behind ORBIT and the largest AI models ever trained on Frontier, the world's first exascale supercomputer.
- Director of AI Programs, Oak Ridge National Laboratory
- Led ORBIT: 113B-parameter foundation model · 1.6 exaFLOPS sustained on 49,152 GPUs
- ACM Gordon Bell Prize Finalist 2024 (ORBIT) & 2025 (ORBIT-2 · 4.1 exaFLOPS, 98% scaling)
- White House panels · Tennessee AI Advisory Council
- Four-time founder · zero-to-one across enterprise SaaS, AI, and healthtech
- Built and scaled a venture studio incubating early-stage startups
- Entrepreneur Magazine — Top 25 People in Tech
- Operator background across GTM, fundraising, and product — pre-seed to growth
- Dual master's from MIT — MBA + ME
- Previously Bosch Ventures, BCG, Bose
- Global Venturing Rising Star 2025
- PhD Université Paris-Saclay · 8+ years at Argonne National Lab
- Co-author, trillion-parameter LLM training on Frontier (ORNL × Paris-Saclay)
- Lead developer of DeepHyper · Best Paper Award, IEEE eScience 2023
- PhD Clemson · GPU performance autotuning
- Best Paper Award · ACM ICS 2021 (FULL-W2V)
- DOE Exascale Computing Project · Argonne & Oak Ridge collaborator
Prove your tokens-per-watt ceiling.
Start with a benchmark pilot.
Every GPU fleet is leaving tokens on the table. PrimaLabs benchmarks the workload, quantifies the upside, validates guardrails in shadow mode, and turns the winning configuration into a production control plane.
Replay the workload.
Capture prompt mix, concurrency, model, hardware, latency targets, and power envelope.
Find the frontier.
Search safe runtime, scheduler, placement, and hardware policies against the chosen objectives.
Prove the guardrails.
Compare throughput, tokens-per-watt, latency distribution, and thermal behavior — head-to-head.
Control in production.
Move from benchmark report → shadow mode → bounded production optimization.