Agentic state space search for optimal inference performance

Agentic AI that systematically and exhaustively profiles, optimizes and validates your models, on your compute platform, on your data, and your traffic patterns, within your accuracy and real-time constraints.

The Market Reality

🏦

Financial Services

Inference costs at major financial institutions exceed $450k per month. Risk models, fraud detection, and trading systems run continuously on NVIDIA infrastructure. Production deployments with variable sequence lengths and domain-specific fine-tuning typically achieve 30-40% below published benchmarks. A 10% optimization represents $1.5M in annual savings, yet most teams lack visibility into the performance characteristics of their inference layer.

🏭

Manufacturing

48% of manufacturers report difficulty filling high-skill GPU optimization roles. Quality control, predictive maintenance, and production planning increasingly depend on real-time inference. The specialized knowledge required to optimize these workloads—GPU architecture, memory hierarchies, kernel optimization—is scarce. Most manufacturers lack internal expertise and must choose between overpaying for compute or accepting suboptimal performance.

🛡️

Defense & Autonomous Systems

Defense applications require real-time inference in disconnected, disrupted, and limited (DDIL) environments where cloud offloading isn't an option. Autonomous platforms must run dozens of neural networks in parallel on embedded GPUs. Every inefficiency reduces safety margins or requires more expensive hardware. Performance optimization is an architectural necessity, not an optional enhancement.

How It Works

An AI agent operates in your staging environment with your actual data and traffic patterns. It profiles systematically, identifies bottlenecks, proposes optimizations, validates accuracy, and iterates. The process that would require months of specialized engineering work proceeds methodically over days.

Hardware-Level Profiling

The agent deploys in your staging environment and profiles your inference using NVIDIA NCU to capture hardware-level metrics. It works with your data, your models, and your traffic patterns—not synthetic benchmarks. No code changes are required. The profiling methodology is informed by 25 years of experience identifying which metrics matter and how to interpret them.

Bottleneck Identification

The agent analyzes comprehensive NCU performance profiles to identify architectural constraints. Is your workload memory-bound, compute-bound, cache-limited, bank-conflicted, or scheduling-constrained? Different bottlenecks require different solutions. The agent's analysis is grounded in understanding GPU architecture at the level required to distinguish between these constraint types.

Configuration Space Search

The agent systematically explores the configuration space: batch sizes, precision settings, KV-cache strategies, kernel selection, sequence length management. It proposes optimizations, measures performance and accuracy, validates results, documents changes, and iterates. The search continues until further improvements are not found within your specified constraints.

Accuracy Validation

Every optimization is validated against perplexity loss and task-specific metrics. You define the tradeoff surface: some applications require zero accuracy loss, others can accept measured degradation for throughput gains. The agent optimizes within your boundaries and provides complete visibility into the speed-accuracy tradeoff for each configuration tested.

Profile
→
Analyze
→
Optimize
→
Test
→
Validate
→
Repeat
Systematic iteration until saturation within specified constraints

Graded Recommendations

Optimization opportunities exist at multiple levels of complexity and time horizon. We provide a tiered system of recommendations, ranging from immediate configuration wins that can be implemented within days to a structured roadmap for long-term architectural improvements and hardware evolution.

Each tier is validated against your accuracy requirements and constraints. You decide which optimizations to pursue based on engineering resources, risk tolerance, and expected return. The recommendations provide clear visibility into the tradeoff between implementation effort and performance gains at each level.

Why This Matters

$16.9B
Enterprise inference spending across financial services, manufacturing, defense, autonomous systems, and telecom. Organizations deploying large models on NVIDIA infrastructure face a consistent gap between published benchmarks and production performance.
58→95
Tokens per second. Published benchmarks claim Llama 2 70B achieves 95 tokens/sec on H100. Production deployments with variable sequence lengths, heterogeneous batch sizes, and domain-specific fine-tuning typically measure 58 tokens/sec. That 37-token gap represents measurable inefficiency.
Scarce
GPU performance engineering expertise. Engineers who understand memory hierarchy, roofline analysis, kernel optimization, and GPU architecture are rare. The knowledge takes years to develop. Most organizations cannot hire them, which is why inference optimization remains deprioritized despite significant cost impact.

Who We Are

Sujatha Kashyap is a systems performance engineer with 25 years of experience at the hardware-software interface. She holds a Ph.D. in distributed systems and has been granted dozens of patents in systems architecture across memory, cache optimization, virtualization, and resource orchestration.

At IBM, she led performance optimization across POWER4 through POWER10, spending decades on post-silicon validation, memory hierarchy tuning, and cache optimization. At Meta, she architected network-on-chip performance for AR SoCs under extreme thermal and power constraints. She has worked with enterprise workloads across diverse hardware and software stacks, addressing the same fundamental performance challenges in different contexts.

Across thousands of deployments, she has identified the recurring patterns: which metrics matter, when memory bandwidth is the constraint versus L2 cache efficiency, how to distinguish between compute-bound and memory-bound workloads. She has optimized for single-thread latency, aggregate throughput, SMT scheduling, NUMA effects, and interconnect congestion.

From direct experience, she understands that organizations of all sizes forfeit significant value because they lack access to performance engineering expertise. Varcas encodes 25 years of pattern recognition into an agentic system that makes this knowledge accessible to any organization.

The Philosophy of Varcas

Varcas means refined fire in Sanskrit—what emerges when you subject scattered thought to disciplined focus. It is the distilled result of effortful discipline: raw potential transformed through systematic application into something precise and useful.

Our logo embodies this transformation: kundalini energy coiled at the root chakra, ascending through the spine to the crown chakra. Raw energy transformed into intelligence - a precise metaphor for what we do. Scattered compute resources move through disciplined optimization to become realized performance. The transformation requires guidance through each level, each constraint, each architectural layer.

Performance optimization is fundamentally an exercise in refinement. You begin with scattered observations: metrics, profiles, bottlenecks. Through disciplined analysis, you distill signal from noise, identify root causes, and systematically remove inefficiency. What remains is closer to the theoretical optimum—compute, realized.

This philosophy informs both the methodology and the choice of who should do this work. Would you hand a toddler the keys to your Mercedes? Performance optimization at this level requires not just technical knowledge but the judgment that comes from decades of making consequential decisions about production systems. The agent is effective because it is informed by someone who has spent 25 years learning when to optimize, what to optimize, and when to stop—someone who understands that the goal is not perfection but disciplined improvement within real-world constraints.

Let's Talk

We are working across Austin, Bangalore, and Silicon Valley, seeking design partners to validate our agentic optimization approach across different industries and hardware configurations.