Agentic state space search for optimal inference performance
Agentic AI that systematically and exhaustively profiles, optimizes and validates your models, on your compute platform, on your data, and your traffic patterns, within your accuracy and real-time constraints.
The Market Reality
Financial Services
Inference costs at major financial institutions exceed $450k per month. Risk models, fraud detection, and trading systems run continuously on NVIDIA infrastructure. Production deployments with variable sequence lengths and domain-specific fine-tuning typically achieve 30-40% below published benchmarks. A 10% optimization represents $1.5M in annual savings, yet most teams lack visibility into the performance characteristics of their inference layer.
Manufacturing
48% of manufacturers report difficulty filling high-skill GPU optimization roles. Quality control, predictive maintenance, and production planning increasingly depend on real-time inference. The specialized knowledge required to optimize these workloads—GPU architecture, memory hierarchies, kernel optimization—is scarce. Most manufacturers lack internal expertise and must choose between overpaying for compute or accepting suboptimal performance.
Defense & Autonomous Systems
Defense applications require real-time inference in disconnected, disrupted, and limited (DDIL) environments where cloud offloading isn't an option. Autonomous platforms must run dozens of neural networks in parallel on embedded GPUs. Every inefficiency reduces safety margins or requires more expensive hardware. Performance optimization is an architectural necessity, not an optional enhancement.
How It Works
An AI agent operates in your staging environment with your actual data and traffic patterns. It profiles systematically, identifies bottlenecks, proposes optimizations, validates accuracy, and iterates. The process that would require months of specialized engineering work proceeds methodically over days.
The agent deploys in your staging environment and profiles your inference using NVIDIA NCU to capture hardware-level metrics. It works with your data, your models, and your traffic patterns—not synthetic benchmarks. No code changes are required. The profiling methodology is informed by 25 years of experience identifying which metrics matter and how to interpret them.
The agent analyzes comprehensive NCU performance profiles to identify architectural constraints. Is your workload memory-bound, compute-bound, cache-limited, bank-conflicted, or scheduling-constrained? Different bottlenecks require different solutions. The agent's analysis is grounded in understanding GPU architecture at the level required to distinguish between these constraint types.
The agent systematically explores the configuration space: batch sizes, precision settings, KV-cache strategies, kernel selection, sequence length management. It proposes optimizations, measures performance and accuracy, validates results, documents changes, and iterates. The search continues until further improvements are not found within your specified constraints.
Every optimization is validated against perplexity loss and task-specific metrics. You define the tradeoff surface: some applications require zero accuracy loss, others can accept measured degradation for throughput gains. The agent optimizes within your boundaries and provides complete visibility into the speed-accuracy tradeoff for each configuration tested.
Graded Recommendations
Optimization opportunities exist at multiple levels of complexity and time horizon. We provide a tiered system of recommendations, ranging from immediate configuration wins that can be implemented within days to a structured roadmap for long-term architectural improvements and hardware evolution.
Each tier is validated against your accuracy requirements and constraints. You decide which optimizations to pursue based on engineering resources, risk tolerance, and expected return. The recommendations provide clear visibility into the tradeoff between implementation effort and performance gains at each level.
Why This Matters
Who We Are
Sujatha Kashyap is a systems performance engineer with 25 years of experience at the hardware-software interface. She holds a Ph.D. in distributed systems and has been granted dozens of patents in systems architecture across memory, cache optimization, virtualization, and resource orchestration.
At IBM, she led performance optimization across POWER4 through POWER10, spending decades on post-silicon validation, memory hierarchy tuning, and cache optimization. At Meta, she architected network-on-chip performance for AR SoCs under extreme thermal and power constraints. She has worked with enterprise workloads across diverse hardware and software stacks, addressing the same fundamental performance challenges in different contexts.
Across thousands of deployments, she has identified the recurring patterns: which metrics matter, when memory bandwidth is the constraint versus L2 cache efficiency, how to distinguish between compute-bound and memory-bound workloads. She has optimized for single-thread latency, aggregate throughput, SMT scheduling, NUMA effects, and interconnect congestion.
From direct experience, she understands that organizations of all sizes forfeit significant value because they lack access to performance engineering expertise. Varcas encodes 25 years of pattern recognition into an agentic system that makes this knowledge accessible to any organization.
The Philosophy of Varcas
Varcas means refined fire in Sanskrit—what emerges when you subject scattered thought to disciplined focus. It is the distilled result of effortful discipline: raw potential transformed through systematic application into something precise and useful.
Our logo embodies this transformation: kundalini energy coiled at the root chakra, ascending through the spine to the crown chakra. Raw energy transformed into intelligence - a precise metaphor for what we do. Scattered compute resources move through disciplined optimization to become realized performance. The transformation requires guidance through each level, each constraint, each architectural layer.
Performance optimization is fundamentally an exercise in refinement. You begin with scattered observations: metrics, profiles, bottlenecks. Through disciplined analysis, you distill signal from noise, identify root causes, and systematically remove inefficiency. What remains is closer to the theoretical optimum—compute, realized.
This philosophy informs both the methodology and the choice of who should do this work. Would you hand a toddler the keys to your Mercedes? Performance optimization at this level requires not just technical knowledge but the judgment that comes from decades of making consequential decisions about production systems. The agent is effective because it is informed by someone who has spent 25 years learning when to optimize, what to optimize, and when to stop—someone who understands that the goal is not perfection but disciplined improvement within real-world constraints.
Let's Talk
We are working across Austin, Bangalore, and Silicon Valley, seeking design partners to validate our agentic optimization approach across different industries and hardware configurations.