Nvidia
Senior Technical Program Manager
Found: Today
As a Senior Technical Program Manager with a passion for data-driven operations, you will lead the DGX Cloud Fleet Health reporting program — delivering real-time, actionable insights on the availability and reliability of our GPU fleet. A core focus of this role is advancing Mean-Time-Between-Interruption (MTBI): understanding the root causes of fleet interruptions, surfacing patterns in the data, and driving cross-functional programs to measurably extend fleet uptime. You will partner closely with Capacity Operations, Infrastructure, SRE, and Engineering teams to translate complex fleet signals into decisions that directly improve customer experience. Join us in making a significant impact on the world's most powerful AI infrastructure.
What You’ll Be Doing:
Define and own the metrics framework for measuring fleet health, reliability, and MTBI across a diverse and rapidly scaling GPU fleet.
Lead hands-on data investigations — querying telemetry, correlating failure signals, and building statistical models — to identify the root causes of interruptions and quantify their impact.
Own and drive execution of cross-functional MTBI improvement programs end-to-end — from translating analytical findings into a prioritized roadmap, to holding teams accountable to milestones and delivering measurable reliability gains.
Build and maintain dashboards, automated anomaly detection, and alerting frameworks that surface gaps in fleet health reporting in real time.
Anticipate and close reporting gaps with new cloud providers and hardware platforms by working closely with Infrastructure bring-up teams.
Communicate complex data findings and program status clearly to senior leadership, turning raw signals into crisp narratives and recommendations.
What We Need to See:
8+ years of Technical Program Management experience, with at least 3 years in infrastructure, platform, or reliability-focused domains.
Strong hands-on data analytics skills — comfortable writing SQL, working with large telemetry datasets, and building dashboards (Grafana, Superset, Databricks, or equivalent).
Demonstrated ability to define and operationalize reliability metrics (MTBI, MTTR, availability SLAs) and drive engineering teams toward measurable improvements.
Proven ability to lead deep-dive investigations across ambiguous, multi-system problems and translate findings into long-term solutions.
Excellent executive communication skills — able to distill complex technical findings into clear, decision-ready narratives for senior leadership.
MS in EE, CS, or equivalent experience.
Ways to stand out from the crowd:
Familiarity with NVIDIA GPU architectures and DGX/HGX infrastructure.
Experience with Databricks, Apache Spark, or other large-scale data processing platforms.
Hands-on experience with Grafana, Superset, or similar observability/BI tooling.
Background in cloud-native infrastructure, Kubernetes, or large-scale distributed systems.