Nvidia

Senior Technical Program Manager

2 Locations

Found: Today

As a Senior Technical Program Manager with a passion for data-driven operations, you will lead the DGX Cloud Fleet Health reporting program — delivering real-time, actionable insights on the availability and reliability of our GPU fleet. A core focus of this role is advancing Mean-Time-Between-Interruption (MTBI): understanding the root causes of fleet interruptions, surfacing patterns in the data, and driving cross-functional programs to measurably extend fleet uptime. You will partner closely with Capacity Operations, Infrastructure, SRE, and Engineering teams to translate complex fleet signals into decisions that directly improve customer experience. Join us in making a significant impact on the world's most powerful AI infrastructure.

What You’ll Be Doing:

  • Define and own the metrics framework for measuring fleet health, reliability, and MTBI across a diverse and rapidly scaling GPU fleet.

  • Lead hands-on data investigations — querying telemetry, correlating failure signals, and building statistical models — to identify the root causes of interruptions and quantify their impact.

  • Own and drive execution of cross-functional MTBI improvement programs end-to-end — from translating analytical findings into a prioritized roadmap, to holding teams accountable to milestones and delivering measurable reliability gains.

  • Build and maintain dashboards, automated anomaly detection, and alerting frameworks that surface gaps in fleet health reporting in real time.

  • Anticipate and close reporting gaps with new cloud providers and hardware platforms by working closely with Infrastructure bring-up teams.

  • Communicate complex data findings and program status clearly to senior leadership, turning raw signals into crisp narratives and recommendations.

What We Need to See:

  • 8+ years of Technical Program Management experience, with at least 3 years in infrastructure, platform, or reliability-focused domains.

  • Strong hands-on data analytics skills — comfortable writing SQL, working with large telemetry datasets, and building dashboards (Grafana, Superset, Databricks, or equivalent).

  • Demonstrated ability to define and operationalize reliability metrics (MTBI, MTTR, availability SLAs) and drive engineering teams toward measurable improvements.

  • Proven ability to lead deep-dive investigations across ambiguous, multi-system problems and translate findings into long-term solutions.

  • Excellent executive communication skills — able to distill complex technical findings into clear, decision-ready narratives for senior leadership.

  • MS in EE, CS, or equivalent experience.

Ways to stand out from the crowd:

  • Familiarity with NVIDIA GPU architectures and DGX/HGX infrastructure.

  • Experience with Databricks, Apache Spark, or other large-scale data processing platforms.

  • Hands-on experience with Grafana, Superset, or similar observability/BI tooling.

  • Background in cloud-native infrastructure, Kubernetes, or large-scale distributed systems.

Get jobs like this in your inbox daily

Fresh FAANG jobs, every day, filtered for your role and location.

Apple Google Amazon Meta OpenAI Microsoft Nvidia Stripe TikTok Netflix Uber Airbnb Booking Spotify Canva Pinterest
or use email

Similar Big Tech Jobs - Posted in the Past 24h

🔍 Google

Program Manager III, Networking Implementation Training, Cloud Networking

place Atlanta, GA, USA ; Austin, TX, USA ; +2 more
🎮 Nvidia

Senior Program Manager, Enterprise AI Software

US, CA, Santa Clara
🎵 TikTok

Creator Program Manager, News - Content Operation - Los Angeles

Los Angeles

Same role, other locations

🖥️ Microsoft

Data Center IT Program Manager

Canada, Ontario, Greater Toronto
📦 Dropbox

Senior Governance, Risk, & Compliance Program Manager

Remote - US: Select locations Remote
📦 Dropbox

Senior Governance, Risk, & Compliance Program Manager

Remote - Canada: Select locations Remote
Stanislav Prigodich

Hey, I'm Stan

Software Developer & Creator of Top Jobs Today

I'm a software developer, and over time I realized I cared mostly about roles at big tech companies - not just whatever happened to show up on LinkedIn or generic job boards. But those sources weren't enough - some roles were delayed, or never posted at all.

So I built this website to solve that. It scrapes fresh job postings directly from official company sites, figures out what kind of roles they really are, and sends them as email alerts - simple, fast, and focused.

Hope it makes your search easier too. Wishing you the best of luck - and I'm really glad you're here!