Nvidia

Senior Site Reliability Engineer - Datacenter Automation

India, Bengaluru

Found: February 25, 2026

View Details and Apply

This role is based in Bengaluru, India.

What you will be doing:

You will be part of a DGX Cloud team responsible for production systems that enable large scalable GPU clusters for various AI workloads.
Implement monitoring and health management capabilities for GPU assets to ensure reliability and scalability.
Collaborate with teams across NVIDIA to maintain production AI clusters and improve services based on incident management processes.

What we need to see:

5+ years in a DevOps/SRE role with experience in large-scale production systems.
Strong communication skills and ability to work with multi-functional teams.
Technical knowledge in systems programming languages (Go, Python) and understanding of data structures and algorithms.

Ways to stand out from the crowd: