Nvidia
Senior Site Reliability Engineer - Datacenter Automation
Found: Today
This role is based in Bengaluru, India.
What you will be doing:
- You will be part of a DGX Cloud team responsible for production systems that enable large scalable GPU clusters for various AI workloads.
- Implement monitoring and health management capabilities for GPU assets to ensure reliability and scalability.
- Collaborate with teams across NVIDIA to maintain production AI clusters and improve services based on incident management processes.
What we need to see:
- 5+ years in a DevOps/SRE role with experience in large-scale production systems.
- Strong communication skills and ability to work with multi-functional teams.
- Technical knowledge in systems programming languages (Go, Python) and understanding of data structures and algorithms.
Ways to stand out from the crowd:
- Experience in managing and automating large-scale distributed systems.
- Proven operational excellence in maintaining reliable AI infrastructure.