Nvidia
Manager, Site Reliability Engineer - DGX Cloud
Found: January 7, 2026
Location:
India, Remote
What you'll be doing:
- Recruit and mentor a team of Site Reliability Engineers.
- Establish SRE practices including SLOs, SLIs, and incident management.
- Collaborate with engineering teams to design scalable cloud services.
- Drive automation across service lifecycle.
- Implement monitoring and alerting solutions.
- Oversee incident response and lead post-mortems.
What we need to see:
- Bachelor's or Master's degree in a related field.
- 10+ years in Site Reliability Engineering or DevOps, with 5 years in a leadership role.
- Experience with cloud environments (AWS, GCP, Azure).
- Expertise in Kubernetes, containerization, and microservices.
- Strong understanding of SRE principles and infrastructure automation tools.
- Proficiency in programming languages like Python or Go.