Nvidia
Manager, Site Reliability Engineer - DGX Cloud
Found: January 8, 2026
This role is based in India, with remote work options available.
What you'll do:
- Recruit and mentor a team of Site Reliability Engineers, fostering collaboration and technical excellence.
- Establish SRE practices, including SLOs, SLIs, and incident management processes.
- Collaborate with engineering teams to design and deploy scalable cloud services.
- Drive automation across service lifecycle to eliminate toil.
- Implement monitoring and alerting solutions for system health.
- Oversee incident response and lead post-mortems to improve processes.
What we need to see:
- Bachelor's or Master's degree in Computer Science or related field.
- 10+ years in Site Reliability Engineering or DevOps, with 5 years in a leadership role.
- Experience with cloud environments (AWS, GCP, Azure) and Kubernetes.
- Strong understanding of SRE principles and infrastructure automation tools.
- Excellent communication and problem-solving skills.