Nvidia
Senior DevOps Service Reliability Operations Engineer - DGX Cloud
Found: November 15, 2025
This role is based in Santa Clara, CA or can be performed remotely.
Compensation:
Base salary range is 144,000 USD - 230,000 USD for Level 3, and 168,000 USD - 270,250 USD for Level 4.
Responsibilities:
- Design, develop, and implement a global Service Reliability Operations Center.
- Provide 24/7 support with a follow-the-sun environment.
- Collaborate with development teams to create monitoring and alert systems.
- Perform systems and network administration tasks.
- Develop runbooks and manage incident procedures.
Requirements:
- 5+ years of experience with large-scale production systems.
- Expertise in Linux administration and automation using Ansible/Python.
- Strong troubleshooting skills and experience with cloud environments.
- BS in Computer Science or equivalent experience.
Tech stack:
Linux, Ansible, Python, Kubernetes, SLURM, cloud platforms (AWS, Azure, GCP).