Nvidia
Senior DevOps Service Reliability Operations Engineer - DGX Cloud
Found: November 15, 2025
Location:
US, CA, Santa Clara or Remote
Compensation:
$144,000 - $270,250/year
Responsibilities:
- Design, develop, and implement a Service Reliability Operations Center.
- Provide 24/7 support, working with global teams.
- Develop monitors, alarms, and alerts to enhance service reliability.
- Perform systems and network administration tasks.
- Collaborate with developers to create and update runbooks.
- Manage incidents and improve service quality.
Requirements:
- 5+ years of experience in large-scale production systems.
- Expertise in Linux, Ansible, Python, and networking.
- BS in Computer Science or equivalent experience.
- Experience with Kubernetes and cloud environments is a plus.