Nvidia

Senior DevOps Service Reliability Operations Engineer - DGX Cloud

2 Locations Remote

Found: November 15, 2025

This role is based in Santa Clara, CA or can be performed remotely.

Compensation:

Base salary range is 144,000 USD - 230,000 USD for Level 3, and 168,000 USD - 270,250 USD for Level 4.

Responsibilities:

  • Design, develop, and implement a global Service Reliability Operations Center.
  • Provide 24/7 support with a follow-the-sun environment.
  • Collaborate with development teams to create monitoring and alert systems.
  • Perform systems and network administration tasks.
  • Develop runbooks and manage incident procedures.

Requirements:

  • 5+ years of experience with large-scale production systems.
  • Expertise in Linux administration and automation using Ansible/Python.
  • Strong troubleshooting skills and experience with cloud environments.
  • BS in Computer Science or equivalent experience.

Tech stack:

Linux, Ansible, Python, Kubernetes, SLURM, cloud platforms (AWS, Azure, GCP).

Get jobs like this in your inbox daily

Fresh FAANG jobs, every day, filtered for your role and location.

Apple Google Amazon Meta OpenAI Microsoft Nvidia Stripe TikTok Netflix Uber Airbnb Booking Spotify Canva Pinterest
or use email
Stanislav Prigodich

Hey, I'm Stan

Software Developer & Creator of Top Jobs Today

I'm a software developer, and over time I realized I cared mostly about roles at big tech companies - not just whatever happened to show up on LinkedIn or generic job boards. But those sources weren't enough - some roles were delayed, or never posted at all.

So I built this website to solve that. It scrapes fresh job postings directly from official company sites, figures out what kind of roles they really are, and sends them as email alerts - simple, fast, and focused.

Hope it makes your search easier too. Wishing you the best of luck - and I'm really glad you're here!