Nvidia

Senior Site Reliability Engineer - Datacenter Automation

India, Bengaluru

Found: Today

This role is based in Bengaluru, India.

What you will be doing:

  • You will be part of a DGX Cloud team responsible for production systems that enable large scalable GPU clusters for various AI workloads.
  • Implement monitoring and health management capabilities for GPU assets to ensure reliability and scalability.
  • Collaborate with teams across NVIDIA to maintain production AI clusters and improve services based on incident management processes.

What we need to see:

  • 5+ years in a DevOps/SRE role with experience in large-scale production systems.
  • Strong communication skills and ability to work with multi-functional teams.
  • Technical knowledge in systems programming languages (Go, Python) and understanding of data structures and algorithms.

Ways to stand out from the crowd:

  • Experience in managing and automating large-scale distributed systems.
  • Proven operational excellence in maintaining reliable AI infrastructure.

Get jobs like this in your inbox daily

Fresh FAANG jobs, every day, filtered for your role and location.

Apple Google Amazon Meta OpenAI Microsoft Nvidia Stripe TikTok Netflix Uber Airbnb Booking Spotify Canva Pinterest
or use email
Stanislav Prigodich

Hey, I'm Stan

Software Developer & Creator of Top Jobs Today

I'm a software developer, and over time I realized I cared mostly about roles at big tech companies - not just whatever happened to show up on LinkedIn or generic job boards. But those sources weren't enough - some roles were delayed, or never posted at all.

So I built this website to solve that. It scrapes fresh job postings directly from official company sites, figures out what kind of roles they really are, and sends them as email alerts - simple, fast, and focused.

Hope it makes your search easier too. Wishing you the best of luck - and I'm really glad you're here!