Nvidia
Senior Platform and EngOps Engineer - Cluster Operations
Found: December 10, 2025
This role is based in Bengaluru, India.
What you'll do:
- Develop automated tools for deploying and maintaining GPU clusters interconnected via NVLink and InfiniBand.
- Implement DevOps tools for software updates, maintenance tasks, and monitoring cluster availability.
- Troubleshoot daily cluster failures to maintain optimal performance.
- Manage software and firmware updates for clusters.
- Collaborate with engineering and product teams across time zones.
What we need to see:
- BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or equivalent experience.
- 5+ years of experience in deploying and managing clusters and infrastructure.
- Expertise in Ansible, Python, and Shell Scripting.
- Deep understanding of operating systems and high-performance applications.
- Proficient with Linux fundamentals.
Ways to stand out:
- Familiarity with resource scheduling managers like Slurm.
- Experience with alerting tools and emergency response practices.
- Hands-on experience with GPU hardware and software.
- Proficiency in designing large scale networking technologies.