Microsoft
Principal Supercomputing Operations Engineering Manager
Found: Today
This role is based in the United States with remote work options available.
Compensation:
USD $139,900 - $274,800 per year
Overview:
As a Principal Supercomputing Operations Engineering Manager, you will own the operational strategy for interconnect fabric reliability across AI supercomputing environments, ensuring GPU availability and SLA compliance.
Responsibilities:
- Drive end-to-end operational strategy for InfiniBand and GPU interconnect fabric reliability.
- Lead and manage a team of senior engineers responsible for fabric operations.
- Provide technical leadership during high severity fabric incidents.
- Ensure high-quality incident response and systemic prevention mechanisms.
- Partner with various teams to improve systemic reliability.
Qualifications:
- Bachelor's Degree in Computer Science or related field with 6+ years of technical engineering experience.
- 4+ years of people management experience.
- Experience operating large-scale distributed systems or HPC infrastructure.
- Strong hands-on background in operating and debugging interconnect fabrics.