Google

Staff Site Reliability Engineer, Cloud Reliability Intelligence

place Sunnyvale, CA, USA

Found: Today

Staff Site Reliability Engineer, Cloud Reliability Intelligence

Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services—both our internally critical and our externally-visible systems—have reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally SRE’s will keep an ever-watchful eye on our systems capacity and performance. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you’ll have the opportunity to manage the complex challenges of scale which are unique to Google Cloud, while using your expertise in coding, algorithms, complexity analysis and large-scale system design. SRE's culture of intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow. The Reliability Outcome Enablement team develops the products, core infrastructure, and datasets that drive and sustain Google Cloud platform's (GCP's) reliability promises. We build the evergreen intelligence platform the core system that automates resilience across the GCP ecosystem. Every product team at Google (from BigQuery to Spanner) relies on our infrastructure and integrated data lake to keep their services bulletproof. We are currently expanding our platform to integrate Generative AI and LLM-driven workflows, moving from reactive tracking to a predictive system that catches failures and automates risk mitigation. Behind everything our users see online is the architecture built by the Technical Infrastructure team to keep it running. From developing and maintaining our data centers to building the next generation of Google platforms, we make Google's product portfolio possible. We're proud to be our engineers' engineers and love voiding warranties by taking things apart so we can rebuild them. We keep our networks up and running, ensuring our users have the best and fastest experience possible. Individual pay is determined by factors including job-related skills, experience, and relevant education or training. US: $207000 - $301000 (USD) + 20% bonus target + equity + benefits.

Minimum qualifications:

  • Bachelor's degree in Computer Science or a related technical field or equivalent practical experience.
  • 8 years of experience with data structures and algorithms.
  • 3 years of experience leading projects and designing, analyzing, and troubleshooting distributed systems.
  • 3 years of experience in a technical leadership role; overseeing projects.
  • Experience overseeing full-stack architectures, ensuring cohesion between backend data automation layers and engineering frontend.

Preferred qualifications:

  • Experience in applying LLMs or Generative AI to automate workflows.
  • Familiarity with large-scale reliability analysis, or policy conformance frameworks.

Responsibilities

  • Own the technical roadmap and long-term architecture for the Evergreen platform, including a unified data model for promise delivery across GCP.
  • Design and scale high-performance backend pipelines (Go, Java) and data-rich user interfaces (TypeScript, Angular) used by over 10,000+ Google engineers.
  • Prototype and productionize LLM-based features to parse unstructured incident data, automatically file risk tickets, and suggest reliability fixes.
  • Partner closely with Product Management, Data Science, and leadership to align multiple organizations on a unified approach to policy measurement and enforcement.

Get jobs like this in your inbox daily

Fresh FAANG jobs, every day, filtered for your role and location.

Apple Google Amazon Meta OpenAI Microsoft Nvidia Stripe TikTok Netflix Uber Airbnb Booking Spotify Canva Pinterest
or use email

Similar Big Tech Jobs - Posted in the Past 24h

🔍 Google

Senior Software Engineer, Google Play Promotion Platform

place Mountain View, CA, USA
🔍 Google

Software Engineer III, Google Play, Promotion Platform

place Mountain View, CA, USA
🎵 TikTok

Staff Backend Engineer, Global Online Data Store Platform

San Jose
Stanislav Prigodich

Hey, I'm Stan

Software Developer & Creator of Top Jobs Today

I'm a software developer, and over time I realized I cared mostly about roles at big tech companies - not just whatever happened to show up on LinkedIn or generic job boards. But those sources weren't enough - some roles were delayed, or never posted at all.

So I built this website to solve that. It scrapes fresh job postings directly from official company sites, figures out what kind of roles they really are, and sends them as email alerts - simple, fast, and focused.

Hope it makes your search easier too. Wishing you the best of luck - and I'm really glad you're here!