Reddit
Senior SRE, Ads
Found: Today
The Ads organization powers Reddit's advertising platform, enabling advertisers to reach highly engaged communities while helping Reddit grow its business. The reliability of our Ads systems directly impacts advertiser success, revenue generation, and user experience.
The Ads Reliability team partners closely with Ads Engineering to improve reliability, scalability, operational excellence, and developer productivity across Reddit’s advertising ecosystem. We help build and operate highly available services that drive revenue and maintain advertiser trust. We’re looking for a Senior Site Reliability Engineer to build, operate, and scale the critical systems behind Reddit Ads.
Responsibilities:
- Partner with Ads Engineering teams to improve reliability, scalability, and operational excellence of ad-serving, auction, targeting, measurement, and billing systems.
- Design, build, and maintain infrastructure, tooling, and automation that improve service reliability and engineering productivity.
- Improve observability through monitoring, alerting, tracing, logging, and dashboards.
- Participate in on-call rotations and lead incident response efforts for critical production systems.
- Run root cause analysis and drive corrective actions following incidents.
- Collaborate with software engineers throughout the service lifecycle, from design reviews through production operations.
- Drive adoption of SRE best practices including SLIs, SLOs, error budgets, capacity planning, and operational readiness reviews.
- Reduce operational toil through automation and self-service tooling.
- Help define and measure advertiser-critical user journeys such as campaign creation, ad delivery, reporting, and billing.
- Scale Ads systems to support continued traffic growth, increased advertiser demand, and evolving business requirements.
Required Qualifications:
- 5+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or related roles operating large scale distributed systems.
- Strong experience supporting high traffic, user facing production environments.
- Good understanding of distributed systems, networking, Linux systems, cloud native architectures.
- Good programming skills in languages such as Go, Python, or similar.
- Demonstrated ability to troubleshoot complex issues across applications, infrastructure, networking, and services.
- Experience with observability platforms, monitoring systems, alerting, and incident response.
- Experience driving automation and operational improvements.