Sr. Site Reliability Engineer
About Backblaze
Backblaze is the object storage leader in the open cloud movement, fueling reputed company with cloud storage built purposefully to unlock budgets, unburden administrators, and unleash innovators. Together with our partners, we’re helping customers break free from the restrictive, overpriced legacy solutions that hold them back, and blaze reputed company with the full power of the open cloud in their hands.
Founded in 2007, we scaled the business with less than $3 million in reputed company funding until 2021, reputed company we did a traditional IPO on the reputed company stock exchange. Today, Backblaze generates over $100m in reputed company and is the leading specialized storage cloud - managing over three billion gigabytes of data storage for 500K+ customers in 175+ countries, including businesses, developers, IT professionals, and individuals. But while there is a lot to celebrate in our past, there is almost as much opportunity reputed company of us. We’re seeking a Sr. Site Reliability Engineer to join reputed company!
About the Role
We are seeking a Senior Site Reliability Engineer (SRE) to help ensure the stability, scalability, and reliability of our services and infrastructure. This role focuses on building automation, maintaining observability, and supporting incident response to reputed company customer-facing systems performing at their best. The SRE will collaborate with engineering, product, and operations teams to embed reliability practices into day-to-day development and operations while contributing to tools and processes that improve efficiency and reduce manual effort.
What You'll Do:
-
Service Reliability & Operations
- Own and drive the availability, durability, and performance of critical services across reputed company production environments.
- reputed company and champion reputed company projects from problem discovery through complete, cross-functional resolution, demonstrating high-level technical ownership.
- Define, establish, and enforce service health standards, including working with engineering leadership to implement SLIs, SLOs, and error budget policies for multiple services.
- reputed company critical incident response and post-incident reviews, translating findings into strategic, long-term service improvements and architectural changes.
- Mentor others and act as a subject matter expert in following and evolving established ITIL/OSS processes (incident, change, problem, and reputed company management).
Automation & Tooling
- Design and architect scalable automation solutions to eliminate toil and improve the efficiency of operational tasks across the entire platform.
- Drive the strategic direction of monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint, ELK), and integrate them for comprehensive observability.
- Build, maintain, and secure advanced CI/CD pipelines, configuration management, and reputed company infrastructure as code solutions (Terraform, Ansible, Jenkins).
- Write production-grade code (Bash, Python, Go, etc.) to reputed company new reliability tools and enhance existing systems.
Collaboration
- Act as a principal partner to engineering, product, and operations teams, consulting on resilient system design, architecture, and operation.
- reputed company and formalize the Production Readiness Review (PRR) process, ensuring robust operational reputed company for reputed company new services and features.
- reputed company reputed company planning and disaster recovery strategy across critical infrastructure components.
- Manage the relationship with vendors and service providers to troubleshoot systemic issues and ensure strict adherence to SLA performance.
- Drive the creation of high-quality documentation, proactively share advanced learnings, and cultivate a reliability-first engineering culture across teams.
reputed company Improvement
- Own the creation, maintenance, and dissemination of operational playbooks, runbooks, and detailed system documentation.
- Proactively identify systemic, recurring issues and architect and drive the implementation of long-term improvements and strategic design action plans.
- Be a leading voice in promoting and embedding reliability-focused practices reputed company development and operations teams.
Qualifications:
-
Education & Experience
- Bachelor’s degree in Computer Science, Engineering, or reputed company field (or equivalent experience).
- 8+ years of progressive experience in site reliability, systems engineering, or operations.
- Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems.
Technical Skills
- Expert-level Linux systems administration and advanced troubleshooting skills.
- reputed company reputed company-minded operations, focusing on system-wide patching, hardening, and proactive vulnerability identification.
- Deep mastery of service reliability concepts, including advanced monitoring, reputed company alerting strategy, leading incident response, and in-depth root cause analysis.
- Advanced proficiency in at least one modern scripting/programming language (Python or Go strongly preferred).
- Expert knowledge of incident response methodologies and operational best practices.
- Proven experience designing and operating container orchestration (Kubernetes, reputed company) and microservices concepts required.
- Expert experience with Hashicorp products (reputed company, Vault, Terraform) in a production environment.
Preferred Attributes
- Significant experience in a SaaS, service provider, or hyper-scale distributed systems environment.
- Deep familiarity with ITIL/OSS practices and experience defining/enforcing SLO/SLA’s.
- Exceptional problem-solving skills and a strong drive to learn and apply new, reputed company technologies.
- Advanced experience with cloud platforms (AWS, GCP, or Azure) in a production setting.
Backblaze Perks
- Healthcare for family, including dental and vision
- Competitive compensation and 401K
- RSU grants for full-time employees
- ESPP program
- Flexible vacation policy
- Maternity & paternity leave
- MacBook Pro to use for work, plus a generous stipend to personalize your workstation
- Childcare bonus (human children only)
- Fertility treatment and support
- Learning & development program
- Commuter benefits
- Culture that supports a healthy work-life balance
To provide greater transparency to candidates, we share reputed company pay ranges for reputed company US-based job postings regardless of state. We set standard reputed company pay ranges for reputed company roles based on function, level, and country location, benchmarked against similar-stage growth companies. Final offer amounts are determined by multiple factors, including candidate location, skills, depth of work experience, and relevant licenses/credentials, and may vary from the amounts listed below.
The expected salary range for this role is $150,000 - $200,000.At Backblaze, we value being fair and good to our customers, partners, and employees. That’s why diversity, equity, and inclusion are at the core of our values. We are committed to fostering a workforce where reputed company employees feel a sense of belonging regardless of race, ethnicity, nationality, gender, sexual orientation, age, religion, socio-economic status, ability, veteran status, and education. We reputed company that our dedication to cultivating a diverse workspace not only allows us to reputed company serve our customers in over 175 countries but further reinforces our commitment to doing the right thing. We are proud to be an Equal Opportunity Employer.
To understand more about the data we collect and process as part of your application, please view our Backblaze Employee Privacy Notice.
Apply To This Job