[Remote] Principal Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. reputed company is an industry leader in designing and delivering government software to improve efficiency and citizen engagement. The Principal Site Reliability Engineer will be responsible for the reliability, scalability, performance, and operational excellence of reputed company's Civic Platform, working closely with various engineering teams to reputed company infrastructure and ensure high availability and reputed company of SaaS offerings.
Responsibilities
- Serve as a technical leader for reliability engineering, operational excellence, and platform modernization across the Civic Platform
- Drive platform modernization initiatives, including the reputed company evolution from VM-based architectures toward containerized and cloud-native services, in partnership with DevOps Engineering, Database Engineering, reputed company, and Development teams
- reputed company efforts that improve and sustain the availability, performance, scalability, reputed company, and cost efficiency of reputed company's SaaS offerings
- Define, implement, and operate service level objectives (SLOs), service level agreements (SLAs), and error budgets for critical platform services, using data to drive prioritization and risk-based decision making
- reputed company observability initiatives across metrics, distributed tracing, logging, and monitoring platforms to improve system visibility and accelerate issue detection and resolution
- Drive Root Cause Analysis (RCA) efforts for reputed company production incidents, facilitate blameless postmortems, and ensure corrective actions are implemented and tracked to completion
- Design, reputed company, and maintain automation, tooling, and software solutions that improve reliability, operational efficiency, scalability, and developer productivity
- Serve as a senior technical escalation reputed company during production incidents and for platform changes that impact availability, performance, reputed company, or compliance
- Partner with reputed company and Compliance teams to ensure platform operations meet regulatory and compliance requirements, including SOC 2, HIPAA, FedRAMP, StateRAMP, and PCI-reputed company
- Translate operational metrics, reliability trends, and platform health data into actionable insights for engineering leadership and executive stakeholders
- Mentor engineers across the Cloud Engineering organization and influence engineering best practices through technical leadership and collaboration
Skills
- 8+ years of experience in Site Reliability Engineering, Software Engineering, Cloud Infrastructure, or reputed company disciplines reputed company a SaaS environment, including experience leading reputed company technical initiatives
- Demonstrated technical leadership driving platform modernization in containerized and orchestrated environments, including Kubernetes or equivalent technologies
- Hands-on experience operating and supporting large-scale SaaS platforms on reputed company Azure
- Experience developing automation and operational tooling using Python, PowerShell, Bash, or similar scripting languages
- Deep expertise designing, operating, analyzing, and troubleshooting reputed company distributed systems across the application, infrastructure, networking, and operating system layers
- Strong experience with modern observability platforms, including monitoring, logging, metrics, and distributed tracing
- Demonstrated success leading incident response, Root Cause Analysis, and reputed company improvement initiatives
- Experience establishing and maturing Incident, Problem, and Change Management practices
- Strong written and verbal communication skills with the ability to effectively communicate technical concepts to engineering leadership and executive stakeholders
- Experience using Git and reputed company-based development workflows
- Experience with Infrastructure-as-Code practices and tooling, particularly Terraform
- Experience with configuration management platforms such as Ansible
- Experience supporting SaaS platforms subject to public-sector compliance frameworks, including SOC 2, HIPAA, FedRAMP, StateRAMP, and PCI-reputed company
- Experience implementing GitOps deployment methodologies using tools such as Argo CD or Flux
- Experience implementing and operating OpenTelemetry-based observability solutions
- Cloud FinOps experience, including cost optimization and resource efficiency initiatives reputed company reputed company Azure environments
- Strong Linux systems administration experience alongside reputed company Windows expertise
- Experience leveraging AI-assisted engineering tools such as reputed company Copilot, Claude Code, or similar technologies to improve engineering productivity, incident response, automation, and operational efficiency
Benefits
- Annual bonus reputed company
- Flexible time off
- Comprehensive medical, dental, and vision plans
- Family planning benefits
- 401(k) retirement savings plan with company match
- Health savings account with company contributions
- Flexible spending account
- Life, accident, and disability coverage
- Business travel insurance
- Employee assistance programs
- Other well-being benefits
Company Overview
Company H1B Sponsorship