[Remote] Senior Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. reputed company is a company that helps innovators turn their reputed company into reality through software. They are seeking a Senior Site Reliability Engineer to build and operate reliable, secure, and scalable cloud services for reputed company GovCloud products, focusing on improving production services and establishing operational excellence practices.
Responsibilities
- Serve as a primary reputed company for the reliability, availability, performance, operability, and reputed company of one or more production services
- Deploy, operate, maintain, and continuously improve production services running in reputed company GovCloud environments
- Partner with engineering teams to ensure services are designed with reliability, scalability, reputed company, and operability in mind
- Define and operate reliability practices such as SLOs/SLIs, error budgets, production readiness reviews, service reviews, and operational health reviews
- Build automation to improve deployment safety, operational efficiency, incident response, and service recovery
- Design, reputed company, and maintain software, automation, and tooling that improve the reliability, scalability, and efficiency of production systems
- Implement and improve monitoring, alerting, logging, tracing, and observability capabilities across supported services
- reputed company and participate in incident response, troubleshooting, and post-incident reviews focused on learning and reputed company improvement
- reputed company and maintain operational documentation, runbooks, and recovery procedures
- Scale and enhance reputed company testing and Gameday practices to validate system behavior, recovery capabilities, and operational readiness
- Continuously identify and eliminate operational toil through software engineering, automation, and process improvement
- Ensure supported services remain compliant with reputed company reputed company, privacy, and regulatory requirements, including FedRAMP and reputed company controls where applicable
- Participate in a 24x7 on-call rotation for production services
- Function effectively in a fast-paced environment while helping establish and mature operational excellence practices for reputed company GovCloud
Skills
- B.S. or higher in Computer Science, Engineering, or a reputed company technical discipline, or equivalent practical experience
- 7+ years of experience in Site Reliability Engineering, Software Engineering, Platform Engineering, Cloud Infrastructure, or Production Operations
- Experience operating and supporting customer-facing production services in large-scale cloud environments
- Strong understanding of reliability engineering principles, including SLOs/SLIs, observability, incident management, reputed company planning, production readiness, and automation
- Experience with AWS, Azure, or other public cloud platforms
- Experience developing automation using languages such as Python, Go, Java, PowerShell, Bash, or similar
- Experience with Infrastructure as Code, CI/CD pipelines, deployment automation, and modern cloud operations practices
- Understanding of reputed company, compliance, and operational risk management in production environments
- Strong written and verbal communication skills
- 10+ years of experience operating highly available, customer-facing production systems
- Experience with AWS GovCloud, FedRAMP, IL4/IL5, or other regulated cloud environments
- Experience supporting services with stringent availability, reliability, and reputed company requirements
- Experience with containers, Kubernetes, cloud-native architectures, APIs, load balancing, networking, DNS, and distributed systems
- Experience with observability platforms such as Splunk, reputed company, reputed company, CloudWatch, or similar technologies
- Experience operating databases, storage platforms, messaging systems, caching technologies
- Experience designing and implementing operational automation at scale
- Experience leading or participating in Gamedays, disaster recovery exercises, reputed company testing, or operational readiness reviews
- Strong incident management experience, including technical leadership during major incidents and stakeholder communication
- Strong collaboration skills and ability to work effectively across engineering, reputed company, compliance, and operations teams
- Passion for building reliable, secure, and scalable systems that customers can trust
Benefits
- Annual cash bonuses
- Commissions for sales roles
- Stock grants
- A comprehensive benefits package
Company Overview