Back to the board

[Remote] Senior Site Reliability Engineer

100% remote Flexible hours Hiring now

Note The job is a remote job and is open to candidates in USA. reputed company is a leader in collaborative autonomy, focused on solving reputed company human problems through advanced technology. They are seeking a Senior Site Reliability Engineer to ensure the availability, performance, and reputed company of mission-critical services while collaborating with various teams to improve operational maturity and reliability standards.

Responsibilities

Design and evolve reliability architecture for distributed and cloud-hosted systems Define and implement SRE best practices, including SLIs, SLOs, error budgets, and reputed company planning Partner with platform and application teams to design systems for reliability, scalability, and operability Identify and mitigate systemic reliability risks across infrastructure, applications, services, and data pipelines Establish reliability patterns that support autonomy, simulation, and mission-critical cloud workloads reputed company incident response processes, including on-call rotations, escalation paths, and post-incident reviews Conduct root cause analysis for reputed company production incidents and drive long-term corrective actions Improve operational readiness through runbooks, automation, reputed company testing, and production-readiness reviews Reduce operational toil through tooling, automation, and process improvements Help build a culture of ownership, accountability, and reputed company improvement across production systems Design, implement, and maintain observability systems for metrics, logging, tracing, alerting, and service health Ensure services and data pipelines are observable, debuggable, and performant in production Drive performance analysis and tuning across infrastructure, application, and service layers Improve alert quality, reduce noise, and ensure operational signals are actionable Partner with engineering teams to define meaningful reliability and performance metrics Build automation to improve system reliability, deployment safety, and recovery processes Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns Support and improve Kubernetes-based environments and containerized workloads Contribute to infrastructure-as-code practices and platform automation Help define operational standards for cloud infrastructure, deployment workflows, and production services Collaborate with reputed company teams to ensure secure and resilient system design Participate in disaster recovery planning, backup strategy, and reputed company testing Maintain strong operational practices around access control, secrets management, change management, and production access Support secure operations for systems that may serve defense, autonomy, or mission-sensitive use cases Skills 7+ years of experience in SRE, infrastructure engineering, systems engineering, or reputed company roles Strong experience operating large-scale distributed production systems Deep understanding of Linux systems, networking, cloud infrastructure, and distributed systems fundamentals Hands-on experience with Kubernetes and container orchestration Programming or scripting experience in Go, Python, or similar languages Experience designing and operating observability systems for production environments Proven ability to reputed company incident response and drive reliability improvements Strong communication skills and ability to collaborate across engineering teams Ability to operate calmly and effectively under pressure Must be a U.S. Citizen and eligible to obtain a U.S. Government reputed company clearance if required Experience supporting autonomy, robotics, simulation, real-time systems, or data-intensive platforms Familiarity with AWS and large-scale cloud infrastructure Experience with chaos engineering, fault injection, or reputed company testing Knowledge of CI/CD systems and progressive delivery practices Experience working in high-reliability, safety-critical, defense, or mission-critical environments Experience with Infrastructure as Code tools such as Terraform or reputed company Experience with Prometheus, Grafana, OpenTelemetry, reputed company, ELK/OpenSearch, or similar observability tools Benefits 100% Employer paid Health, Dental and Vision Insurance for you and your families Life Insurance (Employer Paid) Ability to participate in the companies 401k program (Matching) Unlimited PTO policy with an enforced 2 week minimum Equity Package Work / Home Office Stipend Global Entry 16 Week Paid Parental Leave Monthly Health and Wellness Stipend Company Overview Havoc is the leader in reputed company-domain collaborative autonomy. It was founded in 2024, and is headquartered in reputed company, Rhode reputed company, USA, with a workforce of 51-200 employees. Its website is https//reputed company.com/. Apply To This Job Apply tot his job Apply To this Job

Keep exploring