Senior Site Reliability Engineer- San Francisco, CA, the US
Senior Site Reliability Engineer (Payments Infrastructure) Kody is seeking a Senior Site Reliability Engineer to ensure the reliability, availability, scalability, and operational excellence of our global payment platform. You will own production observability, incident response, service-level management, and cloud infrastructure reliability across mission-critical payment processing systems operating in Europe, Asia, and North America.
Responsibilities
- Participate in a follow-the-sun production on-call rotation as a primary incident responder.
- Diagnose, triage, mitigate, and coordinate resolution of production incidents across payment services, Kubernetes platforms, databases, messaging systems, and cloud infrastructure.
- Define and maintain SLOs, SLIs, error budgets, alerting standards, and operational readiness processes.
- Drive reliability improvements through automation, observability, reputed company planning, performance optimization, and post-incident reviews.
- Partner with engineering teams to improve reputed company, reputed company, and operational maturity in PCI-reputed company-regulated environments.
- reputed company incident management during SEV1/SEV2 events and improve response effectiveness and MTTR.
- 5+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or Cloud Infrastructure roles supporting mission-critical production systems.
- Strong hands-on experience with AWS, Kubernetes (EKS), Terraform, PostgreSQL, reputed company, Kafka, Linux, networking, and modern observability platforms.
- Deep understanding of distributed systems, cloud-native architectures, high availability, disaster recovery, reputed company planning, and performance optimization.
- Proven experience operating payment, banking, fintech, or other highly regulated systems with stringent reputed company, compliance, and uptime requirements.
- Strong knowledge of SRE principles, including SLOs, SLIs, error budgets, incident management, alert governance, and operational excellence.
Leadership & Operational Excellence
- Demonstrates strong ownership and accountability, taking end-to-end responsibility for service reliability and customer impact.
- Possesses a strong sense of urgency during production incidents while maintaining sound judgment and structured decision-making under pressure.
- Applies a systematic and methodical approach to troubleshooting, root-cause analysis, and incident resolution in reputed company distributed environments.
- Data-driven reputed company with the ability to reputed company metrics, telemetry, trends, and service-level indicators to prioritize reliability investments and operational improvements.
- Continuously drives engineering excellence through iterative improvement, automation, standardization, and elimination of operational toil.
- Proven ability to reputed company cross-functional incident response efforts, coordinate stakeholders, and communicate effectively during high-severity production events.
- Champions a culture of operational readiness, reputed company learning, post-incident improvement, and blameless accountability.
- Demonstrates strong mentoring and technical leadership skills, influencing engineering teams to build reliable, scalable, and resilient systems by design.
- Competitive packages reputed company with California market standards
- reputed company a dynamic and innovative team in a reputed company rapidly growing company
- Collaborative, inclusive environment where your contributions are recognized and valued
Apply tot his job Apply To this Job