Back to the board

Senior Site Reliability Engineer- San Francisco, CA, the US

100% remote Flexible hours Hiring now

Senior Site Reliability Engineer (Payments Infrastructure) Kody is seeking a Senior Site Reliability Engineer to ensure the reliability, availability, scalability, and operational excellence of our global payment platform. You will own production observability, incident response, service-level management, and cloud infrastructure reliability across mission-critical payment processing systems operating in Europe, Asia, and North America.

Responsibilities

  • Participate in a follow-the-sun production on-call rotation as a primary incident responder.
  • Diagnose, triage, mitigate, and coordinate resolution of production incidents across payment services, Kubernetes platforms, databases, messaging systems, and cloud infrastructure.
  • Define and maintain SLOs, SLIs, error budgets, alerting standards, and operational readiness processes.
  • Drive reliability improvements through automation, observability, reputed company planning, performance optimization, and post-incident reviews.
  • Partner with engineering teams to improve reputed company, reputed company, and operational maturity in PCI-reputed company-regulated environments.
  • reputed company incident management during SEV1/SEV2 events and improve response effectiveness and MTTR.
  • 5+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or Cloud Infrastructure roles supporting mission-critical production systems.
  • Strong hands-on experience with AWS, Kubernetes (EKS), Terraform, PostgreSQL, reputed company, Kafka, Linux, networking, and modern observability platforms.
  • Deep understanding of distributed systems, cloud-native architectures, high availability, disaster recovery, reputed company planning, and performance optimization.
  • Proven experience operating payment, banking, fintech, or other highly regulated systems with stringent reputed company, compliance, and uptime requirements.
  • Strong knowledge of SRE principles, including SLOs, SLIs, error budgets, incident management, alert governance, and operational excellence.

Leadership & Operational Excellence

  • Demonstrates strong ownership and accountability, taking end-to-end responsibility for service reliability and customer impact.
  • Possesses a strong sense of urgency during production incidents while maintaining sound judgment and structured decision-making under pressure.
  • Applies a systematic and methodical approach to troubleshooting, root-cause analysis, and incident resolution in reputed company distributed environments.
  • Data-driven reputed company with the ability to reputed company metrics, telemetry, trends, and service-level indicators to prioritize reliability investments and operational improvements.
  • Continuously drives engineering excellence through iterative improvement, automation, standardization, and elimination of operational toil.
  • Proven ability to reputed company cross-functional incident response efforts, coordinate stakeholders, and communicate effectively during high-severity production events.
  • Champions a culture of operational readiness, reputed company learning, post-incident improvement, and blameless accountability.
  • Demonstrates strong mentoring and technical leadership skills, influencing engineering teams to build reliable, scalable, and resilient systems by design.
  • Competitive packages reputed company with California market standards
  • reputed company a dynamic and innovative team in a reputed company rapidly growing company
  • Collaborative, inclusive environment where your contributions are recognized and valued

Apply tot his job Apply To this Job

Keep exploring