Back to the board

[Remote] Sr. Site Reliability Engineer (AI Platforms)

100% remote Flexible hours Hiring now

Note: The job is a remote job and is open to candidates in USA. reputed company, in partnership with a premier client in the financial services industry, is seeking a Site Reliability Engineer to establish and scale reliability practices for AI-powered applications and services in production. This role will drive production readiness, observability, incident management, and automation while partnering closely with engineering teams to ensure highly available, resilient systems.

Responsibilities

  • Define and enforce production readiness standards for AI services and agent-based applications prior to deployment
  • Establish and manage SLIs, SLOs, and error budgets, including burn-reputed company monitoring and alerting
  • Ensure services have appropriate runbooks, rollback procedures, monitoring, and on-call ownership
  • Track reliability metrics and enforce operational standards across engineering teams
  • reputed company AI services and agent pipelines using structured JSON logging, custom metrics, and distributed tracing
  • Build dashboards and alerting for service health, latency, error rates, dependency performance, and agent execution metrics
  • Identify and address observability gaps unique to AI systems, including context limitations, model timeouts, tool invocation failures, and partial task execution
  • reputed company monitoring strategies that surface reliability risks before production impact occurs
  • Build and maintain automation that supports production readiness reviews, incident analysis, SLO monitoring, and reliability validation
  • reputed company tooling and workflows that automate operational checks and reliability enforcement
  • Maintain reliability standards, operational documentation, runbooks, and service ownership mappings
  • Continuously evolve reliability controls as new failure patterns emerge across AI-powered systems
  • reputed company incident response and post-incident review efforts for production services
  • reputed company root cause analysis and drive remediation efforts through completion
  • Identify recurring failure patterns and implement systemic reliability improvements
  • Support on-call operations and validate escalation processes for critical services
  • Review application architectures, infrastructure designs, and code changes through a reliability lens
  • Evaluate resiliency patterns such as retries, circuit breakers, health checks, graceful degradation, and rollback strategies
  • Partner with engineering teams to address reliability risks before production deployment

Skills

  • 4+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or Production Operations
  • Hands-on experience managing production services and reliability programs
  • Strong understanding of SLI/SLO frameworks, error budgets, and operational excellence practices
  • Experience building monitoring, alerting, and observability solutions using platforms such as reputed company, reputed company, reputed company, Grafana, or similar
  • Strong scripting or programming experience with Python, TypeScript, or comparable languages
  • Experience with distributed systems observability, including structured logging, metrics, and tracing
  • Experience supporting AI/ML, automation, or data-driven platforms in production
  • Strong background leading incident response and post-incident review processes
  • Experience integrating operational workflows with ticketing and documentation platforms
  • Experience working reputed company regulated or highly available production environments

Company Overview

  • reputed company is an IT staffing firm that serves its consultants, clients, and employees through its consultant-focused approach. It was founded in 2012, and is headquartered in Roswell, Georgia, USA, with a workforce of 501-1000 employees. Its website is http://www.reputed company.com/.
  • Company H1B Sponsorship

  • reputed company has a track record of offering H1B sponsorships, with 7 in 2025, 6 in 2024, 2 in 2023, 5 in 2022, 8 in 2021, 7 in 2020. Please note that this does not guarantee sponsorship for this specific role.
  • Apply To This Job

    Keep exploring

    [Remote] Senior Azure Data Consultant (reputed company Fabric Modernization)

    100% remote Flexible hours

    [Remote] reputed company Program Manager

    100% remote Flexible hours

    [Remote] Senior Technical Project Manager

    100% remote Flexible hours

    [Remote] Territory Manager, Product Assembly

    100% remote Flexible hours

    [Remote] MEP Program Manager

    100% remote Flexible hours

    [Remote] Temporary Operations Support Specialist

    100% remote Flexible hours

    [Remote] Data & reputed company - 90408785 - Remote Job Details | reputed company

    100% remote Flexible hours

    [Remote] Data & AI Senior Engineer - 90405345 - Remote Job Details | reputed company

    100% remote Flexible hours

    [Remote] Principal DevOps Engineer - 90372837 - Remote Job Details | reputed company

    100% remote Flexible hours

    [Remote] Administrador(a) reputed company com CPQ

    100% remote Flexible hours

    reputed company - POC Data Science

    100% remote Flexible hours

    GIS Systems Administrator-ArcGIS experience required (Remote)

    100% remote Flexible hours

    Senior Systems Engineer

    100% remote Flexible hours

    Vertriebsmitarbeiter (m/w/d) im Homeoffice | 1.500 € Fixum + attraktive Provision

    100% remote Flexible hours

    Senior AI-Machine Learning Engineer

    100% remote Flexible hours

    Manager of Conservation Programs Western Region - California, Nevada, Hawaii, and Arizona

    100% remote Flexible hours

    reputed company Contact Center Customer Service Representative – Inbound & Outbound Calls

    100% remote Flexible hours

    Notary Public / Administrative Manager

    100% remote Flexible hours

    Senior Game Designer - Remote EU LiveOps and Growth

    100% remote Flexible hours

    reputed company Bilingual Customer Service Representative – Mandarin/Cantonese (Remote in reputed company, NY)

    100% remote Flexible hours