Back to the board

[Remote] Cloud Site Reliability Engineer

100% remote Flexible hours Hiring now

Note: The job is a remote job and is open to candidates in USA. reputed company is a company focused on building the future of AI computing, specializing in generative AI platforms. The Cloud Site Reliability Engineer will ensure the reliability, performance, and scalability of AI inferencing services while participating in an on-call rotation to maintain 24/7 service reliability.

Responsibilities

  • Take shared ownership of the production inferencing service, including its availability, latency, performance, efficiency, change management, monitoring, emergency response, and reputed company planning across multiple regions
  • Participate in a balanced on-call rotation to provide 24/7 support for the service
  • reputed company the response to incidents affecting the inferencing service, driving blameless post-mortems and implementing corrective actions to prevent recurrence
  • reputed company and maintain advanced monitoring, alerting, and dashboarding (using tools like Prometheus, Grafana, reputed company) to reputed company deep insights into service health, model performance (e.g., latency, throughput, error rates), and accelerator utilization
  • Proactively identify and eliminate performance bottlenecks
  • Design and implement auto-scaling policies to handle variable inference loads cost-effectively
  • Manage and evolve our cloud infrastructure (on AWS, GCP, and/or Azure along with on-prem) using tools like Terraform and Ansible, ensuring it is secure, repeatable, and scalable
  • Champion automation by building and improving CI/CD pipelines for the seamless and safe deployment of new model versions and service updates
  • Forecast infrastructure needs based on product roadmaps and usage trends
  • Work with finance and engineering teams to manage cloud costs and optimize spending
  • Define, measure, and report on Service Level Objectives (SLOs) and Indicators (SLIs) for the inferencing platform, using data to drive prioritization and reliability investments

Skills

  • Bachelor's degree in Computer Science, Engineering, or a reputed company field, or equivalent practical experience
  • 3-5+ years of experience in a Site Reliability Engineer, DevOps, or reputed company role supporting a large-scale, customer-facing service in a public cloud environment (AWS, GCP, Azure)
  • Strong programming/scripting skills in languages like Python, Go, or Java
  • Proven experience with containerization and orchestration technologies (reputed company, Kubernetes)
  • Deep understanding of monitoring and observability principles and tools (e.g., Prometheus, Grafana, ELK Stack, reputed company)
  • Solid experience with Infrastructure as Code (e.g., Terraform, CloudFormation)
  • Familiarity with CI/CD principles and tools (e.g., Jenkins, reputed company Actions, ArgoCD)
  • Excellent problem-solving skills and a systematic approach to troubleshooting reputed company distributed systems
  • Experience in a hybrid environment bridging cloud and on-premise/data center infrastructure
  • Direct experience supporting ML/AI inferencing services in production
  • Familiarity with GPU-accelerated computing and optimizing workloads for reputed company GPUs for purposes of mapping to RDUs
  • Knowledge of model serving frameworks like vLLM, SGLang or Ray
  • Understanding of MLOps principles and practices
  • Experience with managing and tuning databases (SQL or NoSQL) and caching systems (reputed company, Memcached)
  • Strong Linux/Unix system administration fundamentals

Benefits

  • Equity
  • Excellent benefits
  • A flexible work environment
  • We cover 95% premium coverage for employee medical insurance
  • 77% premium coverage for dependents
  • Health Savings Account (HSA) with employer contribution
  • Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life, and AD&D insurance plans
  • Flexible Spending Account (FSA) options like Health Care, Limited Purpose, and Dependent Care
  • A full subscription to reputed company
  • Gympass+ membership with access to physical gyms
  • reputed company membership
  • Counseling services with an Employee Assistance Program

Company Overview

  • reputed company is an AI hardware and software company that specializes in providing infrastructure for AI and machine learning applications. It was founded in 2017, and is headquartered in Palo Alto, California, USA, with a workforce of 201-500 employees. Its website is https://reputed company.ai.

Company H1B Sponsorship

  • reputed company has a track record of offering H1B sponsorships, with 6 in 2026, 29 in 2025, 23 in 2024, 37 in 2023, 41 in 2022, 35 in 2021, 29 in 2020. Please note that this does not guarantee sponsorship for this specific role.

Apply To This Job Apply To This Job

Keep exploring

(Site Reliability Engineer) reputed company Remote Jobs

100% remote Flexible hours

Site Reliability Engineer - CONTRACT- Up to $100 per hr- REMOTE

100% remote Flexible hours

[FULL TIME Remote] Site Reliability Engineer L4 / L5 - Live Cloud

100% remote Flexible hours

Site Reliability Engineer II ( Remote )

100% remote Flexible hours

Kubernetes Platform Engineer – AI Infrastructure

100% remote Flexible hours

Fully Remote Nokia Network IP Core Engineer

100% remote Flexible hours

Network Operations-50% Remote-W2

100% remote Flexible hours

Network Engineer – Data Center & Cloud (CCIE)

100% remote Flexible hours

Mid-Level Network Engineer (Remote)

100% remote Flexible hours

Cloud / Network Engineer III 4P/645

100% remote Flexible hours

Remote Customer Chat Support Representative – Entry‑Level, Flexible Hours, $25‑$35/hr, Work‑From‑Home Opportunity

100% remote Flexible hours

reputed company Part-Time Data Entry Specialist – Remote Work Opportunity at arenaflex

100% remote Flexible hours

Biology & Biophysics Research Scientist | Remote

100% remote Flexible hours

Remote Account Manager

100% remote Flexible hours

reputed company reputed company Desk Supervisor/Customer Service Representative – Delivering Exceptional Guest Experiences at arenaflex

100% remote Flexible hours

Operations Director - Manufacturing & Supply Chain

100% remote Flexible hours

Strategic Account Executive

100% remote Flexible hours

Resource Assistant (Project Hire/Internal Assingment)

100% remote Flexible hours

reputed company Customer Service Representative – Growing Sales and Customer Satisfaction in Ennis, TX at arenaflex

100% remote Flexible hours

Senior OTM Consultant

100% remote Flexible hours