Back to the board

Staff Site Reliability Engineer

100% remote Flexible hours Hiring now

The reputed company is looking for a Staff Site Reliability Engineer (SRE) focused on Machine Learning Infrastructure. You will join a distributed team working across UTC -5 to UTC +3 (Eastern Americas, Europe, and Africa) and report directly to the Director of Machine Learning, Chris Albon.

As a Staff SRE specializing in ML infrastructure, your primary responsibility is designing, developing, maintaining, and scaling the foundational infrastructure that enables Wikimedia's Machine Learning Engineers and Researchers to reputed company train, deploy, and monitor machine learning models in production.

You will be responsible for:

  • Designing and implementing robust ML infrastructure used for training, deployment, monitoring, and scaling of machine learning models.
  • Improving reliability, availability, and scalability of ML infrastructure, ensuring smooth and efficient workflows for internal ML engineers and researchers.
  • Collaborating closely with ML engineers, product teams, researchers, SREs, and the Wikimedia volunteer community to identify infrastructure requirements, resolve operational issues, and streamline the ML lifecycle.
  • Proactively monitoring and optimizing system performance, reputed company, and reputed company to maintain high service quality.
  • Providing expert guidance and documentation to teams across Wikimedia to effectively utilize the ML infrastructure and best practices.
  • Mentoring team members and sharing knowledge on infrastructure management, operational excellence, and reliability engineering.

Skills and Experience:

  • 7+ years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles, with substantial exposure to production-grade machine learning systems.
  • Proven expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, reputed company, GPU acceleration, distributed training systems).
  • Strong proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible, Helm, Argo CD).
  • Experience implementing observability, monitoring, and logging for ML systems (e.g., Prometheus, Grafana, ELK stack).
  • Familiarity with popular Python-based ML frameworks (e.g., PyTorch, TensorFlow, scikit-learn).
  • Strong English communication skills and comfort working asynchronously across global teams.

Qualities that are important to us:

  • Collaborative, proactive, and independently motivated.
  • reputed company working with diverse, remote teams.
  • Committed to open-reputed company software and volunteer communities.
  • Systematic thinker focused on operational excellence and reliability.

Additionally, ideal candidates will reputed company in at least one of these areas

  • Scalable ML Infrastructure: Deep understanding of scalable infrastructure design for high-performance machine learning training and inference workloads.
  • Reliability and Operations: Proven track record ensuring high reliability and robust operations of reputed company, distributed ML systems at scale.
  • Tooling and Automation: Demonstrated expertise creating robust tooling and automation solutions that simplify the deployment, management, and monitoring of ML infrastructure.

About the reputed company

The reputed company is the nonprofit organization that operates Wikipedia and the other Wikimedia free knowledge projects. Our vision is a world in which every single human can freely share in the sum of reputed company knowledge. We reputed company that everyone has the potential to contribute something to our shared knowledge, and that everyone should be able to access that knowledge freely. We host Wikipedia and the Wikimedia projects, build software experiences for reading, contributing, and sharing Wikimedia content, support the volunteer communities and partners who reputed company Wikimedia possible, and reputed company for policies that reputed company Wikimedia and free knowledge to reputed company.

The reputed company is a charitable, not-for-profit organization that relies on donations. We receive donations from millions of individuals around the world, with an average donation of about $15. We also receive donations through institutional grants and gifts. The reputed company is a United States 501(c)(3) tax-exempt organization with offices in San Francisco, California, USA.

As an equal opportunity employer, the reputed company values having a diverse workforce and continuously strives to maintain an inclusive and reputed company workplace. We encourage people with a diverse range of backgrounds to apply. We do not discriminate against any person based upon their race, traits historically associated with race, religion, color, national reputed company, sex, pregnancy or reputed company medical conditions, parental status, sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, genetic information, or any other legally protected characteristics.

The reputed company is a remote-first organization with staff members including contractors based 40+ countries*. Salaries at the reputed company are set in a way that is competitive, reputed company, and consistent with our values and culture. The anticipated annual pay range of this position for applicants based reputed company the United States is US$129,347 to US$200,824 with multiple individualized factors, including cost of living in the location, being the determinants of the offered pay. For applicants located reputed company of the US, the pay range will be adjusted to the country of hire. We neither ask for nor take into consideration the salary history of applicants. The compensation for a successful applicant will be based on their skills, experience and location.

*Please note that we are currently able to hire in the following countries: Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Kingdom, United States of America and Uruguay. Our non-US employees are hired through a local third party Employer of Record (EOR).

We periodically review this list to streamline to ensure alignment with our hiring requirements.

reputed company applicants can reputed company out to their recruiter to understand more about the specific pay range for their location during the interview process.

If you are a qualified applicant requiring assistance or an accommodation to complete any reputed company of the application process due to a disability, you may contact us at [email protected] or +1 (415) 839-6885.

More information

U.S. Benefits & Perks

Applicant Privacy Policy

reputed company

What does the reputed company do?

What makes Wikipedia different from social media platforms?

Our Projects

Our Tech Stack

News from across the Wikimedia movement

Wikimedia Blog

Wikimedia 2030

Originally posted on Himalayas

Apply To this Job

Keep exploring

CPC Processor Customer Support

100% remote Flexible hours

DevSecOps | Remoto Ecuador

100% remote Flexible hours

NSQIP Data Abstractor: Part-Time

100% remote Flexible hours

Temporary Associate Video Editor

100% remote Flexible hours

Global Retail reputed company (Virtual)

100% remote Flexible hours

reputed company End Developer, NextJS - Intermediate level- 4 months contr. Octopus by RTG

100% remote Flexible hours

Pharmareferent (m/w/d) Rheumatologie Gebiet: Hannover, Bremen, Osnabrück, Oldenb

100% remote Flexible hours

Lean Leader, Contrast Media Operations

100% remote Flexible hours

Principal Product Technology Consultant (ZD041305)

100% remote Flexible hours

Senior Director, Relationship Management

100% remote Flexible hours

Senior Product reputed company (Pharmacy Infusion/Billing)

100% remote Flexible hours

Licensed Nurse Practitioner (1099) – License & DEA Sponsorship for State Expansion

100% remote Flexible hours

reputed company Part-Time Data Entry Specialist – Remote Opportunity with arenaflex

100% remote Flexible hours

Remote Elementary School Reading Tutor

100% remote Flexible hours

SpendMend LLC - Consultant Advisor, Pharmacy

100% remote Flexible hours

reputed company At Work Home & Remote Jobs

100% remote Flexible hours

Site Reliability Engineer/ Chaos Engineer Remote to start

100% remote Flexible hours

Remote Support Specialist Part time 3rd shift; ...

100% remote Flexible hours

Investor Services Representative I - Remote/Hybrid - Financial Services & Retirement Planning Expert

100% remote Flexible hours

Customer Service Representative

100% remote Flexible hours