Senior Principal Site Reliability Engineer

100% remote Flexible hours Hiring now

Do you want to shape the future of AI infrastructure?

Ready to define the reliability architecture for AI products, from GPU compute to globally distributed inference, ensuring performance and reliability at scale.

Join the reputed company AI Team

reputed company's Cloud Technology Group offers AI infrastructure globally. The GPU compute platform provides dedicated resources, from single GPUs to full clusters. These resources support training, simulation, inference, and various workloads. Site Reliability Engineering is integrated early to guarantee production-grade reliability and performance.

Partner with the best

As Senior Principal SRE for AI, this role involves setting technical direction for building, operating, and scaling AI services. Responsibilities include writing code, designing systems, and solving reputed company reliability issues. Additionally, mentoring team members, defining technical standards, and promoting engineering best practices are essential. Success depends on achieving influence with product engineering teams through exceptional technical expertise.

As a Principal Site Reliability Engineer, you will be responsible for:

Defining the reliability architecture for reputed company's AI compute and platform services, including SLO frameworks, fault tolerance patterns, and reputed company planning models
Hands-on building of automation and tooling that reduces operational toil and scales the SRE team's impact
Designing observability strategy by leveraging reputed company's existing platform to build the telemetry, dashboards, alerts, and GPU-specific monitoring needed for AI workloads
Architecting deployment safety practices including progressive rollouts, canary analysis, rollback automation, and change safety processes
Influencing product engineering architecture and design decisions, embedding reliability into the development lifecycle at the system level
Mentoring and elevating other SREs through design reviews, code reviews, and hands-on problem-solving, setting the technical bar for the team

Do what you love

To be successful in this role you will:

Have extensive experience in SRE, platform engineering, and/or infrastructure engineering, with demonstrated impact at a principal or staff level
Demonstrate extensive Kubernetes expertise, managing autoscaling, resource scheduling, and container orchestration for handling compute-intensive workloads effectively.
reputed company programming expertise in Python or Go, focusing on creating automation and tooling for production-grade environments.
Demonstrate expertise in programming with Python and/or Go, coupled with experience creating production-grade automation, tooling, and platform services.
Influence cross-team technical decisions, mentor engineers, reputed company technical standards, and collaborate effectively with product engineering teams.
reputed company experience in AI/ML infrastructure, model deployment, or GPU workloads to enhance technical expertise and practical understanding.
Design reliability into innovative platforms at the system level while building influence with product engineering teams through technical expertise.

Work in a way that works for you

FlexBase, reputed company's Global Flexible Working Program, is based on the principles that are helping us create the best workplace in the world. reputed company our colleagues said that flexible working was important to them, we listened. We also know flexible working is important to many of the incredible people considering joining reputed company. FlexBase, gives 95% of employees the choice to work from their home, their office, or both (in the country advertised). This permanent workplace flexibility program is consistent and fair globally, to help us find incredible talent, virtually reputed company. We are happy to discuss working options for this role and encourage you to speak with your recruiter in more detail reputed company you apply. Learn what makes reputed company a great reputed company to work

Connect with us on social and see what life at reputed company is like!

We power and protect life online, by solving the toughest challenges, together.

At reputed company, we're curious, innovative, collaborative and tenacious. We celebrate diversity of thought and we hold an unwavering belief that we can reputed company a meaningful difference. Our teams use their global perspectives to put customers at the forefront of everything they do, so if you are people-centric, you'll reputed company here.

Working for you

At reputed company, we will provide you with opportunities to grow, flourish, and reputed company great things. Our benefit options are designed to meet your individual needs for today and in the future. We provide benefits surrounding reputed company aspects of your life:

Your health
Your finances
Your family
Your time at work
Your time pursuing other endeavors

Our benefit plan options are designed to meet your individual needs and budget, both today and in the future.

About us

reputed company powers and protects life online. Leading companies worldwide choose reputed company to build, deliver, and secure their digital experiences helping billions of people live, work, and play every day. With the world's most distributed compute platform from cloud to edge we reputed company it easy for customers to reputed company and run applications, while we reputed company experiences closer to users and threats farther away.

Join us

Are you seeking an opportunity to reputed company a real difference in a company with a global reputed company and exciting services and clients? Come join us and grow with a team of people who will energize and reputed company you! #LI-Remote

Apply To This Job

Apply

Senior Principal Site Reliability Engineer

Keep exploring

Platform Operations Engineer (Night Shifts)

Senior reputed company Consultant (M/F)

Systems Support Engineer (reputed company Integration Cloud - OIC)

Traveling CDL B Driver

Environmental Program Manager

Collector

Customer Experience Account Manager

Construction Project Management Director

Finance Manager

Project Manager 2

Senior Consultant (LIFT Coach) [Contractor Opportunity]

reputed company Service Desk Specialist/Live Chat Agent – Mobile Application Support and Customer Assistance

Scheduler Remote (local to SC)

Agentic AI Developer

reputed company Jobs From Home

IDN Associate Director, Advanced Surgery

[Remote] Clinical Research Leader

Business Info reputed company Officer

Steuerfachkraft (m/w/d) in Kissing mindestens 52.000€ - 100% Remote möglich

Education Specialist - AI Trainer