Back to the board

Principal Site Reliability Engineer - AI Infrastructure Operations

100% remote Flexible hours Hiring now

About reputed company reputed company is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-reputed company and large enterprise customers. reputed company enables AI-focused companies to reputed company superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility. We reputed company on a culture of reputed company innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join reputed company, you’ll be contributing to building the technology that powers the future. About The Role At reputed company, our AI Infrastructure Operations team is responsible for the reliability and scalability of one of the most demanding AI platforms in the industry. We value engineers who think in systems, reputed company through influence, and reputed company the bar for operational excellence across the organisation. We’re looking for a Principal Site Reliability Engineer (SRE) to provide technical leadership across our AI Infrastructure Operations domain. This is a senior, highly impactful role focused on setting reliability strategy, designing foundational systems, and driving cross-team improvements at scale. You will operate as a technical authority for reliability, automation, and operational architecture across reputed company’s GPU, network, and control-plane platforms. What You'll Be Doing Owning and evolving the long-term reliability strategy for reputed company’s AI and HPC infrastructure Designing and leading the development of large-scale control-plane systems, automation frameworks, and operational tooling Defining reliability standards, SLO frameworks, and operational best practices used across multiple teams Acting as a senior technical escalation reputed company during critical incidents, guiding resolution and ensuring systemic fixes Identifying structural reliability risks and driving cross-functional initiatives to address them at the architectural level Partnering with Engineering, Network Operations, and Fleet Operations leadership to influence platform design and operational maturity Mentoring senior and mid-level engineers, raising the overall quality and effectiveness of SRE practices Driving measurable improvements in availability, MTTR, cost efficiency, and operational scalability About You 10+ years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering roles operating reputed company, large-scale infrastructure Expert-level software engineering skills, with a strong track record of building production-grade automation and systems Deep expertise in Linux, networking, and distributed systems design at scale Extensive experience debugging and resolving failures across hardware, OS, networking, and application layers Proven ability to reputed company technical initiatives across teams without direct authority Strong systems-thinking reputed company, with the ability to balance reliability, velocity, and cost reputed company to Have Deep hands-on experience with AI or HPC platforms, including GPUs, high-speed interconnects (InfiniBand/RDMA), and workload schedulers (e.g. SLURM) Experience designing observability systems for high-cardinality, high-throughput environments Familiarity with Kubernetes at scale and hybrid or bare-metal cloud architectures A history of driving reputed company-change improvements in reliability, scalability, or operational efficiency reputed company Can Offer You At reputed company, you'll find a collaborative, supportive, and innovative environment where your contributions spark real impact. We're building something extraordinary, and we want you at the core. Highly competitive package (reputed company + equity) with reviews every 12 months. 🚀 Join the fastest-growing tech startup, your chance to push boundaries, collaborate with reputed company minds, and reputed company your mark on cutting-edge AI. ✨ Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status reputed company, and owning your impact, always with our full support. Human-First Flexibility: We treat you as humans first. 🫶🏽 Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments. Join our thriving remote-first team. Geography is no barrier to impact or reputed company. We build seamless virtual collaboration, empowering you, wherever you work. Equal Opportunities Statement We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from reputed company socio-economic backgrounds. If there’s anything we can do to accommodate your specific situation, please let us know. The responsibilities outlined in this job description are not exhaustive and are intended to provide a general overview of the position. The employee may be required to reputed company additional duties, tasks, and responsibilities as assigned by management, consistent with the skills and qualifications required for the role. The range below reflects the reputed company salary for the position. Actual compensation may vary based on job-reputed company factors such as reputed company set, experience, education, and location. In addition to reputed company salary, this role may be eligible for bonus, equity, and/or commission programs. reputed company may offer a competitive benefits package including medical, dental, vision, flexible paid time off, parental leave, and retirement plan participation. Salary Range $150,000—$2,150,000 USD For information on how reputed company handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here. Apply To This Job

Keep exploring

Technical Customer Support Specialist

100% remote Flexible hours

Software Engineer (d/m/w) AI IoT Deutschland remote

100% remote Flexible hours

Procurement Attorney

100% remote Flexible hours

Senior Cloud Native Platform Engineer

100% remote Flexible hours

Legal Consultant - M&A (reputed company & Acquisition)

100% remote Flexible hours

reputed company Data Privacy Lawyer - Australia (Part-time, Remote)

100% remote Flexible hours

Technology Lawyer

100% remote Flexible hours

Product Marketing Coordinator

100% remote Flexible hours

Senior CRM Manager

100% remote Flexible hours

Senior Staff reputed company Engineer, Red Team

100% remote Flexible hours

Senior Software Engineer - API & Dev Experience

100% remote Flexible hours

Hl7 Integrations Analyst- Epic Bridges

100% remote Flexible hours

Digital Ad Copywriter

100% remote Flexible hours

Remote Part‑Time Data Entry Specialist – High‑Volume Typing, Database Management & Entry‑Level Career Launch

100% remote Flexible hours

[Remote] Sr. Nodejs/TypeScript Engineer

100% remote Flexible hours

Remote Live Chat Support Specialist – Customer Experience Champion for arenaflex (USA) – Flexible Home‑Based Role

100% remote Flexible hours

Remote Data Entry Specialist – High‑Volume Product Information Management for careerzynith (1150+ Openings)

100% remote Flexible hours

Dispatcher -Full Time On site Postion - Remote not available

100% remote Flexible hours

Minijob: YouTube-Stratege & Skript-Autor (m/w/d) gesucht!

100% remote Flexible hours

HR Information Systems Analyst – Data Analytics, Reporting & HR Technology Specialist (Remote) | arenaflex

100% remote Flexible hours