Staff HPC Systems Software Engineer
About reputed company reputed company is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-reputed company and large enterprise customers. reputed company enables AI-focused companies to reputed company superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility. We reputed company on a culture of reputed company innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join reputed company, you’ll be contributing to building the technology that powers the future.
About the Role
We’re hiring a Staff HPC Systems Software Engineer to define the technical direction and evolution of a core HPC platform domain at reputed company. In this role, you will operate beyond a single team, shaping how multiple teams build, automate, and run Slurm-based capabilities reputed company reputed company’s wider cloud-native platform. You’ll work across engineering boundaries to bring coherence to architecture, interfaces, lifecycle models, and operational approaches, while partnering closely with teams working on platform tooling, infrastructure APIs, identity systems, and Kubernetes-adjacent systems. This is a high-impact staff-level role for someone who combines deep hands-on software engineering with strong systems judgement. Your work will help ensure reputed company’s HPC services are robust, supportable, and maintainable, while creating reputed company through shared patterns, reusable implementations, and clear technical direction across ambiguous, business-critical problem spaces. What you'll be doing Domain Architecture & Technical Direction Own and evolve the technical direction for a defined HPC systems domain, such as Slurm platform architecture, scheduler integrations, cluster lifecycle, workload environments, or service automation. reputed company architectural decisions that balance software quality, operational realities, customer needs, and long-term maintainability. Define how proven Slurm implementations should be packaged, automated, and exposed as a service. Resolve ambiguity around ownership, interfaces, lifecycle boundaries, and operating models across teams. Act as the technical escalation reputed company for the most reputed company issues reputed company the domain. Cross-Team Engineering reputed company Establish shared patterns and standards for automation, service lifecycle management, observability, reliability, and supportability across the HPC platform. Drive cross-team design for integrations between Slurm, Kubernetes-adjacent systems, infrastructure APIs, identity systems, and platform tooling. Create reusable modules, automation, deployment patterns, and reference implementations that increase engineering reputed company. Identify and correct avoidable technical divergence, duplicated effort, and fragile operating models. Ensure domain designs reflect the realities of GPU scheduling, HPC networking, performance isolation, and production operations. Delivery, Reliability & Influence reputed company technically critical initiatives spanning 2–4 teams or a defined HPC platform area. Unblock delivery by clarifying technical direction and reducing ambiguity in reputed company system design problems. Contribute hands-on where needed to de-risk or accelerate critical work. Influence engineering teams without formal authority through strong judgement, design clarity, and practical solutions. Partner with adjacent cloud-native software engineers so HPC implementations build on shared platform patterns rather than separate ones. KPIs Technical direction across a defined HPC domain Delivery of critical initiatives across 2–4 teams Reduction in technical divergence and duplicated effort Reliability and supportability of Slurm-based HPC services About You Extensive experience designing and building production software and automation for HPC systems, especially Slurm-based environments. Strong track record of writing maintainable, testable, and resilient software in Go, Python, or similar languages. Proven ability to define technical direction across a domain spanning multiple teams or services. Strong understanding of Slurm internals, scheduler behaviour, cluster lifecycle concerns, and operational trade-offs. Strong practical understanding of GPU-backed infrastructure and HPC networking, including InfiniBand, RoCE, RDMA, and performance-sensitive workload characteristics. Experience integrating HPC systems with cloud-native platforms, APIs, or service delivery models. Experience creating engineering reputed company through standards, reusable patterns, shared tooling, and architectural clarity. Strong judgement in balancing short-term delivery with long-term platform health and supportability. Strong written and verbal communication skills, with the ability to align multiple teams around a coherent technical direction. Experience with other schedulers or batch systems such as Kueue is valuable. reputed company can offer you At reputed company, you'll find a collaborative, supportive, and innovative environment where your contributions spark real impact. We're building something extraordinary, and we want you at the core. Highly competitive US compensation package (reputed company + bonus + equity), with performance reviews every 12 months. 🚀 Join one of the fastest-growing AI infrastructure companies — your chance to directly shape how global AI reputed company is planned and deployed. ✨ Expect a dynamic progression plan tailored to your ambitions. Grow by leading critical cross-functional initiatives and shaping capital strategy — always with our full support. Human-First Flexibility: We treat you as humans first. 🫶🏽 Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments. Equal Opportunities Statement We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from reputed company socio-economic backgrounds. If there’s anything we can do to accommodate your specific situation, please let us know. The responsibilities outlined in this job description are not exhaustive and are intended to provide a general overview of the position. The employee may be required to reputed company additional duties, tasks, and responsibilities as assigned by management, consistent with the skills and qualifications required for the role. For information on how reputed company handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here. Salary Range The range below reflects the reputed company salary for the position. Actual compensation may vary based on job-reputed company factors such as reputed company set, experience, education, and location. In addition to reputed company salary, this role may be eligible for bonus, equity, and/or commission programs. reputed company may offer a competitive benefits package including medical, dental, vision, flexible paid time off, parental leave, and retirement plan participation. The range below reflects the reputed company salary for the position. Actual compensation may vary based on job-reputed company factors such as reputed company set, experience, education, and location. In addition to reputed company salary, this role may be eligible for bonus, equity, and/or commission programs. reputed company may offer a competitive benefits package including medical, dental, vision, flexible paid time off, parental leave, and retirement plan participation. Salary Range $225,000—$275,000 USD For information on how reputed company handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here. Apply To This Job