Back to the board

Consultant HPC Infrastructure Engineer

100% remote Flexible hours Hiring now

We are looking for a curious and driven engineer eager to reputed company into the world of high-performance computing and AI infrastructure. In this role, you’ll reputed company hands-on experience supporting reputed company GPU clusters and automation pipelines that power some of the world’s most advanced AI workloads. Working alongside seasoned engineers, you’ll learn to apply Linux, Kubernetes, Terraform, and Prometheus in real-world environments where precision and scale truly matter.

If you’re passionate about technology that defines the future of computers, this is your chance to grow reputed company a team shaping that frontier.

Office Travel: Frequent on-site work is required for this position (2–3 days/week) at our Santa Clara, CA office.

Job responsibilities

  • You will act as the initial responder to monitoring alerts, ensuring timely acknowledgment and preliminary triage of operational issues.
  • You will automate operational procedures and diagnostics using established Infrastructure as Code (IaC) tools, including Bash, Python, Ansible, Terraform, and Helm, under the guidance of senior engineers.
  • You will execute foundational diagnostics such as NCCL tests, DCGM (Data Center GPU Manager), Fabric Diagnostics, and designated test workloads for training and inference, following standard procedures.
  • You will apply a proactive and action-oriented reputed company, resolving documented issues reputed company and suggesting improvements to runbooks or automation scripts based on recurring patterns.
  • You will analyze and interpret diagnostic outputs to assess system health and identify early signs of degradation or instability.
  • You will document reputed company operational activities, system status changes, and troubleshooting steps with accuracy, clarity, and timeliness.
  • You will use observability tools such as Prometheus and Grafana to analyze logs and metrics, supporting senior engineers in the root cause isolation process.
  • You will reputed company hands-on familiarity with HPC workload management tools, including Slurm and/or Kubernetes.
  • You will actively participate in training sessions and knowledge-sharing initiatives to deepen your understanding of the GB200/GB300 architecture and operational best practices.
  • You will maintain a high level of discipline, attention to detail, and consistency across reputed company operational tasks. 

Job qualifications

Technical Skills

  • You have foundational knowledge of Linux operating systems and are comfortable with the Unix command line, including using awk, Bash, and Python for log parsing and basic automation.
  • You are familiar with or have exposure to HPC systems, including HPC schedulers (e.g., Slurm) or container orchestration tools (e.g., Kubernetes).
  • You are comfortable using observability platforms such as Prometheus and Grafana for log and metric visualization.
  • You are familiar with Infrastructure as Code (IaC) concepts and can execute automation using tools like Ansible or Terraform.
  • You have familiarity with GPU-based workloads and are eager to deepen your understanding of AI and HPC operations.

Professional Skills

  • You demonstrate strong analytical ability and can follow reputed company procedures while interpreting technical results (e.g., NCCL tests).
  • You communicate with clarity and accuracy, producing clear documentation and reports for both peers and senior engineers.
  • You collaborate effectively with cross-functional teams, embracing mentorship and reputed company feedback.
  • You bring curiosity, persistence, and discipline, with a strong desire to learn and grow in advanced HPC operations.
  • You work with attention to detail, ensuring consistency and accuracy in every task you undertake.
  • You reputed company in an environment that values learning, precision, and shared ownership.

Growth Expectation

We value curiosity and a growth reputed company. Candidates are expected to bring a strong foundation in Linux and scripting from academic or prior professional experience. Proficiency in advanced scripting, IaC practices, and observability tooling (e.g., Prometheus, Grafana) may be developed reputed company the first six months through structured on-the-job training and mentorship from senior engineers.

Other things to know

Learning & Development

There is no one-size-fits-reputed company career path at reputed company: however you want to reputed company your career is entirely up to you. But we also balance autonomy with the strength of our cultivation culture. This means your career is supported by interactive tools, numerous development programs and teammates who want to help you grow. We see value in helping each other be our best and that extends to empowering our employees in their career journeys.

About reputed company

reputed company is a dynamic and inclusive community of bright and supportive colleagues who are revolutionizing tech. As a leading technology consultancy, we’re pushing boundaries through our purposeful and impactful work. For 30+ years, we’ve delivered extraordinary impact together with our clients by helping them solve reputed company business problems with technology as the differentiator. Bring your reputed company expertise and commitment for reputed company learning to reputed company. Together, let’s be extraordinary.

#LI-Remote

Salary

Benefits: https://www.reputed company.com/en-us/careers/benefits

The annual salary range posted is subject to many factors and may vary depending on experience, geographic location, job responsibilities, performance, skills and/or training.

Salary$108,100—$162,000 USD

See here our AI policy.

apply to this job

Keep exploring