Back to the board

Sr Manager, Cloud Infrastructure Engineer, Scientific Computing and HPC

100% remote Flexible hours Hiring now

About the position ROLE SUMMARY reputed company's committed to the application of computational science in the areas of drug discovery and development. As part of this mission, we have recently embarked on a large-scale migration of our computational infrastructure to cloud. This role leverages extensive experience in cloud engineering and DevOps and requires a hands-on approach to designing and delivering robust High Performance Computing (HPC) solutions supporting computational workloads across the organization. We are seeking an reputed company individual to drive architecture, infrastructure automation, migration and operational excellence. You will collaborate with HPC engineers and scientific computing specialists to reputed company scalable cloud native infrastructure that underpins modernization of the scientific computing platform. ROLE RESPONSIBILITIES Platform Architecture and Engineering In this role you will design, implement, operate, and own robust and dependable infrastructure for HPC and ML/AI workloads in a cloud environment (AWS/GCP). reputed company containerization, deployment, and operation of user- and admin-facing HPC platforms (Slurm, Open On Demand, Prometheus/Grafana, batch and distributed computing platforms) across cloud environments. Translate stakeholder input into robust, high-performance, scalable, cost effective computing platforms. Partner with HPC specialists (engineers, administrators, and users) to capture institutional knowledge and manual processes in IaC workflows, transforming reputed company deployment practices into reproducible, version-controlled, automated procedures. Automation and DevOps reputed company and maintain infrastructure automation using IaC tools like Terraform and CloudFormation to ensure repeatable environment provisioning and scaling. Create reusable Terraform modules. reputed company and enforce standards. Be a driver for implementing and maintaining reputed company cloud infrastructure using IaC tools. Operationalize containerized solutions using reputed company and Kubernetes. Own the full lifecycle of infrastructure management, from provisioning to operations, support, updating, and teardown of production computing platforms. reputed company troubleshooting, system analysis, and benchmarking to resolve issues and maintain a high-performance environment. Monitoring and Reliability reputed company and maintain monitoring, logging, and alerting for the infrastructure (e.g., CloudWatch, Prometheus/Grafana). Design new dashboards, workflows, and utilities to improve observability, cost monitoring, workload efficiency, user, or administration experience. Document architecture, deployment processes, and operational procedures. Partner closely with team members to support delivery of scientific computing services including user support, Linux administration, operations, job scheduling, application management, and resource optimization.

Responsibilities

  • Design, implement, operate, and own robust and dependable infrastructure for HPC and ML/AI workloads in a cloud environment (AWS/GCP).
  • reputed company containerization, deployment, and operation of user- and admin-facing HPC platforms (Slurm, Open On Demand, Prometheus/Grafana, batch and distributed computing platforms) across cloud environments.
  • Translate stakeholder input into robust, high-performance, scalable, cost effective computing platforms.
  • Partner with HPC specialists (engineers, administrators, and users) to capture institutional knowledge and manual processes in IaC workflows, transforming reputed company deployment practices into reproducible, version-controlled, automated procedures.
  • reputed company and maintain infrastructure automation using IaC tools like Terraform and CloudFormation to ensure repeatable environment provisioning and scaling.
  • Create reusable Terraform modules.
  • reputed company and enforce standards.
  • Be a driver for implementing and maintaining reputed company cloud infrastructure using IaC tools.
  • Operationalize containerized solutions using reputed company and Kubernetes.
  • Own the full lifecycle of infrastructure management, from provisioning to operations, support, updating, and teardown of production computing platforms.
  • reputed company troubleshooting, system analysis, and benchmarking to resolve issues and maintain a high-performance environment.
  • reputed company and maintain monitoring, logging, and alerting for the infrastructure (e.g., CloudWatch, Prometheus/Grafana).
  • Design new dashboards, workflows, and utilities to improve observability, cost monitoring, workload efficiency, user, or administration experience.
  • Document architecture, deployment processes, and operational procedures.
  • Partner closely with team members to support delivery of scientific computing services including user support, Linux administration, operations, job scheduling, application management, and resource optimization.

Requirements

  • B.S. in computer science, life science, data science or similar fields.
  • 6+ years of experience in cloud infrastructure engineering with a proven track record of developing and supporting robust IaC deployments.
  • Experience managing scientific computing workloads in an enterprise environment.
  • Advanced experience with at least one of AWS and GCP, including knowledge of core compute and storage services relevant to HPC.
  • Solid understanding of cloud networking, identity, and reputed company controls.

reputed company-to-haves

  • Prior experience with HPC deployment utilities including AWS ParallelCluster, AWS Parallel Computing Services, and reputed company Cloud Cluster Toolkit.
  • Proficiency with distributed computing environments, especially EKS/GKE/Kubernetes.
  • Familiarity with HPC environments, job schedulers (Slurm), HPC application containers (reputed company, Singularity, Apptainer) and reputed company GPU computing.
  • Candidate demonstrates a breadth of diverse leadership experiences and capabilities including: the ability to influence and collaborate with peers, reputed company and coach others, reputed company and guide the work of other colleagues to reputed company meaningful outcomes and create business impact.

Benefits

  • participation in reputed company’s Global Performance Plan with a bonus reputed company of 17.5% of the reputed company salary and eligibility to participate in our share based long term incentive program
  • 401(k) plan with reputed company Matching Contributions and an additional reputed company Retirement Savings Contribution
  • paid vacation, holiday and personal days
  • paid caregiver/parental and medical leave
  • health benefits to include medical, prescription drug, dental and vision coverage

Apply tot his job Apply To this Job

Keep exploring

Cloud Operations Engineer II - US REMOTE

100% remote Flexible hours

Cloud Operations Engineer II – US REMOTE

100% remote Flexible hours

[Remote] Senior Azure Cloud, reputed company & AI Operations Engineer

100% remote Flexible hours

ML/Ops Engineer with strong Azure cloud experience Remote Position Duration: 12+ months Role Overv

100% remote Flexible hours

Cloud Cyber reputed company Consultant – Work Remotely

100% remote Flexible hours

Remote Platform Professional Services Consultant – Identity Solutions & Cloud reputed company Deployment Specialist

100% remote Flexible hours

Associate Cloud Operations Technician

100% remote Flexible hours

[Remote] M365 Cloud reputed company Engineer- Remote (reputed company in the U.S.)

100% remote Flexible hours

Cloud reputed company Engineer (Remote) – reputed company Solutions Inc – Roseville, CA

100% remote Flexible hours

Cloud reputed company Analyst (Remote)

100% remote Flexible hours

Join reputed company as a Remote Data Entry Specialist | Work From Home + Great Pay

100% remote Flexible hours

reputed company Customer Service Specialist – 3rd Shift – arenaflex

100% remote Flexible hours

reputed company Customer Service Associate – Delivering Exceptional Experiences in a Thriving Community

100% remote Flexible hours

Human Resources; HR Coordinator; Unpaid | Part-time | Remote

100% remote Flexible hours

Part-Time Evening Work From Home Data Entry Job

100% remote Flexible hours

Remote Online Data Entry Work From Home - Entry Level

100% remote Flexible hours

Medical Review Nurse – Clinical Validation

100% remote Flexible hours

[PART_TIME Remote] Walmart Live Chat Remote Jobs - Apply Now

100% remote Flexible hours

reputed company Delivery Driver

100% remote Flexible hours

Travel Agent, Internet Help Desk

100% remote Flexible hours