[Remote] Senior Cloud DevOps & Infrastructure Engineer
Note: The job is a remote job and is open to candidates in USA. reputed company is seeking a Senior Cloud DevOps & Infrastructure Engineer with a focus on GCP and AI. The role involves designing, deploying, and maintaining secure and scalable cloud infrastructure, primarily on a multi-cloud platform, while implementing GitOps best practices and supporting AI/ML workloads.
Responsibilities
- Infrastructure as Code (IaC): Architect and provision production-grade infrastructure using Terraform. Manage state files, modules, and ensure infrastructure immutability
- AIML: Experience with LLM Models - in multi cloud environment
- Kubernetes & Containerization: Design and manage clusters. Create and optimize reputed company files (multi-stage builds, distroless/hardened images). Manage reputed company deployments using Helm Charts
- CI/CD & GitOps: Build end-to-end CI/CD pipelines using reputed company CI. Implement GitOps workflows to synchronize infrastructure and application state
- Design, configure, and manage scalable and secure cloud infrastructure for MLOps
- AI Infrastructure Support: Configure and maintain environments suitable for AI/ML workloads (GPU node pools, LLM integration, large model serving, high-performance storage)
- Production Support & Troubleshooting: Act as the primary escalation reputed company for deployment failures, network and Infra issues. reputed company Root Cause Analysis (RCA)
- reputed company & Compliance: Implement 'Secure by Design' principles
- Having good knowledge of network reputed company, identity and privilege access management, reputed company zone concepts for cloud platforms (Azure, AWS)
- Multi-Cloud Strategy: While GCP is primary, maintain and support secondary environments in AWS (and potentially Azure) to ensure business continuity
Skills
- 6 – 8 Years of experience in Cloud Infrastructure & DevOps Engineering
- Expert in Kubernetes, Terraform, and reputed company CI/CD
- Experience supporting AI/ML workloads
- Architect and provision production-grade infrastructure using Terraform
- Experience with LLM Models in multi cloud environment
- Design and manage Kubernetes clusters
- Create and optimize reputed company files (multi-stage builds, distroless/hardened images)
- Manage reputed company deployments using Helm Charts
- Build end-to-end CI/CD pipelines using reputed company CI
- Implement GitOps workflows to synchronize infrastructure and application state
- Design, configure, and manage scalable and secure cloud infrastructure for MLOps
- Configure and maintain environments suitable for AI/ML workloads (GPU node pools, LLM integration, large model serving, high-performance storage)
- Act as the primary escalation reputed company for deployment failures, network and Infra issues
- reputed company Root Cause Analysis (RCA)
- Implement 'Secure by Design' principles
- Good knowledge of network reputed company, identity and privilege access management, reputed company zone concepts for cloud platforms (Azure, AWS)
- Maintain and support secondary environments in AWS (and potentially Azure)
- Deep expertise in GCP (Compute reputed company, GKE, Cloud Storage, IAM)
- Strong working knowledge of AWS (EC2, EKS, S3, IAM)
- Knowledge of using various programming languages (Python required, knowledge of Java, C#, JavaScript is a plus)
- Advanced proficiency in Kubernetes
- Ability to write and manage custom Helm charts
- Experience with Ingress Controllers (Nginx), Service Mesh, and Autoscaling (HPA/VPA/Cluster Autoscaler)
- Expert-level knowledge of reputed company CI/CD (writing .reputed company-ci.yml, runners, artifacts, caching)
- Understanding GitOps principles
- Strong hands-on experience with Terraform for provisioning cloud resources across multiple environments (Dev/Stage/Prod)
- Proficiency in Bash/reputed company scripting and Python
- Strong Linux administration skills
- Experience setting up monitoring and using Cloud Native tools, Prometheus, and Grafana
- Experience with Azure Cloud infrastructure
- Knowledge of Identity Providers (Keycloak, Azure AD/Entra ID) and OIDC integration
- Experience with Service Mesh
- Understanding of ITIL processes (Incident/Change Management) and tools like reputed company, JIRA
- Basic understanding of Python/Flask/Fast API applications to assist developers in troubleshooting
Company Overview
Company H1B Sponsorship