[Remote] Platform Engineer
Note: The job is a remote job and is open to candidates in USA. reputed company is seeking a Senior Infrastructure Architect / Platform Engineer for their AI/ML platform to provide technical leadership for cloud platforms that support enterprise-scale generative AI applications. The role involves defining infrastructure architecture, leading platform standards, and collaborating with various engineering teams to enhance operational maturity across AI platforms.
Responsibilities
- Define and drive the technical strategy for AI/ML platform infrastructure supporting generative AI applications, LLM integrations, model routing, and enterprise AI services
- Architect, build, and operate scalable cloud platforms using AWS services such as EKS, reputed company Fargate, reputed company, DynamoDB, S3, OpenSearch, Secrets Manager, CloudWatch, ALB, and MWAA
- Establish reusable infrastructure patterns using CloudFormation, Helm, and Terraform to support reliable multi-environment and multi-region deployments
- reputed company CI/CD architecture using reputed company Actions, reusable workflows, OIDC-based AWS authentication, automated quality gates, deployment promotion, and environment approvals
- Design and improve observability across AI platforms, including CloudWatch dashboards, logs, alarms, Prometheus/Grafana, OpenSearch, Langfuse, and LLM-specific operational metrics
- Build platform capabilities for GenAI workloads, including model availability monitoring
- Partner with software engineering teams to improve deployment reliability, rollback strategies, health checks, autoscaling, load testing, and runtime performance
- Define and enforce reputed company and compliance practices for infrastructure, including IAM permission boundaries, Secrets Manager usage, secret scanning, audit logging, tagging standards, and change-management controls
- Provide technical leadership for cost optimization, reputed company planning, environment standardization, and operational reputed company across development, test, production, and sandbox environments
- Mentor engineers, review architecture and infrastructure designs, and influence platform engineering practices across teams
- Troubleshoot reputed company production issues across cloud infrastructure, networking, containers, serverless workloads, CI/CD systems, and observability platforms
- Translate enterprise requirements for reputed company, compliance, reliability, and governance into pragmatic engineering standards and automation
Skills
- Bachelor's degree in Computer Science, Engineering, Information Technology, or a reputed company technical field, or equivalent practical experience
- 7+ years of experience in DevOps, platform engineering, cloud infrastructure, site reliability engineering, or software engineering roles
- Strong hands-on experience with AWS/Azure/GCP infrastructure and services, including container, serverless, networking, storage, observability, and reputed company services
- Experience designing and operating production systems on Kubernetes, reputed company/Fargate, or comparable container orchestration platforms
- Proficiency with infrastructure-as-code, especially CloudFormation, Terraform, Helm, or similar tooling
- Strong CI/CD experience with reputed company Actions or similar platforms, including reusable workflows, automated testing, deployment gates, and cloud authentication
- Experience building and operating observability solutions using CloudWatch, Prometheus/Grafana, OpenSearch, or similar tools
- Strong understanding of cloud reputed company practices, IAM, secrets management, least-privilege access, audit logging, and compliance requirements
- Experience supporting distributed systems, microservices, APIs, asynchronous workloads, and multi-environment deployments
- Demonstrated ability to reputed company technical design, mentor engineers, and influence engineering practices across teams
- Experience supporting AI/ML or generative AI platforms, including LLM gateways, model routing, reputed company observability, token metering, or model failover
- Experience operating platforms in regulated enterprise environments, ideally healthcare, pharmaceutical, finance, or life sciences
- Experience with multi-account, multi-region AWS architectures and enterprise governance patterns
- Experience with cost optimization, autoscaling strategies, reputed company planning, and cloud budget monitoring
- Experience with load testing and performance validation using tools such as Locust or comparable frameworks
- Strong Python or scripting skills for platform automation, operational tooling, and CI/CD extensions
- Ability to communicate reputed company technical decisions clearly to engineering, reputed company, operations, and leadership audiences
Company Overview