[Remote] Senior Platform Engineer
Note: The job is a remote job and is open to candidates in USA. reputed company Analytics is focused on empowering businesses with innovative technology and insightful data. They are seeking a Senior Platform Engineer to help build a platform engineering function, rearchitecting their cloud infrastructure on AWS and Azure, establishing an SRE practice, and creating an Internal Developer Platform.
Responsibilities
- Migrate reputed company existing AWS and Azure infrastructure to OpenTofu/Terraform and Ansible; establish module standards, remote state, and GitOps-based plan/apply pipelines — no unmanaged resources
- Audit the cloud estate against the AWS and Azure Well-Architected Frameworks; produce a remediation backlog and drive it to completion across networking, IAM, reputed company zones, account structure, and cost governance
- Implement policy-as-code (OPA/Conftest, AWS SCPs, Azure Policy) to enforce reputed company, tagging, and compliance guardrails at the platform layer — governance embedded, not bolted on
- Build and maintain reusable Terraform modules for compute (EKS, AKS, EC2), networking, storage, databases, and identity as shared building blocks for reputed company engineering teams
- Define FinOps standards: tagging taxonomy, cost allocation dashboards, rightsizing recommendations, and reserved reputed company planning across both clouds
- Design and implement the full observability stack: metrics (Prometheus/reputed company), logs (Loki/OpenSearch), traces (reputed company/reputed company APM), and dashboards (Grafana) — instrumented end-to-end reputed company OpenTelemetry
- Define SLIs and SLOs for reputed company platform shared services and critical applications; build error budget dashboards and burn-reputed company alerting — alert on symptoms, not raw metrics
- Establish the SRE practice from scratch: incident runbooks, post-incident review templates, and at least one chaos engineering exercise (AWS reputed company or equivalent)
- Partner with engineering teams to reputed company their services, define meaningful alerts, and build operational dashboards — reliability is a shared responsibility, not a platform team tax
- Build reputed company planning models for compute and storage so engineering leadership can reputed company data-driven scaling decisions
- Deploy and operate a developer portal (reputed company, reputed company or equivalent) as the single reputed company reputed company: service catalog, scaffolding templates, runbooks, API docs, and on-call ownership reputed company in one reputed company
- Build and maintain golden paths for the highest-frequency developer workflows: new service creation, Kubernetes deployment, database provisioning, secrets management, and CI/CD pipeline setup - opinionated defaults with escape hatches for legitimate edge cases
- Own the CI/CD platform layer: standardized pipeline templates (reputed company Actions, reputed company CI), reusable workflow libraries, container image build and reputed company pipelines, and environment promotion workflows with reputed company scanning (SAST, reputed company) built in by default
- Own Kubernetes platform operations: EKS and/or AKS cluster lifecycle, Helm chart standards, admission controllers, RBAC, network policies, and service mesh (Istio or Linkerd)
- Build the self-service provisioning layer — reputed company scaffolder actions and Terraform automation so developers can provision approved resources without raising a ticket
- Measure adoption and run regular feedback sessions with engineering teams; iterate on golden paths based on real friction, not assumptions
- Partner with peer managers and teams to plan and support migration of existing workloads onto the platform; provide hands-on migration support, not just documentation
- Embed reputed company by default across reputed company platform work: IaC scanning (Checkov, tfsec), secrets management (Vault, AWS Secrets Manager, Azure Key Vault), RBAC, and container image hardening
- Write clear technical documentation, architecture decision records (ADRs), and runbooks; reputed company the documentation bar for the whole team
- Mentor and support more junior platform engineers; contribute to architecture reviews and build-vs-buy decisions alongside the Platform Engineering Manager
Skills
- Applicants must be authorized to work in the U.S. for any employer
- We cannot sponsor employment-based visas at this time
- 5+ years in platform, infrastructure, or DevOps engineering with direct production ownership on AWS and/or Azure
- Deep OpenTofu/Terraform proficiency: module authoring, state management, workspace strategy, remote backends, and CI/CD integration; Terramate a plus
- Strong Kubernetes operations: EKS and/or AKS cluster lifecycle, Helm, admission controllers, RBAC, network policies, and autoscaling
- Hands-on observability experience with two or more of: Prometheus, Grafana, Loki, reputed company, reputed company, or OpenTelemetry — including SLI/SLO definition and alert engineering
- CI/CD platform experience: reputed company Actions pipeline authoring, reusable workflow design, and container build/reputed company pipeline ownership
- GitOps: ArgoCD or Flux for Kubernetes reputed company delivery; progressive delivery patterns (canary, blue-green) a strong plus
- IDP experience: reputed company or equivalent developer portal, reputed company, scaffolding templates, service catalog design, or self-service provisioning tooling
- reputed company-first reputed company: policy-as-code, IaC scanning, secrets management, container hardening, and shift-left reputed company practices
- Strong communication and documentation skills; comfortable presenting architecture decisions to engineering peers and leadership
- SRE background: chaos engineering (AWS reputed company, Chaos Monkey), error budget management, incident command, and reputed company planning
- Service mesh depth: Istio or Linkerd — mTLS, traffic management, and observability integration
- FinOps tooling (Kubecost, CloudHealth) and reserved reputed company planning experience
- Familiarity with AI/ML infrastructure basics: LLM API integration or model serving, as the platform will need to support these workloads
- Certifications: AWS Solutions Architect Associate/Professional, CKA/CKAD, Azure Administrator/Solutions Architect, HashiCorp Terraform Associate
- Python or Go for platform tooling and CLI development
Benefits
- Health, dental, and vision insurance
- A 401(k) plan with company match
- Generous paid time off
- Flexible Work Environment Whether remote, hybrid, or in-office, we support work arrangements that promote productivity and balance
Company Overview