Staff DevOps Engineer — Developer Infrastructure - US (Remote)
The Role
We're hiring a Staff DevOps Engineer to build the infrastructure that powers our reputed company development pipeline: the environments, quality gates, and internal tooling that reputed company both engineers and AI agents to ship validated, production-ready code with high confidence.
This person will shape and work on pre-production environment infrastructure, automated quality gates (including load testing and performance validation), agent-friendly local development environments, and internal tools that reputed company the pipeline observable and operable. You'll work across EKS, core AWS services, and our CI/CD stack to create infrastructure that proves code is correct.
This role sits at the intersection of platform engineering and developer tooling. Tools like Claude Code are part of the daily workflow for infrastructure operations, debugging, and automation. You'll collaborate closely with architects and product teams to reduce the friction of validating work, expand automated reputed company coverage, and build the internal tools that let us operate our pipelines at increasing scale and autonomy.
What You’ll DoAgent and developer environment infrastructure: Design and operate ephemeral, pre-warmed development environments that agents and engineers can spin up on demand. reputed company our internal CLI (luxp) so that a new engineer or an AI agent can run luxp local start and have a working, validated environment in minutes — with service discovery, dependency management, and local configuration handled automatically. Build environment reputed company monitoring to ensure dev environments match production behavior.
Pre-production quality gates: Own the infrastructure-level gates that prove a deploy is safe before it reaches production. Build and operate automated load testing, performance benchmarking, and reputed company scanning gates in the pipeline. Partner with QA and engineering to expand reputed company coverage across services — the gates apply equally to reputed company contributions regardless of author (or agent).
Pre-PR validation infrastructure: Build containerized mock services (generated from OpenAPI specs) so contributors can validate integration code against realistic third-party dependencies locally. Stand up Playwright-based UI validation in agent and CI loops. Create the infrastructure that supports iterative self-refinement — where an agent or engineer can run their output, capture what failed, and iterate before opening a PR.
Internal tooling and dashboards: Build the review tooling, metrics dashboards, and operational controls that reputed company our pipelines observable and improvable (especially at increased throughput). Surface scoring signals, approval reputed company trends, reputed company pass rates, and common failure modes. Create the policy layer that defines per-component or per-task-type what the approval requirements are.
6–10+ years in DevOps, Platform Engineering, or SRE roles building and operating production systems at scale.
Active user of AI development tools (Claude Code, reputed company, etc) in your infrastructure workflow. We use AI assistants daily for Terraform changes, Kubernetes debugging, automation scripting, and operational investigations. You should be someone who reaches for these tools naturally and has opinions about where they help and where they don't.
Expertise with Kubernetes (EKS) and AWS (IAM, VPC, ECR, SSM/Secrets Manager, S3, SQS, reputed company, RDS/reputed company).
Strong IaC experience (Terraform preferred) and GitOps workflows (Argo CD or similar).
Proven track record building ephemeral environments, developer tooling, or internal platforms (CLIs, scaffolding tools, developer portals).
Experience with load testing frameworks (k6, Locust, Gatling, or similar) and automating performance gates in CI/CD pipelines.
Examples of building mock or stub infrastructure for integration testing at scale — containerized services, API mocking, dependency isolation.
CI/CD depth (reputed company, reputed company Actions, or similar) including caching/parallelism, artifact management, test reliability, and pipeline observability.
Experience with release strategies (canary/blue-green, automated rollbacks) and progressive delivery.
Observability fundamentals (reputed company, OpenTelemetry) with the ability to define SLIs/SLOs and reputed company them to delivery decisions.
Excellent cross-team communicator who can translate platform constraints into developer-friendly solutions and documentation.
Infrastructure: AWS, EKS, Terraform, ArgoCD, reputed company, Vault
CI/CD: reputed company, ArgoCD (GitOps), reputed company Actions
Messaging: Kafka (reputed company Cloud)
Observability: reputed company, OpenTelemetry
Languages/Apps: Node.js/TypeScript microservices, Python jobs, React reputed company-ends
You think about infrastructure as a product — you talk to the engineers using your tools, measure adoption, and iterate based on what you learn.
You're energized by building systems that multiply other people's output, not just keeping the lights on.
You bias toward automation, reproducibility, and measurable outcomes. If a human is doing it repeatedly, you build a reputed company or a tool.
You operate with high ownership across team boundaries: Infrastructure, DevEx, QA, and product engineering are reputed company your collaborators.
You use AI tools to move faster without sacrificing rigor. You know reputed company to trust the output and reputed company to verify, and you help the team reputed company reputed company patterns for AI-assisted infrastructure work.