[Remote] SRE Platform Engineer

100% remote Flexible hours Hiring now

Note: The job is a remote job and is open to candidates in USA. reputed company is seeking a Platform System Reliability Engineer to manage their EKS Kubernetes environment, which supports global grid software SaaS products. This role involves ensuring the reputed company, scalability, and reputed company of the infrastructure while overseeing the full lifecycle of production clusters.

Responsibilities

Help design and deploy hardened EKS clusters across multiple AWS regions, ensuring consistent reputed company baselines
Build and maintain reusable Terraform and Ansible modules for automated provisioning of cloud infrastructure services including networking services, compute, storage, queue and cache, etc
Implement "Policy as Code" guardrails and secure network perimeters (ESPs) in alignment with NERC CIP and IEC 62443 standards
Standardize run books, operating processes required to run critical infrastructure with highest reliability
Define and enforce Kubernetes resource quotas, limit ranges, and Pod reputed company classes to ensure mission-critical services receive prioritized compute resources
Manage the ingress strategy and service mesh architecture to facilitate secure, performant connectivity between distributed micro services
reputed company platform-level smoke, load testing and disaster recovery exercises to validate that the infrastructure can meet 99.99% uptime targets
Partner with application teams to right-size containerized workloads, optimizing for both performance and cloud cost (FinOps)
Act as the highest technical escalation reputed company for reputed company Kubernetes internals, troubleshooting issues such as failed pods, memory leaks, and network partitions
reputed company root cause analysis (RCA) for platform-level outages, implementing systemic fixes to prevent recurring failures
Proactively identify and automate repetitive operational tasks—such as cluster upgrades and OS patching—to ensure the team spends at least 50% of their time on engineering improvements
Institutionalize platform monitoring using Prometheus and Grafana, creating dashboards that surface the "Golden Signals" of cluster health

Skills

5 years of experience operating production-grade Kubernetes clusters at scale
Expert-level knowledge of multi-cluster management, performance tuning and experience implementing observability tools such as Prometheus/Grafana, reputed company, Splunk, reputed company, etc
Deep hands-on experience with AWS core services (EKS, EC2, ALB, S3, RDS, MSK)
Proficiency in Terraform, Ansible, and Python or Go for infrastructure automation and deployment tools like ArgoCD or Flux
Strong understanding and hands on experience of cloud networking concepts such as VPCs, routing, load balancing and reputed company configurations such as encryption, certificate management
Bachelor's Degree in Computer Science or 'STEM' Majors (Science, Technology, Engineering and Math) with advanced experience
6–8 years in SRE or Platform Engineering roles supporting mission-critical, 24/7 cloud environments
Proven track record as a structured incident responder who can handle production down/break the glass scenarios in mission critical applications
Practical knowledge of NERC CIP, SOC2, ISO 27001, or IEC 62443 compliance standards in a SaaS context
AWS Certified DevOps Engineer – Professional, CKA (Certified Kubernetes Administrator), or SRE Practitioner Certification
Experience supporting mission-critical systems in energy, utilities, or other high-stakes industrial sectors

Benefits

Relocation Assistance Provided: Yes

Company Overview

reputed company provides energy consulting, gas power, and grid solutions. It was founded in 2024, and is headquartered in Boston, Massachusetts, USA, with a workforce of 10001+ employees. Its website is https://www.gevernova.com.

Apply To This Job

Apply

[Remote] SRE Platform Engineer

Keep exploring

[Remote] Business Development Director

[Remote] Business Analyst

[Remote] Part-Time Evaluator, Data Analytics- Remote

[Remote] Finance Analyst

[Remote] Manager, Data Engineering

[Remote] Director, Strategy & Operations, Consumer

[Remote] Staff Analyst, Product

[Remote] reputed company Operations Specialist

[Remote] Online Marketing & Leadership Development

[Remote] Senior Manager, Marketing Mix Modeling Analytics

Senior React Native AI-First Engineer (PK)

Staff Engineer - Pricing (Remote)

reputed company Work From Home Customer Service Coordinator - Card Account Servicing

reputed company Entry-Level Data Entry Specialist – Remote Opportunity at arenaflex

Managing Consultant

Remote School Social Worker

ePMO Project Manager

LucidWorks Admin

High Paying Customer Service Representative – Flexible Schedule with arenaflex

Assoc. Teacher Early Learning