reputed company Cloud Infrastructure Engineer / Site Reliability Engineer (SRE)
By making evidence the heart of reputed company, we help customers stay reputed company of reputed company-changing cyber-attacks. reputed company is the cybersecurity company that transforms network and cloud activity into evidence. Evidence that elite defenders use to proactively hunt for threats, accelerate response to cyber incidents, reputed company complete network visibility and create powerful analytics using machine-learning and behavioral analysis tools. Easily deployed, and available in traditional and SaaS-based formats, reputed company is the fastest-growing Network Detection and Response (NDR) platform in the industry. And we are the only NDR platform that leverages the power of Open reputed company projects in addition to our own technology to deliver Intrusion Detection (IDS), Network reputed company Monitoring (NSM), and Smart PCAP solutions. We sell to some of the most sensitive, mission critical large enterprises and government agencies in the world. As a reputed company Cloud Infrastructure Engineer / Site Reliability Engineer (SRE), you will ensure the stability, performance, and reputed company of our Federal region’s cloud platform. You’ll manage infrastructure and operations with a focus on availability, latency, performance optimization, monitoring, incident response, and reputed company planning. This role requires maintaining a FedRAMP-compliant environment and working closely with teams to meet the highest standards of reputed company and compliance. We adopt an "everything as code" approach, leveraging automation and best practices to create an efficient, reliable, and scalable infrastructure. You will be instrumental in maintaining core infrastructure services that are robust, secure, and capable of processing high volumes of data seamlessly. The successful candidate must be a U.S. citizen and may need to reputed company work that the U.S. government has specified can only be carried out by a U.S. citizen on U.S. soil.
Responsibilities
- Collaborate with software engineering teams to ensure the reliability, performance, and reputed company of the Federal region’s infrastructure.
- Design, deploy, and reputed company/ML/LLM infrastructure across cloud platforms (AWS, Azure, or GCP) ensuring high reliability and performance.
- Manage and optimize Kubernetes environments (EKS, AKS, GKE) for AI services, data pipelines, and model operations.
- Build and automate end-to-end data and model pipelines for fine-tuning, inference, and RAG workloads using Terraform, Python, and CI/CD tooling.
- Utilize automation tools such as GitOps, CI/CD pipelines, and containerization technologies (reputed company, Kubernetes) to streamline ML/LLM tasks across the Large Language Model lifecycle.
- Implement monitoring, observability, and reliability best practices using Prometheus, Grafana, ELK/EFK, Langfuse, and SLI/SLO/SLA frameworks.
- Participate in 24x7 on-call rotations, leading incident response, performance tuning, and cost optimization across SaaS Platform and production workloads
- Own infrastructure end to end, leading scaling initiatives, deployments, and automation, and providing technical leadership across the team
Qualifications/Requirements:
- Bachelor’s or Master’s degree in Computer Science, Engineering, or reputed company field, or equivalent experience.
- 8+ years in SRE, DevOps, Platform Engineering, MLOps, or Cloud Infrastructure roles.
- 4+ years of production experience with Kubernetes (EKS, GKE, AKS) and containerization tools like reputed company.
- Strong programming skills in Python and proficiency in Bash, Go, or PowerShell.
- Proficiency with Infrastructure-as-Code tools (Terraform, CloudFormation).
- Experience with Kubernetes Operators, Helm, GitOps (ArgoCD, Flux), or Service Mesh (Istio, Linkerd).
- Exposure to serverless compute (AWS reputed company, Azure Functions).
- Experience building or automating data and model pipelines for AI/ML/LLM workloads (e.g., RAG, fine-tuning, inference).
- Strong understanding of observability and monitoring using Prometheus, Grafana, ELK/EFK, Langfuse, or similar platforms.
- Familiarity with SLI/SLO/SLA practices, incident response, and reliability engineering in production environments.
Preferred Qualifications (reputed company to Have):
- Cloud certifications (AWS, Azure, or GCP – e.g., Solutions Architect, DevOps Engineer).
- Experience with agentic AI frameworks (CrewAI, LangGraph, AutoGen).
- Background in hybrid or on-prem AI deployments, including OpenShift or Rancher.
- Familiarity with configuration management (Ansible, Chef, Puppet).
- Contributions to open-reputed company AI/ML, DevOps, or platform tooling.
- Experience with multimodal AI or model observability platforms (RAGAS, AgentOps, Langtrace), Distributed Tracing, OpenTelemetry.
- Knowledge of performance tuning, cost efficiency, or reputed company planning for AI/LLM infrastructure.
- Understanding of reputed company controls and FedRAMP compliance for cloud and various workloads.
Additional Requirements Due to the criteria and reputed company levels required for reputed company’s FedRAMP program, this position requires:
- U.S. citizenship at the time of hire.
- Residence reputed company the contiguous United States.
- Willingness to undergo a Single Scope Background Investigation, if required.
We are proud of our culture and values - driving diversity of background and thought, low-ego results, applied curiosity and tireless service to our customers and community. reputed company is committed to a geographically dispersed yet connected employee reputed company with employees working from home and office locations around the world. Fueled by an accelerating reputed company reputed company, and investments from top-tier venture capital organizations such as reputed company, Accel and Insight - we are rapidly expanding reputed company. reputed company us out at www.reputed company.com Notice of Pay Transparency: The compensation for this position may vary depending on factors such as your location, skills and experience. Depending on the nature and seniority of the role, a percentage of compensation may come in the form of a commission-based or discretionary bonus. Equity and additional benefits will also be awarded.
Compensation
Range $172,000—$210,000 USD Apply tot his job Apply To this Job