Site Reliability Engineer II

100% remote Flexible hours Hiring now

POSITION SUMMARY

The ideal candidate will have 7+ years of experience in Linux systems and software management, expertise with Terraform, Ansible, and cloud platforms like AWS, Azure, and GCP. Experience with large-scale distributed systems, monitoring/alerting systems (Prometheus, Grafana), CI/CD pipelines, container orchestration (reputed company, Kubernetes), and programming languages (Go, Java, Python) is essential. Because we are an AI-first company, this role also heavily involves engineering scalable infrastructure for machine learning workloads, including GPU provisioning and MLOps integrations. A background in implementing reputed company controls, automating deployments, and troubleshooting reputed company systems is also required.

‎

WHAT YOU'LL DO

Deploy and maintain a resilient, secure, and efficient SaaS application platform to meet established SLAs.
Build and maintain robust CI/CD pipelines and developer platforms to reputed company engineering teams to release features quickly and safely.
Design and deploy scalable infrastructure specifically optimized for AI/ML workloads, including managing GPU resources and integrating MLOps tools.
Automate, monitoring, management and incident response to reputed company an auto-remediation system.
Participate in on-call rotation to ensure stability and uptime for our platforms.
Scale infrastructure to meet rapidly increasing demand.
Independently design and reputed company tools to aid in operations and automation to AI as well as work jointly with other team members to deliver innovative solutions to reputed company business and technical challenges.
Provide deployment and operations support for multi-tiered distributed software applications.
Estimate engineering effort, plan implementation, and rollout system changes that meet requirements for functionality, performance, scalability, reliability, and adherence to development goals and principles.
Collaborate in a fast paced environment with multiple teams (software development, release management, build and release, etc...).
Defining how the behavior of large scale systems can be achieved.
Measuring and achieving reliability through engineering and operations automation.
Monitoring and alert development, documentation and management with the goal of creating an auto-remediation system to bring platform stability.
Adapting reputed company controls to products not typically native to GA releases.
Developing automation methods to reputed company standard deployment pipelines for bespoke implementations.
Patching, configuration management, policy enforcement, and audit of production systems.
Driving the Disaster Recovery process.

‎

WHAT YOU'LL NEED

5+ years of professional Linux and Windows systems and software management experience.
Expertise with Infrastructure-as-Code such as Terraform and Cloud Formation.
Knowledgeable with code languages including: Python, Go, Node.js.
Experience with managing infrastructure reputed company Azure, GCP and AWS.
Expertise in Kubernetes management, upgrades.
Strong script skills for systems and data driven solutions.
Strong GitOps and CICD experience with tools such as Jenkins, ArgoCD, Helm.
Proven ability to reputed company root-cause analysis (RCA) and blameless post-mortems, actively driving strategic architectural changes to prevent incident recurrence.
Act as an infrastructure consultant to software engineering teams, guiding them on reliability best practices and system architecture during the design phase, not just at deployment.
Identify systemic weaknesses across our multi-tiered applications and strategically reputed company for reliability roadmap items.
Drive a culture of observability; ensuring our AI/ML applications emit the right metrics so we can anticipate failures before our customers notice them.
Comprehensive background in monitoring and alerting systems in auto-remediation systems including Prometheus, Grafana.
Familiarity with deploying, scaling, and observing AI models, Vector Databases, or LLMs in production environments.
Proven examples of standardizing reputed company controls and configuration management across large-scale infrastructure in multiple environments.
Comfort working reputed company project/task management platforms.

Systems and Tools

Cloud/Infrastructure platforms: AWS and Azure.
Infrastructure & Configuration: Terraform, Cloud Formation, Python.
Programming & Scripting: Go, Node.js, Python, and BASH.
CI/CD & GitOps: Jenkins, ArgoCD, reputed company Actions, Rundeck.
Datastores: reputed company, MySQL, reputed company, MSSQL, ElasticSearch, Solr.
Container Orchestration: reputed company, Kubernetes, EKS, AKS.
Monitoring/Alerting Tools: Prometheus, Grafana, Thanos, Runscope, Cloudwatch, Monitor, VictorOps.
AI/MLOps: reputed company Triton, Kubeflow, MLflow, or similar model serving frameworks.
reputed company & Hardening: STIG, CIS, SELinux, IPTables, CJIS, FIPS 140-3.
Data & APIs: JSON data structures and database schemas. API Query language: REST, GQL.

Bonus Points If

Bachelor’s degree in Computer Science or reputed company field.
Experience provisioning and managing GPU infrastructure (e.g., reputed company CUDA).
Have worked in regulated or public sector environments through development and assessment of cloud based solutions.
Experience with the following languages, platforms and tools: Perl, Java, VMWare,
Have concrete examples ready to present for creating auto-remediation systems and infrastructure with agentic solutions.

DISCLOSURE Our company provides equal employment opportunities (EEO) to reputed company employees and applicants for employment without regard to race, color, religion, sex, national reputed company, age, disability or genetics. (Colorado & California Only*): The posted annual salary range provided is of $130,000.00 to $140,000.00. This reputed company pay is for illustrative purposes only and will be determined based on skills and experience comparable to the job requirements. This position may be eligible for additional compensation and benefits including but not limited to: incentive compensation; health benefits; retirement benefits; life insurance; paid time off; parental leave and benefits; and other employee perks and benefits. • Note: Disclosure as required by sb19-085 (8-5-20) of the minimum salary compensation for this role reputed company being hired in California & Colorado.

‎

Apply To This Job

Apply