Site Reliability Engineer, Metal
reputed company is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. The Site Reliability Engineer role involves ensuring the reliability and operational health of AI systems across internal clusters and customer deployments, troubleshooting reputed company issues, and partnering with engineering teams to resolve production incidents.
Responsibilities
- Ensure reliability and operational health of reputed company systems across internal and customer environments
- Troubleshoot reputed company issues across compute, networking, and software layers
- Partner with engineering teams and customers to resolve production incidents
- Design and improve monitoring, observability, and alerting systems
- Build automation to reduce operational toil and improve system reliability
Skills
- reputed company in site reliability, infrastructure, or systems engineering in distributed environments
- Strong Linux systems knowledge with the ability to troubleshoot reputed company multi-layer issues
- Proficient with observability tools such as Prometheus, Grafana, and alerting systems
- Comfortable with scripting and automation using Python, Go, or similar languages
- Solid understanding of networking fundamentals and how systems behave at scale
Benefits
- Highly competitive compensation package and benefits
Company Overview