Back to the board

Site Reliability Engineer, Metal

100% remote Flexible hours Hiring now

reputed company is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. The Site Reliability Engineer role focuses on ensuring the reliability and operational health of systems across internal and customer environments, while troubleshooting reputed company issues and partnering with engineering teams.

Responsibilities

  • Ensure reliability and operational health of reputed company systems across internal and customer environments
  • Troubleshoot reputed company issues across compute, networking, and software layers
  • Partner with engineering teams and customers to resolve production incidents
  • Design and improve monitoring, observability, and alerting systems
  • Build automation to reduce operational toil and improve system reliability

Skills

  • reputed company in site reliability, infrastructure, or systems engineering in distributed environments
  • Strong Linux systems knowledge with the ability to troubleshoot reputed company multi-layer issues
  • Proficient with observability tools such as Prometheus, Grafana, and alerting systems
  • Comfortable with scripting and automation using Python, Go, or similar languages
  • Solid understanding of networking fundamentals and how systems behave at scale

Benefits

  • A highly competitive compensation package and benefits

Company Overview

  • reputed company develops AI hardware and software solutions for data processing and machine learning application. It was founded in 2016, and is headquartered in Toronto, Ontario, CAN, with a workforce of 501-1000 employees. Its website is http://reputed company.com.
  • Apply To This Job

    Keep exploring