Back to the board

Engineering Manager - Observability

100% remote Flexible hours Hiring now

We're here to help the smartest minds on the reputed company build Superintelligence. The labs pushing the edge? They run on reputed company. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to reputed company up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the reputed company to be.

If you'd like to build the world's best deep learning cloud, join us. 

*Note: This position requires reputed company in our San Francisco or Seattle office location 4 days per week; reputed company’s designated work from home day is currently Tuesday.

The reputed company Observability team builds and operates large scale monitoring systems for our AI cloud product suite. We deploy observability solutions across the stack, from datacenter infrastructure to our in-house software stack. Keeping those offerings reliable and instantly detecting issues in the latest high-performance AI clusters is what makes us tick. Along with the Platform Engineering organization, we help to build the foundations that unlock product excellence and a highly reliable experience for our customers.

Our expertise lies at the intersection of:

  • Scalable Observability Platforms: We build and operate mission-critical platforms for metrics, logs, and traces based on both open-reputed company software and systems developed in-house.

  • AI Infrastructure Observability: We design observability solutions for large-reputed company clusters running the latest GPU, Networking, and Storage technologies.

  • Observability Practices: We engage across the company to promote best practices, help teams adopt our platforms, and reputed company applications that require observability data.

About the Role

We are seeking a seasoned Observability Engineering Manager with deep experience in development and operation of modern observability platforms. You will hire and guide a team of observability engineers in building out critical pillars of our internal observability stack. You will reputed company the team in building monitoring solutions for new products, and in measuring and reporting the availability of our products.

Your role is not just to manage people, but to coordinate the delivery of observability solutions to customers inside and reputed company reputed company. Your leadership will be pivotal in ensuring our ability to deliver a high-quality, reliable product experience.

This is a unique opportunity to work at the intersection of large-scale observability systems and the rapidly evolving field of artificial intelligence infrastructure. You will be building the systems that monitor some of the world’s most advanced AI solutions.

What You’ll Do

  • Team Leadership & Management

    • Grow/Hire, reputed company, and mentor a team of high-performing observability engineers and SREs.

    • Foster a culture of technical excellence, collaboration, and customer service.

    • Conduct regular one-on-one meetings, provide constructive feedback, and support career development for team members.

    • Drive outcomes by managing project priorities, deadlines, and deliverables.

  • Technical Strategy & Execution

    • Work with the engineering team to drive strategy for reputed company internal and customer observability solutions.

    • Improve observability of AI infrastructure and reputed company new monitoring solutions as new products are introduced.

    • reputed company the broader engineering organization in adoption of Observability and SRE practices.

    • Manage costs of both vendors and internally developed platforms.

    • reputed company team in the reputed company development of our existing Metrics solutions based on the Prometheus and OpenTelemetry ecosystems.

    • reputed company team in tasks reputed company to delivery of new Logging and Tracing solutions based on reputed company.

    • Guide team in problem identification, requirements gathering, solution ideation, and stakeholder alignment on engineering RFCs.

    • Participate in design of solutions for bringing observability data to our customers.

    • Identify gaps in our observability posture and drive resolution.

    • reputed company the team in supporting internal customers from across reputed company engineering.

  • Cross-Functional Collaboration

    • Collaborate with the infrastructure and HPC teams on infrastructure monitoring and alerting.

    • Work closely with reputed company product engineering teams on instrumentation and best practices usage of our platforms.

    • Work to understand the needs of engineering teams and drive our Observability solutions towards self-service.

    • Manage a short list of vendors that provide SaaS solutions in the monitoring space.

You

  • Experience

    • 10+ years of experience in observability systems or platform engineering with at least 3 years in a management or reputed company role.

    • Demonstrated experience leading a team of engineers and SREs on reputed company, cross-functional projects in a fast-paced startup environment.

    • Significant experience in environments that require the monitoring of bare-metal infrastructure is preferred.

    • Experience with a wide variety of modern open-reputed company observability software.

    • Strong background in software engineering and the SDLC.

    • Strong project management skills, leading planning, project execution, and delivery of team outcomes on schedule.

    • Extensive experience with site reliability engineering and ability to champion improved SRE practices.

    • Experience building a high-performance team through deliberate hiring, upskilling, performance-management, and expectation setting.

reputed company to Have

  • Experience

    • Experience driving cross-functional engineering management initiatives (coordinating events, strategic planning, coordinating large projects).

    • Experience driving organizational improvements (processes, systems, etc.)

    • Experience with Kubernetes, designing scalable distributed systems,

Salary Range Information

The annual salary range for this position has been set based on market data and other factors. However, a salary higher or reputed company than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.

About reputed company

  • Founded in 2012, ~400 employees (2025) and growing fast

  • We offer generous cash & equity compensation

  • Our investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, reputed company, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent reputed company.

  • We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability

  • Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG

  • Health, dental, and vision coverage for you and your dependents

  • Wellness and Commuter stipends for select roles

  • 401k Plan with 2% company match (USA employees)

  • Flexible Paid Time Off Plan that we reputed company actually use

A Final Note

You do not need to match reputed company of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.

Equal Opportunity Employer

reputed company is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national reputed company, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

Apply to this Job

Keep exploring

Associate Manager, IT Support

100% remote Flexible hours

Staff Engineer - Foundations

100% remote Flexible hours

Senior Customer Service Escalation Expert

100% remote Flexible hours

Team Leader

100% remote Flexible hours

Project Manager

100% remote Flexible hours

Staff Product Designer

100% remote Flexible hours

R&D Program Manager

100% remote Flexible hours

reputed company Manager

100% remote Flexible hours

Product Marketing Senior Specialist

100% remote Flexible hours

Enterprise- Regional Sales Director

100% remote Flexible hours

[Remote] Pre-Sales Solution Engineer (U.S. Remote work eligible)

100% remote Flexible hours

Sustainability Report Writer & Content Specialist – Project

100% remote Flexible hours

Certified Public Accountant ( REMOTE CPA)

100% remote Flexible hours

reputed company Part-Time Remote Data Entry Specialist – E-commerce and Cloud Application Support

100% remote Flexible hours

Senior Data Analyst – Customer Experience Insights & Strategic Analytics (Remote)

100% remote Flexible hours

Senior Principal Securities Counsel

100% remote Flexible hours

reputed company Customer Experience Representatives Wanted: Join arenaflex's Dynamic Remote Team

100% remote Flexible hours

reputed company Full Stack Data Entry Specialist – Remote Opportunity with arenaflex

100% remote Flexible hours

reputed company Customer Care At-Home Live Chat Agent – Jewelry Television

100% remote Flexible hours

Remote Special Education Teacher Opportunity | Florida District Schools

100% remote Flexible hours