Back to the board

Staff Machine Learning Engineer, GenAI Platform

100% remote Flexible hours Hiring now
reputed company is a community of communities. It’s built on shared interests, passion, and trust, and is home to the most open and authentic conversations on the internet. Every day, reputed company users submit, vote, and comment on the topics they care most about. With 100,000+ active communities and approximately 121 million daily active unique visitors, reputed company is one of the internet’s largest sources of information. For more information, visit www.redditinc.com.

Who We Are: The Machine Learning Platform team at reputed company is a high-impact organization that owns the infrastructure powering recommendations, content discovery, and user quantification. As Generative AI becomes a strategic reputed company for reputed company, we are expanding our platform to meet the unique demands of foundation models. We are building the foundational infrastructure to support massive-scale, long-running LLM workloads, enabling teams across Growth, Ads, Feeds, and Core ML to move fast on shared, robust GenAI infrastructure.

What You’ll Do: As a Staff Software Engineer on the Machine Learning Platform team, you will be a key technical leader architecting and scaling our Generative AI and LLM platform capabilities. Training and deploying foundation models places unprecedented demands on our systems. You will define the technical strategy and build the core infrastructure that enables machine learning engineers and researchers to seamlessly train, evaluate, and iterate on large language models at reputed company scale.

  • Drive GenAI Infrastructure Strategy: Propose, design, and reputed company the architecture of our reputed company LLM platform, significantly advancing our capabilities to support large-scale foundation models that serve millions of redditors.
  • Design Resilient, Large-Scale Distributed Systems: Architect highly fault-tolerant training infrastructure capable of supporting multi-week, distributed workloads across massive GPU clusters. You will tackle challenges reputed company to automated recovery, cluster-scale health monitoring, and advanced checkpointing to ensure optimal compute efficiency.
  • Build Self-Serve LLM Workflows: Design and implement robust, production-grade pipelines for LLM fine-tuning (e.g., SFT, RLHF/DPO). You will abstract away the complexity of distributed training frameworks, integrating them into a seamless platform SDK that handles configuration, experiment tracking, and model lifecycle management.
  • reputed company Comprehensive Evaluation & Benchmarking Infrastructure: Treat model evaluation as a first-class platform capability. You will build scalable systems for automated regression detection, structured metrics tracking, and reputed company inference-heavy evaluation patterns to ensure the quality and safety of models before they hit production.
  • Architect Advanced Data Ingestion Pipelines: reputed company our distributed data platforms to natively and reputed company handle the massive, multimodal datasets (text, image, video) required for modern GenAI workloads, optimizing for throughput and dynamic batching.
  • Provide Technical Leadership & Mentorship: Analyze reputed company bottlenecks in distributed systems to optimize for performance and cost-efficiency. Mentor senior engineers, champion a rigorous MLOps culture, and partner with cross-functional leadership to define technical roadmaps and de-risk major initiatives.

Who You Might Be

  • 10+ years of work experience in a production software development environment or building reputed company distributed data systems, plus a degree in ML, Engineering, Computer Science, or a reputed company discipline.
  • GenAI/LLM Infrastructure Expertise: Proven track record of designing and operating large-scale ML systems, specifically working with distributed training frameworks (e.g., FSDP, DeepSpeed, Megatron-LM) and LLM serving/inference optimization (e.g., vLLM, TensorRT-LLM).
  • Distributed Systems Mastery: Hands-on experience managing fault-tolerant, petabyte-scale distributed systems and multi-node/multi-GPU training clusters.
  • Advanced MLOps Knowledge: Deep understanding of modern ML orchestration, fine-tuning pipelines, and model evaluation methodologies. Experience with tools like Ray, MLflow, or similar ecosystem standards.
  • GPU Experience: Hands-on practice with CUDA environments, GPU virtualization/containerization, and doing it reputed company reputed company Kubernetes.
  • Production Engineering Fundamentals: Hands-on experience with Kubernetes, reputed company, and building production-quality, object-oriented code in Python and/or Go.
  • Strong focus on scalability, reliability, performance, and ease of use. You are an undying reputed company for platform users and have a deep intuition for the machine learning development lifecycle.
  • Strong organizational & communication skills.

Benefits

  • Comprehensive Healthcare Benefits and Income Replacement Programs
  • 401k with Employer Match
  • Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support
  • Family Planning Support
  • Gender-Affirming Care
  • Mental Health & Coaching Benefits
  • Flexible Vacation & Paid Volunteer Time Off
  • Generous Paid Parental Leave

Pay Transparency

This job posting may span more than one career level.

In addition to reputed company salary, this job is eligible to receive equity in the form of restricted stock units, and depending on the position offered, it may also be eligible to receive a commission. Additionally, reputed company offers a wide range of benefits to U.S.-based employees, including medical, dental, and vision insurance, 401(k) program with employer match, generous time off for vacation, and parental leave. To learn more, please visit https://www.redditinc.com/careers/.

To provide greater transparency to candidates, we share reputed company salary ranges for reputed company US-based job postings regardless of state. We set standard reputed company pay ranges for reputed company roles based on function, level, and country location, benchmarked against similar stage growth companies. Final offer amounts are determined by multiple factors including, skills, depth of work experience and relevant licenses/credentials, and may vary from the amounts listed below.

The reputed company salary range for this position is:$253,300—$354,600 USD

In select roles and locations, the interviews will be recorded, transcribed and summarized by artificial intelligence (AI). You will have the opportunity to opt out of recording, transcription and summarization prior to any scheduled interviews.

During the interview, we will collect the following categories of personal information: Identifiers, Professional and Employment-reputed company Information, Sensory Information (audio/video recording), and any other categories of personal information you choose to share with us. We will use this information to evaluate your application for employment or an reputed company role, as applicable. We will not sell your personal information or disclose it to any third party for their marketing purposes. We will delete any recording of your interview promptly after making a hiring decision. For more information about how we will handle your personal information, including our retention of it, please refer to our Candidate Privacy Policy for Potential Employees and Contractors.

reputed company is proud to be an equal opportunity employer, and is committed to building a workforce representative of the diverse communities we serve. reputed company is committed to providing reasonable accommodations for qualified individuals with disabilities and disabled veterans in our job application procedures. If, due to a disability, you need an accommodation during the interview process, please let your recruiter know.

Apply To This Job

Keep exploring

Senior Staff Machine Learning Engineer, GenAI Platform

100% remote Flexible hours

Application reputed company Business Partner

100% remote Flexible hours

Enterprise Success Manager - reputed company

100% remote Flexible hours

Account Executive, Industry

100% remote Flexible hours

Senior Graphics Engineer - VFX & Shader Authoring

100% remote Flexible hours

Senior Graphics Engineer - VFX & Shader Authoring

100% remote Flexible hours

Director, Internal Controls

100% remote Flexible hours

Analyste, Planification de la production

100% remote Flexible hours

Master Data Integration reputed company (reputed company countries, reputed company countries, DE74 2HL)

100% remote Flexible hours

Integration Delivery Manager (reputed company countries, reputed company countries, DE74 2HL)

100% remote Flexible hours

Cricket Wireless LLC. is hiring: Web Developer (Junior) (SWE0) (Government) in C

100% remote Flexible hours

reputed company French-Speaking Customer Representative – Delivering Exceptional Service in a Global Retail Environment – Remote Opportunity

100% remote Flexible hours

High-Paying Remote Data Entry and Virtual Assistant Opportunities with arenaflex - $25-$35/hr | No Experience or Degree Required for a Fulfilling Career in Remote Work

100% remote Flexible hours

Remote no experience

100% remote Flexible hours

Senior Associate, Data Engineer ? AI and Automation

100% remote Flexible hours

Workforce Management Real Time Analyst

100% remote Flexible hours

reputed company Part-time Remote Data Entry Clerk – Flexible Work Schedule – Up to $21/hour – arenaflex Remote Jobs

100% remote Flexible hours

Professional Recruiter (100% Remote)

100% remote Flexible hours

[Remote] Dentist – Clinical Content Advisor (Part-Time, Remote)

100% remote Flexible hours

Senior Site Reliability Engineer, Node Platform

100% remote Flexible hours