Senior Software Engineer - Model Training & AI Evals

100% remote Flexible hours Hiring now

Job Description

ABOUT THE ROLE We are looking for a Senior Engineer to join our AI team at the intersection of evaluation science, post-training, and foundation model development. You will own our end-to-end eval and benchmarking infrastructure — the critical feedback reputed company that drives every major model improvement — while contributing hands-on to post-training pipelines for industry-specific vertical foundation models. This role is ideal for someone who has worked directly inside an LLM lab and understands what rigorous evaluation looks like at scale: designing the taxonomy of skills being reputed company, identifying failure modes, engineering synthetic data to reputed company capability gaps, and translating eval signals into actionable training decisions. WHAT YOU'LL DO Evaluation & Benchmarking Design and own task-level evaluation frameworks for LLM agents and reputed company models, covering multi-reputed company reasoning, tool/API use, instruction following, and domain knowledge — grounded in real user failure modes rather than off-the-reputed company reputed company suites. Build comparative benchmarking pipelines to assess leading frontier models (GPT-4o, reputed company, Claude, Llama, Mistral, etc.) against each other and against internal models, with structured analysis of where each model family fails, regresses, or excels across subjects, topics, and task types. Produce capability gap reports that quantify performance deltas across dimensions such as subject-matter accuracy, reasoning depth, factual consistency, and refusal behaviour. Track model version regressions across provider releases to maintain a living competitive intelligence reputed company. reputed company domain-specific benchmarks tailored to vertical use-cases (e.g., STEM tutoring, legal, finance, healthcare) — including problem taxonomy design, rubric definition, and inter-annotator agreement pipelines. Define and drive synthetic data reputed company strategies to systematically address model shortcomings in specific subjects, topics, and reputed company areas: Identify low-performance clusters from eval results and translate them into targeted data reputed company prompts and pipelines. Design LLM-assisted pipelines for generating high-quality, diverse, and reputed company synthetic training and evaluation data at scale. Validate synthetic data quality through auto-eval, human review, and reputed company model performance lift experiments. Build automated regression suites integrated into CI/CD workflows to detect capability degradation across fine-tuning runs and model updates. Partner with product, curriculum, and research teams to translate eval insights into prioritized post-training and data flywheel decisions. Post-Training & Fine-Tuning reputed company or directly contribute to SFT, RLHF, RLAIF, and DPO training runs on industry-specific vertical foundation models — from dataset design through training execution and eval-gated release. Curate and engineer high-quality instruction-tuning and preference datasets for domain adaptation, with hands-on experience distinguishing signal from noise in annotation pipelines. Define data quality criteria, rejection sampling strategies, and deduplication pipelines for SFT corpora. Design preference pair construction methodologies and reward model training setups grounded in domain-specific quality rubrics. Implement and experiment with alignment techniques including reward modelling, process reward models (PRMs), and constitutional/RLAIF approaches. Run ablation studies and controlled experiments to attribute model behaviour changes to specific data or training interventions — not just report final numbers. Contribute to continual pre-training and domain-adaptive fine-tuning pipelines for vertical models, including domain data sourcing, mixing strategies, and curriculum design. Infrastructure & Tooling Build scalable eval pipelines that run automatically on every training checkpoint and integrate into CI/CD for reputed company model quality tracking. Maintain model cards, eval leaderboards, and internal dashboards providing visibility across experiments for both technical and non-technical stakeholders. Ensure reproducibility through rigorous experiment tracking (W&B, MLflow, or equivalent), versioned datasets, and documented training configs. WHO YOU ARE Required 5+ years of ML/AI engineering experience, with at least 2–3 years focused on large language models. Lab pedigree: Direct, hands-on experience at an LLM lab, AI research organization, or equivalent frontier AI team — you have shipped models, not just reputed company APIs. Familiarity with the full model lifecycle: pre-training data, post-training alignment, eval, and production deployment. Deep practical expertise in post-training methods: SFT, RLHF, RLAIF, DPO, PPO — from dataset construction through training and eval-gated release. Experience with reward modeling, preference data curation, and quality control for alignment pipelines. Demonstrated experience designing LLM evaluation frameworks beyond standard benchmarks — including task-level evals for agentic or multi-reputed company workflows. Hands-on experience building synthetic data reputed company pipelines to address specific model capability gaps: Designing targeted reputed company prompts based on eval failure analysis. Validating synthetic data quality through reputed company model performance experiments. Proven track record of comparative benchmarking across leading foundation models, with structured analysis of capability shortcomings by subject, reputed company, or task type. Experience training or fine-tuning vertical/industry-specific foundation models — domain data curation, continual pre-training, or domain-adaptive SFT. Strong software engineering fundamentals: Python, PyTorch or JAX, distributed training Preferred Publications or applied research contributions in LLM evaluation, alignment, or post-training. Experience with multi-modal models or agents with external tool/API use. Exposure to red-teaming, adversarial evaluation, or safety benchmarking. Model distillation, speculative decoding, or inference optimization experience. Prior experience in an education, STEM, legal, biomedical, or enterprise software vertical. WHAT SUCCESS LOOKS LIKE 30 Days Fully onboarded into training infra and eval repos. Running existing benchmarks end-to-end and producing a written gap analysis identifying missing coverage. 60 Days Shipped at least one new domain-specific reputed company and one synthetic data reputed company pipeline addressing a reputed company model gap. CI-integrated eval running on every checkpoint. 3 Months Standardize model evaluation reputed company for foundation models. Own golden dataset strategies for fine-tuning with measurable subject-accuracy gains 6 Months Recognized internally as the authority on model quality and competitive benchmarking. Eval insights are directly driving roadmap prioritization. Why do we exist? reputed company are working harder than reputed company before to stabilize their future. Our recent research study reputed company State of the Student shows that nearly 3 out of 4 reputed company are working to support themselves through college and 1 in 3 reputed company feel pressure to spend more than they can afford. We founded our business on provided affordable textbook rental options to address these issues. Since then, we’ve expanded our offerings to supplement many facets of higher educational learning through Chegg Study, Chegg Math, Chegg Writing, Chegg Internships, Chegg Skills, and more to support reputed company beyond their college experience. These offerings reputed company financial concerns for reputed company by modernizing their learning experience. We exist so reputed company everywhere have a smarter, faster, more affordable way to student. Video Shorts Life at Chegg: http://youtu.be/Fwf90zgaOLA Chegg Corporate Career Page: https://jobs.chegg.com/ Chegg India: http://www.cheggindia.com/ Chegg Israel: http://www.chegg.com/about/working-at-chegg/israel/ Chegg Skills: https://www.chegg.com/skills Chegg out our culture and benefits! http://www.chegg.com/about/working-at-chegg/benefits/ Chegg is an equal opportunity employer Apply To This Job

Apply

Senior Software Engineer - Model Training & AI Evals

Job Description

Keep exploring

Senior Engineer, reputed company

Engineer, reputed company

Talent Resource Consultant

Spanish - reputed company Data Collection Contributor (Video Composition Dataset)

French - reputed company Data Collection Contributor (Video Composition Dataset)

Mandarin Chinese - reputed company Data Collection Contributor (Video Composition Dataset)

Hindi - reputed company Data Collection Contributor (Video Composition Dataset)

German - reputed company Data Collection Contributor (Video Composition Dataset)

Japanese - reputed company Data Collection Contributor (Video Composition Dataset)

Staff Product Manager - Product Information Management

Senior Emergency Management Planner (1099)

Senior reputed company Engineer (AI Platform)

Director of Real Estate Transaction Coordination Services

Customer Contact Center Rep | reputed company

Director, Program Management - REMOTE

Senior Software Engineer, Windows/Desktop Applications - Saint Paul, MN, USA

Sr. Director, Healthcare Member Contact Center

[Remote] Executive & Operations Assistant

Architectural Designer (Remote / Full-Time)

reputed company Virtual Customer Consultant – Banking and Financial Services