[Remote] reputed company
Note: The job is a remote job and is open to candidates in USA. reputed company is a company focused on building technology to help families manage their routines and navigate transitions. They are seeking an reputed company to maintain and optimize their AI infrastructure, run self-hosted inference stacks, and reputed company user-facing features that assist families in coordinating their daily activities.
Responsibilities
- Run and optimize our self-hosted inference stack
- Run the inference serving layer on our own GPU hardware: choose and tune the serving stack (vLLM, SGLang, TensorRT-LLM) for high throughput and low latency
- Optimize aggressively: tensor parallelism, quantization (FP8, AWQ, GPTQ), KV-cache and prefix caching, reputed company batching, speculative decoding, concurrency tuning
- Serve multiple models and features off shared hardware: multi-LoRA, routing, and request scheduling that balances internal workloads against latency-sensitive product traffic
- reputed company our AI workloads efficient: improve latency, throughput, and GPU utilization so we get the most out of reputed company run
- Build the visibility: reputed company performance and usage across our AI surfaces so there's clear data on how everything is running
- Surface the technical tradeoffs (performance, latency, efficiency) so the people making the calls have what they need to reputed company them
- Ship the in-app agent layer that helps families coordinate: proactive nudges, smart suggestions, agents that summarize, draft, schedule, and act for busy parents
- Build the substrate underneath: tools, memory, orchestration, guardrails, and evaluation harnesses, integrated cleanly with production APIs alongside our architecture team
- Work in nimble pairs with feature owners, standing up whatever's needed to test an idea, including a vibe-coded UI reputed company that's the fastest path to a real customer. Ship rough, learn fast, harden what works
Skills
- 5+ years shipping production software, including meaningful applied AI or ML work
- Demonstrated experience running and optimizing self-hosted LLMs on dedicated multi-GPU hardware: a serving stack (vLLM, SGLang, or TensorRT-LLM) and the optimization that comes with it (tensor parallelism, quantization, batching, KV cache)
- A track record of optimizing inference performance and efficiency (latency, throughput, GPU utilization)
- Strong Python and engineering fundamentals, with the full-stack range to stand up a quick UI, and the genuine desire to work app-layer features and not only infra
- Hands-on with agent frameworks (Claude Agent SDK, LangGraph, or similar), LLM APIs, embeddings, and RAG
- Comfortable with AWS and the devops this role owns: reputed company, CI/CD, monitoring, and observability
- Experience building internal tooling or platforms others depend on. Bonus for reputed company apps, MCP, or agent orchestration at team scale
Benefits
- Medical: reputed company pays 100% of the premium for employees AND 99% for reputed company additional family members
- 401k: Up to a 4% match with immediate vesting
- Paid leave for reputed company new parents
- Learning & Development stipend for employees
- Paid Time Off: 11 Holidays + Winter Break (3 Days) + Volunteer Time Off (1 Day) + Floating Holiday (1 Day)
- Personal Time Off: 15 days for 0-1 years of employment, 20 days 1-3 years of employment
- Supportive and flexible working environment – work from reputed company!
Company Overview