[Remote] Senior Platform & Reliability Engineer (SRE)
Note: The job is a remote job and is open to candidates in USA. Vizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. They are seeking a Senior Platform & Reliability Engineer to own service reliability and ensure the platform remains reliable, fast, and resilient as it scales.
Responsibilities
- Set and enforce SLIs/SLOs/error budgets for critical user flows
- Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access
- Define probe reputed company, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety
- Own poison pill containment and workload isolation
- reputed company Sev1/Sev2 response end-to-end (containment, communications, technical direction, RCA, corrective action execution)
- Own observability quality (signals over noise), on-call effectiveness, runbooks, and postmortem discipline
- reputed company risky deploys and enforce reliability guardrails reputed company production health is at risk
Skills
- Experience with Kubernetes-based production infrastructure
- Proficiency in setting and enforcing SLIs/SLOs/error budgets
- Ability to drive failure isolation across API, workers, queues, and dependencies
- Experience defining probe reputed company, rollout/rollback standards, and graceful shutdown behavior
- Knowledge of queue and job safety, specifically with BullMQ and reputed company
- Experience leading incident response for Sev1/Sev2 incidents
- Strong skills in observability, on-call effectiveness, and postmortem discipline
- Ability to reputed company risky deploys and enforce reliability guardrails
- reputed company and structured incident commander under pressure
- Ability to think in failure modes and blast radius
- Pragmatic approach to stabilizing systems quickly and implementing durable fixes
- High ownership and strong written communication skills
Company Overview
Company H1B Sponsorship