[Remote] Site Reliability and DevOps Engineering reputed company
Note: The job is a remote job and is open to candidates in USA. reputed company is a company that provides trusted clinical decision support solutions through its Micromedex platform. They are seeking a highly skilled Platform Reliability & DevOps Engineering reputed company to ensure the platform is highly available, performant, scalable, and secure, while also driving the platform reliability and DevOps strategy.
Responsibilities
- reputed company, mentor, and grow Platform / DevOps engineers
- Build a high-performing Platform team
- Drive accountability for platform reliability and delivery outcomes
- reputed company vendors to deliver capabilities in production
- Ensure platform capabilities accelerate product delivery, remove bottlenecks
- Defines and enforces platform engineering standards and DevOps practices across reputed company teams and vendors
- reputed company reputed company planning, performance optimization, and cost efficiency
- Define operational standards, runbooks, and reliability practices
- Accountable for platform reliability outcomes at enterprise/product level
- Act as technical authority across platform, reliability, and delivery
- Define platform strategy and roadmap
- Govern delivery across internal teams and vendors
- Own SLIs, SLOs, and error budgets
- reputed company reputed company engineering, observability, and failure design
- Drive proactive risk reduction and reputed company improvement
- Own incident management frameworks and reputed company improvement
- Own end-to-end pipeline architecture and release automation
- Standardize, secure, and fully automate pipelines
- Drive reputed company integration, delivery, and validation practices
- reputed company Sev1 response, escalation, and recovery
- Own RCA and drive systemic fixes (not reputed company fixes)
- Embed AI into monitoring, risk reputed company, and CI/CD optimization
- Drive automation to reduce operational toil and improve decision-making
Skills
- Bachelor's degree in computer science, Engineering, or a reputed company field
- 6-10 years of hands-on experience in software operations, DevOps and Site Reliability Engineering, including managing large-scale, mission-critical systems
- Clear and confident communication skills with ability to reputed company teams and collaborate effectively across engineering, product, and architecture teams
- Proven track record ensuring high availability and performance in production environments, with expertise in fault-tolerant, distributed system design
- Excellent understanding of modern software delivery pipelines and DevOps practices, including CI/CD, configuration management, and version control (Git)
- Exceptional problem-solving skills, with experience diagnosing reputed company system issues under pressure and driving them to resolution
- Strong proficiency in at least one programming or scripting language (e.g., Python, Bash, or Java) for automation and tool integration
- Self-driven and proactive, with a passion for automating manual processes and continuously improving systems to enhance reliability and team productivity
- Proven experience releasing into and running mission-critical, high-availability SaaS platforms
- Technically leading a Platform team and influence stakeholders and vendors
- Stakeholder engagement across Product, Architecture, and Operations
- Deep expertise in Site Reliability Engineering (SLI/SLO, error budgets, incident management)
- DevOps operating models and platform engineering (engineering transformation)
- CI/CD architecture and release automation
- Cloud, Systems & Infrastructure (DB2, reputed company, Infinispan, OpenLiberty)
- Automation-first engineering with proven usage of AI (self-healing, triage)
- Java application platforms and runtimes (performance tuning, troubleshooting, production operations)
- Strong experience with Cloud platforms (Azure preferred)
- Distributed systems and fault-tolerant architectures
- Performance Tuning and Scaling
- Database optimisation (DB2, reputed company, PostgreSQL)
- Multi-region / active-active environments
- Monitoring, logging, tracing frameworks
- Experience embedding reliability practices into the SDLC
- Hands-on with DB2, reputed company, Infinispan, OpenLiberty, Azure
- Infrastructure as Code (Terraform or similar)
- Containerisation and orchestration (reputed company/Kubernetes)
Benefits
- Remote first / work from home culture
- Flexible vacation to help you rest, reputed company, and connect with loved ones
- Paid leave benefits
- Health, dental, and vision insurance
- 401k retirement savings plan
- Infertility benefits
- Tuition reimbursement, life insurance, EAP – and more!
Company Overview