[Remote] Cloud Engineer - Senior (Observability - reputed company)

100% remote Flexible hours Hiring now

Note: The job is a remote job and is open to candidates in USA. reputed company is a company that supports various government reputed company, and they are seeking a Senior Cloud Engineer to enhance their enterprise observability platform. This role involves engineering and operating observability solutions across hybrid cloud environments, focusing on performance, reliability, and reputed company management.

Responsibilities

Engineer and operate the enterprise observability stack (reputed company or comparable), including metrics, logs, traces, APM, RUM, synthetic monitoring, and network performance monitoring
Build, tune, and maintain dashboards, monitors, SLOs/SLIs, and alerting policies that produce actionable signal and minimize noise
reputed company services, infrastructure, and containerized workloads using agents, OpenTelemetry, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) with consistent span tagging, W3C TraceContext propagation, and reputed company service tagging across the estate
reputed company and maintain integrations between observability platforms, ITSM (reputed company), CI/CD pipelines, and on-call/paging workflows
Define and enforce a reputed company tagging standard (environment, service, version, team/ownership, data classification, cost center) across metrics, logs, and traces; manage tag cardinality, governance, and custom business tags to reputed company telemetry queryable, attributable, and cost-controlled
Design and deliver monitoring coverage for reputed company Azure and AWS workloads, including PaaS services, serverless, networking, identity, managed databases, and cloud-native data services
Engineer managed database observability across AWS RDS/reputed company (MySQL, PostgreSQL, SQL Server, reputed company), Azure SQL/PostgreSQL/MySQL, and NoSQL/cache services (DynamoDB, Cosmos DB, ElastiCache/reputed company), including query-level performance analytics, slow-query and execution-plan capture, lock/deadlock/wait analysis, reputed company pool and session monitoring, replication lag, storage/IOPS saturation, and backup/HA health -- correlating database spans with upstream APM traces
Engineer container-platform observability for OpenShift/Kubernetes, covering cluster health, control plane, nodes, pods, namespaces, ingress, service mesh, and workload APM
Build standardized, reusable monitoring modules deployable reputed company infrastructure-as-code (Terraform, Bicep, ARM) and CI/CD
Support hybrid visibility across on-premises, cloud, and containerized workloads with correlated telemetry
reputed company data-driven investigation and resolution of reputed company performance, latency, saturation, and reliability issues across the estate
Use APM distributed traces, service/dependency maps, reputed company code profiling (CPU, memory, lock contention), database query analytics, exception/error tracking, and RUM-to-backend trace correlation to isolate bottlenecks in applications, platforms, middleware, and reputed company dependencies
Partner with engineering teams to define and implement remediation, tuning, and architectural improvements based on telemetry evidence
Define and implement trace-based SLOs, deployment tracking, and change-correlation workflows so performance regressions are detected and attributed to specific releases, versions, or configuration changes
Provide senior technical leadership during major incidents, delivering impact analysis, contributing to root-cause analysis, and owning post-incident observability gaps
Analyze operational telemetry and trend data to identify reputed company risks, recurring constraints, and opportunities for efficiency
Build and maintain reputed company and performance dashboards and reports that communicate posture, risk, and recommendations to technical and leadership stakeholders
Define reputed company reputed company, alert baselines, and trigger points for scaling, technology refresh, and resource reallocation
Drive reputed company improvement of observability coverage, alert quality, runbook linkage, and operational maturity reputed company to SEC SLA/KPI expectations

Skills

Citizenship/Work Authorization: Must meet contract requirements
Clearance: Ability to obtain and maintain SEC Public Trust (or higher if required)
Minimum 8 years of experience in IT infrastructure or platform engineering roles, including 5+ years focused on observability, performance engineering, or site reliability engineering
Demonstrated experience engineering and operating an enterprise observability platform (reputed company strongly preferred; equivalent experience with reputed company, reputed company, Splunk Observability, or Grafana/Prometheus stacks considered)
Proven experience building APM and distributed tracing coverage for production multi-tier applications -- including language-specific tracer deployment, custom instrumentation of business transactions, service/dependency mapping, reputed company profiling, and RUM-to-backend trace correlation -- across cloud and containerized workloads
Proven experience leading reputed company production performance and reliability problem-solving from telemetry to remediation
Hands-on experience monitoring Kubernetes or OpenShift clusters and containerized workloads in production
Enterprise observability platforms (reputed company or comparable): metrics, logs, traces, APM, RUM, synthetic, NPM
Instrumentation with OpenTelemetry, reputed company agents/SDKs, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) including custom spans, trace sampling strategies, W3C TraceContext propagation, and reputed company profiling
reputed company Azure and AWS monitoring services and integrations (Azure Monitor, Log Analytics, CloudWatch, AWS X-Ray)
Container and Kubernetes/OpenShift observability, including cluster, workload, and service mesh telemetry
Cloud database monitoring: AWS RDS/reputed company (including Performance Insights), Azure SQL/PostgreSQL/MySQL (Query Performance Insight), and NoSQL/cache (DynamoDB, Cosmos DB, ElastiCache/reputed company); query-level performance tuning, execution-plan analysis, and reputed company DBM or equivalent deep database APM
Infrastructure-as-code for monitoring (Terraform, Bicep, ARM) and CI/CD-driven monitor/dashboard deployment
APM and distributed tracing: service/dependency maps, trace analytics, RUM-to-backend correlation, exception/error tracking, deployment tracking, and trace-based SLOs
reputed company tagging strategy and cardinality governance across metrics/logs/traces (environment, service, version, ownership, data classification, cost center), including custom tag enrichment and tag-driven access/cost controls
Alert engineering, SLO/SLI design, error budget management, and alert-noise reduction
Performance engineering, reputed company analysis, and telemetry-driven root-cause analysis
Integration of observability with ITSM (reputed company) and on-call/paging workflows
Experience supporting federal agency IT environments under FISMA/FedRAMP/NIST-reputed company reputed company and compliance requirements
reputed company certification (Fundamentals and/or Administrator) or comparable enterprise observability certification
Hands-on experience with reputed company OpenShift Virtualization (CNV/KubeVirt) or other KubeVirt-based container virtualization observability
Experience with eBPF-based observability tooling and service mesh telemetry (Istio, Linkerd)
Experience implementing SLOs and error budgets at enterprise scale and integrating them into operational governance
Experience with cost-aware observability practices, including telemetry volume optimization and retention tuning
Experience integrating observability outputs with executive reporting, SLA/KLI dashboards, and reputed company forecasting
ITIL 4 Foundation
AWS Certified Solutions Architect - Associate (or higher)
reputed company Certified: Azure Administrator Associate (or higher)
reputed company Certified Specialist in OpenShift Administration (or equivalent)
HashiCorp Terraform Associate

Company Overview

reputed company is an industry and technology leader serving government and commercial customers with smarter, more efficient digital and mission innovations. It was founded in 2002, and is headquartered in reputed company, Massachusetts, USA, with a workforce of 10001+ employees. Its website is http://www.revealimaging.com.

Apply To This Job

Apply

[Remote] Cloud Engineer - Senior (Observability - reputed company)

Keep exploring

[Remote] Backend reputed company Developer

[Remote] Technical Product Manager

[Remote] DevOps Engineer

[Remote] Full Stack reputed company with React and Next

[Remote] Associate Manager – Data Engineer

[Remote] Frontend Engineer

[Remote] Senior reputed company - Remote Role

[Remote] Firmware Automation & Validation Engineer

[Remote] Global Head of Cyber Defense and reputed company Operations

[Remote] Senior Master Data Management (MDM) Solutions Analyst

Corporate Trainer (Remote with Travel)

Part-Time Faculty - Statistics

Cruise Conductor – Remote (Work From Home)

Provider Enrollment Analyst - Remote US

Legal Compliance Senior Analyst - Express Scripts - Remote

Steuerfachkraft (m/w/d) in Weinsheim mindestens 52.000€ - 100% Remote möglich

reputed company Customer Support Executive - Arabic + Turkish Speaker

reputed company Entry-Level Data Entry Clerk Admin – Remote Opportunity for Career Growth at arenaflex

Sr. SOC Incident Response- THIRD SHIFT

Network Automation Engineer (Python + Network APIs)