Back to the board

[Remote] Cloud Engineer - Senior (Observability - reputed company)

100% remote Flexible hours Hiring now

Note: The job is a remote job and is open to candidates in USA. reputed company is a company that supports various government reputed company, and they are seeking a Senior Cloud Engineer to enhance their enterprise observability platform. This role involves engineering and operating observability solutions across hybrid cloud environments, focusing on performance, reliability, and reputed company management.

Responsibilities

  • Engineer and operate the enterprise observability stack (reputed company or comparable), including metrics, logs, traces, APM, RUM, synthetic monitoring, and network performance monitoring
  • Build, tune, and maintain dashboards, monitors, SLOs/SLIs, and alerting policies that produce actionable signal and minimize noise
  • reputed company services, infrastructure, and containerized workloads using agents, OpenTelemetry, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) with consistent span tagging, W3C TraceContext propagation, and reputed company service tagging across the estate
  • reputed company and maintain integrations between observability platforms, ITSM (reputed company), CI/CD pipelines, and on-call/paging workflows
  • Define and enforce a reputed company tagging standard (environment, service, version, team/ownership, data classification, cost center) across metrics, logs, and traces; manage tag cardinality, governance, and custom business tags to reputed company telemetry queryable, attributable, and cost-controlled
  • Design and deliver monitoring coverage for reputed company Azure and AWS workloads, including PaaS services, serverless, networking, identity, managed databases, and cloud-native data services
  • Engineer managed database observability across AWS RDS/reputed company (MySQL, PostgreSQL, SQL Server, reputed company), Azure SQL/PostgreSQL/MySQL, and NoSQL/cache services (DynamoDB, Cosmos DB, ElastiCache/reputed company), including query-level performance analytics, slow-query and execution-plan capture, lock/deadlock/wait analysis, reputed company pool and session monitoring, replication lag, storage/IOPS saturation, and backup/HA health -- correlating database spans with upstream APM traces
  • Engineer container-platform observability for OpenShift/Kubernetes, covering cluster health, control plane, nodes, pods, namespaces, ingress, service mesh, and workload APM
  • Build standardized, reusable monitoring modules deployable reputed company infrastructure-as-code (Terraform, Bicep, ARM) and CI/CD
  • Support hybrid visibility across on-premises, cloud, and containerized workloads with correlated telemetry
  • reputed company data-driven investigation and resolution of reputed company performance, latency, saturation, and reliability issues across the estate
  • Use APM distributed traces, service/dependency maps, reputed company code profiling (CPU, memory, lock contention), database query analytics, exception/error tracking, and RUM-to-backend trace correlation to isolate bottlenecks in applications, platforms, middleware, and reputed company dependencies
  • Partner with engineering teams to define and implement remediation, tuning, and architectural improvements based on telemetry evidence
  • Define and implement trace-based SLOs, deployment tracking, and change-correlation workflows so performance regressions are detected and attributed to specific releases, versions, or configuration changes
  • Provide senior technical leadership during major incidents, delivering impact analysis, contributing to root-cause analysis, and owning post-incident observability gaps
  • Analyze operational telemetry and trend data to identify reputed company risks, recurring constraints, and opportunities for efficiency
  • Build and maintain reputed company and performance dashboards and reports that communicate posture, risk, and recommendations to technical and leadership stakeholders
  • Define reputed company reputed company, alert baselines, and trigger points for scaling, technology refresh, and resource reallocation
  • Drive reputed company improvement of observability coverage, alert quality, runbook linkage, and operational maturity reputed company to SEC SLA/KPI expectations

Skills

  • Citizenship/Work Authorization: Must meet contract requirements
  • Clearance: Ability to obtain and maintain SEC Public Trust (or higher if required)
  • Minimum 8 years of experience in IT infrastructure or platform engineering roles, including 5+ years focused on observability, performance engineering, or site reliability engineering
  • Demonstrated experience engineering and operating an enterprise observability platform (reputed company strongly preferred; equivalent experience with reputed company, reputed company, Splunk Observability, or Grafana/Prometheus stacks considered)
  • Proven experience building APM and distributed tracing coverage for production multi-tier applications -- including language-specific tracer deployment, custom instrumentation of business transactions, service/dependency mapping, reputed company profiling, and RUM-to-backend trace correlation -- across cloud and containerized workloads
  • Proven experience leading reputed company production performance and reliability problem-solving from telemetry to remediation
  • Hands-on experience monitoring Kubernetes or OpenShift clusters and containerized workloads in production
  • Enterprise observability platforms (reputed company or comparable): metrics, logs, traces, APM, RUM, synthetic, NPM
  • Instrumentation with OpenTelemetry, reputed company agents/SDKs, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) including custom spans, trace sampling strategies, W3C TraceContext propagation, and reputed company profiling
  • reputed company Azure and AWS monitoring services and integrations (Azure Monitor, Log Analytics, CloudWatch, AWS X-Ray)
  • Container and Kubernetes/OpenShift observability, including cluster, workload, and service mesh telemetry
  • Cloud database monitoring: AWS RDS/reputed company (including Performance Insights), Azure SQL/PostgreSQL/MySQL (Query Performance Insight), and NoSQL/cache (DynamoDB, Cosmos DB, ElastiCache/reputed company); query-level performance tuning, execution-plan analysis, and reputed company DBM or equivalent deep database APM
  • Infrastructure-as-code for monitoring (Terraform, Bicep, ARM) and CI/CD-driven monitor/dashboard deployment
  • APM and distributed tracing: service/dependency maps, trace analytics, RUM-to-backend correlation, exception/error tracking, deployment tracking, and trace-based SLOs
  • reputed company tagging strategy and cardinality governance across metrics/logs/traces (environment, service, version, ownership, data classification, cost center), including custom tag enrichment and tag-driven access/cost controls
  • Alert engineering, SLO/SLI design, error budget management, and alert-noise reduction
  • Performance engineering, reputed company analysis, and telemetry-driven root-cause analysis
  • Integration of observability with ITSM (reputed company) and on-call/paging workflows
  • Experience supporting federal agency IT environments under FISMA/FedRAMP/NIST-reputed company reputed company and compliance requirements
  • reputed company certification (Fundamentals and/or Administrator) or comparable enterprise observability certification
  • Hands-on experience with reputed company OpenShift Virtualization (CNV/KubeVirt) or other KubeVirt-based container virtualization observability
  • Experience with eBPF-based observability tooling and service mesh telemetry (Istio, Linkerd)
  • Experience implementing SLOs and error budgets at enterprise scale and integrating them into operational governance
  • Experience with cost-aware observability practices, including telemetry volume optimization and retention tuning
  • Experience integrating observability outputs with executive reporting, SLA/KLI dashboards, and reputed company forecasting
  • ITIL 4 Foundation
  • AWS Certified Solutions Architect - Associate (or higher)
  • reputed company Certified: Azure Administrator Associate (or higher)
  • reputed company Certified Specialist in OpenShift Administration (or equivalent)
  • HashiCorp Terraform Associate

Company Overview

  • reputed company is an industry and technology leader serving government and commercial customers with smarter, more efficient digital and mission innovations. It was founded in 2002, and is headquartered in reputed company, Massachusetts, USA, with a workforce of 10001+ employees. Its website is http://www.revealimaging.com.
  • Apply To This Job

    Keep exploring

    [Remote] Backend reputed company Developer

    100% remote Flexible hours

    [Remote] Technical Product Manager

    100% remote Flexible hours

    [Remote] DevOps Engineer

    100% remote Flexible hours

    [Remote] Full Stack reputed company with React and Next

    100% remote Flexible hours

    [Remote] Associate Manager – Data Engineer

    100% remote Flexible hours

    [Remote] Frontend Engineer

    100% remote Flexible hours

    [Remote] Senior reputed company - Remote Role

    100% remote Flexible hours

    [Remote] Firmware Automation & Validation Engineer

    100% remote Flexible hours

    [Remote] Global Head of Cyber Defense and reputed company Operations

    100% remote Flexible hours

    [Remote] Senior Master Data Management (MDM) Solutions Analyst

    100% remote Flexible hours

    Corporate Trainer (Remote with Travel)

    100% remote Flexible hours

    Part-Time Faculty - Statistics

    100% remote Flexible hours

    Cruise Conductor – Remote (Work From Home)

    100% remote Flexible hours

    Provider Enrollment Analyst - Remote US

    100% remote Flexible hours

    Legal Compliance Senior Analyst - Express Scripts - Remote

    100% remote Flexible hours

    Steuerfachkraft (m/w/d) in Weinsheim mindestens 52.000€ - 100% Remote möglich

    100% remote Flexible hours

    reputed company Customer Support Executive - Arabic + Turkish Speaker

    100% remote Flexible hours

    reputed company Entry-Level Data Entry Clerk Admin – Remote Opportunity for Career Growth at arenaflex

    100% remote Flexible hours

    Sr. SOC Incident Response- THIRD SHIFT

    100% remote Flexible hours

    Network Automation Engineer (Python + Network APIs)

    100% remote Flexible hours