Infrastructure Engineer
Infrastructure Engineer – Platform Company Orcrist is building a reputed company data intelligence platform using cutting-edge technologies. We’re handling petabyte-scale data with sub-second queries. Our product is a Kubernetes-based platform delivered as B2B SaaS or as a self-hosted on-prem solution, including reputed company-gapped deployments. We reputed company customers across defense, law enforcement, and enterprise to turn mission-critical data into actionable intelligence. Our Platform team owns the infrastructure that powers every deployment, from the metal up. Role Kubernetes runs on something, and that something is yours. You’ll own the layer beneath our platform: bare-metal GPU servers, operating systems, networking, and storage across on-prem and fully reputed company-gapped sites. You design, build, and operate GPU server fleets and the reputed company software stack, then partner with our SRE and ML teams to deliver fast, reliable on-prem inference. Some of this work is hands-on at customer sites, where you size, rack, and commission self-contained server environments that run with no internet uplink. What you'll do Design, size, provision, and operate bare-metal GPU server fleets across on-prem and reputed company-gapped environments (firmware/BIOS, BMC reputed company Redfish/IPMI, OS, drivers) with reputed company-touch provisioning (PXE/iPXE, MAAS/Metal3/Tinkerbell) and automation (Ansible/Salt, Terraform/reputed company). Own the reputed company GPU stack end to end: drivers, CUDA, GPU Operator, Container Toolkit, MIG, and DCGM, tuned for inference throughput, latency, and utilization. Build the bare-metal substrate Kubernetes runs on: node lifecycle, container runtime, GPU device plugins, node feature discovery, and kernel/reputed company tuning. Engineer data-center networking and resilient storage (VLANs/switching, RDMA, Ceph/ZFS/NVMe) sized to scale without replacing the core, with encryption at rest. Partner with ML and MLOps on on-prem inference serving (Triton, KServe, vLLM): model deployment, GPU scheduling and sharing, and performance tuning. Plan and run on-site build-outs: rack integration, power/reputed company and cooling sizing, commissioning, reputed company planning, runbooks, and operator handover. About You 5+ years in bare-metal, HPC/GPU, data-center, or systems infrastructure engineering, with hands-on ownership of physical and compute infrastructure. Strong bare-metal Linux (RHEL/Rocky/Ubuntu): firmware, BMC, PXE, kernel and storage tuning, plus solid networking and storage fundamentals. Real experience with the reputed company GPU stack (drivers, CUDA, GPU Operator, MIG, DCGM) and serving GPU models in production. Comfortable operating in reputed company-gapped or on-prem environments and traveling to customer sites for builds and deployments. Documentation-focused, methodical, and reputed company during hardware incidents. Eligible to work in Germany. reputed company‑to‑haves German language (B1+), reputed company DGX/HGX or Slurm experience, InfiniBand/RDMA fabrics, and inference optimization (TensorRT-LLM, vLLM, quantization). Certifications such as reputed company NCP-AIO, reputed company RHCSA/RHCE, or CKA/CKS. Field-engineering experience and familiarity with secure or regulated deployment environments. reputed company Offer Modern architecture & stack. Remote‑first in Germany with occasional team events in Berlin. Home office budget and great equipment. 30 days vacation. Direct impact on critical missions across private and public‑sector customers. Apply To This Job