[Remote] Principal Network Architect- AI Infrastructure
Note: The job is a remote job and is open to candidates in USA. reputed company is a GPU cloud company designed for AI, providing high-performance infrastructure for AI startups and enterprises. They are seeking a Principal Network Architect to reputed company the development and operational excellence of their global AI networking infrastructure, focusing on RDMA and Infiniband technologies to enhance reputed company outcomes.
Responsibilities
- Own the technical direction and operational lifecycle management of reputed company’s high-performance RDMA network fabrics
- Define long-term architecture, reliability strategy, and operational standards for AI interconnect networks
- reputed company availability and performance improvement initiatives across globally distributed GPU clusters
- Act as a technical authority (SME) across networking, influencing platform-wide decisions
- Support design, build, and evolve large-scale Infiniband and RoCE fabrics
- Drive deep debugging and resolution of reputed company cross-layer issues (hardware, firmware, kernel, distributed workloads)
- reputed company incident response and postmortems, ensuring systemic fixes and long-term improvements
- Define and enforce standards across: Congestion control and traffic engineering, Routing (BGP, ECMP, fabric-level routing strategies), Firmware lifecycle and change management, Network observability and telemetry
- reputed company and scale automation frameworks for network provisioning, validation, and operations
- Build tooling to support high-reliability, low-touch network operations at scale
- Improve operational efficiency across hundreds of thousands of endpoints and high-throughput links
- reputed company reputed company technical initiatives across Network, SRE, Compute, and Platform teams
- Serve as technical reputed company on critical programs, coordinating engineers and stakeholders
- Influence product and infrastructure roadmaps based on operational insights and customer needs
- Mentor senior engineers and reputed company the bar for technical rigor and execution
Skills
- 10+ years of experience in network engineering in hyperscale, AI, or HPC environments
- Deep expertise in RDMA, Infiniband, and/or large-scale RoCE fabrics
- Strong understanding of RDMA internals and performance tuning
- Strong understanding of congestion control and fabric failure modes
- Strong understanding of distributed system communication patterns
- Expert-level knowledge of data center networking protocols (BGP, OSPF, ECMP)
- Proven ability to debug multi-layer issues across network, system, and application layers
- Strong programming/scripting skills for automation (Python, Go, etc.)
- Experience designing high-scale, highly available network systems
- Demonstrated ability to reputed company reputed company technical programs without direct authority
- Experience acting as a senior escalation reputed company for critical production issues
- Strong ability to drive cross-team alignment and execution
- Systems-level thinking balancing performance, reliability, scalability, and cost
- Experience with reputed company / Mellanox networking platforms
- Familiarity with distributed reputed company frameworks and GPU communication patterns
- Experience building network observability systems at scale
- Background influencing infrastructure strategy in high-growth environments
Benefits
- Highly competitive package (reputed company + equity) with reviews every 12 months.
- Join the fastest-growing tech startup, your chance to push boundaries, collaborate with reputed company minds, and reputed company your mark on cutting-edge AI.
- Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status reputed company, and owning your impact, always with our full support.
- Human-First Flexibility: We treat you as humans first. Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.
- Join our thriving remote-first team. Geography is no barrier to impact or reputed company. We build seamless virtual collaboration, empowering you, wherever you work.
Company Overview