HPC Network Engineering Manager - AI Infrastructure (Argentina)

HPC Network Engineering Manager - AI Infrastructure (Argentina)

24 may
|
EPAM Systems
|
Argentina

24 may

EPAM Systems

Argentina

We are seeking an HPC Network Engineering Manager - AI Infrastructure to guide architecture and technical direction for AI research and Kubernetes-based GPU infrastructure. You will steer standards for InfiniBand/RDMA, Ethernet, Kubernetes networking, SmartNIC/DPU, and observability across large programs while mentoring senior engineers. Join us to shape reliable, scalable network platforms for massive distributed AI workloads—apply now.

Responsibilities

Define and own a multi-year architectural vision and roadmap for InfiniBand/RDMA and high-speed Ethernet fabrics supporting massive GPU clusters and distributed AI/LLM workloads across the client portfolio

Govern evaluation and standardization of cluster network topologies such as Fat-tree, Clos, Rail-optimized, and Dragonfly, and set decision frameworks aligned to scale, performance, and cost constraints

Establish and enforce engineering standards for host-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU-to-NIC communication paths

Drive strategic performance engineering across RDMA/RoCE, NCCL/MSCCL, and collective communication for multi-node GPU training, and oversee resolution of the hardest systemic performance issues

Define the reference architecture for Kubernetes networking on GPU clusters, including CNI plugins, network policies, multi-NIC pods, RDMA/GPU device plugins, and workload orchestration integration, and lead adoption across programs

Own strategy and governance for SmartNIC/DPU technologies such as NVIDIA BlueField, including SR-IOV, offload, isolation, and security use cases, and align rollout with the broader infrastructure roadmap

Define enterprise network observability strategy, governing metrics, dashboards, alerts, congestion detection, latency tracing, SLO frameworks, and capacity/performance analysis methods

Provide technical leadership and mentorship to lead and principal engineers across networking, Kubernetes, storage, GPU infrastructure, observability,



and AI research teams to drive cross-functional alignment

Represent the principal technical authority in executive stakeholder forums by shaping direction, negotiating program trade-offs, and ensuring delivery of reliable, scalable network platforms across engagements

Contribute to the engineering community through thought leadership, internal practice building, and representation at industry events

Requirements

9+ years of experience in network, infrastructure, HPC, SRE, or similar engineering roles, with 5+ years focused on HPC, AI/ML, or GPU cluster networking, including demonstrated technical leadership at the program or portfolio level (3+ years)

Proven track record defining architecture and governing delivery for InfiniBand/RDMA fabrics, high-speed Ethernet, and Linux networking in large-scale, performance-sensitive distributed compute environments

Authoritative expertise in host-side networking (NICs, drivers, firmware) plus PCIe topology, NUMA awareness, and GPU-to-NIC affinity, with demonstrated ability to set enterprise standards and uplift engineering practices

Deep understanding of distributed AI training communication patterns, including NCCL-based workloads and collective operations such as all-reduce and all-gather, with ability to drive workload-network co-design at scale

Authoritative knowledge of Kubernetes and container networking for GPU or distributed workloads, including CNI concepts, network policies, multi-NIC patterns, and RDMA/GPU device integration, with experience defining reference architectures

Expert-level mastery of RDMA networking, including InfiniBand, RoCE/RoCEv2, GPUDirect-related patterns, congestion behavior,



and performance tuning at very large scale

Mastery of Linux networking and host-side troubleshooting, including IRQ affinity, MTU, offloads, and performance diagnostics, with ability to define repeatable diagnostic methodologies for broader teams

Demonstrated ownership of network observability and performance management strategy, including telemetry, traffic monitoring, congestion detection, latency analysis, SLOs, capacity planning, and alerting/troubleshooting across L1-L4, fabric, and RDMA layers

Outstanding leadership, mentoring, stakeholder management, and executive communication skills, with proven experience leading multiple engineering teams, influencing C-level client architecture decisions, and driving alignment across research and platform stakeholders

English language proficiency at an Advanced level (C1)

Nice to have

Hands-on architectural and strategic experience with Azure Networking, Ethernet, and GPGPU/GPU technologies

Authoritative command of Grafana and Prometheus, plus Network Administration experience defining observability standards across an engineering organization

Proven ability to set strategy, govern, and scale Infrastructure as Code practices across multiple teams and programs

Proficiency in Python and UNIX shell scripting for automation, tooling, and improving engineering productivity

Track record of thought leadership through conference talks, publications, patents, or open-source contributions in the HPC/AI networking domain

We offer

International projects with top brands

Work with general teams of highly skilled, diverse peers

Healthcare benefits

Employee financial programs

Paid time off and sick leave

Upskilling, reskilling and certification courses

Unlimited access to the LinkedIn Learning library and 22,000+ courses

Global career opportunities

Volunteer and community involvement opportunities

EPAM Employee Groups

Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

📌 HPC Network Engineering Manager - AI Infrastructure (Argentina)
🏢 EPAM Systems
📍 Argentina

Postulate a este anuncio

Muestra tus habilidades a la empresa, rellenar el formulario y deja un toque personal en la carta, ayudará el reclutador en la elección del candidato.

Suscribete a esta alerta:
Escribe tu dirección de correo electrónico, te permitirá de estar al tanto de los últimos empleos por: hpc network engineering manager - ai infrastructure (argentina) / argentina
Suscribete a esta alerta:
Escribe tu dirección de correo electrónico, te permitirá de estar al tanto de los últimos empleos por: hpc network engineering manager - ai infrastructure (argentina) / argentina