About Nscale

Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility.

We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you'll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you'll be contributing to building the technology that powers the future.

About The Role

Within Nscale, the Network Operations team is responsible for the performance and reliability of the high-speed interconnect fabrics that underpin our AI and HPC platforms. These networks are critical to distributed training and inference workloads and demand a deep operational focus.

We're looking for a Senior Network Engineer – AI Infrastructure to join our Network Operations team.

In this role, you will be responsible for the day-to-day health, stability, and performance of Nscale's large-scale Infiniband and RDMA over Converged Ethernet (RoCE) fabrics. You'll bring deep operational expertise from high-performance or hyperscale environments and play a key role in incident response, performance tuning, and continuous improvement of latency-sensitive AI networking systems.

What You'll be Doing

Owning the operational health, configuration consistency, and performance tuning of large-scale Infiniband and RoCE fabrics supporting AI and HPC workloads

Leading the diagnosis and resolution of complex network incidents (P0/P1), spanning firmware, kernel drivers, switch hardware, and application or middleware layers
Driving blameless postmortems and implementing preventative fixes to improve long-term fabric stability and availability
Partnering with SREs to define requirements for automation and tooling, and contributing where appropriate to network provisioning, validation, and monitoring systems
Collaborating with Network Architecture and Engineering teams to validate fabric designs and enforce standards for routing, congestion control, and firmware baselines
Monitoring fabric utilisation and performance, identifying bottlenecks, and tuning for congestion, microbursts, and predictable latency
Acting as a subject matter expert for cross-functional teams on high-speed networking, RDMA behaviour, and fabric-level performance characteristics

Participating in an on-call rotation supporting mission-critical, customer-facing infrastructure

About You

5+ years of experience in network engineering, with at least 3 years operating HPC or large-scale AI interconnect networks

Deep, hands-on operational experience with Infiniband and/or modern RoCE deployments
Expert understanding of RDMA concepts, protocols, and troubleshooting techniques
Strong fundamentals in data centre networking, including TCP/IP, BGP, OSPF, and leaf-spine architectures
Proven ability to troubleshoot complex network issues using Linux-based tooling and fabric diagnostics
Proficiency in Python, Go, or shell scripting for automation, data analysis, or configuration management

Experience working in a 24/7 operational environment with a strong focus on reliability and toil reduction

Nice to Have

Experience operating NVIDIA/Mellanox Spectrum switches and ConnectX NICs at scale

Familiarity with AI or ML training workflows and the impact of network performance on distributed frameworks
Experience with network observability and telemetry systems such as streaming telemetry, sFlow, Prometheus, or Grafana

Knowledge of GPU communication libraries such as NVIDIA NCCL

What We Can Offer You

At Nscale, you'll find a collaborative, supportive, and innovative environment where your contributions spark real impact. We're building something extraordinary, and we want you at the core.

Highly competitive package (base + equity) with reviews every 12 months.
Join the fastest-growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI. ✨
Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.
Human-First Flexibility: We treat you as humans first. Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.

Join our thriving remote-first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.

Equal Opportunities Statement

We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from lower socio-economic backgrounds.

If there's anything we can do to accommodate your specific situation, please let us know.

The responsibilities outlined in this job description are not exhaustive and are intended to provide a general overview of the position. The employee may be required to perform additional duties, tasks, and responsibilities as assigned by management, consistent with the skills and qualifications required for the role.

For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here.

ML/AI Work links you to the employer's original posting — always verify the details there before applying.

Senior Back-End Network Engineer - AI Infrastructure Operations

Job description

More ML Systems and Inference roles

Principal Network Engineer - AI Infrastructure

Principal Network Engineer - AI Infrastructure

Principal Network Engineer - AI Infrastructure

Senior Principal AI Agent / ML Engineer (OCI)

Principal AI Agent / ML Software Engineer (OCI)

Founding Software Engineer (Backend, Cloud & AI Infrastructure)