About Bitdeer:
Bitdeer is a world-leading technology company for Bitcoin mining and AI cloud.
Bitdeer is committed to providing comprehensive Bitcoin mining solutions for its customers. Apart from designing industry-leading ASIC chips and manufacturing mining rigs, the Group handles complex processes involved in computing across the value chain. This includes equipment procurement, transport logistics, datacenter design and construction, equipment management, and network and facility operations. Bitdeer also offers advanced cloud capabilities to customers with a high demand for artificial intelligence.
Headquartered in Singapore, Bitdeer operates globally with a diversified 3 GW energy portfolio, and deploys Bitcoin mining and HPC datacenters in the United States, Bhutan, Norway, Canada, Malaysia, and Ethiopia.
What you will be responsible for:
- Network Architecture Design: Architect high-availability network solutions for AI Cloud Data Centers, covering DCN (Data Center Network), DCI (Data Center Interconnect), WAN, and backbone networks.
- Cluster Operations & Optimization: Orchestrate daily monitoring, troubleshooting, and performance tuning for large-scale GPU clusters based on InfiniBand or RoCEv2 fabrics.
- Deep-Dive Troubleshooting: Lead investigations into complex network issues affecting AI training and inference performance, such as RDMA packet loss, latency, link flapping, and NCCL communication timeouts.
- Fabric Management: Manage NVIDIA Quantum series switches, NVIDIA Spectrum series switches, ConnectX NICs, NetQ and the UFM (Unified Fabric Manager) platform to ensure the overall health and stability of the network fabric.
- Change Management: Lead network architecture changes, capacity expansions, cutovers, and firmware upgrades, ensuring smooth execution with zero incidents.
- Incident Response: Provide rapid response to critical network incidents, implementing immediate mitigation measures and conducting thorough Root Cause Analysis.
- Build and maintain network monitoring and observability platforms (utilizing Zabbix, Prometheus, Grafana, or Telegraf) to enable real-time monitoring and automated alerting.
- Develop network automation tools or platforms using Python, Go, Ansible, or Terraform to improve operational efficiency and standardize operational workflows
How you will stand out:
- Education: Bachelor’s degree or above in Computer Science, Telecommunications, or a related field.
- Experience: 10+ years of experience in large-scale network operations or architecture.
- Core Networking: Proficient in the TCP/IP protocol stack; extensive mastery of routing protocols (BGP, OSPF, ISIS) and Data Center technologies (EVPN-VXLAN).
- HPC/AI Networking :Familiarity with InfiniBand (IB) architecture and Subnet Manager (SM) principles, with hands-on experience operating NVIDIA (Mellanox) switches; or proficiency in Ethernet-based RoCEv2 (Lossless Network) technology, with a deep understanding of PFC and ECN mechanisms, adaptive routing, congestion control like Spectrum-X CC, ZTRRTT CC, DCQCN.
- System Skills: Proficient in Linux system operations.
Preferred Qualifications:
- Large-Scale Cluster Experience: Hands-on experience in the construction or operations of large-scale GPU clusters (1,000+ GPUs), specifically utilizing NVIDIA H100 or GB200 platforms.
- AI Training Knowledge: Familiarity with the principles of AI distributed training communication libraries (e.g., NCCL, MPI) and the ability to diagnose/infer network issues by analyzing application-level training logs.
- Advanced Infiniband Skills: Proficiency with advanced features of NVIDIA UFM (Unified Fabric Manager), such as SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) and network telemetry.
- Certifications: Hold professional certifications such as CCIE, JNCIE, NVIDIA-Certified Professional AI Networking (NCP-AIN), or Advanced Networking Specialty certifications from major public clouds.
- Optical/Backbone Experience: Experience with Optical Transmission equipment (DWDM/DCI) or managing global backbone networks.
- Automation: Mastery of at least one scripting language (Python or Go) with experience in developing network automation platforms or writing operational scripts.
What you will experience working with us:
- A culture that values authenticity and diversity of thoughts and backgrounds;
- An inclusive and respectable environment with open workspaces and exciting start-up spirit;
- Fast-growing company with the chance to network with industrial pioneers and enthusiasts;
- Ability to contribute directly and make an impact on the future of the digital asset industry;
- Involvement in new projects, developing processes/systems;
- Personal accountability, autonomy, fast growth, and learning opportunities;
- Attractive welfare benefits and developmental opportunities such as training and mentoring.
--------------------------------------------------------------------
Bitdeer is committed to providing equal employment opportunities in accordance with country, state, and local laws. Bitdeer does not discriminate against employees or applicants based on conditions such as race, colour, gender identity and/or expression, sexual orientation, marital and/or parental status, religion, political opinion, nationality, ethnic background or social origin, social status, disability, age, indigenous status, and union.
#LI-ST1