869 words
4 minutes
Is InfiniBand's "Infinite Bandwidth" Truly Infinite? 🤔

Preface#

Recently, my friend and I decided to upgrade the storage for our ‘DataCenter’ (essentially a cluster of servers). Our goal was clear: deploy a centralized storage system featuring both All-Flash and Hybrid-Flash tiers to provide mount services for virtualization nodes, aiming for an “enterprise-grade” feel.

While purchasing a turnkey solution like Huawei’s OceanStor Dorado would be the simplest route, this setup is primarily for personal tinkering and occasional testing. Therefore:

Open-source solutions are our destiny!

为啥选开源?别问,问就是预算有限(qiong)!# 手动狗头

⚡️ Calculating Theoretical All-Flash Throughput#

Assuming a single SSD has a R/W speed of ~500MB/s (~5Gbps), a 6-disk RAID 5 array theoretically yields:

  • Write: 500MB/s * (6-1) ≈ 2.5GB/s (Parity overhead may reduce actual performance).
  • Read: 550MB/s * 6 ≈ 3.3GB/s (Significantly exceeding 10Gbps).

While a RAID 0 array would be even faster, our “retired” storage server is limited by a 12Gbps SAS RAID card. To go higher, we would need to move to direct PCIe paths (NVMe), which is currently beyond our budget. Thus, we set 10Gbps as our baseline requirement for the transmission medium.

Transmission Media#

10Gbps media is now ubiquitous: Fiber optics, DAC (Direct Attach Copper) cables, or even high-quality Cat6A Ethernet for short distances. We previously stockpiled some “gray market” gear, such as Mellanox CX341 40Gbps NICs. We initially considered a Peer-to-Peer (P2P) connection, but with three compute nodes, the NIC requirements for the storage server became impractical.

To remain cost-effective, we acquired a used Cisco Nexus 3064PQ-10GX (48x10G SFP+ ports + 4x40G QSFP ports). A dedicated switch perfectly solves the multi-node connectivity issue.

Storage System#

Consumer-grade NAS OSs aren’t quite built for this level of performance. We settled on TrueNAS—it’s open-source, free, and provides robust support for All-Flash/Hybrid-Flash tiers backed by the powerful ZFS file system.

TrueNas

Network Architecture:#

  • Compute Nodes: Connected via 10Gbps NICs.
  • Storage Server: Connected via 40Gbps NICs.

While this creates a high-speed storage fabric, a new problem arises: with 50-60 VMs running simultaneously, concurrent I/O could easily saturate the 10Gbps bandwidth. Furthermore, the low-power CPUs we chose for the storage server would struggle with the TCP/IP stack overhead.

The Solution? RDMA.

Our Mellanox NICs are famous for RDMA (Remote Direct Memory Access). This technology bypasses the CPU, moving data directly between the NIC and memory with extreme efficiency and ultra-low latency.

Introduction to RDMA & InfiniBand#

Before discussing RDMA, we must address its foundation: InfiniBand (IB).

InfiniBand Trade Association

InfiniBand is a high-performance, low-latency communication protocol widely used in Supercomputing, Data Centers, and AI training clusters. The “magic” of RDMA is primarily realized through this fabric.

The name “InfiniBand” suggests “Infinite Bandwidth.” While technically hyperbole, it refers to its massive throughput and extreme scalability, which effectively shatters the performance ceilings of traditional Ethernet. At SC24, NVIDIA showcased the ConnectX-8, boasting single-port speeds of 800Gbps—enough to transfer a Blu-ray movie in one second.

ConnectX-8 1

✅ Advantages of InfiniBand:#

1. Ultra-High Bandwidth#

Ideal for HPC and AI model training where Terabytes of data are moved constantly.

2. Ultra-Low Latency + Zero-Copy#

This is the primary differentiator from standard Ethernet:

  • RDMA Support: Direct memory-to-memory transfer.
  • Kernel Bypass: Data bypasses the CPU and OS kernel.
  • Minimal CPU Overhead: Latency is measured in sub-microseconds, compared to the tens of microseconds found in traditional Ethernet.

RDMA

🧱 Disadvantages:#

RoCE (RDMA over Converged Ethernet) allows IB protocols to run on traditional Ethernet hardware, which mitigates cost, but still requires specific NIC and software support.

1. Closed Ecosystem and Cost#

IB isn’t just a cable; it’s a proprietary ecosystem requiring:

  1. Dedicated IB Switches.
  2. IB HCA (Host Channel Adapters).
  3. Specific software stacks (OpenFabrics Enterprise Distribution - OFED). 🧾 造价高是一大门槛,尤其是中小规模部署时性价比并不理想,不过好嘴RoCE的推出让IB网络可以在传统以太网络上跑,利用现有的高速以太网络硬件,不过还是需要IB网卡以及IB网络所支持软件支持。
MetricInfiniBandiWARP (TCP-based)RoCE (Ethernet-based)
PerformanceBestLower (TCP overhead)Comparable to IB
CostHighMediumLow
Switch TypeDedicated IB SwitchEthernet SwitchEthernet Switch

上表数据来自华为支持2

2. Lack of Native Interoperability#

InfiniBand does not use the standard IP protocol suite. It has its own addressing and flow control. To connect an IB fabric to an Ethernet network, you need an IB-Ethernet Gateway for protocol translation.

✍️ Final Thoughts#

RDMA and InfiniBand are “performance-at-any-cost” solutions. However, with the rise of AI and Cloud Computing, many enterprise scenarios are hitting the limits of traditional Ethernet. In applications requiring High Bandwidth + High Concurrency + Ultra-Low Latency, IB is becoming essential.

From a technical perspective, IB isn’t for everyone due to:

  1. High Capital Expenditure (CAPEX): Switches and HCAs are expensive.
  2. Operational Complexity: It requires specialized knowledge to maintain.

When should you consider InfiniBand?

  • Deploying AI training/inference clusters with multiple GPU nodes (A100/H100).
  • Using distributed storage (e.g., Ceph) where latency is critical.
  • Massive data distribution systems requiring maximum throughput.

🔗 link:

[1] https://nvdam.widen.net/s/pxsjzhgw6j/connectx-datasheet-connectx-8-supernic-3231505

[2] https://support.huawei.com/enterprise/zh/doc/EDOC1100203347

Is InfiniBand's "Infinite Bandwidth" Truly Infinite? 🤔
https://fuwari.vercel.app/posts/f235a39f-8bb0-4ca6-abd6-c615eed26ae6/
Author
Ryan Zhang
Published at
2025-07-06
License
CC BY-NC-SA 4.0
This content has been translated with the assistance of AI tools, including ChatGPT, Gemini, and Qwen. While efforts have been made to ensure accuracy and clarity, minor discrepancies may exist. Please refer to the original text for authoritative interpretation if needed.