As compute end points, using CPUs and/or Accelerators, evolve in large scale data centers, the network can become a serious choke point in scaling applications running on these compute end points in these data centers. Figure 1 shows the key infrastructure building blocks of a large-scale data center. This blog covers how compute end points are evolving to meet the growing demands from new application and storage architectures. To avoid any bottlenecks, large scale data center customers are already rapidly moving their data center network core to PAM4 (50G SerDes/lane) based leaf-spine network infrastructure with 100-400G. Top-of-rack (ToR) switches are critical elements used to connect these compute/storage end points to the network core and need to evolve to avoid infrastructure choke points. We focus on how ToR switches are advancing and why Innovium TERALYNX 5, a modern and optimized 6.4Tbps ToR switch silicon, can dramatically unlock the innovation and performance of compute and storage end-points.
Figure 1: Large-scale Data Center Infrastructure Building Blocks
Compute End Point Evolution
Compute end points are servers designed using CPUs and/or Accelerators, along with other components such as memory, PCIe based IO and network adapters. As shown in figure 1, these servers are deployed in racks connected to the network core using top-of-rack (ToR) switches.
We can see compute end point market landscape and evolution in figure 2. We show some examples of CPUs and Accelerators in the market from a few providers to highlight overall trends. As seen in the figure, CPUs and Accelerators continue to offer more cores/performance, higher memory capacity & bandwidth, more IO with next-gen PCIe & increase in PCIe lane count, and higher AI training and inference performance.
Figure 2: Compute End Point Landscape & Evolution
And as customers use these CPUs & Accelerators in a scale-out architecture with applications built using microservices (serverless or lambda services), they need higher speed network connectivity from these servers and a highly scalable network fabric with the lowest latencies to provide the best application performance.
As a result, these compute end points need higher bandwidth network adapters to keep a healthy balance between compute, memory and network IO. These adapters are able to offload network and security functions, so CPU cores can focus on running more applications and drive more bandwidth. With growing demand for higher bandwidth, we see that all leading network adapter vendors offer 100-200G NICs (Network Interface Cards or Network Adapters) with support for both NRZ and PAM4 connectivity now.
Traditionally, customers have used direct-attached storage (DAS) or storage-area networks (SAN) for their storage needs. Disaggregated storage, a form of scale-out storage solution built using pools of storage (disks or NVMe flash) connected over the network, is the preferred storage architecture for data center customers now. It delivers a number of advantages – it helps scale storage and compute resources independently, delivers higher storage utilization, provides performance of local storage with flexibility of SANs and offers more efficiency.
Figure 3: Dis-aggregated Storage Architecture
Disaggregated storage requires higher capacity network adapters and a scalable network fabric. Customers often use RDMA or RoCE (RDMA over Converged Ethernet) to connect the compute racks to the storage racks. The network fabric needs to deliver high performance and highly reliable RoCE transport for these customers.
Compute End Point & Storage Connectivity: Upgrading to higher bandwidth NICs with PAM4
Based on the discussion above, compute/storage end points are expected to connect to the network using 100-200GbE PAM4 connectivity in the future. The industry analysts are predicting that too.
Seamus Crehan, Crehan Research, April 2020
“With the recent arrival of PAM-4 based NICs, we expect another strong cloud service provider server networking upgrade centering around 100 gigabit Ethernet to start sometime next year.”
Baron Fung, Dell’Oro Group, Feb 2020
“100 Gbps is currently deployed for specialized applications, such as accelerated computing, high performance computing, and storage. However, as the Tier 1 Cloud service providers upgrade their networks from NRZ to PAM-4 SerDes interfaces, transitioning server network connectivity to 100 Gbps and beyond is expected to follow.”
Vlad Galabov, Omdia, May 2020
“100GE Ethernet adapters are increasingly deployed by cloud service providers and enterprises running high-performance computing clusters ….. Cloud service providers are leading the transition to faster networks as they run multi-tenant servers with a large number of virtual machines and/or containers per server. This is driving high traffic and bandwidth needs.”
Figure 4 shows network adapter usage at top cloud customers based on publicly available information and aligned with these expectation in the future.
Figure 4: Network Adapter Usage at Top Cloud Customers
Data Center Customer Deployments
Customers deploy hundreds of racks of compute and storage in their data centers. As they deploy these racks, they connect the compute/storage to top-of-rack (ToR) switches. The configuration and specification of these ToR switches are influenced primarily by the following:
– Type of Network adapters being deployed by servers and storage, which are defined by:
- NIC Bandwidth (25, 50, 100 or 200G)
- NRZ or PAM4 SerDes and
- NIC Interface Type (SFP28 for 10/25G, QSFP28 for 100G, SFP56 for 10/25/50G, SFP-DD for 100G or QSFP56 for 100/200G)
– Power & cooling capacity in the rack: This critical factor dictates the number of servers/storage in a rack and hence the port-density of the ToR switch. We don’t expect # of servers increasing in the rack as power consumption per server continues to go up with higher performance CPUs/Accelerators and higher memory/IO configurations.
– Use of DAC/copper connectivity within the rack: Copper/DAC connectivity is the most cost-effective way that customers connect servers/storage to the ToR today. That is expected to continue. This limits the port-density of a ToR switch, limited by reach of copper connectivity.
Today, two popular ToR switch configurations being deployed by DC customers are
– 48 x SFP28 (10/25G) + 8 x QSFP28 (100G) in 1 rack-unit (1RU)
– 32 x QSFP28 (100G) in 1 rack-unit (1RU)
Next-Gen Top-of-rack (ToR) Switches
Higher speed PAM4 NICs support SFP56 for 25/50G and QSFP56 for 100/200G as interface types. We find broader support for QSFP56 interface from the network adapter vendors – QSFP56 is more flexible and supports higher & broader range of network speeds. Based on our recent engagements with customers, we’ve found significant interest in ToR switches that support QSFP56 (100/200G) ports to connect to next-gen server NICs.
Figure 5: 32 x QSFP56 (100/200G) switch in 1RU
Hence, 32 x QSFP56 (100/200G) in 1RU, as shown in figure 5, is a natural evolution for customers using 32 x QSFP28 (100G) switches today as number of servers aren’t going up inside a rack. It can connect to servers that use NRZ based network adapters today and gives customers investment protection and a seamless migration path to connect to servers with higher speed PAM4 based network adapters in the future. The uplinks can also migrate from 100G to 200G.
Innovium’s TERALYNX 5 is highly optimized for next-generation ToR switch configurations such as 32 x QSFP56 (100/200G) in 1RU with 200G uplinks. TERALYNX 5 can also be used for today’s popular ToR configurations as well as other future ToR configurations such as
- Today’s popular configurations with 100G uplinks (using NRZ):
- 32 x QSFP28 (100G)
- 48 x SFP28 (10/25G) + 8 x QSFP28 (100G)
- Other next generation ToRs with 400G uplinks (using PAM4)
- 24 x QSFP56 (100/200G) + 4 x QSFP-DD (400G)
- 48 x SFP56 (25/50G) + 6 x QSFP-DD (400G)
- 48 x SFP-DD (100G) + 4 x QSFP-DD (400G)
To deliver modern, high-performance data centers, TERALYNX 5 delivers a range of compelling advantages
– Both NRZ and PAM4 SerDes required for the next-gen ToR switch to support higher bandwidth NICs and uplinks
– 50 MB of on-chip buffer, the largest buffer in a switch asic of this class: This large buffer along with our highly robust RDMA/RoCE implementation delivers the best network quality and application performance
– Best-in-class low latency required by data centers to provide the best application performance in today’s microservices driven applications architecture
– Unmatched telemetry and analytics needed to address the toughest troubleshooting scenarios like network microbursts and packet drops
– Comprehensive set of features, including quality of service, multicast, scalable ACLs and VXLAN tunnels for EVPNs
– Optimal cost and power to deliver the best performance/ $ and performance/ W
Please contact us at s[email protected] for information on our products.