What is Driving Data Center Networks?

Today’s businesses are embracing IT and digital technologies at a faster pace than ever before. Companies are offering more products & services online, collaborating using web conferencing tools, creating and analyzing more data to gain business insights, and embracing new technologies like AI/ML to improve business productivity. All of this is possible because of investments in data centers, where data is stored, processed and analyzed. Data centers have become the nerve center of this new digital economy.

Data center networks connect all the key building blocks that run these businesses – consumers and online users running various applications, to the data and compute resources in these data centers. Let us look at the factors driving data center networks. They include new application architectures, evolution of compute and storage, rapid adoption of AI & ML, focus on power and size & scale of these data centers. At Innovium, we track each of these drivers closely so that we can enable our customers with the best networking capabilities available for running their online businesses from these data centers.

Figure 1: Key Drivers of Data Center Networks

New Application Architectures

As shown in figure 2 below, all cloud native applications are being developed using microservices architecture. Those applications drive a lot of machine-to-machine traffic between containers, VMs or serverless. Facebook in 2020 demonstrated that machine-to-machine traffic has grown ~10X compared to machine-to-user traffic. Each of those machines can be in different racks, in different corners of the data center, which means that traffic has to traverse multiple tiers of the network. And each network hop adds latency which impacts application performance. Hence, new application architectures require higher network bandwidth and lower latency for best application performance.

Further, customers are seeing a huge growth in structured and unstructured data. That data is being aggregated in data lakes and being used to run analytics and machine learning to draw business insights. More data and more data movement require higher network bandwidth and scale. Innovium’s TERALYNX® switch silicon can be used to build high bandwidth, high radix switches leading to a reduction in the number of switch elements and lower operational costs.

There is also a huge focus on availability and performance of applications as businesses move to the cloud, private or public. This requires highly robust and resilient networks in the data centers that host these applications.

Figure 2: New Application Architectures Driving Network Requirements

Evolution of Compute

Moore’s law has hit a wall and hence compute has moved to a distributed model. Large compute tasks are split up and run in a distributed fashion, using microservices architecture as mentioned above. That results in lots of IO between these distributed compute nodes. The IO often occurs without burdening the processors using RDMA (Remote Direct Memory Access) technology.

Further, compute processors are offloaded using accelerators and NIC offloads for specialized functions such as networking, storage and security. These are often referred to as SmartNICs or DPUs (Data Processing Unit). As a result, each compute node can now generate more IO. There are many examples in the industry, including AWS Nitro, Vmware Project Monterey and GCP gVNIC.

Also, processors are becoming more powerful and now have lots of PCIe IO lanes. You can see recent announcements in mainstream x86 processor market, including AMD’s EPYC Milan and Intel’s Ice Lake processors. Besides, as GPUs and AI accelerators become mainstream, they generate a lot of IO. There are also efforts to disaggregate GPUs from compute for flexibility and higher utilization that is going to drive IO further.

As compute evolves, it is evident that networks need to evolve to deliver higher performance/bandwidth, lower latency and robust RDMA.

Figure 3: Compute Evolution Driving Network Requirements

Evolution of Storage

Massive amounts of data is being created everyday with IOT devices, web clicks & browsing online, mobile photos and videos, autonomous cars, medical scans etc. That includes both structured and unstructured data. Businesses and organizations want to process and mine that data to get business insights.

Increasingly, customers are moving towards disaggregated storage that is connected to compute over Ethernet. No one in modern data centers is investing in fiber-channel based storage anymore. This new storage model offers better economics, higher utilization, more scalability and higher flexibility. All leading cloud providers like Azure, GCP and AWS have adopted this model. Storage building blocks are moving to higher performance storage media with faster access like flash and SSDs. Even Ethernet based SSDs are beginning to arrive in the market.

More data and analysis means more data movement and hence requirements for higher performance, bandwidth, lower latency and robust RDMA.

Figure 4: Storage Evolution Driving Network Requirements

Rapid Adoption of AI & Machine Learning

Artificial Intelligence & Machine learning is rapidly getting adopted by businesses and is expected to be ubiquitous in future data centers. These machine learning workloads increasingly consume bigger data sets and use more complex AI models. That requires bigger AI infrastructure, often spanning multiple racks. Connecting the AI infrastructure together requires highly scalable networking with low latency and robust RDMA capabilities. Innovium’s TERALYNX Silicon has incorporated low latency and RDMA requirements from the start in a grounds-up design approach.

AI & ML infrastructure is typically connected using two different networks. One is a traditional Ethernet network. That is critical for data ingest from data stored in data lakes. The second is an AI/ML network used for iterative gradient exchange and synchronization between the GPUs/AI accelerators hosted inside the AI infrastructure. ML tasks involve computation on each GPU/AI accelerator followed by exchange of gradients. As one scales GPUs/AI accelerators, they need to scale the AI/ML network so there is no IO bottleneck. These AI/ML networks have traditionally used a range of fabrics – proprietary, PCIe, NVLink and RoCE (RDMA over Converged Ethernet) As we require a highly scalable network, we’ve increasingly seen that many solutions are adopting RoCE because it offers a standards-based, scalable and cost-effective solution that is available from multiple vendors.

Figure 5: Rapid Adoption of AI & ML Driving Network Requirements

Focus on Power

Climate change phenomenon is forcing green initiatives across all industries. Modern data centers consume a lot of energy. Hence, all data center customers have aggressive goals to be carbon neutral or negative. This includes the large cloud providers like AWS, Google and Microsoft, as well as private cloud operators.

Power and cooling available in each rack of the data center impacts the number of servers as well as switches that can be put in each rack. That influences the configuration and specs for the top-of-rack switch used to connect these servers. And when customers stack a number of switches within a rack for the upper tiers of the data center network, it also influences the number and power consumption for each switch.

Further, customers have seen networking as a % of total power increase in the data center. That can be seen in the figure below where Microsoft Azure shows data for different server link bandwidth. Power consumed by networking is viewed as lost business opportunity by these data center operators as that power is not available to power servers that generate revenues. Hence, there is an increasing focus to reduce power, especially as connectivity bandwidth increases in the future. It means we need to deliver highly power efficient networking infrastructure.

Figure 6: Big Focus on Power Driving Power-Efficient Networks

Size and Scale

As adoption of Cloud computing increases, data center operators are building mega-scale data centers that often host over a hundred thousand servers and many thousand network switches. At that scale, focus on cost and power gets magnified quickly. Also, to simplify operations, these operators need extensive telemetry to troubleshoot issues and resolve them quickly. Further automation based on network telemetry is critical.

Approximately 88-90% of the cost of a data center is driven by servers and storage, while about 10-12% by networking. To be able to get maximum utilization of servers and storage, data center operators want a non-blocking network so they can use any compute/storage resource within the data center. This implies no oversubscription and therefore a higher bandwidth/scale network.

Figure 7: Size and Scale of Data Centers is Driving Networking Requirements

Summary

At Innovium, we track key drivers and trends that impact data center network requirements very closely. Our goal is to deliver the best data center network fabric so customers can get the best experience for their applications and businesses that run inside these modern data centers. Innovium’s TERALYNX® switch silicon portfolio delivers the industry’s best telemetry, lowest latency and most power-efficient network solutions for 1 Tbps to 25.6 Tbps today, with an architecture that can scale to 100 Tbps+.

If you are architecting modern data-centers, please contact us at: [email protected] to learn more about our breakthrough networking solutions.