Innovium TERALYNX Supercharges Next Epoch of Data-centric Computing

The Next Epoch of Data-centric Computing

Data is being created at an exponential pace in today’s digital world from sources such as web clicks, cell phone cameras, security devices, IOT devices and other sensors. With increased availability of inexpensive compute, storage and network infrastructure, that data is being processed and analyzed to draw insights and derive business and consumer benefits. As Moore’s law slows down, general purpose processors are no longer able to scale processing and performance at the rate they used to. Hence, specialized accelerators such as GPUs, TPUs, AI processors are being designed and deployed by customers to process and analyze the massive amounts of data being created. These use artificial intelligence (AI) and machine learning (ML) techniques to drive the next epoch of data-centric computing. This trend was recently articulated by Google’s Amin Vahdat, in his SIGCOMM Lifetime Achievement Award Keynote in Aug 2020.

Artificial Intelligence (AI) and Machine Learning (ML) is being used pervasively across a wide range of areas, including:

  • Entertainment: Individually tailored movie or shopping recommendations based on collected data
  • Smartphones: Speech recognition, intelligent assistants and language translation
  • Security: Fraud detection, intrusion detection and home surveillance
  • Medicine: Rapid disease diagnosis (cancer/stroke detection) and drug research & development
  • Transportation: Self-driving cars and traffic congestion management
  • Oil & Gas: Seismic exploration and analysis

AI / ML Deployments

Multiple Stages in Machine Learning

Traditional computer applications are programmed with a set of explicit pre-defined rules to deliver an answer. In AI & ML, an answer is delivered by analyzing/learning from vast amounts of data that is fed to it. AI and ML applications typically need lots of data. So, the first part of process is data collection. This data is often cleaned, organized and prepared as two separate datasets: training and test. The training dataset is used by a training model for learning until the model reaches a desired accuracy. The test dataset is used to validate the trained model’s accuracy using unseen data. Once results are satisfactory, the trained model can be deployed for inference where it can be used to make predictions. As seen here, there is a lot of data being ingested and processed, and that is being done in a highly distributed fashion. That requires a lot of network bandwidth, low network latency and moving data using RDMA to not burden the CPU.

Figure 1: Multiple Stages in Machine Learning

AI / ML Infrastructure: Ethernet will win over Proprietary Networks and Topologies

Training in machine learning is a process where massive amounts of dataset is fed into a model (algorithm) to train the model. Computation often involves lots of matrix arithmetic and is done in parallel using specialized accelerators such as GPUs, TPUs or AI processors. Each model starts with an initial set of weights and biases. With each iteration of compute, the weights and biases are updated and synchronized across all the accelerators in a process called gradient exchange. Training time associated with gradient exchange can be reduced using a very high bandwidth, low latency network that enables faster weight exchange and synchronization. With increasingly larger datasets, more layers in neural network models and more parameters, larger scale of training infrastructure is necessary to cut down on training time. In this large scale AI / ML infrastructure, high performance and highly scalable network plays a critical role in getting training results faster, and hence delivering business outcomes faster.

As seen in figure 2, machine learning infrastructure consists of compute, network and storage elements. Data sets are typically stored in storage arrays/solutions connected to the network and accessible to all the compute nodes. Compute typically consists of a number of accelerators inside a compute chassis. Most solutions in the market today have eight accelerators in a chassis. Some examples are DGX A100 from Nvidia and HLS-1 from Habana/Intel. These compute chassis also have some CPUs and local storage. These compute chassis are connected to each other and to the storage arrays through the AI / ML network using multiple 100-200G NICs to provide sufficient IO bandwidth. While some solutions use proprietary connectivity between accelerators within a chassis, others use standards based RoCE Ethernet. For connectivity across compute chassis, customers have used both RoCE Ethernet and InfiniBand in the past – however, standards based RoCE Ethernet is becoming the de facto choice as ubiquitous Ethernet offers higher radix 100-400G scalable switching solutions with very low latencies and robust RDMA capabilities.

Figure 2: Large-scale Machine Learning Infrastructure

TERALYNX Ethernet Switches Supercharge AI / ML Deployments

Innovium designed the TERALYNX switch architecture from the ground up with patented customer-focused innovation to address customer’s computing needs in deployments such as AI & ML. The TERALYNX switch family delivers the highest performance and most scalable networking solutions with unmatched features & telemetry, best power efficiency and lowest latency. These capabilities are critical for a high performance AI/ML network infrastructure. Let us look at the key advantages of using TERALYNX switches in an AI/ML deployment.

  • Highest Radix and Performance
    • 12.8T TERALYNX 7 volume production switches deliver customers the highest radix switch solutions including 32 x 400G, 64 x 200G and 128 x 100G today. These systems are already being deployed by the world’s largest cloud providers to scale their data center networks. Further, 25.6T TERALYNX 8 will soon provide customers even higher radix solutions such as 32 x 800G, 64 x 400G and 128 x 200G.
    • Alternate InfiniBand solutions today only offer 200G connectivity with HDR technology. Further, the maximum radix solution is 40 x 200G with 8T switch silicon. This implies more switches and more network tiers to support a certain size of AI & ML cluster. That means higher latency, higher cost & complexity. Besides, Ethernet technology offers a standards based solution that is easier to deploy and manage.
  • Best-in-class Low Latency
    • TERALYNX switch family delivers best-in-class low latency in the market, with ~2x advantage in latency as compared to alternate Ethernet switch solutions. AI & ML applications benefit significantly from low latency.
  • Largest Buffers & Robust RDMA
    • TERALYNX switch family offers the largest amount of on-chip buffer in switches for each performance point. This along with our highly robust RDMA/RoCE feature-set and implementation delivers the best network quality and AI/ML performance. As we saw above, AI/ML applications require a lot of data centric computation and data movement. Robust RDMA/RoCE capability in TERALYNX helps the solution significantly.
  • Programmability
    • TERALYNX switch family is programmable and future proofs the network infrastructure.
  • Unmatched Telemetry
    • FLASHLIGHT telemetry and analytics available in TERALYNX switches address the toughest troubleshooting scenarios like network microbursts that may arise in the AI/ML network.
  • Power Efficiency
    • Most customers have big initiatives to be highly power efficient and be carbon neutral/negative. TERALYNX switches are the most power efficient switches in their class. That helps customers design the most compact switches, often in 1RU, and lower customers’ OPEX significantly.
  • Proven & Deployed at scale
    • TERALYNX switch family has gone through extensive validation and testing, and is now being deployed at scale by the world’s top cloud customers for both general purpose as well as AI/ML computing needs.

Please contact us at [email protected] for information on our products.