Written by Arbitrage • 2022-01-27 00:00:00
On Monday, Facebook's parent company Meta said it has created what it believes is among the fastest artificial intelligence supercomputers running today. The social media giant said it hopes the machine will help lay the groundwork for its building of the metaverse, a virtual reality construct intended to supplant the internet as we know it today.
Facebook said it believes the computer will be the fastest in the world once it is fully built around the middle of the year. Supercomputers are extremely fast and powerful machines built to do complex calculations not possible with a regular home computer. Meta did not disclose where the computer is located or how much it is costing to build. But given the parts, looks like upwards of $120 million.
AI supercomputers are built by combining multiple GPUs into compute nodes, which are then connected by a high-performance network fabric to allow fast communication between those GPUs. RSC today comprises a total of 760 NVIDIA DGX A100 systems as its compute nodes, for a total of 6,080 GPUs - with each A100 GPU being more powerful than the V100 used in our previous system. Each DGX communicates via an NVIDIA Quantum 1600 Gb/s InfiniBand two-level Clos fabric that has no oversubscription. RSC's storage tier has 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Altus systems, and 10 petabytes of Pure Storage FlashBlade.
Early benchmarks on the AI Research SuperCluster (RSC), compared with Meta's legacy production and research infrastructure, have shown that it runs computer vision workflows up to 20 times faster, runs the NVIDIA Collective Communication Library (NCCL) more than nine times faster, and trains large-scale NLP models three times faster. That means a model with tens of billions of parameters can finish training in three weeks, compared with nine weeks before. The computer is already up and running, but is still being built. Meta says it will help its AI researchers build "new and better" artificial intelligence models that can learn from "trillions" of examples and work across hundreds of different languages simultaneously and analyze text, images and video together.
The way Meta is defining the power of its computer is different from how conventional and more technically powerful supercomputers are measured because it relies on the performance of graphics-processing chips, which are useful for running "deep learning" algorithms that can understand what's in an image, analyze text and translate between languages, said Tuomas Sandholm, a computer science professor and co-director of the AI center at Carnegie Mellon University. "We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they can seamlessly collaborate on a research project or play an AR game together," Meta said in a blog post. The company said its supercomputer will incorporate "real-world examples" from its own systems into training its AI. It says its previous efforts used only open-source and other publicly available data sets. "They are going to, for the first time, put their customer data on their AI research computer," Sandholm said. "That would be a really big change to give AI researchers and algorithms access to all that data."
What's next? Through 2022, they will work to increase the number of GPUs from 6,080 to 16,000, which will increase AI training performance by more than 2.5x. The InfiniBand fabric will expand to support 16,000 ports in a two-layer topology with no oversubscription. The storage system will have a target delivery bandwidth of 16 TB/s and exabyte-scale capacity to meet increased demand. Meta expects such a step function change in compute capability to enable them not only to create more accurate AI models for existing services, but also to enable completely new user experiences, especially in the metaverse. Long-term investments in self-supervised learning and in building next-generation AI infrastructure with RSC are helping them create the foundational technologies that will power the metaverse and advance the broader AI community as well.