Meta for Facebook’s parent company to retool data centers as AI slips into more services

Millions of people use AI every month across Meta platforms, including Facebook, and the company is upgrading data center equipment to handle the increasing computing load that AI requires.

Alexis Black Bjorlin, Vice President of Infrastructure at Meta, said in a keynote at artificial intelligence summit Held in Santa Clara, California.

“It gives us deeper insights. It gives us a better ability to predict user behavior, and therefore a better ability to deliver meaningful and relevant content to our nearly 3 billion daily active users,” Black Bjoerlin said during a keynote on Wednesday.

Hardware upgrades will also push AI into more apps and services. It will also help Meta fulfill its long-term focus of business strategy around the metaverse, which is something that is in the pipeline. Black Bjorlin said nearly 700 million people use augmented reality via the Meta platform on a monthly basis.

“In particular, AI can detect and remove more than 95% of objectionable content before you see it. Bjorlin said.

Alexis Black Bjorlin presents at the AI ​​Device Summit in Santa Clara, California.

By 2025, Black Bjorlin said, Meta plans to build massive clusters containing more than 4,000 accelerators. The network of cores will be organized as a grid, with a bandwidth of 1 terabyte per second between accelerators. Black Bjorlin did not say what kind of accelerators the company plans to use, but the company makes extensive use of Nvidia GPUs, and it plans to Artificial intelligence supercomputer Based on Nvidia GPUs.

“Sometimes you’ll see us talking about scale size in terms of thousands of accelerators. What we really have to design is megawatts,” Black Bjorlin said.

Meta has data centers across 20 regions around the world, with each region having about five data center buildings. Black Bjorlin said the company has more than 50 million square feet of data center footprint worldwide.

A typical small AI training suite will be at eight megawatts, but Meta sees the need to scale it up to 64 megawatts of total encapsulated power.

“A large portion of this energy budget will be allocated to the grid,” Black Bjorlin said. AI typically needs ultra-fast network bandwidth to transfer data between computing centers, memory, and storage for machine learning.

This entails understanding the system as a whole and what adds value, stripping out unnecessary components. The idea is to shrink hardware at the system and chip level, Black Bjorlin said. She gave the example of optical interfaces, which are being sought by Meta for use in data centers.

“It gives us an important way to reduce the power consumption of the optics. And when I talk about this, it’s not just about switching to the switch on the higher-level network. It’s actually the optical links that go to the accelerators themselves,” said Black Bjorlin.

She praised the work of the CXL Consortium, which last month released version 3.0 of the Compute Express Link specification, which creates a link for communication between chips, memory and storage in systems.

Meta’s current data center infrastructure handles 3.65 billion monthly active users of its services and 2.91 billion users on Facebook. In addition to 95% accuracy in blocking objectionable content, AI systems can translate 200 languages. The company uses the OPT-175B natural language processing model, which contains 175 billion variables and which has been open source to developers.

The company is building its AI infrastructure around PyTorch’s suite of machine learning tools, which is emerging as the preferred language for AI alongside TensorFlow. There are over 150,000 PyTorch projects on GitHub from over 2,400 authors.

dead this week Separated from its own PyTorch project into the newly formed PyTorch Foundation, which will be managed by the Linux Foundation. Foundation members also include major cloud providers Amazon Web Services, Google Cloud, and Microsoft Azure.

Meta’s new AI operating model relies on how quickly models can move to production, which in some cases is more important than traditional system metrics such as performance per watt.

“We are trying to find a way to capture the best of both worlds – to maintain developer efficiency, use fast production time and achieve high performance. Ideally, we will have devices that support native Ethernet,” said Black Blurgen.

Leave a Comment