Dec 24, 2023

Machine Learning Drives Changing Disaster Recovery At Facebook

Hyperscalers have billions of users who get access to their services for free, but the funny thing is that these users act like they are paying for it and expect for these services to be always available, no excuses.

Organizations and consumers also rely on Facebook, Google, Microsoft, Amazon, Alibaba, Baidu, and Tencent for services that they pay for, too, and they reasonably expect that their data will always be immediately accessible and secure, the services always available, their search returns always popping up milliseconds after their queries are entered, and the recommendations that come to them personalized for them. These hyperscalers have built networks of massive datacenters, spanning the globe, to ensure the data and services are close to their customers and that latency doesn't become a problem.

Given all this, disaster recovery becomes a critical part of the business. Hyperscale companies need to make sure business can continue as usual even if a datacenter goes down. They use multiple availability zones located within geographical regions to ensure that data, services and workloads can be accessed through other datacenters if one becomes unavailable. Hyperscalers like Microsoft – which makes Azure available in 140 countries – also have other disaster recovery plans in place, from management of roles across fault domains to automated failover of user traffic to another region if the user's region fails to enabling users to geo-replicate Azure Storage to secondary regions.

For Facebook, with its 2.1 billion users and global datacenters in places ranging from Santa Clara, California and Ashburn, Virginia to Lulea, Sweden and Odense, Denmark, disaster recovery is not only crucial to its operations, but it's something the giant social networking companies works on constantly.

"The ability to seamlessly handle the loss of a portion of Facebook's global compute, storage, and network footprint has been a long-standing goal of Facebook Infrastructure," a group of Facebook engineers wrote in a recent paper about the company's infrastructure. "Internally, our disaster recovery team regularly performs drills to identify and remedy the weakest links in our global infrastructure and software stacks. Disruptive actions include taking an entire datacenter offline with little to no notice in order to confirm that the loss of any of our global datacenters results in minimal disruption to the business."

Ensuring high availability – while always critical to operations – has become even more so as the role of artificial intelligence (AI) and machine learning has become even more prevalent within in the company's operations. Facebook is leveraging machine learning in a broad array of services, from rankings in the News Feed and searches to displaying ads aimed at specific users and Facer for facial recognition, as well as language translation, speech recognition and internal operations like Sigma for anomaly detection. The company also uses multiple machine learning models, including deep neural networks, logistic regression and support vector machines. There are deep learning frameworks like Caffe2 and PyTorch and internal machine learning-as-a-service capabilities like FBLearner Feature Store, FBLearner Flow, and FBLearner Prediction.

As we’ve noted in The Next Platform, much of Facebook's distributed and scalable machine learning infrastructure is based on systems designed in-house, such as the Big Basin GPU server, and relies heavily on both CPUs from Intel and GPUs from Nvidia for training and inference. The growth of machine learning capabilities throughout the Facebook's operations put an even greater premium on disaster recovery, according to the paper's authors.

"For both the training and inference portions of machine learning, the importance of disaster-readiness cannot be underestimated," they wrote. "While the importance of inference to drive several key projects is unsurprising, there is a potentially surprising dependency on frequent training before noticing a measurable degradation in several key products."

To measure that importance, Facebook engineers ran tests to determine what would happen to three services – News Feed, Ads, and Community Integrity – if they were unable to train their models for a week, a month, and six months.

The first obvious impact was engineer efficiency, as machine learning progress is often tied to frequent experimentation cycles," they wrote. "While many models can be trained on CPUs, training on GPUs often enables notable performance improvement over CPUs for certain use cases. These speedups offer faster iteration times, and the ability to explore more ideas. Therefore, the loss of GPUs would result in a net productivity loss for these engineers. Furthermore, we identified a substantial impact to Facebook products, particularly for products that rely heavily on frequent refreshes of their models."

In the Community Integrity service, which is aimed at identifying and removing objectionable content, being unable to continuously train models would mean a degradation of content, the authors wrote. The content in the News Feed would become stale, and the impact on Ads – essentially the inability to continue to push relevant ads to the right users – of not being able to train models can be measured in hours. Using a one-day-old model is significantly worse than using a one-hour-old model.

"Overall, our investigation served to underscore the importance of machine learning training for many Facebook products and services," the authors wrote. "Disaster readiness of that large and growing workload should not be underestimated."

The rise of AI and machine learning in Facebook's operations also forced the company to change the way it housed its GPU resources. Facebook had compute servers with CPUs for training and inference in almost every datacenter region, a move to compensate should the largest region go down for whatever reason. However, the authors noted that the need for similar redundancy for GPU resources for training was at first underestimated. Computer vision applications were the first workloads that used GPUs for training, and the data for used to train the models was replicated globally.

"When GPUs were new to Facebook Infrastructure, rolling them out in a single region seemed to be a smart option for manageability until the designs matured and we could build internal expertise on their service and maintenance requirements," they wrote. "These two factors led to the decision to physically isolate all production GPUs to one datacenter region."

However, new demands on the GPUs changed that thinking.

"Due to the increased adoption of Deep Learning across multiple products, including ranking, recommendation, and content understanding, locality between the GPU compute and big data increased in importance," the authors wrote. "And complicating that need for compute-data colocation was a strategic pivot toward a mega-region approach for storage. The notion of a mega-region means that a small number of data center regions will house the bulk of Facebook's data. Incidentally, the region housing the entire GPU fleet did not reside in the storage mega-region."

Given all that – and beyond the importance of locating compute resources together with the data, Facebook says that "it quickly became important to consider what might happen if we were to ever lose the region housing the GPUs entirely. And the outcome of that consideration drove the need to diversify the physical locations of the GPUs used for ML training."

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.Subscribe now