Matthew Singer

, Twitter

Overview

pTwitter is an online news and social networking service created in March 2006 by Jack Dorsey, Noah Glass, Biz Stone and Evan Williams, and launched in July of that year. The service rapidly gained worldwide popularity. As of 2018, Twitter had more than 321 million monthly active users. Twitter, Inc. is based in San Francisco, California, and has more than 25 offices around the world./p

Supernova Award Category

Tech Optimization and Modernization

The Problem

pThere are hundreds of millions of Tweets every day, which then turns into over 1 trillion events per day – that represents a lot of data. Twitter uses Hadoop for storing that data and performing advanced analytics to generate important business insights. Twitter's Hadoop clusters collectively have more than an exabyte of physical storage. Because of their affordability, hard disk drives are the workhorses of Twitter’s Hadoop clusters. A typical Hadoop cluster at Twitter can have over 100,000 hard disk drives, which translates into nearly 100 PB of logical storage. But while hard drive capacities have increased over time, their IOPS has remained essentially flat. That has resulted in a storage bottleneck./p

The Solution

pTo look at how they might overcome this bottleneck, Intel worked in collaboration with an Intel engineering team to conduct a series of experiments. Typical data flow into and out of a Hadoop cluster includes two key components: The Hadoop Distributed File System (or HDFS), which is the way that data is stored in Hadoop, and temporary data managed by Yet another Resource Manager (or YARN). These two data flows often occur at the same time, causing contention to access the hard drives. The engineering teams discovered that by selectively caching the YARN temporary files on a fast Intel SSD, the two data streams were no longer competing for the same resources, so hard drive utilization dropped and Hadoop could process data faster./p

The results

pRemoving the storage I/O bottleneck enabled Twitter to reduce the total number of racks in its cluster, which gives the company a smaller data center footprint. They achieve that density by moving from 12 smaller to 8 larger hard drives per system, which reduces the number of hard drives in a cluster without negatively impacting performance. And by removing the hard drive performance bottleneck, they were able to take advantage of more CPU horsepower, so they moved from 4-core processors to 24-core processors. The combination of the larger hard drives and higher-core-count processors means they can reduce the number of systems, hard drives, and racks in each Hadoop cluster in their data center, which leads to reduced maintenance costs and much better energy efficiency. It’s also great for the environment that we can produce the same results with about 75% less energy. Caching temp data and bumping up processor core counts is expected to result in up to 50 percent faster runtimes, and the increased density results in 30 percent lower TCO. This sets Twitter up to be ready for continued growth of its data, while still delivering the great experience their users expect./p

Metrics

p75% energy reduction for Twitter's Hadoop clusters/p p50 percent faster runtimes/p p30 percent lower TCO/p

The Technology

pIntel Xeon Gold 6262V processors; Intel SSD DC P4610; Intel Cache Acceleration Software; Hadoop./p

Disruptive Factor

pA key breakthrough for Twitter was their willingness to challenge long-held assumptions, freeing the team up to make unexpected discoveries. For example, the Twitter team never anticipated that htere was a way to remove 75% of the hard drives from the system without harming performance./p

Shining Moment

pMatt Singer shared Twitter's story at Intel's Data-Centric Day on April 2, 2019, in San Francisco (keynote replay available at a href="https://newsroom.intel.com/video-archive/video-intels-data-centric-innovation-day-keynote-replay/"https://newsroom.intel.com/video-archive/video-intels-data-centric-innov.../a)./p pRelated links:/p pa href="https://twitter.com/mattbytes/status/1113129097553514502"https://twitter.com/mattbytes/status/1113129097553514502/a/p pa href="https://www.intel.com/content/www/us/en/service-providers/enhanced-hadoop-performance-paper.html"https://www.intel.com/content/www/us/en/service-providers/enhanced-hadoo.../a/p

Submission Details

Year
Category
Tech Optimization and Modernization
Result