
We encourage comments and look forward to hearing from you. Please note that Yahoo! may, in our sole discretion, remove comments if they are off topic, inappropriate, or otherwise violate our Terms of Service.
Hadoop is a trademark of the Apache Software Foundation.
We recently ran Hadoop on what we believe is the single largest Hadoop installation, ever:
• 4000 nodes
• 2 quad core Xeons @ 2.5ghz per node
• 4x1TB SATA disks per node
• 8G RAM per node
• 1 gigabit ethernet on each node
• 40 nodes per rack
• 4 gigabit ethernet uplinks from each rack to the core (unfortunately a misconfiguration, we usually do 8 uplinks)
• Red Hat Enterprise Linux AS release 4 (Nahant Update 5)
• Sun Java JDK 1.6.0_05-b13
• So that's well over 30,000 cores with nearly 16PB of raw disk!
The exercise was primarily an effort to see how Hadoop works at this scale and gauge areas for improvements as we continue to push the envelope. We ran Hadoop trunk (post Hadoop 0.18.0) for these experiments.
Scaling has been a constant theme for Hadoop: we, at Yahoo!, ran a modestly sized Hadoop cluster of 20 nodes in early 2006; currently Yahoo! has several clusters around the 2000 node mark.
HDFS
The scaling issues have always been the main focus in designing any HDFS feature. Despite these efforts, attempts to scale the cluster up in the past sometimes resulted in some unpredictable effects. One of the most memorable examples was the cascading crash described in HADOOP-572, when failure of just a handful of data-nodes made the whole cluster completely dysfunctional in a matter of minutes.
This time the testing went smoothly and we observed quite decent file system performance. We did not see any startup problems; the name-node did not drown in self-serving heartbeats and block reports. Note, that heartbeat and block intervals were configured with the default values of 3 seconds and 1 hour respectively.
We ran a series of standard DFSIO benchmarks on the experimental cluster. The main purpose of this was to test how HDFS handles load of 14,000 clients performing writes or reads simultaneously.
Capacity : 14.25 PB
DFS Remaining : 10.61 PB
DFS Used : 233.44 TB
DFS Used% : 1.6 %
Live Nodes : 4049
Dead Nodes : 226
Nodes: 3561
Map Slots: 4 slots per node
Reduce Slots: 4 slots per node
DFSIO benchmark is a map-reduce job where each map task opens a file and writes to or reads from it, closes it, and measures the i/o time. There is only one reduce task, which aggregates and averages individual times and sizes. The result is the average throughput of a single i/o that is how many bytes per second was written or read on average by a single client.
In the test performed each of the 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job.
The table below compares the 4,000-node cluster performance with one of our 500-node clusters.
| 500-node cluster | 4000-node cluster | |||
|---|---|---|---|---|
| write | read | write | read | |
| number of files | 990 | 990 | 14,000 | 14,000 |
| file size (MB) | 320 | 320 | 360 | 360 |
| total MB processes | 316,800 | 316,800 | 5,040,000 | 5,040,000 |
| tasks per node | 2 | 2 | 4 | 4 |
| avg. throughput (MB/s) | 5.8 | 18 | 40 | 66 |
The 4000-node cluster throughput was 7 times better than 500’s for writes and 3.6 times better for reads even though the bigger cluster carried more (4 v/s 2 tasks) per node load than the smaller one.
Map-Reduce
The primary area of concern was the JobTracker and how it would react to this scale (we had never subjected the JobTracker to heartbeats flowing in from 4000 tasktrackers since it isn't a very common use-case when we use HoD). We were also concerned about the JobTracker's memory usage as it serviced thousands of user-jobs.
The initial results were slightly worrisome - GridMix, the standard benchmark, took nearly 2 hours to complete and we lost a fairly large number of tasktrackers since the JobTracker couldn't handle them. For good measure, we couldn't run a 6TB sort either; we kept losing tasktrackers. (We routinely run sort benchmarks which sort 1TB, 5TB and 9TB of data.)
Of course, brand-new hardware didn't help since we kept losing disks, neither did the fact that we had a misconfigured network which let us use only 4 out of the 8 uplinks available from each rack to the backbone (effectively cutting the available bandwidth in half). On the bright side memory usage didn't seem to be a problem and the JobTracker stood up to thousands of user-jobs without problems.
We then went in armed with the YourKit(TM) profiler - we needed to peek into the JobTracker's guts while it was faltering. This basically meant going through the CPU/Memory/Monitors profiles of the JobTracker with a fine-toothed comb. To cut a long story short, here are some of the curative actions we took based those observations:
• HADOOP-3863 - Fixed a bug which caused extreme contention for a single, global lock during serialization of Java strings for Hadoop RPCs.
• HADOOP-3848 - Cut down wasteful RPCs during task-initialization.
• HADOOP-3864 - Fixed locks in the JobTracker during job-initialization to prevent starvation of tasktrackers' heartbeats which caused the huge number of 'lost tasktrackers'.
• HADOOP-3875 - Fixed tasktrackers to gracefully scale heartbeat intervals when the JobTracker is under duress.
• HADOOP-3136 - Assign multiple tasks to the tasktrackers during each heartbeat, this significantly cuts down the number of heartbeats in the system.
The result of these improvements (sans HADOOP-3136 which wasn't ready in time):
1. GridMix came through slightly under an hour - a significant improvement from where we started.
2. The sort of 6TB of data completed in 37 minutes.
3. We had the cluster run more than 5000 Map-Reduce jobs in a window of around 6 hours and the JobTracker came through without any issues.
Overall, the results are very reassuring with respect to the ability of Hadoop to scale out. Of course we have only scratched the surface and have miles to go!
Konstantin V Shvachko
Arun C Murthy
Yahoo!
Apache Hadoop 0.18 was released on 8/22. This is the largest Hadoop release to date in terms of the number of patches committed (266). It also has the largest percentage of patches (20%) from contributors outside of Yahoo!. This is a great indicator of both the growth of the Hadoop community and their increasing involvement in the projects progress. The size of the release resulted in a very large number of blocking bugs in the code base. Unfortunately, this created a big delay between the feature freeze on 6/4 and the final release.
Hadoop 0.18 has many improvements in the areas of performance, scalability and reliability in addition to new features. Some of the performance improvements contributed to Hadoop’s first place in the terabyte sort benchmark. Hadoop 0.18 runs the grid mix benchmark in ~45% of the time taken by Hadoop 0.15. Lots of cool new stuff in this release, some of which is briefly described below.
• Namespace auto-recovery
The HDFS Namenode can store the filesystem image and journal in multiple locations. Upon startup it automatically consults all configured locations of its state and reads the most up to date image and journal. If all of the Namenodes copies of data are unavailable state can be (mostly) recovered from the secondary Namenode using the ‘¬-importCheckpoint’ switch. More details can be found in HADOOP-2585.
• Fast restart
Namenode re-start, particularly for large clusters has been slow. It has until Hadoop 0.17 taken, for instance, over an hour to bring up a Namenode on a 2000 node cluster. Problems included inefficiencies in block report processing, getting stuck in safe mode etc. Most of these have been addressed and Namenode startup on up to 3000 node clusters happens in <15 minutes. Discussion can be found in HADOOP-3022.
• Namespace quotas and archives
HDFS now has directory-based quotas for namespace management. A quota set on a directory limits the number of entries in that sub-tree to the quota value. Only the super-user may set or change quotas. Quotas can be manipulated programmatically or via command line utilities. HADOOP-3187 describes quotas in detail.
• RPC performance and scaling improvements
This release comes with a significant re-vamp of the RPC subsystem in the form of HADOOP-2188, HADOOP-2909 and HADOOP-2910. This includes the use of pings instead of timeouts, improvements in the management of idle connections and client throttling when the server is under load. These improvements will have the greatest effect of large clusters (>1000 nodes) and prevent jobs from failing when the Namenode or Jobtracker are under load.
• Read/write performance improvements
HADOOP-1702 reduces buffer copies while writing to HDFS and brings down Datanode CPU usage during writes by 30%. HADOOP-3164 used sendfile on the Datanodes for reads from HDFS. This results in an 80% reduction in CPU usage by the Datanodes.
• Audit logging
HADOOP-3336 introduces audit logging for HDFS. The Namenode logs all file and directory accesses. An audit log entry includes the originating IP, action requested, pathname accessed, client user and group id and existing permissions on the accessed pathname.
• Append … almost there
Lots more work to support append which unfortunately did not make the cut for Hadoop 0.18. Notable changes include lease recovery for append (HADOOP-3310), datanode generation stamp upgrades (HADOOP-3283) and lease management when open files get renamed (HADOOP-3176).
• Mounting via FUSe
The oldest open Hadoop bug, HADOOP-4 is now closed. This work enables HDFS mounting via FUSE.
• Intermediate compression that just works
Compression of intermediate outputs in Hadoop Map/Reduce has long been a source of grief. Intermediate compression, when enabled would frequently induce job failure by causing tasktrackers to run out of memory, run slow, cause disk thrash etc. HADOOP-3366 and HADOOP-2095 address these problems by introducing a different file format for shuffle data and improving memory management in reduce tasks. Intermediate compression may now be enabled with the supported codecs (gzip and lzo).
• (Single) reduce optimizations
Many important optimizations in sort/merge in reduce tasks. HADOOP-3297 improves the fetching of many small outputs. HADOOP-3365 eliminates unnecessary buffer copies during the merge phase. HADOOP-3429 improves Hadoop streaming performance by buffering the i/o paths to streaming processes.
• Archive tool
Quotas are complemented by ‘Hadoop archives’, which are a tool for users to manage their namespace consumption. A large number of files can be converted into a Hadoop archive using a Map/Reduce utility. A Hadoop archive is basically an HDFS directory with a small number of data files that consist of files from the original set concatenated together. An index stores the location of each file from the original set. Individual files in an archive can be accessed using a special URI with the ‘har’ schema. Archives and their use are discussed in HADOOP-3188.
Sameer Paranjpye
Yahoo!
One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (daytona) terabyte sort benchmark. The sort benchmark, which was created in 1998 by Jim Gray, specifies the input data (10 billion 100 byte records), which must be completely sorted and written to disk. This is the first time that either a Java or an open source program has won. Yahoo is both the largest user of Hadoop with 13,000+ nodes running hundreds of thousands of jobs a month and the largest contributor, although non-Yahoo usage and contributions are increasing rapidly.
The cluster statistics were:
The benchmark was run with Hadoop trunk (pre-0.18) with a couple of optimization patches to remove intermediate writes to disk. The sort used 1800 maps and 1800 reduces and allocated enough memory to buffers to hold the intermediate data in memory. All of the code for the benchmark has been checked in as a Hadoop example.
Owen O'Malley
Yahoo! Grid Computing Team
Apache Hadoop 0.17 is due for release any day now. Feature freeze for the release was on April 4th. The Hadoop dev community is currently actively fixing blocking issues discovered by users that have tried it out. This is a release we’re very excited about as it introduces many long awaited performance fixes to the platform. We’ve observed on the order of 30%(!) improvement in the runtime of some of the Hadoop benchmarks. As always, user feedback is invaluable and we urge folks to kick the tires on the release and help close it out. Here is a quick rundown of the important changes in the release.
Sameer Paranjpye
Yahoo! Grid Computing Team
I joined the Yahoo! Research Engineering group a few weeks ago, and I was literally blown away with the possibilities that Hadoop and Pig open for me. Immediately, I wanted to hack up something good to say thank you to all smart people that build and support such a great software.
I am convinced that Pig deserves more respect from the major text editors, so I wrote a small vim script that adds syntax highlighting for Pig files.
You can download it from vm.org site.
To install, follow instructions on the web page, and don't forget to vote! :-)
Emacs version is coming up soon (yes, I use both vim *and* emacs). It will be my project for the upcoming Yahoo! Hack Day.
Sergiy Matusevych
Yahoo! Research Engineer
It's been a few weeks since the Hadoop Summit in Santa Clara, and we hope everyone had a good time and learned a lot. Feedback has been quite good so far, but don't be shy about sending us comments.
The Yahoo! Research team has assembled a single page containing links to all the presentation slides and video from both the Hadoop Summit and the Data Intensive Computing Symposium.
As a sample, here's the opening presentation that Doug and Eric gave:
Update: Videos are currently unavailable outside of Yahoo! We're working on the problem...