Hadoop CRC32 vs Non CRC32 Study
Test performance benefit of CRC32 Patch by Ed Nevill on Hadoop TeraSort benchmark
- Setup CRC32 and Non CRC32 builds on a 2-node cluster
Build Hadoop 2.7.1 release which was the last stable release. Build another 2.7.1 release but this time apply Ed Nevill's CRC32 Patch.
- Builds should use the same HDFS
From Steve Capper: For normal Hadoop 2.7.1, one can add Ed's CRC patch via: git checkout release-2.7.1 -b test-crc-patch git cherry-pick d9ac5ee2c4dcd4a108ca892af501618caaea450c This will raise a merge conflict, so we.... git status This will show that the CHANGES.txt file has a conflict. We do not need this file so we... git reset hadoop-common-project/hadoop-common/CHANGES.txt git checkout hadoop-common-project/hadoop-common/CHANGES.txt These two lines above remove CHANGES.txt from our staged changes, then revert CHANGES.txt as it is before we cherry-pick. Now we can complete the cherry-pick via: git cherry-pick --continue Now we'll have our own branch, test-crc-patch, that has Ed's patch applied on top of 2.7.1.
- Create script for easy switching between two builds
- A basic script was created to detect the current build which is active and then switch to the other build. It is attached to this page (switchHadoopBuilds.sh).
This script reads an environment variable to see which build is running and accordingly starts TeraGen and TeraSort runs on that build. There are options to specify what input sizes need to be tested along with number of reducers and how many iterations per reducer. Script attached to this page (runTeraSort.sh).
Determine a filesize (>100GB), number of mappers and number of reducers which will run optimally on this cluster in a short time. Run TeraSort for this configuration on both builds multiple times.
- For this study, various input sizes were tested and 128GB was chosen for testing.
- Number of mappers was chosen to be 8 and reducers to be 4
For more details about configuration, refer https://wiki.linaro.org/Internal/People/NachiketBhoyar/HadoopTuningGuide
- Once, the configuration was selected, it was run multiple times on both builds using the script.
- Create scripts for running Test DFSIO write and read
Similar script created as we did for TeraSort. Script is attached to this page (runDFSIO.sh).
- Determine run configuration
- In DFSIO, we can specify number files which we want to create and it spawns one mapper for each file to be created. We use the same config of 8 mappers so 8 files. Each file will be of 16GB so the total size will be 128GB.
HDFS Copy Test
- This is a simple test in which Hadoop source was copied from local directory to HDFS using both builds. The large number of small sized files in source dirs make it a good candidate for this CRC test since CRC is computed for all files.
/usr/bin/time $HADOOP_HOME/bin/hadoop fs -put <path-to-local-hadoop-src-dir> /user/hdadmin
The TeraSort study was termed inconclusive as there was no significant difference between the run times of both builds. Sometimes CRC32 build was slightly faster, sometimes Non CRC32 build was slightly faster.
- Test DFSIO
- It was observed that CRC32 build was ~16% faster compared to Non CRC32 build for both read and write tests
- HDFS Copy Test
- CRC32 build was ~14% faster than Non CRC32 build in this case.
LEG/Engineering/BigData/CRC32vsNonCRC32Study (last modified 2016-03-21 23:11:48)