Hadoop CRC32 vs Non CRC32 Study

Goal

Test performance benefit of CRC32 Patch by Ed Nevill on Hadoop TeraSort benchmark

  1. Setup CRC32 and Non CRC32 builds on a 2-node cluster
    • Build Hadoop 2.7.1 release which was the last stable release. Build another 2.7.1 release but this time apply Ed Nevill's CRC32 Patch.

    • Builds should use the same HDFS
      •     From Steve Capper:
                   
              For normal Hadoop 2.7.1, one can add Ed's CRC patch via:
              
              git checkout release-2.7.1 -b test-crc-patch
              git cherry-pick d9ac5ee2c4dcd4a108ca892af501618caaea450c
              
              This will raise a merge conflict, so we....
              git status
              This will show that the CHANGES.txt file has a conflict. We do not
              need this file so we...
              
              git reset hadoop-common-project/hadoop-common/CHANGES.txt
              git checkout hadoop-common-project/hadoop-common/CHANGES.txt
              
              These two lines above remove CHANGES.txt from our staged changes, then
              revert CHANGES.txt as it is before we cherry-pick.
              
              Now we can complete the cherry-pick via:
              git cherry-pick --continue
              
              
              Now we'll have our own branch, test-crc-patch, that has Ed's patch
              applied on top of 2.7.1.
  2. Create script for easy switching between two builds
    • A basic script was created to detect the current build which is active and then switch to the other build. It is attached to this page (switchHadoopBuilds.sh).

Workloads

TeraSort

  1. Create script for running TeraGen and TeraSort with user specified file sizes, number of mappers and reducers.

    • This script reads an environment variable to see which build is running and accordingly starts TeraGen and TeraSort runs on that build. There are options to specify what input sizes need to be tested along with number of reducers and how many iterations per reducer. Script attached to this page (runTeraSort.sh).

  2. Determine a filesize (>100GB), number of mappers and number of reducers which will run optimally on this cluster in a short time. Run TeraSort for this configuration on both builds multiple times.

Test DFSIO

  1. Create scripts for running Test DFSIO write and read
    • Similar script created as we did for TeraSort. Script is attached to this page (runDFSIO.sh).

  2. Determine run configuration
    • In DFSIO, we can specify number files which we want to create and it spawns one mapper for each file to be created. We use the same config of 8 mappers so 8 files. Each file will be of 16GB so the total size will be 128GB.

HDFS Copy Test

  1. This is a simple test in which Hadoop source was copied from local directory to HDFS using both builds. The large number of small sized files in source dirs make it a good candidate for this CRC test since CRC is computed for all files.
    • /usr/bin/time $HADOOP_HOME/bin/hadoop fs -put <path-to-local-hadoop-src-dir> /user/hdadmin

Observation

  1. TeraSort

    • The TeraSort study was termed inconclusive as there was no significant difference between the run times of both builds. Sometimes CRC32 build was slightly faster, sometimes Non CRC32 build was slightly faster.

  2. Test DFSIO
    • It was observed that CRC32 build was ~16% faster compared to Non CRC32 build for both read and write tests
  3. HDFS Copy Test
    • CRC32 build was ~14% faster than Non CRC32 build in this case.

Attachments

LEG/Engineering/BigData/CRC32vsNonCRC32Study (last modified 2016-03-21 23:11:48)