LEG -12 : Feasibility of Big Data Tasks

Status assessment for Hadoop and Spark on ARM, what is working vs failing

OpenJDK

We have come across the following issues in OpenJDK:

  • Linaro supplied OpenJDK has an empty cacerts.

  • This can cause SSL connections initiated by Java programs to fail as there is no chain to trust.
  • This is expected as it is not a distro.
  • Simple fix is just to set up a symlink: jdk-directory/jre/lib/security/cacerts to a distro supplied cacerts file (typically something like: /etc/pki/java/cacerts)

Hadoop

Hadoop from upstream both most recent stable version 2.7.1 and latest trunk: 3.0.0 appear to build and run on AArch64. There is one issue with OpenJDK (that is being investigated) whereby null pointer exceptions are thrown in the javac compiler.

Hortonworks is a popular Hadoop distribution that along with some extra functionality also provides a tested and easy to deploy platform. We are investigating the feasibility of running Hortonworks on AArch64. Whilst I am happy to report that AArch64 specific problems have very rarely turned up; the packaging logic for the HDP distro is not quite public making it tricky to rebuild the HDP rpms.

The majority of the AArch64 problems that I have ran into included:

  • Lack of availability of old node.js on AArch64. Node.js is used to build webapps.
    • This is due to Google’s V8 engine only recently getting AArch64 support.
    • Some webapps have modules that contain things like syntax errors that one “gets away with” on older versions of node.js.
    • A lot of build scripts download node.js binaries pre-compiled for x86 from the internet randomly and run them. This requires some build logic tweaking.
    • Node.js also contains modules (namely “phantomjs”) that attempt to download x86 code from the internet randomly and run them.
  • Lack of availability of Docker. This is used by Trunk Hadoop.
    • Thankfully, for now, one can just bypass Docker and build Hadoop with Maven.
    • Docker is being introduced on AArch64, some work is required to allow one to select AArch64 images to pull down.
  • Some native Hadoop components (e.g. Hadoop-lzo from HDP) attempt to compile things with -m32 and -m64, this is easily corrected.

Spark

The Spark source consists of mostly Scala code coupled with some Java. It has one native piece of code, an R plugin written in C that computes hashes in a similar manner as Java.

LEG/Engineering/BigData/Feasibility (last modified 2016-03-22 17:39:18)