LEG -12 : Feasibility of Big Data Tasks
Status assessment for Hadoop and Spark on ARM, what is working vs failing
We have come across the following issues in OpenJDK:
Linaro supplied OpenJDK has an empty cacerts.
- This can cause SSL connections initiated by Java programs to fail as there is no chain to trust.
- This is expected as it is not a distro.
Simple fix is just to set up a symlink: jdk-directory/jre/lib/security/cacerts to a distro supplied cacerts file (typically something like: /etc/pki/java/cacerts)
Hadoop from upstream both most recent stable version 2.7.1 and latest trunk: 3.0.0 appear to build and run on AArch64. There is one issue with OpenJDK (that is being investigated) whereby null pointer exceptions are thrown in the javac compiler.
Hortonworks is a popular Hadoop distribution that along with some extra functionality also provides a tested and easy to deploy platform. We are investigating the feasibility of running Hortonworks on AArch64. Whilst I am happy to report that AArch64 specific problems have very rarely turned up; the packaging logic for the HDP distro is not quite public making it tricky to rebuild the HDP rpms.
The majority of the AArch64 problems that I have ran into included:
- Lack of availability of old node.js on AArch64. Node.js is used to build webapps.
- This is due to Google’s V8 engine only recently getting AArch64 support.
- Some webapps have modules that contain things like syntax errors that one “gets away with” on older versions of node.js.
- A lot of build scripts download node.js binaries pre-compiled for x86 from the internet randomly and run them. This requires some build logic tweaking.
- Node.js also contains modules (namely “phantomjs”) that attempt to download x86 code from the internet randomly and run them.
- Lack of availability of Docker. This is used by Trunk Hadoop.
- Thankfully, for now, one can just bypass Docker and build Hadoop with Maven.
- Docker is being introduced on AArch64, some work is required to allow one to select AArch64 images to pull down.
Some native Hadoop components (e.g. Hadoop-lzo from HDP) attempt to compile things with -m32 and -m64, this is easily corrected.
The Spark source consists of mostly Scala code coupled with some Java. It has one native piece of code, an R plugin written in C that computes hashes in a similar manner as Java.
LEG/Engineering/BigData/Feasibility (last modified 2016-03-22 17:39:18)