Sparkling Water (Spark + H2O)

Getting started

Requirements:

  • Spark 1.6 (Built for Hadoop 2.4 or above)
  • JDK 1.6 or above (64-bit)

To build and install Spark and JDK 1.8, please follow instructions on this page: https://wiki.linaro.org/LEG/Engineering/BigData/Building_Spark_1_6

Once you have installed spark,

  1. Add the path to SPARK_HOME environment variable:
    • $ export SPARK_HOME=/path/to/spark
  2. Make sure you have passwordless SSH configured
    • $ su - hduser $ ssh-keygen -t rsa -P ""

      $ cat /home/hduser/.ssh/id_rsa.pub >> /home/hduser/.ssh/authorized_keys $ ssh localhost $ exit

  3. Verify the build with Spark Pi example
    • $ $SPARK_HOME/bin/run-example SparkPi 100

  4. Start Spark processes:

        $ $SPARK_HOME/sbin/start-all.sh
        $ $SPARK_HOME/sbin/start-history-server.sh

  • Spark history server tries to add log files to "/tmp/spark-events" which is owned by root. If you are running as hduser, add a directory owned by hduser:hadoop similar to following and use that for logs:

        $ mkdir -p /home/hduser/spark/tmp/spark-events
        $ export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=/home/hduser/spark/tmp/spark-events"

Steps to install sparkling water

Download latest stable release of sparkling water for Spark 1.6: http://www.h2o.ai/download/sparkling-water/spark16

  1. Install Sparkling water in the same directory where Spark is installed
  2. $ unzip sparkling-water-1.6.3.zip
  3. $ cd sparkling-water-1.6.3/
  4. Export Cluster configuration:
    • Sparkling water needs a special environment variable to specify cluster configuration. The default is the following:

        $ export MASTER="local-cluster[3,2,2048]"
  • In this case, local-cluster[3,2,2048] points to embedded cluster of 3 worker nodes, each with 2 cores and 2G of memory. If you are not sure, you can add it as a wildcard:

        $ export MASTER="local[*]"

Run sparkling water examples:

  1. run-example.sh
    • $ cd sparkling-water-1.6.3/ $ ./bin/run-example.sh
    b. sparkling-shell c. Craigslist demo with sparkling shell and H2O flow [VIDEO]: d. Sparkling Water + YARN

MultiNode Setup

  • Setup another node by following the steps above with ODPi Hadoop, Spark 1.6 and Sparkling Water.

    To setup Hadoop multinode cluster, follow this guide: https://wiki.linaro.org/LEG/Engineering/BigData/ODPiHadoopMultinodeClusterSetup To Setup Spark multinode cluster:

    1. Copy slaves.template to slaves file on both master and slave nodes
      • Add hostname of all the slaves
    2. On each node, add SPARK_LOCAL_IP, SPARK_MASTER_IP, SPARK_WORKER_CORES and SPARK_WORKER_MEMORY to spark-env.sh:
      •      Eg:
                 SPARK_LOCAL_IP=10.XX.XX.90
                 SPARK_MASTER_IP=10.XX.XX.150
                 SPARK_WORKER_CORES=2
                 SPARK_WORKER_MEMORY=8000m
    3. Start spark from Master by running the script $SPARK_HOME/sbin/start-all.sh
    4. Stop Spark from Master by running the script $SPARK_HOME/sbin/stop-all.sh
    5. Go to $SPARK_HOME and enter this command:
      • bin/spark-submit --class water.SparklingWaterDriver --master yarn-client --num-executors 2 --driver-memory 4g --executor-memory 2g --executor-cores 4 ../sparkling-water-1.6.3/assembly/build/libs/*.jar

    6. Once the above script runs successfully, open H2O Flow in a browser (the port number will be mentioned in the script's stdout log).
    7. Click on Admin tab and then open water meter. Check if the water meter shows both nodes (IP will shown).

Troubleshooting:

Error 1:

FIX: Used OpenJDK 8 to build but used OpenJDK 7 to run. This caused the error. Changing to JDK 8 fixed it.

Error 2:

  • Caused by: java.io.FileNotFoundException: /home/nbhoyar/openjdk1.8_1603/jre/lib/ext/sunpkcs11.jar (Permission denied)

    • at java.util.zip.ZipFile.open(Native Method)

FIX: JDK 8 needs to be owned by hduser:hadoop else you will get the following error

Error 3:

  • 16/05/17 13:51:05 ERROR Worker: Connection to master failed! Waiting for master to reconnect...

16/05/17 13:51:05 WARN Worker: Failed to connect to master 10.XXX.XXX.XXX:41254 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@43718424 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@2136a26f[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]

AND

16/05/17 13:51:05 WARN ExecutorRunner: RpcEnv already stopped. java.lang.IllegalStateException: RpcEnv already stopped.

FIX: The MASTER env variable was not set. Refer to sparkling water setup steps above to read more. $ export MASTER="local[*]"

Error 4:

FIX: The protobuf which I installed on my system had only .so files and protoc binary but no protobuf jar. I downloaded protobuf-java-2.6.1.jar from the following link and the ProtocolStringList exception went away: http://search.maven.org/remotecontent?filepath=com/google/protobuf/protobuf-java/2.6.1/protobuf-java-2.6.1.jar I added this jar to SPARK_CLASSPATH in /spark-home-dir/conf/spark-env.sh

Error 5:

FIX: Add the following line of code given by Michal Malohlava from H2O to sparkling-water-1.6.3/examples/scripts/craigslistJobTitles.script.scala at the end of the script before opening flow: table.replace(table.find("target"), table.vec("target").toCategoricalVec).remove()

Error 6:

  • Could not find or load main class org.codehaus.plexus.classworlds.launcher.Launcher

FIX: Install maven 3.3.9 and add M2_HOME to ~/.bash_profile. Somehow, adding it to bashrc didn't work.

LEG/Engineering/BigData/SparklingWaterGuide (last modified 2016-06-14 16:56:11)