Building Apache Spark App with Maven

15 July 2014 by Artur Czeczko & Mateusz Fedoryszak

Recently we've been working on building Spark apps with Maven.

We've found these two texts very valuable:

However, we wanted something more: deployment of Spark app dependant on additional libraries. In this note we'll show you how to do that. Complete code is available on GitHub, here we'll just highlight some snippets.

Two elements are needed. First, appropriate settings in dependencies section of POM file. Note that Spark and Scala library are marked provided.

<dependencies>
  <dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-library</artifactId>
    <version>2.10.3</version>
    <scope>provided</scope>
  </dependency>
  <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.10</artifactId>
    <version>0.9.1</version>
    <scope>provided</scope>
  </dependency>
  <!-- that's our additional dependency -->
  <dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>14.0</version>
  </dependency>
</dependencies>

Second, the submit.sh script to make building the classpath easier.

#!/bin/bash

# Example usage:
# ./submit.sh target/spark-intro-1.0-SNAPSHOT-jar-with-dependencies.jar -Dspark.master=local SimpleApp

JAR_FILE=$1
REMINDER=${@:2}

# If needed, alter these paths to confirm with your config

source /etc/spark/conf/spark-env.sh

SPARK_HOME=/usr/lib/spark; export SPARK_HOME
HADOOP_HOME=/usr/lib/hadoop; export HADOOP_HOME

# system jars:
CLASSPATH=/etc/hadoop/conf
CLASSPATH=$CLASSPATH:$HADOOP_HOME/*:$HADOOP_HOME/lib/*
CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-mapreduce/*:$HADOOP_HOME/../hadoop-mapreduce/lib/*
CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-yarn/*:$HADOOP_HOME/../hadoop-yarn/lib/*
CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-hdfs/*:$HADOOP_HOME/../hadoop-hdfs/lib/*
CLASSPATH=$CLASSPATH:$SPARK_HOME/assembly/lib/*

# app jar:
CLASSPATH=$CLASSPATH:"$JAR_FILE"
CONFIG_OPTS="-Dspark.jars=$JAR_FILE"
java -cp $CLASSPATH $CONFIG_OPTS $REMINDER

To run everything just type:

git clone https://github.com/CeON/spark-intro.git
cd spark-intro  
mvn clean compile assembly:single
./submit.sh target/spark-intro-1.0-SNAPSHOT-jar-with-dependencies.jar -Dspark.master=local SimpleApp

Happy coding!

comments powered by Disqus