Getting started with Spark on Ubuntu in VirtualBox


We are going to create a single node cluster of Spark in this post.

Step 1: Install VirtualBox 6.0 or higher (by launching the .EXE file as administrator)

Step 2: Download the .ISO file for the latest "Ubuntu Desktop" (version used for this post: Ubuntu 18.04.2 LTS) from here "https://ubuntu.com/download/desktop"

Step 3: Install Ubuntu as shown in this post "https://survival8.blogspot.com/p/demonstrating-shared-folder-feature-for.html"

Step 4. Installing Java

To get started, we'll update our package list:

sudo apt-get update

Next, we'll install OpenJDK, the default Java Development Kit on Ubuntu 16.04.

sudo apt-get install default-jdk

Once the installation is complete, let's check the version.

java -version

Output

openjdk version "1.8.0_91"

OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14)

OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)

This output verifies that OpenJDK has been successfully installed.

=============================================

Step 5. Install Scala.

$ sudo apt-get install scala

=============================================

Step 6.a. Retrieve the Spark archive from here "https://spark.apache.org/downloads.html"

ashish:~/Desktop$ wget https://www.apache.org/dyn/closer.lua/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz

Step 6.b. Extract the archive.

ashish:~/Desktop$ tar -xzf spark-2.4.3-bin-hadoop2.7.tgz

OR

$ tar xvf spark-2.4.3-bin-hadoop2.7.tgz

=============================================

Step 7. Move the extracted files into /usr/local, the appropriate place for locally installed software.

(base) ashish@ashish-VirtualBox:~/Desktop$ sudo mv spark-2.4.3-bin-hadoop2.7 /usr/local/spark

(base) ashish@ashish-VirtualBox:~/Desktop$ ls /usr/local

bin etc games hadoop include lib man sbin share spark src

=============================================

Step 8. Next, we have to update the "JAVA_HOME" path in the "hadoop.env.sh" file.

To find the default Java path, fire this command:

readlink -f /usr/bin/java | sed "s:bin/java::"

Output

/usr/lib/jvm/java-8-openjdk-amd64/jre/

(base) ashish@ashish-VirtualBox:~/Desktop$ sudo nano ~/.bashrc

OR

(base) ashish@ashish-VirtualBox:~/Desktop$ sudo gedit ~/.bashrc

We appended this line "export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/" at the end of the file (or replace it if it is already present in the file).

=============================================

Step 9. Including Spark binaries in the environment.

(base) ashish@ashish-VirtualBox:~/Desktop$ sudo gedit ~/.bashrc

$ export PATH=$PATH:/usr/local/spark/bin

=============================================

Step 9. Running the Spark's Scala shell.

Step 9.1. Prepare the input files and folders.

(base) ashish@ashish-VirtualBox:~/Desktop$ cat input.txt one two two three three three

Step 9.2

$ spark-shell

scala> val inputfile = sc.textFile("input.txt")

ashish:~/Desktop$ ls /usr/local/hadoop/share/hadoop/mapreduce

Step 9.3

scala> val counts = inputfile.flatMap(line=>line.split(" ")).map(word=>(word, 1)).reduceByKey(_+_)

Output:

counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[6] at reduceByKey at :25

Step 9.4

scala> counts.saveAsTextFile("output")

Step 9.5 "sys.exit" or ":q" both should work. Simply "exit" was deprecated in version 2.10.x.

scala> sys.exit

Step 9.6. Viewing the output.

(base) ashish@ashish-VirtualBox:~/Desktop$ cat output/*

(one,1)

(two,2)

(three,3)

=============================================

Step 10. Running the Spark's Python shell.

Note: For Python shell, you will need Anaconda (a Python distribution) installed. Refer this page: https://survival8.blogspot.com/p/setting-up-anaconda-on-ubuntu-in.html

(base) ashish@ashish-VirtualBox:~$ pyspark

Python 3.7.3 (default, Mar 27 2019, 22:11:17)

[GCC 7.3.0] :: Anaconda, Inc. on linux

Type "help", "copyright", "credits" or "license" for more information.

19/07/30 10:36:55 WARN Utils: Your hostname, ashish-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)

19/07/30 10:36:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address

WARNING: An illegal reflective access operation has occurred

WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/ashish/anaconda3/lib/python3.7/site-packages/pyspark/jars/spark-unsafe_2.11-2.4.3.jar) to method java.nio.Bits.unaligned()

WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform

WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations

WARNING: All illegal access operations will be denied in a future release

19/07/30 10:36:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Welcome to Spark version 2.4.3

Using Python version 3.7.3 (default, Mar 27 2019 22:11:17)

SparkSession available as 'spark'.

>>> lines = sc.textFile("input.txt") # Create an RDD called lines

>>> lines.count() # Count the number of items in this RDD

127

>>> lines.first() # First item in this RDD, i.e. first line of input.txt

u'one two two three three three'

>>> exit()

Notes:

(base) ashish@slave01:~/Desktop$ cd /usr/local/spark/conf

(base) ashish@slave01:/usr/local/spark/conf$ ls

docker.properties.template log4j.properties.template slaves.template spark-env.sh

fairscheduler.xml.template metrics.properties.template spark-defaults.conf.template spark-env.sh.template

(base) ashish@slave01:/usr/local/spark/conf$ sudo gedit spark-env.sh

TO RUN THE SPARK IN STANDALONE MODE, MAKE CHANGES IN "\usr\local\spark\conf\spark-env.sh":

1. Remove property "#SPARK_MASTER_HOST".

2. And set "SPARK_LOCAL_IP=127.0.0.1".

3. Export JAVA_HOME to "/usr/lib/jvm/java-8-openjdk-amd64/" (provided that you have OpenJDK8 installed.)

============================================

If you get this exception:,

pyspark.sql.utils.IllegalArgumentException: 'Unsupported class file major version 55'

...then you need to downgrade the Java version. Spark is not compiled on OpenJDK 11 as of July 2019.

(base) ashish@slave01:~/Desktop$ java -version

openjdk version "11.0.3" 2019-04-16

OpenJDK Runtime Environment (build 11.0.3+7-Ubuntu-1ubuntu218.04.1)

OpenJDK 64-Bit Server VM (build 11.0.3+7-Ubuntu-1ubuntu218.04.1, mixed mode, sharing)

============================================

(base) ashish@slave01:~/Desktop$ sudo apt install openjdk-8-jdk

Use 'sudo apt autoremove' to remove automatically installed packages and that are no longer required.

The following NEW packages will be installed:

openjdk-8-jdk openjdk-8-jdk-headless openjdk-8-jre openjdk-8-jre-headless

============================================

(base) ashish@slave01:~/Desktop$ which java

/usr/bin/java

(base) ashish@slave01:~/Desktop$ readlink -f /usr/bin/java

/usr/lib/jvm/java-11-openjdk-amd64/

(base) ashish@slave01:~/Desktop$ sudo update-alternatives --config java

There are 2 choices for the alternative java (providing /usr/bin/java).

Selection Path Priority Status

------------------------------------------------------------

* 0 /usr/lib/jvm/java-11-openjdk-amd64/bin/java 1111 auto mode

1 /usr/lib/jvm/java-11-openjdk-amd64/bin/java 1111 manual mode

2 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java 1081 manual mode

Press [enter] to keep the current choice[*], or type selection number: 2

update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java to provide /usr/bin/java (java) in manual mode

============================================

Testing >>>

Restart the terminal before testing the Spark installation and make sure you have a file "input.txt" in "pwd".

$ pyspark

>>> lines=sc.textFile("input.txt")

>>> lines.count()

6

============================================

No comments:

Post a Comment