We are going to create a single node cluster of Spark in this post.
Step 1: Install VirtualBox 6.0 or higher (by launching the .EXE file as administrator)
Step 2: Download the .ISO file for the latest "Ubuntu Desktop" (version used for this post: Ubuntu 18.04.2 LTS) from here "https://ubuntu.com/download/desktop"
Step 3: Install Ubuntu as shown in this post "https://survival8.blogspot.com/p/demonstrating-shared-folder-feature-for.html"
Step 4. Installing Java
To get started, we'll update our package list:
sudo apt-get update
Next, we'll install OpenJDK, the default Java Development Kit on Ubuntu 16.04.
sudo apt-get install default-jdk
Once the installation is complete, let's check the version.
java -version
Output
openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
This output verifies that OpenJDK has been successfully installed.
=============================================
Step 5. Install Scala.
$ sudo apt-get install scala
=============================================
Step 6.a. Retrieve the Spark archive from here "https://spark.apache.org/downloads.html"
ashish:~/Desktop$ wget https://www.apache.org/dyn/closer.lua/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
Step 6.b. Extract the archive.
ashish:~/Desktop$ tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
OR
$ tar xvf spark-2.4.3-bin-hadoop2.7.tgz
=============================================
Step 7. Move the extracted files into /usr/local, the appropriate place for locally installed software.
(base) ashish@ashish-VirtualBox:~/Desktop$ sudo mv spark-2.4.3-bin-hadoop2.7 /usr/local/spark
(base) ashish@ashish-VirtualBox:~/Desktop$ ls /usr/local
bin etc games hadoop include lib man sbin share spark src
=============================================
Step 8. Next, we have to update the "JAVA_HOME" path in the "hadoop.env.sh" file.
To find the default Java path, fire this command:
readlink -f /usr/bin/java | sed "s:bin/java::"
Output
/usr/lib/jvm/java-8-openjdk-amd64/jre/
(base) ashish@ashish-VirtualBox:~/Desktop$ sudo nano ~/.bashrc
OR
(base) ashish@ashish-VirtualBox:~/Desktop$ sudo gedit ~/.bashrc
We appended this line "export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/" at the end of the file (or replace it if it is already present in the file).
=============================================
Step 9. Including Spark binaries in the environment.
(base) ashish@ashish-VirtualBox:~/Desktop$ sudo gedit ~/.bashrc
$ export PATH=$PATH:/usr/local/spark/bin
=============================================
Step 9. Running the Spark's Scala shell.
Step 9.1. Prepare the input files and folders.
(base) ashish@ashish-VirtualBox:~/Desktop$ cat input.txt one two two three three threeStep 9.2
$ spark-shell
scala> val inputfile = sc.textFile("input.txt")
ashish:~/Desktop$ ls /usr/local/hadoop/share/hadoop/mapreduce
Step 9.3
scala> val counts = inputfile.flatMap(line=>line.split(" ")).map(word=>(word, 1)).reduceByKey(_+_)
Output:
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[6] at reduceByKey at
Step 9.4
scala> counts.saveAsTextFile("output")
Step 9.5 "sys.exit" or ":q" both should work. Simply "exit" was deprecated in version 2.10.x.
scala> sys.exit
Step 9.6. Viewing the output.
(base) ashish@ashish-VirtualBox:~/Desktop$ cat output/*
(one,1)
(two,2)
(three,3)
=============================================
Step 10. Running the Spark's Python shell.
Note: For Python shell, you will need Anaconda (a Python distribution) installed. Refer this page: https://survival8.blogspot.com/p/setting-up-anaconda-on-ubuntu-in.html
(base) ashish@ashish-VirtualBox:~$ pyspark
Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
19/07/30 10:36:55 WARN Utils: Your hostname, ashish-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
19/07/30 10:36:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/ashish/anaconda3/lib/python3.7/site-packages/pyspark/jars/spark-unsafe_2.11-2.4.3.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
19/07/30 10:36:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to Spark version 2.4.3
Using Python version 3.7.3 (default, Mar 27 2019 22:11:17)
SparkSession available as 'spark'.
>>> lines = sc.textFile("input.txt") # Create an RDD called lines
>>> lines.count() # Count the number of items in this RDD
127
>>> lines.first() # First item in this RDD, i.e. first line of input.txt
u'one two two three three three'
>>> exit()
Notes:
(base) ashish@slave01:~/Desktop$ cd /usr/local/spark/conf
(base) ashish@slave01:/usr/local/spark/conf$ ls
docker.properties.template log4j.properties.template slaves.template spark-env.sh
fairscheduler.xml.template metrics.properties.template spark-defaults.conf.template spark-env.sh.template
(base) ashish@slave01:/usr/local/spark/conf$ sudo gedit spark-env.sh
TO RUN THE SPARK IN STANDALONE MODE, MAKE CHANGES IN "\usr\local\spark\conf\spark-env.sh":
1. Remove property "#SPARK_MASTER_HOST".
2. And set "SPARK_LOCAL_IP=127.0.0.1".
3. Export JAVA_HOME to "/usr/lib/jvm/java-8-openjdk-amd64/" (provided that you have OpenJDK8 installed.)
============================================
If you get this exception:,
pyspark.sql.utils.IllegalArgumentException: 'Unsupported class file major version 55'
...then you need to downgrade the Java version. Spark is not compiled on OpenJDK 11 as of July 2019.
(base) ashish@slave01:~/Desktop$ java -version
openjdk version "11.0.3" 2019-04-16
OpenJDK Runtime Environment (build 11.0.3+7-Ubuntu-1ubuntu218.04.1)
OpenJDK 64-Bit Server VM (build 11.0.3+7-Ubuntu-1ubuntu218.04.1, mixed mode, sharing)
============================================
(base) ashish@slave01:~/Desktop$ sudo apt install openjdk-8-jdk
Use 'sudo apt autoremove' to remove automatically installed packages and that are no longer required.
The following NEW packages will be installed:
openjdk-8-jdk openjdk-8-jdk-headless openjdk-8-jre openjdk-8-jre-headless
============================================
(base) ashish@slave01:~/Desktop$ which java
/usr/bin/java
(base) ashish@slave01:~/Desktop$ readlink -f /usr/bin/java
/usr/lib/jvm/java-11-openjdk-amd64/
(base) ashish@slave01:~/Desktop$ sudo update-alternatives --config java
There are 2 choices for the alternative java (providing /usr/bin/java).
Selection Path Priority Status
------------------------------------------------------------
* 0 /usr/lib/jvm/java-11-openjdk-amd64/bin/java 1111 auto mode
1 /usr/lib/jvm/java-11-openjdk-amd64/bin/java 1111 manual mode
2 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java 1081 manual mode
Press [enter] to keep the current choice[*], or type selection number: 2
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java to provide /usr/bin/java (java) in manual mode
============================================
Testing >>>
Restart the terminal before testing the Spark installation and make sure you have a file "input.txt" in "pwd".
$ pyspark
>>> lines=sc.textFile("input.txt")
>>> lines.count()
6
============================================
No comments:
Post a Comment