Saturday, October 8, 2022

Installing PySpark on Ubuntu And Basic Testing (2022 Oct 8)

Contents of env.yml File

name: mh channels: - conda-forge dependencies: - python==3.9 - pandas - pyspark - pip

Keeping the number of packages in dependencies to a bare minimum.

Takes over two hours to process the otherwise tried original 13 dependencies. (base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ conda env create -f env.yml (base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ conda activate mh

Testing

Error Prior to Java Installation

(mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ python Python 3.9.0 | packaged by conda-forge | (default, Nov 26 2020, 07:57:39) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> import pyspark >>> pyspark.__version__ '3.3.0' >>> import os >>> os.environ['PYTHONPATH'] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/ashish/anaconda3/envs/mh/lib/python3.9/os.py", line 679, in __getitem__ raise KeyError(key) from None KeyError: 'PYTHONPATH' >>> from pyspark.sql import SQLContext >>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] }) >>> sc = SparkContext.getOrCreate() Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'SparkContext' is not defined >>> from pyspark import SparkContext >>> sc = SparkContext.getOrCreate() JAVA_HOME is not set Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/context.py", line 483, in getOrCreate SparkContext(conf=conf or SparkConf()) File "/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/context.py", line 195, in __init__ SparkContext._ensure_initialized(self, gateway=gateway, conf=conf) File "/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/context.py", line 417, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway(conf) File "/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/java_gateway.py", line 106, in launch_gateway raise RuntimeError("Java gateway process exited before sending its port number") RuntimeError: Java gateway process exited before sending its port number >>> (mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ java Command 'java' not found, but can be installed with: sudo apt install default-jre # version 2:1.11-72build2, or sudo apt install openjdk-11-jre-headless # version 11.0.16+8-0ubuntu1~22.04 sudo apt install openjdk-17-jre-headless # version 17.0.3+7-0ubuntu0.22.04.1 sudo apt install openjdk-18-jre-headless # version 18~36ea-1 sudo apt install openjdk-8-jre-headless # version 8u312-b07-0ubuntu1 (mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ sudo apt install openjdk-8-jre-headless ... (mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ java -version openjdk version "1.8.0_342" OpenJDK Runtime Environment (build 1.8.0_342-8u342-b07-0ubuntu1~22.04-b07) OpenJDK 64-Bit Server VM (build 25.342-b07, mixed mode) (mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ echo $JAVA_HOME EMPTY (mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ (base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ which java /usr/bin/java (base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ readlink -f /usr/bin/java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

Update the JAVA_HOME

(mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ sudo nano ~/.bashrc Add the following line at the end of the file: export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64" (mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ (mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ source ~/.bashrc (mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ echo $JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64 (mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ python Python 3.9.0 | packaged by conda-forge | (default, Nov 26 2020, 07:57:39) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> from pyspark import SparkContext >>> from pyspark.sql import SQLContext >>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] }) >>> sc = SparkContext.getOrCreate() 22/10/08 13:29:50 WARN Utils: Your hostname, ashish-Lenovo-ideapad-130-15IKB resolves to a loopback address: 127.0.1.1; using 192.168.1.129 instead (on interface wlp2s0) 22/10/08 13:29:50 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/10/08 13:29:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable >>> sqlCtx = SQLContext(sc) /home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/sql/context.py:112: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead. warnings.warn() >>> sdf = sqlCtx.createDataFrame(df) /home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py:474: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead. for column, series in pdf.iteritems(): /home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py:486: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead. for column, series in pdf.iteritems(): >>> sdf.show() +----+----+ |col1|col2| +----+----+ |val1|val2| +----+----+ >>> >>> exit()
Tags: Technology,Spark,

No comments:

Post a Comment