Monday, December 12, 2022

Installing PySpark on Ubuntu And Basic Testing (2022 Dec 13)

Contents of pyspark.yml

name: pyspark channels: - conda-forge dependencies: - python==3.9 - pandas - pyspark - pip

Installation

$ conda env create -f pyspark.yml $ conda activate pyspark (pyspark) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ java Command 'java' not found, but can be installed with: sudo apt install openjdk-11-jre-headless # version 11.0.17+8-1ubuntu2, or sudo apt install default-jre # version 2:1.11-72build2 sudo apt install openjdk-18-jre-headless # version 18.0.1+10-1 sudo apt install openjdk-17-jre-headless # version 17.0.5+8-2ubuntu1 sudo apt install openjdk-19-jre-headless # version 19.0.1+10-1 sudo apt install openjdk-8-jre-headless # version 8u352-ga-1~22.10 $ sudo apt install openjdk-8-jre-headless (pyspark) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ java -version openjdk version "1.8.0_352" OpenJDK Runtime Environment (build 1.8.0_352-8u352-ga-1~22.10-b08) OpenJDK 64-Bit Server VM (build 25.352-b08, mixed mode) (pyspark) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ which java /usr/bin/java (pyspark) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ readlink -f /usr/bin/java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java $ sudo nano ~/.bashrc Add the following line at the end of the file. export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64" (pyspark) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ echo $JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64 (pyspark) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ conda install ipykernel jupyterlab -c conda-forge (pyspark) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ python -m ipykernel install --user --name pyspark Installed kernelspec pyspark in /home/ashish/.local/share/jupyter/kernels/pyspark

Basic Testing

(pyspark) ashish@ashish-Lenovo-ideapad-130-15IKB:~/Desktop$ pip show pyspark Name: pyspark Version: 3.3.1 Summary: Apache Spark Python API Home-page: https://github.com/apache/spark/tree/master/python Author: Spark Developers Author-email: dev@spark.apache.org License: http://www.apache.org/licenses/LICENSE-2.0 Location: /home/ashish/anaconda3/envs/pyspark/lib/python3.9/site-packages Requires: py4j Required-by: import pandas as pd from pyspark import SparkContext from pyspark.sql import SQLContext df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] }) sc = SparkContext.getOrCreate() sqlCtx = SQLContext(sc) sdf = sqlCtx.createDataFrame(df) sdf.show() +----+----+ |col1|col2| +----+----+ |val1|val2| +----+----+
Tags: Spark,Technology,

No comments:

Post a Comment