1: Checking Environment Variables For Java, Scala and Python
Java
(base) C:\Users\ashish>java -version
openjdk version "1.8.0_322"
OpenJDK Runtime Environment (Temurin)(build 1.8.0_322-b06)
OpenJDK 64-Bit Server VM (Temurin)(build 25.322-b06, mixed mode)
(base) C:\Users\ashish>where java
C:\Program Files\Eclipse Adoptium\jdk-8.0.322.6-hotspot\bin\java.exe
C:\Program Files\Zulu\zulu-17-jre\bin\java.exe
C:\Program Files\Zulu\zulu-17\bin\java.exe
(base) C:\Users\ashish>echo %JAVA_HOME%
C:\Program Files\Eclipse Adoptium\jdk-8.0.322.6-hotspot
Scala
(base) C:\Users\ashish>scala -version
Scala code runner version 3.2.0 -- Copyright 2002-2022, LAMP/EPFL
Python
(base) C:\Users\ashish>python --version
Python 3.9.12
(base) C:\Users\ashish>where python
C:\Users\ashish\Anaconda3\python.exe
C:\Users\ashish\AppData\Local\Microsoft\WindowsApps\python.exe
(base) C:\Users\ashish>echo %PYSPARK_PYTHON%
C:\Users\ashish\Anaconda3\python.exe
(base) C:\Users\ashish>echo %PYSPARK_DRIVER_PYTHON%
C:\Users\ashish\Anaconda3\python.exe
(base) C:\Users\ashish>echo %PYTHONPATH%
C:\Users\ashish\Anaconda3\envs\mh
Spark Home and Hadoop Home
(base) C:\Users\ashish>echo %SPARK_HOME%
D:\progfiles\spark-3.3.1-bin-hadoop3
(base) C:\Users\ashish>echo %HADOOP_HOME%
D:\progfiles\spark-3.3.1-bin-hadoop3\hadoop
2: Checking Properties Like Hostname, Spark Workers and PATH variable values
(base) D:\progfiles\spark-3.3.1-bin-hadoop3\conf>hostname
CS3L
(base) D:\progfiles\spark-3.3.1-bin-hadoop3\conf>type workers
CS3L
PATH Value:
C:\Users\ashish\AppData\Local\Coursier\data\bin;
It is for Scala.
(base) D:\progfiles\spark-3.3.1-bin-hadoop3\conf>echo %PATH%
C:\Users\ashish\Anaconda3;
C:\Users\ashish\Anaconda3\Library\mingw-w64\bin;
C:\Users\ashish\Anaconda3\Library\usr\bin;
C:\Users\ashish\Anaconda3\Library\bin;
C:\Users\ashish\Anaconda3\Scripts;
C:\Users\ashish\Anaconda3\bin;
C:\Users\ashish\Anaconda3\condabin;
C:\Program Files\Eclipse Adoptium\jdk-8.0.322.6-hotspot\bin;
C:\Program Files\Zulu\zulu-17-jre\bin;
C:\Program Files\Zulu\zulu-17\bin;
C:\windows\system32;C:\windows;
C:\windows\System32\Wbem;
C:\windows\System32\WindowsPowerShell\v1.0;
C:\windows\System32\OpenSSH;
C:\Program Files\Git\cmd;
C:\Users\ashish\Anaconda3;
C:\Users\ashish\Anaconda3\Library\mingw-w64\bin;
C:\Users\ashish\Anaconda3\Library\usr\bin;
C:\Users\ashish\Anaconda3\Library\bin;
C:\Users\ashish\Anaconda3\Scripts;
C:\Users\ashish\AppData\Local\Microsoft\WindowsApps;
C:\Users\ashish\AppData\Local\Programs\Microsoft VS Code\bin;
D:\progfiles\spark-3.3.1-bin-hadoop3\bin;
C:\Users\ashish\AppData\Local\Coursier\data\bin;
.
3: Turn off Windows Defender Firewall (for SSH to work)
4.1: Create inbound rule for allowing SSH connections on port 22
4.2: Create outbound rule for allowing SSH connections on port 22
5: Checking SSH Properties
C:\Users\ashish\.ssh>type known_hosts
192.168.1.151 ecdsa-sha2-nistp256 A***EhMzjgo=
ashishlaptop ecdsa-sha2-nistp256 A***EhMzjgo=
C:\Users\ashish\.ssh>ipconfig
Windows IP Configuration
Ethernet adapter Ethernet:
Media State . . . . . . . . . . . : Media disconnected
Connection-specific DNS Suffix . : ad.itl.com
Ethernet adapter Ethernet 2:
Media State . . . . . . . . . . . : Media disconnected
Connection-specific DNS Suffix . : ad.itl.com
Wireless LAN adapter Wi-Fi:
Connection-specific DNS Suffix . :
IPv6 Address. . . . . . . . . . . : 2401:4900:47f5:1737:b1b2:6d59:f669:1b96
Temporary IPv6 Address. . . . . . : 2401:4900:47f5:1737:88e0:bacc:7490:e794
Link-local IPv6 Address . . . . . : fe80::b1b2:6d59:f669:1b96%13
IPv4 Address. . . . . . . . . . . : 192.168.1.101
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Default Gateway . . . . . . . . . : fe80::44da:eaff:feb6:7061%13
192.168.1.1
Ethernet adapter Bluetooth Network Connection:
Media State . . . . . . . . . . . : Media disconnected
Connection-specific DNS Suffix . :
C:\Users\ashish\.ssh>
C:\Users\ashish\.ssh>dir
Volume in drive C is OSDisk
Volume Serial Number is 88CC-6EA2
Directory of C:\Users\ashish\.ssh
10/26/2022 03:45 PM <DIR> .
10/26/2022 03:45 PM <DIR> ..
10/26/2022 03:45 PM 574 authorized_keys
10/26/2022 03:27 PM 2,635 id_rsa
10/26/2022 03:27 PM 593 id_rsa.pub
10/26/2022 03:46 PM 351 known_hosts
4 File(s) 4,153 bytes
2 Dir(s) 78,491,791,360 bytes free
C:\Users\ashish\.ssh>type authorized_keys
ssh-rsa A***= ashish@ashishlaptop
C:\Users\ashish\.ssh>
6: Error We Faced While Trying Spark start-all.sh in Git Bash
$ ./sbin/start-all.sh
ps: unknown option -- o
Try `ps --help' for more information.
hostname: unknown option -- f
starting org.apache.spark.deploy.master.Master, logging to D:\progfiles\spark-3.3.1-bin-hadoop3/logs/spark--org.apache.spark.deploy.master.Master-1-CS3L.out
ps: unknown option -- o
...
Try `ps --help' for more information.
failed to launch: nice -n 0 D:\progfiles\spark-3.3.1-bin-hadoop3/bin/spark-class org.apache.spark.deploy.master.Master --host --port 7077 --webui-port 8080
ps: unknown option -- o
Try `ps --help' for more information.
Spark Command: C:\Program Files\Eclipse Adoptium\jdk-8.0.322.6-hotspot\bin\java -cp D:\progfiles\spark-3.3.1-bin-hadoop3/conf\;D:\progfiles\spark-3.3.1-bin-hadoop3\jars\* -Xmx1g org.apache.spark.deploy.master.Master --host --port 7077 --webui-port 8080
========================================
"C:\Program Files\Eclipse Adoptium\jdk-8.0.322.6-hotspot\bin\java" -cp "D:\progfiles\spark-3.3.1-bin-hadoop3/conf\;D:\progfiles\spark-3.3.1-bin-hadoop3\jars\*" -Xmx1g org.apache.spark.deploy.master.Master --host --port 7077 --webui-port 8080
D:\progfiles\spark-3.3.1-bin-hadoop3/bin/spark-class: line 96: CMD: bad array subscript
full log in D:\progfiles\spark-3.3.1-bin-hadoop3/logs/spark--org.apache.spark.deploy.master.Master-1-CS3L.out
ps: unknown option -- o
Try `ps --help' for more information.
hostname: unknown option -- f
Try 'hostname --help' for more information.
ps: unknown option -- o
Try `ps --help' for more information.
CS3L: ssh: connect to host cs3l port 22: Connection refused
7: Spark-submit failing in CMD without HADOOP_HOME set
(base) D:\progfiles\spark-3.3.1-bin-hadoop3\bin>spark-submit --master local examples/src/main/python/pi.py 100
22/11/01 11:59:42 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
22/11/01 11:59:42 INFO ShutdownHookManager: Shutdown hook called
22/11/01 11:59:42 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-521c5e3c-beea-4f67-a1e6-71dd4c5c308c
8: Instructions for Configuring HADOOP_HOME for Spark on Windows
Installing winutils
Let’s download the winutils.exe and configure our Spark installation to find winutils.exe.
a) Create a hadoop\bin folder inside the SPARK_HOME folder.
b) Download the winutils.exe for the version of hadoop against which your Spark installation was built for. In my case the Hadoop version was 3 mentioned in Spark package name. So I downloaded the winutils.exe for Hadoop 3.0.0 and copied it to the hadoop\bin folder in the SPARK_HOME folder.
c) Create a system environment variable in Windows called SPARK_HOME that points to the SPARK_HOME folder path.
For Windows: %SPARK_HOME%\hadoop
d) Create another system environment variable in Windows called HADOOP_HOME that points to the hadoop folder inside the SPARK_HOME folder.
Since the hadoop folder is inside the SPARK_HOME folder, it is better to create HADOOP_HOME environment variable using a value of %SPARK_HOME%\hadoop. That way you don’t have to change HADOOP_HOME if SPARK_HOME is updated.
Ref: Download winutils.exe From GitHub
9: Error if PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are not set
A standard way of setting environmental variables, including PYSPARK_PYTHON, is to use conf/spark-env.sh file. Spark comes with a template file (conf/spark-env.sh.template) which explains the most common options.
It is a normal Bash script so you can use it the same way as you would with .bashrc
9.1
(base) D:\progfiles\spark-3.3.1-bin-hadoop3>bin\spark-submit examples\src\main\python\wordcount.py README.md
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
22/11/01 12:21:02 INFO ShutdownHookManager: Shutdown hook called
22/11/01 12:21:02 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-18bce9f2-f7e8-4d04-843b-8c04a27675a7
9.2
(base) D:\progfiles\spark-3.3.1-bin-hadoop3>where python
C:\Users\ashish\Anaconda3\python.exe
C:\Users\ashish\AppData\Local\Microsoft\WindowsApps\python.exe
(base) D:\progfiles\spark-3.3.1-bin-hadoop3>python
Python 3.9.12 (main, Apr 4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()
9.3
IN CMD:
D:\progfiles\spark-3.3.1-bin-hadoop3>bin\spark-submit examples\src\main\python\wordcount.py README.md
Exception in thread "main" java.io.IOException: Cannot run program "C:\Users\ashish\Anaconda3": CreateProcess error=5, Access is denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: CreateProcess error=5, Access is denied
at java.lang.ProcessImpl.create(Native Method)
at java.lang.ProcessImpl.<init>(ProcessImpl.java:453)
at java.lang.ProcessImpl.start(ProcessImpl.java:139)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 14 more
22/11/01 12:32:10 INFO ShutdownHookManager: Shutdown hook called
22/11/01 12:32:10 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-46c29762-efc3-425a-98fd-466b1500aa5b
9.4
IN ANACONDA:
(base) D:\progfiles\spark-3.3.1-bin-hadoop3>bin\spark-submit examples\src\main\python\wordcount.py README.md
Exception in thread "main" java.io.IOException: Cannot run program "C:\Users\ashish\Anaconda3": CreateProcess error=5, Access is denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: CreateProcess error=5, Access is denied
at java.lang.ProcessImpl.create(Native Method)
at java.lang.ProcessImpl.<init>(ProcessImpl.java:453)
at java.lang.ProcessImpl.start(ProcessImpl.java:139)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 14 more
22/11/01 12:34:25 INFO ShutdownHookManager: Shutdown hook called
22/11/01 12:34:25 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-7480dad3-92f5-41cb-98ec-e02fee0eaff5
(base) D:\progfiles\spark-3.3.1-bin-hadoop3>
10: Successful Run of Word Count Program
(base) D:\progfiles\spark-3.3.1-bin-hadoop3>bin\spark-submit examples\src\main\python\wordcount.py README.md
22/11/01 12:37:22 INFO SparkContext: Running Spark version 3.3.1
22/11/01 12:37:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/11/01 12:37:23 INFO ResourceUtils: ==============================================================
22/11/01 12:37:23 INFO ResourceUtils: No custom resources configured for spark.driver.
22/11/01 12:37:23 INFO ResourceUtils: ==============================================================
22/11/01 12:37:23 INFO SparkContext: Submitted application: PythonWordCount
22/11/01 12:37:23 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
22/11/01 12:37:23 INFO ResourceProfile: Limiting resource is cpu
22/11/01 12:37:23 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/11/01 12:37:23 INFO SecurityManager: Changing view acls to: ashish
22/11/01 12:37:23 INFO SecurityManager: Changing modify acls to: ashish
22/11/01 12:37:23 INFO SecurityManager: Changing view acls groups to:
22/11/01 12:37:23 INFO SecurityManager: Changing modify acls groups to:
22/11/01 12:37:23 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ashish); groups with view permissions: Set(); users with modify permissions: Set(ashish); groups with modify permissions: Set()
22/11/01 12:37:24 INFO Utils: Successfully started service 'sparkDriver' on port 52785.
22/11/01 12:37:24 INFO SparkEnv: Registering MapOutputTracker
22/11/01 12:37:24 INFO SparkEnv: Registering BlockManagerMaster
22/11/01 12:37:24 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/11/01 12:37:24 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/11/01 12:37:24 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/11/01 12:37:24 INFO DiskBlockManager: Created local directory at C:\Users\ashish\AppData\Local\Temp\blockmgr-09454369-56f1-4fae-a0d9-b5f19b6a8bd1
22/11/01 12:37:24 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
22/11/01 12:37:24 INFO SparkEnv: Registering OutputCommitCoordinator
22/11/01 12:37:24 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/11/01 12:37:25 INFO Executor: Starting executor ID driver on host CS3L.ad.itl.com
22/11/01 12:37:25 INFO Executor: Starting executor with user classpath (userClassPathFirst = false): ''
22/11/01 12:37:25 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52828.
22/11/01 12:37:25 INFO NettyBlockTransferService: Server created on CS3L.ad.itl.com:52828
22/11/01 12:37:25 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/11/01 12:37:25 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, CS3L.ad.itl.com, 52828, None)
22/11/01 12:37:25 INFO BlockManagerMasterEndpoint: Registering block manager CS3L.ad.itl.com:52828 with 366.3 MiB RAM, BlockManagerId(driver, CS3L.ad.itl.com, 52828, None)
22/11/01 12:37:25 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, CS3L.ad.itl.com, 52828, None)
22/11/01 12:37:25 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, CS3L.ad.itl.com, 52828, None)
22/11/01 12:37:26 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
22/11/01 12:37:26 INFO SharedState: Warehouse path is 'file:/D:/progfiles/spark-3.3.1-bin-hadoop3/spark-warehouse'.
22/11/01 12:37:27 INFO InMemoryFileIndex: It took 51 ms to list leaf files for 1 paths.
22/11/01 12:37:31 INFO FileSourceStrategy: Pushed Filters:
22/11/01 12:37:31 INFO FileSourceStrategy: Post-Scan Filters:
22/11/01 12:37:31 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
22/11/01 12:37:31 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 349.6 KiB, free 366.0 MiB)
22/11/01 12:37:31 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 33.9 KiB, free 365.9 MiB)
22/11/01 12:37:31 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on CS3L.ad.itl.com:52828 (size: 33.9 KiB, free: 366.3 MiB)
22/11/01 12:37:31 INFO SparkContext: Created broadcast 0 from javaToPython at NativeMethodAccessorImpl.java:0
22/11/01 12:37:31 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
22/11/01 12:37:32 INFO SparkContext: Starting job: collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38
22/11/01 12:37:32 INFO DAGScheduler: Registering RDD 6 (reduceByKey at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:35) as input to shuffle 0
22/11/01 12:37:32 INFO DAGScheduler: Got job 0 (collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38) with 1 output partitions
22/11/01 12:37:32 INFO DAGScheduler: Final stage: ResultStage 1 (collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38)
22/11/01 12:37:32 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
22/11/01 12:37:32 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
22/11/01 12:37:32 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[6] at reduceByKey at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:35), which has no missing parents
22/11/01 12:37:32 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 19.4 KiB, free 365.9 MiB)
22/11/01 12:37:32 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 10.2 KiB, free 365.9 MiB)
22/11/01 12:37:32 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on CS3L.ad.itl.com:52828 (size: 10.2 KiB, free: 366.3 MiB)
22/11/01 12:37:32 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1513
22/11/01 12:37:32 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (PairwiseRDD[6] at reduceByKey at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:35) (first 15 tasks are for partitions Vector(0))
22/11/01 12:37:32 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks resource profile 0
22/11/01 12:37:32 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (CS3L.ad.itl.com, executor driver, partition 0, PROCESS_LOCAL, 4914 bytes) taskResourceAssignments Map()
22/11/01 12:37:32 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
22/11/01 12:37:34 INFO FileScanRDD: Reading File path: file:///D:/progfiles/spark-3.3.1-bin-hadoop3/README.md, range: 0-4585, partition values: [empty row]
22/11/01 12:37:34 INFO CodeGenerator: Code generated in 316.0539 ms
22/11/01 12:37:34 INFO PythonRunner: Times: total = 1638, boot = 1131, init = 504, finish = 3
22/11/01 12:37:34 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1928 bytes result sent to driver
22/11/01 12:37:34 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2463 ms on CS3L.ad.itl.com (executor driver) (1/1)
22/11/01 12:37:34 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
22/11/01 12:37:34 INFO PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 52829
22/11/01 12:37:34 INFO DAGScheduler: ShuffleMapStage 0 (reduceByKey at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:35) finished in 2.675 s
22/11/01 12:37:34 INFO DAGScheduler: looking for newly runnable stages
22/11/01 12:37:34 INFO DAGScheduler: running: Set()
22/11/01 12:37:34 INFO DAGScheduler: waiting: Set(ResultStage 1)
22/11/01 12:37:34 INFO DAGScheduler: failed: Set()
22/11/01 12:37:34 INFO DAGScheduler: Submitting ResultStage 1 (PythonRDD[9] at collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38), which has no missing parents
22/11/01 12:37:35 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 9.5 KiB, free 365.9 MiB)
22/11/01 12:37:35 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 5.7 KiB, free 365.9 MiB)
22/11/01 12:37:35 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on CS3L.ad.itl.com:52828 (size: 5.7 KiB, free: 366.3 MiB)
22/11/01 12:37:35 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1513
22/11/01 12:37:35 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (PythonRDD[9] at collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38) (first 15 tasks are for partitions Vector(0))
22/11/01 12:37:35 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks resource profile 0
22/11/01 12:37:35 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1) (CS3L.ad.itl.com, executor driver, partition 0, NODE_LOCAL, 4271 bytes) taskResourceAssignments Map()
22/11/01 12:37:35 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
22/11/01 12:37:35 INFO ShuffleBlockFetcherIterator: Getting 1 (3.2 KiB) non-empty blocks including 1 (3.2 KiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
22/11/01 12:37:35 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 20 ms
22/11/01 12:37:36 INFO PythonRunner: Times: total = 1154, boot = 1133, init = 20, finish = 1
22/11/01 12:37:36 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 6944 bytes result sent to driver
22/11/01 12:37:36 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 1323 ms on CS3L.ad.itl.com (executor driver) (1/1)
22/11/01 12:37:36 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
22/11/01 12:37:36 INFO DAGScheduler: ResultStage 1 (collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38) finished in 1.351 s
22/11/01 12:37:36 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
22/11/01 12:37:36 INFO TaskSchedulerImpl: Killing all running tasks in stage 1: Stage finished
22/11/01 12:37:36 INFO DAGScheduler: Job 0 finished: collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38, took 4.172433 s
#: 1
Apache: 1
Spark: 15
...
project.: 1
22/11/01 12:37:36 INFO SparkUI: Stopped Spark web UI at http://CS3L.ad.itl.com:4040
22/11/01 12:37:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/11/01 12:37:36 INFO MemoryStore: MemoryStore cleared
22/11/01 12:37:36 INFO BlockManager: BlockManager stopped
22/11/01 12:37:36 INFO BlockManagerMaster: BlockManagerMaster stopped
22/11/01 12:37:36 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/11/01 12:37:36 INFO SparkContext: Successfully stopped SparkContext
22/11/01 12:37:37 INFO ShutdownHookManager: Shutdown hook called
22/11/01 12:37:37 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-d74fb7a9-5e3a-417e-b815-7f3de7efb44b
22/11/01 12:37:37 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-d74fb7a9-5e3a-417e-b815-7f3de7efb44b\pyspark-0c5fc2b0-e2ab-4e28-a021-2f57c08d8d8c
22/11/01 12:37:37 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-20206684-dcc9-4326-976b-d9499bc5e483
(base) D:\progfiles\spark-3.3.1-bin-hadoop3>
11: Missing PyArrow Package Issue. And, Resolution
(base) D:\progfiles\spark-3.3.1-bin-hadoop3>bin\spark-submit C:\Users\ashish\Desktop\mh\Code\pandas_to_pyspark\pyinstaller\script.py
Traceback (most recent call last):
File "D:\progfiles\spark-3.3.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\sql\pandas\utils.py", line 53, in require_minimum_pyarrow_version
ModuleNotFoundError: No module named 'pyarrow'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\ashish\Desktop\mh\Code\pandas_to_pyspark\pyinstaller\script.py", line 1, in <module>
from pyspark import pandas as pd
File "<frozen zipimport>", line 259, in load_module
File "D:\progfiles\spark-3.3.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\pandas\__init__.py", line 34, in <module>
File "D:\progfiles\spark-3.3.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\sql\pandas\utils.py", line 60, in require_minimum_pyarrow_version
ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found.
22/11/01 12:41:57 INFO ShutdownHookManager: Shutdown hook called
22/11/01 12:41:57 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-3293d857-5ddc-4fcf-8dcc-8d7b39a8f8cc
Resolution
CMD> pip install pyarrow
(base) C:\Users\ashish>pip show pyarrow
Name: pyarrow
Version: 10.0.0
Summary: Python library for Apache Arrow
Home-page: https://arrow.apache.org/
Author:
Author-email:
License: Apache License, Version 2.0
Location: c:\users\ashish\anaconda3\lib\site-packages
Requires: numpy
Required-by:
Tags: Technology,Spark,