survival8

Thursday, October 13, 2022

Spark installation on 3 RHEL based nodes cluster (Issue Resolution in Apr 2020)

Configurations:
  Hostname and IP mappings:
    Check the "/etc/hosts" file by opening it in both NANO and VI.

192.168.1.12 MASTER master
192.168.1.3  SLAVE1 slave1
192.168.1.4  SLAVE2 slave2
  
  Software configuration:
    (base) [admin@SLAVE2 downloads]$ java -version
      openjdk version "1.8.0_181"
      OpenJDK Runtime Environment (build 1.8.0_181-b13)
      OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
    
    (base) [admin@MASTER ~]$ cd /opt/ml/downloads
    (base) [admin@MASTER downloads]$ ls
      Anaconda3-2020.02-Linux-x86_64.sh  
  	    hadoop-3.2.1.tar.gz  
  	    scala-2.13.2.rpm  
  	    spark-3.0.0-preview2-bin-hadoop3.2.tgz 
   
    # Scala can be downloaded from here.
    # Installation command: sudo rpm -i scala-2.13.2.rpm
   
    (base) [admin@MASTER downloads]$ echo JAVA_HOME
      /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-7.b13.el7.x86_64/jre/
  
    PATH: /usr/local/hadoop/etc/hadoop/hadoop-env.sh
      JAVA_HOME ON 'master': /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-7.b13.el7.x86_64/jre/
      JAVA_HOME on 'slave1': /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-8.b10.el7_5.x86_64/jre

~ ~ ~

In the case of no internet connectivity, installation of 'openssh-server' and 'openssh-client' is not straightforward. These packages have nested dependencies that are hard resolve.

 (base) [admin@SLAVE2 downloads]$ sudo rpm -i openssh-server-8.0p1-4.el8_1.x86_64.rpm
  warning: openssh-server-8.0p1-4.el8_1.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID 8483c65d: NOKEY
  error: Failed dependencies:
    crypto-policies >= 20180306-1 is needed by openssh-server-8.0p1-4.el8_1.x86_64
    libc.so.6(GLIBC_2.25)(64bit) is needed by openssh-server-8.0p1-4.el8_1.x86_64
    libc.so.6(GLIBC_2.26)(64bit) is needed by openssh-server-8.0p1-4.el8_1.x86_64
    libcrypt.so.1(XCRYPT_2.0)(64bit) is needed by openssh-server-8.0p1-4.el8_1.x86_64
    libcrypto.so.1.1()(64bit) is needed by openssh-server-8.0p1-4.el8_1.x86_64
    libcrypto.so.1.1(OPENSSL_1_1_0)(64bit) is needed by openssh-server-8.0p1-4.el8_1.x86_64
    libcrypto.so.1.1(OPENSSL_1_1_1b)(64bit) is needed by openssh-server-8.0p1-4.el8_1.x86_64
    openssh = 8.0p1-4.el8_1 is needed by openssh-server-8.0p1-4.el8_1.x86_64

~ ~ ~

Doing SSH setup:
  1) sudo iptables -A INPUT -p tcp --dport ssh -j ACCEPT
  2) sudo reboot
  3) ssh-keygen -t rsa -f ~/.ssh/id_rsa -P ""
  4) ssh-copy-id -i ~/.ssh/id_rsa.pub admin@SLAVE2
  5) ssh-copy-id -i ~/.ssh/id_rsa.pub admin@MASTER
  6) ssh-copy-id -i ~/.ssh/id_rsa.pub admin@SLAVE1

COMMAND FAILURE ON RHEL:
  [admin@MASTER ~]$ sudo service ssh stop
    Redirecting to /bin/systemctl stop ssh.service
    Failed to stop ssh.service: Unit ssh.service not loaded.
    
  [admin@MASTER ~]$ sudo service ssh start
    Redirecting to /bin/systemctl start ssh.service
    Failed to start ssh.service: Unit not found.

Testing of SSH is through: ssh 'admin@SLAVE1'

~ ~ ~

To activate Conda 'base' environment at the start up of system, following code snippet goes at the end of "~/.bashrc" file.

  # >>> conda initialize >>>
  # !! Contents within this block are managed by 'conda init' !!
  __conda_setup="$('/home/admin/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
  if [ $? -eq 0 ]; then
      eval "$__conda_setup"
  else
      if [ -f "/home/admin/anaconda3/etc/profile.d/conda.sh" ]; then
          . "/home/admin/anaconda3/etc/profile.d/conda.sh"
      else
          export PATH="/home/admin/anaconda3/bin:$PATH"
      fi
  fi
  unset __conda_setup
  # conda initialize

~ ~ ~

CHECKING THE OUTPUT OF 'start-dfs.sh' ON MASTER:
 (base) [admin@MASTER sbin]$ ps aux | grep java
   admin     7461 40.5  1.4 6010824 235120 ?      Sl   21:57   0:07 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-8.b10.el7_5.x86_64/jre/bin/java -Dproc_secondarynamenode -Djava.net.preferIPv4Stack=true -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dyarn.log.dir=/usr/local/hadoop/logs -Dyarn.log.file=hadoop-admin-secondarynamenode-MASTER.log -Dyarn.home.dir=/usr/local/hadoop -Dyarn.root.logger=INFO,console -Djava.library.path=/usr/local/hadoop/lib/native -Dhadoop.log.dir=/usr/local/hadoop/logs -Dhadoop.log.file=hadoop-admin-secondarynamenode-MASTER.log -Dhadoop.home.dir=/usr/local/hadoop -Dhadoop.id.str=admin -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml o.a.h.hdfs.server.namenode.SecondaryNameNode
   
   ...

OR
  $ ps -aux | grep java | awk '{print $12}'
    -Dproc_secondarynamenode
    ...

~ ~ ~

CREATING THE 'DATANODE' AND 'NAMENODE' DIRECTORIES:

  (base) [admin@MASTER logs]$ cd ~
  (base) [admin@MASTER ~]$ pwd
      /home/admin
  (base) [admin@MASTER ~]$ cd ..
  (base) [admin@MASTER home]$ sudo mkdir hadoop
  (base) [admin@MASTER home]$ sudo chmod 777 hadoop
  (base) [admin@MASTER home]$ cd hadoop
  (base) [admin@MASTER hadoop]$ sudo mkdir data
  (base) [admin@MASTER hadoop]$ sudo chmod 777 data
  (base) [admin@MASTER hadoop]$ cd data
  (base) [admin@MASTER data]$ sudo mkdir dataNode
  (base) [admin@MASTER data]$ sudo chmod 777 dataNode
  (base) [admin@MASTER data]$ sudo mkdir nameNode
  (base) [admin@MASTER data]$ sudo chmod 777 nameNode
  (base) [admin@MASTER data]$ pwd
      /home/hadoop/data
  (base) [admin@SLAVE1 data]$ sudo chown admin *
  (base) [admin@MASTER data]$ ls -lrt
      total 0
      drwxrwxrwx. 2 admin root 6 Apr 27 22:24 dataNode
      drwxrwxrwx. 2 admin root 6 Apr 27 22:37 nameNode

# Error example with the NameNode execution if 'data/nameNode' folder is not accessible:

File: /usr/local/hadoop/logs/hadoop-admin-namenode-MASTER.log:

2019-10-17 21:45:39,714 WARN o.a.h.hdfs.server.namenode.FSNamesystem: Encountered exception loading fsimage
o.a.h.hdfs.server.common.InconsistentFSStateException: Directory /home/hadoop/data/nameNode is in an inconsistent state: storage directory does not exist or is not accessible.
	...
  at o.a.h.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1692)
	at o.a.h.hdfs.server.namenode.NameNode.main(NameNode.java:1759)
	
# Error example with the DameNode execution if 'data/dataNode' folder is not accessible:

File: /usr/local/hadoop/logs/hadoop-admin-datanode-SLAVE1.log

2019-10-17 22:30:49,302 WARN o.a.h.hdfs.server.datanode.checker.StorageLocationChecker: Exception checking StorageLocation [DISK]file:/home/hadoop/data/dataNode
java.io.FileNotFoundException: File file:/home/hadoop/data/dataNode does not exist
        ...
2019-10-17 22:30:49,307 ERROR o.a.h.hdfs.server.datanode.DataNode: Exception in secureMain
o.a.h.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 0, volumes configured: 1, volumes failed: 1, volume failures tolerated: 0
        ...
        at o.a.h.hdfs.server.datanode.DataNode.main(DataNode.java:2924)
2019-10-17 22:30:49,310 INFO o.a.h.util.ExitUtil: Exiting with status 1: o.a.h.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 0, volumes configured: 1, volumes failed: 1, volume failures tolerated: 0
2019-10-17 22:30:49,335 INFO o.a.h.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at SLAVE1/192.168.1.3
************************************************************/

~ ~ ~

If 'data/dataNode' is not writable by other nodes on the cluster, following failure logs came on SLAVE1:
File: /usr/local/hadoop/logs/hadoop-admin-datanode-MASTER.log

2019-10-17 22:37:33,820 WARN o.a.h.hdfs.server.datanode.checker.StorageLocationChecker: Exception checking StorageLocation [DISK]file:/home/hadoop/data/dataNode
EPERM: Operation not permitted
        ...
        at java.lang.Thread.run(Thread.java:748)
2019-10-17 22:37:33,825 ERROR o.a.h.hdfs.server.datanode.DataNode: Exception in secureMain
o.a.h.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 0, volumes configured: 1, volumes failed: 1, volume failures tolerated: 0
        at o.a.h.hdfs.server.datanode.checker.StorageLocationChecker.check(StorageLocationChecker.java:231)
        ...
        at o.a.h.hdfs.server.datanode.DataNode.main(DataNode.java:2924)
2019-10-17 22:37:33,829 INFO o.a.h.util.ExitUtil: Exiting with status 1: o.a.h.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 0, volumes configured: 1, volumes failed: 1, volume failures tolerated: 0
2019-10-17 22:37:33,838 INFO o.a.h.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at SLAVE1/192.168.1.3
************************************************************/ 

~ ~ ~

Success logs if "DataNode" program comes up successfully on slave machines:

SLAVE1 SUCCESS MESSAGE FOR DATANODE:

	2019-10-17 22:49:47,572 INFO o.a.h.hdfs.server.datanode.DataNode: STARTUP_MSG:
	/************************************************************
	STARTUP_MSG: Starting DataNode
	STARTUP_MSG:   host = SLAVE1/192.168.1.3
	STARTUP_MSG:   args = []
	STARTUP_MSG:   version = 3.2.1
	...
	STARTUP_MSG:   build = https://gitbox.apache.org/repos/asf/hadoop.git -r b3cbbb467e22ea829b3808f4b7b01d07e0bf3842; compiled by 'rohithsharmaks' on 2019-09-10T15:56Z
	STARTUP_MSG:   java = 1.8.0_171
	...
	2019-10-17 22:49:49,489 INFO o.a.h.hdfs.server.datanode.DataNode: Starting DataNode with maxLockedMemory = 0
	2019-10-17 22:49:49,543 INFO o.a.h.hdfs.server.datanode.DataNode: Opened streaming server at /0.0.0.0:9866
	2019-10-17 22:49:49,549 INFO o.a.h.hdfs.server.datanode.DataNode: Balancing bandwidth is 10485760 bytes/s
	2019-10-17 22:49:49,549 INFO o.a.h.hdfs.server.datanode.DataNode: Number threads for balancing is 50 
	...

ALSO:
	(base) [admin@SLAVE1 logs]$ ps -aux | grep java | awk '{print $12}'
		...
		-Dproc_datanode
		...

MASTER SUCCESS MESSAGE FOR DATANODE:
	(base) [admin@MASTER sbin]$ ps -aux | grep java | awk '{print $12}'
		-Dproc_datanode
		-Dproc_secondarynamenode
		...

~ ~ ~

FAILURE LOGS FROM MASTER FOR ERROR IN NAMENODE:
(base) [admin@MASTER logs]$ cat hadoop-admin-namenode-MASTER.log
	2019-10-17 22:49:56,593 ERROR o.a.h.hdfs.server.namenode.NameNode: Failed to start namenode.
	java.io.IOException: NameNode is not formatted.
			at o.a.h.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:252)
			...
			at o.a.h.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1692)
			at o.a.h.hdfs.server.namenode.NameNode.main(NameNode.java:1759)
	2019-10-17 22:49:56,596 INFO o.a.h.util.ExitUtil: Exiting with status 1: java.io.IOException: NameNode is not formatted.
	2019-10-17 22:49:56,600 INFO o.a.h.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
	/************************************************************
	SHUTDOWN_MSG: Shutting down NameNode at MASTER/192.168.1.12
	************************************************************/ 

FIX:
	Previously: "hadoop namenode -format" 
	On Hadooop 3.X: "hdfs namenode format"

	Hadoop namenode directory contains the fsimage and configuration files that hold the basic information about Hadoop file system such as where is data available, which user created the files, etc.

	If you format the NameNode, then the above information is deleted from NameNode directory which is specified in the "$HADOOP_HOME/etc/hadoop/hdfs-site.xml" as "dfs.namenode.name.dir"

	After formatting you still have the data on the Hadoop, but not the NameNode metadata.

SUCCESS AFTER THE FIX ON MASTER:
	(base) [admin@MASTER sbin]$ ps -aux | grep java | awk '{print $12}'
		-Dproc_namenode
		-Dproc_datanode
		-Dproc_secondarynamenode
		...

~ ~ ~

MOVING ON TO SPARK:
WE HAVE YARN SO WE WILL NOT MAKE USE OF '/usr/local/spark/conf/slaves' FILE.

(base) [admin@MASTER conf]$ cat slaves.template
# A Spark Worker will be started on each of the machines listed below.
... 
		
~ ~ ~

FAILURE LOGS FROM 'spark-submit':
2019-10-17 23:23:03,832 INFO ipc.Client: Retrying connect to server: 192.168.1.12/192.168.1.12:8032. Already tried 0 time(s); maxRetries=45
2019-10-17 23:23:23,836 INFO ipc.Client: Retrying connect to server: 192.168.1.12/192.168.1.12:8032. Already tried 1 time(s); maxRetries=45
2019-10-17 23:23:43,858 INFO ipc.Client: Retrying connect to server: 192.168.1.12/192.168.1.12:8032. Already tried 2 time(s); maxRetries=45 

THE PROBLEM IS IN CONNECTING WITH THE RECOURCE MANAGER AS DESCRIBED IN PROPERTIES FILE YARN-SITE.XML ($HADOOP_HOME/etc/hadoop/yarn-site.xml):
	LOOK FOR THIS: yarn.resourcemanager.address
	FIX: SET IT TO MASTER IP
		
~ ~ ~

SUCCESS LOGS FOR STARTING OF SERVICES AFTER INSTALLATION OF HADOOP AND SPARK:
	(base) [admin@MASTER hadoop/sbin]$ start-all.sh
		Starting namenodes on [master]
		
		Starting datanodes
		master: This system is restricted to authorized users. 
		slave1: This system is restricted to authorized users. 
		
		Starting secondary namenodes [MASTER]
		MASTER: This system is restricted to authorized users. 
		
		Starting resourcemanager
		
		Starting nodemanagers
		master: This system is restricted to authorized users. 
		slave1: This system is restricted to authorized users. 
		
		(base) [admin@MASTER sbin]$

	(base) [admin@MASTER sbin]$ ps aux | grep java | awk '{print $12}'
		-Dproc_namenode
		-Dproc_datanode
		-Dproc_secondarynamenode
		-Dproc_resourcemanager
		-Dproc_nodemanager
		...

ON SLAVE1:
	(base) [admin@SLAVE1 ~]$ ps aux | grep java | awk '{print $12}'
		-Dproc_datanode
		-Dproc_nodemanager
		...

~ ~ ~

FAILURE LOGS FROM SPARK-SUBMIT ON MASTER:
	2019-10-17 23:54:26,189 INFO cluster.YarnScheduler: Adding task set 0.0 with 100 tasks
	2019-10-17 23:54:41,247 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
	2019-10-17 23:54:56,245 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
	2019-10-17 23:55:11,246 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Reason:	Spark master doesn't have any resources allocated to execute the job. Resources like worker node or slave node.
Fix for setup: changes in the /usr/local/hadoop/etc/hadoop/yarn-site.xml
Ref: StackOverflow

~ ~ ~

CONNECTIVITY (OR PORT) RELATED ISSUE INSTANCE 1:
	ISSUE WITH DATANODE ON SLAVE1:
		(base) [admin@SLAVE1 logs]$ pwd
			/usr/local/hadoop/logs
			
		(base) [admin@SLAVE1 logs]$cat hadoop-admin-datanode-SLAVE1.log
		
		(base) [admin@SLAVE1 logs]$
			2019-10-17 22:50:40,384 WARN o.a.h.hdfs.server.datanode.DataNode: Problem connecting to server: master/192.168.1.12:9000
			2019-10-17 22:50:46,416 INFO o.a.h.ipc.Client: Retrying connect to server: master/192.168.1.12:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
			
CONNECTIVITY (OR PORT) RELATED ISSUE INSTANCE 2:
	(base) [admin@MASTER logs]$ cat hadoop-admin-nodemanager-MASTER.log
		2019-10-18 00:24:17,473 INFO o.a.h.ipc.Client: Retrying connect to server: MASTER/192.168.1.12:8031. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

FIX: All connectivity between IPs of nodes on the cluster, and bring down the firewall on the nodes on the cluster.
    sudo /sbin/iptables -A INPUT -p tcp -s 192.168.1.12 -j ACCEPT
    sudo /sbin/iptables -A OUTPUT -p tcp -d 192.168.1.12 -j ACCEPT
    sudo /sbin/iptables -A INPUT -p tcp -s 192.168.1.3 -j ACCEPT
    sudo /sbin/iptables -A OUTPUT -p tcp -d 192.168.1.3 -j ACCEPT
    
    sudo systemctl stop iptables
    sudo service firewalld stop

Also, check port (here 80) connectivity as shown below:
1. lsof -i :80
2. netstat -an | grep 80 | grep LISTEN

~ ~ ~

ISSUE IN SPARK-SUBMIT LOGS ON MASTER:
    Exception: Python in worker has different version 2.7 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

FIX IS TO BE DONE ON ALL THE NODES ON THE CLUSTER:
	(base) [admin@SLAVE1 bin]$ ls -lrt /home/admin/anaconda3/bin/python3.7
	-rwx------. 1 admin wheel 12812592 May  6  2019 /home/admin/anaconda3/bin/python3.7

	(base) [admin@MASTER spark]$ pwd
	/usr/local/spark/conf
	
	(base) [admin@MASTER conf]$ ls
	fairscheduler.xml.template  log4j.properties.template  metrics.properties.template  slaves  slaves.template  spark-defaults.conf.template  spark-env.sh.template
	
	(base) [admin@MASTER conf]$ cp spark-env.sh.template spark-env.sh

	PUT THESE PROPERTIES IN THE FILE "/usr/local/spark/conf/spark-env.sh":
		export PYSPARK_PYTHON=/home/admin/anaconda3/bin/python3.7
		export PYSPARK_DRIVER_PYTHON=/home/admin/anaconda3/bin/python3.7

~ ~ ~

ERROR LOGS IF 'EXECUTOR-MEMORY' ARGUMENT OF SPARK-SUBMIT ASKS FOR MORE MEMORY THAN DEFINED IN YARN CONFIGURATION:

FILE INSTANCE 1:
  $HADOOP_HOME: /usr/local/hadoop
  
  (base) [admin@MASTER hadoop]$ vi $HADOOP_HOME/etc/hadoop/yarn-site.xml
  
  <configuration>
    <property>
      <name>yarn.acl.enable</name>
      <value>0</value>
    </property>
    
    <property>
      <name>yarn.resourcemanager.hostname</name>
      <value>192.168.1.12</value>
    </property>
    
    <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
    </property>
  
    <property>
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>4000</value>
    </property>
    
    <property>
      <name>yarn.scheduler.maximum-allocation-mb</name>
      <value>8000</value>
    </property>
    
    <property>
      <name>yarn.scheduler.minimum-allocation-mb</name>
      <value>128</value>
    </property>
    
    <property>
      <name>yarn.nodemanager.vmem-check-enabled</name>
      <value>false</value>
    </property>
  </configuration>

ERROR INSTANCE 1:

	(base) [admin@MASTER sbin]$ ../bin/spark-submit --master yarn --executor-memory 12G ../examples/src/main/python/pi.py 100
	
	2019-10-18 13:59:07,891 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
	2019-10-18 13:59:09,502 INFO spark.SparkContext: Running Spark version 3.0.0-preview2
	2019-10-18 13:59:09,590 INFO resource.ResourceUtils: ==============================================================
	2019-10-18 13:59:09,593 INFO resource.ResourceUtils: Resources for spark.driver:

	2019-10-18 13:59:09,594 INFO resource.ResourceUtils: ==============================================================
	2019-10-18 13:59:09,596 INFO spark.SparkContext: Submitted application: PythonPi
	2019-10-18 13:59:09,729 INFO spark.SecurityManager: Changing view acls to: admin
	2019-10-18 13:59:09,729 IN

	2019-10-18 13:59:13,927 INFO spark.SparkContext: Successfully stopped SparkContext
	Traceback (most recent call last):
	  File "/usr/local/spark/sbin/../examples/src/main/python/pi.py", line 33, in [module]
		.appName("PythonPi")\
	  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 183, in getOrCreate
	  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 370, in getOrCreate
	  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 130, in __init__
	  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 192, in _do_init
	  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 309, in _initialize_context
	  File "/usr/local/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1554, in __call__
	  File "/usr/local/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
	py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
	: java.lang.IllegalArgumentException: Required executor memory (12288 MB), offHeap memory (0) MB, overhead (1228 MB), and PySpark memory (0 MB) is above the max threshold (4000 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
			...
			at java.lang.Thread.run(Thread.java:748)

	2019-10-18 13:59:14,005 INFO util.ShutdownHookManager: Shutdown hook called
	2019-10-18 13:59:14,007 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-fbead587-b1ae-4e8e-acd4-160e585a6f34
	2019-10-18 13:59:14,012 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-3331bae2-e2d1-47f6-886c-317be6c98339 

FILE INSTANCE 2:

  <configuration>
    <property>
      <name>yarn.acl.enable</name>
      <value>0</value>
    </property>
    
    <property>
      <name>yarn.resourcemanager.hostname</name>
      <value>192.168.1.12</value>
    </property>
    
    <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
    </property>
    
    <property>
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>12000</value>
    </property>
    
    <property>
      <name>yarn.scheduler.maximum-allocation-mb</name>
      <value>10000</value>
    </property>
    
    <property>
      <name>yarn.scheduler.minimum-allocation-mb</name>
      <value>128</value>
    </property>
  </configuration>
  
ERROR INSTANCE 2:
  (base) [admin@MASTER sbin]$ ../bin/spark-submit --master yarn ../examples/src/main/python/pi.py 100
    py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
    : java.lang.IllegalArgumentException: Required executor memory (12288 MB), offHeap memory (0) MB, overhead (1228 MB), and PySpark memory (0 MB) is above the max threshold (10000 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'. 

Related Articles:
% Getting started with Hadoop on Ubuntu in VirtualBox
% Setting up three node Hadoop cluster on Ubuntu using VirtualBox
% Getting started with Spark on Ubuntu in VirtualBox
% Setting up a three node Spark cluster on Ubuntu using VirtualBox (Apr 2020)
% Notes on setting up Spark with YARN three node cluster

Monday, October 10, 2022

What about medications (Propranolol, Benzodiazepines and Anti-psychotics) for treatment of trauma

People have always used drugs to deal with traumatic stress. Each culture and each generation has its preferences—gin, vodka, beer, or whiskey; hashish, marijuana, cannabis, or ganja; cocaine; opioids like oxycontin; tranquilizers such as Valium, Xanax, and Klonopin. When people are desperate, they will do just about anything to feel calmer and more in control.

Mainstream psychiatry follows this tradition. Over the past decade the Departments of Defense and Veterans Affairs combined have spent over $4.5 billion on antidepressants, antipsychotics, and antianxiety drugs. A June 2010 internal report from the Defense Department’s Pharmacoeconomic Center at Fort Sam Houston in San Antonio showed that 213,972, or 20 percent of the 1.1 million active-duty troops surveyed, were taking some form of psychotropic drug: antidepressants, antipsychotics, sedative hypnotics, or other controlled substances.

However, drugs cannot “cure” trauma; they can only dampen the expressions of a disturbed physiology. And they do not teach the lasting lessons of self-regulation. They can help to control feelings and behavior, but always at a price—because they work by blocking the chemical systems that regulate engagement, motivation, pain, and pleasure. Some of my colleagues remain optimistic: I keep attending meetings where serious scientists discuss their quest for the elusive magic bullet that will miraculously reset the fear circuits of the brain (as if traumatic stress involved only one simple brain circuit). I also regularly prescribe medications.

Selective Serotonin Reuptake Inhibitors (SSRIs)

Just about every group of psychotropic agents has been used to treat some aspect of PTSD. The serotonin reuptake inhibitors (SSRIs) such as Prozac, Zoloft, Effexor, and Paxil have been most thoroughly studied, and they can make feelings less intense and life more manageable. Patients on SSRIs often feel calmer and more in control; feeling less overwhelmed often makes it easier to engage in therapy. Other patients feel blunted by SSRIs—they feel they’re “losing their edge.” I approach it as an empirical question: Let’s see what works, and only the patient can be the judge of that. On the other hand, if one SSRI does not work, it’s worth trying another, because they all have slightly different effects. It’s interesting that the SSRIs are widely used to treat depression, but in a study in which we compared Prozac with eye movement desensitization and reprocessing (EMDR) for patients with PTSD, many of whom were also depressed, EMDR proved to be a more effective antidepressant than Prozac.

Propranolol

Medicines that target the autonomic nervous system, like propranolol or clonidine, can help to decrease hyperarousal and reactivity to stress.
This family of drugs works by blocking the physical effects of adrenaline, the fuel of arousal, and thus reduces nightmares, insomnia, and reactivity to trauma triggers.
Blocking adrenaline can help to keep the rational brain online and make choices possible: “Is this really what I want to do?” Since I have started to integrate mindfulness and yoga into my practice, I use these medications less often, except occasionally to help patients sleep more restfully.

Benzodiazepines

Traumatized patients tend to like tranquilizing drugs, benzodiazepines like Klonopin, Valium, Xanax, and Ativan.
In many ways, they work like alcohol, in that they make people feel calm and keep them from worrying. (Casino owners love customers on benzodiazepines; they don’t get upset when they lose and keep gambling.) But also, like alcohol, benzos weaken inhibitions against saying hurtful things to people we love.
Most civilian doctors are reluctant to prescribe these drugs, because they have a high addiction potential and they may also interfere with trauma processing.
Patients who stop taking them after prolonged use usually have withdrawal reactions that make them agitated and increase posttraumatic symptoms. I sometimes give my patients low doses of benzodiazepines to use as needed, but not enough to take on a daily basis. They have to choose when to use up their precious supply, and I ask them to keep a diary of what was going on when they decided to take the pill. That gives us a chance to discuss the specific incidents that triggered them.

A few studies have shown that anticonvulsants and mood stabilizers, such as lithium or valproate, can have mildly positive effects, taking the edge off hyperarousal and panic.

Second-generation antipsychotic agents

The most controversial medications are the so-called second-generation antipsychotic agents, such as Risperdal (Salt: Risperidone) and Seroquel, the largest-selling psychiatric drugs in the United States ($14.6 billion in 2008). Low doses of these agents can be helpful in calming down combat veterans and women with PTSD related to childhood abuse.

Using these drugs is sometimes justified, for example when patients feel completely out of control and unable to sleep or where other methods have failed. But it’s important to keep in mind that these medications work by blocking the dopamine system, the brain’s reward system, which also functions as the engine of pleasure and motivation.

Antipsychotic medications such as Risperdal, Abilify, or Seroquel can significantly dampen the emotional brain and thus make patients less skittish or enraged, but they also may interfere with being able to appreciate subtle signals of pleasure, danger, or satisfaction. They also cause weight gain, increase the chance of developing diabetes, and make patients physically inert, which is likely to further increase their sense of alienation.

These drugs are widely used to treat abused children who are inappropriately diagnosed with bipolar disorder or mood dysregulation disorder. More than half a million children and adolescents in America are now taking antipsychotic drugs, which may calm them down but also interfere with learning age-appropriate skills and developing friendships with other children. A Columbia University study recently found that prescriptions of antipsychotic drugs for privately insured two- to five-year-olds had doubled between 2000 and 2007.61 Only 40 percent of them had received a proper mental health assessment.

Until it lost its patent, the pharmaceutical company Johnson & Johnson doled out LEGO blocks stamped with the word “Risperdal” for the waiting rooms of child psychiatrists. Children from low-income families are four times as likely as the privately insured to receive antipsychotic medicines.

In one year alone Texas Medicaid spent $96 million on antipsychotic drugs for teenagers and children—including three unidentified infants who were given the drugs before their first birthdays. There have been no studies on the effects of psychotropic medications on the developing brain. Dissociation, self-mutilation, fragmented memories, and amnesia generally do not respond to any of these medications.

The Prozac study that I discussed in chapter 2 was the first to discover that traumatized civilians tend to respond much better to medications than do combat veterans. Since then other studies have found similar discrepancies. In this light it is worrisome that the Department of Defense and the Veteran Affairs (VA) prescribe enormous quantities of medications to combat soldiers and returning veterans, often without providing other forms of therapy.

Between 2001 and 2011 the VA spent about $1.5 billion on Seroquel and Risperdal, while Defense spent about $90 million during the same period, even though a research paper published in 2001 showed that Risperdal was no more effective than a placebo in treating PTSD. Similarly, between 2001 and 2012 the VA spent $72.1 million and Defense spent $44.1 million on benzodiazepines — medications that clinicians generally avoid prescribing to civilians with PTSD because of their addiction potential and lack of significant effectiveness for PTSD symptoms.

Reference: Chapter 13 of 'Body Keeps The Score' (by Bessel van red Kolk)

Saturday, October 8, 2022

Four Ways to Read a CSV in PySpark (v3.3.0)

Download Code


import seaborn as sns
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark import SparkContext
from pyspark.sql import SQLContext # Main entry point for SQL based DataFrame (other is Pandas based DataFrame) and SQL functionality.

sc = SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)

import pyspark
print(pyspark.__version__)
    

3.3.0

Our input data looks like this:




with open('./input/student.csv', mode = 'r', encoding = 'utf8') as f:
    data = f.readlines()

import pandas as pd
df_student = pd.read_csv('./input/student.csv')

data



['sno,FirstName,LASTNAME\n',
'one,Ram,\n',
'two,,Sharma\n',
'three,Shyam,NA\n',
'four,Kabir,\n',
'five,NA,Singh\n']


df_student.head()



When you load a Pandas DataFrame by reading from a CSV, blank values and 'NA' values are converted to 'NaN' values by default as shown above.

Way 1

Also, PySpark's sqlCtx.createDataFrame() results in error on Pandas dataframe with null values.

df_student = pd.read_csv('./input/student.csv')
sdf = sqlCtx.createDataFrame(df_student)    


TypeError: field FirstName: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>


def clean_data(df):
    df.fillna('Not Applicable', inplace = True) # Handles blank and 'NA' values both. 
    
    df = df.apply(lambda x: x.str.strip())
    df.columns = df.columns.str.lower()
    return df

df_student = clean_data(df_student)

df_student.fillna('Not Applicable', inplace = True) # Handles blank and 'NA' values both. 

sdf = sqlCtx.createDataFrame(df_student)
type(sdf)


pyspark.sql.dataframe.DataFrame

sdf.show()



Way 2

New feature in 3.2.1 [ Ref ]

df = pyspark.pandas.read_csv('./input/student.csv') 

# Error if 'pandas' package is not there in your version of 'pyspark'.
# AttributeError: module 'pyspark' has no attribute 'pandas'


from pyspark import pandas as ppd
df_student_pyspark = ppd.read_csv('./input/student.csv')
type(df_student_pyspark)


pyspark.pandas.frame.DataFrame

df_student_pyspark



Way 3
[ Ref ]


from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

# A CSV dataset is pointed to by path.
# The path can be either a single CSV file or a directory of CSV files
# path = "examples/src/main/resources/people.csv"

df = spark.read.option("header", True).csv('./input/student.csv')
df.show()



type(df)

pyspark.sql.dataframe.DataFrame

Way 4: Using the plain old RDD

Shane works in data analytics project and needs to process Users event data (UserLogs.csv file). Which of the below code snippet can be used to split the fields with a comma as a delimiter and fetch only the first two fields from it? 

logsRDD = sc.textFile("/HDFSPATH/UserLogs.csv"); 
FieldsRDD = logsRDD.map(lambda r : r.split(",")).map(lambda r: (r[0],r[1]))

Installing PySpark on Ubuntu And Basic Testing (2022 Oct 8)

Contents of env.yml File


name: mh
channels:
  - conda-forge
dependencies:
  - python==3.9
  - pandas
  - pyspark
  - pip


Keeping the number of packages in dependencies to a bare minimum.
Takes over two hours to process the otherwise tried original 13 dependencies.


(base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ conda env create -f env.yml
(base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ conda activate mh


Testing
Error Prior to Java Installation


(mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ python
Python 3.9.0 | packaged by conda-forge | (default, Nov 26 2020, 07:57:39) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import pyspark
>>> pyspark.__version__
'3.3.0'
>>> import os
>>> os.environ['PYTHONPATH']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ashish/anaconda3/envs/mh/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'PYTHONPATH'
>>> from pyspark.sql import SQLContext
>>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] })
>>> sc = SparkContext.getOrCreate()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'SparkContext' is not defined
>>> from pyspark import SparkContext
>>> sc = SparkContext.getOrCreate()
JAVA_HOME is not set
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/context.py", line 483, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/context.py", line 195, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/context.py", line 417, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/java_gateway.py", line 106, in launch_gateway
    raise RuntimeError("Java gateway process exited before sending its port number")
RuntimeError: Java gateway process exited before sending its port number
>>> 



(mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ java


Command 'java' not found, but can be installed with:
sudo apt install default-jre              # version 2:1.11-72build2, or
sudo apt install openjdk-11-jre-headless  # version 11.0.16+8-0ubuntu1~22.04
sudo apt install openjdk-17-jre-headless  # version 17.0.3+7-0ubuntu0.22.04.1
sudo apt install openjdk-18-jre-headless  # version 18~36ea-1
sudo apt install openjdk-8-jre-headless   # version 8u312-b07-0ubuntu1  

(mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ sudo apt install openjdk-8-jre-headless
... 

(mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ java -version


openjdk version "1.8.0_342"
OpenJDK Runtime Environment (build 1.8.0_342-8u342-b07-0ubuntu1~22.04-b07)
OpenJDK 64-Bit Server VM (build 25.342-b07, mixed mode)


(mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ echo $JAVA_HOME
EMPTY
(mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$

(base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ which java
/usr/bin/java

(base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ readlink -f /usr/bin/java
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

Update the JAVA_HOME

(mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ sudo nano ~/.bashrc

Add the following line at the end of the file:

export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"

(mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ 
(mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ source ~/.bashrc

(mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ echo $JAVA_HOME

/usr/lib/jvm/java-8-openjdk-amd64

(mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ python
Python 3.9.0 | packaged by conda-forge | (default, Nov 26 2020, 07:57:39) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.


>>> import pandas as pd
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] })
>>> sc = SparkContext.getOrCreate()


22/10/08 13:29:50 WARN Utils: Your hostname, ashish-Lenovo-ideapad-130-15IKB resolves to a loopback address: 127.0.1.1; using 192.168.1.129 instead (on interface wlp2s0)
22/10/08 13:29:50 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/08 13:29:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable  


>>> sqlCtx = SQLContext(sc)


/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/sql/context.py:112: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
warnings.warn()


>>> sdf = sqlCtx.createDataFrame(df)


/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py:474: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for column, series in pdf.iteritems():
/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py:486: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for column, series in pdf.iteritems():

>>> sdf.show()


+----+----+                                                                     
|col1|col2|
+----+----+
|val1|val2|
+----+----+


>>> 
>>> exit()

Friday, October 7, 2022

Spark Installation on Windows (2022-Oct-07, Status Failure, Part 2)

The Issue


(mh) C:\Users\ashish>python
Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:30:19) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] })
>>> sc = SparkContext.getOrCreate()
22/10/07 17:30:26 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/07 17:30:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>>


FRESH INSTALLATION

Checking Java

(mh) C:\Users\ashish>java -version
openjdk version "17.0.4" 2022-07-19 LTS
OpenJDK Runtime Environment Zulu17.36+14-SA (build 17.0.4+8-LTS)
OpenJDK 64-Bit Server VM Zulu17.36+14-SA (build 17.0.4+8-LTS, mixed mode, sharing)

~ ~ ~

Checking Previous Installation of PySpark Through Its CLI

(mh) C:\Users\ashish>pyspark
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
The system cannot find the path specified.
The system cannot find the path specified.
(mh) C:\Users\ashish>

~ ~ ~

Checking JAVA_HOME

(base) C:\Users\ashish>echo %JAVA_HOME%
C:\Program Files\Zulu\zulu-17

~ ~ ~

Microsoft Windows [Version 10.0.19042.2006]
(c) Microsoft Corporation. All rights reserved.

C:\Users\ashish>pyspark
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
The system cannot find the path specified.
The system cannot find the path specified.

~ ~ ~

(base) C:\Users\ashish>where python
C:\Users\ashish\Anaconda3\python.exe
C:\Users\ashish\AppData\Local\Microsoft\WindowsApps\python.exe

File: C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\bin\pyspark2.cmd


@echo off

rem
rem Licensed to the Apache Software Foundation (ASF) under one or more
rem contributor license agreements.  See the NOTICE file distributed with
rem this work for additional information regarding copyright ownership.
rem The ASF licenses this file to You under the Apache License, Version 2.0
rem (the "License"); you may not use this file except in compliance with
rem the License.  You may obtain a copy of the License at
rem
rem    http://www.apache.org/licenses/LICENSE-2.0
rem
rem Unless required by applicable law or agreed to in writing, software
rem distributed under the License is distributed on an "AS IS" BASIS,
rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
rem See the License for the specific language governing permissions and
rem limitations under the License.
rem

rem Figure out where the Spark framework is installed
call "%~dp0find-spark-home.cmd"

call "%SPARK_HOME%\bin\load-spark-env.cmd"
set _SPARK_CMD_USAGE=Usage: bin\pyspark.cmd [options]

rem Figure out which Python to use.
if "x%PYSPARK_DRIVER_PYTHON%"=="x" (
    set PYSPARK_DRIVER_PYTHON=python
    if not [%PYSPARK_PYTHON%] == [] set PYSPARK_DRIVER_PYTHON=%PYSPARK_PYTHON%
)

set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%
set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.5-src.zip;%PYTHONPATH%

set OLD_PYTHONSTARTUP=%PYTHONSTARTUP%
set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py

call "%SPARK_HOME%\bin\spark-submit2.cmd" pyspark-shell-main --name "PySparkShell" %*    


(base) C:\Users\ashish>echo %PATH%
C:\Users\ashish\Anaconda3;C:\Users\ashish\Anaconda3\Library\mingw-w64\bin;C:\Users\ashish\Anaconda3\Library\usr\bin;C:\Users\ashish\Anaconda3\Library\bin;C:\Users\ashish\Anaconda3\Scripts;C:\Users\ashish\Anaconda3\bin;C:\Users\ashish\Anaconda3\condabin;C:\Program Files\Zulu\zulu-17-jre\bin;C:\Program Files\Zulu\zulu-17\bin;C:\windows\system32;C:\windows;C:\windows\System32\Wbem;C:\windows\System32\WindowsPowerShell\v1.0;C:\windows\System32\OpenSSH;C:\Program Files\Git\cmd;C:\Users\ashish\Anaconda3;C:\Users\ashish\Anaconda3\Library\mingw-w64\bin;C:\Users\ashish\Anaconda3\Library\usr\bin;C:\Users\ashish\Anaconda3\Library\bin;C:\Users\ashish\Anaconda3\Scripts;C:\Users\ashish\AppData\Local\Microsoft\WindowsApps;C:\Users\ashish\AppData\Local\Programs\Microsoft VS Code\bin;C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\bin;.



(base) C:\Users\ashish>echo %PYTHONPATH%
C:\Users\ashish\Anaconda3



(mh) C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\bin>pyspark
Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:30:19) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
22/10/07 18:42:18 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/07 18:42:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
        ____              __
        / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
    /__ / .__/\_,_/_/ /_/\_\   version 3.3.0
        /_/

Using Python version 3.10.6 (main, Aug 22 2022 20:30:19)
Spark context Web UI available at http://CHDSEZ344867L.ad.infosys.com:4040
Spark context available as 'sc' (master = local[*], app id = local-1665148340837).
SparkSession available as 'spark'.
>>>









(base) C:\Users\ashish>where pyspark
C:\Users\ashish\Anaconda3\Scripts\pyspark
C:\Users\ashish\Anaconda3\Scripts\pyspark.cmd
C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\bin\pyspark
C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\bin\pyspark.cmd

(base) C:\Users\ashish>where pyspark
C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\bin\pyspark
C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\bin\pyspark.cmd

DELETE THE FILES:
# C:\Users\ashish\Anaconda3\Scripts\pyspark
# C:\Users\ashish\Anaconda3\Scripts\pyspark.cmd

THEN RUN AGAIN:

(base) C:\Users\ashish>pyspark
Python 3.9.12 (main, Apr  4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
22/10/07 18:44:58 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/07 18:44:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
        ____              __
        / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
    /__ / .__/\_,_/_/ /_/\_\   version 3.3.0
        /_/

Using Python version 3.9.12 (main, Apr  4 2022 05:22:27)
Spark context Web UI available at http://CHDSEZ344867L.ad.infosys.com:4040
Spark context available as 'sc' (master = local[*], app id = local-1665148501551).
SparkSession available as 'spark'.
>>>

~ ~ ~

Microsoft Windows [Version 10.0.19042.2006]
(c) Microsoft Corporation. All rights reserved.

C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\bin>pyspark
Python 3.9.12 (main, Apr  4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32

Warning:
This Python interpreter is in a conda environment, but the environment has
not been activated.  Libraries may fail to load.  To activate this environment
please see https://conda.io/activation

Type "help", "copyright", "credits" or "license" for more information.
22/10/07 18:54:48 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/07 18:54:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
        ____              __
        / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
    /__ / .__/\_,_/_/ /_/\_\   version 3.3.0
        /_/

Using Python version 3.9.12 (main, Apr  4 2022 05:22:27)
Spark context Web UI available at http://CHDSEZ344867L.ad.infosys.com:4040
Spark context available as 'sc' (master = local[*], app id = local-1665149091125).
SparkSession available as 'spark'.
>>> import pandas as pd
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] })
>>> sc = SparkContext.getOrCreate()
>>> sqlCtx = SQLContext(sc)
C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\python\pyspark\sql\context.py:112: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
    warnings.warn(
>>> sdf = sqlCtx.createDataFrame(df)
>>> sdf.show()
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
22/10/07 18:56:26 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:708)
        at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:752)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:686)
        at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:652)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:628)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:585)
        at java.base/java.net.ServerSocket.accept(ServerSocket.java:538)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)
        ... 29 more
22/10/07 18:56:26 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (CHDSEZ344867L.ad.infosys.com executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:708)
        at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:752)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:686)
        at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:652)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:628)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:585)
        at java.base/java.net.ServerSocket.accept(ServerSocket.java:538)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)
        ... 29 more

22/10/07 18:56:26 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):                                  (0 + 0) / 1]
    File "<stdin>", line 1, in <module>
    File "C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\python\pyspark\sql\dataframe.py", line 606, in show
    print(self._jdf.showString(n, 20, vertical))
    File "C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\python\lib\py4j-0.10.9.5-src.zip\py4j\java_gateway.py", line 1321, in __call__
    File "C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\python\pyspark\sql\utils.py", line 190, in deco
    return f(*a, **kw)
    File "C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\python\lib\py4j-0.10.9.5-src.zip\py4j\protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o62.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (CHDSEZ344867L.ad.infosys.com executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:708)
        at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:752)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:686)
        at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:652)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:628)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:585)
        at java.base/java.net.ServerSocket.accept(ServerSocket.java:538)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)
        ... 29 more

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268)
        at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:506)
        at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:459)
        at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
        at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3868)
        at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2863)
        at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858)
        at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
        at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
        at org.apache.spark.sql.Dataset.head(Dataset.scala:2863)
        at org.apache.spark.sql.Dataset.take(Dataset.scala:3084)
        at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
        at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        ... 1 more
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:708)
        at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:752)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:686)
        at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:652)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:628)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:585)
        at java.base/java.net.ServerSocket.accept(ServerSocket.java:538)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)
        ... 29 more

>>>


INSTALL SCALA

https://www.scala-lang.org/download/

~ ~ ~

C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\bin>echo %PYTHONPATH%
C:\Users\ashish\Anaconda3

(mh) C:\Users\ashish>python
Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:30:19) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()

~ ~ ~

Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.

This error occurs because your python version is not compatible with pyspark version. So check your python version and update accordingly using the below given command. After that it will work. To know more about it, get your Pyspark certification today and become expert.07-Apr-2020

edureka

~ ~ ~

(base) C:\Users\ashish\Desktop>conda env create -f menv.yml
Collecting package metadata (repodata.json): done
Solving environment: done

Downloading and Extracting Packages
debugpy-1.6.3        | 3.2 MB    | ### | 100%
kiwisolver-1.4.4     | 61 KB     | ### | 100%
jupyter_core-4.11.1  | 106 KB    | ### | 100%
regex-2022.9.13      | 331 KB    | ### | 100%
scikit-learn-1.1.2   | 7.5 MB    | ### | 100%
cffi-1.15.1          | 223 KB    | ### | 100%
typing_extensions-4. | 29 KB     | ### | 100%
argon2-cffi-bindings | 35 KB     | ### | 100%
scipy-1.9.1          | 28.3 MB   | ### | 100%
markupsafe-2.1.1     | 25 KB     | ### | 100%
click-8.1.3          | 146 KB    | ### | 100%
pandas-1.5.0         | 11.7 MB   | ### | 100%
unicodedata2-14.0.0  | 493 KB    | ### | 100%
sip-6.6.2            | 519 KB    | ### | 100%
python-3.8.0         | 18.8 MB   | ### | 100%
gensim-4.2.0         | 22.4 MB   | ### | 100%
statsmodels-0.13.2   | 10.3 MB   | ### | 100%
tornado-6.2          | 655 KB    | ### | 100%
importlib-metadata-4 | 33 KB     | ### | 100%
pywin32-303          | 6.9 MB    | ### | 100%
pyqt-5.15.7          | 4.7 MB    | ### | 100%
jupyter-1.0.0        | 7 KB      | ### | 100%
pyqt5-sip-12.11.0    | 82 KB     | ### | 100%
matplotlib-3.6.0     | 7 KB      | ### | 100%
matplotlib-base-3.6. | 7.5 MB    | ### | 100%
psutil-5.9.2         | 367 KB    | ### | 100%
pyrsistent-0.18.1    | 85 KB     | ### | 100%
pywinpty-2.0.8       | 234 KB    | ### | 100%
pillow-9.2.0         | 44.9 MB   | ### | 100%
pyarrow-6.0.0        | 2.4 MB    | ### | 100%
numpy-1.23.3         | 6.3 MB    | ### | 100%
fonttools-4.37.4     | 1.7 MB    | ### | 100%
contourpy-1.0.5      | 176 KB    | ### | 100%
python_abi-3.8       | 4 KB      | ### | 100%
sqlite-3.39.4        | 658 KB    | ### | 100%
pyzmq-24.0.1         | 461 KB    | ### | 100%
arrow-cpp-6.0.0      | 15.7 MB   | ### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Installing pip dependencies: \ Ran pip subprocess with arguments:
['C:\\Users\\ashish\\Anaconda3\\envs\\mh\\python.exe', '-m', 'pip', 'install', '-U', '-r', 'C:\\Users\\ashish\\Desktop\\condaenv.tl6wm33z.requirements.txt']
| Pip subprocess output:
Collecting rpy2==3.4.5
    Using cached rpy2-3.4.5.tar.gz (194 kB)
    Preparing metadata (setup.py): started
    Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: cffi>=1.10.0 in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.tl6wm33z.requirements.txt (line 1)) (1.15.1)
Requirement already satisfied: jinja2 in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.tl6wm33z.requirements.txt (line 1)) (3.1.2)
Requirement already satisfied: pytz in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.tl6wm33z.requirements.txt (line 1)) (2022.4)
Collecting tzlocal
    Using cached tzlocal-4.2-py3-none-any.whl (19 kB)
Requirement already satisfied: pycparser in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from cffi>=1.10.0->rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.tl6wm33z.requirements.txt (line 1)) (2.21)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from jinja2->rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.tl6wm33z.requirements.txt (line 1)) (2.1.1)
Collecting backports.zoneinfo
    Downloading backports.zoneinfo-0.2.1-cp38-cp38-win_amd64.whl (38 kB)
Collecting pytz-deprecation-shim
    Using cached pytz_deprecation_shim-0.1.0.post0-py2.py3-none-any.whl (15 kB)
Collecting tzdata
    Using cached tzdata-2022.4-py2.py3-none-any.whl (336 kB)
Building wheels for collected packages: rpy2
    Building wheel for rpy2 (setup.py): started
    Building wheel for rpy2 (setup.py): finished with status 'done'
    Created wheel for rpy2: filename=rpy2-3.4.5-py3-none-any.whl size=198845 sha256=f7220847e02f729bd39188f16026ac01855f88cb2c10c3dd68cf5856fc560b6c
    Stored in directory: c:\users\ashish\appdata\local\pip\cache\wheels\57\e2\f0\64c7640f82ba9a23777a25c05d2552fa2991eee7ec2cf9b216
Successfully built rpy2
Installing collected packages: tzdata, backports.zoneinfo, pytz-deprecation-shim, tzlocal, rpy2
Successfully installed backports.zoneinfo-0.2.1 pytz-deprecation-shim-0.1.0.post0 rpy2-3.4.5 tzdata-2022.4 tzlocal-4.2

done
#
# To activate this environment, use
#
#     $ conda activate mh
#
# To deactivate an active environment, use
#
#     $ conda deactivate

Retrieving notices: ...working... done

(base) C:\Users\ashish\Desktop>

~ ~ ~

(mh) C:\Users\ashish\Desktop>python
Python 3.8.0 | packaged by conda-forge | (default, Nov 22 2019, 19:04:36) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>> pyspark.__version__
'3.3.0'
>>> exit()

~ ~ ~

(mh) C:\Users\ashish\Desktop>python
Python 3.8.0 | packaged by conda-forge | (default, Nov 22 2019, 19:04:36) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] })
>>> sc = SparkContext.getOrCreate()
22/10/07 20:04:54 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/07 20:04:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> sqlCtx = SQLContext(sc)
C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\context.py:112: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
    warnings.warn(
>>> sdf = sqlCtx.createDataFrame(df)
C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\pandas\conversion.py:474: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
    for column, series in pdf.iteritems():
C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\pandas\conversion.py:486: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
    for column, series in pdf.iteritems():
>>> sdf.show()
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
22/10/07 20:05:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker failed to connect back.

... ... ... 


Contents of File ENV.YML


name: mh
channels:
    - conda-forge
dependencies:
    - python==3.7
    - pandas
    - seaborn
    - scikit-learn
    - matplotlib
    - ipykernel
    - jupyter
    - pyspark
    - gensim
    - nltk
    - scipy
    - pip
    - pip:
    - rpy2==3.4.5


(base) C:\Users\ashish\Desktop>conda env create -f menv.yml
Collecting package metadata (repodata.json): done
Solving environment: done

Downloading and Extracting Packages
libthrift-0.16.0     | 877 KB    | ### | 100%
pandas-1.3.5         | 10.9 MB   | ### | 100%
debugpy-1.6.3        | 3.2 MB    | ### | 100%
python-3.7.0         | 21.0 MB   | ### | 100%
argon2-cffi-bindings | 34 KB     | ### | 100%
aws-c-event-stream-0 | 47 KB     | ### | 100%
fonttools-4.37.4     | 1.7 MB    | ### | 100%
ipython-7.33.0       | 1.2 MB    | ### | 100%
gensim-4.2.0         | 22.4 MB   | ### | 100%
jupyter-1.0.0        | 7 KB      | ### | 100%
setuptools-59.8.0    | 1.0 MB    | ### | 100%
aws-checksums-0.1.11 | 51 KB     | ### | 100%
pillow-9.2.0         | 45.4 MB   | ### | 100%
libprotobuf-3.21.7   | 2.4 MB    | ### | 100%
regex-2022.9.13      | 343 KB    | ### | 100%
psutil-5.9.2         | 363 KB    | ### | 100%
pywinpty-2.0.8       | 235 KB    | ### | 100%
statsmodels-0.13.2   | 10.5 MB   | ### | 100%
glog-0.6.0           | 95 KB     | ### | 100%
matplotlib-3.5.3     | 7 KB      | ### | 100%
aws-c-cal-0.5.11     | 36 KB     | ### | 100%
aws-c-common-0.6.2   | 159 KB    | ### | 100%
pyarrow-9.0.0        | 2.8 MB    | ### | 100%
pyrsistent-0.18.1    | 84 KB     | ### | 100%
libgoogle-cloud-2.2. | 10 KB     | ### | 100%
aws-sdk-cpp-1.8.186  | 5.5 MB    | ### | 100%
aws-c-io-0.10.5      | 127 KB    | ### | 100%
pyzmq-24.0.1         | 457 KB    | ### | 100%
libabseil-20220623.0 | 1.6 MB    | ### | 100%
jupyter_core-4.11.1  | 105 KB    | ### | 100%
matplotlib-base-3.5. | 7.4 MB    | ### | 100%
arrow-cpp-9.0.0      | 19.7 MB   | ### | 100%
pywin32-303          | 7.0 MB    | ### | 100%
typing-extensions-4. | 8 KB      | ### | 100%
libcrc32c-1.1.2      | 25 KB     | ### | 100%
cffi-1.15.1          | 222 KB    | ### | 100%
grpc-cpp-1.47.1      | 28.0 MB   | ### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Installing pip dependencies: | Ran pip subprocess with arguments:
['C:\\Users\\ashish\\Anaconda3\\envs\\mh\\python.exe', '-m', 'pip', 'install', '-U', '-r', 'C:\\Users\\ashish\\Desktop\\condaenv.yn5zpyut.requirements.txt']
Pip subprocess output:
Collecting rpy2==3.4.5
    Using cached rpy2-3.4.5.tar.gz (194 kB)
    Preparing metadata (setup.py): started
    Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: cffi>=1.10.0 in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.yn5zpyut.requirements.txt (line 1)) (1.15.1)
Requirement already satisfied: jinja2 in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.yn5zpyut.requirements.txt (line 1)) (3.1.2)
Requirement already satisfied: pytz in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.yn5zpyut.requirements.txt (line 1)) (2022.4)
Collecting tzlocal
    Using cached tzlocal-4.2-py3-none-any.whl (19 kB)
Requirement already satisfied: pycparser in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from cffi>=1.10.0->rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.yn5zpyut.requirements.txt (line 1)) (2.21)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from jinja2->rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.yn5zpyut.requirements.txt (line 1)) (2.1.1)
Collecting backports.zoneinfo
    Downloading backports.zoneinfo-0.2.1-cp37-cp37m-win_amd64.whl (38 kB)
Collecting pytz-deprecation-shim
    Using cached pytz_deprecation_shim-0.1.0.post0-py2.py3-none-any.whl (15 kB)
Collecting tzdata
    Using cached tzdata-2022.4-py2.py3-none-any.whl (336 kB)
Building wheels for collected packages: rpy2
    Building wheel for rpy2 (setup.py): started
    Building wheel for rpy2 (setup.py): finished with status 'done'
    Created wheel for rpy2: filename=rpy2-3.4.5-py3-none-any.whl size=198859 sha256=eb9ac7fe7a3a2109be582d2cae21640c03e1164a55bceda048c24047df75e945
    Stored in directory: c:\users\ashish\appdata\local\pip\cache\wheels\46\00\c5\a43320afe86e7540d16d7f07cf4d29547d98921e76ea9f2f7a
Successfully built rpy2
Installing collected packages: tzdata, backports.zoneinfo, pytz-deprecation-shim, tzlocal, rpy2
Successfully installed backports.zoneinfo-0.2.1 pytz-deprecation-shim-0.1.0.post0 rpy2-3.4.5 tzdata-2022.4 tzlocal-4.2

done
#
# To activate this environment, use
#
#     $ conda activate mh
#
# To deactivate an active environment, use
#
#     $ conda deactivate

Retrieving notices: ...working... done

(base) C:\Users\ashish\Desktop>python
Python 3.9.12 (main, Apr  4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()

(base) C:\Users\ashish\Desktop>conda activate mh

(mh) C:\Users\ashish\Desktop>python
Python 3.7.0 | packaged by conda-forge | (default, Nov 12 2018, 20:47:31) [MSC v.1900 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] })
>>> sc = SparkContext.getOrCreate()
22/10/07 21:02:01 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/07 21:02:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> sqlCtx = SQLContext(sc)
C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\context.py:114: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
    FutureWarning,
>>> sdf = sqlCtx.createDataFrame(df)
>>> sdf.show()
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
22/10/07 21:02:49 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker failed to connect back.

Checking the Environment variables through OS package

>>> import os
>>> os.environ['PATH']
'C:\\Users\\ashish\\Anaconda3\\envs\\mh;C:\\Users\\ashish\\Anaconda3\\envs\\mh\\Library\\mingw-w64\\bin;C:\\Users\\ashish\\Anaconda3\\envs\\mh\\Library\\usr\\bin;C:\\Users\\ashish\\Anaconda3\\envs\\mh\\Library\\bin;C:\\Users\\ashish\\Anaconda3\\envs\\mh\\Scripts;C:\\Users\\ashish\\Anaconda3\\envs\\mh\\bin;C:\\Users\\ashish\\Anaconda3\\condabin;C:\\Program Files\\Zulu\\zulu-17-jre\\bin;C:\\Program Files\\Zulu\\zulu-17\\bin;C:\\windows\\system32;C:\\windows;C:\\windows\\System32\\Wbem;C:\\windows\\System32\\WindowsPowerShell\\v1.0;C:\\windows\\System32\\OpenSSH;C:\\Program Files\\Git\\cmd;C:\\Users\\ashish\\Anaconda3;C:\\Users\\ashish\\Anaconda3\\Library\\mingw-w64\\bin;C:\\Users\\ashish\\Anaconda3\\Library\\usr\\bin;C:\\Users\\ashish\\Anaconda3\\Library\\bin;C:\\Users\\ashish\\Anaconda3\\Scripts;C:\\Users\\ashish\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\ashish\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\ashish\\Desktop\\spark-3.3.0-bin-hadoop3\\bin;.'
>>> os.environ['PYTHONPATH']
'C:\\Users\\ashish\\Anaconda3'
>>> os.system("where python")
C:\Users\ashish\Anaconda3\envs\mh\python.exe
C:\Users\ashish\Anaconda3\python.exe
C:\Users\ashish\AppData\Local\Microsoft\WindowsApps\python.exe
0

(base) C:\Users\ashish>conda activate mh

(mh) C:\Users\ashish>python
Python 3.7.0 | packaged by conda-forge | (default, Nov 12 2018, 20:47:31) [MSC v.1900 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.


>>> import pandas as pd
>>> import os

>>> os.environ["PYTHONPATH"]    


'C:\\Users\\ashish\\Anaconda3\\envs\\mh'


>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] })
>>> sc = SparkContext.getOrCreate()



22/10/07 21:20:00 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/07 21:20:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


>>> sqlCtx = SQLContext(sc)

C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\context.py:114: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
    FutureWarning,
>>> sdf = sqlCtx.createDataFrame(df)
>>> sdf.show()
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
22/10/07 21:20:45 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker failed to connect back.

Installation of Elephas (for distributed deep learning) on Ubuntu through archives (Apr 2020)

Elephas is an extension of Keras, which allows you to run distributed deep learning models at scale with Spark. Elephas currently supports a number of applications, including:

% Data-parallel training of deep learning models
% Distributed hyper-parameter optimization
% Distributed training of ensemble models

Schematically, elephas works as follows.


We have listed packages that are required outside of Anaconda distribution.
The following code goes in a Shell (.sh) script for Ubuntu or .bat script Windows:
pip install Keras_Applications-1.0.8.tar.gz
pip install keras-team-keras-preprocessing-1.1.0-0-gff90696.tar.gz
pip install Keras-2.3.1.tar.gz
pip install hyperopt-0.2.4-py2.py3-none-any.whl
pip install hyperas-0.4.1-py3-none-any.whl
pip install tensorflow_estimator-2.1.0-py2.py3-none-any.whl 
pip install grpcio-1.28.1-cp37-cp37m-manylinux2010_x86_64.whl
pip install protobuf-3.11.3-cp37-cp37m-manylinux1_x86_64.whl
pip install gast-0.3.3.tar.gz
pip install opt_einsum-3.2.1.tar.gz
pip install astor-0.8.1.tar.gz
pip install absl-py-0.9.0.tar.gz
pip install cachetools-4.1.0.tar.gz
pip install pyasn1-0.4.8.tar.gz
pip install pyasn1-modules-0.2.8.tar.gz
pip install rsa-4.0.tar.gz
pip install google-auth-1.14.1.tar.gz
pip install oauthlib-3.1.0.tar.gz
pip install requests-oauthlib-1.3.0.tar.gz
pip install google-auth-oauthlib-0.4.1.tar.gz
pip install Markdown-3.2.1.tar.gz
pip install tensorboard-2.1.1-py3-none-any.whl
pip install google-pasta-0.2.0.tar.gz
pip install gast-0.2.2.tar.gz
pip install termcolor-1.1.0.tar.gz
pip install tensorflow-2.1.0-cp37-cp37m-manylinux2010_x86_64.whl
pip install pypandoc-1.5.tar.gz
pip install py4j-0.10.7.zip
pip install pyspark-2.4.5.tar.gz
pip install elephas-0.4.3-py3-none-any.whl 

Generated .whl files are stored in directory (here 'ashish' is my username): /home/ashish/.cache/pip/wheels

Few packages were not accepted for latest release labels:
# tensorflow-estimator [2.2.0, >=2.1.0rc0] (from tensorflow==2.1.0). Latest available is: 2.2.0
# pip install gast-0.3.3.tar.gz
# pip install py4j-0.10.9.tar.gz

Most of these packages are required by TensorFlow, except:
1. hyperopt-0.2.4-py2.py3-none-any.whl
2. hyperas-0.4.1-py3-none-any.whl
3. pypandoc-1.5.tar.gz
4. py4j-0.10.7.zip
5. pyspark-2.4.5.tar.gz

All the packages are present in this Google Drive link, except TensorFlow and PySpark due to their sizes.
PySpark size: 207 MB
TensorFlow size: 402 MB

Running the shell script second time, uninstalls and reinstalls the packages again.
Here is a Python script to avoid doing this and install only if packages are not installed already:

import sys
import subprocess
import pkg_resources

required = { 'pyspark', 'scipy', 'tensorflow' }
installed = { pkg.key for pkg in pkg_resources.working_set }
missing = required - installed

if missing:
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL) 

References:
1. Elephas Documentation
2. GitHub Repository

Pages

Thursday, October 13, 2022

Spark installation on 3 RHEL based nodes cluster (Issue Resolution in Apr 2020)

Monday, October 10, 2022

What about medications (Propranolol, Benzodiazepines and Anti-psychotics) for treatment of trauma

Selective Serotonin Reuptake Inhibitors (SSRIs)

Propranolol

Benzodiazepines

Second-generation antipsychotic agents

Saturday, October 8, 2022

Four Ways to Read a CSV in PySpark (v3.3.0)

When you load a Pandas DataFrame by reading from a CSV, blank values and 'NA' values are converted to 'NaN' values by default as shown above.

Way 1

Also, PySpark's sqlCtx.createDataFrame() results in error on Pandas dataframe with null values.

Way 2

Way 3

Way 4: Using the plain old RDD

Installing PySpark on Ubuntu And Basic Testing (2022 Oct 8)

Contents of env.yml File

Keeping the number of packages in dependencies to a bare minimum.

Testing

Error Prior to Java Installation

Update the JAVA_HOME

Friday, October 7, 2022

Spark Installation on Windows (2022-Oct-07, Status Failure, Part 2)

The Issue

FRESH INSTALLATION

Checking Java

Checking Previous Installation of PySpark Through Its CLI

Checking JAVA_HOME

INSTALL SCALA

Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.

Contents of File ENV.YML

Checking the Environment variables through OS package

Installation of Elephas (for distributed deep learning) on Ubuntu through archives (Apr 2020)