Configurations: Hostname and IP mappings: Check the "/etc/hosts" file by opening it in both NANO and VI. 192.168.1.12 MASTER master 192.168.1.3 SLAVE1 slave1 192.168.1.4 SLAVE2 slave2 Software configuration: (base) [admin@SLAVE2 downloads]$ java -version openjdk version "1.8.0_181" OpenJDK Runtime Environment (build 1.8.0_181-b13) OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode) (base) [admin@MASTER ~]$ cd /opt/ml/downloads (base) [admin@MASTER downloads]$ ls Anaconda3-2020.02-Linux-x86_64.sh hadoop-3.2.1.tar.gz scala-2.13.2.rpm spark-3.0.0-preview2-bin-hadoop3.2.tgz # Scala can be downloaded from here. # Installation command: sudo rpm -i scala-2.13.2.rpm (base) [admin@MASTER downloads]$ echo JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-7.b13.el7.x86_64/jre/ PATH: /usr/local/hadoop/etc/hadoop/hadoop-env.sh JAVA_HOME ON 'master': /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-7.b13.el7.x86_64/jre/ JAVA_HOME on 'slave1': /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-8.b10.el7_5.x86_64/jre ~ ~ ~ In the case of no internet connectivity, installation of 'openssh-server' and 'openssh-client' is not straightforward. These packages have nested dependencies that are hard resolve. (base) [admin@SLAVE2 downloads]$ sudo rpm -i openssh-server-8.0p1-4.el8_1.x86_64.rpm warning: openssh-server-8.0p1-4.el8_1.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID 8483c65d: NOKEY error: Failed dependencies: crypto-policies >= 20180306-1 is needed by openssh-server-8.0p1-4.el8_1.x86_64 libc.so.6(GLIBC_2.25)(64bit) is needed by openssh-server-8.0p1-4.el8_1.x86_64 libc.so.6(GLIBC_2.26)(64bit) is needed by openssh-server-8.0p1-4.el8_1.x86_64 libcrypt.so.1(XCRYPT_2.0)(64bit) is needed by openssh-server-8.0p1-4.el8_1.x86_64 libcrypto.so.1.1()(64bit) is needed by openssh-server-8.0p1-4.el8_1.x86_64 libcrypto.so.1.1(OPENSSL_1_1_0)(64bit) is needed by openssh-server-8.0p1-4.el8_1.x86_64 libcrypto.so.1.1(OPENSSL_1_1_1b)(64bit) is needed by openssh-server-8.0p1-4.el8_1.x86_64 openssh = 8.0p1-4.el8_1 is needed by openssh-server-8.0p1-4.el8_1.x86_64 ~ ~ ~ Doing SSH setup: 1) sudo iptables -A INPUT -p tcp --dport ssh -j ACCEPT 2) sudo reboot 3) ssh-keygen -t rsa -f ~/.ssh/id_rsa -P "" 4) ssh-copy-id -i ~/.ssh/id_rsa.pub admin@SLAVE2 5) ssh-copy-id -i ~/.ssh/id_rsa.pub admin@MASTER 6) ssh-copy-id -i ~/.ssh/id_rsa.pub admin@SLAVE1 COMMAND FAILURE ON RHEL: [admin@MASTER ~]$ sudo service ssh stop Redirecting to /bin/systemctl stop ssh.service Failed to stop ssh.service: Unit ssh.service not loaded. [admin@MASTER ~]$ sudo service ssh start Redirecting to /bin/systemctl start ssh.service Failed to start ssh.service: Unit not found. Testing of SSH is through: ssh 'admin@SLAVE1' ~ ~ ~ To activate Conda 'base' environment at the start up of system, following code snippet goes at the end of "~/.bashrc" file. # >>> conda initialize >>> # !! Contents within this block are managed by 'conda init' !! __conda_setup="$('/home/admin/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" if [ $? -eq 0 ]; then eval "$__conda_setup" else if [ -f "/home/admin/anaconda3/etc/profile.d/conda.sh" ]; then . "/home/admin/anaconda3/etc/profile.d/conda.sh" else export PATH="/home/admin/anaconda3/bin:$PATH" fi fi unset __conda_setup # conda initialize ~ ~ ~ CHECKING THE OUTPUT OF 'start-dfs.sh' ON MASTER: (base) [admin@MASTER sbin]$ ps aux | grep java admin 7461 40.5 1.4 6010824 235120 ? Sl 21:57 0:07 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-8.b10.el7_5.x86_64/jre/bin/java -Dproc_secondarynamenode -Djava.net.preferIPv4Stack=true -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dyarn.log.dir=/usr/local/hadoop/logs -Dyarn.log.file=hadoop-admin-secondarynamenode-MASTER.log -Dyarn.home.dir=/usr/local/hadoop -Dyarn.root.logger=INFO,console -Djava.library.path=/usr/local/hadoop/lib/native -Dhadoop.log.dir=/usr/local/hadoop/logs -Dhadoop.log.file=hadoop-admin-secondarynamenode-MASTER.log -Dhadoop.home.dir=/usr/local/hadoop -Dhadoop.id.str=admin -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml o.a.h.hdfs.server.namenode.SecondaryNameNode ... OR $ ps -aux | grep java | awk '{print $12}' -Dproc_secondarynamenode ... ~ ~ ~ CREATING THE 'DATANODE' AND 'NAMENODE' DIRECTORIES: (base) [admin@MASTER logs]$ cd ~ (base) [admin@MASTER ~]$ pwd /home/admin (base) [admin@MASTER ~]$ cd .. (base) [admin@MASTER home]$ sudo mkdir hadoop (base) [admin@MASTER home]$ sudo chmod 777 hadoop (base) [admin@MASTER home]$ cd hadoop (base) [admin@MASTER hadoop]$ sudo mkdir data (base) [admin@MASTER hadoop]$ sudo chmod 777 data (base) [admin@MASTER hadoop]$ cd data (base) [admin@MASTER data]$ sudo mkdir dataNode (base) [admin@MASTER data]$ sudo chmod 777 dataNode (base) [admin@MASTER data]$ sudo mkdir nameNode (base) [admin@MASTER data]$ sudo chmod 777 nameNode (base) [admin@MASTER data]$ pwd /home/hadoop/data (base) [admin@SLAVE1 data]$ sudo chown admin * (base) [admin@MASTER data]$ ls -lrt total 0 drwxrwxrwx. 2 admin root 6 Apr 27 22:24 dataNode drwxrwxrwx. 2 admin root 6 Apr 27 22:37 nameNode # Error example with the NameNode execution if 'data/nameNode' folder is not accessible: File: /usr/local/hadoop/logs/hadoop-admin-namenode-MASTER.log: 2019-10-17 21:45:39,714 WARN o.a.h.hdfs.server.namenode.FSNamesystem: Encountered exception loading fsimage o.a.h.hdfs.server.common.InconsistentFSStateException: Directory /home/hadoop/data/nameNode is in an inconsistent state: storage directory does not exist or is not accessible. ... at o.a.h.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1692) at o.a.h.hdfs.server.namenode.NameNode.main(NameNode.java:1759) # Error example with the DameNode execution if 'data/dataNode' folder is not accessible: File: /usr/local/hadoop/logs/hadoop-admin-datanode-SLAVE1.log 2019-10-17 22:30:49,302 WARN o.a.h.hdfs.server.datanode.checker.StorageLocationChecker: Exception checking StorageLocation [DISK]file:/home/hadoop/data/dataNode java.io.FileNotFoundException: File file:/home/hadoop/data/dataNode does not exist ... 2019-10-17 22:30:49,307 ERROR o.a.h.hdfs.server.datanode.DataNode: Exception in secureMain o.a.h.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 0, volumes configured: 1, volumes failed: 1, volume failures tolerated: 0 ... at o.a.h.hdfs.server.datanode.DataNode.main(DataNode.java:2924) 2019-10-17 22:30:49,310 INFO o.a.h.util.ExitUtil: Exiting with status 1: o.a.h.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 0, volumes configured: 1, volumes failed: 1, volume failures tolerated: 0 2019-10-17 22:30:49,335 INFO o.a.h.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down DataNode at SLAVE1/192.168.1.3 ************************************************************/ ~ ~ ~ If 'data/dataNode' is not writable by other nodes on the cluster, following failure logs came on SLAVE1: File: /usr/local/hadoop/logs/hadoop-admin-datanode-MASTER.log 2019-10-17 22:37:33,820 WARN o.a.h.hdfs.server.datanode.checker.StorageLocationChecker: Exception checking StorageLocation [DISK]file:/home/hadoop/data/dataNode EPERM: Operation not permitted ... at java.lang.Thread.run(Thread.java:748) 2019-10-17 22:37:33,825 ERROR o.a.h.hdfs.server.datanode.DataNode: Exception in secureMain o.a.h.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 0, volumes configured: 1, volumes failed: 1, volume failures tolerated: 0 at o.a.h.hdfs.server.datanode.checker.StorageLocationChecker.check(StorageLocationChecker.java:231) ... at o.a.h.hdfs.server.datanode.DataNode.main(DataNode.java:2924) 2019-10-17 22:37:33,829 INFO o.a.h.util.ExitUtil: Exiting with status 1: o.a.h.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 0, volumes configured: 1, volumes failed: 1, volume failures tolerated: 0 2019-10-17 22:37:33,838 INFO o.a.h.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down DataNode at SLAVE1/192.168.1.3 ************************************************************/ ~ ~ ~ Success logs if "DataNode" program comes up successfully on slave machines: SLAVE1 SUCCESS MESSAGE FOR DATANODE: 2019-10-17 22:49:47,572 INFO o.a.h.hdfs.server.datanode.DataNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting DataNode STARTUP_MSG: host = SLAVE1/192.168.1.3 STARTUP_MSG: args = [] STARTUP_MSG: version = 3.2.1 ... STARTUP_MSG: build = https://gitbox.apache.org/repos/asf/hadoop.git -r b3cbbb467e22ea829b3808f4b7b01d07e0bf3842; compiled by 'rohithsharmaks' on 2019-09-10T15:56Z STARTUP_MSG: java = 1.8.0_171 ... 2019-10-17 22:49:49,489 INFO o.a.h.hdfs.server.datanode.DataNode: Starting DataNode with maxLockedMemory = 0 2019-10-17 22:49:49,543 INFO o.a.h.hdfs.server.datanode.DataNode: Opened streaming server at /0.0.0.0:9866 2019-10-17 22:49:49,549 INFO o.a.h.hdfs.server.datanode.DataNode: Balancing bandwidth is 10485760 bytes/s 2019-10-17 22:49:49,549 INFO o.a.h.hdfs.server.datanode.DataNode: Number threads for balancing is 50 ... ALSO: (base) [admin@SLAVE1 logs]$ ps -aux | grep java | awk '{print $12}' ... -Dproc_datanode ... MASTER SUCCESS MESSAGE FOR DATANODE: (base) [admin@MASTER sbin]$ ps -aux | grep java | awk '{print $12}' -Dproc_datanode -Dproc_secondarynamenode ... ~ ~ ~ FAILURE LOGS FROM MASTER FOR ERROR IN NAMENODE: (base) [admin@MASTER logs]$ cat hadoop-admin-namenode-MASTER.log 2019-10-17 22:49:56,593 ERROR o.a.h.hdfs.server.namenode.NameNode: Failed to start namenode. java.io.IOException: NameNode is not formatted. at o.a.h.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:252) ... at o.a.h.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1692) at o.a.h.hdfs.server.namenode.NameNode.main(NameNode.java:1759) 2019-10-17 22:49:56,596 INFO o.a.h.util.ExitUtil: Exiting with status 1: java.io.IOException: NameNode is not formatted. 2019-10-17 22:49:56,600 INFO o.a.h.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at MASTER/192.168.1.12 ************************************************************/ FIX: Previously: "hadoop namenode -format" On Hadooop 3.X: "hdfs namenode format" Hadoop namenode directory contains the fsimage and configuration files that hold the basic information about Hadoop file system such as where is data available, which user created the files, etc. If you format the NameNode, then the above information is deleted from NameNode directory which is specified in the "$HADOOP_HOME/etc/hadoop/hdfs-site.xml" as "dfs.namenode.name.dir" After formatting you still have the data on the Hadoop, but not the NameNode metadata. SUCCESS AFTER THE FIX ON MASTER: (base) [admin@MASTER sbin]$ ps -aux | grep java | awk '{print $12}' -Dproc_namenode -Dproc_datanode -Dproc_secondarynamenode ... ~ ~ ~ MOVING ON TO SPARK: WE HAVE YARN SO WE WILL NOT MAKE USE OF '/usr/local/spark/conf/slaves' FILE. (base) [admin@MASTER conf]$ cat slaves.template # A Spark Worker will be started on each of the machines listed below. ... ~ ~ ~ FAILURE LOGS FROM 'spark-submit': 2019-10-17 23:23:03,832 INFO ipc.Client: Retrying connect to server: 192.168.1.12/192.168.1.12:8032. Already tried 0 time(s); maxRetries=45 2019-10-17 23:23:23,836 INFO ipc.Client: Retrying connect to server: 192.168.1.12/192.168.1.12:8032. Already tried 1 time(s); maxRetries=45 2019-10-17 23:23:43,858 INFO ipc.Client: Retrying connect to server: 192.168.1.12/192.168.1.12:8032. Already tried 2 time(s); maxRetries=45 THE PROBLEM IS IN CONNECTING WITH THE RECOURCE MANAGER AS DESCRIBED IN PROPERTIES FILE YARN-SITE.XML ($HADOOP_HOME/etc/hadoop/yarn-site.xml): LOOK FOR THIS: yarn.resourcemanager.address FIX: SET IT TO MASTER IP ~ ~ ~ SUCCESS LOGS FOR STARTING OF SERVICES AFTER INSTALLATION OF HADOOP AND SPARK: (base) [admin@MASTER hadoop/sbin]$ start-all.sh Starting namenodes on [master] Starting datanodes master: This system is restricted to authorized users. slave1: This system is restricted to authorized users. Starting secondary namenodes [MASTER] MASTER: This system is restricted to authorized users. Starting resourcemanager Starting nodemanagers master: This system is restricted to authorized users. slave1: This system is restricted to authorized users. (base) [admin@MASTER sbin]$ (base) [admin@MASTER sbin]$ ps aux | grep java | awk '{print $12}' -Dproc_namenode -Dproc_datanode -Dproc_secondarynamenode -Dproc_resourcemanager -Dproc_nodemanager ... ON SLAVE1: (base) [admin@SLAVE1 ~]$ ps aux | grep java | awk '{print $12}' -Dproc_datanode -Dproc_nodemanager ... ~ ~ ~ FAILURE LOGS FROM SPARK-SUBMIT ON MASTER: 2019-10-17 23:54:26,189 INFO cluster.YarnScheduler: Adding task set 0.0 with 100 tasks 2019-10-17 23:54:41,247 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 2019-10-17 23:54:56,245 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 2019-10-17 23:55:11,246 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources Reason: Spark master doesn't have any resources allocated to execute the job. Resources like worker node or slave node. Fix for setup: changes in the /usr/local/hadoop/etc/hadoop/yarn-site.xml Ref: StackOverflow ~ ~ ~ CONNECTIVITY (OR PORT) RELATED ISSUE INSTANCE 1: ISSUE WITH DATANODE ON SLAVE1: (base) [admin@SLAVE1 logs]$ pwd /usr/local/hadoop/logs (base) [admin@SLAVE1 logs]$cat hadoop-admin-datanode-SLAVE1.log (base) [admin@SLAVE1 logs]$ 2019-10-17 22:50:40,384 WARN o.a.h.hdfs.server.datanode.DataNode: Problem connecting to server: master/192.168.1.12:9000 2019-10-17 22:50:46,416 INFO o.a.h.ipc.Client: Retrying connect to server: master/192.168.1.12:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) CONNECTIVITY (OR PORT) RELATED ISSUE INSTANCE 2: (base) [admin@MASTER logs]$ cat hadoop-admin-nodemanager-MASTER.log 2019-10-18 00:24:17,473 INFO o.a.h.ipc.Client: Retrying connect to server: MASTER/192.168.1.12:8031. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) FIX: All connectivity between IPs of nodes on the cluster, and bring down the firewall on the nodes on the cluster. sudo /sbin/iptables -A INPUT -p tcp -s 192.168.1.12 -j ACCEPT sudo /sbin/iptables -A OUTPUT -p tcp -d 192.168.1.12 -j ACCEPT sudo /sbin/iptables -A INPUT -p tcp -s 192.168.1.3 -j ACCEPT sudo /sbin/iptables -A OUTPUT -p tcp -d 192.168.1.3 -j ACCEPT sudo systemctl stop iptables sudo service firewalld stop Also, check port (here 80) connectivity as shown below: 1. lsof -i :80 2. netstat -an | grep 80 | grep LISTEN ~ ~ ~ ISSUE IN SPARK-SUBMIT LOGS ON MASTER: Exception: Python in worker has different version 2.7 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set. FIX IS TO BE DONE ON ALL THE NODES ON THE CLUSTER: (base) [admin@SLAVE1 bin]$ ls -lrt /home/admin/anaconda3/bin/python3.7 -rwx------. 1 admin wheel 12812592 May 6 2019 /home/admin/anaconda3/bin/python3.7 (base) [admin@MASTER spark]$ pwd /usr/local/spark/conf (base) [admin@MASTER conf]$ ls fairscheduler.xml.template log4j.properties.template metrics.properties.template slaves slaves.template spark-defaults.conf.template spark-env.sh.template (base) [admin@MASTER conf]$ cp spark-env.sh.template spark-env.sh PUT THESE PROPERTIES IN THE FILE "/usr/local/spark/conf/spark-env.sh": export PYSPARK_PYTHON=/home/admin/anaconda3/bin/python3.7 export PYSPARK_DRIVER_PYTHON=/home/admin/anaconda3/bin/python3.7 ~ ~ ~ ERROR LOGS IF 'EXECUTOR-MEMORY' ARGUMENT OF SPARK-SUBMIT ASKS FOR MORE MEMORY THAN DEFINED IN YARN CONFIGURATION: FILE INSTANCE 1: $HADOOP_HOME: /usr/local/hadoop (base) [admin@MASTER hadoop]$ vi $HADOOP_HOME/etc/hadoop/yarn-site.xml <configuration> <property> <name>yarn.acl.enable</name> <value>0</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>192.168.1.12</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>4000</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>8000</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> </configuration> ERROR INSTANCE 1: (base) [admin@MASTER sbin]$ ../bin/spark-submit --master yarn --executor-memory 12G ../examples/src/main/python/pi.py 100 2019-10-18 13:59:07,891 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2019-10-18 13:59:09,502 INFO spark.SparkContext: Running Spark version 3.0.0-preview2 2019-10-18 13:59:09,590 INFO resource.ResourceUtils: ============================================================== 2019-10-18 13:59:09,593 INFO resource.ResourceUtils: Resources for spark.driver: 2019-10-18 13:59:09,594 INFO resource.ResourceUtils: ============================================================== 2019-10-18 13:59:09,596 INFO spark.SparkContext: Submitted application: PythonPi 2019-10-18 13:59:09,729 INFO spark.SecurityManager: Changing view acls to: admin 2019-10-18 13:59:09,729 IN 2019-10-18 13:59:13,927 INFO spark.SparkContext: Successfully stopped SparkContext Traceback (most recent call last): File "/usr/local/spark/sbin/../examples/src/main/python/pi.py", line 33, in [module] .appName("PythonPi")\ File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 183, in getOrCreate File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 370, in getOrCreate File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 130, in __init__ File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 192, in _do_init File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 309, in _initialize_context File "/usr/local/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1554, in __call__ File "/usr/local/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.IllegalArgumentException: Required executor memory (12288 MB), offHeap memory (0) MB, overhead (1228 MB), and PySpark memory (0 MB) is above the max threshold (4000 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'. ... at java.lang.Thread.run(Thread.java:748) 2019-10-18 13:59:14,005 INFO util.ShutdownHookManager: Shutdown hook called 2019-10-18 13:59:14,007 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-fbead587-b1ae-4e8e-acd4-160e585a6f34 2019-10-18 13:59:14,012 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-3331bae2-e2d1-47f6-886c-317be6c98339 FILE INSTANCE 2: <configuration> <property> <name>yarn.acl.enable</name> <value>0</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>192.168.1.12</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>12000</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>10000</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> </property> </configuration> ERROR INSTANCE 2: (base) [admin@MASTER sbin]$ ../bin/spark-submit --master yarn ../examples/src/main/python/pi.py 100 py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.IllegalArgumentException: Required executor memory (12288 MB), offHeap memory (0) MB, overhead (1228 MB), and PySpark memory (0 MB) is above the max threshold (10000 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'. Related Articles: % Getting started with Hadoop on Ubuntu in VirtualBox % Setting up three node Hadoop cluster on Ubuntu using VirtualBox % Getting started with Spark on Ubuntu in VirtualBox % Setting up a three node Spark cluster on Ubuntu using VirtualBox (Apr 2020) % Notes on setting up Spark with YARN three node cluster
Thursday, October 13, 2022
Spark installation on 3 RHEL based nodes cluster (Issue Resolution in Apr 2020)
Monday, October 10, 2022
What about medications (Propranolol, Benzodiazepines and Anti-psychotics) for treatment of trauma
People have always used drugs to deal with traumatic stress. Each culture and each generation has its preferences—gin, vodka, beer, or whiskey; hashish, marijuana, cannabis, or ganja; cocaine; opioids like oxycontin; tranquilizers such as Valium, Xanax, and Klonopin. When people are desperate, they will do just about anything to feel calmer and more in control. Mainstream psychiatry follows this tradition. Over the past decade the Departments of Defense and Veterans Affairs combined have spent over $4.5 billion on antidepressants, antipsychotics, and antianxiety drugs. A June 2010 internal report from the Defense Department’s Pharmacoeconomic Center at Fort Sam Houston in San Antonio showed that 213,972, or 20 percent of the 1.1 million active-duty troops surveyed, were taking some form of psychotropic drug: antidepressants, antipsychotics, sedative hypnotics, or other controlled substances. However, drugs cannot “cure” trauma; they can only dampen the expressions of a disturbed physiology. And they do not teach the lasting lessons of self-regulation. They can help to control feelings and behavior, but always at a price—because they work by blocking the chemical systems that regulate engagement, motivation, pain, and pleasure. Some of my colleagues remain optimistic: I keep attending meetings where serious scientists discuss their quest for the elusive magic bullet that will miraculously reset the fear circuits of the brain (as if traumatic stress involved only one simple brain circuit). I also regularly prescribe medications.Tags: Medicine,PsychologySelective Serotonin Reuptake Inhibitors (SSRIs)
Just about every group of psychotropic agents has been used to treat some aspect of PTSD. The serotonin reuptake inhibitors (SSRIs) such as Prozac, Zoloft, Effexor, and Paxil have been most thoroughly studied, and they can make feelings less intense and life more manageable. Patients on SSRIs often feel calmer and more in control; feeling less overwhelmed often makes it easier to engage in therapy. Other patients feel blunted by SSRIs—they feel they’re “losing their edge.” I approach it as an empirical question: Let’s see what works, and only the patient can be the judge of that. On the other hand, if one SSRI does not work, it’s worth trying another, because they all have slightly different effects. It’s interesting that the SSRIs are widely used to treat depression, but in a study in which we compared Prozac with eye movement desensitization and reprocessing (EMDR) for patients with PTSD, many of whom were also depressed, EMDR proved to be a more effective antidepressant than Prozac.Propranolol
Medicines that target the autonomic nervous system, like propranolol or clonidine, can help to decrease hyperarousal and reactivity to stress. This family of drugs works by blocking the physical effects of adrenaline, the fuel of arousal, and thus reduces nightmares, insomnia, and reactivity to trauma triggers. Blocking adrenaline can help to keep the rational brain online and make choices possible: “Is this really what I want to do?” Since I have started to integrate mindfulness and yoga into my practice, I use these medications less often, except occasionally to help patients sleep more restfully.Benzodiazepines
Traumatized patients tend to like tranquilizing drugs, benzodiazepines like Klonopin, Valium, Xanax, and Ativan. In many ways, they work like alcohol, in that they make people feel calm and keep them from worrying. (Casino owners love customers on benzodiazepines; they don’t get upset when they lose and keep gambling.) But also, like alcohol, benzos weaken inhibitions against saying hurtful things to people we love. Most civilian doctors are reluctant to prescribe these drugs, because they have a high addiction potential and they may also interfere with trauma processing. Patients who stop taking them after prolonged use usually have withdrawal reactions that make them agitated and increase posttraumatic symptoms. I sometimes give my patients low doses of benzodiazepines to use as needed, but not enough to take on a daily basis. They have to choose when to use up their precious supply, and I ask them to keep a diary of what was going on when they decided to take the pill. That gives us a chance to discuss the specific incidents that triggered them. A few studies have shown that anticonvulsants and mood stabilizers, such as lithium or valproate, can have mildly positive effects, taking the edge off hyperarousal and panic.Second-generation antipsychotic agents
The most controversial medications are the so-called second-generation antipsychotic agents, such as Risperdal (Salt: Risperidone) and Seroquel, the largest-selling psychiatric drugs in the United States ($14.6 billion in 2008). Low doses of these agents can be helpful in calming down combat veterans and women with PTSD related to childhood abuse. Using these drugs is sometimes justified, for example when patients feel completely out of control and unable to sleep or where other methods have failed. But it’s important to keep in mind that these medications work by blocking the dopamine system, the brain’s reward system, which also functions as the engine of pleasure and motivation. Antipsychotic medications such as Risperdal, Abilify, or Seroquel can significantly dampen the emotional brain and thus make patients less skittish or enraged, but they also may interfere with being able to appreciate subtle signals of pleasure, danger, or satisfaction. They also cause weight gain, increase the chance of developing diabetes, and make patients physically inert, which is likely to further increase their sense of alienation. These drugs are widely used to treat abused children who are inappropriately diagnosed with bipolar disorder or mood dysregulation disorder. More than half a million children and adolescents in America are now taking antipsychotic drugs, which may calm them down but also interfere with learning age-appropriate skills and developing friendships with other children. A Columbia University study recently found that prescriptions of antipsychotic drugs for privately insured two- to five-year-olds had doubled between 2000 and 2007.61 Only 40 percent of them had received a proper mental health assessment. Until it lost its patent, the pharmaceutical company Johnson & Johnson doled out LEGO blocks stamped with the word “Risperdal” for the waiting rooms of child psychiatrists. Children from low-income families are four times as likely as the privately insured to receive antipsychotic medicines. In one year alone Texas Medicaid spent $96 million on antipsychotic drugs for teenagers and children—including three unidentified infants who were given the drugs before their first birthdays. There have been no studies on the effects of psychotropic medications on the developing brain. Dissociation, self-mutilation, fragmented memories, and amnesia generally do not respond to any of these medications. The Prozac study that I discussed in chapter 2 was the first to discover that traumatized civilians tend to respond much better to medications than do combat veterans. Since then other studies have found similar discrepancies. In this light it is worrisome that the Department of Defense and the Veteran Affairs (VA) prescribe enormous quantities of medications to combat soldiers and returning veterans, often without providing other forms of therapy. Between 2001 and 2011 the VA spent about $1.5 billion on Seroquel and Risperdal, while Defense spent about $90 million during the same period, even though a research paper published in 2001 showed that Risperdal was no more effective than a placebo in treating PTSD. Similarly, between 2001 and 2012 the VA spent $72.1 million and Defense spent $44.1 million on benzodiazepines — medications that clinicians generally avoid prescribing to civilians with PTSD because of their addiction potential and lack of significant effectiveness for PTSD symptoms. Reference: Chapter 13 of 'Body Keeps The Score' (by Bessel van red Kolk)
Saturday, October 8, 2022
Four Ways to Read a CSV in PySpark (v3.3.0)
Download Code
import seaborn as sns from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer from pyspark import SparkContext from pyspark.sql import SQLContext # Main entry point for SQL based DataFrame (other is Pandas based DataFrame) and SQL functionality. sc = SparkContext.getOrCreate() sqlCtx = SQLContext(sc) import pyspark print(pyspark.__version__) 3.3.0 Our input data looks like this: with open('./input/student.csv', mode = 'r', encoding = 'utf8') as f: data = f.readlines() import pandas as pd df_student = pd.read_csv('./input/student.csv') data ['sno,FirstName,LASTNAME\n', 'one,Ram,\n', 'two,,Sharma\n', 'three,Shyam,NA\n', 'four,Kabir,\n', 'five,NA,Singh\n'] df_student.head()Tags: Technology,SparkWhen you load a Pandas DataFrame by reading from a CSV, blank values and 'NA' values are converted to 'NaN' values by default as shown above.
Way 1
Also, PySpark's sqlCtx.createDataFrame() results in error on Pandas dataframe with null values.
df_student = pd.read_csv('./input/student.csv') sdf = sqlCtx.createDataFrame(df_student) TypeError: field FirstName: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'> def clean_data(df): df.fillna('Not Applicable', inplace = True) # Handles blank and 'NA' values both. df = df.apply(lambda x: x.str.strip()) df.columns = df.columns.str.lower() return df df_student = clean_data(df_student) df_student.fillna('Not Applicable', inplace = True) # Handles blank and 'NA' values both. sdf = sqlCtx.createDataFrame(df_student) type(sdf) pyspark.sql.dataframe.DataFrame sdf.show()Way 2
New feature in 3.2.1 [ Ref ] df = pyspark.pandas.read_csv('./input/student.csv') # Error if 'pandas' package is not there in your version of 'pyspark'. # AttributeError: module 'pyspark' has no attribute 'pandas' from pyspark import pandas as ppd df_student_pyspark = ppd.read_csv('./input/student.csv') type(df_student_pyspark) pyspark.pandas.frame.DataFrame df_student_pysparkWay 3
[ Ref ] from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .appName("Python Spark SQL basic example") \ .config("spark.some.config.option", "some-value") \ .getOrCreate() # A CSV dataset is pointed to by path. # The path can be either a single CSV file or a directory of CSV files # path = "examples/src/main/resources/people.csv" df = spark.read.option("header", True).csv('./input/student.csv') df.show() type(df) pyspark.sql.dataframe.DataFrameWay 4: Using the plain old RDD
Shane works in data analytics project and needs to process Users event data (UserLogs.csv file). Which of the below code snippet can be used to split the fields with a comma as a delimiter and fetch only the first two fields from it? logsRDD = sc.textFile("/HDFSPATH/UserLogs.csv"); FieldsRDD = logsRDD.map(lambda r : r.split(",")).map(lambda r: (r[0],r[1]))
Installing PySpark on Ubuntu And Basic Testing (2022 Oct 8)
Tags: Technology,Spark,Contents of env.yml File
name: mh channels: - conda-forge dependencies: - python==3.9 - pandas - pyspark - pipKeeping the number of packages in dependencies to a bare minimum.
Takes over two hours to process the otherwise tried original 13 dependencies. (base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ conda env create -f env.yml (base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ conda activate mhTesting
Error Prior to Java Installation
(mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ python Python 3.9.0 | packaged by conda-forge | (default, Nov 26 2020, 07:57:39) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> import pyspark >>> pyspark.__version__ '3.3.0' >>> import os >>> os.environ['PYTHONPATH'] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/ashish/anaconda3/envs/mh/lib/python3.9/os.py", line 679, in __getitem__ raise KeyError(key) from None KeyError: 'PYTHONPATH' >>> from pyspark.sql import SQLContext >>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] }) >>> sc = SparkContext.getOrCreate() Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'SparkContext' is not defined >>> from pyspark import SparkContext >>> sc = SparkContext.getOrCreate() JAVA_HOME is not set Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/context.py", line 483, in getOrCreate SparkContext(conf=conf or SparkConf()) File "/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/context.py", line 195, in __init__ SparkContext._ensure_initialized(self, gateway=gateway, conf=conf) File "/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/context.py", line 417, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway(conf) File "/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/java_gateway.py", line 106, in launch_gateway raise RuntimeError("Java gateway process exited before sending its port number") RuntimeError: Java gateway process exited before sending its port number >>> (mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ java Command 'java' not found, but can be installed with: sudo apt install default-jre # version 2:1.11-72build2, or sudo apt install openjdk-11-jre-headless # version 11.0.16+8-0ubuntu1~22.04 sudo apt install openjdk-17-jre-headless # version 17.0.3+7-0ubuntu0.22.04.1 sudo apt install openjdk-18-jre-headless # version 18~36ea-1 sudo apt install openjdk-8-jre-headless # version 8u312-b07-0ubuntu1 (mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ sudo apt install openjdk-8-jre-headless ... (mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ java -version openjdk version "1.8.0_342" OpenJDK Runtime Environment (build 1.8.0_342-8u342-b07-0ubuntu1~22.04-b07) OpenJDK 64-Bit Server VM (build 25.342-b07, mixed mode) (mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ echo $JAVA_HOME EMPTY (mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ (base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ which java /usr/bin/java (base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ readlink -f /usr/bin/java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/javaUpdate the JAVA_HOME
(mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ sudo nano ~/.bashrc Add the following line at the end of the file: export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64" (mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ (mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ source ~/.bashrc (mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ echo $JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64 (mh) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ python Python 3.9.0 | packaged by conda-forge | (default, Nov 26 2020, 07:57:39) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> from pyspark import SparkContext >>> from pyspark.sql import SQLContext >>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] }) >>> sc = SparkContext.getOrCreate() 22/10/08 13:29:50 WARN Utils: Your hostname, ashish-Lenovo-ideapad-130-15IKB resolves to a loopback address: 127.0.1.1; using 192.168.1.129 instead (on interface wlp2s0) 22/10/08 13:29:50 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/10/08 13:29:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable >>> sqlCtx = SQLContext(sc) /home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/sql/context.py:112: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead. warnings.warn() >>> sdf = sqlCtx.createDataFrame(df) /home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py:474: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead. for column, series in pdf.iteritems(): /home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py:486: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead. for column, series in pdf.iteritems(): >>> sdf.show() +----+----+ |col1|col2| +----+----+ |val1|val2| +----+----+ >>> >>> exit()
Friday, October 7, 2022
Spark Installation on Windows (2022-Oct-07, Status Failure, Part 2)
Tags: Technology,Spark,The Issue
(mh) C:\Users\ashish>python Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:30:19) [MSC v.1929 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> from pyspark import SparkContext >>> from pyspark.sql import SQLContext >>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] }) >>> sc = SparkContext.getOrCreate() 22/10/07 17:30:26 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/10/07 17:30:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable >>>FRESH INSTALLATION
Checking Java
(mh) C:\Users\ashish>java -version openjdk version "17.0.4" 2022-07-19 LTS OpenJDK Runtime Environment Zulu17.36+14-SA (build 17.0.4+8-LTS) OpenJDK 64-Bit Server VM Zulu17.36+14-SA (build 17.0.4+8-LTS, mixed mode, sharing) ~ ~ ~Checking Previous Installation of PySpark Through Its CLI
(mh) C:\Users\ashish>pyspark Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases. The system cannot find the path specified. The system cannot find the path specified. (mh) C:\Users\ashish> ~ ~ ~Checking JAVA_HOME
(base) C:\Users\ashish>echo %JAVA_HOME% C:\Program Files\Zulu\zulu-17 ~ ~ ~ Microsoft Windows [Version 10.0.19042.2006] (c) Microsoft Corporation. All rights reserved. C:\Users\ashish>pyspark Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases. The system cannot find the path specified. The system cannot find the path specified. ~ ~ ~ (base) C:\Users\ashish>where python C:\Users\ashish\Anaconda3\python.exe C:\Users\ashish\AppData\Local\Microsoft\WindowsApps\python.exe File: C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\bin\pyspark2.cmd @echo off rem rem Licensed to the Apache Software Foundation (ASF) under one or more rem contributor license agreements. See the NOTICE file distributed with rem this work for additional information regarding copyright ownership. rem The ASF licenses this file to You under the Apache License, Version 2.0 rem (the "License"); you may not use this file except in compliance with rem the License. You may obtain a copy of the License at rem rem http://www.apache.org/licenses/LICENSE-2.0 rem rem Unless required by applicable law or agreed to in writing, software rem distributed under the License is distributed on an "AS IS" BASIS, rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. rem See the License for the specific language governing permissions and rem limitations under the License. rem rem Figure out where the Spark framework is installed call "%~dp0find-spark-home.cmd" call "%SPARK_HOME%\bin\load-spark-env.cmd" set _SPARK_CMD_USAGE=Usage: bin\pyspark.cmd [options] rem Figure out which Python to use. if "x%PYSPARK_DRIVER_PYTHON%"=="x" ( set PYSPARK_DRIVER_PYTHON=python if not [%PYSPARK_PYTHON%] == [] set PYSPARK_DRIVER_PYTHON=%PYSPARK_PYTHON% ) set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH% set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.5-src.zip;%PYTHONPATH% set OLD_PYTHONSTARTUP=%PYTHONSTARTUP% set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py call "%SPARK_HOME%\bin\spark-submit2.cmd" pyspark-shell-main --name "PySparkShell" %* (base) C:\Users\ashish>echo %PATH% C:\Users\ashish\Anaconda3;C:\Users\ashish\Anaconda3\Library\mingw-w64\bin;C:\Users\ashish\Anaconda3\Library\usr\bin;C:\Users\ashish\Anaconda3\Library\bin;C:\Users\ashish\Anaconda3\Scripts;C:\Users\ashish\Anaconda3\bin;C:\Users\ashish\Anaconda3\condabin;C:\Program Files\Zulu\zulu-17-jre\bin;C:\Program Files\Zulu\zulu-17\bin;C:\windows\system32;C:\windows;C:\windows\System32\Wbem;C:\windows\System32\WindowsPowerShell\v1.0;C:\windows\System32\OpenSSH;C:\Program Files\Git\cmd;C:\Users\ashish\Anaconda3;C:\Users\ashish\Anaconda3\Library\mingw-w64\bin;C:\Users\ashish\Anaconda3\Library\usr\bin;C:\Users\ashish\Anaconda3\Library\bin;C:\Users\ashish\Anaconda3\Scripts;C:\Users\ashish\AppData\Local\Microsoft\WindowsApps;C:\Users\ashish\AppData\Local\Programs\Microsoft VS Code\bin;C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\bin;. (base) C:\Users\ashish>echo %PYTHONPATH% C:\Users\ashish\Anaconda3 (mh) C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\bin>pyspark Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:30:19) [MSC v.1929 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. 22/10/07 18:42:18 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/10/07 18:42:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.3.0 /_/ Using Python version 3.10.6 (main, Aug 22 2022 20:30:19) Spark context Web UI available at http://CHDSEZ344867L.ad.infosys.com:4040 Spark context available as 'sc' (master = local[*], app id = local-1665148340837). SparkSession available as 'spark'. >>> (base) C:\Users\ashish>where pyspark C:\Users\ashish\Anaconda3\Scripts\pyspark C:\Users\ashish\Anaconda3\Scripts\pyspark.cmd C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\bin\pyspark C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\bin\pyspark.cmd (base) C:\Users\ashish>where pyspark C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\bin\pyspark C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\bin\pyspark.cmd DELETE THE FILES: # C:\Users\ashish\Anaconda3\Scripts\pyspark # C:\Users\ashish\Anaconda3\Scripts\pyspark.cmd THEN RUN AGAIN: (base) C:\Users\ashish>pyspark Python 3.9.12 (main, Apr 4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information. 22/10/07 18:44:58 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/10/07 18:44:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.3.0 /_/ Using Python version 3.9.12 (main, Apr 4 2022 05:22:27) Spark context Web UI available at http://CHDSEZ344867L.ad.infosys.com:4040 Spark context available as 'sc' (master = local[*], app id = local-1665148501551). SparkSession available as 'spark'. >>> ~ ~ ~ Microsoft Windows [Version 10.0.19042.2006] (c) Microsoft Corporation. All rights reserved. C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\bin>pyspark Python 3.9.12 (main, Apr 4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Warning: This Python interpreter is in a conda environment, but the environment has not been activated. Libraries may fail to load. To activate this environment please see https://conda.io/activation Type "help", "copyright", "credits" or "license" for more information. 22/10/07 18:54:48 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/10/07 18:54:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.3.0 /_/ Using Python version 3.9.12 (main, Apr 4 2022 05:22:27) Spark context Web UI available at http://CHDSEZ344867L.ad.infosys.com:4040 Spark context available as 'sc' (master = local[*], app id = local-1665149091125). SparkSession available as 'spark'. >>> import pandas as pd >>> from pyspark import SparkContext >>> from pyspark.sql import SQLContext >>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] }) >>> sc = SparkContext.getOrCreate() >>> sqlCtx = SQLContext(sc) C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\python\pyspark\sql\context.py:112: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead. warnings.warn( >>> sdf = sqlCtx.createDataFrame(df) >>> sdf.show() Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases. 22/10/07 18:56:26 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: java.net.SocketTimeoutException: Accept timed out at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:708) at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:752) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:686) at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:652) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:628) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:585) at java.base/java.net.ServerSocket.accept(ServerSocket.java:538) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176) ... 29 more 22/10/07 18:56:26 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (CHDSEZ344867L.ad.infosys.com executor driver): org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: java.net.SocketTimeoutException: Accept timed out at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:708) at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:752) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:686) at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:652) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:628) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:585) at java.base/java.net.ServerSocket.accept(ServerSocket.java:538) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176) ... 29 more 22/10/07 18:56:26 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job Traceback (most recent call last): (0 + 0) / 1] File "<stdin>", line 1, in <module> File "C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\python\pyspark\sql\dataframe.py", line 606, in show print(self._jdf.showString(n, 20, vertical)) File "C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\python\lib\py4j-0.10.9.5-src.zip\py4j\java_gateway.py", line 1321, in __call__ File "C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\python\pyspark\sql\utils.py", line 190, in deco return f(*a, **kw) File "C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\python\lib\py4j-0.10.9.5-src.zip\py4j\protocol.py", line 326, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o62.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (CHDSEZ344867L.ad.infosys.com executor driver): org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: java.net.SocketTimeoutException: Accept timed out at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:708) at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:752) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:686) at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:652) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:628) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:585) at java.base/java.net.ServerSocket.accept(ServerSocket.java:538) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176) ... 29 more Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:506) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:459) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3868) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2863) at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858) at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856) at org.apache.spark.sql.Dataset.head(Dataset.scala:2863) at org.apache.spark.sql.Dataset.take(Dataset.scala:3084) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288) at org.apache.spark.sql.Dataset.showString(Dataset.scala:327) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ... 1 more Caused by: java.net.SocketTimeoutException: Accept timed out at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:708) at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:752) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:686) at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:652) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:628) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:585) at java.base/java.net.ServerSocket.accept(ServerSocket.java:538) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176) ... 29 more >>>INSTALL SCALA
https://www.scala-lang.org/download/ ~ ~ ~ C:\Users\ashish\Desktop\spark-3.3.0-bin-hadoop3\bin>echo %PYTHONPATH% C:\Users\ashish\Anaconda3 (mh) C:\Users\ashish>python Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:30:19) [MSC v.1929 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> exit() ~ ~ ~Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
This error occurs because your python version is not compatible with pyspark version. So check your python version and update accordingly using the below given command. After that it will work. To know more about it, get your Pyspark certification today and become expert.07-Apr-2020 edureka ~ ~ ~ (base) C:\Users\ashish\Desktop>conda env create -f menv.yml Collecting package metadata (repodata.json): done Solving environment: done Downloading and Extracting Packages debugpy-1.6.3 | 3.2 MB | ### | 100% kiwisolver-1.4.4 | 61 KB | ### | 100% jupyter_core-4.11.1 | 106 KB | ### | 100% regex-2022.9.13 | 331 KB | ### | 100% scikit-learn-1.1.2 | 7.5 MB | ### | 100% cffi-1.15.1 | 223 KB | ### | 100% typing_extensions-4. | 29 KB | ### | 100% argon2-cffi-bindings | 35 KB | ### | 100% scipy-1.9.1 | 28.3 MB | ### | 100% markupsafe-2.1.1 | 25 KB | ### | 100% click-8.1.3 | 146 KB | ### | 100% pandas-1.5.0 | 11.7 MB | ### | 100% unicodedata2-14.0.0 | 493 KB | ### | 100% sip-6.6.2 | 519 KB | ### | 100% python-3.8.0 | 18.8 MB | ### | 100% gensim-4.2.0 | 22.4 MB | ### | 100% statsmodels-0.13.2 | 10.3 MB | ### | 100% tornado-6.2 | 655 KB | ### | 100% importlib-metadata-4 | 33 KB | ### | 100% pywin32-303 | 6.9 MB | ### | 100% pyqt-5.15.7 | 4.7 MB | ### | 100% jupyter-1.0.0 | 7 KB | ### | 100% pyqt5-sip-12.11.0 | 82 KB | ### | 100% matplotlib-3.6.0 | 7 KB | ### | 100% matplotlib-base-3.6. | 7.5 MB | ### | 100% psutil-5.9.2 | 367 KB | ### | 100% pyrsistent-0.18.1 | 85 KB | ### | 100% pywinpty-2.0.8 | 234 KB | ### | 100% pillow-9.2.0 | 44.9 MB | ### | 100% pyarrow-6.0.0 | 2.4 MB | ### | 100% numpy-1.23.3 | 6.3 MB | ### | 100% fonttools-4.37.4 | 1.7 MB | ### | 100% contourpy-1.0.5 | 176 KB | ### | 100% python_abi-3.8 | 4 KB | ### | 100% sqlite-3.39.4 | 658 KB | ### | 100% pyzmq-24.0.1 | 461 KB | ### | 100% arrow-cpp-6.0.0 | 15.7 MB | ### | 100% Preparing transaction: done Verifying transaction: done Executing transaction: done Installing pip dependencies: \ Ran pip subprocess with arguments: ['C:\\Users\\ashish\\Anaconda3\\envs\\mh\\python.exe', '-m', 'pip', 'install', '-U', '-r', 'C:\\Users\\ashish\\Desktop\\condaenv.tl6wm33z.requirements.txt'] | Pip subprocess output: Collecting rpy2==3.4.5 Using cached rpy2-3.4.5.tar.gz (194 kB) Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'done' Requirement already satisfied: cffi>=1.10.0 in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.tl6wm33z.requirements.txt (line 1)) (1.15.1) Requirement already satisfied: jinja2 in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.tl6wm33z.requirements.txt (line 1)) (3.1.2) Requirement already satisfied: pytz in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.tl6wm33z.requirements.txt (line 1)) (2022.4) Collecting tzlocal Using cached tzlocal-4.2-py3-none-any.whl (19 kB) Requirement already satisfied: pycparser in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from cffi>=1.10.0->rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.tl6wm33z.requirements.txt (line 1)) (2.21) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from jinja2->rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.tl6wm33z.requirements.txt (line 1)) (2.1.1) Collecting backports.zoneinfo Downloading backports.zoneinfo-0.2.1-cp38-cp38-win_amd64.whl (38 kB) Collecting pytz-deprecation-shim Using cached pytz_deprecation_shim-0.1.0.post0-py2.py3-none-any.whl (15 kB) Collecting tzdata Using cached tzdata-2022.4-py2.py3-none-any.whl (336 kB) Building wheels for collected packages: rpy2 Building wheel for rpy2 (setup.py): started Building wheel for rpy2 (setup.py): finished with status 'done' Created wheel for rpy2: filename=rpy2-3.4.5-py3-none-any.whl size=198845 sha256=f7220847e02f729bd39188f16026ac01855f88cb2c10c3dd68cf5856fc560b6c Stored in directory: c:\users\ashish\appdata\local\pip\cache\wheels\57\e2\f0\64c7640f82ba9a23777a25c05d2552fa2991eee7ec2cf9b216 Successfully built rpy2 Installing collected packages: tzdata, backports.zoneinfo, pytz-deprecation-shim, tzlocal, rpy2 Successfully installed backports.zoneinfo-0.2.1 pytz-deprecation-shim-0.1.0.post0 rpy2-3.4.5 tzdata-2022.4 tzlocal-4.2 done # # To activate this environment, use # # $ conda activate mh # # To deactivate an active environment, use # # $ conda deactivate Retrieving notices: ...working... done (base) C:\Users\ashish\Desktop> ~ ~ ~ (mh) C:\Users\ashish\Desktop>python Python 3.8.0 | packaged by conda-forge | (default, Nov 22 2019, 19:04:36) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import pyspark >>> pyspark.__version__ '3.3.0' >>> exit() ~ ~ ~ (mh) C:\Users\ashish\Desktop>python Python 3.8.0 | packaged by conda-forge | (default, Nov 22 2019, 19:04:36) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> from pyspark import SparkContext >>> from pyspark.sql import SQLContext >>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] }) >>> sc = SparkContext.getOrCreate() 22/10/07 20:04:54 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/10/07 20:04:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable >>> sqlCtx = SQLContext(sc) C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\context.py:112: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead. warnings.warn( >>> sdf = sqlCtx.createDataFrame(df) C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\pandas\conversion.py:474: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead. for column, series in pdf.iteritems(): C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\pandas\conversion.py:486: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead. for column, series in pdf.iteritems(): >>> sdf.show() Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases. 22/10/07 20:05:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: Python worker failed to connect back. ... ... ...Contents of File ENV.YML
name: mh channels: - conda-forge dependencies: - python==3.7 - pandas - seaborn - scikit-learn - matplotlib - ipykernel - jupyter - pyspark - gensim - nltk - scipy - pip - pip: - rpy2==3.4.5 (base) C:\Users\ashish\Desktop>conda env create -f menv.yml Collecting package metadata (repodata.json): done Solving environment: done Downloading and Extracting Packages libthrift-0.16.0 | 877 KB | ### | 100% pandas-1.3.5 | 10.9 MB | ### | 100% debugpy-1.6.3 | 3.2 MB | ### | 100% python-3.7.0 | 21.0 MB | ### | 100% argon2-cffi-bindings | 34 KB | ### | 100% aws-c-event-stream-0 | 47 KB | ### | 100% fonttools-4.37.4 | 1.7 MB | ### | 100% ipython-7.33.0 | 1.2 MB | ### | 100% gensim-4.2.0 | 22.4 MB | ### | 100% jupyter-1.0.0 | 7 KB | ### | 100% setuptools-59.8.0 | 1.0 MB | ### | 100% aws-checksums-0.1.11 | 51 KB | ### | 100% pillow-9.2.0 | 45.4 MB | ### | 100% libprotobuf-3.21.7 | 2.4 MB | ### | 100% regex-2022.9.13 | 343 KB | ### | 100% psutil-5.9.2 | 363 KB | ### | 100% pywinpty-2.0.8 | 235 KB | ### | 100% statsmodels-0.13.2 | 10.5 MB | ### | 100% glog-0.6.0 | 95 KB | ### | 100% matplotlib-3.5.3 | 7 KB | ### | 100% aws-c-cal-0.5.11 | 36 KB | ### | 100% aws-c-common-0.6.2 | 159 KB | ### | 100% pyarrow-9.0.0 | 2.8 MB | ### | 100% pyrsistent-0.18.1 | 84 KB | ### | 100% libgoogle-cloud-2.2. | 10 KB | ### | 100% aws-sdk-cpp-1.8.186 | 5.5 MB | ### | 100% aws-c-io-0.10.5 | 127 KB | ### | 100% pyzmq-24.0.1 | 457 KB | ### | 100% libabseil-20220623.0 | 1.6 MB | ### | 100% jupyter_core-4.11.1 | 105 KB | ### | 100% matplotlib-base-3.5. | 7.4 MB | ### | 100% arrow-cpp-9.0.0 | 19.7 MB | ### | 100% pywin32-303 | 7.0 MB | ### | 100% typing-extensions-4. | 8 KB | ### | 100% libcrc32c-1.1.2 | 25 KB | ### | 100% cffi-1.15.1 | 222 KB | ### | 100% grpc-cpp-1.47.1 | 28.0 MB | ### | 100% Preparing transaction: done Verifying transaction: done Executing transaction: done Installing pip dependencies: | Ran pip subprocess with arguments: ['C:\\Users\\ashish\\Anaconda3\\envs\\mh\\python.exe', '-m', 'pip', 'install', '-U', '-r', 'C:\\Users\\ashish\\Desktop\\condaenv.yn5zpyut.requirements.txt'] Pip subprocess output: Collecting rpy2==3.4.5 Using cached rpy2-3.4.5.tar.gz (194 kB) Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'done' Requirement already satisfied: cffi>=1.10.0 in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.yn5zpyut.requirements.txt (line 1)) (1.15.1) Requirement already satisfied: jinja2 in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.yn5zpyut.requirements.txt (line 1)) (3.1.2) Requirement already satisfied: pytz in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.yn5zpyut.requirements.txt (line 1)) (2022.4) Collecting tzlocal Using cached tzlocal-4.2-py3-none-any.whl (19 kB) Requirement already satisfied: pycparser in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from cffi>=1.10.0->rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.yn5zpyut.requirements.txt (line 1)) (2.21) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from jinja2->rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.yn5zpyut.requirements.txt (line 1)) (2.1.1) Collecting backports.zoneinfo Downloading backports.zoneinfo-0.2.1-cp37-cp37m-win_amd64.whl (38 kB) Collecting pytz-deprecation-shim Using cached pytz_deprecation_shim-0.1.0.post0-py2.py3-none-any.whl (15 kB) Collecting tzdata Using cached tzdata-2022.4-py2.py3-none-any.whl (336 kB) Building wheels for collected packages: rpy2 Building wheel for rpy2 (setup.py): started Building wheel for rpy2 (setup.py): finished with status 'done' Created wheel for rpy2: filename=rpy2-3.4.5-py3-none-any.whl size=198859 sha256=eb9ac7fe7a3a2109be582d2cae21640c03e1164a55bceda048c24047df75e945 Stored in directory: c:\users\ashish\appdata\local\pip\cache\wheels\46\00\c5\a43320afe86e7540d16d7f07cf4d29547d98921e76ea9f2f7a Successfully built rpy2 Installing collected packages: tzdata, backports.zoneinfo, pytz-deprecation-shim, tzlocal, rpy2 Successfully installed backports.zoneinfo-0.2.1 pytz-deprecation-shim-0.1.0.post0 rpy2-3.4.5 tzdata-2022.4 tzlocal-4.2 done # # To activate this environment, use # # $ conda activate mh # # To deactivate an active environment, use # # $ conda deactivate Retrieving notices: ...working... done (base) C:\Users\ashish\Desktop>python Python 3.9.12 (main, Apr 4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information. >>> exit() (base) C:\Users\ashish\Desktop>conda activate mh (mh) C:\Users\ashish\Desktop>python Python 3.7.0 | packaged by conda-forge | (default, Nov 12 2018, 20:47:31) [MSC v.1900 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> from pyspark import SparkContext >>> from pyspark.sql import SQLContext >>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] }) >>> sc = SparkContext.getOrCreate() 22/10/07 21:02:01 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/10/07 21:02:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable >>> sqlCtx = SQLContext(sc) C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\context.py:114: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead. FutureWarning, >>> sdf = sqlCtx.createDataFrame(df) >>> sdf.show() Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases. 22/10/07 21:02:49 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: Python worker failed to connect back.Checking the Environment variables through OS package
>>> import os >>> os.environ['PATH'] 'C:\\Users\\ashish\\Anaconda3\\envs\\mh;C:\\Users\\ashish\\Anaconda3\\envs\\mh\\Library\\mingw-w64\\bin;C:\\Users\\ashish\\Anaconda3\\envs\\mh\\Library\\usr\\bin;C:\\Users\\ashish\\Anaconda3\\envs\\mh\\Library\\bin;C:\\Users\\ashish\\Anaconda3\\envs\\mh\\Scripts;C:\\Users\\ashish\\Anaconda3\\envs\\mh\\bin;C:\\Users\\ashish\\Anaconda3\\condabin;C:\\Program Files\\Zulu\\zulu-17-jre\\bin;C:\\Program Files\\Zulu\\zulu-17\\bin;C:\\windows\\system32;C:\\windows;C:\\windows\\System32\\Wbem;C:\\windows\\System32\\WindowsPowerShell\\v1.0;C:\\windows\\System32\\OpenSSH;C:\\Program Files\\Git\\cmd;C:\\Users\\ashish\\Anaconda3;C:\\Users\\ashish\\Anaconda3\\Library\\mingw-w64\\bin;C:\\Users\\ashish\\Anaconda3\\Library\\usr\\bin;C:\\Users\\ashish\\Anaconda3\\Library\\bin;C:\\Users\\ashish\\Anaconda3\\Scripts;C:\\Users\\ashish\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\ashish\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\ashish\\Desktop\\spark-3.3.0-bin-hadoop3\\bin;.' >>> os.environ['PYTHONPATH'] 'C:\\Users\\ashish\\Anaconda3' >>> os.system("where python") C:\Users\ashish\Anaconda3\envs\mh\python.exe C:\Users\ashish\Anaconda3\python.exe C:\Users\ashish\AppData\Local\Microsoft\WindowsApps\python.exe 0 (base) C:\Users\ashish>conda activate mh (mh) C:\Users\ashish>python Python 3.7.0 | packaged by conda-forge | (default, Nov 12 2018, 20:47:31) [MSC v.1900 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> import os >>> os.environ["PYTHONPATH"] 'C:\\Users\\ashish\\Anaconda3\\envs\\mh' >>> from pyspark import SparkContext >>> from pyspark.sql import SQLContext >>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] }) >>> sc = SparkContext.getOrCreate() 22/10/07 21:20:00 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/10/07 21:20:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable >>> sqlCtx = SQLContext(sc) C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\context.py:114: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead. FutureWarning, >>> sdf = sqlCtx.createDataFrame(df) >>> sdf.show() Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases. 22/10/07 21:20:45 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: Python worker failed to connect back.
Installation of Elephas (for distributed deep learning) on Ubuntu through archives (Apr 2020)
Elephas is an extension of Keras, which allows you to run distributed deep learning models at scale with Spark. Elephas currently supports a number of applications, including: % Data-parallel training of deep learning models % Distributed hyper-parameter optimization % Distributed training of ensemble models Schematically, elephas works as follows. We have listed packages that are required outside of Anaconda distribution. The following code goes in a Shell (.sh) script for Ubuntu or .bat script Windows: pip install Keras_Applications-1.0.8.tar.gz pip install keras-team-keras-preprocessing-1.1.0-0-gff90696.tar.gz pip install Keras-2.3.1.tar.gz pip install hyperopt-0.2.4-py2.py3-none-any.whl pip install hyperas-0.4.1-py3-none-any.whl pip install tensorflow_estimator-2.1.0-py2.py3-none-any.whl pip install grpcio-1.28.1-cp37-cp37m-manylinux2010_x86_64.whl pip install protobuf-3.11.3-cp37-cp37m-manylinux1_x86_64.whl pip install gast-0.3.3.tar.gz pip install opt_einsum-3.2.1.tar.gz pip install astor-0.8.1.tar.gz pip install absl-py-0.9.0.tar.gz pip install cachetools-4.1.0.tar.gz pip install pyasn1-0.4.8.tar.gz pip install pyasn1-modules-0.2.8.tar.gz pip install rsa-4.0.tar.gz pip install google-auth-1.14.1.tar.gz pip install oauthlib-3.1.0.tar.gz pip install requests-oauthlib-1.3.0.tar.gz pip install google-auth-oauthlib-0.4.1.tar.gz pip install Markdown-3.2.1.tar.gz pip install tensorboard-2.1.1-py3-none-any.whl pip install google-pasta-0.2.0.tar.gz pip install gast-0.2.2.tar.gz pip install termcolor-1.1.0.tar.gz pip install tensorflow-2.1.0-cp37-cp37m-manylinux2010_x86_64.whl pip install pypandoc-1.5.tar.gz pip install py4j-0.10.7.zip pip install pyspark-2.4.5.tar.gz pip install elephas-0.4.3-py3-none-any.whl Generated .whl files are stored in directory (here 'ashish' is my username): /home/ashish/.cache/pip/wheels Few packages were not accepted for latest release labels: # tensorflow-estimator [2.2.0, >=2.1.0rc0] (from tensorflow==2.1.0). Latest available is: 2.2.0 # pip install gast-0.3.3.tar.gz # pip install py4j-0.10.9.tar.gz Most of these packages are required by TensorFlow, except: 1. hyperopt-0.2.4-py2.py3-none-any.whl 2. hyperas-0.4.1-py3-none-any.whl 3. pypandoc-1.5.tar.gz 4. py4j-0.10.7.zip 5. pyspark-2.4.5.tar.gz All the packages are present in this Google Drive link, except TensorFlow and PySpark due to their sizes. PySpark size: 207 MB TensorFlow size: 402 MB Running the shell script second time, uninstalls and reinstalls the packages again. Here is a Python script to avoid doing this and install only if packages are not installed already: import sys import subprocess import pkg_resources required = { 'pyspark', 'scipy', 'tensorflow' } installed = { pkg.key for pkg in pkg_resources.working_set } missing = required - installed if missing: python = sys.executable subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL) References: 1. Elephas Documentation 2. GitHub RepositoryTags: Technology,Deep Learning,Machine Learning,Big Data,
Subscribe to:
Posts (Atom)