Sunday, October 23, 2022

Creating Two Node Spark Cluster With Two Worker Nodes and One Master Node Using Spark's Standalone Resource Manager on Ubuntu machines

1.1 Setting up Java

(base) ashish@ashishdesktop:~$ java Command 'java' not found, but can be installed with: sudo apt install default-jre # version 2:1.11-72build2, or sudo apt install openjdk-11-jre-headless # version 11.0.16+8-0ubuntu1~22.04 sudo apt install openjdk-17-jre-headless # version 17.0.3+7-0ubuntu0.22.04.1 sudo apt install openjdk-18-jre-headless # version 18~36ea-1 sudo apt install openjdk-8-jre-headless # version 8u312-b07-0ubuntu1 (base) ashish@ashishdesktop:~$ sudo apt install openjdk-8-jre-headless

1.2 Setting environment variable JAVA_HOME

(base) ashish@ashishdesktop:~$ readlink -f /usr/bin/java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java (base) ashish@ashishdesktop:~$ sudo nano ~/.bashrc (base) ashish@ashishdesktop:~$ tail ~/.bashrc . "/home/ashish/anaconda3/etc/profile.d/conda.sh" else export PATH="/home/ashish/anaconda3/bin:$PATH" fi fi unset __conda_setup # <<< conda initialize <<< export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 (base) ashish@ashishdesktop:~$ (base) ashish@ashishdesktop:~$ source ~/.bashrc (base) ashish@ashishdesktop:~$ echo $JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64

2. Setting up Scala

(base) ashish@ashishdesktop:~$ sudo apt-get install scala

3. Checking The Previous Standalone PySpark Installation on Laptop

(base) ashish@ashishlaptop:~$ pyspark pyspark: command not found (base) ashish@ashishlaptop:~$ conda activate mh (mh) ashish@ashishlaptop:~$ pyspark Python 3.9.0 | packaged by conda-forge | (default, Nov 26 2020, 07:57:39) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/10/22 22:45:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.3.0 /_/ Using Python version 3.9.0 (default, Nov 26 2020 07:57:39) Spark context Web UI available at http://ashishlaptop:4040 Spark context available as 'sc' (master = local[*], app id = local-1666458948998). SparkSession available as 'spark'. >>>

4. Download Spark Archive

One we are using: dlcdn.apache.org: spark-3.3.0-bin-hadoop3.tgz Link to broader download site: spark.apache.org In terminal: $ wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz

5. Setting up Spark From The Archive

5.1. Extracting the software from archive

(base) ashish@ashishlaptop:~/Desktop$ tar xvf spark-3.3.0-bin-hadoop3.tgz

5.2. Moving the software to the local installation directory

(base) ashish@ashishlaptop:~/Desktop$ sudo mv spark-3.3.0-bin-hadoop3 /usr/local/spark [sudo] password for ashish: (base) ashish@ashishlaptop:~/Desktop$ cd /usr/local (base) ashish@ashishlaptop:/usr/local$ ls -l total 36 drwxr-xr-x 2 root root 4096 Aug 9 17:18 bin drwxr-xr-x 2 root root 4096 Aug 9 17:18 etc drwxr-xr-x 2 root root 4096 Aug 9 17:18 games drwxr-xr-x 2 root root 4096 Aug 9 17:18 include drwxr-xr-x 4 root root 4096 Oct 8 11:10 lib lrwxrwxrwx 1 root root 9 Aug 26 09:40 man -> share/man drwxr-xr-x 2 root root 4096 Aug 9 17:18 sbin drwxr-xr-x 7 root root 4096 Aug 9 17:21 share drwxr-xr-x 13 ashish ashish 4096 Jun 10 02:07 spark drwxr-xr-x 2 root root 4096 Aug 9 17:18 src (base) ashish@ashishlaptop:/usr/local$ cd spark (base) ashish@ashishlaptop:/usr/local/spark$ ls bin conf data examples jars kubernetes LICENSE licenses NOTICE python R README.md RELEASE sbin yarn (base) ashish@ashishlaptop:/usr/local/spark$

5.3. Including Spark binaries in the environment.

$ sudo gedit ~/.bashrc (base) ashish@ashishlaptop:/usr/local/spark$ tail ~/.bashrc export PATH="/home/ashish/anaconda3/bin:$PATH" fi fi unset __conda_setup # <<< conda initialize <<< export PATH="/home/ashish/.local/bin:$PATH" export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64" export PATH="$PATH:/usr/local/spark/bin" (base) ashish@ashishlaptop:/usr/local/spark$ source ~/.bashrc

6. Checking installation on laptop (and then on desktop after proper setup on it)

6.1. spark-shell (Scala based shell) on Laptop

(base) ashish@ashishlaptop:/usr/local/spark$ spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/10/22 23:44:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context Web UI available at http://ashishlaptop:4040 Spark context available as 'sc' (master = local[*], app id = local-1666462455694). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.3.0 /_/ Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_342) Type in expressions to have them evaluated. Type :help for more information. scala> sys.exit

6.2. pyspark (Python based shell) on Laptop

(base) ashish@ashishlaptop:/usr/local/spark$ pyspark Python 3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/10/22 23:46:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.3.0 /_/ Using Python version 3.9.12 (main, Apr 5 2022 06:56:58) Spark context Web UI available at http://ashishlaptop:4040 Spark context available as 'sc' (master = local[*], app id = local-1666462561785). SparkSession available as 'spark'. >>> exit()

6.3. spark-shell (Scala based shell) on Desktop

(base) ashish@ashishdesktop:~$ spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/10/22 23:54:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context Web UI available at http://ashishdesktop:4040 Spark context available as 'sc' (master = local[*], app id = local-1666463078583). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.3.0 /_/ Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_342) Type in expressions to have them evaluated. Type :help for more information. scala> sys.exit

6.4. pyspark (Python based shell) on Desktop

(base) ashish@ashishdesktop:~$ pyspark Python 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/10/22 23:55:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.3.0 /_/ Using Python version 3.9.7 (default, Sep 16 2021 13:09:58) Spark context Web UI available at http://ashishdesktop:4040 Spark context available as 'sc' (master = local[*], app id = local-1666463110370). SparkSession available as 'spark'. >>> exit() (base) ashish@ashishdesktop:~$

7. Configure Worker Nodes in Spark's Configuration

(base) ashish@ashishlaptop:/usr/local/spark/conf$ ls -l total 36 -rw-r--r-- 1 ashish ashish 1105 Jun 10 02:07 fairscheduler.xml.template -rw-r--r-- 1 ashish ashish 3350 Jun 10 02:07 log4j2.properties.template -rw-r--r-- 1 ashish ashish 9141 Jun 10 02:07 metrics.properties.template -rw-r--r-- 1 ashish ashish 1292 Jun 10 02:07 spark-defaults.conf.template -rwxr-xr-x 1 ashish ashish 4506 Jun 10 02:07 spark-env.sh.template -rw-r--r-- 1 ashish ashish 865 Jun 10 02:07 workers.template (base) ashish@ashishlaptop:/usr/local/spark/conf$ (base) ashish@ashishlaptop:/usr/local/spark/conf$ cat workers.template # # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # A Spark Worker will be started on each of the machines listed below. localhost (base) ashish@ashishlaptop:/usr/local/spark/conf$ (base) ashish@ashishlaptop:/usr/local/spark/conf$ cp workers.template workers (base) ashish@ashishlaptop:/usr/local/spark/conf$ ls -l total 40 -rw-r--r-- 1 ashish ashish 1105 Jun 10 02:07 fairscheduler.xml.template -rw-r--r-- 1 ashish ashish 3350 Jun 10 02:07 log4j2.properties.template -rw-r--r-- 1 ashish ashish 9141 Jun 10 02:07 metrics.properties.template -rw-r--r-- 1 ashish ashish 1292 Jun 10 02:07 spark-defaults.conf.template -rwxr-xr-x 1 ashish ashish 4506 Jun 10 02:07 spark-env.sh.template -rw-r--r-- 1 ashish ashish 865 Jun 10 02:07 workers -rw-r--r-- 1 ashish ashish 865 Oct 23 12:58 workers.template (base) ashish@ashishlaptop:/usr/local/spark/conf$ nano workers (base) ashish@ashishlaptop:/usr/local/spark/conf$ cat workers ashishdesktop ashishlaptop (base) ashish@ashishlaptop:/usr/local/spark/conf$ (base) ashish@ashishlaptop:/usr/local/spark/conf$ cp spark-env.sh.template spark-env.sh

8. Starting Driver/Master and Worker Nodes Using Script 'start-all.sh'

(base) ashish@ashishlaptop:/usr/local/spark/sbin$ source start-all.sh starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark/logs/spark-ashish-org.apache.spark.deploy.master.Master-1-ashishlaptop.out ashishlaptop: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/logs/spark-ashish-org.apache.spark.deploy.worker.Worker-1-ashishlaptop.out ashishdesktop: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/logs/spark-ashish-org.apache.spark.deploy.worker.Worker-1-ashishdesktop.out (base) ashish@ashishlaptop:/usr/local/spark/logs$ ls -l total 12 -rw-rw-r-- 1 ashish ashish 2005 Oct 23 13:54 spark-ashish-org.apache.spark.deploy.master.Master-1-ashishlaptop.out -rw-rw-r-- 1 ashish ashish 2340 Oct 23 13:47 spark-ashish-org.apache.spark.deploy.master.Master-1-ashishlaptop.out.1 -rw-rw-r-- 1 ashish ashish 2485 Oct 23 13:53 spark-ashish-org.apache.spark.deploy.worker.Worker-1-ashishlaptop.out (base) ashish@ashishlaptop:/usr/local/spark/logs$

Master logs

(base) ashish@ashishlaptop:/usr/local/spark/logs$ cat spark-ashish-org.apache.spark.deploy.master.Master-1-ashishlaptop.out Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host ashishlaptop --port 7077 --webui-port 8080 ======================================== Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties 22/10/23 13:53:51 INFO Master: Started daemon with process name: 9224@ashishlaptop 22/10/23 13:53:51 INFO SignalUtils: Registering signal handler for TERM 22/10/23 13:53:51 INFO SignalUtils: Registering signal handler for HUP 22/10/23 13:53:51 INFO SignalUtils: Registering signal handler for INT 22/10/23 13:53:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22/10/23 13:53:51 INFO SecurityManager: Changing view acls to: ashish 22/10/23 13:53:51 INFO SecurityManager: Changing modify acls to: ashish 22/10/23 13:53:51 INFO SecurityManager: Changing view acls groups to: 22/10/23 13:53:51 INFO SecurityManager: Changing modify acls groups to: 22/10/23 13:53:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ashish); groups with view permissions: Set(); users with modify permissions: Set(ashish); groups with modify permissions: Set() 22/10/23 13:53:52 INFO Utils: Successfully started service 'sparkMaster' on port 7077. 22/10/23 13:53:52 INFO Master: Starting Spark master at spark://ashishlaptop:7077 22/10/23 13:53:52 INFO Master: Running Spark version 3.3.0 22/10/23 13:53:52 INFO Utils: Successfully started service 'MasterUI' on port 8080. 22/10/23 13:53:52 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://ashishlaptop:8080 22/10/23 13:53:53 INFO Master: I have been elected leader! New state: ALIVE 22/10/23 13:53:56 INFO Master: Registering worker 192.168.1.142:43143 with 4 cores, 10.6 GiB RAM 22/10/23 13:54:00 INFO Master: Registering worker 192.168.1.106:44471 with 2 cores, 1024.0 MiB RAM (base) ashish@ashishlaptop:/usr/local/spark/logs$

Worker Logs From 'ashishlaptop'

(base) ashish@ashishlaptop:/usr/local/spark/logs$ cat spark-ashish-org.apache.spark.deploy.worker.Worker-1-ashishlaptop.out Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://ashishlaptop:7077 ======================================== Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties 22/10/23 13:53:54 INFO Worker: Started daemon with process name: 9322@ashishlaptop 22/10/23 13:53:54 INFO SignalUtils: Registering signal handler for TERM 22/10/23 13:53:54 INFO SignalUtils: Registering signal handler for HUP 22/10/23 13:53:54 INFO SignalUtils: Registering signal handler for INT 22/10/23 13:53:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22/10/23 13:53:55 INFO SecurityManager: Changing view acls to: ashish 22/10/23 13:53:55 INFO SecurityManager: Changing modify acls to: ashish 22/10/23 13:53:55 INFO SecurityManager: Changing view acls groups to: 22/10/23 13:53:55 INFO SecurityManager: Changing modify acls groups to: 22/10/23 13:53:55 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ashish); groups with view permissions: Set(); users with modify permissions: Set(ashish); groups with modify permissions: Set() 22/10/23 13:53:55 INFO Utils: Successfully started service 'sparkWorker' on port 43143. 22/10/23 13:53:55 INFO Worker: Worker decommissioning not enabled. 22/10/23 13:53:56 INFO Worker: Starting Spark worker 192.168.1.142:43143 with 4 cores, 10.6 GiB RAM 22/10/23 13:53:56 INFO Worker: Running Spark version 3.3.0 22/10/23 13:53:56 INFO Worker: Spark home: /usr/local/spark 22/10/23 13:53:56 INFO ResourceUtils: ============================================================== 22/10/23 13:53:56 INFO ResourceUtils: No custom resources configured for spark.worker. 22/10/23 13:53:56 INFO ResourceUtils: ============================================================== 22/10/23 13:53:56 INFO Utils: Successfully started service 'WorkerUI' on port 8081. 22/10/23 13:53:56 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://ashishlaptop:8081 22/10/23 13:53:56 INFO Worker: Connecting to master ashishlaptop:7077... 22/10/23 13:53:56 INFO TransportClientFactory: Successfully created connection to ashishlaptop/192.168.1.142:7077 after 46 ms (0 ms spent in bootstraps) 22/10/23 13:53:56 INFO Worker: Successfully registered with master spark://ashishlaptop:7077 (base) ashish@ashishlaptop:/usr/local/spark/logs$

Worker Logs From 'ashishdesktop'

(base) ashish@ashishdesktop:/usr/local/spark/logs$ cat spark-ashish-org.apache.spark.deploy.worker.Worker-1-ashishdesktop.out Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://ashishlaptop:7077 ======================================== Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties 22/10/23 13:53:56 INFO Worker: Started daemon with process name: 19475@ashishdesktop 22/10/23 13:53:56 INFO SignalUtils: Registering signal handler for TERM 22/10/23 13:53:56 INFO SignalUtils: Registering signal handler for HUP 22/10/23 13:53:56 INFO SignalUtils: Registering signal handler for INT 22/10/23 13:53:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22/10/23 13:53:57 INFO SecurityManager: Changing view acls to: ashish 22/10/23 13:53:57 INFO SecurityManager: Changing modify acls to: ashish 22/10/23 13:53:57 INFO SecurityManager: Changing view acls groups to: 22/10/23 13:53:57 INFO SecurityManager: Changing modify acls groups to: 22/10/23 13:53:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ashish); groups with view permissions: Set(); users with modify permissions: Set(ashish); groups with modify permissions: Set() 22/10/23 13:53:58 INFO Utils: Successfully started service 'sparkWorker' on port 44471. 22/10/23 13:53:58 INFO Worker: Worker decommissioning not enabled. 22/10/23 13:53:59 INFO Worker: Starting Spark worker 192.168.1.106:44471 with 2 cores, 1024.0 MiB RAM 22/10/23 13:53:59 INFO Worker: Running Spark version 3.3.0 22/10/23 13:53:59 INFO Worker: Spark home: /usr/local/spark 22/10/23 13:53:59 INFO ResourceUtils: ============================================================== 22/10/23 13:53:59 INFO ResourceUtils: No custom resources configured for spark.worker. 22/10/23 13:53:59 INFO ResourceUtils: ============================================================== 22/10/23 13:53:59 INFO Utils: Successfully started service 'WorkerUI' on port 8081. 22/10/23 13:54:00 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://ashishdesktop:8081 22/10/23 13:54:00 INFO Worker: Connecting to master ashishlaptop:7077... 22/10/23 13:54:00 INFO TransportClientFactory: Successfully created connection to ashishlaptop/192.168.1.142:7077 after 157 ms (0 ms spent in bootstraps) 22/10/23 13:54:00 INFO Worker: Successfully registered with master spark://ashishlaptop:7077

9. Issue Resolution

Prompt for password when launching worker nodes using script 'start-all.sh'

(base) ashish@ashishlaptop:/usr/local/spark/sbin$ source start-all.sh starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark/logs/spark-ashish-org.apache.spark.deploy.master.Master-1-ashishlaptop.out ashishlaptop: Warning: Permanently added 'ashishlaptop' (ED25519) to the list of known hosts. ashish@ashishlaptop's password: ashishdesktop: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/logs/spark-ashish-org.apache.spark.deploy.worker.Worker-1-ashishdesktop.out ashishlaptop: Connection closed by 192.168.1.142 port 22 (base) ashish@ashishlaptop:/usr/local/spark/sbin$

Debugging

DOING SSH FROM 'ASHISHLAPTOP' TO 'ASHISHLAPTOP' (NOT A TYPE) STILL RESULTS IN PASSWORD PROMPT.

(base) ashish@ashishlaptop:/usr/local/spark/conf$ ssh ashishlaptop ashish@ashishlaptop's password:

DO THIS TO RESOLVE THE ISSUE

(base) ashish@ashishlaptop:/usr/local/spark/conf$ ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@ashishlaptop /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/ashish/.ssh/id_rsa.pub" /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys ashish@ashishlaptop's password: Number of key(s) added: 1 Now try logging into the machine, with: "ssh 'ashish@ashishlaptop'" and check to make sure that only the key(s) you wanted were added. (base) ashish@ashishlaptop:/usr/local/spark/conf$ ssh ashish@ashishlaptop Welcome to Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-52-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage 9 updates can be applied immediately. To see these additional updates run: apt list --upgradable Last login: Sat Oct 22 22:00:44 2022 from 192.168.1.106 (base) ashish@ashishlaptop:~$

For more details on SSH setup: [ Link ]

Tags: Technology,Spark,

Thursday, October 20, 2022

Python Interview Questions (2022 Oct, Week 3)

Ques 1: Count the range of each value in Python
I have dataset of student's scores for each subject.

StuID  Subject Scores                
1      Math    90
1      Geo     80
2      Math    70
2      Geo     60
3      Math    50
3      Geo     90
Now I want to count the range of scores for each subject like 0 < x <=20, 20 < x <=30 and get a dataframe like this:

Subject  0-20  20-40 40-60 60-80 80-100                 
Math       0     0     1     1     1
Geo        0     0     0     1     2    
How can I do it?

Ans 1:

import pandas as pd

df = pd.DataFrame({
    "StuID": [1,1,2,2,3,3],
    "Subject": ['Math', 'Geo', 'Math', 'Geo', 'Math', 'Geo'],
    "Scores": [90, 80, 70, 60, 50, 90]
})

bins = list(range(0, 100+1, 20))
# [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
labels = [f'{a}-{b}' for a,b in zip(bins, bins[1:])]
# ['0-10', '10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90', '90-100']

out = (pd.crosstab(df['Subject'], pd.cut(df['Scores'], bins=bins,
                                            labels=labels, ordered=True, right=False))
            .reindex(labels, axis=1, fill_value=0)
        )

print(out)

Ques 2: Split one column into 3 columns in python or PySpark

I have:
Customerkeycode:
B01:B14:110083

I want:
PlanningCustomerSuperGroupCode, DPGCode, APGCode
BO1,B14,110083

Ans 2:

In pyspark, first split the string into an array, and then use the getItem method to split it into multiple columns.

import pyspark.sql.functions as F
...
cols = ['PlanningCustomerSuperGroupCode', 'DPGCode', 'APGCode']
arr_cols = [F.split('Customerkeycode', ':').getItem(i).alias(cols[i]) for i in range(3)]
df = df.select(*arr_cols)
df.show(truncate=False)

Using Plain Pandas:


import pandas as pd

df = pd.DataFrame(
    {
        "Customerkeycode": [
            "B01:B14:110083",
            "B02:B15:110084"
        ]
    }
)

df['PlanningCustomerSuperGroupCode'] = df['Customerkeycode'].apply(lambda x: x.split(":")[0])
df['DPGCode'] = df['Customerkeycode'].apply(lambda x: x.split(":")[1])
df['APGCode'] = df['Customerkeycode'].apply(lambda x: x.split(":")[2])

df_rep = df.drop("Customerkeycode", axis = 1)

print(df_rep)


   PlanningCustomerSuperGroupCode DPGCode APGCode
0                            B01     B14  110083
1                            B02     B15  110084

Ref

Ques 3: Create a new list of dict, from a dict with a key that has a list value

I have a dict with one of the keys have a value of list

example = {"a":1,"b":[1,2]}
I am trying to unpack example["b"] and create a list of the same dict with separate example["b"] value.

output = [{"a":1,"b":1},{"a":1,"b":2}]
I have tried to use a for loop to understand the unpacking and reconstruction of the list of dict but I am seeing a strange behavior:


iter = example.get("b")

new_list = []

for p in iter:
    print(f"p is {p}")
    tmp_dict = example
    tmp_dict["b"] = p
    print(tmp_dict)
    new_list.append(tmp_dict)

print(new_list)

Output:


p is 1
{'a': 1, 'b': 1}

p is 2
{'a': 1, 'b': 2}

[{'a': 1, 'b': 2}, {'a': 1, 'b': 2}]

Why is the first dict in the list gets assigned with example["b"] = 2 although the first print() shows that p is 1?

Answer 3.1:

Here's a general approach that works for all cases without hardcoding any keys

here's a general approach that works for all cases without hardcoding any keys. Let's first create a temporary dictionary where all values are lists.

temp = {k: v if isinstance(v, list) else [v] for k, v in example.items()}
This allows us to then obtain the list of all the values in our temp dict as a list of lists.

We want the product of all the values of this temp dictionary. To do this, we can use the itertools.product function, and unpack our list of lists to its arguments.

In each iteration, the resulting tuple will have one value per key of the temp dictionary, so all we need to do is zip that with our tuple, and create a dict out of those key-value pairs. That gives us our list element!


import itertools

keys = list(temp.keys())
vals = list(temp.values())

result = []

for vals_product in itertools.product(*vals):
    d = dict(zip(keys, vals_product))
    result.append(d) 
Which gives the required result:

[{'a': 1, 'b': 1}, {'a': 1, 'b': 2}]

This even works for an example with more keys:

example = {'a': 1, 'b': [1, 2], 'c': [1, 2, 3]}
which gives:


[{'a': 1, 'b': 1, 'c': 1},
 {'a': 1, 'b': 1, 'c': 2},
 {'a': 1, 'b': 1, 'c': 3},
 {'a': 1, 'b': 2, 'c': 1},
 {'a': 1, 'b': 2, 'c': 2},
 {'a': 1, 'b': 2, 'c': 3}]

Answer 3.2:
Just one minor correction: Usage of dict()


example = {"a":1,"b":[1,2]}

iter = example.get("b")

new_list = []

for p in iter:
    print(f"p is {p}")
    tmp_dict = dict(example)
    tmp_dict["b"] = p
    print(tmp_dict)
    new_list.append(tmp_dict)

print(new_list)

Output is as given below: [{'a': 1, 'b': 1}, {'a': 1, 'b': 2}]

Ref

Question 4: Classification of sentences

I have a list of sentences. Examples:


${INS1}, Watch our latest webinar about flu vaccine
Do you think patients would like to go up to 250 days without an attack?
Watch our latest webinar about flu vaccine
??? See if more of your patients are ready for vaccine
Important news for your invaccinated patients
Important news for your inv?ccinated patients
...

By 'good' I mean sentences with no strange characters and sequences of characters such as '${INS1}', '???', or '?' inside the word etc. Otherwise sentence is considered as 'bad'. I need to find 'good' patterns to be able to identify 'bad' sentences in the future and exclude them, as the list of sentences will become larger in the future and new 'bad' sentences might appear.

Is there any way to identify 'good' sentences?

Answer 4:

This solution based on character examples specifically given in the question. If there are more characters that should be used to identify good and bad sentences then they should also be added in the RegEx mentioned below in code.


import re 

sents = [
    "${INS1}, Watch our latest webinar about flu vaccine",
    "Do you think patients would like to go up to 250 days without an attack?",
    "Watch our latest webinar about flu vaccine",
    "??? See if more of your patients are ready for vaccine",
    "Important news for your invaccinated patients",
    "Important news for your inv?ccinated patients"
]

good_sents = []
bad_sents = []

for i in sents:
    x = re.findall("[?}{$]", i)
    if(len(x) == 0):
        good_sents.append(i)

    else:
        bad_sents.append(i)

print(good_sents)

Question 5: How to make the sequence result in one line? 

Need to print 'n' Fibonacci numbers.

Here is the expected output as shown below:


Enter n: 10
Fibonacci numbers = 1 1 2 3 5 8 13 21 34 55

Here is my current code:


n = input("Enter n: ")

def fib(n):
    cur = 1
    old = 1
    i = 1
    while (i < n):
        cur, old, i = cur+old, cur, i+1
    return cur

Answer 5:
To do the least calculation, it is more efficient to have a fib function generate a list of the first n Fibonacci numbers.


def fib(n):
    fibs = [0, 1]
    for _ in range(n-2):
    fibs.append(sum(fibs[-2:]))
    return fibs
We know the first two Fibonacci numbers are 0 and 1. For the remaining count we can add the sum of the last two numbers to the list.

>>> fib(10)
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

You can now:

print('Fibonacci numbers = ', end='')
print(*fib(10), sep=' ', end='\n')

Question 6: Maximum occurrences in a list / DataFrame column

I have a dataframe like the one below.


import pandas as pd

data = {'Date': ['2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05','2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05','2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05'],
        'Runner': ['Runner A', 'Runner A', 'Runner A', 'Runner A', 'Runner A','Runner B', 'Runner B', 'Runner B', 'Runner B', 'Runner B','Runner C', 'Runner C', 'Runner C', 'Runner C', 'Runner C'],
        'Training Time': ['less than 1 hour', 'less than 1 hour', 'less than 1 hour', 'less than 1 hour', '1 hour to 2 hour','less than 1 hour', '1 hour to 2 hour', 'less than 1 hour', '1 hour to 2 hour', '2 hour to 3 hour', '1 hour to 2 hour ', '2 hour to 3 hour' ,'1 hour to 2 hour ', '2 hour to 3 hour', '2 hour to 3 hour']
        }

df = pd.DataFrame(data)

I have counted the occurrence for each runner using the below code

s = df.groupby(['Runner','Training Time']).size()

The problem is on Runner B. It should show "1 hour to 2 hour" and "less than 1 hour". But it only shows "1 hour to 2 hour". How can I get this expected result:
Answer 6.1: import pandas as pd data = {'Date': ['2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05','2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05','2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05'], 'Runner': ['Runner A', 'Runner A', 'Runner A', 'Runner A', 'Runner A','Runner B', 'Runner B', 'Runner B', 'Runner B', 'Runner B','Runner C', 'Runner C', 'Runner C', 'Runner C', 'Runner C'], 'Training Time': ['less than 1 hour', 'less than 1 hour', 'less than 1 hour', 'less than 1 hour', '1 hour to 2 hour','less than 1 hour', '1 hour to 2 hour', 'less than 1 hour', '1 hour to 2 hour', '2 hour to 3 hour', '1 hour to 2 hour ', '2 hour to 3 hour' ,'1 hour to 2 hour ', '2 hour to 3 hour', '2 hour to 3 hour'] } df = pd.DataFrame(data) s = df.groupby(['Runner', 'Training Time'], as_index=False).size() s.columns = ['Runner', 'Training Time', 'Size'] r = s.groupby(['Runner'], as_index=False)['Size'].max() df_list = [] for index, row in r.iterrows(): temp_df = s[(s['Runner'] == row['Runner']) & (s['Size'] == row['Size'])] df_list.append(temp_df) df_report = pd.concat(df_list) print(df_report) df_report.to_csv('report.csv', index = False) Answer 6.2: def agg_most_common(vals): print("vals") matches = [] for i in collections.Counter(vals).most_common(): if not matches or matches[0][1] == i[1]: matches.append(i) else: break return [x[0] for x in matches] print(df.groupby('Runner')['Training Time'].agg(agg_most_common))
Tags" Python,Technology,

Wednesday, October 19, 2022

HDFC may be excluded from Nifty 50 by Dec-end; these stocks front-runners to replace mortgage lender (2022-Oct-19)

The National Stock Exchange may exclude Housing Development Finance Corporation (HDFC) from the NSE Nifty 50 index in the next few weeks as the HDFC and HDFC Bank merger looks likely to be completed a few months earlier than expected, according to analysts. 

Adhesive maker Pidilite Industries is among the front-runners to replace HDFC in the 50-share index. “The (HDFC) stock could be then excluded from all of the Nifty Indices earliest by end of December 2022 or max by middle of January 2023,” said Abhilash Pagaria, head of alternative and quantitative research at Nuvama Wealth Management. The housing finance company with a 5.5% weight in the Nifty could see an outflow of around $1.3-1.5 billion from the passive funds once it’s excluded from the benchmark.

This will be an ad hoc inclusion on account of the mortgage lender’s expulsion from the index on account of its merger with HDFC Bank, which has entered final stages. The shareholders’ meeting to approve the proposed merger of HDFC with HDFC Bank has been scheduled for 25 November. Nifty Indices Methodology book states, “In case of a merger, spin-off, capital restructuring or voluntary delisting, equity shareholders’ approval is considered as a trigger to initiate the replacement of such stock from the index through additional index reconstitution.”

As the weights of the indices are calculated based on free-float market capitalisation, the exclusion of HDFC will lead to a redistribution of most weightages in the top-10 Nifty heavyweights. According to Nuvama, HDFC’s stock could see outflows worth more than $1.5 billion because of exiting the benchmark indices, which could also have a bearing on the stock performance of HDFC Bank. 

Aside from (1) Pidilite:
(2) Ambuja Cements (ACEM), 
(3) Tata Power (TPWR) and 
(4) SRF (SRF) are among the front-runners to replace HDFC in Nifty 50. Pagaria said that HDFC’s large 5.5% weight in the Nifty 50 index means that the stock could see heavy selling from passive funds that closely track the benchmark index.

Currently, over 22 ETFs track the Nifty 50 index as the benchmark with cumulative assets under management (AUM) of $30 billion (Rs 2.4 lakh crore). 
It is worth noting that HDFC’s shareholders are set to receive 42 shares of HDFC Bank for every 25 shares of HDFC held by them. Given the swap ratio, a sharp decline in shares of HDFC in the run-up to the exit from the Nifty indices is likely to trigger a sell-off in HDFC Bank. In the Nuvama note, Pagaria dismissed concerns among some market participants over HDFC Bank being excluded from the NSE indices after the merger. “In fact, once the merged company starts trading with a higher free float market capitalization, HDFC Bank will see an increase in weightage,” he said.

[ Ref ]
Tags: Investment,

Monday, October 17, 2022

ARIMA forecast for timeseries is one step ahead. Why? (Solved Interview Problem)

Question from interviewer

I'm trying to forecast timeseries with ARIMA (Autoregressive Integrated Moving Average (ARIMA) model). As you can see from the plot, the forecast (red) is one step ahead of the expected values (input value colored in blue). How can I synchronize the two?
The code I used: history = [x for x in train] predictions = list() for t in range(len(test)): model = ARIMA(history,order=(2, 2, 1)) model_fit = model.fit(disp=0) output = model_fit.forecast(alpha=0.05) yhat = output[0] predictions.append(yhat) obs = test[t] history.append(obs) print('predicted=%f, expected=%f' % (yhat, obs)) rmse = sqrt(mean_squared_error(test, predictions)) print('Test RMSE: %.3f' % rmse) # plot plt.plot(test, color='blue') plt.plot(predictions, color='red') plt.show()

Answer

It appears you are using Python's statsmodel ARIMA . There are two options: 1) Change order=argument. The p, d, or q hyperparameters might be causing that result. 2) Manually lag the predictions: predictions = predictions[1:] What you are seeing in the plot can be interpreted differently while considering the time dimension on x-axis with future being on the right and past on the left. Your predicted curve (the one colored in red) is FOLLOWING your input (blue) curve. At time t, if blue goes down, then at time t+1, red will go down. You can see this behavior happening twice very clearly in the image once around t = 10 and then second time around t = 20. Also, the magnitude of change in the predicted curve is not as big/low as the input curve.
Tags: Technology,Machine Learning,

Check in Python if a key is present in a JSON object (Solved Interview Problem)


jsonstreng = {
    "kjoretoydataListe":
    [
        {
            "kjoretoyId": {"kjennemerke":"EK 17058","understellsnummer":"5YJXCCE29GF009633"},
            "forstegangsregistrering": {"registrertForstegangNorgeDato":"2016-09-07"},
            "kjennemerke": [
                {"fomTidspunkt":"2016-09-07T00:00:00+02:00","kjennemerke":"EK 17058","kjennemerkekategori":"KJORETOY",
                "kjennemerketype":{"kodeBeskrivelse":"Sorte tegn p- hvit bunn","kodeNavn":"Ordin-re kjennemerker","kodeVerdi":"ORDINART","tidligereKodeVerdi":[]}},{"fomTidspunkt":"2022-08-19T17:04:04.334+02:00","kjennemerke":"GTD","kjennemerkekategori":"PERSONLIG"}
            ]
        }
    ]
}

def checkIfKeyExistsInDict(in_dict, i_key):
    if(isinstance(in_dict, dict)):
        if(i_key in in_dict):
            print(i_key + " found in: " + str(in_dict))
            print()
        
        for j_key in in_dict:
            checkIfKeyExistsInDict(in_dict[j_key], i_key)
    elif(isinstance(in_dict, list)):
        for j_key in range(len(in_dict)):
            checkIfKeyExistsInDict(in_dict[j_key], i_key)

        
checkIfKeyExistsInDict(jsonstreng, "kjennemerke")


Logs

(base) $ python "Check if key is present in a JSON object.py" kjennemerke found in: {'kjoretoyId': {'kjennemerke': 'EK 17058', 'understellsnummer': '5YJXCCE29GF009633'}, 'forstegangsregistrering': {'registrertForstegangNorgeDato': '2016-09-07'}, 'kjennemerke': [{'fomTidspunkt': '2016-09-07T00:00:00+02:00', 'kjennemerke': 'EK 17058', 'kjennemerkekategori': 'KJORETOY', 'kjennemerketype': {'kodeBeskrivelse': 'Sorte tegn p- hvit bunn', 'kodeNavn': 'Ordin-re kjennemerker', 'kodeVerdi': 'ORDINART', 'tidligereKodeVerdi': []}}, {'fomTidspunkt': '2022-08-19T17:04:04.334+02:00', 'kjennemerke': 'GTD', 'kjennemerkekategori': 'PERSONLIG'}]} kjennemerke found in: {'kjennemerke': 'EK 17058', 'understellsnummer': '5YJXCCE29GF009633'} kjennemerke found in: {'fomTidspunkt': '2016-09-07T00:00:00+02:00', 'kjennemerke': 'EK 17058', 'kjennemerkekategori': 'KJORETOY', 'kjennemerketype': {'kodeBeskrivelse': 'Sorte tegn p- hvit bunn', 'kodeNavn': 'Ordin-re kjennemerker', 'kodeVerdi': 'ORDINART', 'tidligereKodeVerdi': []}} kjennemerke found in: {'fomTidspunkt': '2022-08-19T17:04:04.334+02:00', 'kjennemerke': 'GTD', 'kjennemerkekategori': 'PERSONLIG'}
Tags: Python,Technology,

Hinglish to English Machine Translation Using Transformers

Download Code and Data
import numpy as np
from keras_transformer import get_model, decode


with open('../Dataset/English_Hindi_Hinglish.txt', mode = 'r') as f:
    data = f.readlines()

data = data[0:195] # 195 Because we have that many labeled data points for Hinglish to English translation.

source_tokens = [i.split(',')[1].strip().split(' ') for i in data]
target_tokens = [i.split('	')[0].strip().split(' ') for i in data]



# Generate dictionaries
def build_token_dict(token_list):
    token_dict = {
        '<PAD>': 0,
        '<START>': 1,
        '<END>': 2,
    }
    for tokens in token_list:
        for token in tokens:
            if token not in token_dict:
                token_dict[token] = len(token_dict)
    return token_dict

source_token_dict = build_token_dict(source_tokens)
target_token_dict = build_token_dict(target_tokens)
target_token_dict_inv = {v: k for k, v in target_token_dict.items()}

# Add special tokens
encode_tokens = [['<START>'] + tokens + ['<END>'] for tokens in source_tokens]
decode_tokens = [['<START>'] + tokens + ['<END>'] for tokens in target_tokens]
output_tokens = [tokens + ['<END>', '<PAD>'] for tokens in target_tokens]

# Padding
source_max_len = max(map(len, encode_tokens))
target_max_len = max(map(len, decode_tokens))

encode_tokens = [tokens + ['<PAD>'] * (source_max_len - len(tokens)) for tokens in encode_tokens]
decode_tokens = [tokens + ['<PAD>'] * (target_max_len - len(tokens)) for tokens in decode_tokens]
output_tokens = [tokens + ['<PAD>'] * (target_max_len - len(tokens)) for tokens in output_tokens]

encode_input = [list(map(lambda x: source_token_dict[x], tokens)) for tokens in encode_tokens]
decode_input = [list(map(lambda x: target_token_dict[x], tokens)) for tokens in decode_tokens]
decode_output = [list(map(lambda x: [target_token_dict[x]], tokens)) for tokens in output_tokens]

# Build & fit model
model = get_model(
    token_num=max(len(source_token_dict), len(target_token_dict)),
    embed_dim=32,
    encoder_num=2,
    decoder_num=2,
    head_num=4,
    hidden_dim=128,
    dropout_rate=0.05,
    use_same_embed=False,  # Use different embeddings for different languages
)
model.compile('adam', 'sparse_categorical_crossentropy')
model.summary()


Number of Parameters to Train When There Are a Certain Number of Lines in Input

Number of Lines is 55

Model: "model_1" ... Total params: 65,827 Trainable params: 65,827 Non-trainable params: 0

Number of Lines is 115

Total params: 72,002 Trainable params: 72,002 Non-trainable params: 0

Number of Lines is 165

Total params: 77,787 Trainable params: 77,787 Non-trainable params: 0

Number of Lines is 195

Total params: 80,777 Trainable params: 80,777 Non-trainable params: 0 %%time model.fit( x=[np.array(encode_input * 1024), np.array(decode_input * 1024)], y=np.array(decode_output * 1024), epochs=10, batch_size=32, )

Training Logs When There is a Certain Number of Lines in Input

# Number of Lines in Input is 55

Number of Epochs: 10 CPU times: user 11min 31s, sys: 56 s, total: 12min 27s Wall time: 5min 48s <keras.callbacks.History at 0x7f8f347f69d0>

# Number of Lines in Input is 115

Number of Epochs: 10 CPU times: user 26min 55s, sys: 2min 7s, total: 29min 2s Wall time: 13min 33s

# Number of Lines in Input is 150

Number of Epochs: 10 CPU times: user 41min 26s, sys: 3min 12s, total: 44min 39s Wall time: 21min 1s

# Number of Lines in Input is 195

Number of Epochs: 10 Epoch 1/10 6240/6240 [==============================] - 165s 25ms/step - loss: 0.1641 Epoch 2/10 6240/6240 [==============================] - 163s 26ms/step - loss: 0.0049 Epoch 3/10 6240/6240 [==============================] - 151s 24ms/step - loss: 0.0043 Epoch 4/10 6240/6240 [==============================] - 150s 24ms/step - loss: 0.0038 Epoch 5/10 6240/6240 [==============================] - 150s 24ms/step - loss: 0.0043 Epoch 6/10 6240/6240 [==============================] - 153s 24ms/step - loss: 0.0036 Epoch 7/10 6240/6240 [==============================] - 153s 24ms/step - loss: 0.0036 Epoch 8/10 6240/6240 [==============================] - 151s 24ms/step - loss: 0.0036 Epoch 9/10 6240/6240 [==============================] - 150s 24ms/step - loss: 0.0038 Epoch 10/10 6240/6240 [==============================] - 152s 24ms/step - loss: 0.0037 CPU times: user 51min 23s, sys: 3min 52s, total: 55min 16s Wall time: 25min 39s

# Validation

decoded = decode( model, encode_input, start_token=target_token_dict['<START>'], end_token=target_token_dict['<END>'], pad_token=target_token_dict['<PAD>'], ) for i in decoded: print(' '.join(map(lambda x: target_token_dict_inv[x], i[1:-1]))) ... Follow him. I am tired. I can swim. I can swim. I love you.

# Testing

test_sents = [ 'kaise ho?', 'kya tum mujhse pyar karte ho?', 'kya tum mujhe pyar karte ho?' ] test_tokens = [i.split() for i in test_sents] test_token_dict = build_token_dict(test_tokens) test_token_dict_inv = {v: k for k, v in test_token_dict.items()} test_enc_tokens = [['<START>'] + tokens + ['<END>'] for tokens in test_tokens] test_enc_tokens = [tokens + ['<PAD>'] * (target_max_len - len(tokens)) for tokens in test_enc_tokens] test_input = [list(map(lambda x: test_token_dict[x], tokens)) for tokens in test_enc_tokens] decoded = decode( model, test_input, start_token=test_token_dict['<START>'], end_token=test_token_dict['<END>'], pad_token=test_token_dict['<PAD>'], ) for i in decoded: print(' '.join(map(lambda x: target_token_dict_inv[x], i[1:-1])))
Tags: Technology,Natural Language Processing,

Sunday, October 16, 2022

Stories from 'How Not to be Wrong' by Jordan Ellenberg


    

Borrowed from 'How Not to be Wrong: The Hidden Maths of Everyday Life' by Jordan Ellenberg.

Importance of data versus intelligent analysis

During World War II, numerous fighter planes were getting hit by anti-aircraft guns. Air Force officers wanted to add some protective armor/shield to the planes. The question was "where"? The planes could only support few more kilos of weight. A group of mathematicians and engineers were called for a short consulting project. Fighter planes returning from missions were analyzed for bullet holes per square foot. They found 1.93 bullet holes/sq. foot near the tail of the planes whereas only 1.11 bullet holes/sq. foot close to the engine. The Air Force officers thought that since the tail portion had the greatest density of bullets, that would be the logical location for putting an anti-bullet shield. A mathematician named Abraham Wald said exactly the opposite; more protection is needed where the bullet holes aren't - that is - around the engines. His judgement surprised everyone. He said "We are counting the planes that returned from a mission. Planes with lots of bullet holes in the engine did not return at all."

Debrief

If you go to the recovery room at the hospital, you'll see a lot more people with bullet holes in their legs than people with bullet holes in their chests. That's not because people don't get shot in the chest; it's because the people who get shot in the chest don't recover. Remember the words of Einstein: "Not everything that counts can be counted, and not everything that can be counted, counts."

The Parable of the Baltimore Stockbroker

Imagine getting a mail from a little known stockbroking firm. The mail predicts that a certain stock will rise this week. You leave the mail aside, as you have seen enough such mails. But the prediction turns out to be right.

Next week

The Baltimore Stockbroker mails again, with another tip-this time, of a stock going south. The message turns out right too and you decide to mark the Baltimore Stockbroker as 'not spam'.

Week Three

Another hit. And your interest is piqued. This goes on for ten weeks. Ten accurate predictions from the Baltimore Stockbroker. You, the guy who recently retired with a substantial gratuity in the bank, are hooked.

Week eleven

The Baltimore Stockbroker sends you an offer to invest money with him, for a substantial fee of course. There is the usual caveat of past performances not guaranteeing future success, but the Baltimore Stockbroker nudges you to consider his ten week streak. You do the math. Every week, the Stockbroker had a 50% chance with his prediction. Either he would be right, or wrong. Combining the probabilities for ten weeks, the chances of the Baltimore Stockbroker to be right ten weeks in a row work out to... 1/2 x 1/2 x 1/2... ten times... = 1/1024 You consider. The Baltimore Stockbroker must be onto something. And it would be worthwhile to invest your nest egg with him. You go in for the offer.

Things, from the view of the Baltimore Stockbroker, are a bit different.

What he did, was start out with sending 10,240 newsletters! Of these, 5120 said a stock would go up, and 5120 said otherwise. The 5120 who got a dud prediction never heard from the Baltimore Stockbroker again. Week Two, the Baltimore Stockbroker sent 2560 newsletters, and the following week he again halved the number, based on who got his correct prediction. This way, at the end of week 10, he had ten people, convinced he was a financial genius. That's... The power of probabilities, cons, and the impact of mathematics on daily life... Just one aspect! This is how the Tip Providers work.
Tags: Book Summary,

Saturday, October 15, 2022

JavaScript RegEx Solved Interview Problem (Oct 2022)

Question

Update JSON-like object that contains new lines and comments. I'm working on a Node application where I make a call to external API. I receive a response which looks exactly like this: { "data": "{\r\n // comment\r\n someProperty: '*',\r\n\r\n // another comment\r\n method: function(e) {\r\n if (e.name === 'something') {\r\n onSomething();\r\n }\r\n }\r\n}" } So as you can see it contains some comments, new line characters etc. I would like parse it somehow to a proper JSON and then update the method property with completely different function. What would be the best (working) way to achieve it? I've tried to use comment-json npm package, however it fails when I execute parse(response.data).

Answer

<script> function get_response() { return JSON.stringify({ "data": "{\r\n // comment\r\n someProperty: '*',\r\n\r\n // another comment\r\n method: function(e) {\r\n if (e.name === 'something') {\r\n onSomething();\r\n }\r\n }\r\n}" }); } var resp = get_response() console.log("Response before formatting: " + resp); resp = resp.replace(/\/.*?\\r\\n/g, ""); resp = resp.replace(/\\r\\n/g, ""); console.log("Response after formatting: " + resp); console.log("And JSON.parse()") console.log(JSON.parse(resp)) </script>
Tags: Technology,JavaScript,

MongoDB and Node.js Installation on Ubuntu (Oct 2022)

Part 1: MongoDB

(base) ashish@ashishlaptop:~/Desktop$ sudo apt-get install -y mongodb-org=6.0.2 mongodb-org-database=6.0.2 mongodb-org-server=6.0.2 mongodb-org-mongos=6.0.2 mongodb-org-tools=6.0.2 Reading package lists... Done Building dependency tree... Done Reading state information... Done Some packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation: The following packages have unmet dependencies: mongodb-org-mongos : Depends: libssl1.1 (>= 1.1.1) but it is not installable mongodb-org-server : Depends: libssl1.1 (>= 1.1.1) but it is not installable E: Unable to correct problems, you have held broken packages. References 1) Install MongoDB (latest) on Ubuntu 2) Install MongoDB (v6.0) on Ubuntu 3) Install MongoDB (v5.0) on Ubuntu

Resolution

Dated: 2022-Oct-14 MongoDb has no official build for ubuntu 22.04 at the moment. Ubuntu 22.04 has upgraded libssl to 3 and does not propose libssl1.1 You can force the installation of libssl1.1 by adding the ubuntu 20.04 source: $ echo "deb http://security.ubuntu.com/ubuntu focal-security main" | sudo tee /etc/apt/sources.list.d/focal-security.list $ sudo apt-get update $ sudo apt-get install libssl1.1 Then use your commands to install mongodb-org. Then delete the focal-security list file you just created: $ sudo rm /etc/apt/sources.list.d/focal-security.list [ Ref ]

Part 2: Node.js

(base) ashish@ashishlaptop:~/Desktop/node$ node Command 'node' not found, but can be installed with: sudo apt install nodejs (base) ashish@ashishlaptop:~/Desktop/node$ sudo apt install nodejs (base) ashish@ashishlaptop:~/Desktop/node$ node -v v12.22.9 (base) ashish@ashishlaptop:~/Desktop/node$ npm Command 'npm' not found, but can be installed with: sudo apt install npm $ sudo apt install npm (base) ashish@ashishlaptop:~/Desktop/node$ npm -v 8.5.1

Issue when MongoDB client is not installed

(base) ashish@ashishlaptop:~/Desktop/node$ node Welcome to Node.js v12.22.9. Type ".help" for more information. > var mongo = require('mongodb'); Uncaught Error: Cannot find module 'mongodb' Require stack: - <repl> at Function.Module._resolveFilename (internal/modules/cjs/loader.js:815:15) at Function.Module._load (internal/modules/cjs/loader.js:667:27) at Module.require (internal/modules/cjs/loader.js:887:19) at require (internal/modules/cjs/helpers.js:74:18) { code: 'MODULE_NOT_FOUND', requireStack: [ '<repl>' ] } > (base) ashish@ashishlaptop:~/Desktop/node$ npm install mongodb added 20 packages, and audited 21 packages in 35s 3 packages are looking for funding run `npm fund` for details found 0 vulnerabilities

After mongoDB client has been installed

> var mongo = require('mongodb'); undefined >
Tags: Technology,Database,JavaScript,Linux,

Friday, October 14, 2022

Node.js and MongoDB Solved Interview Problem (Oct 2022)

Question:

I'm making 2 queries (or a single query) to the database. 
What I want to achieve:

If one of them is null, I want to add the value 'nil'. Example:

field1: nil,
field2: 'value'

If both are null, then I want it to respond with the 'not found' message.

What's a good approach for this?

Answer:

First, we setup the local database.

Running a local instance

(base) ashish@ashishlaptop:~/Desktop$ mongosh Current Mongosh Log ID: 63498870b0f4b94029f7b626 Connecting to: mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.6.0 Using MongoDB: 6.0.2 Using Mongosh: 1.6.0 For mongosh info see: https://docs.mongodb.com/mongodb-shell/ To help improve our products, anonymous usage data is collected and sent to MongoDB periodically (https://www.mongodb.com/legal/privacy-policy). You can opt-out by running the disableTelemetry() command. ------ The server generated these startup warnings when booting 2022-10-14T21:33:50.958+05:30: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine. See http://dochub.mongodb.org/core/prodnotes-filesystem 2022-10-14T21:33:52.643+05:30: Access control is not enabled for the database. Read and write access to data and configuration is unrestricted 2022-10-14T21:33:52.644+05:30: vm.max_map_count is too low ------ ------ Enable MongoDB's free cloud-based monitoring service, which will then receive and display metrics about your deployment (disk utilization, CPU, operation statistics, etc). The monitoring data will be available on a MongoDB website with a unique URL accessible to you and anyone you share the URL with. MongoDB may use this information to make product improvements and to suggest MongoDB products and deployment options to you. To enable free monitoring, run the following command: db.enableFreeMonitoring() To permanently disable this reminder, run the following command: db.disableFreeMonitoring() ------ test> test> db test test> show dbs admin 40.00 KiB config 12.00 KiB local 72.00 KiB test> test> test> db.createCollection("ccn1") { ok: 1 } test> db.ccn1.insertOne({"name":"Ashish Jain","address":"Delhi"}) { acknowledged: true, insertedId: ObjectId("634989627d163a41b75e1e13") } test> Next Command db.ccn1.insertMany([ {"name":"","address":"India"}, {"name":"","address":""}, {"name":"Ash","address":""} ]) test> db.ccn1.insertMany([ ... {"name":"","address":"India"}, ... {"name":"","address":""}, ... {"name":"Ash","address":""} ... ]) { acknowledged: true, insertedIds: { '0': ObjectId("634989cc7d163a41b75e1e14"), '1': ObjectId("634989cc7d163a41b75e1e15"), '2': ObjectId("634989cc7d163a41b75e1e16") } } test>

Our Database Looks Like This

[ { _id: new ObjectId("634989627d163a41b75e1e13"), name: 'Ashish Jain', address: 'Delhi' }, { _id: new ObjectId("634989cc7d163a41b75e1e14"), name: '', address: 'India' }, { _id: new ObjectId("634989cc7d163a41b75e1e15"), name: '', address: '' }, { _id: new ObjectId("634989cc7d163a41b75e1e16"), name: 'Ash', address: '' } ]

This is our Node.js code in file "script.js"

var MongoClient = require('mongodb').MongoClient; var url = "mongodb://localhost:27017/"; MongoClient.connect(url, function(err, db) { if (err) throw err; var dbo = db.db("test"); var all_docs = dbo.collection("ccn1").find({}).toArray(function(err, result) { if (err) throw err; for (i in result){ if(result[i]['name'] && result[i]['address']) { console.log("name: " + result[i]['name']) console.log("address: " + result[i]['address']) } else if (result[i]['name'] && !result[i]['address']){ console.log("name: " + result[i]['name']) console.log("address: nil") } else if (!result[i]['name'] && result[i]['address']){ console.log("name: nil") console.log("address: " + result[i]['address']) } else { console.log("Not Found") } console.log() } db.close(); }); });

This is our output

(base) ashish@ashishlaptop:~/Desktop/software/node$ node script.js name: Ashish Jain address: Delhi name: nil address: India Not Found name: Ash address: nil
Tags: Technology,Database,JavaScript,

Thursday, October 13, 2022

SSH Setup (on two Ubuntu machines), Error Messages and Resolution

System 1: ashishlaptop

(base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ ifconfig enp1s0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 9c:5a:44:09:35:ee txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 375 bytes 45116 (45.1 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 375 bytes 45116 (45.1 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 wlp2s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.1.131 netmask 255.255.255.0 broadcast 192.168.1.255 inet6 fe80::5154:e768:24e1:aece prefixlen 64 scopeid 0x20<link> inet6 2401:4900:47f6:d7d1:b724:d299:1a51:567 prefixlen 64 scopeid 0x0<global> inet6 2401:4900:47f6:d7d1:239a:fc2d:c994:6e54 prefixlen 64 scopeid 0x0<global> ether b0:fc:36:e5:ad:11 txqueuelen 1000 (Ethernet) RX packets 7899 bytes 8775440 (8.7 MB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 5299 bytes 665165 (665.1 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 (base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ hostname ashish-Lenovo-ideapad-130-15IKB

To Change The Hostname

(base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ sudo nano /etc/hostname (base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ cat /etc/hostname ashishlaptop

System restart required at this point for new hostname to reflect everywhere.

To Setup Addressing of Connected Nodes and Their IP Addresses

Original File Contents

(base) ashish@ashishlaptop:~$ cat /etc/hosts 127.0.0.1 localhost 127.0.1.1 ashish-Lenovo-ideapad-130-15IKB # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters

File "/etc/hosts" After Change

(base) ashish@ashishlaptop:~$ sudo nano /etc/hosts (base) ashish@ashishlaptop:~$ cat /etc/hosts 192.168.1.131 ashishlaptop 192.168.1.106 ashishdesktop

Checking Connectivity With The Other Machine

(base) ashish@ashishlaptop:~$ ping 192.168.1.106 PING 192.168.1.106 (192.168.1.106) 56(84) bytes of data. 64 bytes from 192.168.1.106: icmp_seq=1 ttl=64 time=5.51 ms 64 bytes from 192.168.1.106: icmp_seq=2 ttl=64 time=115 ms 64 bytes from 192.168.1.106: icmp_seq=3 ttl=64 time=4.61 ms 64 bytes from 192.168.1.106: icmp_seq=4 ttl=64 time=362 ms 64 bytes from 192.168.1.106: icmp_seq=5 ttl=64 time=179 ms 64 bytes from 192.168.1.106: icmp_seq=6 ttl=64 time=4.53 ms ^C --- 192.168.1.106 ping statistics --- 6 packets transmitted, 6 received, 0% packet loss, time 5012ms rtt min/avg/max/mdev = 4.525/111.739/361.954/129.976 ms

System 2: ashishdesktop

(base) ashish@ashishdesktop:~$ ifconfig ens33: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 00:e0:4c:3c:16:6b txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 317 bytes 33529 (33.5 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 317 bytes 33529 (33.5 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 wlx00e02d420fcb: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.1.106 netmask 255.255.255.0 broadcast 192.168.1.255 inet6 2401:4900:47f6:d7d1:3cc9:20f6:af75:bb28 prefixlen 64 scopeid 0x0<global> inet6 2401:4900:47f6:d7d1:73e6:fca0:4452:382 prefixlen 64 scopeid 0x0<global> inet6 fe80::1cdd:53e7:d13a:4f52 prefixlen 64 scopeid 0x20<link> ether 00:e0:2d:42:0f:cb txqueuelen 1000 (Ethernet) RX packets 42484 bytes 56651709 (56.6 MB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 28763 bytes 3324595 (3.3 MB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 (base) ashish@ashishdesktop:~$ hostname ashishdesktop

Original Contents of File "/etc/hosts"

(base) ashish@ashishdesktop:~$ cat /etc/hosts 127.0.0.1 localhost 127.0.1.1 ashishdesktop # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters

Modified Contents of "/etc/hosts"

(base) ashish@ashishdesktop:~$ sudo nano /etc/hosts (base) ashish@ashishdesktop:~$ cat /etc/hosts 192.168.1.106 ashishdesktop 192.168.1.131 ashishlaptop

SSH Commands

First: Follow steps 1 to 7 on every node.

1) sudo apt-get install openssh-server openssh-client 2) sudo iptables -A INPUT -p tcp --dport ssh -j ACCEPT 3) Use the network adapter 'NAT' in the Guest OS settings, and create a new port forwarding rule "SSH" for port 22. 4) sudo reboot 5) ssh-keygen -t rsa -f ~/.ssh/id_rsa -P "" 6) sudo service ssh stop 7) sudo service ssh start

Second: After 'First' is done, follow steps 8 to 10 on every node.

8) ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@master 9) ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@slave1 10) ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@slave2

Error Messages And Resolutions

Error 1: Port 22: Connection refused

(base) ashish@ashishlaptop:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@ashishdesktop /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/ashish/.ssh/id_rsa.pub" /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: ERROR: ssh: connect to host ashishdesktop port 22: Connection refused

Resolution

First follow SSH steps 1 to 7 on both the machines.

Error 2: Could not resolve hostname ashishlaptop

(base) ashish@ashishdesktop:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@ashishlaptop /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/ashish/.ssh/id_rsa.pub" /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: ERROR: ssh: Could not resolve hostname ashishlaptop: Temporary failure in name resolution

Resolution

Modify contents of two files "/etc/hostname" and "/etc/hosts" as shown above as the starting activity for this task.
Tags: Technology,Linux,SSH,

Setting up a three node Spark cluster on Ubuntu using VirtualBox (Apr 2020)

Setting hostname in three Guest OS(s):
$ sudo gedit /etc/hostname
    "to master, slave1, and slave2 on different machines"

-----------------------------------------------------------

ON MASTER (Host OS IP: 192.168.1.12):

$ cat /etc/hosts

192.169.1.12	master
192.168.1.3		slave1
192.168.1.4		slave2

Note: Mappings "127.0.0.1  master" and "127.0.1.1  master" should not be there.

$ cd /usr/local/spark/conf
$ sudo gedit slaves

slave1
slave2

$ sudo gedit spark-env.sh

# YOU CAN SET PORTS HERE, IF PORT-USE ISSUE COMES: SPARK_MASTER_PORT=10000 / SPARK_MASTER_WEBUI_PORT=8080
# REMEMBER TO ADD THESE PORTS ALSO IN THE VM SETTING FOR PORT FORWARDING.

export SPARK_WORKER_CORES=2
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

$ sudo gedit ~/.bashrc
Add the below line at the end of the file.
export SPARK_HOME=/usr/local/spark

Later we will start all the Spark cluster, using the following command:
$ cd /usr/local/spark/sbin
$ source start-all.sh

-----------------------------------------------------------

ON SLAVE2 (Host OS IP: 192.168.1.4):

$ cat /etc/hostname
slave2

$ cat /etc/hosts
192.169.1.12	master
192.168.1.3		slave1
192.168.1.4		slave2

Note: Localhost mappings are removed.
---

$ cd /usr/local/spark/conf
$ sudo gedit spark-env.sh
#Setting SPARK_LOCAL_IP to "192.168.1.4" (the Host OS IP) would be wrong and would result in port failure logs.
SPARK_LOCAL_IP=127.0.0.1

SPARK_MASTER_HOST=192.168.1.12
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

-----------------------------------------------------------

FOLLOW THE STEPS MENTIONED FOR SLAVE2 ALSO FOR SLAVE1 (Host OS IP: 192.168.1.3)

-----------------------------------------------------------

Configuring Key Based Login

Setup SSH in every node such that they can communicate with one another without any prompt for password.

First: Follow steps 1 to 7 on every node.

1) sudo apt-get install openssh-server openssh-client 2) sudo iptables -A INPUT -p tcp --dport ssh -j ACCEPT 3) Use the network adapter 'NAT' in the Guest OS settings, and create a new port forwarding rule "SSH" for port 22. 4) sudo reboot 5) ssh-keygen -t rsa -f ~/.ssh/id_rsa -P "" 6) sudo service ssh stop 7) sudo service ssh start

Second: After 'First' is done, follow steps 8 to 10 on every node.

8) ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@master 9) ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@slave1 10) ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@slave2 LOGS: (base) ashish@ashish-VirtualBox:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@ashish-VirtualBox /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/ashish/.ssh/id_rsa.pub" The authenticity of host 'ashish-virtualbox (127.0.1.1)' can't be established. ECDSA key fingerprint is SHA256:FfT9M7GMzBA/yv8dw+7hKa91B1D68gLlMCINhbj3mt4. Are you sure you want to continue connecting (yes/no)? y Please type 'yes' or 'no': yes /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys ashish@ashish-virtualbox's password: Number of key(s) added: 1 Now try logging into the machine, with: "ssh 'ashish@ashish-VirtualBox'" and check to make sure that only the key(s) you wanted were added. ----------------------------------------------------------- FEW SSH COMMANDS: 1) Checking the SSH status: (base) ashish@ashish-VirtualBox:~$ sudo service ssh status [sudo] password for ashish: ● ssh.service - OpenBSD Secure Shell server Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2019-07-24 18:03:50 IST; 1h 1min ago Process: 953 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS) Process: 946 ExecReload=/usr/sbin/sshd -t (code=exited, status=0/SUCCESS) Process: 797 ExecStartPre=/usr/sbin/sshd -t (code=exited, status=0/SUCCESS) Main PID: 819 (sshd) Tasks: 1 (limit: 4915) CGroup: /system.slice/ssh.service └─819 /usr/sbin/sshd -D Jul 24 18:03:28 ashish-VirtualBox systemd[1]: Starting OpenBSD Secure Shell server... Jul 24 18:03:50 ashish-VirtualBox sshd[819]: Server listening on 0.0.0.0 port 22. Jul 24 18:03:50 ashish-VirtualBox sshd[819]: Server listening on :: port 22. Jul 24 18:03:50 ashish-VirtualBox systemd[1]: Started OpenBSD Secure Shell server. Jul 24 18:04:12 ashish-VirtualBox systemd[1]: Reloading OpenBSD Secure Shell server. Jul 24 18:04:12 ashish-VirtualBox sshd[819]: Received SIGHUP; restarting. Jul 24 18:04:12 ashish-VirtualBox sshd[819]: Server listening on 0.0.0.0 port 22. Jul 24 18:04:12 ashish-VirtualBox sshd[819]: Server listening on :: port 22. 2) (base) ashish@ashish-VirtualBox:/etc/ssh$ sudo gedit ssh_config You can change SSH port here. 3) Use the network adapter 'NAT' in the GuestOS settings, and create a new port forwarding rule "SSH" for port you are mentioning in Step 2. 4.A) sudo service ssh stop 4.B) sudo service ssh start LOGS: ashish@master:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@slave1 /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/ashish/.ssh/id_rsa.pub" The authenticity of host 'slave1 (192.168.1.3)' can't be established. ECDSA key fingerprint is SHA256:+GsO1Q6ilqwIYfZLIrBTtt/5HqltZPSjVlI36C+f7ZE. Are you sure you want to continue connecting (yes/no)? yes /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys ashish@slave1's password: Number of key(s) added: 1 Now try logging into the machine, with: "ssh 'ashish@slave1'" and check to make sure that only the key(s) you wanted were added. ashish@master:~$ ssh slave1 Welcome to Ubuntu 19.04 (GNU/Linux 5.0.0-21-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage 99 updates can be installed immediately. 0 of these updates are security updates. ----------------------------------------------------------- LOGS FROM SPARK MASTER ON SUCCESSFUL START: $ cd /usr/local/spark/sbin $ source start-all.sh ashish@master:/usr/local/spark/sbin$ cat /usr/local/spark/logs/spark-ashish-org.apache.spark.deploy.master.Master-1-master.out Spark Command: /usr/lib/jvm/java-8-openjdk-amd64//bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host master --port 60000 --webui-port 50000 ======================================== Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 19/08/05 17:32:09 INFO Master: Started daemon with process name: 1664@master 19/08/05 17:32:10 INFO SignalUtils: Registered signal handler for TERM 19/08/05 17:32:10 INFO SignalUtils: Registered signal handler for HUP 19/08/05 17:32:10 INFO SignalUtils: Registered signal handler for INT 19/08/05 17:32:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/08/05 17:32:11 INFO SecurityManager: Changing view acls to: ashish 19/08/05 17:32:11 INFO SecurityManager: Changing modify acls to: ashish 19/08/05 17:32:11 INFO SecurityManager: Changing view acls groups to: 19/08/05 17:32:11 INFO SecurityManager: Changing modify acls groups to: 19/08/05 17:32:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ashish); groups with view permissions: Set(); users with modify permissions: Set(ashish); groups with modify permissions: Set() 19/08/05 17:32:13 INFO Utils: Successfully started service 'sparkMaster' on port 60000. 19/08/05 17:32:13 INFO Master: Starting Spark master at spark://master:60000 19/08/05 17:32:13 INFO Master: Running Spark version 2.4.3 19/08/05 17:32:13 INFO Utils: Successfully started service 'MasterUI' on port 50000. 19/08/05 17:32:13 INFO MasterWebUI: Bound MasterWebUI to 127.0.0.1, and started at http://master:50000 19/08/05 17:32:14 INFO Master: I have been elected leader! New state: ALIVE -----------------------------------------------------------
Tags: Technology,Spark,Linux

Spark installation on 3 RHEL based nodes cluster (Issue Resolution in Apr 2020)

Configurations:
  Hostname and IP mappings:
    Check the "/etc/hosts" file by opening it in both NANO and VI.

192.168.1.12 MASTER master
192.168.1.3  SLAVE1 slave1
192.168.1.4  SLAVE2 slave2
  
  Software configuration:
    (base) [admin@SLAVE2 downloads]$ java -version
      openjdk version "1.8.0_181"
      OpenJDK Runtime Environment (build 1.8.0_181-b13)
      OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
    
    (base) [admin@MASTER ~]$ cd /opt/ml/downloads
    (base) [admin@MASTER downloads]$ ls
      Anaconda3-2020.02-Linux-x86_64.sh  
  	    hadoop-3.2.1.tar.gz  
  	    scala-2.13.2.rpm  
  	    spark-3.0.0-preview2-bin-hadoop3.2.tgz 
   
    # Scala can be downloaded from here.
    # Installation command: sudo rpm -i scala-2.13.2.rpm
   
    (base) [admin@MASTER downloads]$ echo JAVA_HOME
      /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-7.b13.el7.x86_64/jre/
  
    PATH: /usr/local/hadoop/etc/hadoop/hadoop-env.sh
      JAVA_HOME ON 'master': /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-7.b13.el7.x86_64/jre/
      JAVA_HOME on 'slave1': /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-8.b10.el7_5.x86_64/jre

~ ~ ~

In the case of no internet connectivity, installation of 'openssh-server' and 'openssh-client' is not straightforward. These packages have nested dependencies that are hard resolve.

 (base) [admin@SLAVE2 downloads]$ sudo rpm -i openssh-server-8.0p1-4.el8_1.x86_64.rpm
  warning: openssh-server-8.0p1-4.el8_1.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID 8483c65d: NOKEY
  error: Failed dependencies:
    crypto-policies >= 20180306-1 is needed by openssh-server-8.0p1-4.el8_1.x86_64
    libc.so.6(GLIBC_2.25)(64bit) is needed by openssh-server-8.0p1-4.el8_1.x86_64
    libc.so.6(GLIBC_2.26)(64bit) is needed by openssh-server-8.0p1-4.el8_1.x86_64
    libcrypt.so.1(XCRYPT_2.0)(64bit) is needed by openssh-server-8.0p1-4.el8_1.x86_64
    libcrypto.so.1.1()(64bit) is needed by openssh-server-8.0p1-4.el8_1.x86_64
    libcrypto.so.1.1(OPENSSL_1_1_0)(64bit) is needed by openssh-server-8.0p1-4.el8_1.x86_64
    libcrypto.so.1.1(OPENSSL_1_1_1b)(64bit) is needed by openssh-server-8.0p1-4.el8_1.x86_64
    openssh = 8.0p1-4.el8_1 is needed by openssh-server-8.0p1-4.el8_1.x86_64

~ ~ ~

Doing SSH setup:
  1) sudo iptables -A INPUT -p tcp --dport ssh -j ACCEPT
  2) sudo reboot
  3) ssh-keygen -t rsa -f ~/.ssh/id_rsa -P ""
  4) ssh-copy-id -i ~/.ssh/id_rsa.pub admin@SLAVE2
  5) ssh-copy-id -i ~/.ssh/id_rsa.pub admin@MASTER
  6) ssh-copy-id -i ~/.ssh/id_rsa.pub admin@SLAVE1

COMMAND FAILURE ON RHEL:
  [admin@MASTER ~]$ sudo service ssh stop
    Redirecting to /bin/systemctl stop ssh.service
    Failed to stop ssh.service: Unit ssh.service not loaded.
    
  [admin@MASTER ~]$ sudo service ssh start
    Redirecting to /bin/systemctl start ssh.service
    Failed to start ssh.service: Unit not found.

Testing of SSH is through: ssh 'admin@SLAVE1'

~ ~ ~

To activate Conda 'base' environment at the start up of system, following code snippet goes at the end of "~/.bashrc" file.

  # >>> conda initialize >>>
  # !! Contents within this block are managed by 'conda init' !!
  __conda_setup="$('/home/admin/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
  if [ $? -eq 0 ]; then
      eval "$__conda_setup"
  else
      if [ -f "/home/admin/anaconda3/etc/profile.d/conda.sh" ]; then
          . "/home/admin/anaconda3/etc/profile.d/conda.sh"
      else
          export PATH="/home/admin/anaconda3/bin:$PATH"
      fi
  fi
  unset __conda_setup
  # conda initialize

~ ~ ~

CHECKING THE OUTPUT OF 'start-dfs.sh' ON MASTER:
 (base) [admin@MASTER sbin]$ ps aux | grep java
   admin     7461 40.5  1.4 6010824 235120 ?      Sl   21:57   0:07 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-8.b10.el7_5.x86_64/jre/bin/java -Dproc_secondarynamenode -Djava.net.preferIPv4Stack=true -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dyarn.log.dir=/usr/local/hadoop/logs -Dyarn.log.file=hadoop-admin-secondarynamenode-MASTER.log -Dyarn.home.dir=/usr/local/hadoop -Dyarn.root.logger=INFO,console -Djava.library.path=/usr/local/hadoop/lib/native -Dhadoop.log.dir=/usr/local/hadoop/logs -Dhadoop.log.file=hadoop-admin-secondarynamenode-MASTER.log -Dhadoop.home.dir=/usr/local/hadoop -Dhadoop.id.str=admin -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml o.a.h.hdfs.server.namenode.SecondaryNameNode
   
   ...

OR
  $ ps -aux | grep java | awk '{print $12}'
    -Dproc_secondarynamenode
    ...

~ ~ ~

CREATING THE 'DATANODE' AND 'NAMENODE' DIRECTORIES:

  (base) [admin@MASTER logs]$ cd ~
  (base) [admin@MASTER ~]$ pwd
      /home/admin
  (base) [admin@MASTER ~]$ cd ..
  (base) [admin@MASTER home]$ sudo mkdir hadoop
  (base) [admin@MASTER home]$ sudo chmod 777 hadoop
  (base) [admin@MASTER home]$ cd hadoop
  (base) [admin@MASTER hadoop]$ sudo mkdir data
  (base) [admin@MASTER hadoop]$ sudo chmod 777 data
  (base) [admin@MASTER hadoop]$ cd data
  (base) [admin@MASTER data]$ sudo mkdir dataNode
  (base) [admin@MASTER data]$ sudo chmod 777 dataNode
  (base) [admin@MASTER data]$ sudo mkdir nameNode
  (base) [admin@MASTER data]$ sudo chmod 777 nameNode
  (base) [admin@MASTER data]$ pwd
      /home/hadoop/data
  (base) [admin@SLAVE1 data]$ sudo chown admin *
  (base) [admin@MASTER data]$ ls -lrt
      total 0
      drwxrwxrwx. 2 admin root 6 Apr 27 22:24 dataNode
      drwxrwxrwx. 2 admin root 6 Apr 27 22:37 nameNode

# Error example with the NameNode execution if 'data/nameNode' folder is not accessible:

File: /usr/local/hadoop/logs/hadoop-admin-namenode-MASTER.log:

2019-10-17 21:45:39,714 WARN o.a.h.hdfs.server.namenode.FSNamesystem: Encountered exception loading fsimage
o.a.h.hdfs.server.common.InconsistentFSStateException: Directory /home/hadoop/data/nameNode is in an inconsistent state: storage directory does not exist or is not accessible.
	...
  at o.a.h.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1692)
	at o.a.h.hdfs.server.namenode.NameNode.main(NameNode.java:1759)
	
# Error example with the DameNode execution if 'data/dataNode' folder is not accessible:

File: /usr/local/hadoop/logs/hadoop-admin-datanode-SLAVE1.log

2019-10-17 22:30:49,302 WARN o.a.h.hdfs.server.datanode.checker.StorageLocationChecker: Exception checking StorageLocation [DISK]file:/home/hadoop/data/dataNode
java.io.FileNotFoundException: File file:/home/hadoop/data/dataNode does not exist
        ...
2019-10-17 22:30:49,307 ERROR o.a.h.hdfs.server.datanode.DataNode: Exception in secureMain
o.a.h.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 0, volumes configured: 1, volumes failed: 1, volume failures tolerated: 0
        ...
        at o.a.h.hdfs.server.datanode.DataNode.main(DataNode.java:2924)
2019-10-17 22:30:49,310 INFO o.a.h.util.ExitUtil: Exiting with status 1: o.a.h.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 0, volumes configured: 1, volumes failed: 1, volume failures tolerated: 0
2019-10-17 22:30:49,335 INFO o.a.h.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at SLAVE1/192.168.1.3
************************************************************/

~ ~ ~

If 'data/dataNode' is not writable by other nodes on the cluster, following failure logs came on SLAVE1:
File: /usr/local/hadoop/logs/hadoop-admin-datanode-MASTER.log

2019-10-17 22:37:33,820 WARN o.a.h.hdfs.server.datanode.checker.StorageLocationChecker: Exception checking StorageLocation [DISK]file:/home/hadoop/data/dataNode
EPERM: Operation not permitted
        ...
        at java.lang.Thread.run(Thread.java:748)
2019-10-17 22:37:33,825 ERROR o.a.h.hdfs.server.datanode.DataNode: Exception in secureMain
o.a.h.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 0, volumes configured: 1, volumes failed: 1, volume failures tolerated: 0
        at o.a.h.hdfs.server.datanode.checker.StorageLocationChecker.check(StorageLocationChecker.java:231)
        ...
        at o.a.h.hdfs.server.datanode.DataNode.main(DataNode.java:2924)
2019-10-17 22:37:33,829 INFO o.a.h.util.ExitUtil: Exiting with status 1: o.a.h.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 0, volumes configured: 1, volumes failed: 1, volume failures tolerated: 0
2019-10-17 22:37:33,838 INFO o.a.h.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at SLAVE1/192.168.1.3
************************************************************/ 

~ ~ ~

Success logs if "DataNode" program comes up successfully on slave machines:

SLAVE1 SUCCESS MESSAGE FOR DATANODE:

	2019-10-17 22:49:47,572 INFO o.a.h.hdfs.server.datanode.DataNode: STARTUP_MSG:
	/************************************************************
	STARTUP_MSG: Starting DataNode
	STARTUP_MSG:   host = SLAVE1/192.168.1.3
	STARTUP_MSG:   args = []
	STARTUP_MSG:   version = 3.2.1
	...
	STARTUP_MSG:   build = https://gitbox.apache.org/repos/asf/hadoop.git -r b3cbbb467e22ea829b3808f4b7b01d07e0bf3842; compiled by 'rohithsharmaks' on 2019-09-10T15:56Z
	STARTUP_MSG:   java = 1.8.0_171
	...
	2019-10-17 22:49:49,489 INFO o.a.h.hdfs.server.datanode.DataNode: Starting DataNode with maxLockedMemory = 0
	2019-10-17 22:49:49,543 INFO o.a.h.hdfs.server.datanode.DataNode: Opened streaming server at /0.0.0.0:9866
	2019-10-17 22:49:49,549 INFO o.a.h.hdfs.server.datanode.DataNode: Balancing bandwidth is 10485760 bytes/s
	2019-10-17 22:49:49,549 INFO o.a.h.hdfs.server.datanode.DataNode: Number threads for balancing is 50 
	...

ALSO:
	(base) [admin@SLAVE1 logs]$ ps -aux | grep java | awk '{print $12}'
		...
		-Dproc_datanode
		...

MASTER SUCCESS MESSAGE FOR DATANODE:
	(base) [admin@MASTER sbin]$ ps -aux | grep java | awk '{print $12}'
		-Dproc_datanode
		-Dproc_secondarynamenode
		...

~ ~ ~

FAILURE LOGS FROM MASTER FOR ERROR IN NAMENODE:
(base) [admin@MASTER logs]$ cat hadoop-admin-namenode-MASTER.log
	2019-10-17 22:49:56,593 ERROR o.a.h.hdfs.server.namenode.NameNode: Failed to start namenode.
	java.io.IOException: NameNode is not formatted.
			at o.a.h.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:252)
			...
			at o.a.h.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1692)
			at o.a.h.hdfs.server.namenode.NameNode.main(NameNode.java:1759)
	2019-10-17 22:49:56,596 INFO o.a.h.util.ExitUtil: Exiting with status 1: java.io.IOException: NameNode is not formatted.
	2019-10-17 22:49:56,600 INFO o.a.h.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
	/************************************************************
	SHUTDOWN_MSG: Shutting down NameNode at MASTER/192.168.1.12
	************************************************************/ 

FIX:
	Previously: "hadoop namenode -format" 
	On Hadooop 3.X: "hdfs namenode format"

	Hadoop namenode directory contains the fsimage and configuration files that hold the basic information about Hadoop file system such as where is data available, which user created the files, etc.

	If you format the NameNode, then the above information is deleted from NameNode directory which is specified in the "$HADOOP_HOME/etc/hadoop/hdfs-site.xml" as "dfs.namenode.name.dir"

	After formatting you still have the data on the Hadoop, but not the NameNode metadata.

SUCCESS AFTER THE FIX ON MASTER:
	(base) [admin@MASTER sbin]$ ps -aux | grep java | awk '{print $12}'
		-Dproc_namenode
		-Dproc_datanode
		-Dproc_secondarynamenode
		...

~ ~ ~

MOVING ON TO SPARK:
WE HAVE YARN SO WE WILL NOT MAKE USE OF '/usr/local/spark/conf/slaves' FILE.

(base) [admin@MASTER conf]$ cat slaves.template
# A Spark Worker will be started on each of the machines listed below.
... 
		
~ ~ ~

FAILURE LOGS FROM 'spark-submit':
2019-10-17 23:23:03,832 INFO ipc.Client: Retrying connect to server: 192.168.1.12/192.168.1.12:8032. Already tried 0 time(s); maxRetries=45
2019-10-17 23:23:23,836 INFO ipc.Client: Retrying connect to server: 192.168.1.12/192.168.1.12:8032. Already tried 1 time(s); maxRetries=45
2019-10-17 23:23:43,858 INFO ipc.Client: Retrying connect to server: 192.168.1.12/192.168.1.12:8032. Already tried 2 time(s); maxRetries=45 

THE PROBLEM IS IN CONNECTING WITH THE RECOURCE MANAGER AS DESCRIBED IN PROPERTIES FILE YARN-SITE.XML ($HADOOP_HOME/etc/hadoop/yarn-site.xml):
	LOOK FOR THIS: yarn.resourcemanager.address
	FIX: SET IT TO MASTER IP
		
~ ~ ~

SUCCESS LOGS FOR STARTING OF SERVICES AFTER INSTALLATION OF HADOOP AND SPARK:
	(base) [admin@MASTER hadoop/sbin]$ start-all.sh
		Starting namenodes on [master]
		
		Starting datanodes
		master: This system is restricted to authorized users. 
		slave1: This system is restricted to authorized users. 
		
		Starting secondary namenodes [MASTER]
		MASTER: This system is restricted to authorized users. 
		
		Starting resourcemanager
		
		Starting nodemanagers
		master: This system is restricted to authorized users. 
		slave1: This system is restricted to authorized users. 
		
		(base) [admin@MASTER sbin]$

	(base) [admin@MASTER sbin]$ ps aux | grep java | awk '{print $12}'
		-Dproc_namenode
		-Dproc_datanode
		-Dproc_secondarynamenode
		-Dproc_resourcemanager
		-Dproc_nodemanager
		...

ON SLAVE1:
	(base) [admin@SLAVE1 ~]$ ps aux | grep java | awk '{print $12}'
		-Dproc_datanode
		-Dproc_nodemanager
		...

~ ~ ~

FAILURE LOGS FROM SPARK-SUBMIT ON MASTER:
	2019-10-17 23:54:26,189 INFO cluster.YarnScheduler: Adding task set 0.0 with 100 tasks
	2019-10-17 23:54:41,247 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
	2019-10-17 23:54:56,245 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
	2019-10-17 23:55:11,246 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Reason:	Spark master doesn't have any resources allocated to execute the job. Resources like worker node or slave node.
Fix for setup: changes in the /usr/local/hadoop/etc/hadoop/yarn-site.xml
Ref: StackOverflow

~ ~ ~

CONNECTIVITY (OR PORT) RELATED ISSUE INSTANCE 1:
	ISSUE WITH DATANODE ON SLAVE1:
		(base) [admin@SLAVE1 logs]$ pwd
			/usr/local/hadoop/logs
			
		(base) [admin@SLAVE1 logs]$cat hadoop-admin-datanode-SLAVE1.log
		
		(base) [admin@SLAVE1 logs]$
			2019-10-17 22:50:40,384 WARN o.a.h.hdfs.server.datanode.DataNode: Problem connecting to server: master/192.168.1.12:9000
			2019-10-17 22:50:46,416 INFO o.a.h.ipc.Client: Retrying connect to server: master/192.168.1.12:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
			
CONNECTIVITY (OR PORT) RELATED ISSUE INSTANCE 2:
	(base) [admin@MASTER logs]$ cat hadoop-admin-nodemanager-MASTER.log
		2019-10-18 00:24:17,473 INFO o.a.h.ipc.Client: Retrying connect to server: MASTER/192.168.1.12:8031. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

FIX: All connectivity between IPs of nodes on the cluster, and bring down the firewall on the nodes on the cluster.
    sudo /sbin/iptables -A INPUT -p tcp -s 192.168.1.12 -j ACCEPT
    sudo /sbin/iptables -A OUTPUT -p tcp -d 192.168.1.12 -j ACCEPT
    sudo /sbin/iptables -A INPUT -p tcp -s 192.168.1.3 -j ACCEPT
    sudo /sbin/iptables -A OUTPUT -p tcp -d 192.168.1.3 -j ACCEPT
    
    sudo systemctl stop iptables
    sudo service firewalld stop

Also, check port (here 80) connectivity as shown below:
1. lsof -i :80
2. netstat -an | grep 80 | grep LISTEN

~ ~ ~

ISSUE IN SPARK-SUBMIT LOGS ON MASTER:
    Exception: Python in worker has different version 2.7 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

FIX IS TO BE DONE ON ALL THE NODES ON THE CLUSTER:
	(base) [admin@SLAVE1 bin]$ ls -lrt /home/admin/anaconda3/bin/python3.7
	-rwx------. 1 admin wheel 12812592 May  6  2019 /home/admin/anaconda3/bin/python3.7

	(base) [admin@MASTER spark]$ pwd
	/usr/local/spark/conf
	
	(base) [admin@MASTER conf]$ ls
	fairscheduler.xml.template  log4j.properties.template  metrics.properties.template  slaves  slaves.template  spark-defaults.conf.template  spark-env.sh.template
	
	(base) [admin@MASTER conf]$ cp spark-env.sh.template spark-env.sh

	PUT THESE PROPERTIES IN THE FILE "/usr/local/spark/conf/spark-env.sh":
		export PYSPARK_PYTHON=/home/admin/anaconda3/bin/python3.7
		export PYSPARK_DRIVER_PYTHON=/home/admin/anaconda3/bin/python3.7

~ ~ ~

ERROR LOGS IF 'EXECUTOR-MEMORY' ARGUMENT OF SPARK-SUBMIT ASKS FOR MORE MEMORY THAN DEFINED IN YARN CONFIGURATION:

FILE INSTANCE 1:
  $HADOOP_HOME: /usr/local/hadoop
  
  (base) [admin@MASTER hadoop]$ vi $HADOOP_HOME/etc/hadoop/yarn-site.xml
  
  <configuration>
    <property>
      <name>yarn.acl.enable</name>
      <value>0</value>
    </property>
    
    <property>
      <name>yarn.resourcemanager.hostname</name>
      <value>192.168.1.12</value>
    </property>
    
    <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
    </property>
  
    <property>
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>4000</value>
    </property>
    
    <property>
      <name>yarn.scheduler.maximum-allocation-mb</name>
      <value>8000</value>
    </property>
    
    <property>
      <name>yarn.scheduler.minimum-allocation-mb</name>
      <value>128</value>
    </property>
    
    <property>
      <name>yarn.nodemanager.vmem-check-enabled</name>
      <value>false</value>
    </property>
  </configuration>

ERROR INSTANCE 1:

	(base) [admin@MASTER sbin]$ ../bin/spark-submit --master yarn --executor-memory 12G ../examples/src/main/python/pi.py 100
	
	2019-10-18 13:59:07,891 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
	2019-10-18 13:59:09,502 INFO spark.SparkContext: Running Spark version 3.0.0-preview2
	2019-10-18 13:59:09,590 INFO resource.ResourceUtils: ==============================================================
	2019-10-18 13:59:09,593 INFO resource.ResourceUtils: Resources for spark.driver:

	2019-10-18 13:59:09,594 INFO resource.ResourceUtils: ==============================================================
	2019-10-18 13:59:09,596 INFO spark.SparkContext: Submitted application: PythonPi
	2019-10-18 13:59:09,729 INFO spark.SecurityManager: Changing view acls to: admin
	2019-10-18 13:59:09,729 IN

	2019-10-18 13:59:13,927 INFO spark.SparkContext: Successfully stopped SparkContext
	Traceback (most recent call last):
	  File "/usr/local/spark/sbin/../examples/src/main/python/pi.py", line 33, in [module]
		.appName("PythonPi")\
	  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 183, in getOrCreate
	  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 370, in getOrCreate
	  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 130, in __init__
	  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 192, in _do_init
	  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 309, in _initialize_context
	  File "/usr/local/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1554, in __call__
	  File "/usr/local/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
	py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
	: java.lang.IllegalArgumentException: Required executor memory (12288 MB), offHeap memory (0) MB, overhead (1228 MB), and PySpark memory (0 MB) is above the max threshold (4000 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
			...
			at java.lang.Thread.run(Thread.java:748)

	2019-10-18 13:59:14,005 INFO util.ShutdownHookManager: Shutdown hook called
	2019-10-18 13:59:14,007 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-fbead587-b1ae-4e8e-acd4-160e585a6f34
	2019-10-18 13:59:14,012 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-3331bae2-e2d1-47f6-886c-317be6c98339 

FILE INSTANCE 2:

  <configuration>
    <property>
      <name>yarn.acl.enable</name>
      <value>0</value>
    </property>
    
    <property>
      <name>yarn.resourcemanager.hostname</name>
      <value>192.168.1.12</value>
    </property>
    
    <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
    </property>
    
    <property>
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>12000</value>
    </property>
    
    <property>
      <name>yarn.scheduler.maximum-allocation-mb</name>
      <value>10000</value>
    </property>
    
    <property>
      <name>yarn.scheduler.minimum-allocation-mb</name>
      <value>128</value>
    </property>
  </configuration>
  
ERROR INSTANCE 2:
  (base) [admin@MASTER sbin]$ ../bin/spark-submit --master yarn ../examples/src/main/python/pi.py 100
    py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
    : java.lang.IllegalArgumentException: Required executor memory (12288 MB), offHeap memory (0) MB, overhead (1228 MB), and PySpark memory (0 MB) is above the max threshold (10000 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'. 

Related Articles:
% Getting started with Hadoop on Ubuntu in VirtualBox
% Setting up three node Hadoop cluster on Ubuntu using VirtualBox
% Getting started with Spark on Ubuntu in VirtualBox
% Setting up a three node Spark cluster on Ubuntu using VirtualBox (Apr 2020)
% Notes on setting up Spark with YARN three node cluster
Tags: Technology,Spark,Linux