survival8

Sunday, October 23, 2022

spark-submit For Two Node Spark Cluster With Spark's Standalone RM For Pi Computation (2022 Oct 23)

Previously: Creating Two Node Spark Cluster With Two Worker Nodes and One Master Node Using Spark's Standalone Resource Manager on Ubuntu Machines

Issue

(base) ashish@ashishlaptop:/usr/local/spark$ spark-submit --master spark://ashishlaptop:7077 examples/src/main/python/pi.py 100


22/10/23 15:14:36 INFO SparkContext: Running Spark version 3.3.0
22/10/23 15:14:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/23 15:14:36 INFO ResourceUtils: ==============================================================
22/10/23 15:14:36 INFO ResourceUtils: No custom resources configured for spark.driver.
22/10/23 15:14:36 INFO ResourceUtils: ==============================================================
22/10/23 15:14:36 INFO SparkContext: Submitted application: PythonPi
22/10/23 15:14:36 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
22/10/23 15:14:36 INFO ResourceProfile: Limiting resource is cpu
22/10/23 15:14:36 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/10/23 15:14:36 INFO SecurityManager: Changing view acls to: ashish
22/10/23 15:14:36 INFO SecurityManager: Changing modify acls to: ashish
22/10/23 15:14:36 INFO SecurityManager: Changing view acls groups to: 
22/10/23 15:14:36 INFO SecurityManager: Changing modify acls groups to: 
22/10/23 15:14:36 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(ashish); groups with view permissions: Set(); users  with modify permissions: Set(ashish); groups with modify permissions: Set()
22/10/23 15:14:37 INFO Utils: Successfully started service 'sparkDriver' on port 41631.
22/10/23 15:14:37 INFO SparkEnv: Registering MapOutputTracker
22/10/23 15:14:37 INFO SparkEnv: Registering BlockManagerMaster
22/10/23 15:14:37 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/10/23 15:14:37 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/10/23 15:14:37 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/10/23 15:14:37 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-9599974d-836e-482e-bcf1-5c6e15c29ce9
22/10/23 15:14:37 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
22/10/23 15:14:37 INFO SparkEnv: Registering OutputCommitCoordinator
22/10/23 15:14:37 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/10/23 15:14:38 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://ashishlaptop:7077...
22/10/23 15:14:38 INFO TransportClientFactory: Successfully created connection to ashishlaptop/192.168.1.142:7077 after 45 ms (0 ms spent in bootstraps)
22/10/23 15:14:38 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20221023151438-0000
22/10/23 15:14:38 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44369.
22/10/23 15:14:38 INFO NettyBlockTransferService: Server created on ashishlaptop:44369
22/10/23 15:14:38 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/10/23 15:14:38 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, ashishlaptop, 44369, None)
22/10/23 15:14:38 INFO BlockManagerMasterEndpoint: Registering block manager ashishlaptop:44369 with 366.3 MiB RAM, BlockManagerId(driver, ashishlaptop, 44369, None)
22/10/23 15:14:38 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, ashishlaptop, 44369, None)
22/10/23 15:14:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20221023151438-0000/0 on worker-20221023135355-192.168.1.142-43143 (192.168.1.142:43143) with 4 core(s)
22/10/23 15:14:38 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, ashishlaptop, 44369, None)
22/10/23 15:14:38 INFO StandaloneSchedulerBackend: Granted executor ID app-20221023151438-0000/0 on hostPort 192.168.1.142:43143 with 4 core(s), 1024.0 MiB RAM
22/10/23 15:14:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20221023151438-0000/1 on worker-20221023135358-192.168.1.106-44471 (192.168.1.106:44471) with 2 core(s)
22/10/23 15:14:38 INFO StandaloneSchedulerBackend: Granted executor ID app-20221023151438-0000/1 on hostPort 192.168.1.106:44471 with 2 core(s), 1024.0 MiB RAM
22/10/23 15:14:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221023151438-0000/0 is now RUNNING
22/10/23 15:14:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221023151438-0000/1 is now RUNNING
22/10/23 15:14:39 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
22/10/23 15:14:40 INFO SparkContext: Starting job: reduce at /usr/local/spark/examples/src/main/python/pi.py:42
22/10/23 15:14:41 INFO DAGScheduler: Got job 0 (reduce at /usr/local/spark/examples/src/main/python/pi.py:42) with 100 output partitions
22/10/23 15:14:41 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at /usr/local/spark/examples/src/main/python/pi.py:42)
22/10/23 15:14:41 INFO DAGScheduler: Parents of final stage: List()
22/10/23 15:14:41 INFO DAGScheduler: Missing parents: List()
22/10/23 15:14:41 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[1] at reduce at /usr/local/spark/examples/src/main/python/pi.py:42), which has no missing parents
22/10/23 15:14:41 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 11.3 KiB, free 366.3 MiB)
22/10/23 15:14:41 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 8.5 KiB, free 366.3 MiB)
22/10/23 15:14:41 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ashishlaptop:44369 (size: 8.5 KiB, free: 366.3 MiB)
22/10/23 15:14:41 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1513
22/10/23 15:14:41 INFO DAGScheduler: Submitting 100 missing tasks from ResultStage 0 (PythonRDD[1] at reduce at /usr/local/spark/examples/src/main/python/pi.py:42) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
22/10/23 15:14:41 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 tasks resource profile 0
22/10/23 15:14:43 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.142:37452) with ID 0,  ResourceProfileId 0
22/10/23 15:14:43 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.142:34419 with 366.3 MiB RAM, BlockManagerId(0, 192.168.1.142, 34419, None)
22/10/23 15:14:43 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (192.168.1.142, executor 0, partition 0, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:14:43 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1) (192.168.1.142, executor 0, partition 1, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:14:43 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2) (192.168.1.142, executor 0, partition 2, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:14:43 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3) (192.168.1.142, executor 0, partition 3, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:14:44 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.142:34419 (size: 8.5 KiB, free: 366.3 MiB)
22/10/23 15:14:46 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4) (192.168.1.142, executor 0, partition 4, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:14:46 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5) (192.168.1.142, executor 0, partition 5, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:14:46 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6) (192.168.1.142, executor 0, partition 6, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:14:46 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7) (192.168.1.142, executor 0, partition 7, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:14:46 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.106:44292) with ID 1,  ResourceProfileId 0
22/10/23 15:14:46 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2) (192.168.1.142 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 540, in main
    raise RuntimeError(
RuntimeError: Python in worker has different version 3.10 than that in driver 3.9, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

...
22/10/23 15:14:47 INFO SparkContext: Invoking stop() from shutdown hook
22/10/23 15:14:47 INFO SparkUI: Stopped Spark web UI at http://ashishlaptop:4040
22/10/23 15:14:47 INFO StandaloneSchedulerBackend: Shutting down all executors
22/10/23 15:14:47 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
22/10/23 15:14:47 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/10/23 15:14:47 INFO MemoryStore: MemoryStore cleared
22/10/23 15:14:47 INFO BlockManager: BlockManager stopped
22/10/23 15:14:47 INFO BlockManagerMaster: BlockManagerMaster stopped
22/10/23 15:14:47 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/10/23 15:14:47 INFO SparkContext: Successfully stopped SparkContext
22/10/23 15:14:47 INFO ShutdownHookManager: Shutdown hook called
22/10/23 15:14:47 INFO ShutdownHookManager: Deleting directory /tmp/spark-c60126be-f479-4617-8548-ad0ca7f00763/pyspark-40737be6-41de-4d50-859d-88e13123232b
22/10/23 15:14:47 INFO ShutdownHookManager: Deleting directory /tmp/spark-0915b97c-253d-4807-9eb6-e8f3d1a7019c
22/10/23 15:14:47 INFO ShutdownHookManager: Deleting directory /tmp/spark-c60126be-f479-4617-8548-ad0ca7f00763
(base) ashish@ashishlaptop:/usr/local/spark$ 
    
Debugging
(base) ashish@ashishlaptop:/usr/local/spark$ echo $PYSPARK_PYTHON

(base) ashish@ashishlaptop:/usr/local/spark$ echo $PYSPARK_DRIVER_PYTHON

(base) ashish@ashishlaptop:/usr/local/spark$ 

Both are empty.

Setting the environment variables

(base) ashish@ashishlaptop:/usr/local/spark$ which python

/home/ashish/anaconda3/bin/python

(base) ashish@ashishlaptop:/usr/local/spark$ /home/ashish/anaconda3/bin/python

Python 3.9.12 (main, Apr  5 2022, 06:56:58) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()

(base) ashish@ashishlaptop:/usr/local/spark$ sudo nano ~/.bashrc
[sudo] password for ashish: 
(base) ashish@ashishlaptop:/usr/local/spark$ 
(base) ashish@ashishlaptop:/usr/local/spark$ tail ~/.bashrc

unset __conda_setup
# <<< conda initialize <<<

export PATH="/home/ashish/.local/bin:$PATH"
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
export PATH="$PATH:/usr/local/spark/bin"
export PYSPARK_PYTHON="/home/ashish/anaconda3/bin/python"
export PYSPARK_DRIVER_PYTHON="/home/ashish/anaconda3/bin/python" 

(base) ashish@ashishlaptop:/usr/local/spark$ source ~/.bashrc
(base) ashish@ashishlaptop:/usr/local/spark$ echo $PYSPARK_PYTHON

/home/ashish/anaconda3/bin/python

Logs After Issue Resolution

(base) ashish@ashishlaptop:/usr/local/spark$ spark-submit --master spark://ashishlaptop:7077 examples/src/main/python/pi.py 100


22/10/23 15:30:51 INFO SparkContext: Running Spark version 3.3.0
22/10/23 15:30:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/23 15:30:52 INFO ResourceUtils: ==============================================================
22/10/23 15:30:52 INFO ResourceUtils: No custom resources configured for spark.driver.
22/10/23 15:30:52 INFO ResourceUtils: ==============================================================
22/10/23 15:30:52 INFO SparkContext: Submitted application: PythonPi
22/10/23 15:30:52 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
22/10/23 15:30:52 INFO ResourceProfile: Limiting resource is cpu
22/10/23 15:30:52 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/10/23 15:30:52 INFO SecurityManager: Changing view acls to: ashish
22/10/23 15:30:52 INFO SecurityManager: Changing modify acls to: ashish
22/10/23 15:30:52 INFO SecurityManager: Changing view acls groups to: 
22/10/23 15:30:52 INFO SecurityManager: Changing modify acls groups to: 
22/10/23 15:30:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(ashish); groups with view permissions: Set(); users  with modify permissions: Set(ashish); groups with modify permissions: Set()
22/10/23 15:30:52 INFO Utils: Successfully started service 'sparkDriver' on port 41761.
22/10/23 15:30:52 INFO SparkEnv: Registering MapOutputTracker
22/10/23 15:30:52 INFO SparkEnv: Registering BlockManagerMaster
22/10/23 15:30:52 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/10/23 15:30:52 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/10/23 15:30:52 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/10/23 15:30:52 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-ffa15e79-7af0-41f9-87eb-fce866f17ed8
22/10/23 15:30:53 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
22/10/23 15:30:53 INFO SparkEnv: Registering OutputCommitCoordinator
22/10/23 15:30:53 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/10/23 15:30:53 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://ashishlaptop:7077...
22/10/23 15:30:53 INFO TransportClientFactory: Successfully created connection to ashishlaptop/192.168.1.142:7077 after 58 ms (0 ms spent in bootstraps)
22/10/23 15:30:53 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20221023153053-0001
22/10/23 15:30:53 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20221023153053-0001/0 on worker-20221023135355-192.168.1.142-43143 (192.168.1.142:43143) with 4 core(s)
22/10/23 15:30:53 INFO StandaloneSchedulerBackend: Granted executor ID app-20221023153053-0001/0 on hostPort 192.168.1.142:43143 with 4 core(s), 1024.0 MiB RAM
22/10/23 15:30:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 32809.
22/10/23 15:30:53 INFO NettyBlockTransferService: Server created on ashishlaptop:32809
22/10/23 15:30:53 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/10/23 15:30:53 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20221023153053-0001/1 on worker-20221023135358-192.168.1.106-44471 (192.168.1.106:44471) with 2 core(s)
22/10/23 15:30:53 INFO StandaloneSchedulerBackend: Granted executor ID app-20221023153053-0001/1 on hostPort 192.168.1.106:44471 with 2 core(s), 1024.0 MiB RAM
22/10/23 15:30:53 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, ashishlaptop, 32809, None)
22/10/23 15:30:53 INFO BlockManagerMasterEndpoint: Registering block manager ashishlaptop:32809 with 366.3 MiB RAM, BlockManagerId(driver, ashishlaptop, 32809, None)
22/10/23 15:30:53 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, ashishlaptop, 32809, None)
22/10/23 15:30:53 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, ashishlaptop, 32809, None)
22/10/23 15:30:54 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221023153053-0001/0 is now RUNNING
22/10/23 15:30:54 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221023153053-0001/1 is now RUNNING
22/10/23 15:30:54 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
22/10/23 15:30:55 INFO SparkContext: Starting job: reduce at /usr/local/spark/examples/src/main/python/pi.py:42
22/10/23 15:30:56 INFO DAGScheduler: Got job 0 (reduce at /usr/local/spark/examples/src/main/python/pi.py:42) with 100 output partitions
22/10/23 15:30:56 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at /usr/local/spark/examples/src/main/python/pi.py:42)
22/10/23 15:30:56 INFO DAGScheduler: Parents of final stage: List()
22/10/23 15:30:56 INFO DAGScheduler: Missing parents: List()
22/10/23 15:30:56 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[1] at reduce at /usr/local/spark/examples/src/main/python/pi.py:42), which has no missing parents
22/10/23 15:30:56 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 11.4 KiB, free 366.3 MiB)
22/10/23 15:30:56 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 8.5 KiB, free 366.3 MiB)
22/10/23 15:30:56 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ashishlaptop:32809 (size: 8.5 KiB, free: 366.3 MiB)
22/10/23 15:30:56 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1513
22/10/23 15:30:56 INFO DAGScheduler: Submitting 100 missing tasks from ResultStage 0 (PythonRDD[1] at reduce at /usr/local/spark/examples/src/main/python/pi.py:42) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
22/10/23 15:30:56 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 tasks resource profile 0
22/10/23 15:30:58 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.142:54146) with ID 0,  ResourceProfileId 0
22/10/23 15:30:59 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.142:46811 with 366.3 MiB RAM, BlockManagerId(0, 192.168.1.142, 46811, None)
22/10/23 15:30:59 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (192.168.1.142, executor 0, partition 0, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:30:59 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1) (192.168.1.142, executor 0, partition 1, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:30:59 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2) (192.168.1.142, executor 0, partition 2, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:30:59 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3) (192.168.1.142, executor 0, partition 3, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:30:59 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.142:46811 (size: 8.5 KiB, free: 366.3 MiB)
22/10/23 15:31:01 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.106:60352) with ID 1,  ResourceProfileId 0
22/10/23 15:31:01 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.106:41617 with 366.3 MiB RAM, BlockManagerId(1, 192.168.1.106, 41617, None)
22/10/23 15:31:01 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4) (192.168.1.106, executor 1, partition 4, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
...
22/10/23 15:31:09 INFO TaskSetManager: Finished task 93.0 in stage 0.0 (TID 93) in 344 ms on 192.168.1.142 (executor 0) (94/100)
22/10/23 15:31:09 INFO TaskSetManager: Finished task 94.0 in stage 0.0 (TID 94) in 312 ms on 192.168.1.142 (executor 0) (95/100)
22/10/23 15:31:09 INFO TaskSetManager: Finished task 95.0 in stage 0.0 (TID 95) in 314 ms on 192.168.1.142 (executor 0) (96/100)
22/10/23 15:31:09 INFO TaskSetManager: Finished task 96.0 in stage 0.0 (TID 96) in 263 ms on 192.168.1.106 (executor 1) (97/100)
22/10/23 15:31:09 INFO TaskSetManager: Finished task 98.0 in stage 0.0 (TID 98) in 260 ms on 192.168.1.142 (executor 0) (98/100)
22/10/23 15:31:09 INFO TaskSetManager: Finished task 99.0 in stage 0.0 (TID 99) in 256 ms on 192.168.1.142 (executor 0) (99/100)
22/10/23 15:31:10 INFO TaskSetManager: Finished task 97.0 in stage 0.0 (TID 97) in 384 ms on 192.168.1.106 (executor 1) (100/100)
22/10/23 15:31:10 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
22/10/23 15:31:10 INFO DAGScheduler: ResultStage 0 (reduce at /usr/local/spark/examples/src/main/python/pi.py:42) finished in 13.849 s
22/10/23 15:31:10 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
22/10/23 15:31:10 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
22/10/23 15:31:10 INFO DAGScheduler: Job 0 finished: reduce at /usr/local/spark/examples/src/main/python/pi.py:42, took 14.106103 s
Pi is roughly 3.142880
22/10/23 15:31:10 INFO SparkUI: Stopped Spark web UI at http://ashishlaptop:4040
22/10/23 15:31:10 INFO StandaloneSchedulerBackend: Shutting down all executors
22/10/23 15:31:10 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
22/10/23 15:31:10 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/10/23 15:31:10 INFO MemoryStore: MemoryStore cleared
22/10/23 15:31:10 INFO BlockManager: BlockManager stopped
22/10/23 15:31:10 INFO BlockManagerMaster: BlockManagerMaster stopped
22/10/23 15:31:10 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/10/23 15:31:10 INFO SparkContext: Successfully stopped SparkContext
22/10/23 15:31:11 INFO ShutdownHookManager: Shutdown hook called
22/10/23 15:31:11 INFO ShutdownHookManager: Deleting directory /tmp/spark-6be4655c-e59a-403a-92e8-582583fa3f7d/pyspark-c4d7588d-a23a-4393-b29b-6689d20e7684
22/10/23 15:31:11 INFO ShutdownHookManager: Deleting directory /tmp/spark-f4436e38-d155-4763-bb57-461eb3793d13
22/10/23 15:31:11 INFO ShutdownHookManager: Deleting directory /tmp/spark-6be4655c-e59a-403a-92e8-582583fa3f7d
(base) ashish@ashishlaptop:/usr/local/spark$

Creating Two Node Spark Cluster With Two Worker Nodes and One Master Node Using Spark's Standalone Resource Manager on Ubuntu machines

1.1 Setting up Java
(base) ashish@ashishdesktop:~$ java
Command 'java' not found, but can be installed with:
sudo apt install default-jre              # version 2:1.11-72build2, or
sudo apt install openjdk-11-jre-headless  # version 11.0.16+8-0ubuntu1~22.04
sudo apt install openjdk-17-jre-headless  # version 17.0.3+7-0ubuntu0.22.04.1
sudo apt install openjdk-18-jre-headless  # version 18~36ea-1
sudo apt install openjdk-8-jre-headless   # version 8u312-b07-0ubuntu1

(base) ashish@ashishdesktop:~$ sudo apt install openjdk-8-jre-headless

1.2 Setting environment variable JAVA_HOME

(base) ashish@ashishdesktop:~$ readlink -f /usr/bin/java
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

(base) ashish@ashishdesktop:~$ sudo nano ~/.bashrc
(base) ashish@ashishdesktop:~$ tail ~/.bashrc

        . "/home/ashish/anaconda3/etc/profile.d/conda.sh"
    else
        export PATH="/home/ashish/anaconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 

(base) ashish@ashishdesktop:~$ 

(base) ashish@ashishdesktop:~$ source ~/.bashrc
(base) ashish@ashishdesktop:~$ echo $JAVA_HOME
/usr/lib/jvm/java-8-openjdk-amd64

2. Setting up Scala

(base) ashish@ashishdesktop:~$ sudo apt-get install scala

3. Checking The Previous Standalone PySpark Installation on Laptop

(base) ashish@ashishlaptop:~$ pyspark
pyspark: command not found

(base) ashish@ashishlaptop:~$ conda activate mh
(mh) ashish@ashishlaptop:~$ pyspark
Python 3.9.0 | packaged by conda-forge | (default, Nov 26 2020, 07:57:39) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/22 22:45:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.3.0
      /_/

Using Python version 3.9.0 (default, Nov 26 2020 07:57:39)
Spark context Web UI available at http://ashishlaptop:4040
Spark context available as 'sc' (master = local[*], app id = local-1666458948998).
SparkSession available as 'spark'.
>>> 


4. Download Spark Archive

One we are using: dlcdn.apache.org: spark-3.3.0-bin-hadoop3.tgz

Link to broader download site: spark.apache.org

In terminal:

$ wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz

5. Setting up Spark From The Archive

5.1. Extracting the software from archive

(base) ashish@ashishlaptop:~/Desktop$ tar xvf spark-3.3.0-bin-hadoop3.tgz

5.2. Moving the software to the local installation directory

(base) ashish@ashishlaptop:~/Desktop$ sudo mv spark-3.3.0-bin-hadoop3 /usr/local/spark
[sudo] password for ashish: 

(base) ashish@ashishlaptop:~/Desktop$ cd /usr/local

(base) ashish@ashishlaptop:/usr/local$ ls -l
total 36
drwxr-xr-x  2 root   root   4096 Aug  9 17:18 bin
drwxr-xr-x  2 root   root   4096 Aug  9 17:18 etc
drwxr-xr-x  2 root   root   4096 Aug  9 17:18 games
drwxr-xr-x  2 root   root   4096 Aug  9 17:18 include
drwxr-xr-x  4 root   root   4096 Oct  8 11:10 lib
lrwxrwxrwx  1 root   root      9 Aug 26 09:40 man -> share/man
drwxr-xr-x  2 root   root   4096 Aug  9 17:18 sbin
drwxr-xr-x  7 root   root   4096 Aug  9 17:21 share
drwxr-xr-x 13 ashish ashish 4096 Jun 10 02:07 spark
drwxr-xr-x  2 root   root   4096 Aug  9 17:18 src

(base) ashish@ashishlaptop:/usr/local$ cd spark

(base) ashish@ashishlaptop:/usr/local/spark$ ls
bin  conf  data  examples  jars  kubernetes  LICENSE  licenses  NOTICE  python  R  README.md  RELEASE  sbin  yarn

(base) ashish@ashishlaptop:/usr/local/spark$ 

5.3. Including Spark binaries in the environment.

$ sudo gedit ~/.bashrc

(base) ashish@ashishlaptop:/usr/local/spark$ tail ~/.bashrc
        export PATH="/home/ashish/anaconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

export PATH="/home/ashish/.local/bin:$PATH"
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
export PATH="$PATH:/usr/local/spark/bin"

(base) ashish@ashishlaptop:/usr/local/spark$ source ~/.bashrc

6. Checking installation on laptop (and then on desktop after proper setup on it)

6.1. spark-shell (Scala based shell) on Laptop

(base) ashish@ashishlaptop:/usr/local/spark$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/22 23:44:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://ashishlaptop:4040
Spark context available as 'sc' (master = local[*], app id = local-1666462455694).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.0
      /_/
         
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_342)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sys.exit

6.2. pyspark (Python based shell) on Laptop

(base) ashish@ashishlaptop:/usr/local/spark$ pyspark
Python 3.9.12 (main, Apr  5 2022, 06:56:58) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/22 23:46:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.3.0
      /_/

Using Python version 3.9.12 (main, Apr  5 2022 06:56:58)
Spark context Web UI available at http://ashishlaptop:4040
Spark context available as 'sc' (master = local[*], app id = local-1666462561785).
SparkSession available as 'spark'.
>>> exit()

6.3. spark-shell (Scala based shell) on Desktop

(base) ashish@ashishdesktop:~$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/22 23:54:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://ashishdesktop:4040
Spark context available as 'sc' (master = local[*], app id = local-1666463078583).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.0
      /_/
         
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_342)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sys.exit

6.4. pyspark (Python based shell) on Desktop

(base) ashish@ashishdesktop:~$ pyspark
Python 3.9.7 (default, Sep 16 2021, 13:09:58) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/22 23:55:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.3.0
      /_/

Using Python version 3.9.7 (default, Sep 16 2021 13:09:58)
Spark context Web UI available at http://ashishdesktop:4040
Spark context available as 'sc' (master = local[*], app id = local-1666463110370).
SparkSession available as 'spark'.
>>> exit()
(base) ashish@ashishdesktop:~$ 

7. Configure Worker Nodes in Spark's Configuration

(base) ashish@ashishlaptop:/usr/local/spark/conf$ ls -l
total 36
-rw-r--r-- 1 ashish ashish 1105 Jun 10 02:07 fairscheduler.xml.template
-rw-r--r-- 1 ashish ashish 3350 Jun 10 02:07 log4j2.properties.template
-rw-r--r-- 1 ashish ashish 9141 Jun 10 02:07 metrics.properties.template
-rw-r--r-- 1 ashish ashish 1292 Jun 10 02:07 spark-defaults.conf.template
-rwxr-xr-x 1 ashish ashish 4506 Jun 10 02:07 spark-env.sh.template
-rw-r--r-- 1 ashish ashish  865 Jun 10 02:07 workers.template
(base) ashish@ashishlaptop:/usr/local/spark/conf$ 

(base) ashish@ashishlaptop:/usr/local/spark/conf$ cat workers.template 

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# A Spark Worker will be started on each of the machines listed below.
localhost 

(base) ashish@ashishlaptop:/usr/local/spark/conf$ 

(base) ashish@ashishlaptop:/usr/local/spark/conf$ cp workers.template workers
(base) ashish@ashishlaptop:/usr/local/spark/conf$ ls -l
total 40
-rw-r--r-- 1 ashish ashish 1105 Jun 10 02:07 fairscheduler.xml.template
-rw-r--r-- 1 ashish ashish 3350 Jun 10 02:07 log4j2.properties.template
-rw-r--r-- 1 ashish ashish 9141 Jun 10 02:07 metrics.properties.template
-rw-r--r-- 1 ashish ashish 1292 Jun 10 02:07 spark-defaults.conf.template
-rwxr-xr-x 1 ashish ashish 4506 Jun 10 02:07 spark-env.sh.template
-rw-r--r-- 1 ashish ashish  865 Jun 10 02:07 workers
-rw-r--r-- 1 ashish ashish  865 Oct 23 12:58 workers.template
(base) ashish@ashishlaptop:/usr/local/spark/conf$ nano workers
(base) ashish@ashishlaptop:/usr/local/spark/conf$ cat workers
ashishdesktop
ashishlaptop
(base) ashish@ashishlaptop:/usr/local/spark/conf$ 


(base) ashish@ashishlaptop:/usr/local/spark/conf$ cp spark-env.sh.template spark-env.sh

8. Starting Driver/Master and Worker Nodes Using Script 'start-all.sh'

(base) ashish@ashishlaptop:/usr/local/spark/sbin$ source start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark/logs/spark-ashish-org.apache.spark.deploy.master.Master-1-ashishlaptop.out
ashishlaptop: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/logs/spark-ashish-org.apache.spark.deploy.worker.Worker-1-ashishlaptop.out
ashishdesktop: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/logs/spark-ashish-org.apache.spark.deploy.worker.Worker-1-ashishdesktop.out

(base) ashish@ashishlaptop:/usr/local/spark/logs$ ls -l
total 12
-rw-rw-r-- 1 ashish ashish 2005 Oct 23 13:54 spark-ashish-org.apache.spark.deploy.master.Master-1-ashishlaptop.out
-rw-rw-r-- 1 ashish ashish 2340 Oct 23 13:47 spark-ashish-org.apache.spark.deploy.master.Master-1-ashishlaptop.out.1
-rw-rw-r-- 1 ashish ashish 2485 Oct 23 13:53 spark-ashish-org.apache.spark.deploy.worker.Worker-1-ashishlaptop.out
(base) ashish@ashishlaptop:/usr/local/spark/logs$ 

Master logs

(base) ashish@ashishlaptop:/usr/local/spark/logs$ cat spark-ashish-org.apache.spark.deploy.master.Master-1-ashishlaptop.out
Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host ashishlaptop --port 7077 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
22/10/23 13:53:51 INFO Master: Started daemon with process name: 9224@ashishlaptop
22/10/23 13:53:51 INFO SignalUtils: Registering signal handler for TERM
22/10/23 13:53:51 INFO SignalUtils: Registering signal handler for HUP
22/10/23 13:53:51 INFO SignalUtils: Registering signal handler for INT
22/10/23 13:53:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/23 13:53:51 INFO SecurityManager: Changing view acls to: ashish
22/10/23 13:53:51 INFO SecurityManager: Changing modify acls to: ashish
22/10/23 13:53:51 INFO SecurityManager: Changing view acls groups to: 
22/10/23 13:53:51 INFO SecurityManager: Changing modify acls groups to: 
22/10/23 13:53:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(ashish); groups with view permissions: Set(); users  with modify permissions: Set(ashish); groups with modify permissions: Set()
22/10/23 13:53:52 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
22/10/23 13:53:52 INFO Master: Starting Spark master at spark://ashishlaptop:7077
22/10/23 13:53:52 INFO Master: Running Spark version 3.3.0
22/10/23 13:53:52 INFO Utils: Successfully started service 'MasterUI' on port 8080.
22/10/23 13:53:52 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://ashishlaptop:8080
22/10/23 13:53:53 INFO Master: I have been elected leader! New state: ALIVE
22/10/23 13:53:56 INFO Master: Registering worker 192.168.1.142:43143 with 4 cores, 10.6 GiB RAM
22/10/23 13:54:00 INFO Master: Registering worker 192.168.1.106:44471 with 2 cores, 1024.0 MiB RAM
(base) ashish@ashishlaptop:/usr/local/spark/logs$ 

Worker Logs From 'ashishlaptop'

(base) ashish@ashishlaptop:/usr/local/spark/logs$ cat spark-ashish-org.apache.spark.deploy.worker.Worker-1-ashishlaptop.out 
Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://ashishlaptop:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
22/10/23 13:53:54 INFO Worker: Started daemon with process name: 9322@ashishlaptop
22/10/23 13:53:54 INFO SignalUtils: Registering signal handler for TERM
22/10/23 13:53:54 INFO SignalUtils: Registering signal handler for HUP
22/10/23 13:53:54 INFO SignalUtils: Registering signal handler for INT
22/10/23 13:53:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/23 13:53:55 INFO SecurityManager: Changing view acls to: ashish
22/10/23 13:53:55 INFO SecurityManager: Changing modify acls to: ashish
22/10/23 13:53:55 INFO SecurityManager: Changing view acls groups to: 
22/10/23 13:53:55 INFO SecurityManager: Changing modify acls groups to: 
22/10/23 13:53:55 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(ashish); groups with view permissions: Set(); users  with modify permissions: Set(ashish); groups with modify permissions: Set()
22/10/23 13:53:55 INFO Utils: Successfully started service 'sparkWorker' on port 43143.
22/10/23 13:53:55 INFO Worker: Worker decommissioning not enabled.
22/10/23 13:53:56 INFO Worker: Starting Spark worker 192.168.1.142:43143 with 4 cores, 10.6 GiB RAM
22/10/23 13:53:56 INFO Worker: Running Spark version 3.3.0
22/10/23 13:53:56 INFO Worker: Spark home: /usr/local/spark
22/10/23 13:53:56 INFO ResourceUtils: ==============================================================
22/10/23 13:53:56 INFO ResourceUtils: No custom resources configured for spark.worker.
22/10/23 13:53:56 INFO ResourceUtils: ==============================================================
22/10/23 13:53:56 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
22/10/23 13:53:56 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://ashishlaptop:8081
22/10/23 13:53:56 INFO Worker: Connecting to master ashishlaptop:7077...
22/10/23 13:53:56 INFO TransportClientFactory: Successfully created connection to ashishlaptop/192.168.1.142:7077 after 46 ms (0 ms spent in bootstraps)
22/10/23 13:53:56 INFO Worker: Successfully registered with master spark://ashishlaptop:7077
(base) ashish@ashishlaptop:/usr/local/spark/logs$ 

Worker Logs From 'ashishdesktop'

(base) ashish@ashishdesktop:/usr/local/spark/logs$ cat spark-ashish-org.apache.spark.deploy.worker.Worker-1-ashishdesktop.out 
Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://ashishlaptop:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
22/10/23 13:53:56 INFO Worker: Started daemon with process name: 19475@ashishdesktop
22/10/23 13:53:56 INFO SignalUtils: Registering signal handler for TERM
22/10/23 13:53:56 INFO SignalUtils: Registering signal handler for HUP
22/10/23 13:53:56 INFO SignalUtils: Registering signal handler for INT
22/10/23 13:53:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/23 13:53:57 INFO SecurityManager: Changing view acls to: ashish
22/10/23 13:53:57 INFO SecurityManager: Changing modify acls to: ashish
22/10/23 13:53:57 INFO SecurityManager: Changing view acls groups to: 
22/10/23 13:53:57 INFO SecurityManager: Changing modify acls groups to: 
22/10/23 13:53:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(ashish); groups with view permissions: Set(); users  with modify permissions: Set(ashish); groups with modify permissions: Set()
22/10/23 13:53:58 INFO Utils: Successfully started service 'sparkWorker' on port 44471.
22/10/23 13:53:58 INFO Worker: Worker decommissioning not enabled.
22/10/23 13:53:59 INFO Worker: Starting Spark worker 192.168.1.106:44471 with 2 cores, 1024.0 MiB RAM
22/10/23 13:53:59 INFO Worker: Running Spark version 3.3.0
22/10/23 13:53:59 INFO Worker: Spark home: /usr/local/spark
22/10/23 13:53:59 INFO ResourceUtils: ==============================================================
22/10/23 13:53:59 INFO ResourceUtils: No custom resources configured for spark.worker.
22/10/23 13:53:59 INFO ResourceUtils: ==============================================================
22/10/23 13:53:59 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
22/10/23 13:54:00 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://ashishdesktop:8081
22/10/23 13:54:00 INFO Worker: Connecting to master ashishlaptop:7077...
22/10/23 13:54:00 INFO TransportClientFactory: Successfully created connection to ashishlaptop/192.168.1.142:7077 after 157 ms (0 ms spent in bootstraps)
22/10/23 13:54:00 INFO Worker: Successfully registered with master spark://ashishlaptop:7077

9. Issue Resolution

Prompt for password when launching worker nodes using script 'start-all.sh'


(base) ashish@ashishlaptop:/usr/local/spark/sbin$ source start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark/logs/spark-ashish-org.apache.spark.deploy.master.Master-1-ashishlaptop.out
ashishlaptop: Warning: Permanently added 'ashishlaptop' (ED25519) to the list of known hosts.
ashish@ashishlaptop's password: ashishdesktop: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/logs/spark-ashish-org.apache.spark.deploy.worker.Worker-1-ashishdesktop.out

ashishlaptop: Connection closed by 192.168.1.142 port 22
(base) ashish@ashishlaptop:/usr/local/spark/sbin$

Debugging

DOING SSH FROM 'ASHISHLAPTOP' TO 'ASHISHLAPTOP' (NOT A TYPE) STILL RESULTS IN PASSWORD PROMPT.

(base) ashish@ashishlaptop:/usr/local/spark/conf$ ssh ashishlaptop
ashish@ashishlaptop's password: 

DO THIS TO RESOLVE THE ISSUE

(base) ashish@ashishlaptop:/usr/local/spark/conf$ ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@ashishlaptop
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/ashish/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
ashish@ashishlaptop's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'ashish@ashishlaptop'"
and check to make sure that only the key(s) you wanted were added.

(base) ashish@ashishlaptop:/usr/local/spark/conf$ ssh ashish@ashishlaptop
Welcome to Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-52-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

9 updates can be applied immediately.
To see these additional updates run: apt list --upgradable

Last login: Sat Oct 22 22:00:44 2022 from 192.168.1.106
(base) ashish@ashishlaptop:~$ 

For more details on SSH setup: [ Link ]

Thursday, October 20, 2022

Python Interview Questions (2022 Oct, Week 3)

Ques 1: Count the range of each value in Python
I have dataset of student's scores for each subject.

StuID  Subject Scores                
1      Math    90
1      Geo     80
2      Math    70
2      Geo     60
3      Math    50
3      Geo     90
Now I want to count the range of scores for each subject like 0 < x <=20, 20 < x <=30 and get a dataframe like this:

Subject  0-20  20-40 40-60 60-80 80-100                 
Math       0     0     1     1     1
Geo        0     0     0     1     2    
How can I do it?

Ans 1:

import pandas as pd

df = pd.DataFrame({
    "StuID": [1,1,2,2,3,3],
    "Subject": ['Math', 'Geo', 'Math', 'Geo', 'Math', 'Geo'],
    "Scores": [90, 80, 70, 60, 50, 90]
})

bins = list(range(0, 100+1, 20))
# [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
labels = [f'{a}-{b}' for a,b in zip(bins, bins[1:])]
# ['0-10', '10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90', '90-100']

out = (pd.crosstab(df['Subject'], pd.cut(df['Scores'], bins=bins,
                                            labels=labels, ordered=True, right=False))
            .reindex(labels, axis=1, fill_value=0)
        )

print(out)

Ques 2: Split one column into 3 columns in python or PySpark

I have:
Customerkeycode:
B01:B14:110083

I want:
PlanningCustomerSuperGroupCode, DPGCode, APGCode
BO1,B14,110083

Ans 2:

In pyspark, first split the string into an array, and then use the getItem method to split it into multiple columns.

import pyspark.sql.functions as F
...
cols = ['PlanningCustomerSuperGroupCode', 'DPGCode', 'APGCode']
arr_cols = [F.split('Customerkeycode', ':').getItem(i).alias(cols[i]) for i in range(3)]
df = df.select(*arr_cols)
df.show(truncate=False)

Using Plain Pandas:


import pandas as pd

df = pd.DataFrame(
    {
        "Customerkeycode": [
            "B01:B14:110083",
            "B02:B15:110084"
        ]
    }
)

df['PlanningCustomerSuperGroupCode'] = df['Customerkeycode'].apply(lambda x: x.split(":")[0])
df['DPGCode'] = df['Customerkeycode'].apply(lambda x: x.split(":")[1])
df['APGCode'] = df['Customerkeycode'].apply(lambda x: x.split(":")[2])

df_rep = df.drop("Customerkeycode", axis = 1)

print(df_rep)


   PlanningCustomerSuperGroupCode DPGCode APGCode
0                            B01     B14  110083
1                            B02     B15  110084

Ref

Ques 3: Create a new list of dict, from a dict with a key that has a list value

I have a dict with one of the keys have a value of list

example = {"a":1,"b":[1,2]}
I am trying to unpack example["b"] and create a list of the same dict with separate example["b"] value.

output = [{"a":1,"b":1},{"a":1,"b":2}]
I have tried to use a for loop to understand the unpacking and reconstruction of the list of dict but I am seeing a strange behavior:


iter = example.get("b")

new_list = []

for p in iter:
    print(f"p is {p}")
    tmp_dict = example
    tmp_dict["b"] = p
    print(tmp_dict)
    new_list.append(tmp_dict)

print(new_list)

Output:


p is 1
{'a': 1, 'b': 1}

p is 2
{'a': 1, 'b': 2}

[{'a': 1, 'b': 2}, {'a': 1, 'b': 2}]

Why is the first dict in the list gets assigned with example["b"] = 2 although the first print() shows that p is 1?

Answer 3.1:

Here's a general approach that works for all cases without hardcoding any keys

here's a general approach that works for all cases without hardcoding any keys. Let's first create a temporary dictionary where all values are lists.

temp = {k: v if isinstance(v, list) else [v] for k, v in example.items()}
This allows us to then obtain the list of all the values in our temp dict as a list of lists.

We want the product of all the values of this temp dictionary. To do this, we can use the itertools.product function, and unpack our list of lists to its arguments.

In each iteration, the resulting tuple will have one value per key of the temp dictionary, so all we need to do is zip that with our tuple, and create a dict out of those key-value pairs. That gives us our list element!


import itertools

keys = list(temp.keys())
vals = list(temp.values())

result = []

for vals_product in itertools.product(*vals):
    d = dict(zip(keys, vals_product))
    result.append(d) 
Which gives the required result:

[{'a': 1, 'b': 1}, {'a': 1, 'b': 2}]

This even works for an example with more keys:

example = {'a': 1, 'b': [1, 2], 'c': [1, 2, 3]}
which gives:


[{'a': 1, 'b': 1, 'c': 1},
 {'a': 1, 'b': 1, 'c': 2},
 {'a': 1, 'b': 1, 'c': 3},
 {'a': 1, 'b': 2, 'c': 1},
 {'a': 1, 'b': 2, 'c': 2},
 {'a': 1, 'b': 2, 'c': 3}]

Answer 3.2:
Just one minor correction: Usage of dict()


example = {"a":1,"b":[1,2]}

iter = example.get("b")

new_list = []

for p in iter:
    print(f"p is {p}")
    tmp_dict = dict(example)
    tmp_dict["b"] = p
    print(tmp_dict)
    new_list.append(tmp_dict)

print(new_list)

Output is as given below: [{'a': 1, 'b': 1}, {'a': 1, 'b': 2}]

Ref

Question 4: Classification of sentences

I have a list of sentences. Examples:


${INS1}, Watch our latest webinar about flu vaccine
Do you think patients would like to go up to 250 days without an attack?
Watch our latest webinar about flu vaccine
??? See if more of your patients are ready for vaccine
Important news for your invaccinated patients
Important news for your inv?ccinated patients
...

By 'good' I mean sentences with no strange characters and sequences of characters such as '${INS1}', '???', or '?' inside the word etc. Otherwise sentence is considered as 'bad'. I need to find 'good' patterns to be able to identify 'bad' sentences in the future and exclude them, as the list of sentences will become larger in the future and new 'bad' sentences might appear.

Is there any way to identify 'good' sentences?

Answer 4:

This solution based on character examples specifically given in the question. If there are more characters that should be used to identify good and bad sentences then they should also be added in the RegEx mentioned below in code.


import re 

sents = [
    "${INS1}, Watch our latest webinar about flu vaccine",
    "Do you think patients would like to go up to 250 days without an attack?",
    "Watch our latest webinar about flu vaccine",
    "??? See if more of your patients are ready for vaccine",
    "Important news for your invaccinated patients",
    "Important news for your inv?ccinated patients"
]

good_sents = []
bad_sents = []

for i in sents:
    x = re.findall("[?}{$]", i)
    if(len(x) == 0):
        good_sents.append(i)

    else:
        bad_sents.append(i)

print(good_sents)

Question 5: How to make the sequence result in one line? 

Need to print 'n' Fibonacci numbers.

Here is the expected output as shown below:


Enter n: 10
Fibonacci numbers = 1 1 2 3 5 8 13 21 34 55

Here is my current code:


n = input("Enter n: ")

def fib(n):
    cur = 1
    old = 1
    i = 1
    while (i < n):
        cur, old, i = cur+old, cur, i+1
    return cur

Answer 5:
To do the least calculation, it is more efficient to have a fib function generate a list of the first n Fibonacci numbers.


def fib(n):
    fibs = [0, 1]
    for _ in range(n-2):
    fibs.append(sum(fibs[-2:]))
    return fibs
We know the first two Fibonacci numbers are 0 and 1. For the remaining count we can add the sum of the last two numbers to the list.

>>> fib(10)
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

You can now:

print('Fibonacci numbers = ', end='')
print(*fib(10), sep=' ', end='\n')

Question 6: Maximum occurrences in a list / DataFrame column

I have a dataframe like the one below.


import pandas as pd

data = {'Date': ['2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05','2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05','2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05'],
        'Runner': ['Runner A', 'Runner A', 'Runner A', 'Runner A', 'Runner A','Runner B', 'Runner B', 'Runner B', 'Runner B', 'Runner B','Runner C', 'Runner C', 'Runner C', 'Runner C', 'Runner C'],
        'Training Time': ['less than 1 hour', 'less than 1 hour', 'less than 1 hour', 'less than 1 hour', '1 hour to 2 hour','less than 1 hour', '1 hour to 2 hour', 'less than 1 hour', '1 hour to 2 hour', '2 hour to 3 hour', '1 hour to 2 hour ', '2 hour to 3 hour' ,'1 hour to 2 hour ', '2 hour to 3 hour', '2 hour to 3 hour']
        }

df = pd.DataFrame(data)

I have counted the occurrence for each runner using the below code

s = df.groupby(['Runner','Training Time']).size()



The problem is on Runner B. It should show "1 hour to 2 hour" and "less than 1 hour". But it only shows "1 hour to 2 hour".

How can I get this expected result:



Answer 6.1:


import pandas as pd

data = {'Date': ['2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05','2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05','2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05'],
        'Runner': ['Runner A', 'Runner A', 'Runner A', 'Runner A', 'Runner A','Runner B', 'Runner B', 'Runner B', 'Runner B', 'Runner B','Runner C', 'Runner C', 'Runner C', 'Runner C', 'Runner C'],
        'Training Time': ['less than 1 hour', 'less than 1 hour', 'less than 1 hour', 'less than 1 hour', '1 hour to 2 hour','less than 1 hour', '1 hour to 2 hour', 'less than 1 hour', '1 hour to 2 hour', '2 hour to 3 hour', '1 hour to 2 hour ', '2 hour to 3 hour' ,'1 hour to 2 hour ', '2 hour to 3 hour', '2 hour to 3 hour']
        }

df = pd.DataFrame(data)
s = df.groupby(['Runner', 'Training Time'], as_index=False).size()
s.columns = ['Runner', 'Training Time', 'Size']

r = s.groupby(['Runner'], as_index=False)['Size'].max()

df_list = []
for index, row in r.iterrows():
    temp_df = s[(s['Runner'] == row['Runner']) & (s['Size'] == row['Size'])]
    df_list.append(temp_df)

df_report = pd.concat(df_list)
print(df_report)
    
df_report.to_csv('report.csv', index = False)

Answer 6.2:

def agg_most_common(vals):
    print("vals")
    matches = []
    for i in collections.Counter(vals).most_common():
        if not matches or matches[0][1] == i[1]:
            matches.append(i)
        else:
            break
    return [x[0] for x in matches]

print(df.groupby('Runner')['Training Time'].agg(agg_most_common))

Wednesday, October 19, 2022

HDFC may be excluded from Nifty 50 by Dec-end; these stocks front-runners to replace mortgage lender (2022-Oct-19)

The National Stock Exchange may exclude Housing Development Finance Corporation (HDFC) from the NSE Nifty 50 index in the next few weeks as the HDFC and HDFC Bank merger looks likely to be completed a few months earlier than expected, according to analysts.

Adhesive maker Pidilite Industries is among the front-runners to replace HDFC in the 50-share index. “The (HDFC) stock could be then excluded from all of the Nifty Indices earliest by end of December 2022 or max by middle of January 2023,” said Abhilash Pagaria, head of alternative and quantitative research at Nuvama Wealth Management. The housing finance company with a 5.5% weight in the Nifty could see an outflow of around $1.3-1.5 billion from the passive funds once it’s excluded from the benchmark.

This will be an ad hoc inclusion on account of the mortgage lender’s expulsion from the index on account of its merger with HDFC Bank, which has entered final stages. The shareholders’ meeting to approve the proposed merger of HDFC with HDFC Bank has been scheduled for 25 November. Nifty Indices Methodology book states, “In case of a merger, spin-off, capital restructuring or voluntary delisting, equity shareholders’ approval is considered as a trigger to initiate the replacement of such stock from the index through additional index reconstitution.”

As the weights of the indices are calculated based on free-float market capitalisation, the exclusion of HDFC will lead to a redistribution of most weightages in the top-10 Nifty heavyweights. According to Nuvama, HDFC’s stock could see outflows worth more than $1.5 billion because of exiting the benchmark indices, which could also have a bearing on the stock performance of HDFC Bank.

Aside from (1) Pidilite:
(2) Ambuja Cements (ACEM),
(3) Tata Power (TPWR) and
(4) SRF (SRF) are among the front-runners to replace HDFC in Nifty 50. Pagaria said that HDFC’s large 5.5% weight in the Nifty 50 index means that the stock could see heavy selling from passive funds that closely track the benchmark index.

Currently, over 22 ETFs track the Nifty 50 index as the benchmark with cumulative assets under management (AUM) of $30 billion (Rs 2.4 lakh crore).
It is worth noting that HDFC’s shareholders are set to receive 42 shares of HDFC Bank for every 25 shares of HDFC held by them. Given the swap ratio, a sharp decline in shares of HDFC in the run-up to the exit from the Nifty indices is likely to trigger a sell-off in HDFC Bank. In the Nuvama note, Pagaria dismissed concerns among some market participants over HDFC Bank being excluded from the NSE indices after the merger. “In fact, once the merged company starts trading with a higher free float market capitalization, HDFC Bank will see an increase in weightage,” he said.

[ Ref ]

Monday, October 17, 2022

ARIMA forecast for timeseries is one step ahead. Why? (Solved Interview Problem)

Question from interviewer 

I'm trying to forecast timeseries with ARIMA (Autoregressive Integrated Moving Average (ARIMA) model). As you can see from the plot, the forecast (red) is one step ahead of the expected values (input value colored in blue). How can I synchronize the two?



The code I used:


history = [x for x in train]
predictions = list()

for t in range(len(test)):
    model = ARIMA(history,order=(2, 2, 1))
    model_fit = model.fit(disp=0)
    output = model_fit.forecast(alpha=0.05)
    yhat = output[0]
    predictions.append(yhat)
    obs = test[t]
    history.append(obs)
    print('predicted=%f, expected=%f' % (yhat, obs))

rmse = sqrt(mean_squared_error(test, predictions))
print('Test RMSE: %.3f' % rmse)
# plot
plt.plot(test, color='blue')
plt.plot(predictions, color='red')
plt.show() 

Answer
It appears you are using Python's statsmodel ARIMA .

There are two options:

1) Change order=argument. The p, d, or q hyperparameters might be causing that result.
2) Manually lag the predictions: predictions = predictions[1:]

What you are seeing in the plot can be interpreted differently while considering the time dimension on x-axis with future being on the right and past on the left.

Your predicted curve (the one colored in red) is FOLLOWING your input (blue) curve. At time t, if blue goes down, then at time t+1, red will go down. You can see this behavior happening twice very clearly in the image once around t = 10 and then second time around t = 20. Also, the magnitude of change in the predicted curve is not as big/low as the input curve.

Check in Python if a key is present in a JSON object (Solved Interview Problem)


jsonstreng = {
    "kjoretoydataListe":
    [
        {
            "kjoretoyId": {"kjennemerke":"EK 17058","understellsnummer":"5YJXCCE29GF009633"},
            "forstegangsregistrering": {"registrertForstegangNorgeDato":"2016-09-07"},
            "kjennemerke": [
                {"fomTidspunkt":"2016-09-07T00:00:00+02:00","kjennemerke":"EK 17058","kjennemerkekategori":"KJORETOY",
                "kjennemerketype":{"kodeBeskrivelse":"Sorte tegn p- hvit bunn","kodeNavn":"Ordin-re kjennemerker","kodeVerdi":"ORDINART","tidligereKodeVerdi":[]}},{"fomTidspunkt":"2022-08-19T17:04:04.334+02:00","kjennemerke":"GTD","kjennemerkekategori":"PERSONLIG"}
            ]
        }
    ]
}

def checkIfKeyExistsInDict(in_dict, i_key):
    if(isinstance(in_dict, dict)):
        if(i_key in in_dict):
            print(i_key + " found in: " + str(in_dict))
            print()
        
        for j_key in in_dict:
            checkIfKeyExistsInDict(in_dict[j_key], i_key)
    elif(isinstance(in_dict, list)):
        for j_key in range(len(in_dict)):
            checkIfKeyExistsInDict(in_dict[j_key], i_key)

        
checkIfKeyExistsInDict(jsonstreng, "kjennemerke")


Logs

(base) $ python "Check if key is present in a JSON object.py" 

kjennemerke found in: {'kjoretoyId': {'kjennemerke': 'EK 17058', 'understellsnummer': '5YJXCCE29GF009633'}, 'forstegangsregistrering': {'registrertForstegangNorgeDato': '2016-09-07'}, 'kjennemerke': [{'fomTidspunkt': '2016-09-07T00:00:00+02:00', 'kjennemerke': 'EK 17058', 'kjennemerkekategori': 'KJORETOY', 'kjennemerketype': {'kodeBeskrivelse': 'Sorte tegn p- hvit bunn', 'kodeNavn': 'Ordin-re kjennemerker', 'kodeVerdi': 'ORDINART', 'tidligereKodeVerdi': []}}, {'fomTidspunkt': '2022-08-19T17:04:04.334+02:00', 'kjennemerke': 'GTD', 'kjennemerkekategori': 'PERSONLIG'}]}

kjennemerke found in: {'kjennemerke': 'EK 17058', 'understellsnummer': '5YJXCCE29GF009633'}

kjennemerke found in: {'fomTidspunkt': '2016-09-07T00:00:00+02:00', 'kjennemerke': 'EK 17058', 'kjennemerkekategori': 'KJORETOY', 'kjennemerketype': {'kodeBeskrivelse': 'Sorte tegn p- hvit bunn', 'kodeNavn': 'Ordin-re kjennemerker', 'kodeVerdi': 'ORDINART', 'tidligereKodeVerdi': []}}

kjennemerke found in: {'fomTidspunkt': '2022-08-19T17:04:04.334+02:00', 'kjennemerke': 'GTD', 'kjennemerkekategori': 'PERSONLIG'}

Hinglish to English Machine Translation Using Transformers

Download Code and Data

import numpy as np
from keras_transformer import get_model, decode


with open('../Dataset/English_Hindi_Hinglish.txt', mode = 'r') as f:
    data = f.readlines()

data = data[0:195] # 195 Because we have that many labeled data points for Hinglish to English translation.

source_tokens = [i.split(',')[1].strip().split(' ') for i in data]
target_tokens = [i.split('	')[0].strip().split(' ') for i in data]



# Generate dictionaries
def build_token_dict(token_list):
    token_dict = {
        '<PAD>': 0,
        '<START>': 1,
        '<END>': 2,
    }
    for tokens in token_list:
        for token in tokens:
            if token not in token_dict:
                token_dict[token] = len(token_dict)
    return token_dict

source_token_dict = build_token_dict(source_tokens)
target_token_dict = build_token_dict(target_tokens)
target_token_dict_inv = {v: k for k, v in target_token_dict.items()}

# Add special tokens
encode_tokens = [['<START>'] + tokens + ['<END>'] for tokens in source_tokens]
decode_tokens = [['<START>'] + tokens + ['<END>'] for tokens in target_tokens]
output_tokens = [tokens + ['<END>', '<PAD>'] for tokens in target_tokens]

# Padding
source_max_len = max(map(len, encode_tokens))
target_max_len = max(map(len, decode_tokens))

encode_tokens = [tokens + ['<PAD>'] * (source_max_len - len(tokens)) for tokens in encode_tokens]
decode_tokens = [tokens + ['<PAD>'] * (target_max_len - len(tokens)) for tokens in decode_tokens]
output_tokens = [tokens + ['<PAD>'] * (target_max_len - len(tokens)) for tokens in output_tokens]

encode_input = [list(map(lambda x: source_token_dict[x], tokens)) for tokens in encode_tokens]
decode_input = [list(map(lambda x: target_token_dict[x], tokens)) for tokens in decode_tokens]
decode_output = [list(map(lambda x: [target_token_dict[x]], tokens)) for tokens in output_tokens]

# Build & fit model
model = get_model(
    token_num=max(len(source_token_dict), len(target_token_dict)),
    embed_dim=32,
    encoder_num=2,
    decoder_num=2,
    head_num=4,
    hidden_dim=128,
    dropout_rate=0.05,
    use_same_embed=False,  # Use different embeddings for different languages
)
model.compile('adam', 'sparse_categorical_crossentropy')
model.summary()


Number of Parameters to Train When There Are a Certain Number of Lines in Input

Number of Lines is 55
Model: "model_1"
... 
Total params: 65,827
Trainable params: 65,827
Non-trainable params: 0

Number of Lines is 115

Total params: 72,002
Trainable params: 72,002
Non-trainable params: 0


Number of Lines is 165

Total params: 77,787
Trainable params: 77,787
Non-trainable params: 0


Number of Lines is 195

Total params: 80,777
Trainable params: 80,777
Non-trainable params: 0


%%time
model.fit(
    x=[np.array(encode_input * 1024), np.array(decode_input * 1024)],
    y=np.array(decode_output * 1024),
    epochs=10,
    batch_size=32,
)


Training Logs When There is a Certain Number of Lines in Input
# Number of Lines in Input is 55

Number of Epochs: 10
CPU times: user 11min 31s, sys: 56 s, total: 12min 27s
Wall time: 5min 48s

<keras.callbacks.History at 0x7f8f347f69d0>


# Number of Lines in Input is 115


Number of Epochs: 10
CPU times: user 26min 55s, sys: 2min 7s, total: 29min 2s
Wall time: 13min 33s


# Number of Lines in Input is 150

Number of Epochs: 10
CPU times: user 41min 26s, sys: 3min 12s, total: 44min 39s
Wall time: 21min 1s


# Number of Lines in Input is 195
Number of Epochs: 10

Epoch 1/10
6240/6240 [==============================] - 165s 25ms/step - loss: 0.1641
Epoch 2/10
6240/6240 [==============================] - 163s 26ms/step - loss: 0.0049
Epoch 3/10
6240/6240 [==============================] - 151s 24ms/step - loss: 0.0043
Epoch 4/10
6240/6240 [==============================] - 150s 24ms/step - loss: 0.0038
Epoch 5/10
6240/6240 [==============================] - 150s 24ms/step - loss: 0.0043
Epoch 6/10
6240/6240 [==============================] - 153s 24ms/step - loss: 0.0036
Epoch 7/10
6240/6240 [==============================] - 153s 24ms/step - loss: 0.0036
Epoch 8/10
6240/6240 [==============================] - 151s 24ms/step - loss: 0.0036
Epoch 9/10
6240/6240 [==============================] - 150s 24ms/step - loss: 0.0038
Epoch 10/10
6240/6240 [==============================] - 152s 24ms/step - loss: 0.0037
CPU times: user 51min 23s, sys: 3min 52s, total: 55min 16s
Wall time: 25min 39s
    
# Validation


decoded = decode(
    model,
    encode_input,
    start_token=target_token_dict['<START>'],
    end_token=target_token_dict['<END>'],
    pad_token=target_token_dict['<PAD>'],
)
for i in decoded:
    print(' '.join(map(lambda x: target_token_dict_inv[x], i[1:-1])))



...
Follow him.
I am tired.
I can swim.
I can swim.
I love you.


# Testing

test_sents = [
    'kaise ho?',
    'kya tum mujhse pyar karte ho?',
    'kya tum mujhe pyar karte ho?'
]

test_tokens = [i.split() for i in test_sents]

test_token_dict = build_token_dict(test_tokens)

test_token_dict_inv = {v: k for k, v in test_token_dict.items()}

test_enc_tokens = [['<START>'] + tokens + ['<END>'] for tokens in test_tokens]
test_enc_tokens = [tokens + ['<PAD>'] * (target_max_len - len(tokens)) for tokens in test_enc_tokens]
test_input = [list(map(lambda x: test_token_dict[x], tokens)) for tokens in test_enc_tokens]

decoded = decode(
    model,
    test_input,
    start_token=test_token_dict['<START>'],
    end_token=test_token_dict['<END>'],
    pad_token=test_token_dict['<PAD>'],
)
for i in decoded:
    print(' '.join(map(lambda x: target_token_dict_inv[x], i[1:-1])))

Sunday, October 16, 2022

Stories from 'How Not to be Wrong' by Jordan Ellenberg

Borrowed from 'How Not to be Wrong: The Hidden Maths of Everyday Life' by Jordan Ellenberg.

Importance of data versus intelligent analysis

During World War II, numerous fighter planes were getting hit by anti-aircraft guns. Air Force officers wanted to add some protective armor/shield to the planes. The question was "where"? The planes could only support few more kilos of weight. A group of mathematicians and engineers were called for a short consulting project. Fighter planes returning from missions were analyzed for bullet holes per square foot. They found 1.93 bullet holes/sq. foot near the tail of the planes whereas only 1.11 bullet holes/sq. foot close to the engine. The Air Force officers thought that since the tail portion had the greatest density of bullets, that would be the logical location for putting an anti-bullet shield. A mathematician named Abraham Wald said exactly the opposite; more protection is needed where the bullet holes aren't - that is - around the engines. His judgement surprised everyone. He said "We are counting the planes that returned from a mission. Planes with lots of bullet holes in the engine did not return at all."

Debrief

If you go to the recovery room at the hospital, you'll see a lot more people with bullet holes in their legs than people with bullet holes in their chests. That's not because people don't get shot in the chest; it's because the people who get shot in the chest don't recover.

Remember the words of Einstein: "Not everything that counts can be counted, and not everything that can be counted, counts."

The Parable of the Baltimore Stockbroker

Imagine getting a mail from a little known stockbroking firm. The mail predicts that a certain stock will rise this week. You leave the mail aside, as you have seen enough such mails. But the prediction turns out to be right.

Next week

The Baltimore Stockbroker mails again, with another tip-this time, of a stock going south. The message turns out right too and you decide to mark the Baltimore Stockbroker as 'not spam'.

Week Three

Another hit. And your interest is piqued.
This goes on for ten weeks. Ten accurate predictions from the Baltimore Stockbroker.
You, the guy who recently retired with a substantial gratuity in the bank, are hooked.

Week eleven

The Baltimore Stockbroker sends you an offer to invest money with him, for a substantial fee of course. There is the usual caveat of past performances not guaranteeing future success, but the Baltimore Stockbroker nudges you to consider his ten week streak.
You do the math. Every week, the Stockbroker had a 50% chance with his prediction. Either he would be right, or wrong.
Combining the probabilities for ten weeks, the chances of the Baltimore Stockbroker to be right ten weeks in a row work out to... 1/2 x 1/2 x 1/2... ten times... = 1/1024

You consider. The Baltimore Stockbroker must be onto something. And it would be worthwhile to invest your nest egg with him.
You go in for the offer.

Things, from the view of the Baltimore Stockbroker, are a bit different.

What he did, was start out with sending 10,240 newsletters!
Of these, 5120 said a stock would go up, and 5120 said otherwise.
The 5120 who got a dud prediction never heard from the Baltimore Stockbroker again.
Week Two, the Baltimore Stockbroker sent 2560 newsletters, and the following week he again halved the number, based on who got his correct prediction.
This way, at the end of week 10, he had ten people, convinced he was a financial genius.
That's... The power of probabilities, cons, and the impact of mathematics on daily life... Just one aspect!

This is how the Tip Providers work.

Saturday, October 15, 2022

JavaScript RegEx Solved Interview Problem (Oct 2022)

Question
Update JSON-like object that contains new lines and comments.

I'm working on a Node application where I make a call to external API. I receive a response which looks exactly like this:

{
   "data": "{\r\n    // comment\r\n    someProperty: '*',\r\n\r\n    // another comment\r\n    method: function(e) {\r\n        if (e.name === 'something') {\r\n            onSomething();\r\n        }\r\n    }\r\n}"
}

So as you can see it contains some comments, new line characters etc.

I would like parse it somehow to a proper JSON and then update the method property with completely different function. What would be the best (working) way to achieve it? I've tried to use comment-json npm package, however it fails when I execute parse(response.data).

Answer

<script>
function get_response() {     return JSON.stringify({         "data": "{\r\n    // comment\r\n    someProperty: '*',\r\n\r\n    // another comment\r\n    method: function(e) {\r\n        if (e.name === 'something') {\r\n            onSomething();\r\n        }\r\n    }\r\n}"     }); }

var resp = get_response()

console.log("Response before formatting: " + resp);

resp = resp.replace(/\/.*?\\r\\n/g, "");
resp = resp.replace(/\\r\\n/g, "");

console.log("Response after formatting: " + resp);

console.log("And JSON.parse()")
console.log(JSON.parse(resp))
</script>

MongoDB and Node.js Installation on Ubuntu (Oct 2022)

Part 1: MongoDB
(base) ashish@ashishlaptop:~/Desktop$ sudo apt-get install -y mongodb-org=6.0.2 mongodb-org-database=6.0.2 mongodb-org-server=6.0.2 mongodb-org-mongos=6.0.2 mongodb-org-tools=6.0.2

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:

 mongodb-org-mongos : Depends: libssl1.1 (>= 1.1.1) but it is not installable
 mongodb-org-server : Depends: libssl1.1 (>= 1.1.1) but it is not installable
E: Unable to correct problems, you have held broken packages.

References 

1) Install MongoDB (latest) on Ubuntu 
2) Install MongoDB (v6.0) on Ubuntu 
3) Install MongoDB (v5.0) on Ubuntu 

Resolution

Dated: 2022-Oct-14
MongoDb has no official build for ubuntu 22.04 at the moment.

Ubuntu 22.04 has upgraded libssl to 3 and does not propose libssl1.1

You can force the installation of libssl1.1 by adding the ubuntu 20.04 source:

$ echo "deb http://security.ubuntu.com/ubuntu focal-security main" | sudo tee /etc/apt/sources.list.d/focal-security.list

$ sudo apt-get update
$ sudo apt-get install libssl1.1

Then use your commands to install mongodb-org.

Then delete the focal-security list file you just created:

$ sudo rm /etc/apt/sources.list.d/focal-security.list
[ Ref ]

Part 2: Node.js

(base) ashish@ashishlaptop:~/Desktop/node$ node

Command 'node' not found, but can be installed with:
sudo apt install nodejs

(base) ashish@ashishlaptop:~/Desktop/node$ sudo apt install nodejs

(base) ashish@ashishlaptop:~/Desktop/node$ node -v
v12.22.9 

(base) ashish@ashishlaptop:~/Desktop/node$ npm

Command 'npm' not found, but can be installed with:
sudo apt install npm

$ sudo apt install npm

(base) ashish@ashishlaptop:~/Desktop/node$ npm -v
8.5.1

Issue when MongoDB client is not installed

(base) ashish@ashishlaptop:~/Desktop/node$ node
Welcome to Node.js v12.22.9.
Type ".help" for more information.
> var mongo = require('mongodb');
Uncaught Error: Cannot find module 'mongodb' 
Require stack:
- <repl>
    at Function.Module._resolveFilename (internal/modules/cjs/loader.js:815:15)
    at Function.Module._load (internal/modules/cjs/loader.js:667:27)
    at Module.require (internal/modules/cjs/loader.js:887:19)
    at require (internal/modules/cjs/helpers.js:74:18) {
    code: 'MODULE_NOT_FOUND',
    requireStack: [ '<repl>' ]
}
> 

(base) ashish@ashishlaptop:~/Desktop/node$ npm install mongodb

added 20 packages, and audited 21 packages in 35s

3 packages are looking for funding
  run `npm fund` for details

found 0 vulnerabilities 

After mongoDB client has been installed


> var mongo = require('mongodb');
undefined
>

Friday, October 14, 2022

Node.js and MongoDB Solved Interview Problem (Oct 2022)

Question:

I'm making 2 queries (or a single query) to the database. 
What I want to achieve:

If one of them is null, I want to add the value 'nil'. Example:

field1: nil,
field2: 'value'

If both are null, then I want it to respond with the 'not found' message.

What's a good approach for this?

Answer:

First, we setup the local database.


Running a local instance

(base) ashish@ashishlaptop:~/Desktop$ mongosh
Current Mongosh Log ID:	63498870b0f4b94029f7b626
Connecting to:		mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.6.0
Using MongoDB:		6.0.2
Using Mongosh:		1.6.0

For mongosh info see: https://docs.mongodb.com/mongodb-shell/


To help improve our products, anonymous usage data is collected and sent to MongoDB periodically (https://www.mongodb.com/legal/privacy-policy).
You can opt-out by running the disableTelemetry() command.

------
   The server generated these startup warnings when booting
   2022-10-14T21:33:50.958+05:30: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine. See http://dochub.mongodb.org/core/prodnotes-filesystem
   2022-10-14T21:33:52.643+05:30: Access control is not enabled for the database. Read and write access to data and configuration is unrestricted
   2022-10-14T21:33:52.644+05:30: vm.max_map_count is too low
------

------
   Enable MongoDB's free cloud-based monitoring service, which will then receive and display
   metrics about your deployment (disk utilization, CPU, operation statistics, etc).
   
   The monitoring data will be available on a MongoDB website with a unique URL accessible to you
   and anyone you share the URL with. MongoDB may use this information to make product
   improvements and to suggest MongoDB products and deployment options to you.
   
   To enable free monitoring, run the following command: db.enableFreeMonitoring()
   To permanently disable this reminder, run the following command: db.disableFreeMonitoring()
------

test> 
test> db
test
test> show dbs

admin   40.00 KiB
config  12.00 KiB
local   72.00 KiB  

test> 

test> 

test> db.createCollection("ccn1")
{ ok: 1 }

test> db.ccn1.insertOne({"name":"Ashish Jain","address":"Delhi"})

{
  acknowledged: true,
  insertedId: ObjectId("634989627d163a41b75e1e13")
}
test> 

Next Command

db.ccn1.insertMany([  
  {"name":"","address":"India"},
  {"name":"","address":""},
  {"name":"Ash","address":""}
])


test> db.ccn1.insertMany([  
... {"name":"","address":"India"},
... {"name":"","address":""},
... {"name":"Ash","address":""}
... ])

{
  acknowledged: true,
  insertedIds: {
    '0': ObjectId("634989cc7d163a41b75e1e14"),
    '1': ObjectId("634989cc7d163a41b75e1e15"),
    '2': ObjectId("634989cc7d163a41b75e1e16")
  }
}
test> 

Our Database Looks Like This

[
  {
    _id: new ObjectId("634989627d163a41b75e1e13"),
    name: 'Ashish Jain',
    address: 'Delhi'
  },
  {
    _id: new ObjectId("634989cc7d163a41b75e1e14"),
    name: '',
    address: 'India'
  },
  {
    _id: new ObjectId("634989cc7d163a41b75e1e15"),
    name: '',
    address: ''
  },
  {
    _id: new ObjectId("634989cc7d163a41b75e1e16"),
    name: 'Ash',
    address: ''
  }
]
 
This is our Node.js code in file "script.js"


var MongoClient = require('mongodb').MongoClient;

var url = "mongodb://localhost:27017/";

MongoClient.connect(url, function(err, db) {
  if (err) throw err;
  var dbo = db.db("test");
  var all_docs = dbo.collection("ccn1").find({}).toArray(function(err, result) {
    if (err) throw err;
    
    
    for (i in result){
        if(result[i]['name'] && result[i]['address']) {
            console.log("name: " + result[i]['name'])
            console.log("address: " + result[i]['address'])
        }

        else if (result[i]['name'] && !result[i]['address']){
            console.log("name: " + result[i]['name'])
            console.log("address: nil")
        }

        else if (!result[i]['name'] && result[i]['address']){
            console.log("name: nil")
            console.log("address: " + result[i]['address'])
        }

        else {
            console.log("Not Found")
        }

        console.log()
        
    }

    db.close();
  });
});  

This is our output


(base) ashish@ashishlaptop:~/Desktop/software/node$ node script.js 
name: Ashish Jain
address: Delhi

name: nil
address: India

Not Found

name: Ash
address: nil

Thursday, October 13, 2022

SSH Setup (on two Ubuntu machines), Error Messages and Resolution

System 1: ashishlaptop

(base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ ifconfig

enp1s0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 9c:5a:44:09:35:ee  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 375  bytes 45116 (45.1 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 375  bytes 45116 (45.1 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

wlp2s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.131  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 fe80::5154:e768:24e1:aece  prefixlen 64  scopeid 0x20<link>
        inet6 2401:4900:47f6:d7d1:b724:d299:1a51:567  prefixlen 64  scopeid 0x0<global>
        inet6 2401:4900:47f6:d7d1:239a:fc2d:c994:6e54  prefixlen 64  scopeid 0x0<global>
        ether b0:fc:36:e5:ad:11  txqueuelen 1000  (Ethernet)
        RX packets 7899  bytes 8775440 (8.7 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 5299  bytes 665165 (665.1 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

(base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ hostname
ashish-Lenovo-ideapad-130-15IKB

To Change The Hostname

(base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ sudo nano /etc/hostname
(base) ashish@ashish-Lenovo-ideapad-130-15IKB:~$ cat /etc/hostname
ashishlaptop

System restart required at this point for new hostname to reflect everywhere.

To Setup Addressing of Connected Nodes and Their IP Addresses

Original File Contents


(base) ashish@ashishlaptop:~$ cat /etc/hosts
127.0.0.1	localhost
127.0.1.1	ashish-Lenovo-ideapad-130-15IKB

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

File "/etc/hosts" After Change

(base) ashish@ashishlaptop:~$ sudo nano /etc/hosts
(base) ashish@ashishlaptop:~$ cat /etc/hosts
192.168.1.131	ashishlaptop
192.168.1.106	ashishdesktop

Checking Connectivity With The Other Machine

(base) ashish@ashishlaptop:~$ ping 192.168.1.106


PING 192.168.1.106 (192.168.1.106) 56(84) bytes of data.
64 bytes from 192.168.1.106: icmp_seq=1 ttl=64 time=5.51 ms
64 bytes from 192.168.1.106: icmp_seq=2 ttl=64 time=115 ms
64 bytes from 192.168.1.106: icmp_seq=3 ttl=64 time=4.61 ms
64 bytes from 192.168.1.106: icmp_seq=4 ttl=64 time=362 ms
64 bytes from 192.168.1.106: icmp_seq=5 ttl=64 time=179 ms
64 bytes from 192.168.1.106: icmp_seq=6 ttl=64 time=4.53 ms
^C
--- 192.168.1.106 ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 5012ms
rtt min/avg/max/mdev = 4.525/111.739/361.954/129.976 ms 

System 2: ashishdesktop

(base) ashish@ashishdesktop:~$ ifconfig

ens33: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 00:e0:4c:3c:16:6b  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 317  bytes 33529 (33.5 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 317  bytes 33529 (33.5 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

wlx00e02d420fcb: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.106  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 2401:4900:47f6:d7d1:3cc9:20f6:af75:bb28  prefixlen 64  scopeid 0x0<global>
        inet6 2401:4900:47f6:d7d1:73e6:fca0:4452:382  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::1cdd:53e7:d13a:4f52  prefixlen 64  scopeid 0x20<link>
        ether 00:e0:2d:42:0f:cb  txqueuelen 1000  (Ethernet)
        RX packets 42484  bytes 56651709 (56.6 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 28763  bytes 3324595 (3.3 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

(base) ashish@ashishdesktop:~$ hostname
ashishdesktop

Original Contents of File "/etc/hosts"


(base) ashish@ashishdesktop:~$ cat /etc/hosts
127.0.0.1	localhost
127.0.1.1	ashishdesktop

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters


Modified Contents of "/etc/hosts"

(base) ashish@ashishdesktop:~$ sudo nano /etc/hosts
(base) ashish@ashishdesktop:~$ cat /etc/hosts
192.168.1.106	ashishdesktop
192.168.1.131	ashishlaptop


SSH Commands

First: Follow steps 1 to 7 on every node.
1) sudo apt-get install openssh-server openssh-client
2) sudo iptables -A INPUT -p tcp --dport ssh -j ACCEPT
3) Use the network adapter 'NAT' in the Guest OS settings, and create a new port forwarding rule "SSH" for port 22.
4) sudo reboot
5) ssh-keygen -t rsa -f ~/.ssh/id_rsa -P ""
6) sudo service ssh stop
7) sudo service ssh start

Second: After 'First' is done, follow steps 8 to 10 on every node. 
8) ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@master
9) ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@slave1
10) ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@slave2

Error Messages And Resolutions

Error 1: Port 22: Connection refused


(base) ashish@ashishlaptop:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@ashishdesktop
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/ashish/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/usr/bin/ssh-copy-id: ERROR: ssh: connect to host ashishdesktop port 22: Connection refused

Resolution

First follow SSH steps 1 to 7 on both the machines.

Error 2: Could not resolve hostname ashishlaptop

(base) ashish@ashishdesktop:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@ashishlaptop
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/ashish/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/usr/bin/ssh-copy-id: ERROR: ssh: Could not resolve hostname ashishlaptop: Temporary failure in name resolution

Resolution

Modify contents of two files "/etc/hostname" and "/etc/hosts" as shown above as the starting activity for this task.

Setting up a three node Spark cluster on Ubuntu using VirtualBox (Apr 2020)

Setting hostname in three Guest OS(s):
$ sudo gedit /etc/hostname
    "to master, slave1, and slave2 on different machines"

-----------------------------------------------------------

ON MASTER (Host OS IP: 192.168.1.12):

$ cat /etc/hosts

192.169.1.12	master
192.168.1.3		slave1
192.168.1.4		slave2

Note: Mappings "127.0.0.1  master" and "127.0.1.1  master" should not be there.

$ cd /usr/local/spark/conf
$ sudo gedit slaves

slave1
slave2

$ sudo gedit spark-env.sh

# YOU CAN SET PORTS HERE, IF PORT-USE ISSUE COMES: SPARK_MASTER_PORT=10000 / SPARK_MASTER_WEBUI_PORT=8080
# REMEMBER TO ADD THESE PORTS ALSO IN THE VM SETTING FOR PORT FORWARDING.

export SPARK_WORKER_CORES=2
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

$ sudo gedit ~/.bashrc
Add the below line at the end of the file.
export SPARK_HOME=/usr/local/spark

Later we will start all the Spark cluster, using the following command:
$ cd /usr/local/spark/sbin
$ source start-all.sh

-----------------------------------------------------------

ON SLAVE2 (Host OS IP: 192.168.1.4):

$ cat /etc/hostname
slave2

$ cat /etc/hosts
192.169.1.12	master
192.168.1.3		slave1
192.168.1.4		slave2

Note: Localhost mappings are removed.
---

$ cd /usr/local/spark/conf
$ sudo gedit spark-env.sh
#Setting SPARK_LOCAL_IP to "192.168.1.4" (the Host OS IP) would be wrong and would result in port failure logs.
SPARK_LOCAL_IP=127.0.0.1

SPARK_MASTER_HOST=192.168.1.12
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

-----------------------------------------------------------

FOLLOW THE STEPS MENTIONED FOR SLAVE2 ALSO FOR SLAVE1 (Host OS IP: 192.168.1.3)

-----------------------------------------------------------

Configuring Key Based Login

Setup SSH in every node such that they can communicate with one another without any prompt for password.

First: Follow steps 1 to 7 on every node.
1) sudo apt-get install openssh-server openssh-client
2) sudo iptables -A INPUT -p tcp --dport ssh -j ACCEPT
3) Use the network adapter 'NAT' in the Guest OS settings, and create a new port forwarding rule "SSH" for port 22.
4) sudo reboot
5) ssh-keygen -t rsa -f ~/.ssh/id_rsa -P ""
6) sudo service ssh stop
7) sudo service ssh start

Second: After 'First' is done, follow steps 8 to 10 on every node. 
8) ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@master
9) ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@slave1
10) ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@slave2

LOGS:

(base) ashish@ashish-VirtualBox:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@ashish-VirtualBox
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/ashish/.ssh/id_rsa.pub"
The authenticity of host 'ashish-virtualbox (127.0.1.1)' can't be established.
ECDSA key fingerprint is SHA256:FfT9M7GMzBA/yv8dw+7hKa91B1D68gLlMCINhbj3mt4.
Are you sure you want to continue connecting (yes/no)? y   
Please type 'yes' or 'no': yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
ashish@ashish-virtualbox's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'ashish@ashish-VirtualBox'"
and check to make sure that only the key(s) you wanted were added.

-----------------------------------------------------------

FEW SSH COMMANDS:

1) Checking the SSH status:
	(base) ashish@ashish-VirtualBox:~$ sudo service ssh status
	[sudo] password for ashish: 
	● ssh.service - OpenBSD Secure Shell server
	   Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)
	   Active: active (running) since Wed 2019-07-24 18:03:50 IST; 1h 1min ago
	  Process: 953 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS)
	  Process: 946 ExecReload=/usr/sbin/sshd -t (code=exited, status=0/SUCCESS)
	  Process: 797 ExecStartPre=/usr/sbin/sshd -t (code=exited, status=0/SUCCESS)
	 Main PID: 819 (sshd)
		Tasks: 1 (limit: 4915)
	   CGroup: /system.slice/ssh.service
			   └─819 /usr/sbin/sshd -D

	Jul 24 18:03:28 ashish-VirtualBox systemd[1]: Starting OpenBSD Secure Shell server...
	Jul 24 18:03:50 ashish-VirtualBox sshd[819]: Server listening on 0.0.0.0 port 22.
	Jul 24 18:03:50 ashish-VirtualBox sshd[819]: Server listening on :: port 22.
	Jul 24 18:03:50 ashish-VirtualBox systemd[1]: Started OpenBSD Secure Shell server.
	Jul 24 18:04:12 ashish-VirtualBox systemd[1]: Reloading OpenBSD Secure Shell server.
	Jul 24 18:04:12 ashish-VirtualBox sshd[819]: Received SIGHUP; restarting.
	Jul 24 18:04:12 ashish-VirtualBox sshd[819]: Server listening on 0.0.0.0 port 22.
	Jul 24 18:04:12 ashish-VirtualBox sshd[819]: Server listening on :: port 22.

2) (base) ashish@ashish-VirtualBox:/etc/ssh$ sudo gedit ssh_config 
You can change SSH port here.

3) Use the network adapter 'NAT' in the GuestOS settings, and create a new port forwarding rule "SSH" for port you are mentioning in Step 2.

4.A) sudo service ssh stop
4.B) sudo service ssh start

LOGS:

ashish@master:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub ashish@slave1
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/ashish/.ssh/id_rsa.pub"
The authenticity of host 'slave1 (192.168.1.3)' can't be established.
ECDSA key fingerprint is SHA256:+GsO1Q6ilqwIYfZLIrBTtt/5HqltZPSjVlI36C+f7ZE.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
ashish@slave1's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'ashish@slave1'" and check to make sure that only the key(s) you wanted were added.

ashish@master:~$ ssh slave1
Welcome to Ubuntu 19.04 (GNU/Linux 5.0.0-21-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage


99 updates can be installed immediately.
0 of these updates are security updates.

-----------------------------------------------------------

LOGS FROM SPARK MASTER ON SUCCESSFUL START:
$ cd /usr/local/spark/sbin
$ source start-all.sh

ashish@master:/usr/local/spark/sbin$ cat /usr/local/spark/logs/spark-ashish-org.apache.spark.deploy.master.Master-1-master.out

Spark Command: /usr/lib/jvm/java-8-openjdk-amd64//bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host master --port 60000 --webui-port 50000
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/08/05 17:32:09 INFO Master: Started daemon with process name: 1664@master
19/08/05 17:32:10 INFO SignalUtils: Registered signal handler for TERM
19/08/05 17:32:10 INFO SignalUtils: Registered signal handler for HUP
19/08/05 17:32:10 INFO SignalUtils: Registered signal handler for INT
19/08/05 17:32:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/08/05 17:32:11 INFO SecurityManager: Changing view acls to: ashish
19/08/05 17:32:11 INFO SecurityManager: Changing modify acls to: ashish
19/08/05 17:32:11 INFO SecurityManager: Changing view acls groups to: 
19/08/05 17:32:11 INFO SecurityManager: Changing modify acls groups to: 
19/08/05 17:32:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(ashish); groups with view permissions: Set(); users  with modify permissions: Set(ashish); groups with modify permissions: Set()
19/08/05 17:32:13 INFO Utils: Successfully started service 'sparkMaster' on port 60000.
19/08/05 17:32:13 INFO Master: Starting Spark master at spark://master:60000
19/08/05 17:32:13 INFO Master: Running Spark version 2.4.3
19/08/05 17:32:13 INFO Utils: Successfully started service 'MasterUI' on port 50000.
19/08/05 17:32:13 INFO MasterWebUI: Bound MasterWebUI to 127.0.0.1, and started at http://master:50000
19/08/05 17:32:14 INFO Master: I have been elected leader! New state: ALIVE

-----------------------------------------------------------