survival8: October 2022

Sunday, October 30, 2022

Do we need all the one hot features?


import pandas as pd
df = pd.read_csv('in/titanic_train.csv')
df.head()




enc_pc_df = pd.get_dummies(df, columns = ['Pclass'])
enc_pc_df.head()




Hypothetical Question
Q: If I remove a column (could be first, could be last) from the one-hot feature matrix (let's say with 'n' columns), can I reproduce the same matrix from the 'n-1' columns?

OR: Rephrasing the question: How do we get back original matrix or 'put back the dropped column'?

Answer:
If: There is no '1' in the remaining n-1 values in a row, then the dropped value from that row is 1.
Else: 0
Assumptions made: there are 'n' number of columns. 

Conclusion: In removing one column from the one-hot feature matrix, there is still no data loss.
One column's value is related to the value of rest n-1 columns.

So, what is the solution to resolve this relation?

We drop the first column from one-hot feature matrix.

enc_pc_df = pd.get_dummies(df, columns = ['Pclass'], drop_first = True)
enc_pc_df.head()

The default value of "drop_first" parameter is False.

Installing 'Category Encoders' Python Package Using Pip And Conda

You can install Category Encoders package from both Pip and Conda-forge.
But there is one subtle difference in the package in the two repositories. 

In PyPI, it is: category-encoders 



In Conda-forge, it is: category_encoders



Now with minimal env.yml:
Contents of the file:

name: cat_enc
channels:
  - conda-forge
dependencies:
  - scikit-learn
  - pandas
  - ipykernel
  - jupyterlab
  - category_encoders


Conda Commands

$ conda remove -n cat_enc --all
$ conda env create -f env.yml 
Collecting package metadata (repodata.json): done
Solving environment: done

Downloading and Extracting Packages
category_encoders-2. | 62 KB     | ### | 100% 
...
python-3.10.6        | 29.0 MB   | ### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate cat_enc
#
# To deactivate an active environment, use
#
#     $ conda deactivate

Retrieving notices: ...working... done
(base) $ conda activate cat_enc
(cat_enc) $ python -m ipykernel install --user --name cat_enc
Installed kernelspec cat_enc in /home/ashish/.local/share/jupyter/kernels/cat_enc
(cat_enc) $

Saturday, October 29, 2022

Clinical Psychology Books (May 2020)

Download Books

Covered/Read

1. The Man Who Mistook His Wife for a Hats
Book by Oliver Sacks

2. Maps of Meaning. The Architecture of Belief
Jordan B. Peterson (1999)

3. 12 Rules for Life
Book by Jordan Peterson

4. Beyond Order (12 more rules for life) by Jordan Peterson 

5. The Body Keeps the Score: Brain, Mind, and Body in the Healing of Trauma
Book by Bessel van der Kolk

6. The Boy who was Raised as a Dog: And Other Stories from a Child Psychiatrist's Notebook.
What Traumatized Children Can Teach Us about Loss, Love, and Healing.

Book by Bruce D. Perry

7. An Unquiet Mind
Book by Kay Redfield Jamison

8. My Little Epiphanies
Aisha Chaudhary

9. Expressive Writing. Words that Heal 
James W. Pennebaker, John E. Evans, 2014

10. Anxiety and Phobia Workbook (6e, 2015, Edmund Bourne)

Pending (Too Technical)

1. Abnormal and Clinical Psychology: An Introductory Textbook
Book by Paul Bennett

2. The handbook of child and adolescent clinical psychology
Book by Alan Carr

3. The Polyvagal Theory in Therapy: Engaging the Rhythm of Regulation
Book by Deb Dana

4. The theory and practice of group psychotherapy
Book by Irvin D. Yalom

5. Clinical Psychology: An Introduction
Book by Alan Carr

6. Clinical Psychology
1962
Editor: Graham Davey

7. Publication Manual of the American Psychological Association
Book by American Psychological Association

8. Stahl's Essential Psychopharmacology
Textbook by Stephen M. Stahl

9. Fish's Clinical Psychopathology: Signs and Symptoms in Psychiatry
Book by B. Kelly and Patricia Casey

10. A dictionary of psychology
Originally published: 2001
Author: Andrew Colman
Genre: Dictionary

11. Abnormal Psychology
Textbook by Ronald J. Comer

12. Existential psychotherapy
Book by Irvin D. Yalom

13. Diagnostic and Statistical Manual of Mental Disorders, 5th Edition: DSM-5
Book by American Psychiatric Association

14. Clinical Psychology: A Very Short Introduction
Book by Katie Aafjes-van Doorn and Susan Llewelyn

15. Introduction to Clinical Psychology
Textbook
1980
Lynda A. Heiden and Michel Hersen

21. Clinical Psychology: Assessment, Treatment, and Research
Originally published: 2009
Genre: Self-help book
Editors: Steven K. Huprich, David C. S. Richard

22. The Oxford Handbook of Clinical Psychology

Originally published: 12 November 2010
Genre: Reference work
Editor: David H. Barlow

23. Introduction to Clinical Psychology
Textbook by Geoffrey L. Thorpe and Jeffrey Hecker

24. Clinical Psychology in Practice
Originally published: 2009
Editor: Paul Kennedy

25. Introducing Psychology
Originally published: 10 August 1994
Author: Nigel Benson
Genres: Study guide, Non-fiction comics

26. What is Clinical Psychology?

Originally published: 13 April 2006
Editors: Susan Llewelyn, David Murphy

27. Theory and Practice of Counseling and Psychotherapy
Book by Gerald Corey

28. The Gift of Therapy
Book by Irvin D. Yalom

29. The Myth of Mental Illness
Book by Thomas Szasz

30. Trauma and Recovery
Book by Judith Lewis Herman

31. Brain Lock: Free Yourself from Obsessive-Compulsive Behavior
Book by Jeffrey M. Schwartz

32. Motivational Interviewing in Health Care: Helping Patients Change Behavior
Book by Christopher C. Butler, Stephen Rollnick, and William Richard Miller

33. Many Lives, Many Masters
Book by Brian Weiss

34. Clinical Handbook of Psychological Disorders, Fifth Edition: A Step-by-Step Treatment Manual

Originally published: 1985
Editor: David H. Barlow
Genres: Thesis, Reference work

35. Skills Training Manual for Treating Borderline Personality Disorder
Book by Marsha M. Linehan

Originally published: 14 May 1993

36. Diagnostic and Statistical Manual of Mental Disorders

Originally published: 1952
Author: American Psychiatric Association
Original language: English

37. Abnormal Psychology: Clinical Perspectives on Psychological Disorders

Originally published: 2000
Authors: Richard P. Halgin, Susan Krauss Whitbourne

38. Coping Skills for Kids Workbook: Over 75 Coping Strategies to Help Kids ...
Book by Janine Halloran
Originally published: 4 June 2016

39. Beyond Behaviors: Using Brain Science and Compassion to Understand and Solve Children's Behavioral Challenges
Book by Mona Delahooke

40. Seeking Safety: A Treatment Manual for PTSD and Substance Abuse
Book by Lisa M. Najavits

Originally published: 2002

41. DBT Skills Training Manual, Second Edition
Book by Marsha M. Linehan

Originally published: 19 October 2014

42. ACT Made Simple: An Easy-To-Read Primer on Acceptance and Commitment Therapy
Book by Russ Harris

Originally published: November 2009

43. Psychopathology: Research, Assessment and Treatment in Clinical Psychology
Textbook by Graham Davey

Originally published: 29 September 2008

44. The Interpretation of Dreams
Book by Sigmund Freud

Originally published: 4 November 1899
Author: Sigmund Freud
Original title: Die Traumdeutung
Text: The Interpretation of Dreams at Wikisource
Original language: German
Subject: Dream interpretation

45. DSM-5 Made Easy: The Clinician's Guide to Diagnosis
Book by James Roy Morrison
Originally published: 11 April 2014

46. Insider's Guide to Graduate Programs in Clinical and Counseling Psychology
Book by John C. Norcross and Michael A. Sayette

Originally published: 10 March 1996

47. The Whole-Brain Child: 12 Revolutionary Strategies to Nurture Your Child's ...
Book by Daniel J. Siegel and Tina Payne Bryson

Originally published: 4 October 2011

48. Madness Explained: Psychosis and Human Nature
Book by Richard P. Bentall

50. Becoming a Clinical Psychologist: Everything You Need to Know
Book by Amanda Mwale and Steven Mayers

51. Clinical Psychology: Science, Practice, and Culture
Textbook by Andrew M. Pomerantz

52. The Red Book
Book by Carl Jung

The Red Book is a red leather-bound folio manuscript crafted by the Swiss psychiatrist Carl Gustav Jung between 1915 and about 1930. It recounts and comments upon the author's psychological experiments between 1913 and 1916, and is based on manuscripts first drafted by Jung in 1914–15 and 1917. 
Originally published: 7 October 2009
Author: Carl Jung
Original title: Liber Novus ("The New Book")
Original language: German
Page count: 404
Genre: Biography

Friday, October 28, 2022

One Hot Encoding Using Pandas' get_dummies() Method on Titanic Dataset

Download Data and Code


import pandas as pd
df = pd.read_csv('titanic_train.csv')
print(df.head())



print("Number of Unique Values in The Column 'Sex':")
print(df['Sex'].nunique())


# 2
# This is also the width of it's one-hot encoding.

print("Number of Unique Values in The Column For 'Passenger Class':")
print(df['Pclass'].nunique())

# 3
# This is also the width of one-hot encoding for 'Passenger Class'.

Let us first see what happens when we do one-hot encoding of column 'Sex'.

enc_gender_df = pd.get_dummies(df, columns = ['Sex'])
print(enc_gender_df.head())


# Sex
# male
# female
# female
# female
# male

# Sex_female  Sex_male
# 0           1
# 1           0
# 1           0
# 1           0
# 0           1 


enc_pc_df = pd.get_dummies(df, columns = ['Pclass'])
print(enc_pc_df.head())

# Pclass_1  Pclass_2  Pclass_3
# 0         0         1
# 1         0         0
# 0         0         1
# 1         0         0
# 0         0         1 

Fun Facts
1. LabelEncoder of Scikit-Learn works by encoding the labels in the Ascending-Alphabetical-Sequence.
2. As you have Ascending-Alphabetical-Sequence, there are three more sequences that are common:
2.1. Descending Alphabetical Sequence 
2.2. Ascending Frequency Based Sequence 
2.3. Descending Frequency Based Sequence

Elon Musk clearing why he took over Twitter (2022 Oct 27)

Elon Musk's first day at twitter headquarters.

"I wanted to reach out personally to share my motivation in acquiring Twitter. There has been much speculation about why I bought Twitter and what I think about advertising. Most of it has been wrong.

The reason I acquired Twitter is because it is important to the future of civilization to have a common digital town square, where a wide range of beliefs can be debated in a healthy manner, without resorting to violence. There is currently great danger that social media will splinter into far right wing and far left wing echo chambers that generate more hate and divide our society.

In the relentless pursuit of clicks, much of traditional media has fueled and catered to those polarised extremes, as they believe that is what brings in the money, but, in doing so, the opportunity for dialogue is lost.

That is why I bought Twitter. I didn't do it because it would be easy. I didn't do it to make more money. I did it to try to help humanity, whom I love. And I do so with humility, recognizing that failure in pursuing this goal, despite our best efforts, is a very real possibility.

That said, Twitter obviously cannot become a free-for-all hellscape, where anything can be said with no consequences! In addition to adhering to the laws of the land, our platform must be warm and welcoming to all, where you can choose your desired experience according to your preferences, just as you can choose, for example, to see movies or play video games ranging from all ages to mature."

- Elon Musk

Wednesday, October 26, 2022

Way 4: With respect to DataFrame.replace() Method (Ways in which Pandas API on PySpark differs from Plain Pandas)

Download Code

Working in Pandas
import pandas as pd
df = pd.DataFrame({
    'dummy_col': ["alpha", "beta", "gamma", "","-","0","N/A","-_-","NA", "delta", "epsilon", "zeta", "eta", "theta"]
})
df['cleaned'] = df.replace(to_replace =["","-","0","N/A","-_-","NA"], value = "Not Applicable")




Not working in Pandas API on PySpark

from pyspark import pandas as ppd
df_ppd = ppd.DataFrame({
    'dummy_col': ["alpha", "beta", "gamma", "","-","0","N/A","-_-","NA", "delta", "epsilon", "zeta", "eta", "theta"]
})


Error


df_ppd['cleaned'] = df_ppd.replace(to_replace =["","-","0","N/A","-_-","NA"], value = "Not Applicable")

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [15], line 1
----> 1 df_ppd['cleaned'] = df_ppd.replace(to_replace =["","-","0","N/A","-_-","NA"], value = "Not Applicable")

File ~/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/pandas/frame.py:12355, in DataFrame.__setitem__(self, key, value)
    12352     psdf = self._assign({k: value[c] for k, c in zip(key, field_names)})
    12353 else:
    12354     # Same Series.
> 12355     psdf = self._assign({key: value})
    12357 self._update_internal_frame(psdf._internal)

File ~/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/pandas/frame.py:4921, in DataFrame._assign(self, kwargs)
    4917 is_invalid_assignee = (
    4918     not (isinstance(v, (IndexOpsMixin, Column)) or callable(v) or is_scalar(v))
    4919 ) or isinstance(v, MultiIndex)
    4920 if is_invalid_assignee:
-> 4921     raise TypeError(
    4922         "Column assignment doesn't support type " "{0}".format(type(v).__name__)
    4923     )
    4924 if callable(v):
    4925     kwargs[k] = v(self)

TypeError: Column assignment doesn't support type DataFrame




df_ppd_cleaned = df_ppd.replace(to_replace = ["","-","0","N/A","-_-","NA"], value = "Not Applicable")



df_ppd_cleaned.replace(to_replace = ['Not Applicable', 'alpha'], value = "Still NA", inplace = True)

Termux to get information about my Android device

Welcome to Termux!

Wiki: https://wiki.termux.com
Community forum: https://termux.com/community
Gitter chat: https://gitter.im/termux/termux
IRC channel: #termux on freenode

Working with packages:

    * Search packages: pkg search [query]
    * Install a package: pkg install [package]
    * Upgrade packages: pkg upgrade

Subscribing to additional repositories:

    * Root: pkg install root-repo
    * Unstable: pkg install unstable-repo
    * X11: pkg install x11-repo

Report issues at https://termux.com/issues

1. Getting OS Info

$ uname
Linux

$ uname -a
Linux localhost 4.14.199-24365169-abX205XXU1AVG1 #2 SMP PREEMPT Tue Jul 5 20:39:23 KST 2022 aarch64 Android

2. Getting Processor Info

$ more /proc/cpuinfo

Processor : AArch64 Processor rev 1 (aarch64)
processor : 0
BogoMIPS : 52.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop
    asimddp
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x1
CPU part : 0xd05
CPU revision : 0

processor : 1
BogoMIPS : 52.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop
    asimddp
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x1
CPU part : 0xd05
CPU revision : 0

processor : 2
BogoMIPS : 52.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop
    asimddp
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x1
CPU part : 0xd05
CPU revision : 0

processor : 3
BogoMIPS : 52.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop
    asimddp
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x1
CPU part : 0xd05
CPU revision : 0

processor : 4
BogoMIPS : 52.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop
    asimddp
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x1
CPU part : 0xd05
CPU revision : 0

processor : 5
BogoMIPS : 52.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop
    asimddp
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x1
CPU part : 0xd05
CPU revision : 0

processor : 6
BogoMIPS : 52.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop
    asimddp
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x3
CPU part : 0xd0a
CPU revision : 1

processor : 7
BogoMIPS : 52.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop
    asimddp
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x3
CPU part : 0xd0a
CPU revision : 1

Hardware : Unisoc ums512
Serial : 96789ab0ffeb70e8d1320621ab4d084fb1082517682936e1977afc5ae63a3c7b 

3. Getting my username

$ whoami
u0_a218

4. Getting Your IP Address

$ ifconfig

Warning: cannot open /proc/net/dev (Permission denied). Limited output.
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
        inet 127.0.0.1 netmask 255.0.0.0
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 1000 (UNSPEC)

wlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 192.168.1.102 netmask 255.255.255.0 broadcast 192.168.1.255
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 1000 (UNSPEC)

5. Checking RAM Usage

$ free -h

      total used  free  shared buff/cache available
Mem:  2.4Gi 1.9Gi 113Mi 12Mi   493Mi      448Mi
Swap: 2.5Gi 1.2Gi 1.3Gi

6. Checking Space on Hard Disk

$ df -h

Filesystem        Size Used Avail Use% Mounted on
/dev/block/dm-4   3.2G 3.2G 2.5M 100% /
tmpfs             1.2G 1.3M 1.2G 1%   /dev
tmpfs             1.2G 0    1.2G 0%   /mnt
/dev/block/dm-1   122M 122M 0    100% /system_ext
/dev/block/dm-5   759M 751M 0    100% /vendor
/dev/block/dm-6   1.0G 1.0G 0    100% /product
/dev/block/dm-7   271M 166M 99M  63%  /prism
/dev/block/dm-8   31M  408K 30M  2%   /optics
tmpfs             1.2G 0    1.2G 0%   /apex
/dev/block/dm-11  1.8M 1.7M 0    100% /apex/com.android.os.statsd@311510000
/dev/block/dm-12  704K 676K 16K  98%  /apex/com.android.sdkext@330810010
/dev/block/dm-13  13M  13M  0    100% /apex/com.android.cellbroadcast@330911010
/dev/block/dm-14  15M  15M  0    100% /apex/com.android.permission@330912010
/dev/block/dm-15  7.9M 7.8M 0    100% /apex/com.android.tethering@330911010
/dev/block/dm-16  3.8M 3.7M 0    100% /apex/com.android.resolv@330910000
/dev/block/dm-17  19M  19M  0    100% /apex/com.android.media.swcodec@330443040
/dev/block/dm-18  8.4M 8.4M 0    100% /apex/com.android.mediaprovider@330911040
/dev/block/dm-19  836K 808K 12K  99%  /apex/com.android.tzdata@303200001
/dev/block/dm-20  7.2M 7.1M 0    100% /apex/com.android.neuralnetworks@330443000
/dev/block/dm-21  7.8M 7.7M 0    100% /apex/com.android.adbd@330444000
/dev/block/dm-22  4.8M 4.8M 0    100% /apex/com.android.conscrypt@330443020
/dev/block/dm-23  5.6M 5.6M 0    100% /apex/com.android.extservices@330443000
/dev/block/dm-24  748K 720K 16K  98%  /apex/com.android.ipsec@330443010
/dev/block/dm-25  5.7M 5.6M 0    100% /apex/com.android.media@330443030
/dev/block/loop21 24M  24M  0    100% /apex/com.android.i18n@1
/dev/block/loop22 5.1M 5.1M 0    100% /apex/com.android.wifi@300000000
/dev/block/loop23 5.0M 5.0M 0    100% /apex/com.android.runtime@1
/dev/block/loop24 236K 72K  160K 32%  /apex/com.samsung.android.shell@303013100
/dev/block/loop25 82M  82M  0    100% /apex/com.android.art@1
/dev/block/loop26 232K 92K  136K 41%  /apex/com.android.apex.cts.shim@1
/dev/block/loop27 109M 109M 0    100% /apex/com.android.vndk.v30@1
/dev/block/loop28 236K 32K  200K 14%  /apex/com.samsung.android.wifi.broadcom@300000000
/dev/block/loop29 236K 32K  200K 14%  /apex/com.samsung.android.camera.unihal@301742001
/dev/block/by-name/cache   303M 12M  285M 4%  /cache
/dev/block/by-name/sec_efs 11M  788K 10M  8%  /efs
/dev/fuse                  22G  8.5G 13G  40% /storage/emulated 

7. Print Environment Variables

$ echo $USER


$ echo $HOME
/data/data/com.termux/files/home

8. Print Working Directory

$ pwd
/data/data/com.termux/files/home

SSH Setup For Accessing Ubuntu From Windows Using SFTP

Getting Basic Info Like Hostname and IP
(base) C:\Users\ashish>hostname
CS3L

(base) C:\Users\ashish>ipconfig

Windows IP Configuration

Ethernet adapter Ethernet 2:

    Media State . . . . . . . . . . . : Media disconnected
    Connection-specific DNS Suffix  . : ad.itli.com

Ethernet adapter Ethernet:

    Media State . . . . . . . . . . . : Media disconnected
    Connection-specific DNS Suffix  . : ad.itli.com

Wireless LAN adapter Wi-Fi:

    Connection-specific DNS Suffix  . :
    IPv6 Address. . . . . . . . . . . : 2401:4900:47f2:5147:b1b2:6d59:f669:1b96
    Temporary IPv6 Address. . . . . . : 2401:4900:47f2:5147:15e3:46:9f5b:8d78
    Link-local IPv6 Address . . . . . : fe80::b1b2:6d59:f669:1b96%13
    IPv4 Address. . . . . . . . . . . : 192.168.1.100
    Subnet Mask . . . . . . . . . . . : 255.255.255.0
    Default Gateway . . . . . . . . . : fe80::d837:1aff:fe40:b173%13
                                        192.168.1.1

Ethernet adapter Bluetooth Network Connection:

    Media State . . . . . . . . . . . : Media disconnected
    Connection-specific DNS Suffix  . :


Setting up SSH

(base) C:\Users\ashish>mkdir .ssh


(base) C:\Users\ashish>dir

    Volume in drive C is OSDisk
    Volume Serial Number is ABCD-PQRS

    Directory of C:\Users\ashish

10/26/2022  03:25 PM    <DIR>          .
10/26/2022  03:25 PM    <DIR>          ..
08/16/2022  01:29 PM    <DIR>          .3T
09/26/2022  08:04 AM             1,288 .bash_history
06/02/2022  10:15 AM    <DIR>          .cache
05/30/2022  11:39 AM    <DIR>          .conda
10/26/2022  02:58 PM                89 .dotty_history
08/19/2022  06:42 PM                68 .gitconfig
10/11/2022  02:03 PM    <DIR>          .ipython
05/30/2022  10:05 AM    <DIR>          .jupyter
05/30/2022  12:56 PM    <DIR>          .keras
08/20/2022  11:55 AM                20 .lesshst
07/04/2022  06:09 PM    <DIR>          .matplotlib
06/30/2022  10:32 AM    <DIR>          .ms-ad
10/07/2022  09:00 PM             1,457 .python_history
10/26/2022  03:25 PM    <DIR>          .ssh
09/06/2022  10:13 PM             2,379 .viminfo
05/30/2022  11:34 AM    <DIR>          .vscode
05/16/2022  03:19 PM    <DIR>          3D Objects
10/07/2022  02:50 PM    <DIR>          Anaconda3
05/16/2022  03:19 PM    <DIR>          Contacts
10/26/2022  02:57 PM    <DIR>          Desktop
10/07/2022  06:27 PM    <DIR>          Documents
10/26/2022  03:18 PM    <DIR>          Downloads
05/16/2022  03:19 PM    <DIR>          Favorites
05/16/2022  03:19 PM    <DIR>          Links
05/16/2022  03:19 PM    <DIR>          Music
05/16/2022  02:13 PM    <DIR>          OneDrive
05/16/2022  03:20 PM    <DIR>          Pictures
05/16/2022  03:19 PM    <DIR>          Saved Games
05/16/2022  03:20 PM    <DIR>          Searches
05/30/2022  09:36 AM    <DIR>          Videos
                6 File(s)          5,301 bytes
                26 Dir(s)  81,987,842,048 bytes free

(base) C:\Users\ashish>cd .ssh

(base) C:\Users\ashish\.ssh>dir
    Volume in drive C is OSDisk
    Volume Serial Number is ABCD-PQRS

    Directory of C:\Users\ashish\.ssh

10/26/2022  03:25 PM    <DIR>          .
10/26/2022  03:25 PM    <DIR>          ..
                0 File(s)              0 bytes
                2 Dir(s)  81,987,903,488 bytes free

(base) C:\Users\ashish\.ssh>echo "" > id_rsa

(base) C:\Users\ashish\.ssh>dir
    Volume in drive C is OSDisk
    Volume Serial Number is ABCD-PQRS

    Directory of C:\Users\ashish\.ssh

10/26/2022  03:26 PM    <DIR>          .
10/26/2022  03:26 PM    <DIR>          ..
10/26/2022  03:26 PM                 5 id_rsa
                1 File(s)              5 bytes
                2 Dir(s)  81,987,678,208 bytes free


(base) C:\Users\ashish\.ssh>type id_rsa

(base) C:\Users\ashish\.ssh>

(base) C:\Users\ashish>ssh-keygen -t rsa -f ./.ssh/id_rsa -P ""

Generating public/private rsa key pair.
./.ssh/id_rsa already exists.
Overwrite (y/n)? y
Your identification has been saved in ./.ssh/id_rsa.
Your public key has been saved in ./.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:fGEZHROeTzogrdXwo7haw0g3eXLVZnO9nM0ZtTbIBh8 itlitli\ashish@CS3L
The key's randomart image is:
+---[RSA 3072]----+
|          oo+E  .|
|         . B=+o +|
|        . B B=*=o|
|       . B =.Bo+B|
|      . S = o .=o|
|     . + B   .   |
|      . =        |
|       o .       |
|      .          |
+----[SHA256]-----+

(base) C:\Users\ashish>

Note This Error While Doing Setup on Windows
CMD> ssh-copy-id -i ./.ssh/id_rsa.pub ashish@192.168.1.100
'ssh-copy-id' is not recognized as an internal or external command, operable program or batch file.
We overcome this issue by manually copying Public RSA Key into the 'authorized_keys' file of the remote machine using SFTP.

(base) C:\Users\ashish>sftp

usage: sftp [-46aCfpqrv] [-B buffer_size] [-b batchfile] [-c cipher]
            [-D sftp_server_path] [-F ssh_config] [-i identity_file]
            [-J destination] [-l limit] [-o ssh_option] [-P port]
            [-R num_requests] [-S program] [-s subsystem | sftp_server]
            destination

Next Steps of Copying Pubic Key Onto Remote Machine And Vice-versa

Address of Ubuntu System: ashish@192.168.1.151

(base) C:\Users\ashish>sftp ashish@192.168.1.151

The authenticity of host '192.168.1.151 (192.168.1.151)' can't be established.
ECDSA key fingerprint is SHA256:2hgOVHHgkrT9/6XnK/KDaFQ0DaXLUoW82eeU6oQyTvQ.
Are you sure you want to continue connecting (yes/no/[fingerprint])?
Warning: Permanently added '192.168.1.151' (ECDSA) to the list of known hosts.
ashish@192.168.1.151's password:
Connected to 192.168.1.151.

sftp> ls
Desktop    Documents  Downloads  Music      Pictures   Public     Templates  Videos     anaconda3  nltk_data  snap
sftp> bye

PWD: /home/ashish
sftp> put id_rsa.pub win_auth_key.txt
Uploading id_rsa.pub to /home/ashish/win_auth_key.txt
id_rsa.pub  100%  593    89.9KB/s   00:00
sftp>

PWD: /home/ashish/.ssh
sftp> get id_rsa.pub ./ubuntu_id_rsa.pub.txt
Fetching /home/ashish/.ssh/id_rsa.pub to ./ubuntu_id_rsa.pub.txt
/home/ashish/.ssh/id_rsa.pub  100%  573     2.7KB/s   00:00
sftp>
sftp> bye

Steps on Ubuntu Machine

(base) ashish@ashishlaptop:~$ cat win_auth_key.txt 
ssh-rsa AAA***vZs= itli\ashish@CS3L
(base) ashish@ashishlaptop:~$ 

Paste this Public RSA Key in 'authorized_keys' File
(base) ashish@ashishlaptop:~/.ssh$ nano authorized_keys 
(base) ashish@ashishlaptop:~/.ssh$ cat authorized_keys 

ssh-rsa AAAA***rzFM= ashish@ashishdesktop
ssh-rsa AAAA***GOD0= ashish@ashishlaptop
ssh-rsa AAAA***3vZs= itli\ashish@CS3L
(base) ashish@ashishlaptop:~/.ssh$ 

Testing The SSH
Back to Windows 10 System

(base) C:\Users\ashish>ssh ashish@ashishlaptop
The authenticity of host 'ashishlaptop (192.168.1.151)' can't be established.
ECDSA key fingerprint is SHA256:2hgOVHHgkrT9/6XnK/KDaFQ0DaXLUoW82eeU6oQyTvQ.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'ashishlaptop' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-52-generic x86_64)

    * Documentation:  https://help.ubuntu.com
    * Management:     https://landscape.canonical.com
    * Support:        https://ubuntu.com/advantage

2 updates can be applied immediately.
To see these additional updates run: apt list --upgradable

Last login: Wed Oct 26 13:35:44 2022 from 192.168.1.151
(base) ashish@ashishlaptop:~$
(base) ashish@ashishlaptop:~$ ls
anaconda3  Desktop  Documents  Downloads  Music  nltk_data  Pictures  Public  snap  Templates  Videos  win_auth_key.txt
(base) ashish@ashishlaptop:~$ rm win_auth_key.txt
(base) ashish@ashishlaptop:~$ ls
anaconda3  Desktop  Documents  Downloads  Music  nltk_data  Pictures  Public  snap  Templates  Videos
(base) ashish@ashishlaptop:~$ exit
logout
Connection to ashishlaptop closed.

(base) C:\Users\ashish>ssh ashish@ashishlaptop
Welcome to Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-52-generic x86_64)

    * Documentation:  https://help.ubuntu.com
    * Management:     https://landscape.canonical.com
    * Support:        https://ubuntu.com/advantage

2 updates can be applied immediately.
To see these additional updates run: apt list --upgradable

Last login: Wed Oct 26 15:46:02 2022 from 192.168.1.100
(base) ashish@ashishlaptop:~$ client_loop: send disconnect: Connection reset

(base) C:\Users\ashish>

Tuesday, October 25, 2022

Way 3: How isin() works for Plain Pandas and how we have to use to_numpy() for it in PySpark's Pandas API (Ways in which Pandas API on PySpark differs from Plain Pandas)

Download Code


import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.DataFrame({
    'alphabets': [
        'alpha', 'beta', 'gamma', 'delta', 'epsilon', 'zeta', 'eta', 'theta', 'iota', 'kappa', 'lambda', 'mu', 'nu', 'xi', 'omicron', 'pi', 'rho', 'sigma', 'tau',
        'upsilon', 'phi', 'chi', 'psi', 'omega', # Greek Alphabets
        'ka', 'kh', 'ga', 'gh', 'ng', 'ch', 'chh', 'ja', 'jh', 'ny', 'ta', 'th', 'da', 'dh', 'na', 'ta', 'th', 'da', 'dh', 'na', 'pa', 'ph', 'ba', 'bh', 'ma', 
        'ya', 'ra', 'la', 'va', 'sh', 'sh', 'sa', 'ha', 'ksh', 'tr', 'gy', 'shr' # Hindi Consonants
    ]
})

df['first_letter'] = df['alphabets'].str[0] # Won't work for Pandas API on PySpark 

ixs = np.random.permutation(df.shape[0])
split_pct = 0.5

train_ixs = ixs[:round(len(ixs) * split_pct)]
test_ixs = ixs[round(len(ixs) * split_pct):]

df_train = df.iloc[train_ixs]
df_test = df.iloc[test_ixs]

df_train.head()



df_test.head()



not_in_train_but_in_test = df_test[-(df_test.first_letter.isin(df_train.first_letter))]



import pyspark
print(pyspark.__version__)

3.3.0


from pyspark import pandas as ppd
df_ppd = ppd.DataFrame({
    'alphabets': [
        'alpha', 'beta', 'gamma', 'delta', 'epsilon', 'zeta', 'eta', 'theta', 'iota', 'kappa', 'lambda', 'mu', 'nu', 'xi', 'omicron', 'pi', 'rho', 'sigma', 'tau',
        'upsilon', 'phi', 'chi', 'psi', 'omega', # Greek
        'ka', 'kh', 'ga', 'gh', 'ng', 'ch', 'chh', 'ja', 'jh', 'ny', 'ta', 'th', 'da', 'dh', 'na', 'ta', 'th', 'da', 'dh', 'na', 'pa', 'ph', 'ba', 'bh', 'ma', 
        'ya', 'ra', 'la', 'va', 'sh', 'sh', 'sa', 'ha', 'ksh', 'tr', 'gy', 'shr' # Hindi
    ]
})

df_ppd['first_letter'] = df_ppd['alphabets'].apply(lambda x: x[0])

df_ppd_train = df_ppd.iloc[train_ixs]
df_ppd_test = df_ppd.iloc[test_ixs]


Errors: We cannot filter PySpark's Pandas API based DataFrame using the same code we used for Pure Pandas DataFrame

1.


not_in_train_but_in_test = df_ppd_test[-(df_ppd_test.first_letter.isin(df_ppd_train.first_letter))]

---------------------------------------------------------------------------
PandasNotImplementedError                 Traceback (most recent call last)
Cell In [62], line 1
----> 1 not_in_train_but_in_test = df_ppd_test[-(df_ppd_test.first_letter.isin(df_ppd_train.first_letter))]

File ~/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/pandas/base.py:880, in IndexOpsMixin.isin(self, values)
    873 if not is_list_like(values):
    874     raise TypeError(
    875         "only list-like objects are allowed to be passed"
    876         " to isin(), you passed a [{values_type}]".format(values_type=type(values).__name__)
    877     )
    879 values = (
--> 880     cast(np.ndarray, values).tolist() if isinstance(values, np.ndarray) else list(values)
    881 )
    883 other = [SF.lit(v) for v in values]
    884 scol = self.spark.column.isin(other)

File ~/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/pandas/series.py:6485, in Series.__iter__(self)
   6484 def __iter__(self) -> None:
-> 6485     return MissingPandasLikeSeries.__iter__(self)

File ~/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/pandas/missing/__init__.py:23, in unsupported_function..unsupported_function(*args, **kwargs)
     22 def unsupported_function(*args, **kwargs):
---> 23     raise PandasNotImplementedError(
     24         class_name=class_name, method_name=method_name, reason=reason
     25     )

PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.


2.

df_ppd_test.first_letter.isin(df_ppd_train.first_letter)

---------------------------------------------------------------------------
PandasNotImplementedError                 Traceback (most recent call last)
Cell In [63], line 1
----> 1 df_ppd_test.first_letter.isin(df_ppd_train.first_letter)

File ~/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/pandas/base.py:880, in IndexOpsMixin.isin(self, values)
    873 if not is_list_like(values):
    874     raise TypeError(
    875         "only list-like objects are allowed to be passed"
    876         " to isin(), you passed a [{values_type}]".format(values_type=type(values).__name__)
    877     )
    879 values = (
--> 880     cast(np.ndarray, values).tolist() if isinstance(values, np.ndarray) else list(values)
    881 )
    883 other = [SF.lit(v) for v in values]
    884 scol = self.spark.column.isin(other)

File ~/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/pandas/series.py:6485, in Series.__iter__(self)
    6484 def __iter__(self) -> None:
-> 6485     return MissingPandasLikeSeries.__iter__(self)

File ~/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/pandas/missing/__init__.py:23, in unsupported_function..unsupported_function(*args, **kwargs)
        22 def unsupported_function(*args, **kwargs):
---> 23     raise PandasNotImplementedError(
        24         class_name=class_name, method_name=method_name, reason=reason
        25     )

PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.








Use of: DataFrame.to_numpy() → numpy.ndarray

Returns: A NumPy ndarray representing the values in this DataFrame or Series.

Note: This method should only be used if the resulting NumPy ndarray is expected to be small, as all the data is loaded into the driver’s memory.



df_ppd_test.first_letter.isin(df_ppd_train.first_letter.to_numpy())


0     False
1     False
2      True
3      True
4     False
6     False
7      True
9     False
10     True
12     True
13    False
23     True
24    False
25    False
28     True
30     True
31     True
33     True
34     True
39     True
41     True
43     True
44     True
45     True
46    False
47    False
49    False
53     True
56    False
57    False
58     True
Name: first_letter, dtype: bool


not_in_train_but_in_test = df_ppd_test[- ( df_ppd_test.first_letter.isin( df_ppd_train.first_letter.to_numpy() ) )]

Way 2: Difference in how access to str representation is provided (Ways in which Pandas API on PySpark differs from Plain Pandas)

Download Code


import seaborn as sns
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark import SparkContext
from pyspark.sql import SQLContext # Main entry point for DataFrame and SQL functionality.

sc = SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)

import pyspark
print(pyspark.__version__)

3.3.0


import pandas as pd
df_student = pd.read_csv('./input/student.csv')
df_student



Aim: To retrieve the first letter from a column of string type
In Pandas

df_student['first_letter'] = df_student['FirstName'].str[0] 
df_student




In Pandas API on Spark
from pyspark import pandas as ppd
df_student_ppd = ppd.read_csv('./input/student.csv')
df_student_ppd




Errors in Pandas API on Spark when we try with the way of Plain Pandas 

1.
df_student_ppd['first_letter'] = df_student_ppd['FirstName'].str[0] 

# In Pandas API on Spark
TypeError: 'StringMethods' object is not subscriptable

2.
df_student_ppd['first_letter'] = df_student_ppd['FirstName'].str
TypeError: Column assignment doesn't support type StringMethods # pyspark.pandas.strings.StringMethods as shown below.

3.
df_student_ppd['FirstName'].str
<pyspark.pandas.strings.StringMethods at 0x7f7474157520>


How we resolved it:

df_student_ppd['FirstName'] = df_student_ppd['FirstName'].astype(str)

# If we do not do the above transformation, None values will result in an error.
TypeError: 'NoneType' object is not subscriptable

df_student_ppd['first_letter'] = df_student_ppd['FirstName'].apply(lambda x: x[0])

Warning:/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py:486: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for column, series in pdf.iteritems():


df_student_ppd

Way 1: In Reading null and NA values (Ways in which Pandas API on PySpark differs from Plain Pandas)



import seaborn as sns
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark import SparkContext
from pyspark.sql import SQLContext # Main entry point for DataFrame and SQL functionality.

sc = SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)
import pyspark
print(pyspark.__version__)


3.3.0


with open('./input/student.csv', mode = 'r', encoding = 'utf8') as f:
    data = f.readlines()

data



['sno,FirstName,LASTNAME\n',
'one,Ram,\n',
'two,,Sharma\n',
'three,Shyam,NA\n',
'four,Kabir,\n',
'five,NA,Singh\n']


df_student.head()



When you load a Pandas DataFrame by reading from a CSV, blank values and 'NA' values are converted to 'NaN' values by default as shown above.

print(type(df_student))
<class 'pandas.core.frame.DataFrame'>
df_student.fillna('Not Applicable', inplace = True) # Handles blank and 'NA' values both. 
df_student





from pyspark import pandas as ppd
df_student_pyspark = ppd.read_csv('./input/student.csv')
type(df_student_pyspark)

pyspark.pandas.frame.DataFrame

df_student_pyspark



df_student_pyspark.fillna('Not Applicable', inplace = True) # Handles blank (None) values. 
df_student_pyspark

Monday, October 24, 2022

Creating a three node Hadoop cluster using Ubuntu OS (Apr 2020)

Dated: 28 Apr 2020
Note about the setup: We are running the Ubuntu OS(s) on top of Windows via VirtualBox.

1. Setting hostname in three Guest OS(s)

$ sudo gedit /etc/hostname

The hostnames for three machines are master, slave1, and slave2.

ON MASTER (Host OS IP: 192.168.1.12)
$ cat /etc/hosts

192.168.1.12 master
192.168.1.3  slave1
192.168.1.4  slave2

2. ON SLAVE2 (Host OS IP: 192.168.1.4)

$ cat /etc/hostname
slave2
$ cat /etc/hosts

192.168.1.12 master
192.168.1.3  slave1
192.168.1.4  slave2

3. FOLLOW THE STEPS MENTIONED FOR SLAVE2 ALSO FOR SLAVE1 (Host OS IP: 192.168.1.3)

4. Configuring Key Based Login
Setup SSH in every node such that they can communicate with one another without any prompt for password.

Check this link for: Steps of Doing SSH Setup

5. Setting up ".bashrc" on each system (master, slave1, slave2)

$ sudo gedit ~/.bashrc

Add the below lines at the end of the file.

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_HDFS_HOME=/usr/local/hadoop
export YARN_HOME=/usr/local/hadoop 

6. Follow all the nine steps from the article below to setup Hadoop on "master" machine
Getting started with Hadoop on Ubuntu in VirtualBox

On "master"

7. Set NameNode Location
Update your $HADOOP_HOME/etc/hadoop/core-site.xml file to set the NameNode location to master on port 9000:
$HADOOP_HOME: /usr/local/hadoop

Code:
<configuration>
	<property>
		<name>fs.default.name</name>
		<value>hdfs://master:9000</value>
	</property>
</configuration> 

8. Set path for HDFS

Edit $HADOOP_HOME/etc/hadoop/hdfs-site.xml file to resemble the following configuration.
<configuration>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/home/hadoop/data/nameNode</value>
    </property>

    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/home/hadoop/data/dataNode</value>
    </property>

    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

9. Set YARN as Job Scheduler
Edit the mapred-site.xml file, setting YARN as the default framework for MapReduce operations

$HADOOP_HOME/etc/hadoop/mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
    </property>
    <property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
    </property>
    <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
    </property>
</configuration>

10. Configure YARN

Edit yarn-site.xml, which contains the configuration options for YARN. In the value field for the
yarn.resourcemanager.hostname, replace 192.168.1.12 with the public IP address of "master":

$HADOOP_HOME/etc/hadoop/yarn-site.xml

<configuration>
    <property>
        <name>yarn.acl.enable</name>
        <value>0</value>
    </property>

    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>192.168.1.12</value> 
    </property>

    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

11. Configure Workers
The file workers is used by startup scripts to start required daemons on all nodes.
Edit this file: $HADOOP_HOME/etc/hadoop/workers to include both of the nodes:

slave1
slave2

12. Configure Memory Allocation (Two steps)

A) Edit $HADOOP_HOME/etc/hadoop/yarn-site.xml and add the following lines:

$ sudo gedit $HADOOP_HOME/etc/hadoop/yarn-site.xml


<property>
 <name>yarn.nodemanager.resource.memory-mb</name>
 <value>1536</value>
</property>

<property>
 <name>yarn.scheduler.maximum-allocation-mb</name>
 <value>1536</value>
</property>

<property>
 <name>yarn.scheduler.minimum-allocation-mb</name>
 <value>128</value>
</property>

<property>
 <name>yarn.nodemanager.vmem-check-enabled</name>
 <value>false</value>
</property>

B) Edit $HADOOP_HOME/etc/hadoop/mapred-site.xml and add the following lines

$ sudo gedit $HADOOP_HOME/etc/hadoop/mapred-site.xml

<property>
 <name>yarn.app.mapreduce.am.resource.mb</name>
 <value>512</value>
</property>

<property>
 <name>mapreduce.map.memory.mb</name>
 <value>256</value>
</property>

<property>
 <name>mapreduce.reduce.memory.mb</name>
 <value>256</value>
</property>

13. Duplicate Config Files on Each Node
Copy the Hadoop configuration files to the worker nodes:

$ scp -r /usr/local/hadoop/etc/* ashish@slave1:/usr/local/hadoop/etc/
$ scp -r /usr/local/hadoop/etc/* ashish@slave2:/usr/local/hadoop/etc/

When you are copying contents of "/etc", the following file should be modified to contain the correct JAVA_HOME for each of the destination nodes.

/usr/local/hadoop/etc/hadoop/hadoop-env.sh

14. Format HDFS
HDFS needs to be formatted like any classical file system. On "master", run the following command:
$ hdfs namenode -format

Your Hadoop installation is now configured and ready to run.

15. ==> Start and Stop HDFS

Start the HDFS by running the following script from master:

/usr/local/hadoop/sbin/start-dfs.sh

This will start NameNode and SecondaryNameNode on master, and DataNode on slave1 and slave2, according to the configuration in the workers config file.

Check that every process is running with the jps command on each node. On master, you should see the following (the PID number will be different):

21922 Jps
21603 NameNode
21787 SecondaryNameNode

And on slave1 and slave2 you should see the following:

19728 DataNode
19819 Jps

To stop HDFS on master and worker nodes, run the following command from node-master:

stop-dfs.sh

16. ==> Monitor your HDFS Cluster

Point your browser to http://master:9870/dfshealth.html, where "master" IP is the IP address of your master, and you’ll get a user-friendly monitoring console.

Sunday, October 23, 2022

spark-submit For Two Node Spark Cluster With Spark's Standalone RM For Pi Computation (2022 Oct 23)

Previously: Creating Two Node Spark Cluster With Two Worker Nodes and One Master Node Using Spark's Standalone Resource Manager on Ubuntu Machines

Issue

(base) ashish@ashishlaptop:/usr/local/spark$ spark-submit --master spark://ashishlaptop:7077 examples/src/main/python/pi.py 100


22/10/23 15:14:36 INFO SparkContext: Running Spark version 3.3.0
22/10/23 15:14:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/23 15:14:36 INFO ResourceUtils: ==============================================================
22/10/23 15:14:36 INFO ResourceUtils: No custom resources configured for spark.driver.
22/10/23 15:14:36 INFO ResourceUtils: ==============================================================
22/10/23 15:14:36 INFO SparkContext: Submitted application: PythonPi
22/10/23 15:14:36 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
22/10/23 15:14:36 INFO ResourceProfile: Limiting resource is cpu
22/10/23 15:14:36 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/10/23 15:14:36 INFO SecurityManager: Changing view acls to: ashish
22/10/23 15:14:36 INFO SecurityManager: Changing modify acls to: ashish
22/10/23 15:14:36 INFO SecurityManager: Changing view acls groups to: 
22/10/23 15:14:36 INFO SecurityManager: Changing modify acls groups to: 
22/10/23 15:14:36 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(ashish); groups with view permissions: Set(); users  with modify permissions: Set(ashish); groups with modify permissions: Set()
22/10/23 15:14:37 INFO Utils: Successfully started service 'sparkDriver' on port 41631.
22/10/23 15:14:37 INFO SparkEnv: Registering MapOutputTracker
22/10/23 15:14:37 INFO SparkEnv: Registering BlockManagerMaster
22/10/23 15:14:37 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/10/23 15:14:37 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/10/23 15:14:37 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/10/23 15:14:37 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-9599974d-836e-482e-bcf1-5c6e15c29ce9
22/10/23 15:14:37 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
22/10/23 15:14:37 INFO SparkEnv: Registering OutputCommitCoordinator
22/10/23 15:14:37 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/10/23 15:14:38 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://ashishlaptop:7077...
22/10/23 15:14:38 INFO TransportClientFactory: Successfully created connection to ashishlaptop/192.168.1.142:7077 after 45 ms (0 ms spent in bootstraps)
22/10/23 15:14:38 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20221023151438-0000
22/10/23 15:14:38 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44369.
22/10/23 15:14:38 INFO NettyBlockTransferService: Server created on ashishlaptop:44369
22/10/23 15:14:38 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/10/23 15:14:38 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, ashishlaptop, 44369, None)
22/10/23 15:14:38 INFO BlockManagerMasterEndpoint: Registering block manager ashishlaptop:44369 with 366.3 MiB RAM, BlockManagerId(driver, ashishlaptop, 44369, None)
22/10/23 15:14:38 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, ashishlaptop, 44369, None)
22/10/23 15:14:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20221023151438-0000/0 on worker-20221023135355-192.168.1.142-43143 (192.168.1.142:43143) with 4 core(s)
22/10/23 15:14:38 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, ashishlaptop, 44369, None)
22/10/23 15:14:38 INFO StandaloneSchedulerBackend: Granted executor ID app-20221023151438-0000/0 on hostPort 192.168.1.142:43143 with 4 core(s), 1024.0 MiB RAM
22/10/23 15:14:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20221023151438-0000/1 on worker-20221023135358-192.168.1.106-44471 (192.168.1.106:44471) with 2 core(s)
22/10/23 15:14:38 INFO StandaloneSchedulerBackend: Granted executor ID app-20221023151438-0000/1 on hostPort 192.168.1.106:44471 with 2 core(s), 1024.0 MiB RAM
22/10/23 15:14:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221023151438-0000/0 is now RUNNING
22/10/23 15:14:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221023151438-0000/1 is now RUNNING
22/10/23 15:14:39 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
22/10/23 15:14:40 INFO SparkContext: Starting job: reduce at /usr/local/spark/examples/src/main/python/pi.py:42
22/10/23 15:14:41 INFO DAGScheduler: Got job 0 (reduce at /usr/local/spark/examples/src/main/python/pi.py:42) with 100 output partitions
22/10/23 15:14:41 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at /usr/local/spark/examples/src/main/python/pi.py:42)
22/10/23 15:14:41 INFO DAGScheduler: Parents of final stage: List()
22/10/23 15:14:41 INFO DAGScheduler: Missing parents: List()
22/10/23 15:14:41 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[1] at reduce at /usr/local/spark/examples/src/main/python/pi.py:42), which has no missing parents
22/10/23 15:14:41 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 11.3 KiB, free 366.3 MiB)
22/10/23 15:14:41 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 8.5 KiB, free 366.3 MiB)
22/10/23 15:14:41 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ashishlaptop:44369 (size: 8.5 KiB, free: 366.3 MiB)
22/10/23 15:14:41 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1513
22/10/23 15:14:41 INFO DAGScheduler: Submitting 100 missing tasks from ResultStage 0 (PythonRDD[1] at reduce at /usr/local/spark/examples/src/main/python/pi.py:42) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
22/10/23 15:14:41 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 tasks resource profile 0
22/10/23 15:14:43 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.142:37452) with ID 0,  ResourceProfileId 0
22/10/23 15:14:43 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.142:34419 with 366.3 MiB RAM, BlockManagerId(0, 192.168.1.142, 34419, None)
22/10/23 15:14:43 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (192.168.1.142, executor 0, partition 0, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:14:43 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1) (192.168.1.142, executor 0, partition 1, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:14:43 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2) (192.168.1.142, executor 0, partition 2, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:14:43 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3) (192.168.1.142, executor 0, partition 3, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:14:44 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.142:34419 (size: 8.5 KiB, free: 366.3 MiB)
22/10/23 15:14:46 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4) (192.168.1.142, executor 0, partition 4, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:14:46 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5) (192.168.1.142, executor 0, partition 5, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:14:46 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6) (192.168.1.142, executor 0, partition 6, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:14:46 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7) (192.168.1.142, executor 0, partition 7, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:14:46 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.106:44292) with ID 1,  ResourceProfileId 0
22/10/23 15:14:46 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2) (192.168.1.142 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 540, in main
    raise RuntimeError(
RuntimeError: Python in worker has different version 3.10 than that in driver 3.9, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

...
22/10/23 15:14:47 INFO SparkContext: Invoking stop() from shutdown hook
22/10/23 15:14:47 INFO SparkUI: Stopped Spark web UI at http://ashishlaptop:4040
22/10/23 15:14:47 INFO StandaloneSchedulerBackend: Shutting down all executors
22/10/23 15:14:47 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
22/10/23 15:14:47 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/10/23 15:14:47 INFO MemoryStore: MemoryStore cleared
22/10/23 15:14:47 INFO BlockManager: BlockManager stopped
22/10/23 15:14:47 INFO BlockManagerMaster: BlockManagerMaster stopped
22/10/23 15:14:47 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/10/23 15:14:47 INFO SparkContext: Successfully stopped SparkContext
22/10/23 15:14:47 INFO ShutdownHookManager: Shutdown hook called
22/10/23 15:14:47 INFO ShutdownHookManager: Deleting directory /tmp/spark-c60126be-f479-4617-8548-ad0ca7f00763/pyspark-40737be6-41de-4d50-859d-88e13123232b
22/10/23 15:14:47 INFO ShutdownHookManager: Deleting directory /tmp/spark-0915b97c-253d-4807-9eb6-e8f3d1a7019c
22/10/23 15:14:47 INFO ShutdownHookManager: Deleting directory /tmp/spark-c60126be-f479-4617-8548-ad0ca7f00763
(base) ashish@ashishlaptop:/usr/local/spark$ 
    
Debugging
(base) ashish@ashishlaptop:/usr/local/spark$ echo $PYSPARK_PYTHON

(base) ashish@ashishlaptop:/usr/local/spark$ echo $PYSPARK_DRIVER_PYTHON

(base) ashish@ashishlaptop:/usr/local/spark$ 

Both are empty.

Setting the environment variables

(base) ashish@ashishlaptop:/usr/local/spark$ which python

/home/ashish/anaconda3/bin/python

(base) ashish@ashishlaptop:/usr/local/spark$ /home/ashish/anaconda3/bin/python

Python 3.9.12 (main, Apr  5 2022, 06:56:58) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()

(base) ashish@ashishlaptop:/usr/local/spark$ sudo nano ~/.bashrc
[sudo] password for ashish: 
(base) ashish@ashishlaptop:/usr/local/spark$ 
(base) ashish@ashishlaptop:/usr/local/spark$ tail ~/.bashrc

unset __conda_setup
# <<< conda initialize <<<

export PATH="/home/ashish/.local/bin:$PATH"
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
export PATH="$PATH:/usr/local/spark/bin"
export PYSPARK_PYTHON="/home/ashish/anaconda3/bin/python"
export PYSPARK_DRIVER_PYTHON="/home/ashish/anaconda3/bin/python" 

(base) ashish@ashishlaptop:/usr/local/spark$ source ~/.bashrc
(base) ashish@ashishlaptop:/usr/local/spark$ echo $PYSPARK_PYTHON

/home/ashish/anaconda3/bin/python

Logs After Issue Resolution

(base) ashish@ashishlaptop:/usr/local/spark$ spark-submit --master spark://ashishlaptop:7077 examples/src/main/python/pi.py 100


22/10/23 15:30:51 INFO SparkContext: Running Spark version 3.3.0
22/10/23 15:30:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/23 15:30:52 INFO ResourceUtils: ==============================================================
22/10/23 15:30:52 INFO ResourceUtils: No custom resources configured for spark.driver.
22/10/23 15:30:52 INFO ResourceUtils: ==============================================================
22/10/23 15:30:52 INFO SparkContext: Submitted application: PythonPi
22/10/23 15:30:52 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
22/10/23 15:30:52 INFO ResourceProfile: Limiting resource is cpu
22/10/23 15:30:52 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/10/23 15:30:52 INFO SecurityManager: Changing view acls to: ashish
22/10/23 15:30:52 INFO SecurityManager: Changing modify acls to: ashish
22/10/23 15:30:52 INFO SecurityManager: Changing view acls groups to: 
22/10/23 15:30:52 INFO SecurityManager: Changing modify acls groups to: 
22/10/23 15:30:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(ashish); groups with view permissions: Set(); users  with modify permissions: Set(ashish); groups with modify permissions: Set()
22/10/23 15:30:52 INFO Utils: Successfully started service 'sparkDriver' on port 41761.
22/10/23 15:30:52 INFO SparkEnv: Registering MapOutputTracker
22/10/23 15:30:52 INFO SparkEnv: Registering BlockManagerMaster
22/10/23 15:30:52 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/10/23 15:30:52 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/10/23 15:30:52 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/10/23 15:30:52 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-ffa15e79-7af0-41f9-87eb-fce866f17ed8
22/10/23 15:30:53 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
22/10/23 15:30:53 INFO SparkEnv: Registering OutputCommitCoordinator
22/10/23 15:30:53 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/10/23 15:30:53 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://ashishlaptop:7077...
22/10/23 15:30:53 INFO TransportClientFactory: Successfully created connection to ashishlaptop/192.168.1.142:7077 after 58 ms (0 ms spent in bootstraps)
22/10/23 15:30:53 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20221023153053-0001
22/10/23 15:30:53 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20221023153053-0001/0 on worker-20221023135355-192.168.1.142-43143 (192.168.1.142:43143) with 4 core(s)
22/10/23 15:30:53 INFO StandaloneSchedulerBackend: Granted executor ID app-20221023153053-0001/0 on hostPort 192.168.1.142:43143 with 4 core(s), 1024.0 MiB RAM
22/10/23 15:30:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 32809.
22/10/23 15:30:53 INFO NettyBlockTransferService: Server created on ashishlaptop:32809
22/10/23 15:30:53 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/10/23 15:30:53 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20221023153053-0001/1 on worker-20221023135358-192.168.1.106-44471 (192.168.1.106:44471) with 2 core(s)
22/10/23 15:30:53 INFO StandaloneSchedulerBackend: Granted executor ID app-20221023153053-0001/1 on hostPort 192.168.1.106:44471 with 2 core(s), 1024.0 MiB RAM
22/10/23 15:30:53 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, ashishlaptop, 32809, None)
22/10/23 15:30:53 INFO BlockManagerMasterEndpoint: Registering block manager ashishlaptop:32809 with 366.3 MiB RAM, BlockManagerId(driver, ashishlaptop, 32809, None)
22/10/23 15:30:53 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, ashishlaptop, 32809, None)
22/10/23 15:30:53 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, ashishlaptop, 32809, None)
22/10/23 15:30:54 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221023153053-0001/0 is now RUNNING
22/10/23 15:30:54 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221023153053-0001/1 is now RUNNING
22/10/23 15:30:54 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
22/10/23 15:30:55 INFO SparkContext: Starting job: reduce at /usr/local/spark/examples/src/main/python/pi.py:42
22/10/23 15:30:56 INFO DAGScheduler: Got job 0 (reduce at /usr/local/spark/examples/src/main/python/pi.py:42) with 100 output partitions
22/10/23 15:30:56 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at /usr/local/spark/examples/src/main/python/pi.py:42)
22/10/23 15:30:56 INFO DAGScheduler: Parents of final stage: List()
22/10/23 15:30:56 INFO DAGScheduler: Missing parents: List()
22/10/23 15:30:56 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[1] at reduce at /usr/local/spark/examples/src/main/python/pi.py:42), which has no missing parents
22/10/23 15:30:56 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 11.4 KiB, free 366.3 MiB)
22/10/23 15:30:56 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 8.5 KiB, free 366.3 MiB)
22/10/23 15:30:56 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ashishlaptop:32809 (size: 8.5 KiB, free: 366.3 MiB)
22/10/23 15:30:56 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1513
22/10/23 15:30:56 INFO DAGScheduler: Submitting 100 missing tasks from ResultStage 0 (PythonRDD[1] at reduce at /usr/local/spark/examples/src/main/python/pi.py:42) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
22/10/23 15:30:56 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 tasks resource profile 0
22/10/23 15:30:58 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.142:54146) with ID 0,  ResourceProfileId 0
22/10/23 15:30:59 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.142:46811 with 366.3 MiB RAM, BlockManagerId(0, 192.168.1.142, 46811, None)
22/10/23 15:30:59 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (192.168.1.142, executor 0, partition 0, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:30:59 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1) (192.168.1.142, executor 0, partition 1, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:30:59 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2) (192.168.1.142, executor 0, partition 2, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:30:59 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3) (192.168.1.142, executor 0, partition 3, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/10/23 15:30:59 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.142:46811 (size: 8.5 KiB, free: 366.3 MiB)
22/10/23 15:31:01 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.106:60352) with ID 1,  ResourceProfileId 0
22/10/23 15:31:01 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.106:41617 with 366.3 MiB RAM, BlockManagerId(1, 192.168.1.106, 41617, None)
22/10/23 15:31:01 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4) (192.168.1.106, executor 1, partition 4, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
...
22/10/23 15:31:09 INFO TaskSetManager: Finished task 93.0 in stage 0.0 (TID 93) in 344 ms on 192.168.1.142 (executor 0) (94/100)
22/10/23 15:31:09 INFO TaskSetManager: Finished task 94.0 in stage 0.0 (TID 94) in 312 ms on 192.168.1.142 (executor 0) (95/100)
22/10/23 15:31:09 INFO TaskSetManager: Finished task 95.0 in stage 0.0 (TID 95) in 314 ms on 192.168.1.142 (executor 0) (96/100)
22/10/23 15:31:09 INFO TaskSetManager: Finished task 96.0 in stage 0.0 (TID 96) in 263 ms on 192.168.1.106 (executor 1) (97/100)
22/10/23 15:31:09 INFO TaskSetManager: Finished task 98.0 in stage 0.0 (TID 98) in 260 ms on 192.168.1.142 (executor 0) (98/100)
22/10/23 15:31:09 INFO TaskSetManager: Finished task 99.0 in stage 0.0 (TID 99) in 256 ms on 192.168.1.142 (executor 0) (99/100)
22/10/23 15:31:10 INFO TaskSetManager: Finished task 97.0 in stage 0.0 (TID 97) in 384 ms on 192.168.1.106 (executor 1) (100/100)
22/10/23 15:31:10 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
22/10/23 15:31:10 INFO DAGScheduler: ResultStage 0 (reduce at /usr/local/spark/examples/src/main/python/pi.py:42) finished in 13.849 s
22/10/23 15:31:10 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
22/10/23 15:31:10 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
22/10/23 15:31:10 INFO DAGScheduler: Job 0 finished: reduce at /usr/local/spark/examples/src/main/python/pi.py:42, took 14.106103 s
Pi is roughly 3.142880
22/10/23 15:31:10 INFO SparkUI: Stopped Spark web UI at http://ashishlaptop:4040
22/10/23 15:31:10 INFO StandaloneSchedulerBackend: Shutting down all executors
22/10/23 15:31:10 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
22/10/23 15:31:10 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/10/23 15:31:10 INFO MemoryStore: MemoryStore cleared
22/10/23 15:31:10 INFO BlockManager: BlockManager stopped
22/10/23 15:31:10 INFO BlockManagerMaster: BlockManagerMaster stopped
22/10/23 15:31:10 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/10/23 15:31:10 INFO SparkContext: Successfully stopped SparkContext
22/10/23 15:31:11 INFO ShutdownHookManager: Shutdown hook called
22/10/23 15:31:11 INFO ShutdownHookManager: Deleting directory /tmp/spark-6be4655c-e59a-403a-92e8-582583fa3f7d/pyspark-c4d7588d-a23a-4393-b29b-6689d20e7684
22/10/23 15:31:11 INFO ShutdownHookManager: Deleting directory /tmp/spark-f4436e38-d155-4763-bb57-461eb3793d13
22/10/23 15:31:11 INFO ShutdownHookManager: Deleting directory /tmp/spark-6be4655c-e59a-403a-92e8-582583fa3f7d
(base) ashish@ashishlaptop:/usr/local/spark$

Sunday, October 30, 2022

Hypothetical Question

Conclusion: In removing one column from the one-hot feature matrix, there is still no data loss. One column's value is related to the value of rest n-1 columns.

So, what is the solution to resolve this relation?

Now with minimal env.yml:

Conda Commands

Saturday, October 29, 2022

Covered/Read

Pending (Too Technical)

Friday, October 28, 2022

Let us first see what happens when we do one-hot encoding of column 'Sex'.

Fun Facts

Wednesday, October 26, 2022

Working in Pandas

Not working in Pandas API on PySpark

Error

Welcome to Termux!

1. Getting OS Info

2. Getting Processor Info

3. Getting my username

4. Getting Your IP Address

5. Checking RAM Usage

6. Checking Space on Hard Disk

7. Print Environment Variables

8. Print Working Directory

Getting Basic Info Like Hostname and IP

Setting up SSH

Note This Error While Doing Setup on Windows

We overcome this issue by manually copying Public RSA Key into the 'authorized_keys' file of the remote machine using SFTP.

Next Steps of Copying Pubic Key Onto Remote Machine And Vice-versa

Address of Ubuntu System: ashish@192.168.1.151

PWD: /home/ashish

PWD: /home/ashish/.ssh

Steps on Ubuntu Machine

Paste this Public RSA Key in 'authorized_keys' File

Testing The SSH

Back to Windows 10 System

Tuesday, October 25, 2022

Errors: We cannot filter PySpark's Pandas API based DataFrame using the same code we used for Pure Pandas DataFrame

Use of: DataFrame.to_numpy() → numpy.ndarray

Aim: To retrieve the first letter from a column of string type

In Pandas

In Pandas API on Spark

Errors in Pandas API on Spark when we try with the way of Plain Pandas

1.

2.

3.

How we resolved it:

When you load a Pandas DataFrame by reading from a CSV, blank values and 'NA' values are converted to 'NaN' values by default as shown above.

Monday, October 24, 2022

1. Setting hostname in three Guest OS(s)

ON MASTER (Host OS IP: 192.168.1.12)

2. ON SLAVE2 (Host OS IP: 192.168.1.4)

3. FOLLOW THE STEPS MENTIONED FOR SLAVE2 ALSO FOR SLAVE1 (Host OS IP: 192.168.1.3)

4. Configuring Key Based Login

5. Setting up ".bashrc" on each system (master, slave1, slave2)

6. Follow all the nine steps from the article below to setup Hadoop on "master" machine

On "master"

7. Set NameNode Location

8. Set path for HDFS

9. Set YARN as Job Scheduler

10. Configure YARN

11. Configure Workers

12. Configure Memory Allocation (Two steps)

13. Duplicate Config Files on Each Node

14. Format HDFS

Sunday, October 23, 2022

Issue

Debugging

Setting the environment variables

Logs After Issue Resolution