Saturday, October 29, 2022

Clinical Psychology Books (May 2020)

Download Books

Covered/Read

1. The Man Who Mistook His Wife for a Hats Book by Oliver Sacks 2. Maps of Meaning. The Architecture of Belief Jordan B. Peterson (1999) 3. 12 Rules for Life Book by Jordan Peterson 4. Beyond Order (12 more rules for life) by Jordan Peterson 5. The Body Keeps the Score: Brain, Mind, and Body in the Healing of Trauma Book by Bessel van der Kolk 6. The Boy who was Raised as a Dog: And Other Stories from a Child Psychiatrist's Notebook. What Traumatized Children Can Teach Us about Loss, Love, and Healing. Book by Bruce D. Perry 7. An Unquiet Mind Book by Kay Redfield Jamison 8. My Little Epiphanies Aisha Chaudhary 9. Expressive Writing. Words that Heal James W. Pennebaker, John E. Evans, 2014 10. Anxiety and Phobia Workbook (6e, 2015, Edmund Bourne)

Pending (Too Technical)

1. Abnormal and Clinical Psychology: An Introductory Textbook Book by Paul Bennett 2. The handbook of child and adolescent clinical psychology Book by Alan Carr 3. The Polyvagal Theory in Therapy: Engaging the Rhythm of Regulation Book by Deb Dana 4. The theory and practice of group psychotherapy Book by Irvin D. Yalom 5. Clinical Psychology: An Introduction Book by Alan Carr 6. Clinical Psychology 1962 Editor: Graham Davey 7. Publication Manual of the American Psychological Association Book by American Psychological Association 8. Stahl's Essential Psychopharmacology Textbook by Stephen M. Stahl 9. Fish's Clinical Psychopathology: Signs and Symptoms in Psychiatry Book by B. Kelly and Patricia Casey 10. A dictionary of psychology Originally published: 2001 Author: Andrew Colman Genre: Dictionary 11. Abnormal Psychology Textbook by Ronald J. Comer 12. Existential psychotherapy Book by Irvin D. Yalom 13. Diagnostic and Statistical Manual of Mental Disorders, 5th Edition: DSM-5 Book by American Psychiatric Association 14. Clinical Psychology: A Very Short Introduction Book by Katie Aafjes-van Doorn and Susan Llewelyn 15. Introduction to Clinical Psychology Textbook 1980 Lynda A. Heiden and Michel Hersen 21. Clinical Psychology: Assessment, Treatment, and Research Originally published: 2009 Genre: Self-help book Editors: Steven K. Huprich, David C. S. Richard 22. The Oxford Handbook of Clinical Psychology Originally published: 12 November 2010 Genre: Reference work Editor: David H. Barlow 23. Introduction to Clinical Psychology Textbook by Geoffrey L. Thorpe and Jeffrey Hecker 24. Clinical Psychology in Practice Originally published: 2009 Editor: Paul Kennedy 25. Introducing Psychology Originally published: 10 August 1994 Author: Nigel Benson Genres: Study guide, Non-fiction comics 26. What is Clinical Psychology? Originally published: 13 April 2006 Editors: Susan Llewelyn, David Murphy 27. Theory and Practice of Counseling and Psychotherapy Book by Gerald Corey 28. The Gift of Therapy Book by Irvin D. Yalom 29. The Myth of Mental Illness Book by Thomas Szasz 30. Trauma and Recovery Book by Judith Lewis Herman 31. Brain Lock: Free Yourself from Obsessive-Compulsive Behavior Book by Jeffrey M. Schwartz 32. Motivational Interviewing in Health Care: Helping Patients Change Behavior Book by Christopher C. Butler, Stephen Rollnick, and William Richard Miller 33. Many Lives, Many Masters Book by Brian Weiss 34. Clinical Handbook of Psychological Disorders, Fifth Edition: A Step-by-Step Treatment Manual Originally published: 1985 Editor: David H. Barlow Genres: Thesis, Reference work 35. Skills Training Manual for Treating Borderline Personality Disorder Book by Marsha M. Linehan Originally published: 14 May 1993 36. Diagnostic and Statistical Manual of Mental Disorders Originally published: 1952 Author: American Psychiatric Association Original language: English 37. Abnormal Psychology: Clinical Perspectives on Psychological Disorders Originally published: 2000 Authors: Richard P. Halgin, Susan Krauss Whitbourne 38. Coping Skills for Kids Workbook: Over 75 Coping Strategies to Help Kids ... Book by Janine Halloran Originally published: 4 June 2016 39. Beyond Behaviors: Using Brain Science and Compassion to Understand and Solve Children's Behavioral Challenges Book by Mona Delahooke 40. Seeking Safety: A Treatment Manual for PTSD and Substance Abuse Book by Lisa M. Najavits Originally published: 2002 41. DBT Skills Training Manual, Second Edition Book by Marsha M. Linehan Originally published: 19 October 2014 42. ACT Made Simple: An Easy-To-Read Primer on Acceptance and Commitment Therapy Book by Russ Harris Originally published: November 2009 43. Psychopathology: Research, Assessment and Treatment in Clinical Psychology Textbook by Graham Davey Originally published: 29 September 2008 44. The Interpretation of Dreams Book by Sigmund Freud Originally published: 4 November 1899 Author: Sigmund Freud Original title: Die Traumdeutung Text: The Interpretation of Dreams at Wikisource Original language: German Subject: Dream interpretation 45. DSM-5 Made Easy: The Clinician's Guide to Diagnosis Book by James Roy Morrison Originally published: 11 April 2014 46. Insider's Guide to Graduate Programs in Clinical and Counseling Psychology Book by John C. Norcross and Michael A. Sayette Originally published: 10 March 1996 47. The Whole-Brain Child: 12 Revolutionary Strategies to Nurture Your Child's ... Book by Daniel J. Siegel and Tina Payne Bryson Originally published: 4 October 2011 48. Madness Explained: Psychosis and Human Nature Book by Richard P. Bentall 50. Becoming a Clinical Psychologist: Everything You Need to Know Book by Amanda Mwale and Steven Mayers 51. Clinical Psychology: Science, Practice, and Culture Textbook by Andrew M. Pomerantz 52. The Red Book Book by Carl Jung The Red Book is a red leather-bound folio manuscript crafted by the Swiss psychiatrist Carl Gustav Jung between 1915 and about 1930. It recounts and comments upon the author's psychological experiments between 1913 and 1916, and is based on manuscripts first drafted by Jung in 1914–15 and 1917. Originally published: 7 October 2009 Author: Carl Jung Original title: Liber Novus ("The New Book") Original language: German Page count: 404 Genre: Biography
Tags: List of Books,Psychology,

Friday, October 28, 2022

One Hot Encoding Using Pandas' get_dummies() Method on Titanic Dataset

Download Data and Code

import pandas as pd
df = pd.read_csv('titanic_train.csv')
print(df.head())

print("Number of Unique Values in The Column 'Sex':") print(df['Sex'].nunique())
# 2 # This is also the width of it's one-hot encoding. print("Number of Unique Values in The Column For 'Passenger Class':") print(df['Pclass'].nunique()) # 3 # This is also the width of one-hot encoding for 'Passenger Class'.

Let us first see what happens when we do one-hot encoding of column 'Sex'.

enc_gender_df = pd.get_dummies(df, columns = ['Sex']) print(enc_gender_df.head()) # Sex # male # female # female # female # male # Sex_female Sex_male # 0 1 # 1 0 # 1 0 # 1 0 # 0 1 enc_pc_df = pd.get_dummies(df, columns = ['Pclass']) print(enc_pc_df.head()) # Pclass_1 Pclass_2 Pclass_3 # 0 0 1 # 1 0 0 # 0 0 1 # 1 0 0 # 0 0 1

Fun Facts

1. LabelEncoder of Scikit-Learn works by encoding the labels in the Ascending-Alphabetical-Sequence. 2. As you have Ascending-Alphabetical-Sequence, there are three more sequences that are common: 2.1. Descending Alphabetical Sequence 2.2. Ascending Frequency Based Sequence 2.3. Descending Frequency Based Sequence
Tags: Technology,Machine Learning,

Elon Musk clearing why he took over Twitter (2022 Oct 27)

Elon Musk's first day at twitter headquarters. "I wanted to reach out personally to share my motivation in acquiring Twitter. There has been much speculation about why I bought Twitter and what I think about advertising. Most of it has been wrong. The reason I acquired Twitter is because it is important to the future of civilization to have a common digital town square, where a wide range of beliefs can be debated in a healthy manner, without resorting to violence. There is currently great danger that social media will splinter into far right wing and far left wing echo chambers that generate more hate and divide our society. In the relentless pursuit of clicks, much of traditional media has fueled and catered to those polarised extremes, as they believe that is what brings in the money, but, in doing so, the opportunity for dialogue is lost. That is why I bought Twitter. I didn't do it because it would be easy. I didn't do it to make more money. I did it to try to help humanity, whom I love. And I do so with humility, recognizing that failure in pursuing this goal, despite our best efforts, is a very real possibility. That said, Twitter obviously cannot become a free-for-all hellscape, where anything can be said with no consequences! In addition to adhering to the laws of the land, our platform must be warm and welcoming to all, where you can choose your desired experience according to your preferences, just as you can choose, for example, to see movies or play video games ranging from all ages to mature." - Elon Musk
Tags: Investment,

Wednesday, October 26, 2022

Way 4: With respect to DataFrame.replace() Method (Ways in which Pandas API on PySpark differs from Plain Pandas)

Download Code

Working in Pandas

import pandas as pd df = pd.DataFrame({ 'dummy_col': ["alpha", "beta", "gamma", "","-","0","N/A","-_-","NA", "delta", "epsilon", "zeta", "eta", "theta"] }) df['cleaned'] = df.replace(to_replace =["","-","0","N/A","-_-","NA"], value = "Not Applicable")

Not working in Pandas API on PySpark

from pyspark import pandas as ppd df_ppd = ppd.DataFrame({ 'dummy_col': ["alpha", "beta", "gamma", "","-","0","N/A","-_-","NA", "delta", "epsilon", "zeta", "eta", "theta"] })

Error

df_ppd['cleaned'] = df_ppd.replace(to_replace =["","-","0","N/A","-_-","NA"], value = "Not Applicable") --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In [15], line 1 ----> 1 df_ppd['cleaned'] = df_ppd.replace(to_replace =["","-","0","N/A","-_-","NA"], value = "Not Applicable") File ~/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/pandas/frame.py:12355, in DataFrame.__setitem__(self, key, value) 12352 psdf = self._assign({k: value[c] for k, c in zip(key, field_names)}) 12353 else: 12354 # Same Series. > 12355 psdf = self._assign({key: value}) 12357 self._update_internal_frame(psdf._internal) File ~/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/pandas/frame.py:4921, in DataFrame._assign(self, kwargs) 4917 is_invalid_assignee = ( 4918 not (isinstance(v, (IndexOpsMixin, Column)) or callable(v) or is_scalar(v)) 4919 ) or isinstance(v, MultiIndex) 4920 if is_invalid_assignee: -> 4921 raise TypeError( 4922 "Column assignment doesn't support type " "{0}".format(type(v).__name__) 4923 ) 4924 if callable(v): 4925 kwargs[k] = v(self) TypeError: Column assignment doesn't support type DataFrame
df_ppd_cleaned = df_ppd.replace(to_replace = ["","-","0","N/A","-_-","NA"], value = "Not Applicable")
df_ppd_cleaned.replace(to_replace = ['Not Applicable', 'alpha'], value = "Still NA", inplace = True)
Tags: Technology,Spark

Termux to get information about my Android device

Welcome to Termux!

Wiki: https://wiki.termux.com Community forum: https://termux.com/community Gitter chat: https://gitter.im/termux/termux IRC channel: #termux on freenode Working with packages: * Search packages: pkg search [query] * Install a package: pkg install [package] * Upgrade packages: pkg upgrade Subscribing to additional repositories: * Root: pkg install root-repo * Unstable: pkg install unstable-repo * X11: pkg install x11-repo Report issues at https://termux.com/issues

1. Getting OS Info

$ uname Linux $ uname -a Linux localhost 4.14.199-24365169-abX205XXU1AVG1 #2 SMP PREEMPT Tue Jul 5 20:39:23 KST 2022 aarch64 Android

2. Getting Processor Info

$ more /proc/cpuinfo Processor : AArch64 Processor rev 1 (aarch64) processor : 0 BogoMIPS : 52.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x1 CPU part : 0xd05 CPU revision : 0 processor : 1 BogoMIPS : 52.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x1 CPU part : 0xd05 CPU revision : 0 processor : 2 BogoMIPS : 52.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x1 CPU part : 0xd05 CPU revision : 0 processor : 3 BogoMIPS : 52.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x1 CPU part : 0xd05 CPU revision : 0 processor : 4 BogoMIPS : 52.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x1 CPU part : 0xd05 CPU revision : 0 processor : 5 BogoMIPS : 52.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x1 CPU part : 0xd05 CPU revision : 0 processor : 6 BogoMIPS : 52.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x3 CPU part : 0xd0a CPU revision : 1 processor : 7 BogoMIPS : 52.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x3 CPU part : 0xd0a CPU revision : 1 Hardware : Unisoc ums512 Serial : 96789ab0ffeb70e8d1320621ab4d084fb1082517682936e1977afc5ae63a3c7b

3. Getting my username

$ whoami u0_a218

4. Getting Your IP Address

$ ifconfig Warning: cannot open /proc/net/dev (Permission denied). Limited output. lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 1000 (UNSPEC) wlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.1.102 netmask 255.255.255.0 broadcast 192.168.1.255 unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 1000 (UNSPEC)

5. Checking RAM Usage

$ free -h total used free shared buff/cache available Mem: 2.4Gi 1.9Gi 113Mi 12Mi 493Mi 448Mi Swap: 2.5Gi 1.2Gi 1.3Gi

6. Checking Space on Hard Disk

$ df -h Filesystem Size Used Avail Use% Mounted on /dev/block/dm-4 3.2G 3.2G 2.5M 100% / tmpfs 1.2G 1.3M 1.2G 1% /dev tmpfs 1.2G 0 1.2G 0% /mnt /dev/block/dm-1 122M 122M 0 100% /system_ext /dev/block/dm-5 759M 751M 0 100% /vendor /dev/block/dm-6 1.0G 1.0G 0 100% /product /dev/block/dm-7 271M 166M 99M 63% /prism /dev/block/dm-8 31M 408K 30M 2% /optics tmpfs 1.2G 0 1.2G 0% /apex /dev/block/dm-11 1.8M 1.7M 0 100% /apex/com.android.os.statsd@311510000 /dev/block/dm-12 704K 676K 16K 98% /apex/com.android.sdkext@330810010 /dev/block/dm-13 13M 13M 0 100% /apex/com.android.cellbroadcast@330911010 /dev/block/dm-14 15M 15M 0 100% /apex/com.android.permission@330912010 /dev/block/dm-15 7.9M 7.8M 0 100% /apex/com.android.tethering@330911010 /dev/block/dm-16 3.8M 3.7M 0 100% /apex/com.android.resolv@330910000 /dev/block/dm-17 19M 19M 0 100% /apex/com.android.media.swcodec@330443040 /dev/block/dm-18 8.4M 8.4M 0 100% /apex/com.android.mediaprovider@330911040 /dev/block/dm-19 836K 808K 12K 99% /apex/com.android.tzdata@303200001 /dev/block/dm-20 7.2M 7.1M 0 100% /apex/com.android.neuralnetworks@330443000 /dev/block/dm-21 7.8M 7.7M 0 100% /apex/com.android.adbd@330444000 /dev/block/dm-22 4.8M 4.8M 0 100% /apex/com.android.conscrypt@330443020 /dev/block/dm-23 5.6M 5.6M 0 100% /apex/com.android.extservices@330443000 /dev/block/dm-24 748K 720K 16K 98% /apex/com.android.ipsec@330443010 /dev/block/dm-25 5.7M 5.6M 0 100% /apex/com.android.media@330443030 /dev/block/loop21 24M 24M 0 100% /apex/com.android.i18n@1 /dev/block/loop22 5.1M 5.1M 0 100% /apex/com.android.wifi@300000000 /dev/block/loop23 5.0M 5.0M 0 100% /apex/com.android.runtime@1 /dev/block/loop24 236K 72K 160K 32% /apex/com.samsung.android.shell@303013100 /dev/block/loop25 82M 82M 0 100% /apex/com.android.art@1 /dev/block/loop26 232K 92K 136K 41% /apex/com.android.apex.cts.shim@1 /dev/block/loop27 109M 109M 0 100% /apex/com.android.vndk.v30@1 /dev/block/loop28 236K 32K 200K 14% /apex/com.samsung.android.wifi.broadcom@300000000 /dev/block/loop29 236K 32K 200K 14% /apex/com.samsung.android.camera.unihal@301742001 /dev/block/by-name/cache 303M 12M 285M 4% /cache /dev/block/by-name/sec_efs 11M 788K 10M 8% /efs /dev/fuse 22G 8.5G 13G 40% /storage/emulated

7. Print Environment Variables

$ echo $USER $ echo $HOME /data/data/com.termux/files/home

8. Print Working Directory

$ pwd /data/data/com.termux/files/home
Tags: Technology,Android,Linux,

SSH Setup For Accessing Ubuntu From Windows Using SFTP

Getting Basic Info Like Hostname and IP

(base) C:\Users\ashish>hostname CS3L (base) C:\Users\ashish>ipconfig Windows IP Configuration Ethernet adapter Ethernet 2: Media State . . . . . . . . . . . : Media disconnected Connection-specific DNS Suffix . : ad.itli.com Ethernet adapter Ethernet: Media State . . . . . . . . . . . : Media disconnected Connection-specific DNS Suffix . : ad.itli.com Wireless LAN adapter Wi-Fi: Connection-specific DNS Suffix . : IPv6 Address. . . . . . . . . . . : 2401:4900:47f2:5147:b1b2:6d59:f669:1b96 Temporary IPv6 Address. . . . . . : 2401:4900:47f2:5147:15e3:46:9f5b:8d78 Link-local IPv6 Address . . . . . : fe80::b1b2:6d59:f669:1b96%13 IPv4 Address. . . . . . . . . . . : 192.168.1.100 Subnet Mask . . . . . . . . . . . : 255.255.255.0 Default Gateway . . . . . . . . . : fe80::d837:1aff:fe40:b173%13 192.168.1.1 Ethernet adapter Bluetooth Network Connection: Media State . . . . . . . . . . . : Media disconnected Connection-specific DNS Suffix . :

Setting up SSH

(base) C:\Users\ashish>mkdir .ssh (base) C:\Users\ashish>dir Volume in drive C is OSDisk Volume Serial Number is ABCD-PQRS Directory of C:\Users\ashish 10/26/2022 03:25 PM <DIR> . 10/26/2022 03:25 PM <DIR> .. 08/16/2022 01:29 PM <DIR> .3T 09/26/2022 08:04 AM 1,288 .bash_history 06/02/2022 10:15 AM <DIR> .cache 05/30/2022 11:39 AM <DIR> .conda 10/26/2022 02:58 PM 89 .dotty_history 08/19/2022 06:42 PM 68 .gitconfig 10/11/2022 02:03 PM <DIR> .ipython 05/30/2022 10:05 AM <DIR> .jupyter 05/30/2022 12:56 PM <DIR> .keras 08/20/2022 11:55 AM 20 .lesshst 07/04/2022 06:09 PM <DIR> .matplotlib 06/30/2022 10:32 AM <DIR> .ms-ad 10/07/2022 09:00 PM 1,457 .python_history 10/26/2022 03:25 PM <DIR> .ssh 09/06/2022 10:13 PM 2,379 .viminfo 05/30/2022 11:34 AM <DIR> .vscode 05/16/2022 03:19 PM <DIR> 3D Objects 10/07/2022 02:50 PM <DIR> Anaconda3 05/16/2022 03:19 PM <DIR> Contacts 10/26/2022 02:57 PM <DIR> Desktop 10/07/2022 06:27 PM <DIR> Documents 10/26/2022 03:18 PM <DIR> Downloads 05/16/2022 03:19 PM <DIR> Favorites 05/16/2022 03:19 PM <DIR> Links 05/16/2022 03:19 PM <DIR> Music 05/16/2022 02:13 PM <DIR> OneDrive 05/16/2022 03:20 PM <DIR> Pictures 05/16/2022 03:19 PM <DIR> Saved Games 05/16/2022 03:20 PM <DIR> Searches 05/30/2022 09:36 AM <DIR> Videos 6 File(s) 5,301 bytes 26 Dir(s) 81,987,842,048 bytes free (base) C:\Users\ashish>cd .ssh (base) C:\Users\ashish\.ssh>dir Volume in drive C is OSDisk Volume Serial Number is ABCD-PQRS Directory of C:\Users\ashish\.ssh 10/26/2022 03:25 PM <DIR> . 10/26/2022 03:25 PM <DIR> .. 0 File(s) 0 bytes 2 Dir(s) 81,987,903,488 bytes free (base) C:\Users\ashish\.ssh>echo "" > id_rsa (base) C:\Users\ashish\.ssh>dir Volume in drive C is OSDisk Volume Serial Number is ABCD-PQRS Directory of C:\Users\ashish\.ssh 10/26/2022 03:26 PM <DIR> . 10/26/2022 03:26 PM <DIR> .. 10/26/2022 03:26 PM 5 id_rsa 1 File(s) 5 bytes 2 Dir(s) 81,987,678,208 bytes free (base) C:\Users\ashish\.ssh>type id_rsa (base) C:\Users\ashish\.ssh> (base) C:\Users\ashish>ssh-keygen -t rsa -f ./.ssh/id_rsa -P "" Generating public/private rsa key pair. ./.ssh/id_rsa already exists. Overwrite (y/n)? y Your identification has been saved in ./.ssh/id_rsa. Your public key has been saved in ./.ssh/id_rsa.pub. The key fingerprint is: SHA256:fGEZHROeTzogrdXwo7haw0g3eXLVZnO9nM0ZtTbIBh8 itlitli\ashish@CS3L The key's randomart image is: +---[RSA 3072]----+ | oo+E .| | . B=+o +| | . B B=*=o| | . B =.Bo+B| | . S = o .=o| | . + B . | | . = | | o . | | . | +----[SHA256]-----+ (base) C:\Users\ashish>

Note This Error While Doing Setup on Windows

CMD> ssh-copy-id -i ./.ssh/id_rsa.pub ashish@192.168.1.100 'ssh-copy-id' is not recognized as an internal or external command, operable program or batch file.

We overcome this issue by manually copying Public RSA Key into the 'authorized_keys' file of the remote machine using SFTP.

(base) C:\Users\ashish>sftp usage: sftp [-46aCfpqrv] [-B buffer_size] [-b batchfile] [-c cipher] [-D sftp_server_path] [-F ssh_config] [-i identity_file] [-J destination] [-l limit] [-o ssh_option] [-P port] [-R num_requests] [-S program] [-s subsystem | sftp_server] destination

Next Steps of Copying Pubic Key Onto Remote Machine And Vice-versa

Address of Ubuntu System: ashish@192.168.1.151

(base) C:\Users\ashish>sftp ashish@192.168.1.151 The authenticity of host '192.168.1.151 (192.168.1.151)' can't be established. ECDSA key fingerprint is SHA256:2hgOVHHgkrT9/6XnK/KDaFQ0DaXLUoW82eeU6oQyTvQ. Are you sure you want to continue connecting (yes/no/[fingerprint])? Warning: Permanently added '192.168.1.151' (ECDSA) to the list of known hosts. ashish@192.168.1.151's password: Connected to 192.168.1.151. sftp> ls Desktop Documents Downloads Music Pictures Public Templates Videos anaconda3 nltk_data snap sftp> bye

PWD: /home/ashish

sftp> put id_rsa.pub win_auth_key.txt Uploading id_rsa.pub to /home/ashish/win_auth_key.txt id_rsa.pub 100% 593 89.9KB/s 00:00 sftp>

PWD: /home/ashish/.ssh

sftp> get id_rsa.pub ./ubuntu_id_rsa.pub.txt Fetching /home/ashish/.ssh/id_rsa.pub to ./ubuntu_id_rsa.pub.txt /home/ashish/.ssh/id_rsa.pub 100% 573 2.7KB/s 00:00 sftp> sftp> bye

Steps on Ubuntu Machine

(base) ashish@ashishlaptop:~$ cat win_auth_key.txt ssh-rsa AAA***vZs= itli\ashish@CS3L (base) ashish@ashishlaptop:~$

Paste this Public RSA Key in 'authorized_keys' File

(base) ashish@ashishlaptop:~/.ssh$ nano authorized_keys (base) ashish@ashishlaptop:~/.ssh$ cat authorized_keys ssh-rsa AAAA***rzFM= ashish@ashishdesktop ssh-rsa AAAA***GOD0= ashish@ashishlaptop ssh-rsa AAAA***3vZs= itli\ashish@CS3L (base) ashish@ashishlaptop:~/.ssh$

Testing The SSH

Back to Windows 10 System

(base) C:\Users\ashish>ssh ashish@ashishlaptop The authenticity of host 'ashishlaptop (192.168.1.151)' can't be established. ECDSA key fingerprint is SHA256:2hgOVHHgkrT9/6XnK/KDaFQ0DaXLUoW82eeU6oQyTvQ. Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added 'ashishlaptop' (ECDSA) to the list of known hosts. Welcome to Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-52-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage 2 updates can be applied immediately. To see these additional updates run: apt list --upgradable Last login: Wed Oct 26 13:35:44 2022 from 192.168.1.151 (base) ashish@ashishlaptop:~$ (base) ashish@ashishlaptop:~$ ls anaconda3 Desktop Documents Downloads Music nltk_data Pictures Public snap Templates Videos win_auth_key.txt (base) ashish@ashishlaptop:~$ rm win_auth_key.txt (base) ashish@ashishlaptop:~$ ls anaconda3 Desktop Documents Downloads Music nltk_data Pictures Public snap Templates Videos (base) ashish@ashishlaptop:~$ exit logout Connection to ashishlaptop closed. (base) C:\Users\ashish>ssh ashish@ashishlaptop Welcome to Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-52-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage 2 updates can be applied immediately. To see these additional updates run: apt list --upgradable Last login: Wed Oct 26 15:46:02 2022 from 192.168.1.100 (base) ashish@ashishlaptop:~$ client_loop: send disconnect: Connection reset (base) C:\Users\ashish>
Tags: Technology,SSH,Linux,Windows CMD,

Tuesday, October 25, 2022

Way 3: How isin() works for Plain Pandas and how we have to use to_numpy() for it in PySpark's Pandas API (Ways in which Pandas API on PySpark differs from Plain Pandas)

Download Code

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.DataFrame({
    'alphabets': [
        'alpha', 'beta', 'gamma', 'delta', 'epsilon', 'zeta', 'eta', 'theta', 'iota', 'kappa', 'lambda', 'mu', 'nu', 'xi', 'omicron', 'pi', 'rho', 'sigma', 'tau',
        'upsilon', 'phi', 'chi', 'psi', 'omega', # Greek Alphabets
        'ka', 'kh', 'ga', 'gh', 'ng', 'ch', 'chh', 'ja', 'jh', 'ny', 'ta', 'th', 'da', 'dh', 'na', 'ta', 'th', 'da', 'dh', 'na', 'pa', 'ph', 'ba', 'bh', 'ma', 
        'ya', 'ra', 'la', 'va', 'sh', 'sh', 'sa', 'ha', 'ksh', 'tr', 'gy', 'shr' # Hindi Consonants
    ]
})

df['first_letter'] = df['alphabets'].str[0] # Won't work for Pandas API on PySpark 

ixs = np.random.permutation(df.shape[0])
split_pct = 0.5

train_ixs = ixs[:round(len(ixs) * split_pct)]
test_ixs = ixs[round(len(ixs) * split_pct):]

df_train = df.iloc[train_ixs]
df_test = df.iloc[test_ixs]

df_train.head()

df_test.head()
not_in_train_but_in_test = df_test[-(df_test.first_letter.isin(df_train.first_letter))]
import pyspark print(pyspark.__version__) 3.3.0 from pyspark import pandas as ppd df_ppd = ppd.DataFrame({ 'alphabets': [ 'alpha', 'beta', 'gamma', 'delta', 'epsilon', 'zeta', 'eta', 'theta', 'iota', 'kappa', 'lambda', 'mu', 'nu', 'xi', 'omicron', 'pi', 'rho', 'sigma', 'tau', 'upsilon', 'phi', 'chi', 'psi', 'omega', # Greek 'ka', 'kh', 'ga', 'gh', 'ng', 'ch', 'chh', 'ja', 'jh', 'ny', 'ta', 'th', 'da', 'dh', 'na', 'ta', 'th', 'da', 'dh', 'na', 'pa', 'ph', 'ba', 'bh', 'ma', 'ya', 'ra', 'la', 'va', 'sh', 'sh', 'sa', 'ha', 'ksh', 'tr', 'gy', 'shr' # Hindi ] }) df_ppd['first_letter'] = df_ppd['alphabets'].apply(lambda x: x[0]) df_ppd_train = df_ppd.iloc[train_ixs] df_ppd_test = df_ppd.iloc[test_ixs]

Errors: We cannot filter PySpark's Pandas API based DataFrame using the same code we used for Pure Pandas DataFrame

1. not_in_train_but_in_test = df_ppd_test[-(df_ppd_test.first_letter.isin(df_ppd_train.first_letter))] --------------------------------------------------------------------------- PandasNotImplementedError Traceback (most recent call last) Cell In [62], line 1 ----> 1 not_in_train_but_in_test = df_ppd_test[-(df_ppd_test.first_letter.isin(df_ppd_train.first_letter))] File ~/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/pandas/base.py:880, in IndexOpsMixin.isin(self, values) 873 if not is_list_like(values): 874 raise TypeError( 875 "only list-like objects are allowed to be passed" 876 " to isin(), you passed a [{values_type}]".format(values_type=type(values).__name__) 877 ) 879 values = ( --> 880 cast(np.ndarray, values).tolist() if isinstance(values, np.ndarray) else list(values) 881 ) 883 other = [SF.lit(v) for v in values] 884 scol = self.spark.column.isin(other) File ~/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/pandas/series.py:6485, in Series.__iter__(self) 6484 def __iter__(self) -> None: -> 6485 return MissingPandasLikeSeries.__iter__(self) File ~/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/pandas/missing/__init__.py:23, in unsupported_function..unsupported_function(*args, **kwargs) 22 def unsupported_function(*args, **kwargs): ---> 23 raise PandasNotImplementedError( 24 class_name=class_name, method_name=method_name, reason=reason 25 ) PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead. 2. df_ppd_test.first_letter.isin(df_ppd_train.first_letter) --------------------------------------------------------------------------- PandasNotImplementedError Traceback (most recent call last) Cell In [63], line 1 ----> 1 df_ppd_test.first_letter.isin(df_ppd_train.first_letter) File ~/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/pandas/base.py:880, in IndexOpsMixin.isin(self, values) 873 if not is_list_like(values): 874 raise TypeError( 875 "only list-like objects are allowed to be passed" 876 " to isin(), you passed a [{values_type}]".format(values_type=type(values).__name__) 877 ) 879 values = ( --> 880 cast(np.ndarray, values).tolist() if isinstance(values, np.ndarray) else list(values) 881 ) 883 other = [SF.lit(v) for v in values] 884 scol = self.spark.column.isin(other) File ~/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/pandas/series.py:6485, in Series.__iter__(self) 6484 def __iter__(self) -> None: -> 6485 return MissingPandasLikeSeries.__iter__(self) File ~/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/pandas/missing/__init__.py:23, in unsupported_function..unsupported_function(*args, **kwargs) 22 def unsupported_function(*args, **kwargs): ---> 23 raise PandasNotImplementedError( 24 class_name=class_name, method_name=method_name, reason=reason 25 ) PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.

Use of: DataFrame.to_numpy() → numpy.ndarray

Returns: A NumPy ndarray representing the values in this DataFrame or Series. Note: This method should only be used if the resulting NumPy ndarray is expected to be small, as all the data is loaded into the driver’s memory.
df_ppd_test.first_letter.isin(df_ppd_train.first_letter.to_numpy()) 0 False 1 False 2 True 3 True 4 False 6 False 7 True 9 False 10 True 12 True 13 False 23 True 24 False 25 False 28 True 30 True 31 True 33 True 34 True 39 True 41 True 43 True 44 True 45 True 46 False 47 False 49 False 53 True 56 False 57 False 58 True Name: first_letter, dtype: bool not_in_train_but_in_test = df_ppd_test[- ( df_ppd_test.first_letter.isin( df_ppd_train.first_letter.to_numpy() ) )]
Tags: Technology,Spark,

Way 2: Difference in how access to str representation is provided (Ways in which Pandas API on PySpark differs from Plain Pandas)

Download Code

import seaborn as sns
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark import SparkContext
from pyspark.sql import SQLContext # Main entry point for DataFrame and SQL functionality.

sc = SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)

import pyspark
print(pyspark.__version__)

3.3.0


import pandas as pd
df_student = pd.read_csv('./input/student.csv')
df_student

Aim: To retrieve the first letter from a column of string type

In Pandas

df_student['first_letter'] = df_student['FirstName'].str[0] df_student

In Pandas API on Spark

from pyspark import pandas as ppd df_student_ppd = ppd.read_csv('./input/student.csv') df_student_ppd

Errors in Pandas API on Spark when we try with the way of Plain Pandas

1.

df_student_ppd['first_letter'] = df_student_ppd['FirstName'].str[0] # In Pandas API on Spark TypeError: 'StringMethods' object is not subscriptable

2.

df_student_ppd['first_letter'] = df_student_ppd['FirstName'].str TypeError: Column assignment doesn't support type StringMethods # pyspark.pandas.strings.StringMethods as shown below.

3.

df_student_ppd['FirstName'].str <pyspark.pandas.strings.StringMethods at 0x7f7474157520>

How we resolved it:

df_student_ppd['FirstName'] = df_student_ppd['FirstName'].astype(str) # If we do not do the above transformation, None values will result in an error. TypeError: 'NoneType' object is not subscriptable df_student_ppd['first_letter'] = df_student_ppd['FirstName'].apply(lambda x: x[0]) Warning:/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py:486: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead. for column, series in pdf.iteritems(): df_student_ppd
Tags: Technology,Spark

Way 1: In Reading null and NA values (Ways in which Pandas API on PySpark differs from Plain Pandas)



import seaborn as sns
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark import SparkContext
from pyspark.sql import SQLContext # Main entry point for DataFrame and SQL functionality.

sc = SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)
import pyspark
print(pyspark.__version__)


3.3.0


with open('./input/student.csv', mode = 'r', encoding = 'utf8') as f:
    data = f.readlines()

data



['sno,FirstName,LASTNAME\n',
'one,Ram,\n',
'two,,Sharma\n',
'three,Shyam,NA\n',
'four,Kabir,\n',
'five,NA,Singh\n']


df_student.head()

When you load a Pandas DataFrame by reading from a CSV, blank values and 'NA' values are converted to 'NaN' values by default as shown above.

print(type(df_student)) <class 'pandas.core.frame.DataFrame'> df_student.fillna('Not Applicable', inplace = True) # Handles blank and 'NA' values both. df_student
from pyspark import pandas as ppd df_student_pyspark = ppd.read_csv('./input/student.csv') type(df_student_pyspark) pyspark.pandas.frame.DataFrame df_student_pyspark
df_student_pyspark.fillna('Not Applicable', inplace = True) # Handles blank (None) values. df_student_pyspark
Tags: Technology,Spark

Monday, October 24, 2022

Creating a three node Hadoop cluster using Ubuntu OS (Apr 2020)

Dated: 28 Apr 2020
Note about the setup: We are running the Ubuntu OS(s) on top of Windows via VirtualBox.

1. Setting hostname in three Guest OS(s)

$ sudo gedit /etc/hostname The hostnames for three machines are master, slave1, and slave2.

ON MASTER (Host OS IP: 192.168.1.12)

$ cat /etc/hosts 192.168.1.12 master 192.168.1.3 slave1 192.168.1.4 slave2

2. ON SLAVE2 (Host OS IP: 192.168.1.4)

$ cat /etc/hostname slave2 $ cat /etc/hosts 192.168.1.12 master 192.168.1.3 slave1 192.168.1.4 slave2

3. FOLLOW THE STEPS MENTIONED FOR SLAVE2 ALSO FOR SLAVE1 (Host OS IP: 192.168.1.3)

4. Configuring Key Based Login

Setup SSH in every node such that they can communicate with one another without any prompt for password. Check this link for: Steps of Doing SSH Setup

5. Setting up ".bashrc" on each system (master, slave1, slave2)

$ sudo gedit ~/.bashrc Add the below lines at the end of the file. export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop export HADOOP_MAPRED_HOME=/usr/local/hadoop export HADOOP_COMMON_HOME=/usr/local/hadoop export HADOOP_HDFS_HOME=/usr/local/hadoop export YARN_HOME=/usr/local/hadoop

6. Follow all the nine steps from the article below to setup Hadoop on "master" machine

Getting started with Hadoop on Ubuntu in VirtualBox

On "master"

7. Set NameNode Location

Update your $HADOOP_HOME/etc/hadoop/core-site.xml file to set the NameNode location to master on port 9000: $HADOOP_HOME: /usr/local/hadoop Code: <configuration> <property> <name>fs.default.name</name> <value>hdfs://master:9000</value> </property> </configuration>

8. Set path for HDFS

Edit $HADOOP_HOME/etc/hadoop/hdfs-site.xml file to resemble the following configuration. <configuration> <property> <name>dfs.namenode.name.dir</name> <value>/home/hadoop/data/nameNode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/hadoop/data/dataNode</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>

9. Set YARN as Job Scheduler

Edit the mapred-site.xml file, setting YARN as the default framework for MapReduce operations $HADOOP_HOME/etc/hadoop/mapred-site.xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> </configuration>

10. Configure YARN

Edit yarn-site.xml, which contains the configuration options for YARN. In the value field for the yarn.resourcemanager.hostname, replace 192.168.1.12 with the public IP address of "master": $HADOOP_HOME/etc/hadoop/yarn-site.xml <configuration> <property> <name>yarn.acl.enable</name> <value>0</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>192.168.1.12</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>

11. Configure Workers

The file workers is used by startup scripts to start required daemons on all nodes. Edit this file: $HADOOP_HOME/etc/hadoop/workers to include both of the nodes: slave1 slave2

12. Configure Memory Allocation (Two steps)

A) Edit $HADOOP_HOME/etc/hadoop/yarn-site.xml and add the following lines: $ sudo gedit $HADOOP_HOME/etc/hadoop/yarn-site.xml <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>1536</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>1536</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> B) Edit $HADOOP_HOME/etc/hadoop/mapred-site.xml and add the following lines $ sudo gedit $HADOOP_HOME/etc/hadoop/mapred-site.xml <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>512</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>256</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>256</value> </property>

13. Duplicate Config Files on Each Node

Copy the Hadoop configuration files to the worker nodes: $ scp -r /usr/local/hadoop/etc/* ashish@slave1:/usr/local/hadoop/etc/ $ scp -r /usr/local/hadoop/etc/* ashish@slave2:/usr/local/hadoop/etc/ When you are copying contents of "/etc", the following file should be modified to contain the correct JAVA_HOME for each of the destination nodes. /usr/local/hadoop/etc/hadoop/hadoop-env.sh

14. Format HDFS

HDFS needs to be formatted like any classical file system. On "master", run the following command: $ hdfs namenode -format Your Hadoop installation is now configured and ready to run.

15. ==> Start and Stop HDFS

Start the HDFS by running the following script from master:

/usr/local/hadoop/sbin/start-dfs.sh

This will start NameNode and SecondaryNameNode on master, and DataNode on slave1 and slave2, according to the configuration in the workers config file.

Check that every process is running with the jps command on each node. On master, you should see the following (the PID number will be different):

21922 Jps
21603 NameNode
21787 SecondaryNameNode

And on slave1 and slave2 you should see the following:

19728 DataNode
19819 Jps

To stop HDFS on master and worker nodes, run the following command from node-master:

stop-dfs.sh

16. ==> Monitor your HDFS Cluster

Point your browser to http://master:9870/dfshealth.html, where "master" IP is the IP address of your master, and you’ll get a user-friendly monitoring console.

Tags: Technology,Big Data,

Sunday, October 23, 2022

spark-submit For Two Node Spark Cluster With Spark's Standalone RM For Pi Computation (2022 Oct 23)

Previously: Creating Two Node Spark Cluster With Two Worker Nodes and One Master Node Using Spark's Standalone Resource Manager on Ubuntu Machines

Issue

(base) ashish@ashishlaptop:/usr/local/spark$ spark-submit --master spark://ashishlaptop:7077 examples/src/main/python/pi.py 100 22/10/23 15:14:36 INFO SparkContext: Running Spark version 3.3.0 22/10/23 15:14:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22/10/23 15:14:36 INFO ResourceUtils: ============================================================== 22/10/23 15:14:36 INFO ResourceUtils: No custom resources configured for spark.driver. 22/10/23 15:14:36 INFO ResourceUtils: ============================================================== 22/10/23 15:14:36 INFO SparkContext: Submitted application: PythonPi 22/10/23 15:14:36 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 22/10/23 15:14:36 INFO ResourceProfile: Limiting resource is cpu 22/10/23 15:14:36 INFO ResourceProfileManager: Added ResourceProfile id: 0 22/10/23 15:14:36 INFO SecurityManager: Changing view acls to: ashish 22/10/23 15:14:36 INFO SecurityManager: Changing modify acls to: ashish 22/10/23 15:14:36 INFO SecurityManager: Changing view acls groups to: 22/10/23 15:14:36 INFO SecurityManager: Changing modify acls groups to: 22/10/23 15:14:36 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ashish); groups with view permissions: Set(); users with modify permissions: Set(ashish); groups with modify permissions: Set() 22/10/23 15:14:37 INFO Utils: Successfully started service 'sparkDriver' on port 41631. 22/10/23 15:14:37 INFO SparkEnv: Registering MapOutputTracker 22/10/23 15:14:37 INFO SparkEnv: Registering BlockManagerMaster 22/10/23 15:14:37 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 22/10/23 15:14:37 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 22/10/23 15:14:37 INFO SparkEnv: Registering BlockManagerMasterHeartbeat 22/10/23 15:14:37 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-9599974d-836e-482e-bcf1-5c6e15c29ce9 22/10/23 15:14:37 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB 22/10/23 15:14:37 INFO SparkEnv: Registering OutputCommitCoordinator 22/10/23 15:14:37 INFO Utils: Successfully started service 'SparkUI' on port 4040. 22/10/23 15:14:38 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://ashishlaptop:7077... 22/10/23 15:14:38 INFO TransportClientFactory: Successfully created connection to ashishlaptop/192.168.1.142:7077 after 45 ms (0 ms spent in bootstraps) 22/10/23 15:14:38 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20221023151438-0000 22/10/23 15:14:38 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44369. 22/10/23 15:14:38 INFO NettyBlockTransferService: Server created on ashishlaptop:44369 22/10/23 15:14:38 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 22/10/23 15:14:38 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, ashishlaptop, 44369, None) 22/10/23 15:14:38 INFO BlockManagerMasterEndpoint: Registering block manager ashishlaptop:44369 with 366.3 MiB RAM, BlockManagerId(driver, ashishlaptop, 44369, None) 22/10/23 15:14:38 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, ashishlaptop, 44369, None) 22/10/23 15:14:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20221023151438-0000/0 on worker-20221023135355-192.168.1.142-43143 (192.168.1.142:43143) with 4 core(s) 22/10/23 15:14:38 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, ashishlaptop, 44369, None) 22/10/23 15:14:38 INFO StandaloneSchedulerBackend: Granted executor ID app-20221023151438-0000/0 on hostPort 192.168.1.142:43143 with 4 core(s), 1024.0 MiB RAM 22/10/23 15:14:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20221023151438-0000/1 on worker-20221023135358-192.168.1.106-44471 (192.168.1.106:44471) with 2 core(s) 22/10/23 15:14:38 INFO StandaloneSchedulerBackend: Granted executor ID app-20221023151438-0000/1 on hostPort 192.168.1.106:44471 with 2 core(s), 1024.0 MiB RAM 22/10/23 15:14:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221023151438-0000/0 is now RUNNING 22/10/23 15:14:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221023151438-0000/1 is now RUNNING 22/10/23 15:14:39 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 22/10/23 15:14:40 INFO SparkContext: Starting job: reduce at /usr/local/spark/examples/src/main/python/pi.py:42 22/10/23 15:14:41 INFO DAGScheduler: Got job 0 (reduce at /usr/local/spark/examples/src/main/python/pi.py:42) with 100 output partitions 22/10/23 15:14:41 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at /usr/local/spark/examples/src/main/python/pi.py:42) 22/10/23 15:14:41 INFO DAGScheduler: Parents of final stage: List() 22/10/23 15:14:41 INFO DAGScheduler: Missing parents: List() 22/10/23 15:14:41 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[1] at reduce at /usr/local/spark/examples/src/main/python/pi.py:42), which has no missing parents 22/10/23 15:14:41 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 11.3 KiB, free 366.3 MiB) 22/10/23 15:14:41 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 8.5 KiB, free 366.3 MiB) 22/10/23 15:14:41 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ashishlaptop:44369 (size: 8.5 KiB, free: 366.3 MiB) 22/10/23 15:14:41 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1513 22/10/23 15:14:41 INFO DAGScheduler: Submitting 100 missing tasks from ResultStage 0 (PythonRDD[1] at reduce at /usr/local/spark/examples/src/main/python/pi.py:42) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)) 22/10/23 15:14:41 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 tasks resource profile 0 22/10/23 15:14:43 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.142:37452) with ID 0, ResourceProfileId 0 22/10/23 15:14:43 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.142:34419 with 366.3 MiB RAM, BlockManagerId(0, 192.168.1.142, 34419, None) 22/10/23 15:14:43 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (192.168.1.142, executor 0, partition 0, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map() 22/10/23 15:14:43 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1) (192.168.1.142, executor 0, partition 1, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map() 22/10/23 15:14:43 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2) (192.168.1.142, executor 0, partition 2, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map() 22/10/23 15:14:43 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3) (192.168.1.142, executor 0, partition 3, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map() 22/10/23 15:14:44 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.142:34419 (size: 8.5 KiB, free: 366.3 MiB) 22/10/23 15:14:46 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4) (192.168.1.142, executor 0, partition 4, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map() 22/10/23 15:14:46 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5) (192.168.1.142, executor 0, partition 5, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map() 22/10/23 15:14:46 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6) (192.168.1.142, executor 0, partition 6, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map() 22/10/23 15:14:46 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7) (192.168.1.142, executor 0, partition 7, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map() 22/10/23 15:14:46 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.106:44292) with ID 1, ResourceProfileId 0 22/10/23 15:14:46 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2) (192.168.1.142 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 540, in main raise RuntimeError( RuntimeError: Python in worker has different version 3.10 than that in driver 3.9, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set. ... 22/10/23 15:14:47 INFO SparkContext: Invoking stop() from shutdown hook 22/10/23 15:14:47 INFO SparkUI: Stopped Spark web UI at http://ashishlaptop:4040 22/10/23 15:14:47 INFO StandaloneSchedulerBackend: Shutting down all executors 22/10/23 15:14:47 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down 22/10/23 15:14:47 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 22/10/23 15:14:47 INFO MemoryStore: MemoryStore cleared 22/10/23 15:14:47 INFO BlockManager: BlockManager stopped 22/10/23 15:14:47 INFO BlockManagerMaster: BlockManagerMaster stopped 22/10/23 15:14:47 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 22/10/23 15:14:47 INFO SparkContext: Successfully stopped SparkContext 22/10/23 15:14:47 INFO ShutdownHookManager: Shutdown hook called 22/10/23 15:14:47 INFO ShutdownHookManager: Deleting directory /tmp/spark-c60126be-f479-4617-8548-ad0ca7f00763/pyspark-40737be6-41de-4d50-859d-88e13123232b 22/10/23 15:14:47 INFO ShutdownHookManager: Deleting directory /tmp/spark-0915b97c-253d-4807-9eb6-e8f3d1a7019c 22/10/23 15:14:47 INFO ShutdownHookManager: Deleting directory /tmp/spark-c60126be-f479-4617-8548-ad0ca7f00763 (base) ashish@ashishlaptop:/usr/local/spark$

Debugging

(base) ashish@ashishlaptop:/usr/local/spark$ echo $PYSPARK_PYTHON (base) ashish@ashishlaptop:/usr/local/spark$ echo $PYSPARK_DRIVER_PYTHON (base) ashish@ashishlaptop:/usr/local/spark$ Both are empty.

Setting the environment variables

(base) ashish@ashishlaptop:/usr/local/spark$ which python /home/ashish/anaconda3/bin/python (base) ashish@ashishlaptop:/usr/local/spark$ /home/ashish/anaconda3/bin/python Python 3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> exit() (base) ashish@ashishlaptop:/usr/local/spark$ sudo nano ~/.bashrc [sudo] password for ashish: (base) ashish@ashishlaptop:/usr/local/spark$ (base) ashish@ashishlaptop:/usr/local/spark$ tail ~/.bashrc unset __conda_setup # <<< conda initialize <<< export PATH="/home/ashish/.local/bin:$PATH" export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64" export PATH="$PATH:/usr/local/spark/bin" export PYSPARK_PYTHON="/home/ashish/anaconda3/bin/python" export PYSPARK_DRIVER_PYTHON="/home/ashish/anaconda3/bin/python" (base) ashish@ashishlaptop:/usr/local/spark$ source ~/.bashrc (base) ashish@ashishlaptop:/usr/local/spark$ echo $PYSPARK_PYTHON /home/ashish/anaconda3/bin/python

Logs After Issue Resolution

(base) ashish@ashishlaptop:/usr/local/spark$ spark-submit --master spark://ashishlaptop:7077 examples/src/main/python/pi.py 100 22/10/23 15:30:51 INFO SparkContext: Running Spark version 3.3.0 22/10/23 15:30:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22/10/23 15:30:52 INFO ResourceUtils: ============================================================== 22/10/23 15:30:52 INFO ResourceUtils: No custom resources configured for spark.driver. 22/10/23 15:30:52 INFO ResourceUtils: ============================================================== 22/10/23 15:30:52 INFO SparkContext: Submitted application: PythonPi 22/10/23 15:30:52 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 22/10/23 15:30:52 INFO ResourceProfile: Limiting resource is cpu 22/10/23 15:30:52 INFO ResourceProfileManager: Added ResourceProfile id: 0 22/10/23 15:30:52 INFO SecurityManager: Changing view acls to: ashish 22/10/23 15:30:52 INFO SecurityManager: Changing modify acls to: ashish 22/10/23 15:30:52 INFO SecurityManager: Changing view acls groups to: 22/10/23 15:30:52 INFO SecurityManager: Changing modify acls groups to: 22/10/23 15:30:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ashish); groups with view permissions: Set(); users with modify permissions: Set(ashish); groups with modify permissions: Set() 22/10/23 15:30:52 INFO Utils: Successfully started service 'sparkDriver' on port 41761. 22/10/23 15:30:52 INFO SparkEnv: Registering MapOutputTracker 22/10/23 15:30:52 INFO SparkEnv: Registering BlockManagerMaster 22/10/23 15:30:52 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 22/10/23 15:30:52 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 22/10/23 15:30:52 INFO SparkEnv: Registering BlockManagerMasterHeartbeat 22/10/23 15:30:52 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-ffa15e79-7af0-41f9-87eb-fce866f17ed8 22/10/23 15:30:53 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB 22/10/23 15:30:53 INFO SparkEnv: Registering OutputCommitCoordinator 22/10/23 15:30:53 INFO Utils: Successfully started service 'SparkUI' on port 4040. 22/10/23 15:30:53 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://ashishlaptop:7077... 22/10/23 15:30:53 INFO TransportClientFactory: Successfully created connection to ashishlaptop/192.168.1.142:7077 after 58 ms (0 ms spent in bootstraps) 22/10/23 15:30:53 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20221023153053-0001 22/10/23 15:30:53 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20221023153053-0001/0 on worker-20221023135355-192.168.1.142-43143 (192.168.1.142:43143) with 4 core(s) 22/10/23 15:30:53 INFO StandaloneSchedulerBackend: Granted executor ID app-20221023153053-0001/0 on hostPort 192.168.1.142:43143 with 4 core(s), 1024.0 MiB RAM 22/10/23 15:30:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 32809. 22/10/23 15:30:53 INFO NettyBlockTransferService: Server created on ashishlaptop:32809 22/10/23 15:30:53 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 22/10/23 15:30:53 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20221023153053-0001/1 on worker-20221023135358-192.168.1.106-44471 (192.168.1.106:44471) with 2 core(s) 22/10/23 15:30:53 INFO StandaloneSchedulerBackend: Granted executor ID app-20221023153053-0001/1 on hostPort 192.168.1.106:44471 with 2 core(s), 1024.0 MiB RAM 22/10/23 15:30:53 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, ashishlaptop, 32809, None) 22/10/23 15:30:53 INFO BlockManagerMasterEndpoint: Registering block manager ashishlaptop:32809 with 366.3 MiB RAM, BlockManagerId(driver, ashishlaptop, 32809, None) 22/10/23 15:30:53 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, ashishlaptop, 32809, None) 22/10/23 15:30:53 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, ashishlaptop, 32809, None) 22/10/23 15:30:54 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221023153053-0001/0 is now RUNNING 22/10/23 15:30:54 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221023153053-0001/1 is now RUNNING 22/10/23 15:30:54 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 22/10/23 15:30:55 INFO SparkContext: Starting job: reduce at /usr/local/spark/examples/src/main/python/pi.py:42 22/10/23 15:30:56 INFO DAGScheduler: Got job 0 (reduce at /usr/local/spark/examples/src/main/python/pi.py:42) with 100 output partitions 22/10/23 15:30:56 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at /usr/local/spark/examples/src/main/python/pi.py:42) 22/10/23 15:30:56 INFO DAGScheduler: Parents of final stage: List() 22/10/23 15:30:56 INFO DAGScheduler: Missing parents: List() 22/10/23 15:30:56 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[1] at reduce at /usr/local/spark/examples/src/main/python/pi.py:42), which has no missing parents 22/10/23 15:30:56 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 11.4 KiB, free 366.3 MiB) 22/10/23 15:30:56 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 8.5 KiB, free 366.3 MiB) 22/10/23 15:30:56 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ashishlaptop:32809 (size: 8.5 KiB, free: 366.3 MiB) 22/10/23 15:30:56 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1513 22/10/23 15:30:56 INFO DAGScheduler: Submitting 100 missing tasks from ResultStage 0 (PythonRDD[1] at reduce at /usr/local/spark/examples/src/main/python/pi.py:42) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)) 22/10/23 15:30:56 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 tasks resource profile 0 22/10/23 15:30:58 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.142:54146) with ID 0, ResourceProfileId 0 22/10/23 15:30:59 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.142:46811 with 366.3 MiB RAM, BlockManagerId(0, 192.168.1.142, 46811, None) 22/10/23 15:30:59 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (192.168.1.142, executor 0, partition 0, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map() 22/10/23 15:30:59 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1) (192.168.1.142, executor 0, partition 1, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map() 22/10/23 15:30:59 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2) (192.168.1.142, executor 0, partition 2, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map() 22/10/23 15:30:59 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3) (192.168.1.142, executor 0, partition 3, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map() 22/10/23 15:30:59 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.142:46811 (size: 8.5 KiB, free: 366.3 MiB) 22/10/23 15:31:01 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.106:60352) with ID 1, ResourceProfileId 0 22/10/23 15:31:01 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.106:41617 with 366.3 MiB RAM, BlockManagerId(1, 192.168.1.106, 41617, None) 22/10/23 15:31:01 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4) (192.168.1.106, executor 1, partition 4, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map() ... 22/10/23 15:31:09 INFO TaskSetManager: Finished task 93.0 in stage 0.0 (TID 93) in 344 ms on 192.168.1.142 (executor 0) (94/100) 22/10/23 15:31:09 INFO TaskSetManager: Finished task 94.0 in stage 0.0 (TID 94) in 312 ms on 192.168.1.142 (executor 0) (95/100) 22/10/23 15:31:09 INFO TaskSetManager: Finished task 95.0 in stage 0.0 (TID 95) in 314 ms on 192.168.1.142 (executor 0) (96/100) 22/10/23 15:31:09 INFO TaskSetManager: Finished task 96.0 in stage 0.0 (TID 96) in 263 ms on 192.168.1.106 (executor 1) (97/100) 22/10/23 15:31:09 INFO TaskSetManager: Finished task 98.0 in stage 0.0 (TID 98) in 260 ms on 192.168.1.142 (executor 0) (98/100) 22/10/23 15:31:09 INFO TaskSetManager: Finished task 99.0 in stage 0.0 (TID 99) in 256 ms on 192.168.1.142 (executor 0) (99/100) 22/10/23 15:31:10 INFO TaskSetManager: Finished task 97.0 in stage 0.0 (TID 97) in 384 ms on 192.168.1.106 (executor 1) (100/100) 22/10/23 15:31:10 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 22/10/23 15:31:10 INFO DAGScheduler: ResultStage 0 (reduce at /usr/local/spark/examples/src/main/python/pi.py:42) finished in 13.849 s 22/10/23 15:31:10 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job 22/10/23 15:31:10 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished 22/10/23 15:31:10 INFO DAGScheduler: Job 0 finished: reduce at /usr/local/spark/examples/src/main/python/pi.py:42, took 14.106103 s Pi is roughly 3.142880 22/10/23 15:31:10 INFO SparkUI: Stopped Spark web UI at http://ashishlaptop:4040 22/10/23 15:31:10 INFO StandaloneSchedulerBackend: Shutting down all executors 22/10/23 15:31:10 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down 22/10/23 15:31:10 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 22/10/23 15:31:10 INFO MemoryStore: MemoryStore cleared 22/10/23 15:31:10 INFO BlockManager: BlockManager stopped 22/10/23 15:31:10 INFO BlockManagerMaster: BlockManagerMaster stopped 22/10/23 15:31:10 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 22/10/23 15:31:10 INFO SparkContext: Successfully stopped SparkContext 22/10/23 15:31:11 INFO ShutdownHookManager: Shutdown hook called 22/10/23 15:31:11 INFO ShutdownHookManager: Deleting directory /tmp/spark-6be4655c-e59a-403a-92e8-582583fa3f7d/pyspark-c4d7588d-a23a-4393-b29b-6689d20e7684 22/10/23 15:31:11 INFO ShutdownHookManager: Deleting directory /tmp/spark-f4436e38-d155-4763-bb57-461eb3793d13 22/10/23 15:31:11 INFO ShutdownHookManager: Deleting directory /tmp/spark-6be4655c-e59a-403a-92e8-582583fa3f7d (base) ashish@ashishlaptop:/usr/local/spark$
Tags: Technology,Spark,