Friday, November 4, 2022

Histogram report and binning on Sales data

In [2]:

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

In [3]:

df1 = pd.read_csv('sales_data_sample.csv')

In [4]:

df1.shape

Out[4]:

(2823, 25)

In [5]:

df1.columns

Out[5]:

Index(['ORDERNUMBER', 'QUANTITYORDERED', 'PRICEEACH', 'ORDERLINENUMBER',
       'SALES', 'ORDERDATE', 'STATUS', 'QTR_ID', 'MONTH_ID', 'YEAR_ID',
       'PRODUCTLINE', 'MSRP', 'PRODUCTCODE', 'CUSTOMERNAME', 'PHONE',
       'ADDRESSLINE1', 'ADDRESSLINE2', 'CITY', 'STATE', 'POSTALCODE',
       'COUNTRY', 'TERRITORY', 'CONTACTLASTNAME', 'CONTACTFIRSTNAME',
       'DEALSIZE'],
      dtype='object')

In [6]:

df1.head()

Out[6]:

	ORDERNUMBER	QUANTITYORDERED	PRICEEACH	ORDERLINENUMBER	SALES	ORDERDATE	STATUS	QTR_ID	MONTH_ID	YEAR_ID	...	ADDRESSLINE1	ADDRESSLINE2	CITY	STATE	POSTALCODE	COUNTRY	TERRITORY	CONTACTLASTNAME	CONTACTFIRSTNAME	DEALSIZE
0	10107	30	95.70	2	2871.00	2/24/2003 0:00	Shipped	1	2	2003	...	897 Long Airport Avenue	NaN	NYC	NY	10022	USA	NaN	Yu	Kwai	Small
1	10121	34	81.35	5	2765.90	5/7/2003 0:00	Shipped	2	5	2003	...	59 rue de l'Abbaye	NaN	Reims	NaN	51100	France	EMEA	Henriot	Paul	Small
2	10134	41	94.74	2	3884.34	7/1/2003 0:00	Shipped	3	7	2003	...	27 rue du Colonel Pierre Avia	NaN	Paris	NaN	75508	France	EMEA	Da Cunha	Daniel	Medium
3	10145	45	83.26	6	3746.70	8/25/2003 0:00	Shipped	3	8	2003	...	78934 Hillside Dr.	NaN	Pasadena	CA	90003	USA	NaN	Young	Julie	Medium
4	10159	49	100.00	14	5205.27	10/10/2003 0:00	Shipped	4	10	2003	...	7734 Strong St.	NaN	San Francisco	CA	NaN	USA	NaN	Brown	Julie	Medium

5 rows × 25 columns

In [7]:

bin_columns = ['PRICEEACH', 'SALES']

In [8]:

for i, col in enumerate(bin_columns):
    plt.figure(i)
    sns.histplot(data=df1, x = col)

In [9]:

sns.histplot(data=df1, x = 'SALES', binwidth = 2000)

Out[9]:

<AxesSubplot:xlabel='SALES', ylabel='Count'>

In [10]:

bins = [0, 2000, 4000, 6000, 8000, 10000, 12000, 14000]
df1['bins'] = pd.cut(df1['SALES'], bins)

In [11]:

df1['bins'].value_counts()

Out[11]:

(2000, 4000]      1340
(4000, 6000]       624
(0, 2000]          565
(6000, 8000]       215
(8000, 10000]       63
(10000, 12000]      13
(12000, 14000]       2
Name: bins, dtype: int64

In [ ]:

Layoffs and Reduction of Infrastructure Cost at Musk's Twitter (Nov 2022)

Musk Orders Twitter To Reduce Infrastructure Costs By $1 Billion: Sources

Elon Musk has directed Twitter Inc’s teams to find up to $1 billion in annual infrastructure cost savings, according to two sources familiar with the matter and an internal Slack message reviewed by Reuters, raising concerns that Twitter could go down during high-traffic events like the U.S. midterm elections.

The company is aiming to find between $1.5 million and $3 million a day in savings from servers and cloud services, said the Slack message, which referred to the project as “Deep Cuts Plan."

Twitter is currently losing about $3 million a day “with all spending and revenue considered," according to an internal document reviewed by Reuters.

Twitter did not immediately respond to a request for comment.

'If On Way To Office, Return Home': Twitter To All Employees As Layoffs Begin

Elon Musk-owned Twitter is going ahead with a massive firing plan globally. Twitter has literally shut its offices and suspended the badges of all employees until a decision is made as to whether an employee is fired or retained. The scale is so massive that employees who are not fired will get “a notification via their Twitter email”. And those who are fired will get an email on their personal email ID. The decision will be made by Friday and all employees will get an email by “9AM PST on Friday Nov. 4th.”

Elon Musk is said to be working with close colleagues at Tesla and SpaceX to structure the layoff plans. 3,738 Twitter employees could be laid off. Employees at Twitter were notified in an email seen by The New York Times that layoffs would start on Friday and instructed not to come into work on that day. The overall number of layoffs the corporation was contemplating was not mentioned in the email.

Here’s the full letter that was to sent to Twitter employees:

Team,

In an effort to place Twitter on a healthy path, we will go through the difficult process of reducing our global workforce on Friday. We recognize that this will impact a number of individuals who have made valuable contributions to Twitter, but this action is unfortunately necessary to ensure the company’s success moving forward.

Given the nature of our distributed workforce and our desire to inform impacted individuals as quickly as possible, communications for this process will take place via email. By 9AM PST on Friday Nov. 4th, everyone will receive an individual email with the subject line: Your Role at Twitter.

Please check your email, including your spam folder.

If your employment is not impacted, you will receive a notification via your Twitter email. • If your employment is impacted, you will receive a notification with next steps via your personal email. • If you do not receive an email from twitter-hr@ by 5PM PST on Friday Nov. 4th, please email peoplequestions@twitter.com.

To help ensure the safety of each employee as well as Twitter systems and customer data, our offices will be temporarily closed and all badge access will be suspended. If you are in an office or on your way to an office, please return home.

We acknowledge this is an incredibly challenging experience to go through, whether or not you are impacted. Thank you for continuing to adhere to Twitter policies that prohibit you from discussing confidential company information on social media, with the press or elsewhere.

We are grateful for your contributions to Twitter and for your patience as we move through this process.

Thank you.

Twitter

Thursday, November 3, 2022

Stratified sampling and fixed size sampling plus visualization using pie plot (Nov 2022)

Download Code And Data

1. Stratified Sampling Using Pandas

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

complete_data = pd.read_csv('sales_data_sample.csv')

colslist = ['COUNTRY', 'PRODUCTLINE']
train_size = 0.33

data_sample = complete_data.groupby(colslist, group_keys=False).apply(
    lambda x: x.sample(
        int(train_size*len(x)), random_state=1
    )
)

complete_data.head()



complete_data.shape
(2823, 25)

data_sample.shape
(865, 25)


def plot_pie(labels, sizes, title = ""):
    colors = ['#f47961', '#f0c419', '#255c61', '#78909c', '#6ad4cf', '#17aee8', '#5c6bc0', '#444b6e', '#ef4c60', '#744593', 
              '#ee5691', '#9ccc65', '#708b75', '#d1cb65', '#0d8de1', '#a4554b', '#694f5d', '#45adb3', '#26a69a', '#bdc7cc', ]
    colors = colors[0:len(labels)]
    
    explode = (0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)  # explode 1st slice
    explode = explode[0:len(labels)]
    
    # Plot
    plt.figure(num=None, figsize=(9, 7), dpi=80, facecolor='w', edgecolor='k')
    plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140)
    plt.title(title)
    plt.axis('equal')
    plt.show()



pie_plot_data = complete_data.groupby('COUNTRY', as_index=False)['COUNTRY'].value_counts()
pie_plot_data.sort_values(by=['count'], inplace = True)

pie_plot_data.head()




plot_pie(pie_plot_data.COUNTRY.values, pie_plot_data['count'].values, 'Countries Before Sampling')



pie_plot_data = data_sample.groupby('COUNTRY', as_index=False)['COUNTRY'].value_counts()
pie_plot_data.sort_values(by=['count'], inplace = True)
pie_plot_data.head()



plot_pie(pie_plot_data.COUNTRY.values, pie_plot_data['count'].values, 'Countries After Sampling')



2. Fixed Size Sampling With Equal Representation When Number of Records is Too Large

data_sample = complete_data.groupby(colslist, group_keys=False).apply(
    lambda x: x.sample(n = 1, random_state=1)
).reset_index(drop=True) 

pie_plot_data = data_sample.groupby('COUNTRY', as_index=False)['COUNTRY'].value_counts()
pie_plot_data.sort_values(by=['count'], inplace = True)

pie_plot_data.head()




pie_plot_data.tail()



plot_pie(pie_plot_data.COUNTRY.values, pie_plot_data['count'].values, 'Countries After Sampling')



data_sample[data_sample['COUNTRY'] == 'USA'][colslist]

Tuesday, November 1, 2022

Installing Spark on Windows (Nov 2022)

1: Checking Environment Variables For Java, Scala and Python
Java

(base) C:\Users\ashish>java -version
openjdk version "1.8.0_322"
OpenJDK Runtime Environment (Temurin)(build 1.8.0_322-b06)
OpenJDK 64-Bit Server VM (Temurin)(build 25.322-b06, mixed mode)

(base) C:\Users\ashish>where java
C:\Program Files\Eclipse Adoptium\jdk-8.0.322.6-hotspot\bin\java.exe
C:\Program Files\Zulu\zulu-17-jre\bin\java.exe
C:\Program Files\Zulu\zulu-17\bin\java.exe

(base) C:\Users\ashish>echo %JAVA_HOME%
C:\Program Files\Eclipse Adoptium\jdk-8.0.322.6-hotspot

Scala

(base) C:\Users\ashish>scala -version
Scala code runner version 3.2.0 -- Copyright 2002-2022, LAMP/EPFL

Python

(base) C:\Users\ashish>python --version
Python 3.9.12

(base) C:\Users\ashish>where python
C:\Users\ashish\Anaconda3\python.exe
C:\Users\ashish\AppData\Local\Microsoft\WindowsApps\python.exe

(base) C:\Users\ashish>echo %PYSPARK_PYTHON%
C:\Users\ashish\Anaconda3\python.exe

(base) C:\Users\ashish>echo %PYSPARK_DRIVER_PYTHON%
C:\Users\ashish\Anaconda3\python.exe

(base) C:\Users\ashish>echo %PYTHONPATH%
C:\Users\ashish\Anaconda3\envs\mh

Spark Home and Hadoop Home

(base) C:\Users\ashish>echo %SPARK_HOME%
D:\progfiles\spark-3.3.1-bin-hadoop3

(base) C:\Users\ashish>echo %HADOOP_HOME%
D:\progfiles\spark-3.3.1-bin-hadoop3\hadoop

2: Checking Properties Like Hostname, Spark Workers and PATH variable values

(base) D:\progfiles\spark-3.3.1-bin-hadoop3\conf>hostname
CS3L

(base) D:\progfiles\spark-3.3.1-bin-hadoop3\conf>type workers
CS3L

PATH Value:
C:\Users\ashish\AppData\Local\Coursier\data\bin;
It is for Scala.

(base) D:\progfiles\spark-3.3.1-bin-hadoop3\conf>echo %PATH%

C:\Users\ashish\Anaconda3;
C:\Users\ashish\Anaconda3\Library\mingw-w64\bin;
C:\Users\ashish\Anaconda3\Library\usr\bin;
C:\Users\ashish\Anaconda3\Library\bin;
C:\Users\ashish\Anaconda3\Scripts;
C:\Users\ashish\Anaconda3\bin;
C:\Users\ashish\Anaconda3\condabin;
C:\Program Files\Eclipse Adoptium\jdk-8.0.322.6-hotspot\bin;
C:\Program Files\Zulu\zulu-17-jre\bin;
C:\Program Files\Zulu\zulu-17\bin;
C:\windows\system32;C:\windows;
C:\windows\System32\Wbem;
C:\windows\System32\WindowsPowerShell\v1.0;
C:\windows\System32\OpenSSH;
C:\Program Files\Git\cmd;
C:\Users\ashish\Anaconda3;
C:\Users\ashish\Anaconda3\Library\mingw-w64\bin;
C:\Users\ashish\Anaconda3\Library\usr\bin;
C:\Users\ashish\Anaconda3\Library\bin;
C:\Users\ashish\Anaconda3\Scripts;
C:\Users\ashish\AppData\Local\Microsoft\WindowsApps;
C:\Users\ashish\AppData\Local\Programs\Microsoft VS Code\bin;
D:\progfiles\spark-3.3.1-bin-hadoop3\bin;
C:\Users\ashish\AppData\Local\Coursier\data\bin;
.

3: Turn off Windows Defender Firewall (for SSH to work)



4.1: Create inbound rule for allowing SSH connections on port 22



4.2: Create outbound rule for allowing SSH connections on port 22



5: Checking SSH Properties

C:\Users\ashish\.ssh>type known_hosts
192.168.1.151 ecdsa-sha2-nistp256 A***EhMzjgo=
ashishlaptop ecdsa-sha2-nistp256 A***EhMzjgo=

C:\Users\ashish\.ssh>ipconfig

Windows IP Configuration

Ethernet adapter Ethernet:

   Media State . . . . . . . . . . . : Media disconnected
   Connection-specific DNS Suffix  . : ad.itl.com

Ethernet adapter Ethernet 2:

   Media State . . . . . . . . . . . : Media disconnected
   Connection-specific DNS Suffix  . : ad.itl.com

Wireless LAN adapter Wi-Fi:

   Connection-specific DNS Suffix  . :
   IPv6 Address. . . . . . . . . . . : 2401:4900:47f5:1737:b1b2:6d59:f669:1b96
   Temporary IPv6 Address. . . . . . : 2401:4900:47f5:1737:88e0:bacc:7490:e794
   Link-local IPv6 Address . . . . . : fe80::b1b2:6d59:f669:1b96%13
   IPv4 Address. . . . . . . . . . . : 192.168.1.101
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Default Gateway . . . . . . . . . : fe80::44da:eaff:feb6:7061%13
                                       192.168.1.1

Ethernet adapter Bluetooth Network Connection:

   Media State . . . . . . . . . . . : Media disconnected
   Connection-specific DNS Suffix  . :

C:\Users\ashish\.ssh>
C:\Users\ashish\.ssh>dir
 Volume in drive C is OSDisk
 Volume Serial Number is 88CC-6EA2

 Directory of C:\Users\ashish\.ssh

10/26/2022  03:45 PM    <DIR>          .
10/26/2022  03:45 PM    <DIR>          ..
10/26/2022  03:45 PM               574 authorized_keys
10/26/2022  03:27 PM             2,635 id_rsa
10/26/2022  03:27 PM               593 id_rsa.pub
10/26/2022  03:46 PM               351 known_hosts
               4 File(s)          4,153 bytes
               2 Dir(s)  78,491,791,360 bytes free

C:\Users\ashish\.ssh>type authorized_keys
ssh-rsa A***= ashish@ashishlaptop

C:\Users\ashish\.ssh>

6: Error We Faced While Trying Spark start-all.sh in Git Bash

$ ./sbin/start-all.sh
ps: unknown option -- o
Try `ps --help' for more information.
hostname: unknown option -- f
starting org.apache.spark.deploy.master.Master, logging to D:\progfiles\spark-3.3.1-bin-hadoop3/logs/spark--org.apache.spark.deploy.master.Master-1-CS3L.out
ps: unknown option -- o
...
Try `ps --help' for more information.
failed to launch: nice -n 0 D:\progfiles\spark-3.3.1-bin-hadoop3/bin/spark-class org.apache.spark.deploy.master.Master --host --port 7077 --webui-port 8080
  ps: unknown option -- o
  Try `ps --help' for more information.
  Spark Command: C:\Program Files\Eclipse Adoptium\jdk-8.0.322.6-hotspot\bin\java -cp D:\progfiles\spark-3.3.1-bin-hadoop3/conf\;D:\progfiles\spark-3.3.1-bin-hadoop3\jars\* -Xmx1g org.apache.spark.deploy.master.Master --host --port 7077 --webui-port 8080
  ========================================
  "C:\Program Files\Eclipse Adoptium\jdk-8.0.322.6-hotspot\bin\java" -cp "D:\progfiles\spark-3.3.1-bin-hadoop3/conf\;D:\progfiles\spark-3.3.1-bin-hadoop3\jars\*" -Xmx1g org.apache.spark.deploy.master.Master --host --port 7077 --webui-port 8080
  D:\progfiles\spark-3.3.1-bin-hadoop3/bin/spark-class: line 96: CMD: bad array subscript
full log in D:\progfiles\spark-3.3.1-bin-hadoop3/logs/spark--org.apache.spark.deploy.master.Master-1-CS3L.out
ps: unknown option -- o
Try `ps --help' for more information.
hostname: unknown option -- f
Try 'hostname --help' for more information.
ps: unknown option -- o
Try `ps --help' for more information.
CS3L: ssh: connect to host cs3l port 22: Connection refused

7: Spark-submit failing in CMD without HADOOP_HOME set

(base) D:\progfiles\spark-3.3.1-bin-hadoop3\bin>spark-submit --master local examples/src/main/python/pi.py 100
22/11/01 11:59:42 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
22/11/01 11:59:42 INFO ShutdownHookManager: Shutdown hook called
22/11/01 11:59:42 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-521c5e3c-beea-4f67-a1e6-71dd4c5c308c

8: Instructions for Configuring HADOOP_HOME for Spark on Windows

Installing winutils

Let’s download the winutils.exe and configure our Spark installation to find winutils.exe.

a) Create a hadoop\bin folder inside the SPARK_HOME folder.

b) Download the winutils.exe for the version of hadoop against which your Spark installation was built for. In my case the Hadoop version was 3 mentioned in Spark package name. So I downloaded the winutils.exe for Hadoop 3.0.0 and copied it to the hadoop\bin folder in the SPARK_HOME folder.

c) Create a system environment variable in Windows called SPARK_HOME that points to the SPARK_HOME folder path.
For Windows: %SPARK_HOME%\hadoop 

d) Create another system environment variable in Windows called HADOOP_HOME that points to the hadoop folder inside the SPARK_HOME folder.

Since the hadoop folder is inside the SPARK_HOME folder, it is better to create HADOOP_HOME environment variable using a value of %SPARK_HOME%\hadoop. That way you don’t have to change HADOOP_HOME if SPARK_HOME is updated.

Ref: Download winutils.exe From GitHub

9: Error if PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are not set

A standard way of setting environmental variables, including PYSPARK_PYTHON, is to use conf/spark-env.sh file. Spark comes with a template file (conf/spark-env.sh.template) which explains the most common options.
It is a normal Bash script so you can use it the same way as you would with .bashrc

9.1

(base) D:\progfiles\spark-3.3.1-bin-hadoop3>bin\spark-submit examples\src\main\python\wordcount.py README.md
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
22/11/01 12:21:02 INFO ShutdownHookManager: Shutdown hook called
22/11/01 12:21:02 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-18bce9f2-f7e8-4d04-843b-8c04a27675a7

9.2

(base) D:\progfiles\spark-3.3.1-bin-hadoop3>where python
C:\Users\ashish\Anaconda3\python.exe
C:\Users\ashish\AppData\Local\Microsoft\WindowsApps\python.exe

(base) D:\progfiles\spark-3.3.1-bin-hadoop3>python
Python 3.9.12 (main, Apr  4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()

9.3

IN CMD:
D:\progfiles\spark-3.3.1-bin-hadoop3>bin\spark-submit examples\src\main\python\wordcount.py README.md
Exception in thread "main" java.io.IOException: Cannot run program "C:\Users\ashish\Anaconda3": CreateProcess error=5, Access is denied
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
        at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
        at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: CreateProcess error=5, Access is denied
        at java.lang.ProcessImpl.create(Native Method)
        at java.lang.ProcessImpl.<init>(ProcessImpl.java:453)
        at java.lang.ProcessImpl.start(ProcessImpl.java:139)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
        ... 14 more
22/11/01 12:32:10 INFO ShutdownHookManager: Shutdown hook called
22/11/01 12:32:10 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-46c29762-efc3-425a-98fd-466b1500aa5b

9.4

IN ANACONDA:
(base) D:\progfiles\spark-3.3.1-bin-hadoop3>bin\spark-submit examples\src\main\python\wordcount.py README.md
Exception in thread "main" java.io.IOException: Cannot run program "C:\Users\ashish\Anaconda3": CreateProcess error=5, Access is denied
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
        at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
        at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: CreateProcess error=5, Access is denied
        at java.lang.ProcessImpl.create(Native Method)
        at java.lang.ProcessImpl.<init>(ProcessImpl.java:453)
        at java.lang.ProcessImpl.start(ProcessImpl.java:139)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
        ... 14 more
22/11/01 12:34:25 INFO ShutdownHookManager: Shutdown hook called
22/11/01 12:34:25 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-7480dad3-92f5-41cb-98ec-e02fee0eaff5

(base) D:\progfiles\spark-3.3.1-bin-hadoop3>

10: Successful Run of Word Count Program

(base) D:\progfiles\spark-3.3.1-bin-hadoop3>bin\spark-submit examples\src\main\python\wordcount.py README.md
22/11/01 12:37:22 INFO SparkContext: Running Spark version 3.3.1
22/11/01 12:37:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/11/01 12:37:23 INFO ResourceUtils: ==============================================================
22/11/01 12:37:23 INFO ResourceUtils: No custom resources configured for spark.driver.
22/11/01 12:37:23 INFO ResourceUtils: ==============================================================
22/11/01 12:37:23 INFO SparkContext: Submitted application: PythonWordCount
22/11/01 12:37:23 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
22/11/01 12:37:23 INFO ResourceProfile: Limiting resource is cpu
22/11/01 12:37:23 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/11/01 12:37:23 INFO SecurityManager: Changing view acls to: ashish
22/11/01 12:37:23 INFO SecurityManager: Changing modify acls to: ashish
22/11/01 12:37:23 INFO SecurityManager: Changing view acls groups to:
22/11/01 12:37:23 INFO SecurityManager: Changing modify acls groups to:
22/11/01 12:37:23 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(ashish); groups with view permissions: Set(); users  with modify permissions: Set(ashish); groups with modify permissions: Set()
22/11/01 12:37:24 INFO Utils: Successfully started service 'sparkDriver' on port 52785.
22/11/01 12:37:24 INFO SparkEnv: Registering MapOutputTracker
22/11/01 12:37:24 INFO SparkEnv: Registering BlockManagerMaster
22/11/01 12:37:24 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/11/01 12:37:24 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/11/01 12:37:24 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/11/01 12:37:24 INFO DiskBlockManager: Created local directory at C:\Users\ashish\AppData\Local\Temp\blockmgr-09454369-56f1-4fae-a0d9-b5f19b6a8bd1
22/11/01 12:37:24 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
22/11/01 12:37:24 INFO SparkEnv: Registering OutputCommitCoordinator
22/11/01 12:37:24 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/11/01 12:37:25 INFO Executor: Starting executor ID driver on host CS3L.ad.itl.com
22/11/01 12:37:25 INFO Executor: Starting executor with user classpath (userClassPathFirst = false): ''
22/11/01 12:37:25 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52828.
22/11/01 12:37:25 INFO NettyBlockTransferService: Server created on CS3L.ad.itl.com:52828
22/11/01 12:37:25 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/11/01 12:37:25 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, CS3L.ad.itl.com, 52828, None)
22/11/01 12:37:25 INFO BlockManagerMasterEndpoint: Registering block manager CS3L.ad.itl.com:52828 with 366.3 MiB RAM, BlockManagerId(driver, CS3L.ad.itl.com, 52828, None)
22/11/01 12:37:25 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, CS3L.ad.itl.com, 52828, None)
22/11/01 12:37:25 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, CS3L.ad.itl.com, 52828, None)
22/11/01 12:37:26 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
22/11/01 12:37:26 INFO SharedState: Warehouse path is 'file:/D:/progfiles/spark-3.3.1-bin-hadoop3/spark-warehouse'.
22/11/01 12:37:27 INFO InMemoryFileIndex: It took 51 ms to list leaf files for 1 paths.
22/11/01 12:37:31 INFO FileSourceStrategy: Pushed Filters:
22/11/01 12:37:31 INFO FileSourceStrategy: Post-Scan Filters:
22/11/01 12:37:31 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
22/11/01 12:37:31 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 349.6 KiB, free 366.0 MiB)
22/11/01 12:37:31 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 33.9 KiB, free 365.9 MiB)
22/11/01 12:37:31 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on CS3L.ad.itl.com:52828 (size: 33.9 KiB, free: 366.3 MiB)
22/11/01 12:37:31 INFO SparkContext: Created broadcast 0 from javaToPython at NativeMethodAccessorImpl.java:0
22/11/01 12:37:31 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
22/11/01 12:37:32 INFO SparkContext: Starting job: collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38
22/11/01 12:37:32 INFO DAGScheduler: Registering RDD 6 (reduceByKey at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:35) as input to shuffle 0
22/11/01 12:37:32 INFO DAGScheduler: Got job 0 (collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38) with 1 output partitions
22/11/01 12:37:32 INFO DAGScheduler: Final stage: ResultStage 1 (collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38)
22/11/01 12:37:32 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
22/11/01 12:37:32 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
22/11/01 12:37:32 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[6] at reduceByKey at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:35), which has no missing parents
22/11/01 12:37:32 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 19.4 KiB, free 365.9 MiB)
22/11/01 12:37:32 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 10.2 KiB, free 365.9 MiB)
22/11/01 12:37:32 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on CS3L.ad.itl.com:52828 (size: 10.2 KiB, free: 366.3 MiB)
22/11/01 12:37:32 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1513
22/11/01 12:37:32 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (PairwiseRDD[6] at reduceByKey at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:35) (first 15 tasks are for partitions Vector(0))
22/11/01 12:37:32 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks resource profile 0
22/11/01 12:37:32 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (CS3L.ad.itl.com, executor driver, partition 0, PROCESS_LOCAL, 4914 bytes) taskResourceAssignments Map()
22/11/01 12:37:32 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
22/11/01 12:37:34 INFO FileScanRDD: Reading File path: file:///D:/progfiles/spark-3.3.1-bin-hadoop3/README.md, range: 0-4585, partition values: [empty row]
22/11/01 12:37:34 INFO CodeGenerator: Code generated in 316.0539 ms
22/11/01 12:37:34 INFO PythonRunner: Times: total = 1638, boot = 1131, init = 504, finish = 3
22/11/01 12:37:34 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1928 bytes result sent to driver
22/11/01 12:37:34 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2463 ms on CS3L.ad.itl.com (executor driver) (1/1)
22/11/01 12:37:34 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
22/11/01 12:37:34 INFO PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 52829
22/11/01 12:37:34 INFO DAGScheduler: ShuffleMapStage 0 (reduceByKey at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:35) finished in 2.675 s
22/11/01 12:37:34 INFO DAGScheduler: looking for newly runnable stages
22/11/01 12:37:34 INFO DAGScheduler: running: Set()
22/11/01 12:37:34 INFO DAGScheduler: waiting: Set(ResultStage 1)
22/11/01 12:37:34 INFO DAGScheduler: failed: Set()
22/11/01 12:37:34 INFO DAGScheduler: Submitting ResultStage 1 (PythonRDD[9] at collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38), which has no missing parents
22/11/01 12:37:35 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 9.5 KiB, free 365.9 MiB)
22/11/01 12:37:35 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 5.7 KiB, free 365.9 MiB)
22/11/01 12:37:35 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on CS3L.ad.itl.com:52828 (size: 5.7 KiB, free: 366.3 MiB)
22/11/01 12:37:35 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1513
22/11/01 12:37:35 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (PythonRDD[9] at collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38) (first 15 tasks are for partitions Vector(0))
22/11/01 12:37:35 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks resource profile 0
22/11/01 12:37:35 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1) (CS3L.ad.itl.com, executor driver, partition 0, NODE_LOCAL, 4271 bytes) taskResourceAssignments Map()
22/11/01 12:37:35 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
22/11/01 12:37:35 INFO ShuffleBlockFetcherIterator: Getting 1 (3.2 KiB) non-empty blocks including 1 (3.2 KiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
22/11/01 12:37:35 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 20 ms
22/11/01 12:37:36 INFO PythonRunner: Times: total = 1154, boot = 1133, init = 20, finish = 1
22/11/01 12:37:36 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 6944 bytes result sent to driver
22/11/01 12:37:36 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 1323 ms on CS3L.ad.itl.com (executor driver) (1/1)
22/11/01 12:37:36 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
22/11/01 12:37:36 INFO DAGScheduler: ResultStage 1 (collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38) finished in 1.351 s
22/11/01 12:37:36 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
22/11/01 12:37:36 INFO TaskSchedulerImpl: Killing all running tasks in stage 1: Stage finished
22/11/01 12:37:36 INFO DAGScheduler: Job 0 finished: collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38, took 4.172433 s
#: 1
Apache: 1
Spark: 15
...
project.: 1
22/11/01 12:37:36 INFO SparkUI: Stopped Spark web UI at http://CS3L.ad.itl.com:4040
22/11/01 12:37:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/11/01 12:37:36 INFO MemoryStore: MemoryStore cleared
22/11/01 12:37:36 INFO BlockManager: BlockManager stopped
22/11/01 12:37:36 INFO BlockManagerMaster: BlockManagerMaster stopped
22/11/01 12:37:36 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/11/01 12:37:36 INFO SparkContext: Successfully stopped SparkContext
22/11/01 12:37:37 INFO ShutdownHookManager: Shutdown hook called
22/11/01 12:37:37 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-d74fb7a9-5e3a-417e-b815-7f3de7efb44b
22/11/01 12:37:37 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-d74fb7a9-5e3a-417e-b815-7f3de7efb44b\pyspark-0c5fc2b0-e2ab-4e28-a021-2f57c08d8d8c
22/11/01 12:37:37 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-20206684-dcc9-4326-976b-d9499bc5e483

(base) D:\progfiles\spark-3.3.1-bin-hadoop3>

11: Missing PyArrow Package Issue. And, Resolution


(base) D:\progfiles\spark-3.3.1-bin-hadoop3>bin\spark-submit C:\Users\ashish\Desktop\mh\Code\pandas_to_pyspark\pyinstaller\script.py
Traceback (most recent call last):
    File "D:\progfiles\spark-3.3.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\sql\pandas\utils.py", line 53, in require_minimum_pyarrow_version
ModuleNotFoundError: No module named 'pyarrow'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
    File "C:\Users\ashish\Desktop\mh\Code\pandas_to_pyspark\pyinstaller\script.py", line 1, in <module>
    from pyspark import pandas as pd
    File "<frozen zipimport>", line 259, in load_module
    File "D:\progfiles\spark-3.3.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\pandas\__init__.py", line 34, in <module>
    File "D:\progfiles\spark-3.3.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\sql\pandas\utils.py", line 60, in require_minimum_pyarrow_version
ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found.
22/11/01 12:41:57 INFO ShutdownHookManager: Shutdown hook called
22/11/01 12:41:57 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-3293d857-5ddc-4fcf-8dcc-8d7b39a8f8cc


Resolution
    
CMD> pip install pyarrow

(base) C:\Users\ashish>pip show pyarrow
Name: pyarrow
Version: 10.0.0
Summary: Python library for Apache Arrow
Home-page: https://arrow.apache.org/
Author:
Author-email:
License: Apache License, Version 2.0
Location: c:\users\ashish\anaconda3\lib\site-packages
Requires: numpy
Required-by:

Way 5: WRT Cell Value Replacement Through Assignment (Ways in which Pandas API on PySpark differs from Plain Pandas)

Download Code

Cell Value Replacement Through Assignment in Pandas
import pandas as pd

df = pd.DataFrame({
    'col1': ["alpha", "beta", "gamma"],
    'col2': ['beta', 'gamma', 'alpha'],
    'col3': ['gamma', 'alpha', 'beta']
})

df[df == 'alpha'] = 'delta'
df 



Error in PySpark For The Same Code: Unhashable type: 'DataFrame'

from pyspark import pandas as ppd
df_ppd = ppd.DataFrame({
    'col1': ["alpha", "beta", "gamma"],
    'col2': ['beta', 'gamma', 'alpha'],
    'col3': ['gamma', 'alpha', 'beta']
})

df_ppd[df_ppd == 'alpha'] = 'delta' 


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [13], line 1
----> 1 df_ppd[df_ppd == 'alpha'] = 'delta'

File ~/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/pandas/frame.py:12355, in DataFrame.__setitem__(self, key, value)
  12352     psdf = self._assign({k: value[c] for k, c in zip(key, field_names)})
  12353 else:
  12354     # Same Series.
> 12355     psdf = self._assign({key: value})
  12357 self._update_internal_frame(psdf._internal)

TypeError: unhashable type: 'DataFrame'

Alternate Way

df_ppd = df_ppd.replace(to_replace = ['alpha'], value = "delta")


df_ppd = df_ppd.replace(to_replace = ['beta', 'gamma'], value = "epsilon")



Also Check: Way 4: With respect to DataFrame.replace() Method (Ways in which Pandas API on PySpark differs from Plain Pandas)

Sunday, October 30, 2022

Do we need all the one hot features?


import pandas as pd
df = pd.read_csv('in/titanic_train.csv')
df.head()




enc_pc_df = pd.get_dummies(df, columns = ['Pclass'])
enc_pc_df.head()




Hypothetical Question
Q: If I remove a column (could be first, could be last) from the one-hot feature matrix (let's say with 'n' columns), can I reproduce the same matrix from the 'n-1' columns?

OR: Rephrasing the question: How do we get back original matrix or 'put back the dropped column'?

Answer:
If: There is no '1' in the remaining n-1 values in a row, then the dropped value from that row is 1.
Else: 0
Assumptions made: there are 'n' number of columns. 

Conclusion: In removing one column from the one-hot feature matrix, there is still no data loss.
One column's value is related to the value of rest n-1 columns.

So, what is the solution to resolve this relation?

We drop the first column from one-hot feature matrix.

enc_pc_df = pd.get_dummies(df, columns = ['Pclass'], drop_first = True)
enc_pc_df.head()

The default value of "drop_first" parameter is False.

Installing 'Category Encoders' Python Package Using Pip And Conda

You can install Category Encoders package from both Pip and Conda-forge.
But there is one subtle difference in the package in the two repositories. 

In PyPI, it is: category-encoders 



In Conda-forge, it is: category_encoders



Now with minimal env.yml:
Contents of the file:

name: cat_enc
channels:
  - conda-forge
dependencies:
  - scikit-learn
  - pandas
  - ipykernel
  - jupyterlab
  - category_encoders


Conda Commands

$ conda remove -n cat_enc --all
$ conda env create -f env.yml 
Collecting package metadata (repodata.json): done
Solving environment: done

Downloading and Extracting Packages
category_encoders-2. | 62 KB     | ### | 100% 
...
python-3.10.6        | 29.0 MB   | ### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate cat_enc
#
# To deactivate an active environment, use
#
#     $ conda deactivate

Retrieving notices: ...working... done
(base) $ conda activate cat_enc
(cat_enc) $ python -m ipykernel install --user --name cat_enc
Installed kernelspec cat_enc in /home/ashish/.local/share/jupyter/kernels/cat_enc
(cat_enc) $

Pages

Friday, November 4, 2022

Musk Orders Twitter To Reduce Infrastructure Costs By $1 Billion: Sources

'If On Way To Office, Return Home': Twitter To All Employees As Layoffs Begin

Thursday, November 3, 2022

1. Stratified Sampling Using Pandas

2. Fixed Size Sampling With Equal Representation When Number of Records is Too Large

Tuesday, November 1, 2022

1: Checking Environment Variables For Java, Scala and Python

Java

Scala

Python

Spark Home and Hadoop Home

2: Checking Properties Like Hostname, Spark Workers and PATH variable values

3: Turn off Windows Defender Firewall (for SSH to work)

4.1: Create inbound rule for allowing SSH connections on port 22

4.2: Create outbound rule for allowing SSH connections on port 22

5: Checking SSH Properties

6: Error We Faced While Trying Spark start-all.sh in Git Bash

7: Spark-submit failing in CMD without HADOOP_HOME set

8: Instructions for Configuring HADOOP_HOME for Spark on Windows

Installing winutils

9: Error if PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are not set

9.1

9.2

9.3

9.4

10: Successful Run of Word Count Program

11: Missing PyArrow Package Issue. And, Resolution

Resolution

Cell Value Replacement Through Assignment in Pandas

Error in PySpark For The Same Code: Unhashable type: 'DataFrame'

Alternate Way

Sunday, October 30, 2022

Hypothetical Question

Conclusion: In removing one column from the one-hot feature matrix, there is still no data loss. One column's value is related to the value of rest n-1 columns.

So, what is the solution to resolve this relation?

Now with minimal env.yml:

Conda Commands