Friday, November 4, 2022

Histogram report and binning on Sales data

Layoffs and Reduction of Infrastructure Cost at Musk's Twitter (Nov 2022)

Musk Orders Twitter To Reduce Infrastructure Costs By $1 Billion: Sources

Elon Musk has directed Twitter Inc’s teams to find up to $1 billion in annual infrastructure cost savings, according to two sources familiar with the matter and an internal Slack message reviewed by Reuters, raising concerns that Twitter could go down during high-traffic events like the U.S. midterm elections. The company is aiming to find between $1.5 million and $3 million a day in savings from servers and cloud services, said the Slack message, which referred to the project as “Deep Cuts Plan." Twitter is currently losing about $3 million a day “with all spending and revenue considered," according to an internal document reviewed by Reuters. Twitter did not immediately respond to a request for comment.

'If On Way To Office, Return Home': Twitter To All Employees As Layoffs Begin

Elon Musk-owned Twitter is going ahead with a massive firing plan globally. Twitter has literally shut its offices and suspended the badges of all employees until a decision is made as to whether an employee is fired or retained. The scale is so massive that employees who are not fired will get “a notification via their Twitter email”. And those who are fired will get an email on their personal email ID. The decision will be made by Friday and all employees will get an email by “9AM PST on Friday Nov. 4th.” Elon Musk is said to be working with close colleagues at Tesla and SpaceX to structure the layoff plans. 3,738 Twitter employees could be laid off. Employees at Twitter were notified in an email seen by The New York Times that layoffs would start on Friday and instructed not to come into work on that day. The overall number of layoffs the corporation was contemplating was not mentioned in the email. Here’s the full letter that was to sent to Twitter employees: Team, In an effort to place Twitter on a healthy path, we will go through the difficult process of reducing our global workforce on Friday. We recognize that this will impact a number of individuals who have made valuable contributions to Twitter, but this action is unfortunately necessary to ensure the company’s success moving forward. Given the nature of our distributed workforce and our desire to inform impacted individuals as quickly as possible, communications for this process will take place via email. By 9AM PST on Friday Nov. 4th, everyone will receive an individual email with the subject line: Your Role at Twitter. Please check your email, including your spam folder. If your employment is not impacted, you will receive a notification via your Twitter email. • If your employment is impacted, you will receive a notification with next steps via your personal email. • If you do not receive an email from twitter-hr@ by 5PM PST on Friday Nov. 4th, please email peoplequestions@twitter.com. To help ensure the safety of each employee as well as Twitter systems and customer data, our offices will be temporarily closed and all badge access will be suspended. If you are in an office or on your way to an office, please return home. We acknowledge this is an incredibly challenging experience to go through, whether or not you are impacted. Thank you for continuing to adhere to Twitter policies that prohibit you from discussing confidential company information on social media, with the press or elsewhere. We are grateful for your contributions to Twitter and for your patience as we move through this process. Thank you. Twitter
Tags: Investment,Management,

Thursday, November 3, 2022

Stratified sampling and fixed size sampling plus visualization using pie plot (Nov 2022)

Download Code And Data

1. Stratified Sampling Using Pandas

import pandas as pd import numpy as np import matplotlib.pyplot as plt complete_data = pd.read_csv('sales_data_sample.csv') colslist = ['COUNTRY', 'PRODUCTLINE'] train_size = 0.33 data_sample = complete_data.groupby(colslist, group_keys=False).apply( lambda x: x.sample( int(train_size*len(x)), random_state=1 ) ) complete_data.head()
complete_data.shape (2823, 25) data_sample.shape (865, 25) def plot_pie(labels, sizes, title = ""): colors = ['#f47961', '#f0c419', '#255c61', '#78909c', '#6ad4cf', '#17aee8', '#5c6bc0', '#444b6e', '#ef4c60', '#744593', '#ee5691', '#9ccc65', '#708b75', '#d1cb65', '#0d8de1', '#a4554b', '#694f5d', '#45adb3', '#26a69a', '#bdc7cc', ] colors = colors[0:len(labels)] explode = (0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) # explode 1st slice explode = explode[0:len(labels)] # Plot plt.figure(num=None, figsize=(9, 7), dpi=80, facecolor='w', edgecolor='k') plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140) plt.title(title) plt.axis('equal') plt.show() pie_plot_data = complete_data.groupby('COUNTRY', as_index=False)['COUNTRY'].value_counts() pie_plot_data.sort_values(by=['count'], inplace = True) pie_plot_data.head()
plot_pie(pie_plot_data.COUNTRY.values, pie_plot_data['count'].values, 'Countries Before Sampling')
pie_plot_data = data_sample.groupby('COUNTRY', as_index=False)['COUNTRY'].value_counts() pie_plot_data.sort_values(by=['count'], inplace = True) pie_plot_data.head()
plot_pie(pie_plot_data.COUNTRY.values, pie_plot_data['count'].values, 'Countries After Sampling')

2. Fixed Size Sampling With Equal Representation When Number of Records is Too Large

data_sample = complete_data.groupby(colslist, group_keys=False).apply( lambda x: x.sample(n = 1, random_state=1) ).reset_index(drop=True) pie_plot_data = data_sample.groupby('COUNTRY', as_index=False)['COUNTRY'].value_counts() pie_plot_data.sort_values(by=['count'], inplace = True) pie_plot_data.head()
pie_plot_data.tail()
plot_pie(pie_plot_data.COUNTRY.values, pie_plot_data['count'].values, 'Countries After Sampling')
data_sample[data_sample['COUNTRY'] == 'USA'][colslist]
Tags: Technology,Data Visualization,Machine Learning,

Tuesday, November 1, 2022

Installing Spark on Windows (Nov 2022)

1: Checking Environment Variables For Java, Scala and Python

Java

(base) C:\Users\ashish>java -version openjdk version "1.8.0_322" OpenJDK Runtime Environment (Temurin)(build 1.8.0_322-b06) OpenJDK 64-Bit Server VM (Temurin)(build 25.322-b06, mixed mode) (base) C:\Users\ashish>where java C:\Program Files\Eclipse Adoptium\jdk-8.0.322.6-hotspot\bin\java.exe C:\Program Files\Zulu\zulu-17-jre\bin\java.exe C:\Program Files\Zulu\zulu-17\bin\java.exe (base) C:\Users\ashish>echo %JAVA_HOME% C:\Program Files\Eclipse Adoptium\jdk-8.0.322.6-hotspot

Scala

(base) C:\Users\ashish>scala -version Scala code runner version 3.2.0 -- Copyright 2002-2022, LAMP/EPFL

Python

(base) C:\Users\ashish>python --version Python 3.9.12 (base) C:\Users\ashish>where python C:\Users\ashish\Anaconda3\python.exe C:\Users\ashish\AppData\Local\Microsoft\WindowsApps\python.exe (base) C:\Users\ashish>echo %PYSPARK_PYTHON% C:\Users\ashish\Anaconda3\python.exe (base) C:\Users\ashish>echo %PYSPARK_DRIVER_PYTHON% C:\Users\ashish\Anaconda3\python.exe (base) C:\Users\ashish>echo %PYTHONPATH% C:\Users\ashish\Anaconda3\envs\mh

Spark Home and Hadoop Home

(base) C:\Users\ashish>echo %SPARK_HOME% D:\progfiles\spark-3.3.1-bin-hadoop3 (base) C:\Users\ashish>echo %HADOOP_HOME% D:\progfiles\spark-3.3.1-bin-hadoop3\hadoop

2: Checking Properties Like Hostname, Spark Workers and PATH variable values

(base) D:\progfiles\spark-3.3.1-bin-hadoop3\conf>hostname CS3L (base) D:\progfiles\spark-3.3.1-bin-hadoop3\conf>type workers CS3L PATH Value: C:\Users\ashish\AppData\Local\Coursier\data\bin; It is for Scala. (base) D:\progfiles\spark-3.3.1-bin-hadoop3\conf>echo %PATH% C:\Users\ashish\Anaconda3; C:\Users\ashish\Anaconda3\Library\mingw-w64\bin; C:\Users\ashish\Anaconda3\Library\usr\bin; C:\Users\ashish\Anaconda3\Library\bin; C:\Users\ashish\Anaconda3\Scripts; C:\Users\ashish\Anaconda3\bin; C:\Users\ashish\Anaconda3\condabin; C:\Program Files\Eclipse Adoptium\jdk-8.0.322.6-hotspot\bin; C:\Program Files\Zulu\zulu-17-jre\bin; C:\Program Files\Zulu\zulu-17\bin; C:\windows\system32;C:\windows; C:\windows\System32\Wbem; C:\windows\System32\WindowsPowerShell\v1.0; C:\windows\System32\OpenSSH; C:\Program Files\Git\cmd; C:\Users\ashish\Anaconda3; C:\Users\ashish\Anaconda3\Library\mingw-w64\bin; C:\Users\ashish\Anaconda3\Library\usr\bin; C:\Users\ashish\Anaconda3\Library\bin; C:\Users\ashish\Anaconda3\Scripts; C:\Users\ashish\AppData\Local\Microsoft\WindowsApps; C:\Users\ashish\AppData\Local\Programs\Microsoft VS Code\bin; D:\progfiles\spark-3.3.1-bin-hadoop3\bin; C:\Users\ashish\AppData\Local\Coursier\data\bin; .

3: Turn off Windows Defender Firewall (for SSH to work)

4.1: Create inbound rule for allowing SSH connections on port 22

4.2: Create outbound rule for allowing SSH connections on port 22

5: Checking SSH Properties

C:\Users\ashish\.ssh>type known_hosts 192.168.1.151 ecdsa-sha2-nistp256 A***EhMzjgo= ashishlaptop ecdsa-sha2-nistp256 A***EhMzjgo= C:\Users\ashish\.ssh>ipconfig Windows IP Configuration Ethernet adapter Ethernet: Media State . . . . . . . . . . . : Media disconnected Connection-specific DNS Suffix . : ad.itl.com Ethernet adapter Ethernet 2: Media State . . . . . . . . . . . : Media disconnected Connection-specific DNS Suffix . : ad.itl.com Wireless LAN adapter Wi-Fi: Connection-specific DNS Suffix . : IPv6 Address. . . . . . . . . . . : 2401:4900:47f5:1737:b1b2:6d59:f669:1b96 Temporary IPv6 Address. . . . . . : 2401:4900:47f5:1737:88e0:bacc:7490:e794 Link-local IPv6 Address . . . . . : fe80::b1b2:6d59:f669:1b96%13 IPv4 Address. . . . . . . . . . . : 192.168.1.101 Subnet Mask . . . . . . . . . . . : 255.255.255.0 Default Gateway . . . . . . . . . : fe80::44da:eaff:feb6:7061%13 192.168.1.1 Ethernet adapter Bluetooth Network Connection: Media State . . . . . . . . . . . : Media disconnected Connection-specific DNS Suffix . : C:\Users\ashish\.ssh> C:\Users\ashish\.ssh>dir Volume in drive C is OSDisk Volume Serial Number is 88CC-6EA2 Directory of C:\Users\ashish\.ssh 10/26/2022 03:45 PM <DIR> . 10/26/2022 03:45 PM <DIR> .. 10/26/2022 03:45 PM 574 authorized_keys 10/26/2022 03:27 PM 2,635 id_rsa 10/26/2022 03:27 PM 593 id_rsa.pub 10/26/2022 03:46 PM 351 known_hosts 4 File(s) 4,153 bytes 2 Dir(s) 78,491,791,360 bytes free C:\Users\ashish\.ssh>type authorized_keys ssh-rsa A***= ashish@ashishlaptop C:\Users\ashish\.ssh>

6: Error We Faced While Trying Spark start-all.sh in Git Bash

$ ./sbin/start-all.sh ps: unknown option -- o Try `ps --help' for more information. hostname: unknown option -- f starting org.apache.spark.deploy.master.Master, logging to D:\progfiles\spark-3.3.1-bin-hadoop3/logs/spark--org.apache.spark.deploy.master.Master-1-CS3L.out ps: unknown option -- o ... Try `ps --help' for more information. failed to launch: nice -n 0 D:\progfiles\spark-3.3.1-bin-hadoop3/bin/spark-class org.apache.spark.deploy.master.Master --host --port 7077 --webui-port 8080 ps: unknown option -- o Try `ps --help' for more information. Spark Command: C:\Program Files\Eclipse Adoptium\jdk-8.0.322.6-hotspot\bin\java -cp D:\progfiles\spark-3.3.1-bin-hadoop3/conf\;D:\progfiles\spark-3.3.1-bin-hadoop3\jars\* -Xmx1g org.apache.spark.deploy.master.Master --host --port 7077 --webui-port 8080 ======================================== "C:\Program Files\Eclipse Adoptium\jdk-8.0.322.6-hotspot\bin\java" -cp "D:\progfiles\spark-3.3.1-bin-hadoop3/conf\;D:\progfiles\spark-3.3.1-bin-hadoop3\jars\*" -Xmx1g org.apache.spark.deploy.master.Master --host --port 7077 --webui-port 8080 D:\progfiles\spark-3.3.1-bin-hadoop3/bin/spark-class: line 96: CMD: bad array subscript full log in D:\progfiles\spark-3.3.1-bin-hadoop3/logs/spark--org.apache.spark.deploy.master.Master-1-CS3L.out ps: unknown option -- o Try `ps --help' for more information. hostname: unknown option -- f Try 'hostname --help' for more information. ps: unknown option -- o Try `ps --help' for more information. CS3L: ssh: connect to host cs3l port 22: Connection refused

7: Spark-submit failing in CMD without HADOOP_HOME set

(base) D:\progfiles\spark-3.3.1-bin-hadoop3\bin>spark-submit --master local examples/src/main/python/pi.py 100 22/11/01 11:59:42 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases. 22/11/01 11:59:42 INFO ShutdownHookManager: Shutdown hook called 22/11/01 11:59:42 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-521c5e3c-beea-4f67-a1e6-71dd4c5c308c

8: Instructions for Configuring HADOOP_HOME for Spark on Windows

Installing winutils

Let’s download the winutils.exe and configure our Spark installation to find winutils.exe. a) Create a hadoop\bin folder inside the SPARK_HOME folder. b) Download the winutils.exe for the version of hadoop against which your Spark installation was built for. In my case the Hadoop version was 3 mentioned in Spark package name. So I downloaded the winutils.exe for Hadoop 3.0.0 and copied it to the hadoop\bin folder in the SPARK_HOME folder. c) Create a system environment variable in Windows called SPARK_HOME that points to the SPARK_HOME folder path. For Windows: %SPARK_HOME%\hadoop d) Create another system environment variable in Windows called HADOOP_HOME that points to the hadoop folder inside the SPARK_HOME folder. Since the hadoop folder is inside the SPARK_HOME folder, it is better to create HADOOP_HOME environment variable using a value of %SPARK_HOME%\hadoop. That way you don’t have to change HADOOP_HOME if SPARK_HOME is updated. Ref: Download winutils.exe From GitHub

9: Error if PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are not set

A standard way of setting environmental variables, including PYSPARK_PYTHON, is to use conf/spark-env.sh file. Spark comes with a template file (conf/spark-env.sh.template) which explains the most common options. It is a normal Bash script so you can use it the same way as you would with .bashrc

9.1

(base) D:\progfiles\spark-3.3.1-bin-hadoop3>bin\spark-submit examples\src\main\python\wordcount.py README.md Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases. 22/11/01 12:21:02 INFO ShutdownHookManager: Shutdown hook called 22/11/01 12:21:02 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-18bce9f2-f7e8-4d04-843b-8c04a27675a7

9.2

(base) D:\progfiles\spark-3.3.1-bin-hadoop3>where python C:\Users\ashish\Anaconda3\python.exe C:\Users\ashish\AppData\Local\Microsoft\WindowsApps\python.exe (base) D:\progfiles\spark-3.3.1-bin-hadoop3>python Python 3.9.12 (main, Apr 4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information. >>> exit()

9.3

IN CMD: D:\progfiles\spark-3.3.1-bin-hadoop3>bin\spark-submit examples\src\main\python\wordcount.py README.md Exception in thread "main" java.io.IOException: Cannot run program "C:\Users\ashish\Anaconda3": CreateProcess error=5, Access is denied at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97) at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.IOException: CreateProcess error=5, Access is denied at java.lang.ProcessImpl.create(Native Method) at java.lang.ProcessImpl.<init>(ProcessImpl.java:453) at java.lang.ProcessImpl.start(ProcessImpl.java:139) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ... 14 more 22/11/01 12:32:10 INFO ShutdownHookManager: Shutdown hook called 22/11/01 12:32:10 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-46c29762-efc3-425a-98fd-466b1500aa5b

9.4

IN ANACONDA: (base) D:\progfiles\spark-3.3.1-bin-hadoop3>bin\spark-submit examples\src\main\python\wordcount.py README.md Exception in thread "main" java.io.IOException: Cannot run program "C:\Users\ashish\Anaconda3": CreateProcess error=5, Access is denied at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97) at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.IOException: CreateProcess error=5, Access is denied at java.lang.ProcessImpl.create(Native Method) at java.lang.ProcessImpl.<init>(ProcessImpl.java:453) at java.lang.ProcessImpl.start(ProcessImpl.java:139) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ... 14 more 22/11/01 12:34:25 INFO ShutdownHookManager: Shutdown hook called 22/11/01 12:34:25 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-7480dad3-92f5-41cb-98ec-e02fee0eaff5 (base) D:\progfiles\spark-3.3.1-bin-hadoop3>

10: Successful Run of Word Count Program

(base) D:\progfiles\spark-3.3.1-bin-hadoop3>bin\spark-submit examples\src\main\python\wordcount.py README.md 22/11/01 12:37:22 INFO SparkContext: Running Spark version 3.3.1 22/11/01 12:37:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22/11/01 12:37:23 INFO ResourceUtils: ============================================================== 22/11/01 12:37:23 INFO ResourceUtils: No custom resources configured for spark.driver. 22/11/01 12:37:23 INFO ResourceUtils: ============================================================== 22/11/01 12:37:23 INFO SparkContext: Submitted application: PythonWordCount 22/11/01 12:37:23 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 22/11/01 12:37:23 INFO ResourceProfile: Limiting resource is cpu 22/11/01 12:37:23 INFO ResourceProfileManager: Added ResourceProfile id: 0 22/11/01 12:37:23 INFO SecurityManager: Changing view acls to: ashish 22/11/01 12:37:23 INFO SecurityManager: Changing modify acls to: ashish 22/11/01 12:37:23 INFO SecurityManager: Changing view acls groups to: 22/11/01 12:37:23 INFO SecurityManager: Changing modify acls groups to: 22/11/01 12:37:23 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ashish); groups with view permissions: Set(); users with modify permissions: Set(ashish); groups with modify permissions: Set() 22/11/01 12:37:24 INFO Utils: Successfully started service 'sparkDriver' on port 52785. 22/11/01 12:37:24 INFO SparkEnv: Registering MapOutputTracker 22/11/01 12:37:24 INFO SparkEnv: Registering BlockManagerMaster 22/11/01 12:37:24 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 22/11/01 12:37:24 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 22/11/01 12:37:24 INFO SparkEnv: Registering BlockManagerMasterHeartbeat 22/11/01 12:37:24 INFO DiskBlockManager: Created local directory at C:\Users\ashish\AppData\Local\Temp\blockmgr-09454369-56f1-4fae-a0d9-b5f19b6a8bd1 22/11/01 12:37:24 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB 22/11/01 12:37:24 INFO SparkEnv: Registering OutputCommitCoordinator 22/11/01 12:37:24 INFO Utils: Successfully started service 'SparkUI' on port 4040. 22/11/01 12:37:25 INFO Executor: Starting executor ID driver on host CS3L.ad.itl.com 22/11/01 12:37:25 INFO Executor: Starting executor with user classpath (userClassPathFirst = false): '' 22/11/01 12:37:25 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52828. 22/11/01 12:37:25 INFO NettyBlockTransferService: Server created on CS3L.ad.itl.com:52828 22/11/01 12:37:25 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 22/11/01 12:37:25 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, CS3L.ad.itl.com, 52828, None) 22/11/01 12:37:25 INFO BlockManagerMasterEndpoint: Registering block manager CS3L.ad.itl.com:52828 with 366.3 MiB RAM, BlockManagerId(driver, CS3L.ad.itl.com, 52828, None) 22/11/01 12:37:25 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, CS3L.ad.itl.com, 52828, None) 22/11/01 12:37:25 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, CS3L.ad.itl.com, 52828, None) 22/11/01 12:37:26 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir. 22/11/01 12:37:26 INFO SharedState: Warehouse path is 'file:/D:/progfiles/spark-3.3.1-bin-hadoop3/spark-warehouse'. 22/11/01 12:37:27 INFO InMemoryFileIndex: It took 51 ms to list leaf files for 1 paths. 22/11/01 12:37:31 INFO FileSourceStrategy: Pushed Filters: 22/11/01 12:37:31 INFO FileSourceStrategy: Post-Scan Filters: 22/11/01 12:37:31 INFO FileSourceStrategy: Output Data Schema: struct<value: string> 22/11/01 12:37:31 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 349.6 KiB, free 366.0 MiB) 22/11/01 12:37:31 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 33.9 KiB, free 365.9 MiB) 22/11/01 12:37:31 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on CS3L.ad.itl.com:52828 (size: 33.9 KiB, free: 366.3 MiB) 22/11/01 12:37:31 INFO SparkContext: Created broadcast 0 from javaToPython at NativeMethodAccessorImpl.java:0 22/11/01 12:37:31 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes. 22/11/01 12:37:32 INFO SparkContext: Starting job: collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38 22/11/01 12:37:32 INFO DAGScheduler: Registering RDD 6 (reduceByKey at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:35) as input to shuffle 0 22/11/01 12:37:32 INFO DAGScheduler: Got job 0 (collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38) with 1 output partitions 22/11/01 12:37:32 INFO DAGScheduler: Final stage: ResultStage 1 (collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38) 22/11/01 12:37:32 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0) 22/11/01 12:37:32 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0) 22/11/01 12:37:32 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[6] at reduceByKey at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:35), which has no missing parents 22/11/01 12:37:32 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 19.4 KiB, free 365.9 MiB) 22/11/01 12:37:32 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 10.2 KiB, free 365.9 MiB) 22/11/01 12:37:32 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on CS3L.ad.itl.com:52828 (size: 10.2 KiB, free: 366.3 MiB) 22/11/01 12:37:32 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1513 22/11/01 12:37:32 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (PairwiseRDD[6] at reduceByKey at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:35) (first 15 tasks are for partitions Vector(0)) 22/11/01 12:37:32 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks resource profile 0 22/11/01 12:37:32 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (CS3L.ad.itl.com, executor driver, partition 0, PROCESS_LOCAL, 4914 bytes) taskResourceAssignments Map() 22/11/01 12:37:32 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 22/11/01 12:37:34 INFO FileScanRDD: Reading File path: file:///D:/progfiles/spark-3.3.1-bin-hadoop3/README.md, range: 0-4585, partition values: [empty row] 22/11/01 12:37:34 INFO CodeGenerator: Code generated in 316.0539 ms 22/11/01 12:37:34 INFO PythonRunner: Times: total = 1638, boot = 1131, init = 504, finish = 3 22/11/01 12:37:34 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1928 bytes result sent to driver 22/11/01 12:37:34 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2463 ms on CS3L.ad.itl.com (executor driver) (1/1) 22/11/01 12:37:34 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 22/11/01 12:37:34 INFO PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 52829 22/11/01 12:37:34 INFO DAGScheduler: ShuffleMapStage 0 (reduceByKey at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:35) finished in 2.675 s 22/11/01 12:37:34 INFO DAGScheduler: looking for newly runnable stages 22/11/01 12:37:34 INFO DAGScheduler: running: Set() 22/11/01 12:37:34 INFO DAGScheduler: waiting: Set(ResultStage 1) 22/11/01 12:37:34 INFO DAGScheduler: failed: Set() 22/11/01 12:37:34 INFO DAGScheduler: Submitting ResultStage 1 (PythonRDD[9] at collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38), which has no missing parents 22/11/01 12:37:35 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 9.5 KiB, free 365.9 MiB) 22/11/01 12:37:35 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 5.7 KiB, free 365.9 MiB) 22/11/01 12:37:35 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on CS3L.ad.itl.com:52828 (size: 5.7 KiB, free: 366.3 MiB) 22/11/01 12:37:35 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1513 22/11/01 12:37:35 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (PythonRDD[9] at collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38) (first 15 tasks are for partitions Vector(0)) 22/11/01 12:37:35 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks resource profile 0 22/11/01 12:37:35 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1) (CS3L.ad.itl.com, executor driver, partition 0, NODE_LOCAL, 4271 bytes) taskResourceAssignments Map() 22/11/01 12:37:35 INFO Executor: Running task 0.0 in stage 1.0 (TID 1) 22/11/01 12:37:35 INFO ShuffleBlockFetcherIterator: Getting 1 (3.2 KiB) non-empty blocks including 1 (3.2 KiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks 22/11/01 12:37:35 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 20 ms 22/11/01 12:37:36 INFO PythonRunner: Times: total = 1154, boot = 1133, init = 20, finish = 1 22/11/01 12:37:36 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 6944 bytes result sent to driver 22/11/01 12:37:36 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 1323 ms on CS3L.ad.itl.com (executor driver) (1/1) 22/11/01 12:37:36 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 22/11/01 12:37:36 INFO DAGScheduler: ResultStage 1 (collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38) finished in 1.351 s 22/11/01 12:37:36 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job 22/11/01 12:37:36 INFO TaskSchedulerImpl: Killing all running tasks in stage 1: Stage finished 22/11/01 12:37:36 INFO DAGScheduler: Job 0 finished: collect at D:\progfiles\spark-3.3.1-bin-hadoop3\examples\src\main\python\wordcount.py:38, took 4.172433 s #: 1 Apache: 1 Spark: 15 ... project.: 1 22/11/01 12:37:36 INFO SparkUI: Stopped Spark web UI at http://CS3L.ad.itl.com:4040 22/11/01 12:37:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 22/11/01 12:37:36 INFO MemoryStore: MemoryStore cleared 22/11/01 12:37:36 INFO BlockManager: BlockManager stopped 22/11/01 12:37:36 INFO BlockManagerMaster: BlockManagerMaster stopped 22/11/01 12:37:36 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 22/11/01 12:37:36 INFO SparkContext: Successfully stopped SparkContext 22/11/01 12:37:37 INFO ShutdownHookManager: Shutdown hook called 22/11/01 12:37:37 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-d74fb7a9-5e3a-417e-b815-7f3de7efb44b 22/11/01 12:37:37 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-d74fb7a9-5e3a-417e-b815-7f3de7efb44b\pyspark-0c5fc2b0-e2ab-4e28-a021-2f57c08d8d8c 22/11/01 12:37:37 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-20206684-dcc9-4326-976b-d9499bc5e483 (base) D:\progfiles\spark-3.3.1-bin-hadoop3>

11: Missing PyArrow Package Issue. And, Resolution

(base) D:\progfiles\spark-3.3.1-bin-hadoop3>bin\spark-submit C:\Users\ashish\Desktop\mh\Code\pandas_to_pyspark\pyinstaller\script.py Traceback (most recent call last): File "D:\progfiles\spark-3.3.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\sql\pandas\utils.py", line 53, in require_minimum_pyarrow_version ModuleNotFoundError: No module named 'pyarrow' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "C:\Users\ashish\Desktop\mh\Code\pandas_to_pyspark\pyinstaller\script.py", line 1, in <module> from pyspark import pandas as pd File "<frozen zipimport>", line 259, in load_module File "D:\progfiles\spark-3.3.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\pandas\__init__.py", line 34, in <module> File "D:\progfiles\spark-3.3.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\sql\pandas\utils.py", line 60, in require_minimum_pyarrow_version ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found. 22/11/01 12:41:57 INFO ShutdownHookManager: Shutdown hook called 22/11/01 12:41:57 INFO ShutdownHookManager: Deleting directory C:\Users\ashish\AppData\Local\Temp\spark-3293d857-5ddc-4fcf-8dcc-8d7b39a8f8cc

Resolution

CMD> pip install pyarrow (base) C:\Users\ashish>pip show pyarrow Name: pyarrow Version: 10.0.0 Summary: Python library for Apache Arrow Home-page: https://arrow.apache.org/ Author: Author-email: License: Apache License, Version 2.0 Location: c:\users\ashish\anaconda3\lib\site-packages Requires: numpy Required-by:
Tags: Technology,Spark,

Way 5: WRT Cell Value Replacement Through Assignment (Ways in which Pandas API on PySpark differs from Plain Pandas)

Download Code

Cell Value Replacement Through Assignment in Pandas

import pandas as pd df = pd.DataFrame({ 'col1': ["alpha", "beta", "gamma"], 'col2': ['beta', 'gamma', 'alpha'], 'col3': ['gamma', 'alpha', 'beta'] }) df[df == 'alpha'] = 'delta' df

Error in PySpark For The Same Code: Unhashable type: 'DataFrame'

from pyspark import pandas as ppd df_ppd = ppd.DataFrame({ 'col1': ["alpha", "beta", "gamma"], 'col2': ['beta', 'gamma', 'alpha'], 'col3': ['gamma', 'alpha', 'beta'] }) df_ppd[df_ppd == 'alpha'] = 'delta' --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In [13], line 1 ----> 1 df_ppd[df_ppd == 'alpha'] = 'delta' File ~/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/pandas/frame.py:12355, in DataFrame.__setitem__(self, key, value) 12352 psdf = self._assign({k: value[c] for k, c in zip(key, field_names)}) 12353 else: 12354 # Same Series. > 12355 psdf = self._assign({key: value}) 12357 self._update_internal_frame(psdf._internal) TypeError: unhashable type: 'DataFrame'

Alternate Way

df_ppd = df_ppd.replace(to_replace = ['alpha'], value = "delta")
df_ppd = df_ppd.replace(to_replace = ['beta', 'gamma'], value = "epsilon")
Also Check: Way 4: With respect to DataFrame.replace() Method (Ways in which Pandas API on PySpark differs from Plain Pandas)
Tags: Spark,Technology,

Sunday, October 30, 2022

Do we need all the one hot features?


import pandas as pd
df = pd.read_csv('in/titanic_train.csv')
df.head()

enc_pc_df = pd.get_dummies(df, columns = ['Pclass']) enc_pc_df.head()

Hypothetical Question

Q: If I remove a column (could be first, could be last) from the one-hot feature matrix (let's say with 'n' columns), can I reproduce the same matrix from the 'n-1' columns? OR: Rephrasing the question: How do we get back original matrix or 'put back the dropped column'? Answer: If: There is no '1' in the remaining n-1 values in a row, then the dropped value from that row is 1. Else: 0 Assumptions made: there are 'n' number of columns.

Conclusion: In removing one column from the one-hot feature matrix, there is still no data loss. One column's value is related to the value of rest n-1 columns.

So, what is the solution to resolve this relation?

We drop the first column from one-hot feature matrix. enc_pc_df = pd.get_dummies(df, columns = ['Pclass'], drop_first = True) enc_pc_df.head() The default value of "drop_first" parameter is False.
Tags: Technology,Machine Learning,

Installing 'Category Encoders' Python Package Using Pip And Conda

You can install Category Encoders package from both Pip and Conda-forge.
But there is one subtle difference in the package in the two repositories. 

In PyPI, it is: category-encoders 

In Conda-forge, it is: category_encoders

Now with minimal env.yml:

Contents of the file: name: cat_enc channels: - conda-forge dependencies: - scikit-learn - pandas - ipykernel - jupyterlab - category_encoders

Conda Commands

$ conda remove -n cat_enc --all $ conda env create -f env.yml Collecting package metadata (repodata.json): done Solving environment: done Downloading and Extracting Packages category_encoders-2. | 62 KB | ### | 100% ... python-3.10.6 | 29.0 MB | ### | 100% Preparing transaction: done Verifying transaction: done Executing transaction: done # # To activate this environment, use # # $ conda activate cat_enc # # To deactivate an active environment, use # # $ conda deactivate Retrieving notices: ...working... done (base) $ conda activate cat_enc (cat_enc) $ python -m ipykernel install --user --name cat_enc Installed kernelspec cat_enc in /home/ashish/.local/share/jupyter/kernels/cat_enc (cat_enc) $
Tags: Technology,Machine Learning,