survival8: Way 2: Difference in how access to str representation is provided (Ways in which Pandas API on PySpark differs from Plain Pandas)

Tuesday, October 25, 2022

Way 2: Difference in how access to str representation is provided (Ways in which Pandas API on PySpark differs from Plain Pandas)

Download Code


import seaborn as sns
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark import SparkContext
from pyspark.sql import SQLContext # Main entry point for DataFrame and SQL functionality.

sc = SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)

import pyspark
print(pyspark.__version__)

3.3.0


import pandas as pd
df_student = pd.read_csv('./input/student.csv')
df_student



Aim: To retrieve the first letter from a column of string type
In Pandas

df_student['first_letter'] = df_student['FirstName'].str[0] 
df_student




In Pandas API on Spark
from pyspark import pandas as ppd
df_student_ppd = ppd.read_csv('./input/student.csv')
df_student_ppd




Errors in Pandas API on Spark when we try with the way of Plain Pandas 

1.
df_student_ppd['first_letter'] = df_student_ppd['FirstName'].str[0] 

# In Pandas API on Spark
TypeError: 'StringMethods' object is not subscriptable

2.
df_student_ppd['first_letter'] = df_student_ppd['FirstName'].str
TypeError: Column assignment doesn't support type StringMethods # pyspark.pandas.strings.StringMethods as shown below.

3.
df_student_ppd['FirstName'].str
<pyspark.pandas.strings.StringMethods at 0x7f7474157520>


How we resolved it:

df_student_ppd['FirstName'] = df_student_ppd['FirstName'].astype(str)

# If we do not do the above transformation, None values will result in an error.
TypeError: 'NoneType' object is not subscriptable

df_student_ppd['first_letter'] = df_student_ppd['FirstName'].apply(lambda x: x[0])

Warning:/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py:486: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for column, series in pdf.iteritems():


df_student_ppd