Tuesday, October 25, 2022

Way 2: Difference in how access to str representation is provided (Ways in which Pandas API on PySpark differs from Plain Pandas)

Download Code

import seaborn as sns
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark import SparkContext
from pyspark.sql import SQLContext # Main entry point for DataFrame and SQL functionality.

sc = SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)

import pyspark
print(pyspark.__version__)

3.3.0


import pandas as pd
df_student = pd.read_csv('./input/student.csv')
df_student

Aim: To retrieve the first letter from a column of string type

In Pandas

df_student['first_letter'] = df_student['FirstName'].str[0] df_student

In Pandas API on Spark

from pyspark import pandas as ppd df_student_ppd = ppd.read_csv('./input/student.csv') df_student_ppd

Errors in Pandas API on Spark when we try with the way of Plain Pandas

1.

df_student_ppd['first_letter'] = df_student_ppd['FirstName'].str[0] # In Pandas API on Spark TypeError: 'StringMethods' object is not subscriptable

2.

df_student_ppd['first_letter'] = df_student_ppd['FirstName'].str TypeError: Column assignment doesn't support type StringMethods # pyspark.pandas.strings.StringMethods as shown below.

3.

df_student_ppd['FirstName'].str <pyspark.pandas.strings.StringMethods at 0x7f7474157520>

How we resolved it:

df_student_ppd['FirstName'] = df_student_ppd['FirstName'].astype(str) # If we do not do the above transformation, None values will result in an error. TypeError: 'NoneType' object is not subscriptable df_student_ppd['first_letter'] = df_student_ppd['FirstName'].apply(lambda x: x[0]) Warning:/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py:486: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead. for column, series in pdf.iteritems(): df_student_ppd
Tags: Technology,Spark

No comments:

Post a Comment