import seaborn as sns from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer from pyspark import SparkContext from pyspark.sql import SQLContext # Main entry point for DataFrame and SQL functionality. sc = SparkContext.getOrCreate() sqlCtx = SQLContext(sc) import pyspark print(pyspark.__version__) 3.3.0 import pandas as pd df_student = pd.read_csv('./input/student.csv') df_studentAim: To retrieve the first letter from a column of string type
In Pandas
df_student['first_letter'] = df_student['FirstName'].str[0] df_studentIn Pandas API on Spark
from pyspark import pandas as ppd df_student_ppd = ppd.read_csv('./input/student.csv') df_student_ppdErrors in Pandas API on Spark when we try with the way of Plain Pandas
1.
df_student_ppd['first_letter'] = df_student_ppd['FirstName'].str[0] # In Pandas API on Spark TypeError: 'StringMethods' object is not subscriptable2.
df_student_ppd['first_letter'] = df_student_ppd['FirstName'].str TypeError: Column assignment doesn't support type StringMethods # pyspark.pandas.strings.StringMethods as shown below.3.
df_student_ppd['FirstName'].str <pyspark.pandas.strings.StringMethods at 0x7f7474157520>How we resolved it:
df_student_ppd['FirstName'] = df_student_ppd['FirstName'].astype(str) # If we do not do the above transformation, None values will result in an error. TypeError: 'NoneType' object is not subscriptable df_student_ppd['first_letter'] = df_student_ppd['FirstName'].apply(lambda x: x[0]) Warning:/home/ashish/anaconda3/envs/mh/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py:486: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead. for column, series in pdf.iteritems(): df_student_ppd
Tuesday, October 25, 2022
Way 2: Difference in how access to str representation is provided (Ways in which Pandas API on PySpark differs from Plain Pandas)
Download Code
Labels:
Spark,
Technology
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment