Tuesday, December 13, 2022

Accessing PostgreSQL from PySpark.Pandas

Installing a couple of packages

$ conda install sqlalchemy -c conda-forge $ conda install psycopg2 -c conda-forge

Data Ingestion Via PostGRE SQL terminal

PostGRE is case insensitive: These two statements are same: 1: CREATE TABLE states_and_capitals( state_name TEXT NOT NULL, capital_name TEXT NOT NULL, PRIMARY KEY( state_name, capital_name ) ); 2: create table states_and_capitals (state_name text not null, capital_name text not null, primary key(state_name, capital_name)); postgres=# CREATE TABLE states_and_capitals( state_name TEXT NOT NULL, capital_name TEXT NOT NULL, PRIMARY KEY( state_name, capital_name ) ); CREATE TABLE postgres=# create table states_and_capitals (state_name text not null, capital_name text not null, primary key(state_name, capital_name)); ERROR: relation "states_and_capitals" already exists postgres=# postgres=# select * from states_and_capitals; state_name | capital_name ------------+-------------- (0 rows) You can also view your table in the output of this command: select * from information_schema.tables;
INSERT INTO states_and_capitals (state_name, capital_name) VALUES ('haryana', 'chandigarh'); INSERT INTO states_and_capitals (state_name, capital_name) VALUES ('punjab', 'chandigarh'); Let's say you made a spelling mistake in a record. Then you can delete that record using the following statement: DELETE FROM states_and_capitals WHERE state_name = 'harayana'; postgres=# select * from states_and_capitals; state_name | capital_name ------------+-------------- punjab | chandigarh haryana | chandigarh (2 rows) postgres=# postgres=# \password postgres Enter new password for user "postgres": postgres Enter it again: postgres postgres=#

Reading data from PostgreSQL into Plain Pandas

from sqlalchemy import create_engine import pandas as pd engine = create_engine('postgresql://postgres:postgres@localhost/postgres') df = pd.read_sql("""SELECT * FROM states_and_capitals""", engine) df

Issue while trying with pyspark.pandas

jdbc_engine = 'jdbc:postgresql://postgres:postgres@localhost/postgres' query = """SELECT * FROM states_and_capitals""" import pyspark.pandas as pspd from pyspark import SparkContext from pyspark.sql import SQLContext sc = SparkContext.getOrCreate() sqlCtx = SQLContext(sc) df_pspd = pspd.read_sql(query, jdbc_engine) Py4JJavaError: An error occurred while calling o48.load. : java.sql.SQLException: No suitable driver
Tags: Technology,Spark,Database,

No comments:

Post a Comment