Showing posts with label Database. Show all posts
Showing posts with label Database. Show all posts

Friday, November 24, 2023

Hour One With SQLite Database

What does SQLite database contain?

SQLite Database has data stored in it in the form of tables.

What does a table contain?

It contains rows and columns.

A row is also called a record or tuple.

A column is also called an attribute.

Let's say we have a table called 'marks'.

What all columns can table of marks contain?

1. Year

2. Class

3. Subject

4. Marks

Four Basic Types of Operations We Do in Databases

Assuming you have a database with you:

Insert: To add or create a new record or tuple in your database.

Select: To read a records or tuples present in the database.

Update: To change an existing record or tuple in your database.

Delete: To remove or delete a record or tuple from the database.

Can be remembered using the acronym: CRUD for Create, Read, Update, Delete

Single File For Storing a Database

Note: SQLite uses a single file for storing data related a database. For example: we have two databases with us and there names are 'marks.db' and 'sharks.db'.

(base) ashish@ashish:~/Desktop/ws/sqlite_db$ ls

marks.csv marks.db sharks.db

(base) ashish@ashish:~/Desktop/ws/sqlite_db$

Creating a Database file and Creating a Table

(base) ashish@ashish:~$ sqlite3 sharks.db

# Here, sqlit3 is your application and sharks.db is the filename for that application to open.

SQLite version 3.39.3 2022-09-05 11:02:23
Enter '.help' for usage hints.

sqlite> CREATE TABLE sharks(id integer NOT NULL, name text NOT NULL, sharktype text NOT NULL, length integer NOT NULL);

INSERT INTO sharks VALUES (1, 'Sammy', 'Greenland Shark', 427);
INSERT INTO sharks VALUES (2, 'Alyoshka', 'Great White Shark', 600);
INSERT INTO sharks VALUES (3, 'Himari', 'Megaladon', 1800);

sqlite> select * from sharks;


1|Sammy|Greenland Shark|427
2|Alyoshka|Great White Shark|600
3|Himari|Megaladon|1800    

sqlite>

(base) ashish@ashish:~/Desktop/ws/sqlite_db$ sqlite3 sharks.db

SQLite version 3.39.3 2022-09-05 11:02:23

Enter '.help' for usage hints.

sqlite> .schema sharks

CREATE TABLE sharks(id integer NOT NULL, name text NOT NULL, sharktype text NOT NULL, length integer NOT NULL);

sqlite>

Viewing the Structure of a Table

Exploring a Database file

(base) ashish@ashish:~/Desktop/ws/sqlite_db$ sqlite3 sharks.db

SQLite version 3.39.3 2022-09-05 11:02:23

Enter '.help' for usage hints.

sqlite> .tables

sharks

sqlite> select * from sharks;

1|Sammy|Greenland Shark|427

2|Alyoshka|Great White Shark|600

3|Himari|Megaladon|1800

sqlite> ^Z

[3]+ Stopped sqlite3 sharks.db

(base) ashish@ashish:~/Desktop/ws/sqlite_db$ sqlite3 marks.db

SQLite version 3.39.3 2022-09-05 11:02:23

Enter '.help' for usage hints.

sqlite> .tables

marks

sqlite> select * from marks;

2023|8|english|75

2023|8|math|85

2023|8|computer|90

2022|7|english|90

2022|7|math|85

2022|7|computer|95

2023|8|physics|99

2023|8|chemistry|97

2023|8|biology|95

sqlite>

Import a CSV file

(base) ashish@ashish:~/Desktop/ws/sqlite_db$ sqlite3 marks.db
SQLite version 3.39.3 2022-09-05 11:02:23
Enter '.help' for usage hints.
sqlite>
sqlite> .tables
sqlite> .mode csv
sqlite> .import ./marks.csv marks

First line in marks.csv should contain the header with column names.

sqlite> .tables

marks

sqlite>

sqlite> select * from marks;

2023,8,english,75

2023,8,math,80

2023,8,computer,90

2022,7,english,90

2022,7,math,85

2022,7,computer,95

sqlite>

Inserting a single row

sqlite> .schema marks

CREATE TABLE IF NOT EXISTS 'marks' ('year' TEXT, 'class' TEXT, 'subject' TEXT, 'marks' TEXT);

sqlite>

sqlite> insert into marks (year, class, subject, marks) values (2023, 8, 'sanskrit', 90);

sqlite> select * from marks;

2023|8|english|75

2023|8|math|80

2023|8|computer|90

2022|7|english|90

2022|7|math|85

2022|7|computer|95

2023|8|sanskrit|90

sqlite>

Inserting multiple rows

sqlite> insert into marks (year, class, subject, marks) values (2023, 8, 'physics', 99), (2023, 8, 'chemistry', '97'), (2023, 8, 'biology', 95);

sqlite> select * from marks;

2023|8|english|75

2023|8|math|80

2023|8|computer|90

2022|7|english|90

2022|7|math|85

2022|7|computer|95

2023|8|sanskrit|90

2023|8|physics|99

2023|8|chemistry|97

2023|8|biology|95

Selection and Filtering

sqlite> select * from marks where marks > 90;

2022|7|computer|95

2023|8|physics|99

2023|8|chemistry|97

2023|8|biology|95

sqlite> select * from marks where class = 8;

2023|8|english|75

2023|8|math|80

2023|8|computer|90

2023|8|sanskrit|90

2023|8|physics|99

2023|8|chemistry|97

2023|8|biology|95

Update

sqlite> select * from marks where class = 8 and subject = 'math';

2023|8|math|80

sqlite>

sqlite> update marks set marks = marks + 5 where class = 8 and subject = 'math';

sqlite> select * from marks where class = 8 and subject = 'math';

2023|8|math|85

Deleting a record

sqlite> select count(*) from marks where subject = 'sanskrit';

1

sqlite> select * from marks where subject = 'sanskrit';

2023|8|sanskrit|90

sqlite> delete from marks where subject = 'sanskrit';

sqlite> select * from marks where subject = 'sanskrit';

sqlite> select count(*) from marks where subject = 'sanskrit';

0

sqlite>

Cursor

Row Num

0:ID

1:Name

2:SharkType

3:Length

0

1

Sammy

Greenland Shark

427

1

2

Alyoshka

Great White Shark

600

2

3

Himari

Megaladon

1800

[0][0]

[0][1]

[0][2]

[0][3]

[1][0]

[1][1]

[1][2]

[1][3]

[2][0]

[2][1]

[2][2]

[2][3]

A cursor allows us to iterate over a table.

And, a table can be viewed as a list of rows.

And, a row can be viewed as a tuple of cells.

In Code (Part 1)

import sqlite3

conn = sqlite3.connect('sharks.db')

print('Opened database successfully')

cursor = conn.execute('SELECT id, name, sharktype, length FROM sharks')

for row in cursor:

print('ID = ', row[0])

print('NAME = ', row[1])

print('SHARKTYPE = ', row[2])

print('LENGTH = ', row[3])

print('\n')

print('Operation done successfully')

conn.close()

Output

Opened database successfully

ID = 1

NAME = Sammy

SHARKTYPE = Greenland Shark

LENGTH = 427

ID = 2

NAME = Alyoshka

SHARKTYPE = Great White Shark

LENGTH = 600

ID = 3

NAME = Himari

SHARKTYPE = Megaladon

LENGTH = 1800

Operation done successfully

Tags: Technology,Database,

Tuesday, February 7, 2023

Joins in PySpark using RDD, SQL DataFrame, PySpark.Pandas

Download Code and Data

Join Types

Inner Join The inner join is the default join in Spark SQL. It selects rows that have matching values in both relations. Syntax: relation [ INNER ] JOIN relation [ join_criteria ] Left Join A left join returns all values from the left relation and the matched values from the right relation, or appends NULL if there is no match. It is also referred to as a left outer join. Syntax: relation LEFT [ OUTER ] JOIN relation [ join_criteria ] Right Join A right join returns all values from the right relation and the matched values from the left relation, or appends NULL if there is no match. It is also referred to as a right outer join. Syntax: relation RIGHT [ OUTER ] JOIN relation [ join_criteria ] Full Join A full join returns all values from both relations, appending NULL values on the side that does not have a match. It is also referred to as a full outer join. Syntax: relation FULL [ OUTER ] JOIN relation [ join_criteria ] Cross Join A cross join returns the Cartesian product of two relations. Syntax: relation CROSS JOIN relation [ join_criteria ] Semi Join A semi join returns values from the left side of the relation that has a match with the right. It is also referred to as a left semi join. Syntax: relation [ LEFT ] SEMI JOIN relation [ join_criteria ] Anti Join An anti join returns values from the left relation that has no match with the right. It is also referred to as a left anti join. Syntax: relation [ LEFT ] ANTI JOIN relation [ join_criteria ] Ref: spark.apache.org

Visualizing Using Venn Diagrams

Spark provides a join() function that can join two paired RDDs based on the same key.

join(): Performs an inner join between two RDDs: firstRDD.join(laterRDD)
rightOuterJoin(): Performs join operation between two RDDs with key present in first RDD: firstRDD.rightOuterJoin(laterRDD)
leftOuterJoin(): Performs join operation between two RDDs with key present in the later RDD: firstRDD.leftOuterJoin(laterRDD)

Requirement:

Let us consider two different datasets of ArisCCNetwork RouterLocation.tsv and RouterPurchase.tsv.

Schema:

RouterLocation.tsv: RouterID, Name, Location

RouterPurchase.tsv: RouterID, Date, PrimaryMemory, SecondaryMemory, Cost

Join these two datasets to fetch Routers Location, Cost, and Memory details into a single RDD.

Implementation steps to join

Step 1: Create namedtuple classes representing datasets

Create two namedtuple representing the schema of each dataset.

Note: namedtuple is just like a dictionary. It improves the code readability by providing a way to access the values using descriptive field names.

Step 2: Generate <K,V> pairs using namedtuple

In this step,

datasets are loaded as RDDs

Paired RDDs <K, V> are created where K = common column in both RDDs, V = value part which contains a complete record 

Step 3: Apply the join() function

Spark join is applied against the grouped fields of locRDD and purRDD from the previous step. 
how: str, optional

default inner. 
Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti.
Tags: Spark,Technology,Database,