Showing posts with label Database. Show all posts
Showing posts with label Database. Show all posts

Monday, July 8, 2024

Round (2) - 18 Interview Questions on Cypher Queries and Knowledge Graph Using Neo4j (For Data Scientist Role) - Jul 2024

To See All Interview Preparation Articles: Index For Interviews Preparation
THESE QUERIES HAVE BEEN TRIED ON:
Server version:	Neo4j/5.20-aura
Server address:	fd39c7f2.databases.neo4j.io:7687

QUESTION 1:

HOW DO YOU CREATE AN INDEX ON STRING TYPE DATA IN NEO4J?

ANSWER:

(1)

Ref

Question from site:

I have a NextJS app using a Neo4j DB, in which I want to create a index like this:

CALL db.index.fulltext.createNodeIndex(
  "FTS_Person_Company_fullName",
  ["Person", "Company"],
  ["fullName"],
  {analyzer: "standard-folding", eventually_consistent: "true"})

Answer from site:

In neo4j 5.x, the Cypher language directly supports creating fulltext indexes.

For example, in your case:

CREATE FULLTEXT INDEX FTS_Person_Company_fullName 
FOR (n:Person|Company)
ON EACH [n.fullName]
OPTIONS {
  indexConfig: {
    `fulltext.analyzer`: "standard-folding",
    `fulltext.eventually_consistent`: true
  }
}

Ref (2)

CALL db.index.fulltext.createNodeIndex('documents', ['Document'], ['title','text'])

~~~

QUESTION 2:

IF I WANT TO CREATE TO CREATE AN INDEX LIKE THIS:
CALL db.index.fulltext.createNodeIndex('MY_FTS_EPICS', ['EPIC'], ['SUMMARY','DESCRIPTION', 'ACCEPTANCE_CRITERIA'])

WHAT WOULD BE EQUIVALENT 'CREATE FULLTEXT INDEX' QUERY:


ANSWER 2:

Ref

CREATE FULLTEXT INDEX MY_FTS_EPICS 
FOR (n:EPIC)
ON EACH [n.SUMMARY, n.DESCRIPTION, n.ACCEPTANCE_CRITERIA]

~~~

QUESTION 3:

WHY ARE WE DOING IT WITH 'CREATE FULLTEXT INDEX' QUERY:::

INSTEAD OF SOME THIS LIKE THIS:
CALL db.index.fulltext.createNodeIndex('MY_FTS_EPICS', ['EPIC'], ['SUMMARY','DESCRIPTION', 'ACCEPTANCE_CRITERIA'])

ANSWER:

Because createNodeIndex() is not supported in Neo4j AuraDB.

Neo.ClientError.Procedure.ProcedureNotFound
There is no procedure with the name `db.index.fulltext.createNodeIndex` registered for this database instance. Please ensure you've spelled the procedure name correctly and that the procedure is properly deployed.

~~~

QUESTION 4:

WRITE A QUERY TO FETCH RESULTS FOR INPUT STRING 'SGQ' FOR THE INDEX (MY_FTS_EPICS) THAT YOU HAVE CREATED ABOVE.

ANSWER:

CALL db.index.fulltext.queryNodes('MY_FTS_EPICS', 'SGQ') 
YIELD node, score 
RETURN node, score

Ref 1
Ref 2
Ref 3

~~~

QUESTION 5:

What would you get if you pass NODE LABEL in the above query in place of INDEX name?

ANSWER:

We get an error:


CALL db.index.fulltext.queryNodes('EPIC', 'My SGQ and yours') YIELD node, score
RETURN node, score


Neo.ClientError.Procedure.ProcedureCallFailed
Failed to invoke procedure `db.index.fulltext.queryNodes`: Caused by: java.lang.IllegalArgumentException: There is no such fulltext schema index: EPIC

~~~

QUESTION 6:

Can you combine calls to queryNodes() with a WHERE clause as in a relational query?

ANSWER:

Short answer: No.

Ref

"I understand now that since the full-text-search is based on the Lucene index it can't be made on a subset of nodes."

~~~

QUESTION 7:

What all string comparison operators do you know?

Answer:

Ref

STRING-specific comparison operators

	STARTS WITH, 
	ENDS WITH, 
	CONTAINS, 
	=~ (regex matching)

~~~

QUESTION 8:

Are there more operators for or related to strings?

Answer:

Ref

Yes. 

The string operators comprise:

	concatenating STRING values: + and ||

	checking if a STRING is normalized: IS NORMALIZED

~~~

QUESTION 9:

What does 'IS NORMALISED' do?

Answer:

Checking if a STRING IS NORMALIZED
Introduced in 5.17
The IS NORMALIZED operator is used to check whether the given STRING is in the NFC Unicode normalization form:

Unicode normalization is a process that transforms different representations of the same string into a standardized form. For more information, see the documentation for Unicode normalization forms.

Query
RETURN "the \u212B char" IS NORMALIZED AS normalized

Table 14. Result

normalized
false

Because the given STRING contains a non-normalized Unicode character (\u212B), false is returned.

~~~

QUESTION 10:

What all string related functions are available in Neo4j?
Which all have you used?

Answer:

Ref

btrim()
btrim() returns the original STRING with leading and trailing trimCharacterString characters removed. If trimCharacterString is not specified then all leading and trailing whitespace will be removed.

left()
left() returns a STRING containing the specified number of leftmost characters of the given STRING.

ltrim()
ltrim() returns the original STRING with leading trimCharacterString characters removed. As of Neo4j 5.20, a trimCharacterString can be specified. If this is not specified all leading whitespace will be removed.

normalize()
normalize() returns the given STRING normalized using the NFC Unicode normalization form.

Unicode normalization is a process that transforms different representations of the same string into a standardized form. For more information, see the documentation for Unicode normalization forms.

The normalize() function is useful for converting STRING values into comparable forms. When comparing two STRING values, it is their Unicode codepoints that are compared. In Unicode, a codepoint for a character that looks the same may be represented by two, or more, different codepoints. For example, the character < can be represented as \uFE64 (﹤) or \u003C (<). To the human eye, the characters may appear identical. However, if compared, Cypher® will return false as \uFE64 does not equal \u003C. Using the normalize() function, it is possible to normalize the codepoint \uFE64 to \u003C, creating a single codepoint representation, allowing them to be successfully compared.


normalize(), with specified normal form
normalize() returns the given STRING normalized using the specified normalization form. The normalization form can be of type NFC, NFD, NFKC or NFKD.

replace()
replace() returns a STRING in which all occurrences of a specified STRING in the given STRING have been replaced by another (specified) replacement STRING.

reverse()
reverse() returns a STRING in which the order of all characters in the given STRING have been reversed.

right()
right() returns a STRING containing the specified number of rightmost characters in the given STRING.

rtrim()
rtrim() returns the original STRING with trailing trimCharacterString characters removed. As of Neo4j 5.20, a trimCharacterString can be specified. If this is not specified all trailing whitespace will be removed.

split()
split() returns a LIST resulting from the splitting of the given STRING around matches of the given delimiter.

substring()
substring() returns a substring of the given STRING, beginning with a zero-based index start and length.

toLower()
toLower() returns the given STRING in lowercase.

toString()
toString() converts an INTEGER, FLOAT, BOOLEAN, STRING, POINT, DURATION, DATE, ZONED TIME, LOCAL TIME, LOCAL DATETIME or ZONED DATETIME value to a STRING.
This function will return an error if provided with an expression that is not an INTEGER, FLOAT, BOOLEAN, STRING, POINT, DURATION, DATE, ZONED TIME, LOCAL TIME, LOCAL DATETIME or ZONED DATETIME value.

toStringOrNull()
The function toStringOrNull() converts an INTEGER, FLOAT, BOOLEAN, STRING, POINT, DURATION, DATE, ZONED TIME, LOCAL TIME, LOCAL DATETIME or ZONED DATETIME value to a STRING.
If the expression is not an INTEGER, FLOAT, BOOLEAN, STRING, POINT, DURATION, DATE, ZONED TIME, LOCAL TIME, LOCAL DATETIME or ZONED DATETIME value, null will be returned.

toUpper()
toUpper() returns the given STRING in uppercase.

trim()
trim() returns the given STRING with the leading and/or trailing trimCharacterString character removed. As of Neo4j 5.20, a trimCharacterString can be specified. If this is not specified all leading and/or trailing whitespace will be removed.

~~~

QUESTION 11:

STRINT matching in Neo4j using STRING COMPARISON operators is case-sensitive or case-insensitive?

ANSWER:

Short answer: case-sensitive

Ref

STRING matching
The prefix and suffix of a STRING can be matched using STARTS WITH and ENDS WITH. To undertake a substring search (that is, match regardless of the location within a STRING), use CONTAINS.

The matching is case-sensitive. Attempting to use these operators on values which are not STRING values will return null.

~~~

QUESTION 12:

Write the query to find out ENTITY(s) where SGQ appears in it's NAME property.

ANSWER:

MATCH (n:ENTITY) 
WHERE n.NAME CONTAINS 'SGQ'
RETURN n.NAME;

~~~

QUESTION 13:
WRITE THE ABOVE QUERY USING REGEX:

Answer:
MATCH (n:ENTITY) 
WHERE n.NAME =~ '.*SGQ.*'
RETURN n.NAME;

~~~

QUESTION 14:

WRITE THE ABOVE QUERY USING FULLTEXT INDEX:

Answer:

CREATE FULLTEXT INDEX MY_FTS_ON_ENTITY 
FOR (n:ENTITY)
ON EACH [n.NAME]

CALL db.index.fulltext.queryNodes('MY_FTS_ON_ENTITY', 'SGQ') 
YIELD node, score 
RETURN node.NAME, score



1
"SGQ"
2.5313024520874023

2
"ABANDON SGQ"
2.0718064308166504

3
"SGQ output file"
1.7535011768341064

4
"SGQ quoting process"
1.7535011768341064

5
"SGQ Quote PDF"
1.7535011768341064

6
"BPR-SGQ-ISG Refresh"
1.5199768543243408

~~~

QUESTION 15:
Which query runs faster one that uses 'CONTAINS', one that uses "REGEX" or one that uses db.index.fulltext.queryNodes()?
An example of query: 
MATCH (n:ENTITY) 
WHERE n.NAME CONTAINS 'SGQ'
RETURN n.NAME;

And what one is best to use?

Answer:

All the three queries run in a comparable times.

Time for CONTAINS:
# 71.3 ms ± 2.52 ms per loop
# 70.6 ms ± 635 µs per loop

Time for REGEX:
# 72 ms ± 2.32 ms per loop
# 70.6 ms ± 643 µs per loop

Time for fulltext.QUERYNODES():
# 70.8 ms ± 601 µs per loop
# 70.1 ms ± 393 µs per loop


The advantage is in using the fulltext.queryNodes() because it also returns a SCORE to figure out which is the best match, or which is the worst match.

'SGQ'	2.5313024520874023
'ABANDON SGQ'	2.0718064308166504
'SGQ output file'	1.7535011768341064
'SGQ quoting process'	1.7535011768341064
'SGQ Quote PDF'	1.7535011768341064
'BPR-SGQ-ISG Refresh'	1.5199768543243408

~~~

QUESTION 16:

Match a node in Neo4j from a couple options for it's label. 

Ref

I want to match a graph where a node can be typeX or typeY... 

###
One way is

MATCH (n) WHERE labels(n) IN ['typeX','typeY'] 
WITH n
MATCH (n)-[]-(z)
RETURN z
However, if "either typeX or typeY" are queried frequently and share some common purpose in your domain, you could add another common label to them like "commonXY" and query using that label instead.

--------------------------------------------------------------------------

MATCH (x:ENTITY) WHERE x.NAME IN ['SGQ','ABANDON SGQ'] 
WITH x
MATCH (x)-[:APPEARS_IN_EPIC]->(z:EPIC)<-[:APPEARS_IN_EPIC]-(y:ENTITY)
WHERE y.NAME in ['SGQ quoting process']
RETURN x, y, z

OUTPUT:

(:ENTITY {NAME: SGQ})	
(:ENTITY {NAME: SGQ quoting process})	
(:EPIC {SUMMARY: BPR-QTE-Dental Rate Bank updated to support DHMO on SGQ output file, DESCRIPTION: As an agent quoting DHMO Dental Plans in SGQ, _x000D_@@NL@@I want to be able to update the rates for these plans, _x000D_@@NL@@So that the price for these dental plans will be more compelling to the small group employer. , ACCEPTANCE_CRITERIA: Problem Statement _x000D_Current dental rate bank in SGQ does not support DHMO planss. _x000D__x000D_Benefit Hypothesis _x000D_By allowing DHMO plan rates to be updated through Rate Bank in SGQ, ElevanceHealth can present more competitive prices for the selected dental rate plans. _x000D__x000D_Process Flow (Scenario - new sales) _x000D_Given an agent has successfully accessed SGQ, _x000D_And the agent is quoting DHMO dental plans for a small group employer. _x000D_After proceeding through the SGQ quoting process as designed, _x000D_And the agent is on the quotes tab to see the plans selected to be quoted.  _x000D_When reviewing the DHMO plans there will be an 'override' link where the agent will be able to update the rates for the selected DHMO plans., KY: BPP-149271, CUSTOM_ID: 2})

~~~

QUESTION 17:
How do set a property on all nodes with label EPIC?

Ref

MATCH (n:EPIC)
SET n.GeneratedFrom = 'JIRA', n.CreatedOn = date("2024-07-05")
RETURN n

# ALSO CHECK:
# Set multiple properties using one SET clause
MATCH (n {name: 'Andy'})
SET n.position = 'Developer', n.surname = 'Taylor'

# ALSO CHECK:
https://neo4j.com/docs/getting-started/cypher-intro/dates-datetimes-durations/
# Creating and updating Date and Datetime type values

MATCH (article:Article {title: "Dates, Datetimes, and Durations in Neo4j"})
SET article.datePublished = date("2019-09-30")

# HOW DO YOU SET A PROPERTY ON ALL NODES WITH LABEL EPIC OR STORY?

MATCH (n) 
WHERE n:EPIC OR n:STORY
SET n.GeneratedFrom = 'JIRA', n.CreatedOn = date("2024-07-05")
RETURN n

# ALSO CHECK
Update a property:
SET can be used to update a property on a node or relationship. This query forces a change of type in the age property:

Query
MATCH (n {name: 'Andy'})
SET n.age = toString(n.age)
RETURN n.name, n.age

MATCH (n:EPIC)
WHERE n.GeneratedFrom = 'Bot'
SET n.CreatedOn = date(n.CreatedOn)
RETURN n;

~~~

QUESTION 18:
RETURN ALL NODES WITH LABEL 'EPIC' WITH PROPERTY GeneratedFrom = 'Bot'

ANSWER:

MATCH (n:EPIC)
WHERE n.GeneratedFrom = 'Bot'
RETURN n;

...

# ALSO CHECK THE ASSOCIATED DELETE QUERY

# Delete a node with all its relationships
MATCH (n:EPIC)
WHERE n.GeneratedFrom = 'Bot'
DETACH DELETE n;

If you would use just 'DELETE n' instead of 'DETACH DELETE n', following error would come:
Neo.ClientError.Schema.ConstraintValidationFailed
Cannot delete node<555>, because it still has relationships. To delete this node, you must first delete its relationships.
Tags: Database,Technology

Thursday, June 20, 2024

10 Interview Questions on Cypher Queries and Knowledge Graph Using Neo4j (For Data Scientist Role) - Jun 2024

To See All Interview Preparation Articles: Index For Interviews Preparation
Question 1: Write a CREATE query having the following nodes and the relationship from ROOT to other nodes is 'HAS_CHILD'.

ROOT 
|--BROKER
|--PROVIDER
|--MEMBER

Answer:

CREATE (root:ROOT),
       (broker:BROKER),
       (provider:PROVIDER),
       (member:MEMBER),
       (root)-[:HAS_CHILD]->(broker),
       (root)-[:HAS_CHILD]->(provider),
       (root)-[:HAS_CHILD]->(member)

~~~

Question 2: Write a DELETE query to delete all nodes and relationships in a graph. 

Answer:
MATCH (n) DETACH DELETE n

Ref

~~~

Question 3: Write a query to get a count for all nodes of a given label:

Answer:

MATCH (n:Person)
RETURN count(n) as count

Ref

~~~

Question 4: There are three EPIC nodes in my graph. 
Each node has a numerical property CUSTOM_ID.
Now, I want to retrieve the node with the largest CUSTOM_ID.

Answer:

MATCH (n:EPIC)
RETURN n
ORDER BY n.CUSTOM_ID DESC
LIMIT 1

~~~ 

Question 5: Write query to get a node by property value in Neo4j.

Answer:


MATCH (n) 
WHERE n.name = 'Mark' 
RETURN n

Ref

~~~

Question 6: Delete a node with a given property.

Answer:
MATCH (n:Person {name: 'Tom Hanks'})
DELETE n

Ref

~~~

Question 7:  Delete ONLY nodes having label of ENTITY:

Answer:

MATCH (n:ENTITY)
DELETE n

~~~

Question 8: Return number of EPIC nodes in the knowledge graph.

Answer:

MATCH (epic:EPIC)
RETURN count(epic) as count

~~~

Question 9: Write a query to get the EPIC node with largest numerical property of CUSTOM_ID. 

Answer:

MATCH (epic:EPIC)
RETURN epic
ORDER BY epic.CUSTOM_ID DESC
LIMIT 1

~~~

Question 10:
What are some of the use cases where Between Centrality Algorithm is used?

Answer:
The Betweenness Centrality Algorithm is a powerful tool used to understand the roles of nodes in a graph and their impact on the network. Here are some use cases where it finds application:

Supply Chain Risk Analysis: In supply chain processes, Betweenness Centrality helps identify critical nodes that act as bridges between different parts of the network. For example, when transporting a product internationally, it can pinpoint bottleneck nodes during cargo ship stops in intermediate ports1.

Power Grid Contingency Analysis: The algorithm is used to analyze power grid networks, identifying critical nodes that affect the flow of electricity. Due to its computational intensity, this application often requires supercomputers2.

Community Detection and Network Routing: Betweenness Centrality assists in Girvan–Newman community detection and network routing tasks. It helps find influential nodes that connect different communities or guide information flow2.

Artificial Intelligence and Skill Characterization: Skill characterization in AI relies on identifying influential nodes. Betweenness Centrality helps determine which nodes play a crucial role in spreading information or resources2.

Epidemiology and Rumor Spreading: In epidemiology, it identifies nodes that influence the spread of diseases. Similarly, it helps analyze rumor propagation in social networks1.

Transportation Networks: The algorithm is applied to transportation networks, such as road or rail systems, to find critical nodes affecting traffic flow or resource distribution1.

Remember, Betweenness Centrality is about detecting nodes that serve as bridges, allowing information or resources to flow efficiently across a graph. 

1: graphable.ai
2: computationalsocialnetworks.springeropen.com
3: nature.com

---
Tags: Database,Technology

Thursday, April 25, 2024

Index For Interviews Preparation For Data Scientist Role


Toggle All Sections

For Data Scientist Role

Theoretical Questions On Data Science

Questions From Interviews (Data Scientist)

Questions For 'Natural Language Processing' Posed By ChatGPT / Gemini

Questions For 'Generative AI' and 'Large Language Models'

Questions For 'Machine Learning' Posed By ChatGPT / Gemini

MCQs and Complex Scenario Based Questions For Various Machine Learning Models

Questions For Graph Databases Like Neo4j And Apache Gremlin

Questions For Statistics From Khan Academy

Explain Your Last Project

Miscellaneous / Others

Tags: Interview Preparation,Generative AI,Technology,Natural Language Processing,Machine Learning,Large Language Models,Database,

Friday, November 24, 2023

Hour One With SQLite Database

What does SQLite database contain?

SQLite Database has data stored in it in the form of tables.

What does a table contain?

It contains rows and columns.

A row is also called a record or tuple.

A column is also called an attribute.

Let's say we have a table called 'marks'.

What all columns can table of marks contain?

1. Year

2. Class

3. Subject

4. Marks

Four Basic Types of Operations We Do in Databases

Assuming you have a database with you:

Insert: To add or create a new record or tuple in your database.

Select: To read a records or tuples present in the database.

Update: To change an existing record or tuple in your database.

Delete: To remove or delete a record or tuple from the database.

Can be remembered using the acronym: CRUD for Create, Read, Update, Delete

Single File For Storing a Database

Note: SQLite uses a single file for storing data related a database. For example: we have two databases with us and there names are 'marks.db' and 'sharks.db'.

(base) ashish@ashish:~/Desktop/ws/sqlite_db$ ls

marks.csv marks.db sharks.db

(base) ashish@ashish:~/Desktop/ws/sqlite_db$

Creating a Database file and Creating a Table

(base) ashish@ashish:~$ sqlite3 sharks.db

# Here, sqlit3 is your application and sharks.db is the filename for that application to open.

SQLite version 3.39.3 2022-09-05 11:02:23
Enter '.help' for usage hints.

sqlite> CREATE TABLE sharks(id integer NOT NULL, name text NOT NULL, sharktype text NOT NULL, length integer NOT NULL);

INSERT INTO sharks VALUES (1, 'Sammy', 'Greenland Shark', 427);
INSERT INTO sharks VALUES (2, 'Alyoshka', 'Great White Shark', 600);
INSERT INTO sharks VALUES (3, 'Himari', 'Megaladon', 1800);

sqlite> select * from sharks;


1|Sammy|Greenland Shark|427
2|Alyoshka|Great White Shark|600
3|Himari|Megaladon|1800    

sqlite>

(base) ashish@ashish:~/Desktop/ws/sqlite_db$ sqlite3 sharks.db

SQLite version 3.39.3 2022-09-05 11:02:23

Enter '.help' for usage hints.

sqlite> .schema sharks

CREATE TABLE sharks(id integer NOT NULL, name text NOT NULL, sharktype text NOT NULL, length integer NOT NULL);

sqlite>

Viewing the Structure of a Table

Exploring a Database file

(base) ashish@ashish:~/Desktop/ws/sqlite_db$ sqlite3 sharks.db

SQLite version 3.39.3 2022-09-05 11:02:23

Enter '.help' for usage hints.

sqlite> .tables

sharks

sqlite> select * from sharks;

1|Sammy|Greenland Shark|427

2|Alyoshka|Great White Shark|600

3|Himari|Megaladon|1800

sqlite> ^Z

[3]+ Stopped sqlite3 sharks.db

(base) ashish@ashish:~/Desktop/ws/sqlite_db$ sqlite3 marks.db

SQLite version 3.39.3 2022-09-05 11:02:23

Enter '.help' for usage hints.

sqlite> .tables

marks

sqlite> select * from marks;

2023|8|english|75

2023|8|math|85

2023|8|computer|90

2022|7|english|90

2022|7|math|85

2022|7|computer|95

2023|8|physics|99

2023|8|chemistry|97

2023|8|biology|95

sqlite>

Import a CSV file

(base) ashish@ashish:~/Desktop/ws/sqlite_db$ sqlite3 marks.db
SQLite version 3.39.3 2022-09-05 11:02:23
Enter '.help' for usage hints.
sqlite>
sqlite> .tables
sqlite> .mode csv
sqlite> .import ./marks.csv marks

First line in marks.csv should contain the header with column names.

sqlite> .tables

marks

sqlite>

sqlite> select * from marks;

2023,8,english,75

2023,8,math,80

2023,8,computer,90

2022,7,english,90

2022,7,math,85

2022,7,computer,95

sqlite>

Inserting a single row

sqlite> .schema marks

CREATE TABLE IF NOT EXISTS 'marks' ('year' TEXT, 'class' TEXT, 'subject' TEXT, 'marks' TEXT);

sqlite>

sqlite> insert into marks (year, class, subject, marks) values (2023, 8, 'sanskrit', 90);

sqlite> select * from marks;

2023|8|english|75

2023|8|math|80

2023|8|computer|90

2022|7|english|90

2022|7|math|85

2022|7|computer|95

2023|8|sanskrit|90

sqlite>

Inserting multiple rows

sqlite> insert into marks (year, class, subject, marks) values (2023, 8, 'physics', 99), (2023, 8, 'chemistry', '97'), (2023, 8, 'biology', 95);

sqlite> select * from marks;

2023|8|english|75

2023|8|math|80

2023|8|computer|90

2022|7|english|90

2022|7|math|85

2022|7|computer|95

2023|8|sanskrit|90

2023|8|physics|99

2023|8|chemistry|97

2023|8|biology|95

Selection and Filtering

sqlite> select * from marks where marks > 90;

2022|7|computer|95

2023|8|physics|99

2023|8|chemistry|97

2023|8|biology|95

sqlite> select * from marks where class = 8;

2023|8|english|75

2023|8|math|80

2023|8|computer|90

2023|8|sanskrit|90

2023|8|physics|99

2023|8|chemistry|97

2023|8|biology|95

Update

sqlite> select * from marks where class = 8 and subject = 'math';

2023|8|math|80

sqlite>

sqlite> update marks set marks = marks + 5 where class = 8 and subject = 'math';

sqlite> select * from marks where class = 8 and subject = 'math';

2023|8|math|85

Deleting a record

sqlite> select count(*) from marks where subject = 'sanskrit';

1

sqlite> select * from marks where subject = 'sanskrit';

2023|8|sanskrit|90

sqlite> delete from marks where subject = 'sanskrit';

sqlite> select * from marks where subject = 'sanskrit';

sqlite> select count(*) from marks where subject = 'sanskrit';

0

sqlite>

Cursor

Row Num

0:ID

1:Name

2:SharkType

3:Length

0

1

Sammy

Greenland Shark

427

1

2

Alyoshka

Great White Shark

600

2

3

Himari

Megaladon

1800

[0][0]

[0][1]

[0][2]

[0][3]

[1][0]

[1][1]

[1][2]

[1][3]

[2][0]

[2][1]

[2][2]

[2][3]

A cursor allows us to iterate over a table.

And, a table can be viewed as a list of rows.

And, a row can be viewed as a tuple of cells.

In Code (Part 1)

import sqlite3

conn = sqlite3.connect('sharks.db')

print('Opened database successfully')

cursor = conn.execute('SELECT id, name, sharktype, length FROM sharks')

for row in cursor:

print('ID = ', row[0])

print('NAME = ', row[1])

print('SHARKTYPE = ', row[2])

print('LENGTH = ', row[3])

print('\n')

print('Operation done successfully')

conn.close()

Output

Opened database successfully

ID = 1

NAME = Sammy

SHARKTYPE = Greenland Shark

LENGTH = 427

ID = 2

NAME = Alyoshka

SHARKTYPE = Great White Shark

LENGTH = 600

ID = 3

NAME = Himari

SHARKTYPE = Megaladon

LENGTH = 1800

Operation done successfully

Tags: Technology,Database,

Tuesday, February 7, 2023

Joins in PySpark using RDD, SQL DataFrame, PySpark.Pandas

Download Code and Data

Join Types

Inner Join The inner join is the default join in Spark SQL. It selects rows that have matching values in both relations. Syntax: relation [ INNER ] JOIN relation [ join_criteria ] Left Join A left join returns all values from the left relation and the matched values from the right relation, or appends NULL if there is no match. It is also referred to as a left outer join. Syntax: relation LEFT [ OUTER ] JOIN relation [ join_criteria ] Right Join A right join returns all values from the right relation and the matched values from the left relation, or appends NULL if there is no match. It is also referred to as a right outer join. Syntax: relation RIGHT [ OUTER ] JOIN relation [ join_criteria ] Full Join A full join returns all values from both relations, appending NULL values on the side that does not have a match. It is also referred to as a full outer join. Syntax: relation FULL [ OUTER ] JOIN relation [ join_criteria ] Cross Join A cross join returns the Cartesian product of two relations. Syntax: relation CROSS JOIN relation [ join_criteria ] Semi Join A semi join returns values from the left side of the relation that has a match with the right. It is also referred to as a left semi join. Syntax: relation [ LEFT ] SEMI JOIN relation [ join_criteria ] Anti Join An anti join returns values from the left relation that has no match with the right. It is also referred to as a left anti join. Syntax: relation [ LEFT ] ANTI JOIN relation [ join_criteria ] Ref: spark.apache.org

Visualizing Using Venn Diagrams

Spark provides a join() function that can join two paired RDDs based on the same key.

join(): Performs an inner join between two RDDs: firstRDD.join(laterRDD)
rightOuterJoin(): Performs join operation between two RDDs with key present in first RDD: firstRDD.rightOuterJoin(laterRDD)
leftOuterJoin(): Performs join operation between two RDDs with key present in the later RDD: firstRDD.leftOuterJoin(laterRDD)

Requirement:

Let us consider two different datasets of ArisCCNetwork RouterLocation.tsv and RouterPurchase.tsv.

Schema:

RouterLocation.tsv: RouterID, Name, Location

RouterPurchase.tsv: RouterID, Date, PrimaryMemory, SecondaryMemory, Cost

Join these two datasets to fetch Routers Location, Cost, and Memory details into a single RDD.

Implementation steps to join

Step 1: Create namedtuple classes representing datasets

Create two namedtuple representing the schema of each dataset.

Note: namedtuple is just like a dictionary. It improves the code readability by providing a way to access the values using descriptive field names.

Step 2: Generate <K,V> pairs using namedtuple

In this step,

datasets are loaded as RDDs

Paired RDDs <K, V> are created where K = common column in both RDDs, V = value part which contains a complete record 

Step 3: Apply the join() function

Spark join is applied against the grouped fields of locRDD and purRDD from the previous step. 
how: str, optional

default inner. 
Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti.
Tags: Spark,Technology,Database,