THESE QUERIES HAVE BEEN TRIED ON: Server version: Neo4j/5.20-aura Server address: fd39c7f2.databases.neo4j.io:7687 QUESTION 1: HOW DO YOU CREATE AN INDEX ON STRING TYPE DATA IN NEO4J? ANSWER: (1) Ref Question from site: I have a NextJS app using a Neo4j DB, in which I want to create a index like this: CALL db.index.fulltext.createNodeIndex( "FTS_Person_Company_fullName", ["Person", "Company"], ["fullName"], {analyzer: "standard-folding", eventually_consistent: "true"}) Answer from site: In neo4j 5.x, the Cypher language directly supports creating fulltext indexes. For example, in your case: CREATE FULLTEXT INDEX FTS_Person_Company_fullName FOR (n:Person|Company) ON EACH [n.fullName] OPTIONS { indexConfig: { `fulltext.analyzer`: "standard-folding", `fulltext.eventually_consistent`: true } } Ref (2) CALL db.index.fulltext.createNodeIndex('documents', ['Document'], ['title','text']) ~~~ QUESTION 2: IF I WANT TO CREATE TO CREATE AN INDEX LIKE THIS: CALL db.index.fulltext.createNodeIndex('MY_FTS_EPICS', ['EPIC'], ['SUMMARY','DESCRIPTION', 'ACCEPTANCE_CRITERIA']) WHAT WOULD BE EQUIVALENT 'CREATE FULLTEXT INDEX' QUERY: ANSWER 2: Ref CREATE FULLTEXT INDEX MY_FTS_EPICS FOR (n:EPIC) ON EACH [n.SUMMARY, n.DESCRIPTION, n.ACCEPTANCE_CRITERIA] ~~~ QUESTION 3: WHY ARE WE DOING IT WITH 'CREATE FULLTEXT INDEX' QUERY::: INSTEAD OF SOME THIS LIKE THIS: CALL db.index.fulltext.createNodeIndex('MY_FTS_EPICS', ['EPIC'], ['SUMMARY','DESCRIPTION', 'ACCEPTANCE_CRITERIA']) ANSWER: Because createNodeIndex() is not supported in Neo4j AuraDB. Neo.ClientError.Procedure.ProcedureNotFound There is no procedure with the name `db.index.fulltext.createNodeIndex` registered for this database instance. Please ensure you've spelled the procedure name correctly and that the procedure is properly deployed. ~~~ QUESTION 4: WRITE A QUERY TO FETCH RESULTS FOR INPUT STRING 'SGQ' FOR THE INDEX (MY_FTS_EPICS) THAT YOU HAVE CREATED ABOVE. ANSWER: CALL db.index.fulltext.queryNodes('MY_FTS_EPICS', 'SGQ') YIELD node, score RETURN node, score Ref 1 Ref 2 Ref 3 ~~~ QUESTION 5: What would you get if you pass NODE LABEL in the above query in place of INDEX name? ANSWER: We get an error: CALL db.index.fulltext.queryNodes('EPIC', 'My SGQ and yours') YIELD node, score RETURN node, score Neo.ClientError.Procedure.ProcedureCallFailed Failed to invoke procedure `db.index.fulltext.queryNodes`: Caused by: java.lang.IllegalArgumentException: There is no such fulltext schema index: EPIC ~~~ QUESTION 6: Can you combine calls to queryNodes() with a WHERE clause as in a relational query? ANSWER: Short answer: No. Ref "I understand now that since the full-text-search is based on the Lucene index it can't be made on a subset of nodes." ~~~ QUESTION 7: What all string comparison operators do you know? Answer: Ref STRING-specific comparison operators STARTS WITH, ENDS WITH, CONTAINS, =~ (regex matching) ~~~ QUESTION 8: Are there more operators for or related to strings? Answer: Ref Yes. The string operators comprise: concatenating STRING values: + and || checking if a STRING is normalized: IS NORMALIZED ~~~ QUESTION 9: What does 'IS NORMALISED' do? Answer: Checking if a STRING IS NORMALIZED Introduced in 5.17 The IS NORMALIZED operator is used to check whether the given STRING is in the NFC Unicode normalization form: Unicode normalization is a process that transforms different representations of the same string into a standardized form. For more information, see the documentation for Unicode normalization forms. Query RETURN "the \u212B char" IS NORMALIZED AS normalized Table 14. Result normalized false Because the given STRING contains a non-normalized Unicode character (\u212B), false is returned. ~~~ QUESTION 10: What all string related functions are available in Neo4j? Which all have you used? Answer: Ref btrim() btrim() returns the original STRING with leading and trailing trimCharacterString characters removed. If trimCharacterString is not specified then all leading and trailing whitespace will be removed. left() left() returns a STRING containing the specified number of leftmost characters of the given STRING. ltrim() ltrim() returns the original STRING with leading trimCharacterString characters removed. As of Neo4j 5.20, a trimCharacterString can be specified. If this is not specified all leading whitespace will be removed. normalize() normalize() returns the given STRING normalized using the NFC Unicode normalization form. Unicode normalization is a process that transforms different representations of the same string into a standardized form. For more information, see the documentation for Unicode normalization forms. The normalize() function is useful for converting STRING values into comparable forms. When comparing two STRING values, it is their Unicode codepoints that are compared. In Unicode, a codepoint for a character that looks the same may be represented by two, or more, different codepoints. For example, the character < can be represented as \uFE64 (﹤) or \u003C (<). To the human eye, the characters may appear identical. However, if compared, Cypher® will return false as \uFE64 does not equal \u003C. Using the normalize() function, it is possible to normalize the codepoint \uFE64 to \u003C, creating a single codepoint representation, allowing them to be successfully compared. normalize(), with specified normal form normalize() returns the given STRING normalized using the specified normalization form. The normalization form can be of type NFC, NFD, NFKC or NFKD. replace() replace() returns a STRING in which all occurrences of a specified STRING in the given STRING have been replaced by another (specified) replacement STRING. reverse() reverse() returns a STRING in which the order of all characters in the given STRING have been reversed. right() right() returns a STRING containing the specified number of rightmost characters in the given STRING. rtrim() rtrim() returns the original STRING with trailing trimCharacterString characters removed. As of Neo4j 5.20, a trimCharacterString can be specified. If this is not specified all trailing whitespace will be removed. split() split() returns a LISTresulting from the splitting of the given STRING around matches of the given delimiter. substring() substring() returns a substring of the given STRING, beginning with a zero-based index start and length. toLower() toLower() returns the given STRING in lowercase. toString() toString() converts an INTEGER, FLOAT, BOOLEAN, STRING, POINT, DURATION, DATE, ZONED TIME, LOCAL TIME, LOCAL DATETIME or ZONED DATETIME value to a STRING. This function will return an error if provided with an expression that is not an INTEGER, FLOAT, BOOLEAN, STRING, POINT, DURATION, DATE, ZONED TIME, LOCAL TIME, LOCAL DATETIME or ZONED DATETIME value. toStringOrNull() The function toStringOrNull() converts an INTEGER, FLOAT, BOOLEAN, STRING, POINT, DURATION, DATE, ZONED TIME, LOCAL TIME, LOCAL DATETIME or ZONED DATETIME value to a STRING. If the expression is not an INTEGER, FLOAT, BOOLEAN, STRING, POINT, DURATION, DATE, ZONED TIME, LOCAL TIME, LOCAL DATETIME or ZONED DATETIME value, null will be returned. toUpper() toUpper() returns the given STRING in uppercase. trim() trim() returns the given STRING with the leading and/or trailing trimCharacterString character removed. As of Neo4j 5.20, a trimCharacterString can be specified. If this is not specified all leading and/or trailing whitespace will be removed. ~~~ QUESTION 11: STRINT matching in Neo4j using STRING COMPARISON operators is case-sensitive or case-insensitive? ANSWER: Short answer: case-sensitive Ref STRING matching The prefix and suffix of a STRING can be matched using STARTS WITH and ENDS WITH. To undertake a substring search (that is, match regardless of the location within a STRING), use CONTAINS. The matching is case-sensitive. Attempting to use these operators on values which are not STRING values will return null. ~~~ QUESTION 12: Write the query to find out ENTITY(s) where SGQ appears in it's NAME property. ANSWER: MATCH (n:ENTITY) WHERE n.NAME CONTAINS 'SGQ' RETURN n.NAME; ~~~ QUESTION 13: WRITE THE ABOVE QUERY USING REGEX: Answer: MATCH (n:ENTITY) WHERE n.NAME =~ '.*SGQ.*' RETURN n.NAME; ~~~ QUESTION 14: WRITE THE ABOVE QUERY USING FULLTEXT INDEX: Answer: CREATE FULLTEXT INDEX MY_FTS_ON_ENTITY FOR (n:ENTITY) ON EACH [n.NAME] CALL db.index.fulltext.queryNodes('MY_FTS_ON_ENTITY', 'SGQ') YIELD node, score RETURN node.NAME, score 1 "SGQ" 2.5313024520874023 2 "ABANDON SGQ" 2.0718064308166504 3 "SGQ output file" 1.7535011768341064 4 "SGQ quoting process" 1.7535011768341064 5 "SGQ Quote PDF" 1.7535011768341064 6 "BPR-SGQ-ISG Refresh" 1.5199768543243408 ~~~ QUESTION 15: Which query runs faster one that uses 'CONTAINS', one that uses "REGEX" or one that uses db.index.fulltext.queryNodes()? An example of query: MATCH (n:ENTITY) WHERE n.NAME CONTAINS 'SGQ' RETURN n.NAME; And what one is best to use? Answer: All the three queries run in a comparable times. Time for CONTAINS: # 71.3 ms ± 2.52 ms per loop # 70.6 ms ± 635 µs per loop Time for REGEX: # 72 ms ± 2.32 ms per loop # 70.6 ms ± 643 µs per loop Time for fulltext.QUERYNODES(): # 70.8 ms ± 601 µs per loop # 70.1 ms ± 393 µs per loop The advantage is in using the fulltext.queryNodes() because it also returns a SCORE to figure out which is the best match, or which is the worst match. 'SGQ' 2.5313024520874023 'ABANDON SGQ' 2.0718064308166504 'SGQ output file' 1.7535011768341064 'SGQ quoting process' 1.7535011768341064 'SGQ Quote PDF' 1.7535011768341064 'BPR-SGQ-ISG Refresh' 1.5199768543243408 ~~~ QUESTION 16: Match a node in Neo4j from a couple options for it's label. Ref I want to match a graph where a node can be typeX or typeY... ### One way is MATCH (n) WHERE labels(n) IN ['typeX','typeY'] WITH n MATCH (n)-[]-(z) RETURN z However, if "either typeX or typeY" are queried frequently and share some common purpose in your domain, you could add another common label to them like "commonXY" and query using that label instead. -------------------------------------------------------------------------- MATCH (x:ENTITY) WHERE x.NAME IN ['SGQ','ABANDON SGQ'] WITH x MATCH (x)-[:APPEARS_IN_EPIC]->(z:EPIC)<-[:APPEARS_IN_EPIC]-(y:ENTITY) WHERE y.NAME in ['SGQ quoting process'] RETURN x, y, z OUTPUT: (:ENTITY {NAME: SGQ}) (:ENTITY {NAME: SGQ quoting process}) (:EPIC {SUMMARY: BPR-QTE-Dental Rate Bank updated to support DHMO on SGQ output file, DESCRIPTION: As an agent quoting DHMO Dental Plans in SGQ, _x000D_@@NL@@I want to be able to update the rates for these plans, _x000D_@@NL@@So that the price for these dental plans will be more compelling to the small group employer. , ACCEPTANCE_CRITERIA: Problem Statement _x000D_Current dental rate bank in SGQ does not support DHMO planss. _x000D__x000D_Benefit Hypothesis _x000D_By allowing DHMO plan rates to be updated through Rate Bank in SGQ, ElevanceHealth can present more competitive prices for the selected dental rate plans. _x000D__x000D_Process Flow (Scenario - new sales) _x000D_Given an agent has successfully accessed SGQ, _x000D_And the agent is quoting DHMO dental plans for a small group employer. _x000D_After proceeding through the SGQ quoting process as designed, _x000D_And the agent is on the quotes tab to see the plans selected to be quoted. _x000D_When reviewing the DHMO plans there will be an 'override' link where the agent will be able to update the rates for the selected DHMO plans., KY: BPP-149271, CUSTOM_ID: 2}) ~~~ QUESTION 17: How do set a property on all nodes with label EPIC? Ref MATCH (n:EPIC) SET n.GeneratedFrom = 'JIRA', n.CreatedOn = date("2024-07-05") RETURN n # ALSO CHECK: # Set multiple properties using one SET clause MATCH (n {name: 'Andy'}) SET n.position = 'Developer', n.surname = 'Taylor' # ALSO CHECK: https://neo4j.com/docs/getting-started/cypher-intro/dates-datetimes-durations/ # Creating and updating Date and Datetime type values MATCH (article:Article {title: "Dates, Datetimes, and Durations in Neo4j"}) SET article.datePublished = date("2019-09-30") # HOW DO YOU SET A PROPERTY ON ALL NODES WITH LABEL EPIC OR STORY? MATCH (n) WHERE n:EPIC OR n:STORY SET n.GeneratedFrom = 'JIRA', n.CreatedOn = date("2024-07-05") RETURN n # ALSO CHECK Update a property: SET can be used to update a property on a node or relationship. This query forces a change of type in the age property: Query MATCH (n {name: 'Andy'}) SET n.age = toString(n.age) RETURN n.name, n.age MATCH (n:EPIC) WHERE n.GeneratedFrom = 'Bot' SET n.CreatedOn = date(n.CreatedOn) RETURN n; ~~~ QUESTION 18: RETURN ALL NODES WITH LABEL 'EPIC' WITH PROPERTY GeneratedFrom = 'Bot' ANSWER: MATCH (n:EPIC) WHERE n.GeneratedFrom = 'Bot' RETURN n; ... # ALSO CHECK THE ASSOCIATED DELETE QUERY # Delete a node with all its relationships MATCH (n:EPIC) WHERE n.GeneratedFrom = 'Bot' DETACH DELETE n; If you would use just 'DELETE n' instead of 'DETACH DELETE n', following error would come: Neo.ClientError.Schema.ConstraintValidationFailed Cannot delete node<555>, because it still has relationships. To delete this node, you must first delete its relationships.
Monday, July 8, 2024
Round (2) - 18 Interview Questions on Cypher Queries and Knowledge Graph Using Neo4j (For Data Scientist Role) - Jul 2024
Thursday, June 20, 2024
10 Interview Questions on Cypher Queries and Knowledge Graph Using Neo4j (For Data Scientist Role) - Jun 2024
Question 1: Write a CREATE query having the following nodes and the relationship from ROOT to other nodes is 'HAS_CHILD'. ROOT |--BROKER |--PROVIDER |--MEMBER Answer: CREATE (root:ROOT), (broker:BROKER), (provider:PROVIDER), (member:MEMBER), (root)-[:HAS_CHILD]->(broker), (root)-[:HAS_CHILD]->(provider), (root)-[:HAS_CHILD]->(member) ~~~ Question 2: Write a DELETE query to delete all nodes and relationships in a graph. Answer: MATCH (n) DETACH DELETE n Ref ~~~ Question 3: Write a query to get a count for all nodes of a given label: Answer: MATCH (n:Person) RETURN count(n) as count Ref ~~~ Question 4: There are three EPIC nodes in my graph. Each node has a numerical property CUSTOM_ID. Now, I want to retrieve the node with the largest CUSTOM_ID. Answer: MATCH (n:EPIC) RETURN n ORDER BY n.CUSTOM_ID DESC LIMIT 1 ~~~ Question 5: Write query to get a node by property value in Neo4j. Answer: MATCH (n) WHERE n.name = 'Mark' RETURN n Ref ~~~ Question 6: Delete a node with a given property. Answer: MATCH (n:Person {name: 'Tom Hanks'}) DELETE n Ref ~~~ Question 7: Delete ONLY nodes having label of ENTITY: Answer: MATCH (n:ENTITY) DELETE n ~~~ Question 8: Return number of EPIC nodes in the knowledge graph. Answer: MATCH (epic:EPIC) RETURN count(epic) as count ~~~ Question 9: Write a query to get the EPIC node with largest numerical property of CUSTOM_ID. Answer: MATCH (epic:EPIC) RETURN epic ORDER BY epic.CUSTOM_ID DESC LIMIT 1 ~~~ Question 10: What are some of the use cases where Between Centrality Algorithm is used? Answer: The Betweenness Centrality Algorithm is a powerful tool used to understand the roles of nodes in a graph and their impact on the network. Here are some use cases where it finds application: Supply Chain Risk Analysis: In supply chain processes, Betweenness Centrality helps identify critical nodes that act as bridges between different parts of the network. For example, when transporting a product internationally, it can pinpoint bottleneck nodes during cargo ship stops in intermediate ports1. Power Grid Contingency Analysis: The algorithm is used to analyze power grid networks, identifying critical nodes that affect the flow of electricity. Due to its computational intensity, this application often requires supercomputers2. Community Detection and Network Routing: Betweenness Centrality assists in Girvan–Newman community detection and network routing tasks. It helps find influential nodes that connect different communities or guide information flow2. Artificial Intelligence and Skill Characterization: Skill characterization in AI relies on identifying influential nodes. Betweenness Centrality helps determine which nodes play a crucial role in spreading information or resources2. Epidemiology and Rumor Spreading: In epidemiology, it identifies nodes that influence the spread of diseases. Similarly, it helps analyze rumor propagation in social networks1. Transportation Networks: The algorithm is applied to transportation networks, such as road or rail systems, to find critical nodes affecting traffic flow or resource distribution1. Remember, Betweenness Centrality is about detecting nodes that serve as bridges, allowing information or resources to flow efficiently across a graph. 1: graphable.ai 2: computationalsocialnetworks.springeropen.com 3: nature.com ---Tags: Database,Technology
Thursday, April 25, 2024
Index For Interviews Preparation For Data Scientist Role
Toggle All Sections
For Data Scientist Role
Theoretical Questions On Data Science
Questions From Interviews (Data Scientist)
- Interview at Nagarro for Data Scientist Role (Jul 27, 2024)
- Interview at Capgemini For Data Scientist Role (May 28, 2024)
- Interview for Data Scientist Role at Cognizant (Questions With Answers From Gemini - 18 Apr 2024)
- Coding Round in Interview for Data Scientist Role at National Australia Bank (17 Nov 2023)
- Data Structures, Algorithms and Coding Round For Data Scientist at Agoda (26 Sep 2023)
- Interview for Data Engineering and Machine Learning Profile (20 Sep 2023) - For the position of Infosys Digital Specialist
- Interview for Natural Language Processing, and Machine Learning Profile (2 Sep 2023) - For Data Scientist Role at Accenture
Questions For 'Natural Language Processing' Posed By ChatGPT / Gemini
Questions For 'Generative AI' and 'Large Language Models'
Questions For 'Machine Learning' Posed By ChatGPT / Gemini
- Day 1: Preparing Machine Learning Topics for Data Scientist Interview by The Help of Gemini
- Day 2: Some complex interview questions for the use case of Master Data Management for Tyson Foods (US) from my resume
- Day 2: Some complex questions on Bot Dectection Project on Twitter Data For Infosys Digital Marketing Team
- Day 3 of Interview Preparation For Data Scientist Role: Some MCQs and Scenario Based Questions on The Topic of Regression
- Day 4 of Interview Preparation For Data Scientist Role: Questions on Classification
- Day 5: Complex Scenario Based Questions For Anomaly Detection Project (May 2024)
- Day 6: MCQs and Scenario Based Complex Questions on The Topic of "Binary and Multiclass Classification"
- Day 7: MCQs and Scenario Based Questions on 'Bayes Search For Hyperparameter Tuning'
- Day 8: Scenario Based Questions on Normalization and Standardization
- Day 9: MCQs and Scenario Based Questions on 'Support Vector Machines'
- Day 10: Questions on Principal Component Analysis (PCA) (Ch 6 from 'Applied ML and AI for Engineers')
MCQs and Complex Scenario Based Questions For Various Machine Learning Models
Questions For Graph Databases Like Neo4j And Apache Gremlin
Questions For Statistics From Khan Academy
Explain Your Last Project
Miscellaneous / Others
Tags: Interview Preparation,Generative AI,Technology,Natural Language Processing,Machine Learning,Large Language Models,Database,Friday, November 24, 2023
Hour One With SQLite Database
What does SQLite database contain?
SQLite Database has data stored in it in the form of tables.
What does a table contain?
It contains rows and columns.
A row is also called a record or tuple.
A column is also called an attribute.
Let's say we have a table called 'marks'.
What all columns can table of marks contain?
1. Year
2. Class
3. Subject
4. Marks
Four Basic Types of Operations We Do in Databases
Assuming you have a database with you:
Insert: To add or create a new record or tuple in your database.
Select: To read a records or tuples present in the database.
Update: To change an existing record or tuple in your database.
Delete: To remove or delete a record or tuple from the database.
Can be remembered using the acronym: CRUD for Create, Read, Update, Delete
Single File For Storing a Database
Note: SQLite uses a single file for storing data related a database. For example: we have two databases with us and there names are 'marks.db' and 'sharks.db'.
(base) ashish@ashish:~/Desktop/ws/sqlite_db$ ls
marks.csv marks.db sharks.db
(base) ashish@ashish:~/Desktop/ws/sqlite_db$
Creating a Database file and Creating a Table
(base) ashish@ashish:~$ sqlite3 sharks.db # Here, sqlit3 is your application and sharks.db is the filename for that application to open. SQLite version 3.39.3 2022-09-05 11:02:23 Enter '.help' for usage hints. sqlite> CREATE TABLE sharks(id integer NOT NULL, name text NOT NULL, sharktype text NOT NULL, length integer NOT NULL); INSERT INTO sharks VALUES (1, 'Sammy', 'Greenland Shark', 427); INSERT INTO sharks VALUES (2, 'Alyoshka', 'Great White Shark', 600); INSERT INTO sharks VALUES (3, 'Himari', 'Megaladon', 1800); sqlite> select * from sharks; 1|Sammy|Greenland Shark|427 2|Alyoshka|Great White Shark|600 3|Himari|Megaladon|1800
sqlite>
(base) ashish@ashish:~/Desktop/ws/sqlite_db$ sqlite3 sharks.db
SQLite version 3.39.3 2022-09-05 11:02:23
Enter '.help' for usage hints.
sqlite> .schema sharks
CREATE TABLE sharks(id integer NOT NULL, name text NOT NULL, sharktype text NOT NULL, length integer NOT NULL);
sqlite>
Viewing the Structure of a Table
Exploring a Database file
(base) ashish@ashish:~/Desktop/ws/sqlite_db$ sqlite3 sharks.db
SQLite version 3.39.3 2022-09-05 11:02:23
Enter '.help' for usage hints.
sqlite> .tables
sharks
sqlite> select * from sharks;
1|Sammy|Greenland Shark|427
2|Alyoshka|Great White Shark|600
3|Himari|Megaladon|1800
sqlite> ^Z
[3]+ Stopped sqlite3 sharks.db
(base) ashish@ashish:~/Desktop/ws/sqlite_db$ sqlite3 marks.db
SQLite version 3.39.3 2022-09-05 11:02:23
Enter '.help' for usage hints.
sqlite> .tables
marks
sqlite> select * from marks;
2023|8|english|75
2023|8|math|85
2023|8|computer|90
2022|7|english|90
2022|7|math|85
2022|7|computer|95
2023|8|physics|99
2023|8|chemistry|97
2023|8|biology|95
sqlite>
Import a CSV file
(base) ashish@ashish:~/Desktop/ws/sqlite_db$ sqlite3 marks.db SQLite version 3.39.3 2022-09-05 11:02:23 Enter '.help' for usage hints. sqlite> sqlite> .tables sqlite> .mode csv sqlite> .import ./marks.csv marks First line in marks.csv should contain the header with column names. sqlite> .tables
marks
sqlite>
sqlite> select * from marks;
2023,8,english,75
2023,8,math,80
2023,8,computer,90
2022,7,english,90
2022,7,math,85
2022,7,computer,95
sqlite>
Inserting a single row
sqlite> .schema marks
CREATE TABLE IF NOT EXISTS 'marks' ('year' TEXT, 'class' TEXT, 'subject' TEXT, 'marks' TEXT);
sqlite>
sqlite> insert into marks (year, class, subject, marks) values (2023, 8, 'sanskrit', 90);
sqlite> select * from marks;
2023|8|english|75
2023|8|math|80
2023|8|computer|90
2022|7|english|90
2022|7|math|85
2022|7|computer|95
2023|8|sanskrit|90
sqlite>
Inserting multiple rows
sqlite> insert into marks (year, class, subject, marks) values (2023, 8, 'physics', 99), (2023, 8, 'chemistry', '97'), (2023, 8, 'biology', 95);
sqlite> select * from marks;
2023|8|english|75
2023|8|math|80
2023|8|computer|90
2022|7|english|90
2022|7|math|85
2022|7|computer|95
2023|8|sanskrit|90
2023|8|physics|99
2023|8|chemistry|97
2023|8|biology|95
Selection and Filtering
sqlite> select * from marks where marks > 90;
2022|7|computer|95
2023|8|physics|99
2023|8|chemistry|97
2023|8|biology|95
sqlite> select * from marks where class = 8;
2023|8|english|75
2023|8|math|80
2023|8|computer|90
2023|8|sanskrit|90
2023|8|physics|99
2023|8|chemistry|97
2023|8|biology|95
Update
sqlite> select * from marks where class = 8 and subject = 'math';
2023|8|math|80
sqlite>
sqlite> update marks set marks = marks + 5 where class = 8 and subject = 'math';
sqlite> select * from marks where class = 8 and subject = 'math';
2023|8|math|85
Deleting a record
sqlite> select count(*) from marks where subject = 'sanskrit';
1
sqlite> select * from marks where subject = 'sanskrit';
2023|8|sanskrit|90
sqlite> delete from marks where subject = 'sanskrit';
sqlite> select * from marks where subject = 'sanskrit';
sqlite> select count(*) from marks where subject = 'sanskrit';
0
sqlite>
Cursor
Row Num |
0:ID |
1:Name |
2:SharkType |
3:Length |
0 |
1 |
Sammy |
Greenland Shark |
427 |
1 |
2 |
Alyoshka |
Great White Shark |
600 |
2 |
3 |
Himari |
Megaladon |
1800 |
[0][0] |
[0][1] |
[0][2] |
[0][3] |
[1][0] |
[1][1] |
[1][2] |
[1][3] |
[2][0] |
[2][1] |
[2][2] |
[2][3] |
A cursor allows us to iterate over a table.
And, a table can be viewed as a list of rows.
And, a row can be viewed as a tuple of cells.
In Code (Part 1)
import sqlite3
conn = sqlite3.connect('sharks.db')
print('Opened database successfully')
cursor = conn.execute('SELECT id, name, sharktype, length FROM sharks')
for row in cursor:
print('ID = ', row[0])
print('NAME = ', row[1])
print('SHARKTYPE = ', row[2])
print('LENGTH = ', row[3])
print('\n')
print('Operation done successfully')
conn.close()
Output
Opened database successfully
ID = 1
NAME = Sammy
SHARKTYPE = Greenland Shark
LENGTH = 427
ID = 2
NAME = Alyoshka
SHARKTYPE = Great White Shark
LENGTH = 600
ID = 3
NAME = Himari
SHARKTYPE = Megaladon
LENGTH = 1800
Operation done successfully
Tags: Technology,Database,Tuesday, February 7, 2023
Joins in PySpark using RDD, SQL DataFrame, PySpark.Pandas
Join Types
Inner Join The inner join is the default join in Spark SQL. It selects rows that have matching values in both relations. Syntax: relation [ INNER ] JOIN relation [ join_criteria ] Left Join A left join returns all values from the left relation and the matched values from the right relation, or appends NULL if there is no match. It is also referred to as a left outer join. Syntax: relation LEFT [ OUTER ] JOIN relation [ join_criteria ] Right Join A right join returns all values from the right relation and the matched values from the left relation, or appends NULL if there is no match. It is also referred to as a right outer join. Syntax: relation RIGHT [ OUTER ] JOIN relation [ join_criteria ] Full Join A full join returns all values from both relations, appending NULL values on the side that does not have a match. It is also referred to as a full outer join. Syntax: relation FULL [ OUTER ] JOIN relation [ join_criteria ] Cross Join A cross join returns the Cartesian product of two relations. Syntax: relation CROSS JOIN relation [ join_criteria ] Semi Join A semi join returns values from the left side of the relation that has a match with the right. It is also referred to as a left semi join. Syntax: relation [ LEFT ] SEMI JOIN relation [ join_criteria ] Anti Join An anti join returns values from the left relation that has no match with the right. It is also referred to as a left anti join. Syntax: relation [ LEFT ] ANTI JOIN relation [ join_criteria ] Ref: spark.apache.orgVisualizing Using Venn Diagrams
1: Using RDD¶
Spark provides a join() function that can join two paired RDDs based on the same key. join(): Performs an inner join between two RDDs: firstRDD.join(laterRDD) rightOuterJoin(): Performs join operation between two RDDs with key present in first RDD: firstRDD.rightOuterJoin(laterRDD) leftOuterJoin(): Performs join operation between two RDDs with key present in the later RDD: firstRDD.leftOuterJoin(laterRDD) Requirement: Let us consider two different datasets of ArisCCNetwork RouterLocation.tsv and RouterPurchase.tsv. Schema: RouterLocation.tsv: RouterID, Name, Location RouterPurchase.tsv: RouterID, Date, PrimaryMemory, SecondaryMemory, Cost Join these two datasets to fetch Routers Location, Cost, and Memory details into a single RDD. Implementation steps to join Step 1: Create namedtuple classes representing datasets Create two namedtuple representing the schema of each dataset. Note: namedtuple is just like a dictionary. It improves the code readability by providing a way to access the values using descriptive field names. Step 2: Generate <K,V> pairs using namedtuple In this step, datasets are loaded as RDDs Paired RDDs <K, V> are created where K = common column in both RDDs, V = value part which contains a complete record Step 3: Apply the join() function Spark join is applied against the grouped fields of locRDD and purRDD from the previous step.
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
from collections import namedtuple
RouterLocation = namedtuple('RouterLocation',['rid','name','location'])
RouterPurchase = namedtuple('RouterPurchase',['rid','date','pmemory','smemory','cost'])
#Load RouterLocation dataset and generate Rid(common field),RouterLocation object
locRDD = sc.textFile("../in/RouterLocation.tsv")\
.map(lambda line:line.split("\t"))\
.map(lambda r : (r[0], RouterLocation(int(r[0]), r[1], r[2])))
#Load RouterPurchase dataset and generate Rid(common field),RouterLocation object
purRDD = sc.textFile("../in/RouterPurchase.tsv")\
.map(lambda line:line.split("\t"))\
.map(lambda r : (r[0], RouterPurchase(int(r[0]), r[1], int(r[2]), int(r[3]), r[4])))
locRDD.collect()
[('1', RouterLocation(rid=1, name='RTR1', location='Chennai')), ('2', RouterLocation(rid=2, name='RTR2', location='Bangalore')), ('3', RouterLocation(rid=3, name='RTR3', location='Pune')), ('4', RouterLocation(rid=4, name='RTR4', location='Delhi')), ('5', RouterLocation(rid=5, name='RTR5', location='Mumbai'))]
result = locRDD.join(purRDD)
result.collect()
[('4', (RouterLocation(rid=4, name='RTR4', location='Delhi'), RouterPurchase(rid=4, date='9/3/2014', pmemory=653235467, smemory=245913333, cost='1000USD'))), ('3', (RouterLocation(rid=3, name='RTR3', location='Pune'), RouterPurchase(rid=3, date='6/10/2013', pmemory=453232267, smemory=325913333, cost='1200USD'))), ('1', (RouterLocation(rid=1, name='RTR1', location='Chennai'), RouterPurchase(rid=1, date='9/3/2012', pmemory=453232267, smemory=175913333, cost='1000USD'))), ('2', (RouterLocation(rid=2, name='RTR2', location='Bangalore'), RouterPurchase(rid=2, date='9/7/2012', pmemory=453232345, smemory=255913333, cost='1500USD'))), ('5', (RouterLocation(rid=5, name='RTR5', location='Mumbai'), RouterPurchase(rid=5, date='7/7/2014', pmemory=373232267, smemory=465913333, cost='1300USD')))]
2: Using SQL DataFrames¶
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.option("header", False).option("delimiter", "\t").csv('../in/RouterLocation.tsv')
# Ref: sql-data-sources-csv
df = df.withColumnRenamed("_c0", "rid")\
.withColumnRenamed("_c1", "name")\
.withColumnRenamed("_c2", "location")
df.show()
+---+----+---------+ |rid|name| location| +---+----+---------+ | 1|RTR1| Chennai| | 2|RTR2|Bangalore| | 3|RTR3| Pune| | 4|RTR4| Delhi| | 5|RTR5| Mumbai| +---+----+---------+
dfp = spark.read.option("header", False).option("delimiter", "\t").csv('../in/RouterPurchase.tsv')
dfp.show()
+---+---------+---------+---------+-------+ |_c0| _c1| _c2| _c3| _c4| +---+---------+---------+---------+-------+ | 1| 9/3/2012|453232267|175913333|1000USD| | 2| 9/7/2012|453232345|255913333|1500USD| | 3|6/10/2013|453232267|325913333|1200USD| | 4| 9/3/2014|653235467|245913333|1000USD| | 5| 7/7/2014|373232267|465913333|1300USD| +---+---------+---------+---------+-------+
dfp = dfp.withColumnRenamed("_c0", "rid")\
.withColumnRenamed("_c1", "date")\
.withColumnRenamed("_c2", "pmemory")\
.withColumnRenamed("_c3", "smemory")\
.withColumnRenamed("_c4", "cost")
dfp.show()
+---+---------+---------+---------+-------+ |rid| date| pmemory| smemory| cost| +---+---------+---------+---------+-------+ | 1| 9/3/2012|453232267|175913333|1000USD| | 2| 9/7/2012|453232345|255913333|1500USD| | 3|6/10/2013|453232267|325913333|1200USD| | 4| 9/3/2014|653235467|245913333|1000USD| | 5| 7/7/2014|373232267|465913333|1300USD| +---+---------+---------+---------+-------+
res = df.join(dfp, df.rid == dfp.rid, how = 'inner')
res.show()
+---+----+---------+---+---------+---------+---------+-------+ |rid|name| location|rid| date| pmemory| smemory| cost| +---+----+---------+---+---------+---------+---------+-------+ | 1|RTR1| Chennai| 1| 9/3/2012|453232267|175913333|1000USD| | 2|RTR2|Bangalore| 2| 9/7/2012|453232345|255913333|1500USD| | 3|RTR3| Pune| 3|6/10/2013|453232267|325913333|1200USD| | 4|RTR4| Delhi| 4| 9/3/2014|653235467|245913333|1000USD| | 5|RTR5| Mumbai| 5| 7/7/2014|373232267|465913333|1300USD| +---+----+---------+---+---------+---------+---------+-------+
how: str, optional default inner. Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti.
3: Using PySpark.Pandas¶
from pyspark import pandas as ppd
dfl = ppd.read_csv('../in/RouterLocation.tsv', sep = '\t',
names = ['rid', 'name', 'location'])
/home/ashish/anaconda3/envs/pyspark/lib/python3.9/site-packages/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: If `index_col` is not specified for `read_csv`, the default index is attached which can cause additional overhead. warnings.warn(message, PandasAPIOnSparkAdviceWarning)
dfl
rid | name | location | |
---|---|---|---|
0 | 1 | RTR1 | Chennai |
1 | 2 | RTR2 | Bangalore |
2 | 3 | RTR3 | Pune |
3 | 4 | RTR4 | Delhi |
4 | 5 | RTR5 | Mumbai |
dfp = ppd.read_csv('../in/RouterPurchase.tsv', sep = '\t',
names = ['rid', 'date', 'pmemory', 'smemory', 'cost'])
/home/ashish/anaconda3/envs/pyspark/lib/python3.9/site-packages/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: If `index_col` is not specified for `read_csv`, the default index is attached which can cause additional overhead. warnings.warn(message, PandasAPIOnSparkAdviceWarning)
dfp
rid | date | pmemory | smemory | cost | |
---|---|---|---|---|---|
0 | 1 | 9/3/2012 | 453232267 | 175913333 | 1000USD |
1 | 2 | 9/7/2012 | 453232345 | 255913333 | 1500USD |
2 | 3 | 6/10/2013 | 453232267 | 325913333 | 1200USD |
3 | 4 | 9/3/2014 | 653235467 | 245913333 | 1000USD |
4 | 5 | 7/7/2014 | 373232267 | 465913333 | 1300USD |
res = ppd.merge(dfl, dfp, on="rid", how = 'inner')
res
rid | name | location | date | pmemory | smemory | cost | |
---|---|---|---|---|---|---|---|
0 | 1 | RTR1 | Chennai | 9/3/2012 | 453232267 | 175913333 | 1000USD |
1 | 2 | RTR2 | Bangalore | 9/7/2012 | 453232345 | 255913333 | 1500USD |
2 | 3 | RTR3 | Pune | 6/10/2013 | 453232267 | 325913333 | 1200USD |
3 | 4 | RTR4 | Delhi | 9/3/2014 | 653235467 | 245913333 | 1000USD |
4 | 5 | RTR5 | Mumbai | 7/7/2014 | 373232267 | 465913333 | 1300USD |