Thursday, October 20, 2022

Python Interview Questions (2022 Oct, Week 3)

Ques 1: Count the range of each value in Python
I have dataset of student's scores for each subject.

StuID  Subject Scores                
1      Math    90
1      Geo     80
2      Math    70
2      Geo     60
3      Math    50
3      Geo     90
Now I want to count the range of scores for each subject like 0 < x <=20, 20 < x <=30 and get a dataframe like this:

Subject  0-20  20-40 40-60 60-80 80-100                 
Math       0     0     1     1     1
Geo        0     0     0     1     2    
How can I do it?

Ans 1:

import pandas as pd

df = pd.DataFrame({
    "StuID": [1,1,2,2,3,3],
    "Subject": ['Math', 'Geo', 'Math', 'Geo', 'Math', 'Geo'],
    "Scores": [90, 80, 70, 60, 50, 90]
})

bins = list(range(0, 100+1, 20))
# [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
labels = [f'{a}-{b}' for a,b in zip(bins, bins[1:])]
# ['0-10', '10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90', '90-100']

out = (pd.crosstab(df['Subject'], pd.cut(df['Scores'], bins=bins,
                                            labels=labels, ordered=True, right=False))
            .reindex(labels, axis=1, fill_value=0)
        )

print(out)

Ques 2: Split one column into 3 columns in python or PySpark

I have:
Customerkeycode:
B01:B14:110083

I want:
PlanningCustomerSuperGroupCode, DPGCode, APGCode
BO1,B14,110083

Ans 2:

In pyspark, first split the string into an array, and then use the getItem method to split it into multiple columns.

import pyspark.sql.functions as F
...
cols = ['PlanningCustomerSuperGroupCode', 'DPGCode', 'APGCode']
arr_cols = [F.split('Customerkeycode', ':').getItem(i).alias(cols[i]) for i in range(3)]
df = df.select(*arr_cols)
df.show(truncate=False)

Using Plain Pandas:


import pandas as pd

df = pd.DataFrame(
    {
        "Customerkeycode": [
            "B01:B14:110083",
            "B02:B15:110084"
        ]
    }
)

df['PlanningCustomerSuperGroupCode'] = df['Customerkeycode'].apply(lambda x: x.split(":")[0])
df['DPGCode'] = df['Customerkeycode'].apply(lambda x: x.split(":")[1])
df['APGCode'] = df['Customerkeycode'].apply(lambda x: x.split(":")[2])

df_rep = df.drop("Customerkeycode", axis = 1)

print(df_rep)


   PlanningCustomerSuperGroupCode DPGCode APGCode
0                            B01     B14  110083
1                            B02     B15  110084

Ref

Ques 3: Create a new list of dict, from a dict with a key that has a list value

I have a dict with one of the keys have a value of list

example = {"a":1,"b":[1,2]}
I am trying to unpack example["b"] and create a list of the same dict with separate example["b"] value.

output = [{"a":1,"b":1},{"a":1,"b":2}]
I have tried to use a for loop to understand the unpacking and reconstruction of the list of dict but I am seeing a strange behavior:


iter = example.get("b")

new_list = []

for p in iter:
    print(f"p is {p}")
    tmp_dict = example
    tmp_dict["b"] = p
    print(tmp_dict)
    new_list.append(tmp_dict)

print(new_list)

Output:


p is 1
{'a': 1, 'b': 1}

p is 2
{'a': 1, 'b': 2}

[{'a': 1, 'b': 2}, {'a': 1, 'b': 2}]

Why is the first dict in the list gets assigned with example["b"] = 2 although the first print() shows that p is 1?

Answer 3.1:

Here's a general approach that works for all cases without hardcoding any keys

here's a general approach that works for all cases without hardcoding any keys. Let's first create a temporary dictionary where all values are lists.

temp = {k: v if isinstance(v, list) else [v] for k, v in example.items()}
This allows us to then obtain the list of all the values in our temp dict as a list of lists.

We want the product of all the values of this temp dictionary. To do this, we can use the itertools.product function, and unpack our list of lists to its arguments.

In each iteration, the resulting tuple will have one value per key of the temp dictionary, so all we need to do is zip that with our tuple, and create a dict out of those key-value pairs. That gives us our list element!


import itertools

keys = list(temp.keys())
vals = list(temp.values())

result = []

for vals_product in itertools.product(*vals):
    d = dict(zip(keys, vals_product))
    result.append(d) 
Which gives the required result:

[{'a': 1, 'b': 1}, {'a': 1, 'b': 2}]

This even works for an example with more keys:

example = {'a': 1, 'b': [1, 2], 'c': [1, 2, 3]}
which gives:


[{'a': 1, 'b': 1, 'c': 1},
 {'a': 1, 'b': 1, 'c': 2},
 {'a': 1, 'b': 1, 'c': 3},
 {'a': 1, 'b': 2, 'c': 1},
 {'a': 1, 'b': 2, 'c': 2},
 {'a': 1, 'b': 2, 'c': 3}]

Answer 3.2:
Just one minor correction: Usage of dict()


example = {"a":1,"b":[1,2]}

iter = example.get("b")

new_list = []

for p in iter:
    print(f"p is {p}")
    tmp_dict = dict(example)
    tmp_dict["b"] = p
    print(tmp_dict)
    new_list.append(tmp_dict)

print(new_list)

Output is as given below: [{'a': 1, 'b': 1}, {'a': 1, 'b': 2}]

Ref

Question 4: Classification of sentences

I have a list of sentences. Examples:


${INS1}, Watch our latest webinar about flu vaccine
Do you think patients would like to go up to 250 days without an attack?
Watch our latest webinar about flu vaccine
??? See if more of your patients are ready for vaccine
Important news for your invaccinated patients
Important news for your inv?ccinated patients
...

By 'good' I mean sentences with no strange characters and sequences of characters such as '${INS1}', '???', or '?' inside the word etc. Otherwise sentence is considered as 'bad'. I need to find 'good' patterns to be able to identify 'bad' sentences in the future and exclude them, as the list of sentences will become larger in the future and new 'bad' sentences might appear.

Is there any way to identify 'good' sentences?

Answer 4:

This solution based on character examples specifically given in the question. If there are more characters that should be used to identify good and bad sentences then they should also be added in the RegEx mentioned below in code.


import re 

sents = [
    "${INS1}, Watch our latest webinar about flu vaccine",
    "Do you think patients would like to go up to 250 days without an attack?",
    "Watch our latest webinar about flu vaccine",
    "??? See if more of your patients are ready for vaccine",
    "Important news for your invaccinated patients",
    "Important news for your inv?ccinated patients"
]

good_sents = []
bad_sents = []

for i in sents:
    x = re.findall("[?}{$]", i)
    if(len(x) == 0):
        good_sents.append(i)

    else:
        bad_sents.append(i)

print(good_sents)

Question 5: How to make the sequence result in one line? 

Need to print 'n' Fibonacci numbers.

Here is the expected output as shown below:


Enter n: 10
Fibonacci numbers = 1 1 2 3 5 8 13 21 34 55

Here is my current code:


n = input("Enter n: ")

def fib(n):
    cur = 1
    old = 1
    i = 1
    while (i < n):
        cur, old, i = cur+old, cur, i+1
    return cur

Answer 5:
To do the least calculation, it is more efficient to have a fib function generate a list of the first n Fibonacci numbers.


def fib(n):
    fibs = [0, 1]
    for _ in range(n-2):
    fibs.append(sum(fibs[-2:]))
    return fibs
We know the first two Fibonacci numbers are 0 and 1. For the remaining count we can add the sum of the last two numbers to the list.

>>> fib(10)
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

You can now:

print('Fibonacci numbers = ', end='')
print(*fib(10), sep=' ', end='\n')

Question 6: Maximum occurrences in a list / DataFrame column

I have a dataframe like the one below.


import pandas as pd

data = {'Date': ['2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05','2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05','2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05'],
        'Runner': ['Runner A', 'Runner A', 'Runner A', 'Runner A', 'Runner A','Runner B', 'Runner B', 'Runner B', 'Runner B', 'Runner B','Runner C', 'Runner C', 'Runner C', 'Runner C', 'Runner C'],
        'Training Time': ['less than 1 hour', 'less than 1 hour', 'less than 1 hour', 'less than 1 hour', '1 hour to 2 hour','less than 1 hour', '1 hour to 2 hour', 'less than 1 hour', '1 hour to 2 hour', '2 hour to 3 hour', '1 hour to 2 hour ', '2 hour to 3 hour' ,'1 hour to 2 hour ', '2 hour to 3 hour', '2 hour to 3 hour']
        }

df = pd.DataFrame(data)

I have counted the occurrence for each runner using the below code

s = df.groupby(['Runner','Training Time']).size()

The problem is on Runner B. It should show "1 hour to 2 hour" and "less than 1 hour". But it only shows "1 hour to 2 hour". How can I get this expected result:
Answer 6.1: import pandas as pd data = {'Date': ['2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05','2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05','2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05'], 'Runner': ['Runner A', 'Runner A', 'Runner A', 'Runner A', 'Runner A','Runner B', 'Runner B', 'Runner B', 'Runner B', 'Runner B','Runner C', 'Runner C', 'Runner C', 'Runner C', 'Runner C'], 'Training Time': ['less than 1 hour', 'less than 1 hour', 'less than 1 hour', 'less than 1 hour', '1 hour to 2 hour','less than 1 hour', '1 hour to 2 hour', 'less than 1 hour', '1 hour to 2 hour', '2 hour to 3 hour', '1 hour to 2 hour ', '2 hour to 3 hour' ,'1 hour to 2 hour ', '2 hour to 3 hour', '2 hour to 3 hour'] } df = pd.DataFrame(data) s = df.groupby(['Runner', 'Training Time'], as_index=False).size() s.columns = ['Runner', 'Training Time', 'Size'] r = s.groupby(['Runner'], as_index=False)['Size'].max() df_list = [] for index, row in r.iterrows(): temp_df = s[(s['Runner'] == row['Runner']) & (s['Size'] == row['Size'])] df_list.append(temp_df) df_report = pd.concat(df_list) print(df_report) df_report.to_csv('report.csv', index = False) Answer 6.2: def agg_most_common(vals): print("vals") matches = [] for i in collections.Counter(vals).most_common(): if not matches or matches[0][1] == i[1]: matches.append(i) else: break return [x[0] for x in matches] print(df.groupby('Runner')['Training Time'].agg(agg_most_common))
Tags" Python,Technology,

No comments:

Post a Comment