Ques 1: Count the range of each value in Python I have dataset of student's scores for each subject. StuID Subject Scores 1 Math 90 1 Geo 80 2 Math 70 2 Geo 60 3 Math 50 3 Geo 90 Now I want to count the range of scores for each subject like 0 < x <=20, 20 < x <=30 and get a dataframe like this: Subject 0-20 20-40 40-60 60-80 80-100 Math 0 0 1 1 1 Geo 0 0 0 1 2 How can I do it? Ans 1: import pandas as pd df = pd.DataFrame({ "StuID": [1,1,2,2,3,3], "Subject": ['Math', 'Geo', 'Math', 'Geo', 'Math', 'Geo'], "Scores": [90, 80, 70, 60, 50, 90] }) bins = list(range(0, 100+1, 20)) # [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100] labels = [f'{a}-{b}' for a,b in zip(bins, bins[1:])] # ['0-10', '10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90', '90-100'] out = (pd.crosstab(df['Subject'], pd.cut(df['Scores'], bins=bins, labels=labels, ordered=True, right=False)) .reindex(labels, axis=1, fill_value=0) ) print(out) Ques 2: Split one column into 3 columns in python or PySpark I have: Customerkeycode: B01:B14:110083 I want: PlanningCustomerSuperGroupCode, DPGCode, APGCode BO1,B14,110083 Ans 2: In pyspark, first split the string into an array, and then use the getItem method to split it into multiple columns. import pyspark.sql.functions as F ... cols = ['PlanningCustomerSuperGroupCode', 'DPGCode', 'APGCode'] arr_cols = [F.split('Customerkeycode', ':').getItem(i).alias(cols[i]) for i in range(3)] df = df.select(*arr_cols) df.show(truncate=False) Using Plain Pandas: import pandas as pd df = pd.DataFrame( { "Customerkeycode": [ "B01:B14:110083", "B02:B15:110084" ] } ) df['PlanningCustomerSuperGroupCode'] = df['Customerkeycode'].apply(lambda x: x.split(":")[0]) df['DPGCode'] = df['Customerkeycode'].apply(lambda x: x.split(":")[1]) df['APGCode'] = df['Customerkeycode'].apply(lambda x: x.split(":")[2]) df_rep = df.drop("Customerkeycode", axis = 1) print(df_rep) PlanningCustomerSuperGroupCode DPGCode APGCode 0 B01 B14 110083 1 B02 B15 110084 Ref Ques 3: Create a new list of dict, from a dict with a key that has a list value I have a dict with one of the keys have a value of list example = {"a":1,"b":[1,2]} I am trying to unpack example["b"] and create a list of the same dict with separate example["b"] value. output = [{"a":1,"b":1},{"a":1,"b":2}] I have tried to use a for loop to understand the unpacking and reconstruction of the list of dict but I am seeing a strange behavior: iter = example.get("b") new_list = [] for p in iter: print(f"p is {p}") tmp_dict = example tmp_dict["b"] = p print(tmp_dict) new_list.append(tmp_dict) print(new_list) Output: p is 1 {'a': 1, 'b': 1} p is 2 {'a': 1, 'b': 2} [{'a': 1, 'b': 2}, {'a': 1, 'b': 2}] Why is the first dict in the list gets assigned with example["b"] = 2 although the first print() shows that p is 1? Answer 3.1: Here's a general approach that works for all cases without hardcoding any keys here's a general approach that works for all cases without hardcoding any keys. Let's first create a temporary dictionary where all values are lists. temp = {k: v if isinstance(v, list) else [v] for k, v in example.items()} This allows us to then obtain the list of all the values in our temp dict as a list of lists. We want the product of all the values of this temp dictionary. To do this, we can use the itertools.product function, and unpack our list of lists to its arguments. In each iteration, the resulting tuple will have one value per key of the temp dictionary, so all we need to do is zip that with our tuple, and create a dict out of those key-value pairs. That gives us our list element! import itertools keys = list(temp.keys()) vals = list(temp.values()) result = [] for vals_product in itertools.product(*vals): d = dict(zip(keys, vals_product)) result.append(d) Which gives the required result: [{'a': 1, 'b': 1}, {'a': 1, 'b': 2}] This even works for an example with more keys: example = {'a': 1, 'b': [1, 2], 'c': [1, 2, 3]} which gives: [{'a': 1, 'b': 1, 'c': 1}, {'a': 1, 'b': 1, 'c': 2}, {'a': 1, 'b': 1, 'c': 3}, {'a': 1, 'b': 2, 'c': 1}, {'a': 1, 'b': 2, 'c': 2}, {'a': 1, 'b': 2, 'c': 3}] Answer 3.2: Just one minor correction: Usage of dict() example = {"a":1,"b":[1,2]} iter = example.get("b") new_list = [] for p in iter: print(f"p is {p}") tmp_dict = dict(example) tmp_dict["b"] = p print(tmp_dict) new_list.append(tmp_dict) print(new_list) Output is as given below: [{'a': 1, 'b': 1}, {'a': 1, 'b': 2}] Ref Question 4: Classification of sentences I have a list of sentences. Examples: ${INS1}, Watch our latest webinar about flu vaccine Do you think patients would like to go up to 250 days without an attack? Watch our latest webinar about flu vaccine ??? See if more of your patients are ready for vaccine Important news for your invaccinated patients Important news for your inv?ccinated patients ... By 'good' I mean sentences with no strange characters and sequences of characters such as '${INS1}', '???', or '?' inside the word etc. Otherwise sentence is considered as 'bad'. I need to find 'good' patterns to be able to identify 'bad' sentences in the future and exclude them, as the list of sentences will become larger in the future and new 'bad' sentences might appear. Is there any way to identify 'good' sentences? Answer 4: This solution based on character examples specifically given in the question. If there are more characters that should be used to identify good and bad sentences then they should also be added in the RegEx mentioned below in code. import re sents = [ "${INS1}, Watch our latest webinar about flu vaccine", "Do you think patients would like to go up to 250 days without an attack?", "Watch our latest webinar about flu vaccine", "??? See if more of your patients are ready for vaccine", "Important news for your invaccinated patients", "Important news for your inv?ccinated patients" ] good_sents = [] bad_sents = [] for i in sents: x = re.findall("[?}{$]", i) if(len(x) == 0): good_sents.append(i) else: bad_sents.append(i) print(good_sents) Question 5: How to make the sequence result in one line? Need to print 'n' Fibonacci numbers. Here is the expected output as shown below: Enter n: 10 Fibonacci numbers = 1 1 2 3 5 8 13 21 34 55 Here is my current code: n = input("Enter n: ") def fib(n): cur = 1 old = 1 i = 1 while (i < n): cur, old, i = cur+old, cur, i+1 return cur Answer 5: To do the least calculation, it is more efficient to have a fib function generate a list of the first n Fibonacci numbers. def fib(n): fibs = [0, 1] for _ in range(n-2): fibs.append(sum(fibs[-2:])) return fibs We know the first two Fibonacci numbers are 0 and 1. For the remaining count we can add the sum of the last two numbers to the list. >>> fib(10) [0, 1, 1, 2, 3, 5, 8, 13, 21, 34] You can now: print('Fibonacci numbers = ', end='') print(*fib(10), sep=' ', end='\n') Question 6: Maximum occurrences in a list / DataFrame column I have a dataframe like the one below. import pandas as pd data = {'Date': ['2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05','2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05','2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05'], 'Runner': ['Runner A', 'Runner A', 'Runner A', 'Runner A', 'Runner A','Runner B', 'Runner B', 'Runner B', 'Runner B', 'Runner B','Runner C', 'Runner C', 'Runner C', 'Runner C', 'Runner C'], 'Training Time': ['less than 1 hour', 'less than 1 hour', 'less than 1 hour', 'less than 1 hour', '1 hour to 2 hour','less than 1 hour', '1 hour to 2 hour', 'less than 1 hour', '1 hour to 2 hour', '2 hour to 3 hour', '1 hour to 2 hour ', '2 hour to 3 hour' ,'1 hour to 2 hour ', '2 hour to 3 hour', '2 hour to 3 hour'] } df = pd.DataFrame(data) I have counted the occurrence for each runner using the below code s = df.groupby(['Runner','Training Time']).size() The problem is on Runner B. It should show "1 hour to 2 hour" and "less than 1 hour". But it only shows "1 hour to 2 hour". How can I get this expected result: Answer 6.1: import pandas as pd data = {'Date': ['2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05','2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05','2022/09/01', '2022/09/02', '2022/09/03', '2022/09/04', '2022/09/05'], 'Runner': ['Runner A', 'Runner A', 'Runner A', 'Runner A', 'Runner A','Runner B', 'Runner B', 'Runner B', 'Runner B', 'Runner B','Runner C', 'Runner C', 'Runner C', 'Runner C', 'Runner C'], 'Training Time': ['less than 1 hour', 'less than 1 hour', 'less than 1 hour', 'less than 1 hour', '1 hour to 2 hour','less than 1 hour', '1 hour to 2 hour', 'less than 1 hour', '1 hour to 2 hour', '2 hour to 3 hour', '1 hour to 2 hour ', '2 hour to 3 hour' ,'1 hour to 2 hour ', '2 hour to 3 hour', '2 hour to 3 hour'] } df = pd.DataFrame(data) s = df.groupby(['Runner', 'Training Time'], as_index=False).size() s.columns = ['Runner', 'Training Time', 'Size'] r = s.groupby(['Runner'], as_index=False)['Size'].max() df_list = [] for index, row in r.iterrows(): temp_df = s[(s['Runner'] == row['Runner']) & (s['Size'] == row['Size'])] df_list.append(temp_df) df_report = pd.concat(df_list) print(df_report) df_report.to_csv('report.csv', index = False) Answer 6.2: def agg_most_common(vals): print("vals") matches = [] for i in collections.Counter(vals).most_common(): if not matches or matches[0][1] == i[1]: matches.append(i) else: break return [x[0] for x in matches] print(df.groupby('Runner')['Training Time'].agg(agg_most_common))
Thursday, October 20, 2022
Python Interview Questions (2022 Oct, Week 3)
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment