Tuesday, June 7, 2022

Creating a Taxonomy for BBC News Articles (Part 6 based on - A Hybrid Approach to Hypernym Discovery)

The difference between Part 5 and Part 6.

In Part 5, we were using Cosine Distance between input text and output label. In Part 6, we are first finding the dot product and then getting a probability using the sigmoid function similar to Logistic Regression. A sigmoid function is a mathematical function having a characteristic "S"-shaped curve or sigmoid curve.
import pandas as pd import numpy as np from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity # Expects 2D arrays as input from scipy.spatial.distance import cosine # Works with 1D vectors from sklearn.metrics import classification_report smodel = SentenceTransformer('distilbert-base-nli-mean-tokens') df1 = pd.read_csv('bbc_news_train.csv') df1.head()
def get_sentence_vector(query): query_vec = smodel.encode([query])[0] return query_vec df1['textVec'] = df1['Text'].apply(lambda x: get_sentence_vector(x)) def std_category(x): if(x == 'tech'): return 'technology' elif (x == 'sport'): return 'sports' else: return x df1['Category'] = df1['Category'].apply(std_category) import math def sigmoid(x): return 1 / (1 + math.exp(-x)) def get_logistic_regression_probability(x, Y): y = smodel.encode([Y])[0] d = np.dot(x, y) s = sigmoid(d) return s df1['proba_business'] = df1['textVec'].apply(lambda x: get_logistic_regression_probability(x, 'business')) # CPU times: total: 2min 1s. Wall time: 1min 1s df1['Category'].unique() # OUTPUT: array(['business', 'technology', 'politics', 'sports', 'entertainment'], dtype=object) df1['proba_technology'] = df1['textVec'].apply(lambda x: get_logistic_regression_probability(x, 'technology')) df1['proba_politics'] = df1['textVec'].apply(lambda x: get_logistic_regression_probability(x, 'politics')) df1['proba_sports'] = df1['textVec'].apply(lambda x: get_logistic_regression_probability(x, 'sports')) df1['proba_entertainment'] = df1['textVec'].apply(lambda x: get_logistic_regression_probability(x, 'entertainment')) def get_prediction(in_row): max_proba = 0 label = "" for i in ['proba_business', 'proba_technology', 'proba_politics', 'proba_sports', 'proba_entertainment']: d = in_row[i] if d > max_proba: max_proba = d label = i.split('_')[1] return label df1['prediction'] = df1.apply(lambda in_row: get_prediction(in_row), axis = 1) target_names = ['business', 'entertainment', 'politics', 'sports', 'technology'] print(classification_report(df1['Category'], df1['prediction'], target_names=target_names))
Tags: Technology,Natural Language Processing,

No comments:

Post a Comment