The software tools we need for the activity covered in this post are:
Google Colab
GitHub
And last: ChatGPT
Why I needed these three items?
I needed Google Colab to write code. Google Colab allowed to me not care about creating a local environment and setting it up with the required packages such as 'transformers' from Hugging Face.
I needed GitHub to put my code in a place that I can make available to myself anywhere and also to you (my readers).
I needed ChatGPT to get boilerplate code for our particular task. I learnt about the prompts I needed for this activity from the book by Sinan Ozdemir titled:
Quick Start Guide to Large Language Models. Strategies and Best Practices for using ChatGPT and Other LLMs - Addison-Wesley Professional (2023)
4. Once repo is selected, it's notebooks start appearing below:
Code for zero shot Spam vs. Not Spam classifier using Facebook's BART
from transformers import pipeline
def classify_text (email):
"""
Use Facebook's BART model to classify an email into "spam" or "not spam"
Args:
email (str): The email to classify
Returns:
str: The classification of the email
"""
classifier = pipeline('zero-shot-classification', model='facebook/bart-large-mnli')
labels = ['spam', 'not spam']
hypothesis_template = 'This email is {}.'
results = classifier(email, labels, hypothesis_template=hypothesis_template)
return results['labels'][0]
Usage:
How we used ChatGPT?
Provide code to do zero shot sentiment analysis using Google BERT.
Answer:
To perform zero-shot sentiment analysis using Google's BERT model, you
can utilize the Hugging Face Transformers library along with the
pre-trained BERT model. Here's how you can do it in Python:
from transformers import pipeline
# Load the zero-shot classification pipeline with BERT classifier = pipeline("zero-shot-classification", model="bert-base-uncased")
# Define a list of texts for sentiment analysis texts = [ "I love this product, it's amazing!", "This movie was terrible, I hated it.", "The weather today is fantastic.", "I feel neutral about this situation." ]
# Define the possible labels/categories for sentiment candidate_labels = ["positive", "negative", "neutral"]
# Print the results for i, text in enumerate(texts): print(f"Text: {text}") print("Sentiment probabilities:") for label, score in zip(results[i]['labels'], results[i]['scores']): print(f"{label}: {score:.4f}") print()
Output
Text: I love this product, it's amazing!
Sentiment probabilities:
positive: 0.4718
negative: 0.2679
neutral: 0.2603
Text: This movie was terrible, I hated it.
Sentiment probabilities:
negative: 0.3644
positive: 0.3179
neutral: 0.3177
Text: The weather today is fantastic.
Sentiment probabilities:
positive: 0.4026
negative: 0.3039
neutral: 0.2935
Text: I feel neutral about this situation.
Sentiment probabilities:
neutral: 0.3587
negative: 0.3248
positive: 0.3166
Additional Note
Question:
How does the BERT know that the problem is of sentiment analysis and not spam-vs-not_spam?
Answer:
In the code provided, the BERT model doesn't inherently "know" that the problem is specifically about sentiment analysis. Instead, it is designed for zero-shot classification, which means it can classify texts into predefined categories without being explicitly trained on those categories.
The key lies in the input data and the labels provided. In the example, we've defined a list of texts that contain statements or reviews, and we've also provided a list of candidate labels (positive, negative, neutral) that represent sentiment categories. When you pass these texts and labels to the BERT model through the zero-shot classification pipeline, it infers the sentiment of each text based on its understanding of the language and the provided labels.
If you were to perform a different classification task, such as spam vs. not spam, you would simply replace the list of candidate labels with the appropriate labels for that task (e.g., spam, not_spam). The model doesn't need to be explicitly told the nature of the classification problem because it learns to make predictions based on the provided labels during training.
We will do Sentiment Analysis using the code from this repo: GitHub
Note: The entire GitHub code base for this project is about 18 MB in size.
And for the first time, when you run the "server.py" from Anaconda Prompt, it downloads the BERT model of size about 450 MB.
Contents of YAML file for conda environment creation: env.yml
import transformers as ppb
import torch
import numpy as np
print(ppb.__version__)
4.19.2
input_sentence_1 = "In recent years, a lot of hype has developed around the promise of neural networks and their ability to classify and identify input data, and more recently the ability of certain network architectures to generate original content. Companies large and small are using them for everything from image captioning and self-driving car navigation to identifying solar panels from satellite images and recognizing faces in security camera videos. And luckily for us, many NLP applications of neural nets exist as well. While deep neural networks have inspired a lot of hype and hyperbole, our robot overlords are probably further off than any clickbait cares to admit. Neural networks are, however, quite powerful tools, and you can easily use them in an NLP chatbot pipeline to classify input text, summarize documents, and even generate novel works. This chapter is intended as a primer for those with no experience in neural networks. We don’t cover anything specific to NLP in this chapter, but gaining a basic understanding of what is going on under the hood in a neural network is important for the upcoming chapters. If you’re familiar with the basics of a neural network, you can rest easy in skipping ahead to the next chapter, where you dive back into processing text with the various flavors of neural nets. Although the mathematics of the underlying algorithm, backpropagation, are outside this book’s scope, a high-level grasp of its basic functionality will help you understand language and the patterns hidden within. As the availability of processing power and memory has exploded over the course of the decade, an old technology has come into its own again. First proposed in the 1950s by Frank Rosenblatt, the perceptron1 offered a novel algorithm for finding patterns in data. The basic concept lies in a rough mimicry of the operation of a living neuron cell. As electrical signals flow into the cell through the dendrites (see figure 5.1) into the nucleus, an electric charge begins to build up. When the cell reaches a certain level of charge, it fires, sending an electrical signal out through the axon. However, the dendrites aren’t all created equal. The cell is more “sensitive” to signals through certain dendrites than others, so it takes less of a signal in those paths to fire the axon."
print(input_sentence_1)
print("Char count", len(input_sentence_1))
print("Word Count:", len(input_sentence_1.split(" ")))
In recent years, a lot of hype has developed around the promise of neural networks and their ability to classify and identify input data, and more recently the ability of certain network architectures to generate original content. Companies large and small are using them for everything from image captioning and self-driving car navigation to identifying solar panels from satellite images and recognizing faces in security camera videos. And luckily for us, many NLP applications of neural nets exist as well. While deep neural networks have inspired a lot of hype and hyperbole, our robot overlords are probably further off than any clickbait cares to admit. Neural networks are, however, quite powerful tools, and you can easily use them in an NLP chatbot pipeline to classify input text, summarize documents, and even generate novel works. This chapter is intended as a primer for those with no experience in neural networks. We don’t cover anything specific to NLP in this chapter, but gaining a basic understanding of what is going on under the hood in a neural network is important for the upcoming chapters. If you’re familiar with the basics of a neural network, you can rest easy in skipping ahead to the next chapter, where you dive back into processing text with the various flavors of neural nets. Although the mathematics of the underlying algorithm, backpropagation, are outside this book’s scope, a high-level grasp of its basic functionality will help you understand language and the patterns hidden within. As the availability of processing power and memory has exploded over the course of the decade, an old technology has come into its own again. First proposed in the 1950s by Frank Rosenblatt, the perceptron1 offered a novel algorithm for finding patterns in data. The basic concept lies in a rough mimicry of the operation of a living neuron cell. As electrical signals flow into the cell through the dendrites (see figure 5.1) into the nucleus, an electric charge begins to build up. When the cell reaches a certain level of charge, it fires, sending an electrical signal out through the axon. However, the dendrites aren’t all created equal. The cell is more “sensitive” to signals through certain dendrites than others, so it takes less of a signal in those paths to fire the axon.
Char count 2309
Word Count: 382
input_sentence_2 = "The biology that controls these relationships is most certainly beyond the scope of this book, but the key concept to notice here is the way the cell weights incoming signals when deciding when to fire. The neuron will dynamically change those weights in the decision making process over the course of its life. You are going to mimic that process. Rosenblatt’s original project was to teach a machine to recognize images. The original perceptron was a conglomeration of photo-receptors and potentiometers, not a computer in the current sense. But implementation specifics aside, Rosenblatt’s concept was to take the features of an image and assign a weight, a measure of importance, to each one. The features of the input image were each a small subsection of the image. A grid of photo-receptors would be exposed to the image. Each receptor would see one small piece of the image. The brightness of the image that a particular photoreceptor could see would determine the strength of the signal that it would send to the associated “dendrite.” Each dendrite had an associated weight in the form of a potentiometer. Once enough signal came in, it would pass the signal into the main body of the “nucleus” of the “cell.” Once enough of those signals from all the potentiometers passed a certain threshold, the perceptron would fire down its axon, indicating a positive match on the image it was presented with. If it didn’t fire for a given image, that was a negative classification match. Think “hot dog, not hot dog” or “iris setosa, not iris setosa.” So far there has been a lot of hand waving about biology and electric current and photo-receptors. Let’s pause for a second and peel out the most important parts of this concept. Basically, you’d like to take an example from a dataset, show it to an algorithm, and have the algorithm say yes or no. That’s all you’re doing so far. The first piece you need is a way to determine the features of the sample. Choosing appropriate features turns out to be a surprisingly challenging part of machine learning. In “normal” machine learning problems, like predicting home prices, your features might be square footage, last sold price, and ZIP code. Or perhaps you’d like to predict the species of a certain flower using the Iris dataset.2 In that case your features would be petal length, petal width, sepal length, and sepal width. In Rosenblatt’s experiment, the features were the intensity values of each pixel (subsections of the image), one pixel per photo receptor."
print(input_sentence_2)
print("Char count", len(input_sentence_2))
print("Word Count:", len(input_sentence_2.split(" ")))
The biology that controls these relationships is most certainly beyond the scope of this book, but the key concept to notice here is the way the cell weights incoming signals when deciding when to fire. The neuron will dynamically change those weights in the decision making process over the course of its life. You are going to mimic that process. Rosenblatt’s original project was to teach a machine to recognize images. The original perceptron was a conglomeration of photo-receptors and potentiometers, not a computer in the current sense. But implementation specifics aside, Rosenblatt’s concept was to take the features of an image and assign a weight, a measure of importance, to each one. The features of the input image were each a small subsection of the image. A grid of photo-receptors would be exposed to the image. Each receptor would see one small piece of the image. The brightness of the image that a particular photoreceptor could see would determine the strength of the signal that it would send to the associated “dendrite.” Each dendrite had an associated weight in the form of a potentiometer. Once enough signal came in, it would pass the signal into the main body of the “nucleus” of the “cell.” Once enough of those signals from all the potentiometers passed a certain threshold, the perceptron would fire down its axon, indicating a positive match on the image it was presented with. If it didn’t fire for a given image, that was a negative classification match. Think “hot dog, not hot dog” or “iris setosa, not iris setosa.” So far there has been a lot of hand waving about biology and electric current and photo-receptors. Let’s pause for a second and peel out the most important parts of this concept. Basically, you’d like to take an example from a dataset, show it to an algorithm, and have the algorithm say yes or no. That’s all you’re doing so far. The first piece you need is a way to determine the features of the sample. Choosing appropriate features turns out to be a surprisingly challenging part of machine learning. In “normal” machine learning problems, like predicting home prices, your features might be square footage, last sold price, and ZIP code. Or perhaps you’d like to predict the species of a certain flower using the Iris dataset.2 In that case your features would be petal length, petal width, sepal length, and sepal width. In Rosenblatt’s experiment, the features were the intensity values of each pixel (subsections of the image), one pixel per photo receptor.
Char count 2518
Word Count: 426
model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
def get_embedding(in_list):
tokenized = [tokenizer.encode(x, add_special_tokens=True) for x in in_list]
max_len = 0
for i in tokenized:
if len(i) > max_len:
max_len = len(i)
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized])
attention_mask = np.where(padded != 0, 1, 0)
input_ids = torch.LongTensor(padded)
attention_mask = torch.tensor(attention_mask)
with torch.no_grad():
last_hidden_states = model(input_ids = input_ids, attention_mask = attention_mask)
features = last_hidden_states[0][:,0,:].numpy()
return features
string_embeddings = get_embedding([input_sentence_1, input_sentence_2])
Token indices sequence length is longer than the specified maximum sequence length for this model (560 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Input In [12], in <cell line: 1>()
----> 1 string_embeddings = get_embedding([input_sentence_1, input_sentence_2])
Input In [11], in get_embedding(in_list)
14 attention_mask = torch.tensor(attention_mask)
16 with torch.no_grad():
---> 17 last_hidden_states = model(input_ids = input_ids, attention_mask = attention_mask)
19 features = last_hidden_states[0][:,0,:].numpy()
20 return features
File E:\programfiles\Anaconda3\envs\transformers\lib\site-packages\torch\nn\modules\module.py:1102, in Module._call_impl(self, *input, **kwargs)
1098 # If we don't have any hooks, we want to skip the rest of the logic in
1099 # this function, and just call forward.
1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1101 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102 return forward_call(*input, **kwargs)
1103 # Do not call functions when jit is used
1104 full_backward_hooks, non_full_backward_hooks = [], []
File E:\programfiles\Anaconda3\envs\transformers\lib\site-packages\transformers\models\bert\modeling_bert.py:983, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
981 if hasattr(self.embeddings, "token_type_ids"):
982 buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length]
--> 983 buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
984 token_type_ids = buffered_token_type_ids_expanded
985 else:
RuntimeError: The expanded size of the tensor (560) must match the existing size (512) at non-singleton dimension 1. Target sizes: [2, 560]. Tensor sizes: [1, 512]
Tags: Machine Learning, Natural Language Processing, Python, Technology
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics.pairwise import cosine_similarity
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')
from joblib import load, dump
import json
import re
print(ppb.__version__)'3.0.1'Loading the Pre-trained BERT modelmodel_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)
When run first time, the above statements loads a model of 440MB in size.
Word Ambiguities def get_embedding(in_list):
tokenized = [tokenizer.encode(x, add_special_tokens=True) for x in in_list]
max_len = 0
for i in tokenized:
if len(i) > max_len:
max_len = len(i)
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized])
attention_mask = np.where(padded != 0, 1, 0)
input_ids = torch.LongTensor(padded)
attention_mask = torch.tensor(attention_mask)
with torch.no_grad():
last_hidden_states = model(input_ids = input_ids, attention_mask = attention_mask)
features = last_hidden_states[0][:,0,:].numpy()
return features
python_strings = [
'I love coding in Python language.',
'Python is more readable than Java.',
'Pythons are famous for their very long body.',
'Python is famous for its very long body.',
'All six continents have a python species.',
'Python is a programming language.',
'Python is a reptile.',
'The python ate a mouse.',
'python ate a mouse'
]
string_embeddings = get_embedding(python_strings)
print(string_embeddings.shape) (9, 768)csm = cosine_similarity(X = string_embeddings, Y=None, dense_output=True)
print(csm.round(2))
In the picture below, if we ignore the diagnol (that is similarity of a sentence to itself), we are able to see which sentence is closer to which.
[[1. 0.83 0.8 0.79 0.8 0.84 0.84 0.81 0.81]
[0.83 1. 0.79 0.76 0.8 0.87 0.79 0.8 0.79]
[0.8 0.79 1. 0.96 0.86 0.77 0.88 0.77 0.78]
[0.79 0.76 0.96 1. 0.82 0.77 0.9 0.75 0.77]
[0.8 0.8 0.86 0.82 1. 0.78 0.85 0.8 0.8 ]
[0.84 0.87 0.77 0.77 0.78 1. 0.81 0.76 0.78]
[0.84 0.79 0.88 0.9 0.85 0.81 1. 0.81 0.86]
[0.81 0.8 0.77 0.75 0.8 0.76 0.81 1. 0.9 ]
[0.81 0.79 0.78 0.77 0.8 0.78 0.86 0.9 1. ]] for i in range(len(csm)):
ord_indx = np.argsort(csm[i])[::-1]
print(python_strings[ord_indx[0]])
print([python_strings[j] for j in ord_indx[1:]])
print()I love coding in Python language.
['Python is a reptile.', 'Python is a programming language.', 'Python is more readable than Java.', 'python ate a mouse', 'The python ate a mouse.', 'All six continents have a python species.', 'Pythons are famous for their very long body.', 'Python is famous for its very long body.']
Python is more readable than Java.
['Python is a programming language.', 'I love coding in Python language.', 'All six continents have a python species.', 'The python ate a mouse.', 'Python is a reptile.', 'python ate a mouse', 'Pythons are famous for their very long body.', 'Python is famous for its very long body.']
Pythons are famous for their very long body.
['Python is famous for its very long body.', 'Python is a reptile.', 'All six continents have a python species.', 'I love coding in Python language.', 'Python is more readable than Java.', 'python ate a mouse', 'Python is a programming language.', 'The python ate a mouse.']
Python is famous for its very long body.
['Pythons are famous for their very long body.', 'Python is a reptile.', 'All six continents have a python species.', 'I love coding in Python language.', 'python ate a mouse', 'Python is a programming language.', 'Python is more readable than Java.', 'The python ate a mouse.']
All six continents have a python species.
['Pythons are famous for their very long body.', 'Python is a reptile.', 'Python is famous for its very long body.', 'I love coding in Python language.', 'Python is more readable than Java.', 'The python ate a mouse.', 'python ate a mouse', 'Python is a programming language.']
Python is a programming language.
['Python is more readable than Java.', 'I love coding in Python language.', 'Python is a reptile.', 'All six continents have a python species.', 'python ate a mouse', 'Pythons are famous for their very long body.', 'Python is famous for its very long body.', 'The python ate a mouse.']
Python is a reptile.
['Python is famous for its very long body.', 'Pythons are famous for their very long body.', 'python ate a mouse', 'All six continents have a python species.', 'I love coding in Python language.', 'Python is a programming language.', 'The python ate a mouse.', 'Python is more readable than Java.']
The python ate a mouse.
['python ate a mouse', 'I love coding in Python language.', 'Python is a reptile.', 'All six continents have a python species.', 'Python is more readable than Java.', 'Pythons are famous for their very long body.', 'Python is a programming language.', 'Python is famous for its very long body.']
python ate a mouse
['The python ate a mouse.', 'Python is a reptile.', 'I love coding in Python language.', 'All six continents have a python species.', 'Python is more readable than Java.', 'Python is a programming language.', 'Pythons are famous for their very long body.', 'Python is famous for its very long body.'] Few observations
1. "python ate a mouse" is more closer to "Python is a reptile." than "The python ate a mouse."
For closeness of these sentences to "Python is a reptile" shows "python ate a mouse" at number 3 while "The python ate a mouse" appears at number 7.
2. The model we are using is "uncased" so capitalization does not matter.
3. Sentences about Python language are similar to each other, and sentences about Python reptile are similar to each other.
4. Word "python" or "Python" alone is closest to 'I love coding in Python language.' then to 'Python is a reptile.', see code snippet below.
from scipy.spatial import distance
python_embedding = get_embedding('python')
csm = [1 - distance.cosine(u = python_embedding[0], v = i) for i in string_embeddings]
print([python_strings[j] for j in np.argsort(csm)[::-1]]) ['I love coding in Python language.',
'Python is a reptile.',
'python ate a mouse',
'The python ate a mouse.',
'All six continents have a python species.',
'Python is a programming language.',
'Python is more readable than Java.',
'Python is famous for its very long body.',
'Pythons are famous for their very long body.']
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics.pairwise import cosine_similarity
from joblib import load, dump
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')
ppb.__version__ '3.0.1'model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased') Load pretrained model/tokenizertokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)
The above code downloads three files when it runs for the first time:
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…
The third file is the model with size 440MB.
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with. sentences = ['First do it', 'then do it right', 'then do it better']
sentences_df = pd.DataFrame({"sentences": sentences})
tokenized = sentences_df['sentences'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True))) Padding
After tokenization, `tokenized` is a list of sentences -- each sentences is represented as a list of tokens. We want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths).
max_len = 0
for i in tokenized.values:
if len(i) > max_len:
max_len = len(i)
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values]) Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:
attention_mask = np.where(padded != 0, 1, 0)
%%time
input_ids = torch.LongTensor(padded)
attention_mask = torch.tensor(attention_mask)
with torch.no_grad():
last_hidden_states = model(input_ids = input_ids, attention_mask = attention_mask)
features = last_hidden_states[0][:,0,:].numpy()
Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. Last token is representing [SEP]. The output corresponding to that token can be thought of as an embedding for the entire sentence.
We'll save those in the `features` variable, as they'll serve as the features to our logitics regression model.
Testing
Word Analogies def get_embedding(in_list):
tokenized = [tokenizer.encode(x, add_special_tokens=True) for x in in_list]
max_len = 0
for i in tokenized:
if len(i) > max_len:
max_len = len(i)
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized])
attention_mask = np.where(padded != 0, 1, 0)
input_ids = torch.LongTensor(padded)
attention_mask = torch.tensor(attention_mask)
with torch.no_grad():
last_hidden_states = model(input_ids = input_ids, attention_mask = attention_mask)
features = last_hidden_states[0][:,0,:].numpy()
return features
analogies = [['king', 'man', 'queen', 'woman'],
['king', 'prince', 'queen', 'princess'],
['miami', 'florida', 'dallas', 'texas'],
['einstein', 'scientist', 'picasso', 'painter'],
['japan', 'sushi', 'germany', 'bratwurst'],
['man', 'woman', 'he', 'she'],
['man', 'woman', 'uncle', 'aunt'],
['man', 'woman', 'brother', 'sister'],
['man', 'woman', 'husband', 'wife'],
['man', 'woman', 'actor', 'actress'],
['man', 'woman', 'father', 'mother'],
['heir', 'heiress', 'prince', 'princess'],
['nephew', 'niece', 'uncle', 'aunt'],
['france', 'paris', 'japan', 'tokyo'],
['france', 'paris', 'china', 'beijing'],
['february', 'january', 'december', 'november'],
['france', 'paris', 'germany', 'berlin'],
['week', 'day', 'year', 'month'],
['week', 'day', 'hour', 'minute'],
['france', 'paris', 'italy', 'rome'],
['paris', 'france', 'rome', 'italy'],
['france', 'french', 'england', 'english'],
['japan', 'japanese', 'china', 'chinese'],
['china', 'chinese', 'america', 'american'],
['japan', 'japanese', 'italy', 'italian'],
['japan', 'japanese', 'australia', 'australian'],
['walk', 'walking', 'swim', 'swimming']]
for i in analogies:
king = get_embedding([i[0]])
queen = get_embedding([i[2]])
man = get_embedding([i[1]])
woman = get_embedding([i[3]])
q = king - man + woman
print(i[0], '-', i[1], '+', i[3], 'and', i[2], cosine_similarity(queen, q)) king - man + woman and queen [[0.95728725]]
king - prince + princess and queen [[0.9805071]]
miami - florida + texas and dallas [[0.93608725]]
einstein - scientist + painter and picasso [[0.9021458]]
japan - sushi + bratwurst and germany [[0.8383053]]
man - woman + she and he [[0.97603536]]
man - woman + aunt and uncle [[0.9624729]]
man - woman + sister and brother [[0.970188]]
man - woman + wife and husband [[0.9585104]]
man - woman + actress and actor [[0.95233154]]
man - woman + mother and father [[0.9783108]]
heir - heiress + princess and prince [[0.9558885]]
nephew - niece + aunt and uncle [[0.9844531]]
france - paris + tokyo and japan [[0.95287836]]
france - paris + beijing and china [[0.94868445]]
february - january + november and december [[0.89765096]]
france - paris + berlin and germany [[0.9586985]]
week - day + month and year [[0.9131064]]
week - day + minute and hour [[0.9280644]]
france - paris + rome and italy [[0.92742187]]
paris - france + italy and rome [[0.9252609]]
france - french + english and england [[0.9143828]]
japan - japanese + chinese and china [[0.9681916]]
china - chinese + american and america [[0.9371264]]
japan - japanese + italian and italy [[0.97318065]]
japan - japanese + australian and australia [[0.96878356]]
walk - walking + swimming and swim [[0.90309924]] Nearest Words
We have retrieved nouns from the 'BERT Base Uncased' vocabulary. There are 15269 nouns in this vocabulary.
You can download "vocab.txt" from here: GitHub
We have used to SpaCy to identify the nouns.
nouns = load('files_5_p3/list_of_nouns_from_bert_base_uncased_vocab.joblib')
%%time
noun_embeddings = [get_embedding([i]) for i in nouns]
dump(noun_embeddings, 'files_2_p2/list_of_noun_embeddings.joblib') Wall time: 20min 8s noun_embeddings = load('files_2_p2/list_of_noun_embeddings.joblib')
noun_embeddings = [n[0] for n in noun_embeddings]
from scipy.spatial import distance
def get_nn_of_words(in_list):
for k in in_list:
input_word = k
if k not in nouns:
continue
p = noun_embeddings[nouns.index(input_word)]
closest_embedding_indices = distance.cdist(np.array(p).reshape(1, -1),
np.array(noun_embeddings).reshape(len(noun_embeddings),-1))[0].argsort()[1:11]
closest_nouns = [nouns[i] for i in closest_embedding_indices]
print("For", k, closest_nouns)
get_nn_of_words(set(pd.core.common.flatten(analogies))) For germany ['austria', 'bavaria', 'berlin', 'luxembourg', 'europe', 'japan', 'britain', 'wurttemberg', 'dresden', 'sweden']
For niece ['nephew', 'granddaughter', 'fiancee', 'daughter', 'grandparents', 'grandson', 'stepmother', 'aunt', 'cousins', 'wife']
For aunt ['grandmother', 'grandfather', 'uncle', 'cousin', 'sister', 'mother', 'miriam', 'vicki', 'uncles', 'cousins']
For february ['january', 'april', 'june', 'november', 'march', 'july', 'august', 'december', 'october', 'spring']
For england ['britain', 'wales', 'australia', 'ireland', 'barbados', 'stoke', 'brentford', 'lancashire', 'cuba', 'luxembourg']
For america ['planet', 'dakota', 'hawaii', 'britain', 'hemisphere', 'coral', 'virginia', 'nina', 'columbia', 'victoria']
For italian ['italy', 'russian', 'catalan', 'portuguese', 'french', 'azerbaijani', 'indonesian', 'austrian', 'japanese', 'irish']
For uncle ['aunt', 'cousin', 'grandfather', 'brother', 'grandmother', 'uncles', 'doc', 'bobby', 'mother', 'kid']
For miami ['tampa', 'seattle', 'vancouver', 'portland', 'arizona', 'vegas', 'sydney', 'florida', 'houston', 'orlando']
For italy ['austria', 'germany', 'luxembourg', 'europe', 'rico', 'japan', 'africa', 'indonesia', 'florence', 'tuscany']
For woman ['teenager', 'girl', 'spouse', 'brother', 'partner', 'daughter', 'mother', 'consort', 'wife', 'stallion']
For english ['afrikaans', 'hindi', 'latin', 'portuguese', 'sanskrit', 'french', 'italian', 'hebrew', 'azerbaijani', 'lithuanian']
For king ['duke', 'prince', 'queen', 'princess', 'throne', 'consort', 'deity', 'queens', 'abbot', 'lords']
For dallas ['jasmine', 'travis', 'savannah', 'eden', 'lucas', 'mia', 'lexi', 'jack', 'hunter', 'penny']
For mother ['grandmother', 'brother', 'mothers', 'parents', 'daughter', 'mom', 'father', 'grandfather', 'sister', 'mary']
For heiress ['landowner', 'heir', 'daughters', 'heirs', 'daughter', 'granddaughter', 'siblings', 'grandson', 'childless', 'clerk']
For japanese ['korean', 'japan', 'thai', 'russian', 'hawaiian', 'malaysian', 'indonesian', 'khmer', 'taiwanese', 'bengali']
For heir ['heirs', 'consort', 'spouse', 'prince', 'womb', 'attendants', 'fulfillment', 'duke', 'daughter', 'keeper']
For january ['november', 'april', 'august', 'february', 'december', 'summer', 'july', 'spring', 'october', 'june']
For brother ['sister', 'grandfather', 'cousin', 'grandmother', 'mother', 'daughter', 'partner', 'bowl', 'mentor', 'beau']
For wife ['husbands', 'daughter', 'spouse', 'husband', 'woman', 'girlfriend', 'household', 'supporter', 'boyfriend', 'granddaughter']
For minute ['moments', 'hour', 'dozen', 'mile', 'cycles', 'millennia', 'moment', 'sizes', 'clocks', 'twenties']
For picasso ['goldsmith', 'michelangelo', 'fresco', 'carousel', 'chopin', 'verdi', 'hercules', 'palette', 'canvas', 'britten']
For week ['month', 'series', 'replacement', 'primetime', 'position', 'highlight', 'zone', 'slot', 'office', 'showcase']
For japan ['america', 'ceylon', 'hawaii', 'malaysia', 'australia', 'taiwan', 'osaka', 'fukuoka', 'indonesia', 'korea']
For einstein ['aristotle', 'nobel', 'beckett', 'wiener', 'relativity', 'abel', 'strauss', 'skinner', 'clifford', 'bernstein']
For australian ['australia', 'canadian', 'canada', 'fremantle', 'oceania', 'america', 'brazil', 'nepal', 'jakarta', 'hawaii']
For painter ['musician', 'painting', 'paintings', 'designer', 'dancer', 'filmmaker', 'illustrator', 'teacher', 'soldier', 'boxer']
For man ['lump', 'woman', 'boss', 'bear', 'scratch', 'intruder', 'alpha', 'rat', 'touch', 'condo']
For florida ['maine', 'louisiana', 'arizona', 'virginia', 'charleston', 'indiana', 'tampa', 'colorado', 'alabama', 'connecticut']
For year ['season', 'month', 'eligibility', 'seasons', 'name', 'calendar', 'date', 'colour', 'highlight', 'divisional']
For tokyo ['osaka', 'kyoto', 'fukuoka', 'nagoya', 'seoul', 'kobe', 'moscow', 'honolulu', 'japan', 'nippon']
For november ['october', 'january', 'december', 'winter', 'spring', 'august', 'april', 'monday', 'halloween', 'wednesday']
For rome ['titan', 'vulcan', 'mesopotamia', 'damascus', 'alexandria', 'egypt', 'baghdad', 'orion', 'denver', 'nevada']
For china ['taiwan', 'fujian', 'indonesia', 'japan', 'asia', 'sichuan', 'malawi', 'lebanon', 'russia', 'zimbabwe']
For hour ['minute', 'hours', 'moments', 'dozen', 'weeks', 'inning', 'day', 'cycles', 'midnight', 'minutes']
For texas ['oregon', 'alabama', 'florida', 'colorado', 'ohio', 'indiana', 'georgia', 'houston', 'arkansas', 'arizona']
For sister ['brother', 'daughter', 'mother', 'grandmother', 'grandfather', 'cousin', 'aunt', 'padre', 'sisters', 'blossom']
For berlin ['vienna', 'stuttgart', 'hannover', 'hamburg', 'bonn', 'dresden', 'dusseldorf', 'gottingen', 'mannheim', 'rosenthal']
For actress ['actor', 'musician', 'singer', 'novelist', 'teacher', 'dancer', 'magician', 'poet', 'painter', 'actors']
For beijing ['tianjin', 'guangzhou', 'singapore', 'honolulu', 'taipei', 'ankara', 'osaka', 'manila', 'durban', 'jakarta']
For princess ['prince', 'madam', 'papa', 'kira', 'sweetie', 'witch', 'ruby', 'wedding', 'tasha', 'marta']
For nephew ['niece', 'grandson', 'granddaughter', 'daughter', 'fiancee', 'girlfriend', 'son', 'brother', 'sidekick', 'wife']
For month ['week', 'summers', 'evening', 'calendar', 'decade', 'semester', 'term', 'position', 'seasonal', 'occasion']
For swimming ['diving', 'weightlifting', 'judo', 'badminton', 'tennis', 'gymnastics', 'archery', 'swimmers', 'breaststroke', 'hockey']
For queen ['princess', 'queens', 'maid', 'king', 'prince', 'duke', 'consort', 'crown', 'stallion', 'madam']
For actor ['actress', 'actors', 'poet', 'television', 'singer', 'novelist', 'comedian', 'musician', 'screenwriter', 'painter']
For december ['november', 'january', 'october', 'april', 'march', 'june', 'july', 'september', 'august', 'autumn']
For american ['british', 'americans', 'america', 'britain', 'african', 'haitian', 'kenyan', 'bangladeshi', 'resident', 'canadian']
For french ['italian', 'portuguese', 'dutch', 'english', 'spanish', 'afrikaans', 'filipino', 'romanian', 'france', 'greek']
For prince ['princess', 'duke', 'consort', 'king', 'benedict', 'commander', 'papa', 'dean', 'throne', 'kevin']
For scientist ['physician', 'archaeologist', 'golfer', 'inventor', 'chef', 'consultant', 'investigator', 'teenager', 'astronaut', 'technician']
For paris ['bonn', 'laval', 'provence', 'dublin', 'geneva', 'eugene', 'michel', 'koln', 'benoit', 'ville']
For father ['mother', 'fathers', 'brother', 'daddy', 'uncles', 'son', 'sister', 'homeland', 'dad', 'protector']
For husband ['wife', 'spouse', 'lover', 'boyfriend', 'husbands', 'woman', 'daughter', 'son', 'fiance', 'mother']
For france ['martinique', 'luxembourg', 'marseille', 'geneva', 'bordeaux', 'lyon', 'paris', 'clermont', 'alsace', 'switzerland']
For australia ['australian', 'america', 'canada', 'tasmania', 'sydney', 'britain', 'japan', 'fremantle', 'malaysia', 'hawaii']
For day ['evening', 'nightfall', 'midnight', 'night', 'dawn', 'morning', 'moments', 'afternoon', 'epoch', 'sunrise'] 1D Spectrum for i in analogies:
king = get_embedding([i[0]])
queen = get_embedding([i[2]])
man = get_embedding([i[1]])
woman = get_embedding([i[3]])
q = king - man + woman
print(i[0], i[1], i[2], i[3], cosine_similarity(queen, q))
for j in i:
print(j, ":")
np.random.seed(1)
plt.rcParams["figure.figsize"] = 8, 2
x = np.linspace(0, 768, num=768)
y = get_embedding([j])
fig, (ax,ax2) = plt.subplots(nrows=2, sharex=True)
extent = [x[0]-(x[1]-x[0])/2., x[-1]+(x[1]-x[0])/2.,0,1]
ax.imshow(y, cmap="plasma", aspect="auto", extent=extent)
ax.set_yticks([])
ax.set_xlim(extent[0], extent[1])
ax2.plot(x, y.ravel())
plt.tight_layout()
plt.show()