Wednesday, October 7, 2020

Word Embeddings using BERT and testing using Word Analogies, Nearest Words, 1D Spectrum


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics.pairwise import cosine_similarity
from joblib import load, dump
import torch
import transformers as ppb

import warnings
warnings.filterwarnings('ignore')

ppb.__version__ 
'3.0.1'

model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased') 

Load pretrained model/tokenizer

tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights) 

The above code downloads three files when it runs for the first time:

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri… 

The third file is the model with size 440MB.

Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with. 

sentences = ['First do it', 'then do it right', 'then do it better']
sentences_df = pd.DataFrame({"sentences": sentences})
tokenized = sentences_df['sentences'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True))) 

Padding
After tokenization, `tokenized` is a list of sentences -- each sentences is represented as a list of tokens. We want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths).

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values]) 

Masking 
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

attention_mask = np.where(padded != 0, 1, 0) 

%%time
input_ids = torch.LongTensor(padded)
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids = input_ids, attention_mask = attention_mask) 
	
features = last_hidden_states[0][:,0,:].numpy() 
	
Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. Last token is representing [SEP]. The output corresponding to that token can be thought of as an embedding for the entire sentence.

We'll save those in the `features` variable, as they'll serve as the features to our logitics regression model.

Testing

Word Analogies def get_embedding(in_list): tokenized = [tokenizer.encode(x, add_special_tokens=True) for x in in_list] max_len = 0 for i in tokenized: if len(i) > max_len: max_len = len(i) padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized]) attention_mask = np.where(padded != 0, 1, 0) input_ids = torch.LongTensor(padded) attention_mask = torch.tensor(attention_mask) with torch.no_grad(): last_hidden_states = model(input_ids = input_ids, attention_mask = attention_mask) features = last_hidden_states[0][:,0,:].numpy() return features analogies = [['king', 'man', 'queen', 'woman'], ['king', 'prince', 'queen', 'princess'], ['miami', 'florida', 'dallas', 'texas'], ['einstein', 'scientist', 'picasso', 'painter'], ['japan', 'sushi', 'germany', 'bratwurst'], ['man', 'woman', 'he', 'she'], ['man', 'woman', 'uncle', 'aunt'], ['man', 'woman', 'brother', 'sister'], ['man', 'woman', 'husband', 'wife'], ['man', 'woman', 'actor', 'actress'], ['man', 'woman', 'father', 'mother'], ['heir', 'heiress', 'prince', 'princess'], ['nephew', 'niece', 'uncle', 'aunt'], ['france', 'paris', 'japan', 'tokyo'], ['france', 'paris', 'china', 'beijing'], ['february', 'january', 'december', 'november'], ['france', 'paris', 'germany', 'berlin'], ['week', 'day', 'year', 'month'], ['week', 'day', 'hour', 'minute'], ['france', 'paris', 'italy', 'rome'], ['paris', 'france', 'rome', 'italy'], ['france', 'french', 'england', 'english'], ['japan', 'japanese', 'china', 'chinese'], ['china', 'chinese', 'america', 'american'], ['japan', 'japanese', 'italy', 'italian'], ['japan', 'japanese', 'australia', 'australian'], ['walk', 'walking', 'swim', 'swimming']] for i in analogies: king = get_embedding([i[0]]) queen = get_embedding([i[2]]) man = get_embedding([i[1]]) woman = get_embedding([i[3]]) q = king - man + woman print(i[0], '-', i[1], '+', i[3], 'and', i[2], cosine_similarity(queen, q)) king - man + woman and queen [[0.95728725]] king - prince + princess and queen [[0.9805071]] miami - florida + texas and dallas [[0.93608725]] einstein - scientist + painter and picasso [[0.9021458]] japan - sushi + bratwurst and germany [[0.8383053]] man - woman + she and he [[0.97603536]] man - woman + aunt and uncle [[0.9624729]] man - woman + sister and brother [[0.970188]] man - woman + wife and husband [[0.9585104]] man - woman + actress and actor [[0.95233154]] man - woman + mother and father [[0.9783108]] heir - heiress + princess and prince [[0.9558885]] nephew - niece + aunt and uncle [[0.9844531]] france - paris + tokyo and japan [[0.95287836]] france - paris + beijing and china [[0.94868445]] february - january + november and december [[0.89765096]] france - paris + berlin and germany [[0.9586985]] week - day + month and year [[0.9131064]] week - day + minute and hour [[0.9280644]] france - paris + rome and italy [[0.92742187]] paris - france + italy and rome [[0.9252609]] france - french + english and england [[0.9143828]] japan - japanese + chinese and china [[0.9681916]] china - chinese + american and america [[0.9371264]] japan - japanese + italian and italy [[0.97318065]] japan - japanese + australian and australia [[0.96878356]] walk - walking + swimming and swim [[0.90309924]] Nearest Words We have retrieved nouns from the 'BERT Base Uncased' vocabulary. There are 15269 nouns in this vocabulary. You can download "vocab.txt" from here: GitHub We have used to SpaCy to identify the nouns. nouns = load('files_5_p3/list_of_nouns_from_bert_base_uncased_vocab.joblib') %%time noun_embeddings = [get_embedding([i]) for i in nouns] dump(noun_embeddings, 'files_2_p2/list_of_noun_embeddings.joblib') Wall time: 20min 8s noun_embeddings = load('files_2_p2/list_of_noun_embeddings.joblib') noun_embeddings = [n[0] for n in noun_embeddings] from scipy.spatial import distance def get_nn_of_words(in_list): for k in in_list: input_word = k if k not in nouns: continue p = noun_embeddings[nouns.index(input_word)] closest_embedding_indices = distance.cdist(np.array(p).reshape(1, -1), np.array(noun_embeddings).reshape(len(noun_embeddings),-1))[0].argsort()[1:11] closest_nouns = [nouns[i] for i in closest_embedding_indices] print("For", k, closest_nouns) get_nn_of_words(set(pd.core.common.flatten(analogies))) For germany ['austria', 'bavaria', 'berlin', 'luxembourg', 'europe', 'japan', 'britain', 'wurttemberg', 'dresden', 'sweden'] For niece ['nephew', 'granddaughter', 'fiancee', 'daughter', 'grandparents', 'grandson', 'stepmother', 'aunt', 'cousins', 'wife'] For aunt ['grandmother', 'grandfather', 'uncle', 'cousin', 'sister', 'mother', 'miriam', 'vicki', 'uncles', 'cousins'] For february ['january', 'april', 'june', 'november', 'march', 'july', 'august', 'december', 'october', 'spring'] For england ['britain', 'wales', 'australia', 'ireland', 'barbados', 'stoke', 'brentford', 'lancashire', 'cuba', 'luxembourg'] For america ['planet', 'dakota', 'hawaii', 'britain', 'hemisphere', 'coral', 'virginia', 'nina', 'columbia', 'victoria'] For italian ['italy', 'russian', 'catalan', 'portuguese', 'french', 'azerbaijani', 'indonesian', 'austrian', 'japanese', 'irish'] For uncle ['aunt', 'cousin', 'grandfather', 'brother', 'grandmother', 'uncles', 'doc', 'bobby', 'mother', 'kid'] For miami ['tampa', 'seattle', 'vancouver', 'portland', 'arizona', 'vegas', 'sydney', 'florida', 'houston', 'orlando'] For italy ['austria', 'germany', 'luxembourg', 'europe', 'rico', 'japan', 'africa', 'indonesia', 'florence', 'tuscany'] For woman ['teenager', 'girl', 'spouse', 'brother', 'partner', 'daughter', 'mother', 'consort', 'wife', 'stallion'] For english ['afrikaans', 'hindi', 'latin', 'portuguese', 'sanskrit', 'french', 'italian', 'hebrew', 'azerbaijani', 'lithuanian'] For king ['duke', 'prince', 'queen', 'princess', 'throne', 'consort', 'deity', 'queens', 'abbot', 'lords'] For dallas ['jasmine', 'travis', 'savannah', 'eden', 'lucas', 'mia', 'lexi', 'jack', 'hunter', 'penny'] For mother ['grandmother', 'brother', 'mothers', 'parents', 'daughter', 'mom', 'father', 'grandfather', 'sister', 'mary'] For heiress ['landowner', 'heir', 'daughters', 'heirs', 'daughter', 'granddaughter', 'siblings', 'grandson', 'childless', 'clerk'] For japanese ['korean', 'japan', 'thai', 'russian', 'hawaiian', 'malaysian', 'indonesian', 'khmer', 'taiwanese', 'bengali'] For heir ['heirs', 'consort', 'spouse', 'prince', 'womb', 'attendants', 'fulfillment', 'duke', 'daughter', 'keeper'] For january ['november', 'april', 'august', 'february', 'december', 'summer', 'july', 'spring', 'october', 'june'] For brother ['sister', 'grandfather', 'cousin', 'grandmother', 'mother', 'daughter', 'partner', 'bowl', 'mentor', 'beau'] For wife ['husbands', 'daughter', 'spouse', 'husband', 'woman', 'girlfriend', 'household', 'supporter', 'boyfriend', 'granddaughter'] For minute ['moments', 'hour', 'dozen', 'mile', 'cycles', 'millennia', 'moment', 'sizes', 'clocks', 'twenties'] For picasso ['goldsmith', 'michelangelo', 'fresco', 'carousel', 'chopin', 'verdi', 'hercules', 'palette', 'canvas', 'britten'] For week ['month', 'series', 'replacement', 'primetime', 'position', 'highlight', 'zone', 'slot', 'office', 'showcase'] For japan ['america', 'ceylon', 'hawaii', 'malaysia', 'australia', 'taiwan', 'osaka', 'fukuoka', 'indonesia', 'korea'] For einstein ['aristotle', 'nobel', 'beckett', 'wiener', 'relativity', 'abel', 'strauss', 'skinner', 'clifford', 'bernstein'] For australian ['australia', 'canadian', 'canada', 'fremantle', 'oceania', 'america', 'brazil', 'nepal', 'jakarta', 'hawaii'] For painter ['musician', 'painting', 'paintings', 'designer', 'dancer', 'filmmaker', 'illustrator', 'teacher', 'soldier', 'boxer'] For man ['lump', 'woman', 'boss', 'bear', 'scratch', 'intruder', 'alpha', 'rat', 'touch', 'condo'] For florida ['maine', 'louisiana', 'arizona', 'virginia', 'charleston', 'indiana', 'tampa', 'colorado', 'alabama', 'connecticut'] For year ['season', 'month', 'eligibility', 'seasons', 'name', 'calendar', 'date', 'colour', 'highlight', 'divisional'] For tokyo ['osaka', 'kyoto', 'fukuoka', 'nagoya', 'seoul', 'kobe', 'moscow', 'honolulu', 'japan', 'nippon'] For november ['october', 'january', 'december', 'winter', 'spring', 'august', 'april', 'monday', 'halloween', 'wednesday'] For rome ['titan', 'vulcan', 'mesopotamia', 'damascus', 'alexandria', 'egypt', 'baghdad', 'orion', 'denver', 'nevada'] For china ['taiwan', 'fujian', 'indonesia', 'japan', 'asia', 'sichuan', 'malawi', 'lebanon', 'russia', 'zimbabwe'] For hour ['minute', 'hours', 'moments', 'dozen', 'weeks', 'inning', 'day', 'cycles', 'midnight', 'minutes'] For texas ['oregon', 'alabama', 'florida', 'colorado', 'ohio', 'indiana', 'georgia', 'houston', 'arkansas', 'arizona'] For sister ['brother', 'daughter', 'mother', 'grandmother', 'grandfather', 'cousin', 'aunt', 'padre', 'sisters', 'blossom'] For berlin ['vienna', 'stuttgart', 'hannover', 'hamburg', 'bonn', 'dresden', 'dusseldorf', 'gottingen', 'mannheim', 'rosenthal'] For actress ['actor', 'musician', 'singer', 'novelist', 'teacher', 'dancer', 'magician', 'poet', 'painter', 'actors'] For beijing ['tianjin', 'guangzhou', 'singapore', 'honolulu', 'taipei', 'ankara', 'osaka', 'manila', 'durban', 'jakarta'] For princess ['prince', 'madam', 'papa', 'kira', 'sweetie', 'witch', 'ruby', 'wedding', 'tasha', 'marta'] For nephew ['niece', 'grandson', 'granddaughter', 'daughter', 'fiancee', 'girlfriend', 'son', 'brother', 'sidekick', 'wife'] For month ['week', 'summers', 'evening', 'calendar', 'decade', 'semester', 'term', 'position', 'seasonal', 'occasion'] For swimming ['diving', 'weightlifting', 'judo', 'badminton', 'tennis', 'gymnastics', 'archery', 'swimmers', 'breaststroke', 'hockey'] For queen ['princess', 'queens', 'maid', 'king', 'prince', 'duke', 'consort', 'crown', 'stallion', 'madam'] For actor ['actress', 'actors', 'poet', 'television', 'singer', 'novelist', 'comedian', 'musician', 'screenwriter', 'painter'] For december ['november', 'january', 'october', 'april', 'march', 'june', 'july', 'september', 'august', 'autumn'] For american ['british', 'americans', 'america', 'britain', 'african', 'haitian', 'kenyan', 'bangladeshi', 'resident', 'canadian'] For french ['italian', 'portuguese', 'dutch', 'english', 'spanish', 'afrikaans', 'filipino', 'romanian', 'france', 'greek'] For prince ['princess', 'duke', 'consort', 'king', 'benedict', 'commander', 'papa', 'dean', 'throne', 'kevin'] For scientist ['physician', 'archaeologist', 'golfer', 'inventor', 'chef', 'consultant', 'investigator', 'teenager', 'astronaut', 'technician'] For paris ['bonn', 'laval', 'provence', 'dublin', 'geneva', 'eugene', 'michel', 'koln', 'benoit', 'ville'] For father ['mother', 'fathers', 'brother', 'daddy', 'uncles', 'son', 'sister', 'homeland', 'dad', 'protector'] For husband ['wife', 'spouse', 'lover', 'boyfriend', 'husbands', 'woman', 'daughter', 'son', 'fiance', 'mother'] For france ['martinique', 'luxembourg', 'marseille', 'geneva', 'bordeaux', 'lyon', 'paris', 'clermont', 'alsace', 'switzerland'] For australia ['australian', 'america', 'canada', 'tasmania', 'sydney', 'britain', 'japan', 'fremantle', 'malaysia', 'hawaii'] For day ['evening', 'nightfall', 'midnight', 'night', 'dawn', 'morning', 'moments', 'afternoon', 'epoch', 'sunrise'] 1D Spectrum for i in analogies: king = get_embedding([i[0]]) queen = get_embedding([i[2]]) man = get_embedding([i[1]]) woman = get_embedding([i[3]]) q = king - man + woman print(i[0], i[1], i[2], i[3], cosine_similarity(queen, q)) for j in i: print(j, ":") np.random.seed(1) plt.rcParams["figure.figsize"] = 8, 2 x = np.linspace(0, 768, num=768) y = get_embedding([j]) fig, (ax,ax2) = plt.subplots(nrows=2, sharex=True) extent = [x[0]-(x[1]-x[0])/2., x[-1]+(x[1]-x[0])/2.,0,1] ax.imshow(y, cmap="plasma", aspect="auto", extent=extent) ax.set_yticks([]) ax.set_xlim(extent[0], extent[1]) ax2.plot(x, y.ravel()) plt.tight_layout() plt.show()

Tuesday, October 6, 2020

Word embeddings using BERT (Demo of BERT-as-a-Service)


For installation we use a YAML file.
File: D:\f25\files_2\bert_aas.yaml

name: bert_aas
channels:
  - conda-forge
  - defaults
dependencies:
  - termcolor
  - numpy
  - pyzmq
  - tensorflow==1.14.0
  - GPUtil
  - sphinx-argparse
  - pip:
    - bert-serving-server
    - bert-serving-client 

==========

(base) CMD>conda env list
# conda environments:
#
base                  *  D:\programfiles\Anaconda3
e20200909                D:\programfiles\Anaconda3\envs\e20200909
py38                     D:\programfiles\Anaconda3\envs\py38
... 

(base) CMD>jupyter kernelspec list
Available kernels:
  tf         C:\Users\aj\AppData\Roaming\jupyter\kernels\tf
  python3    D:\programfiles\Anaconda3\share\jupyter\kernels\python3
  py38       C:\ProgramData\jupyter\kernels\py38 
  
==========

==> WARNING: A newer version of conda exists.
  current version: 4.8.4
  latest version: 4.8.5

Please update conda by running

    $ conda update -n base -c defaults conda

==========

CMD> conda env create -f bert_aas_1.yaml

(base) CMD>conda activate bert_aas

==========

Checking TensorFlow installation

(bert_aas) CMD>pip freeze | find "tensorflow"
tensorflow @ file:///D:/bld/tensorflow_1594833538462/work/tensorflow-1.14.0-cp37-cp37m-win_amd64.whl
tensorflow-estimator==1.14.0

(bert_aas) CMD>python
Python 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 01:53:57) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
...
D:\programfiles\Anaconda3\envs\bert_aas\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)]) 
>>> tf.__version__
'1.14.0' 
>>>
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2020-10-05 14:47:48.399686: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
>>> print(sess.run(hello))
b'Hello, TensorFlow!'
>>>   

==========

Setting up "model_dir" 

This directory is passed as an input argument to "bert-serving-start" program from Command Prompt.

Model can be downloaded from here: storage.googleapis.com: BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters

We also have these options for models (among others):

1. BERT-Large, Uncased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters
2. BERT-Large, Cased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters
3. BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
4. BERT-Large, Uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters
5. BERT-Base, Cased: 12-layer, 768-hidden, 12-heads , 110M parameters
6. BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters

Additional Note:

Fine-tuning with BERT 
Important: All results on the paper were fine-tuned on a single Cloud TPU, which has 64GB of RAM. It is currently not possible to re-produce most of the BERT-Large results on the paper using a GPU with 12GB - 16GB of RAM, because the maximum batch size that can fit in memory is too small. We are working on adding code to this repository which allows for much larger effective batch size on the GPU. See the section on out-of-memory issues for more details.

This code was tested with TensorFlow 1.11.0. It was tested with Python2 and Python3 (but more thoroughly with Python2, since this is what's used internally in Google).

The fine-tuning examples which use BERT-Base should be able to run on a GPU that has at least 12GB of RAM using the hyperparameters given.

Ref: bert#pre-trained-models

==========

Issues with newer versions of TensorFlow and Bert-as-a-service 

CMD> bert-serving-start -model_dir E:\e25/files_2/uncased_L-12_H-768_A-12 -num_worker=1

(bert_aas) E:\e25\files_2>bert-serving-start -model_dir E:\e25/files_2/uncased_L-12_H-768_A-12 -num_worker=1

2020-10-04 23:14:02.505211: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2020-10-04 23:14:02.515231: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
e:\programfiles\anaconda3\envs\bert_aas\lib\site-packages\bert_serving\server\helper.py:176: UserWarning: Tensorflow 2.3.0 is not tested! It may or may not work. Feel free to submit an issue at https://github.com/hanxiao/bert-as-service/issues/
  'Feel free to submit an issue at https://github.com/hanxiao/bert-as-service/issues/' % tf.__version__)
usage: E:\programfiles\Anaconda3\envs\bert_aas\Scripts\bert-serving-start -model_dir E:\e25/files_2/uncased_L-12_H-768_A-12 -num_worker=1
                  ARG   VALUE
__________________________________________________
            ckpt_name = bert_model.ckpt
          config_name = bert_config.json
                cors = *
                  cpu = False
          device_map = []
        do_lower_case = True
  fixed_embed_length = False
                fp16 = False
  gpu_memory_fraction = 0.5
        graph_tmp_dir = None
    http_max_connect = 10
            http_port = None
        mask_cls_sep = False
      max_batch_size = 256
          max_seq_len = 25
            model_dir = E:\e25/files_2/uncased_L-12_H-768_A-12
no_position_embeddings = False
    no_special_token = False
          num_worker = 1
        pooling_layer = [-2]
    pooling_strategy = REDUCE_MEAN
                port = 5555
            port_out = 5556
        prefetch_size = 10
  priority_batch_size = 16
show_tokens_to_client = False
      tuned_model_dir = None
              verbose = False
                  xla = False

I: [35mVENTILATOR [0m:freeze, optimize and export graph, could take a while...

e:\programfiles\anaconda3\envs\bert_aas\lib\site-packages\bert_serving\server\helper.py:176: UserWarning: Tensorflow 2.3.0 is not tested! It may or may not work. Feel free to submit an issue at https://github.com/hanxiao/bert-as-service/issues/
  'Feel free to submit an issue at https://github.com/hanxiao/bert-as-service/issues/' % tf.__version__)
E: [36mGRAPHOPT [0m:fail to optimize the graph!
Traceback (most recent call last):
  File "e:\programfiles\anaconda3\envs\bert_aas\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "e:\programfiles\anaconda3\envs\bert_aas\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "E:\programfiles\Anaconda3\envs\bert_aas\Scripts\bert-serving-start.exe\__main__.py", line 7, in 
  File "e:\programfiles\anaconda3\envs\bert_aas\lib\site-packages\bert_serving\server\cli\__init__.py", line 4, in main
    with BertServer(get_run_args()) as server:
  File "e:\programfiles\anaconda3\envs\bert_aas\lib\site-packages\bert_serving\server\__init__.py", line 71, in __init__
    self.graph_path, self.bert_config = pool.apply(optimize_graph, (self.args,))
TypeError: cannot unpack non-iterable NoneType object 

FROM HELPER.PY:
Path: e:\programfiles\anaconda3\envs\bert_aas\lib\site-packages\bert_serving\server\helper.py

import tensorflow as tf
tf_ver = tf.__version__.split('.')
if int(tf_ver[0]) <= 1 and int(tf_ver[1]) < 10:
    raise ModuleNotFoundError('Tensorflow >=1.10 (one-point-ten) is required!')
elif int(tf_ver[0]) > 1:
    warnings.warn('Tensorflow %s is not tested! It may or may not work. '
                  'Feel free to submit an issue at https://github.com/hanxiao/bert-as-service/issues/' % tf.__version__)
return tf_ver 

~   ~   ~   ~   ~

Removing Conda Environment in Case of Failure
(bert_aas) CMD>conda deactivate
(base) CMD>conda env remove -n bert_aas
Remove all packages in environment E:\programfiles\Anaconda3\envs\bert_aas: y

Checking installation

(bert_aas) CMD>pip freeze | findstr "GPUtil"

GPUtil @ file:///home/conda/feedstock_root/build_artifacts/gputil_1590646865081/work 

(bert_aas) CMD>pip freeze | findstr /C:"tensorflow" /C:"numpy" /C:"GPUtil" 

GPUtil @ file:///home/conda/feedstock_root/build_artifacts/gputil_1590646865081/work
numpy==1.18.5
tensorflow==1.14.0
tensorflow-estimator==1.14.0 

==========

Starting the server:

CMD> bert-serving-start -model_dir D:\ws\jupyter\f25_bert_for_sent_an\files_2\uncased_L-12_H-768_A-12 -num_worker=1

Logs:
                 ARG   VALUE
__________________________________________________
            ckpt_name = bert_model.ckpt
          config_name = bert_config.json
            model_dir = D:\ws\jupyter\f25_bert_for_sent_an\files_2\uncased_L-12_H-768_A-12
...
I: [36mGRAPHOPT [0m:model config: D:\ws\jupyter\f25_bert_for_sent_an\files_2\uncased_L-12_H-768_A-12\bert_config.json
I: [36mGRAPHOPT [0m:checkpoint: D:\ws\jupyter\f25_bert_for_sent_an\files_2\uncased_L-12_H-768_A-12\bert_model.ckpt
I: [36mGRAPHOPT [0m:build graph...
I: [36mGRAPHOPT [0m:load parameters from checkpoint...
I: [36mGRAPHOPT [0m:optimize...
I: [36mGRAPHOPT [0m:freeze...
I: [36mGRAPHOPT [0m:write graph to a tmp file: C:\Users\aj\AppData\Local\Temp\tmpxyxaq_b3
I: [35mVENTILATOR [0m:optimized graph is stored at: C:\Users\aj\AppData\Local\Temp\tmpxyxaq_b3
I: [35mVENTILATOR [0m:bind all sockets
I: [35mVENTILATOR [0m:open 8 ventilator-worker sockets
I: [35mVENTILATOR [0m:start the sink
...
I: [33mWORKER-0 [0m:ready and listening!
I: [35mVENTILATOR [0m:all set, ready to serve request! 

==========

Running the client

(base) C:\Users\aj>conda activate bert_aas

(bert_aas) C:\Users\aj>python

>>> from bert_serving.client import BertClient
>>> bc = BertClient()
>>> enc_values = bc.encode(['First do it', 'then do it right', 'then do it better'])
>>> enc_values.shape
(3, 768)
>>> enc_values
array([[ 0.13186528,  0.3240411 , -0.82704353, ..., -0.37119573,
        -0.39250126, -0.31721842],
        [ 0.24873482, -0.12334437, -0.38933888, ..., -0.4475625 ,
        -0.55913603, -0.11345193],
        [ 0.2862734 , -0.18580128, -0.3090687 , ..., -0.29593647,
        -0.39310572,  0.0764024 ]], dtype=float32)
>>> 

Sunday, October 4, 2020

Set up Google OAuth 2.0 Authentication For Blogger and Retrieve Blog Posts


Installation 
Blogger APIs Client Library for Python
CMD> pip install --upgrade google-api-python-client 
Ref: developers.google.com

(base) C:\Users\ashish\Desktop>pip install oauth2client

(base) C:\Users\ashish\Desktop\blogger>pip show google-api-python-client
Name: google-api-python-client
Version: 1.12.3
Summary: Google API Client Library for Python
Home-page: https://github.com/googleapis/google-api-python-client/
Author: Google LLC
Author-email: googleapis-packages@google.com
License: Apache 2.0
Location: d:\programfiles\anaconda3\lib\site-packages
Requires: six, google-auth-httplib2, google-auth, google-api-core, uritemplate, httplib2
Required-by: 

(base) C:\Users\ashish\Desktop\blogger>pip show oauth2client
Name: oauth2client
Version: 4.1.3
Summary: OAuth 2.0 client library
Home-page: http://github.com/google/oauth2client/
Author: Google Inc.
Author-email: jonwayne+oauth2client@google.com
License: Apache 2.0
Location: d:\programfiles\anaconda3\lib\site-packages
Requires: pyasn1, six, rsa, pyasn1-modules, httplib2
Required-by: 

==  ==  ==  ==  ==

Basic structure of the "client_secrets.json" is as written below:

Ref: GitHub

{
  "web": {
    "client_id": "[[INSERT CLIENT ID HERE]]",
    "client_secret": "[[INSERT CLIENT SECRET HERE]]",
    "redirect_uris": [],
    "auth_uri": "https://accounts.google.com/o/oauth2/auth",
    "token_uri": "https://accounts.google.com/o/oauth2/token"
  }
} 

Our "client_secrets.json" file looks like this:

{
  "installed":
  {
    "client_id": "0...9-a...z.apps.googleusercontent.com",
    "project_id": "banded...",
    "auth_uri": "https://accounts.google.com/o/oauth2/auth",
    "token_uri": "https://oauth2.googleapis.com/token",
    "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
    "client_secret": "2.....v",
    "redirect_uris": ["urn:ietf:wg:oauth:2.0:oob", "http://localhost"]
  }
} 

==  ==  ==  ==  ==

About authorization protocols 
Your application must use OAuth 2.0 to authorize requests. No other authorization protocols are supported. If your application uses Google Sign-In, some aspects of authorization are handled for you.

You can create "OAuth Client ID" from here: developers.google.com

You can also create a new "Project". We will select already created "My Project". We will select the "OAuth Client" as "Desktop App":
== == == == == Error if the 'client_secrets.json' file is not present alongside "blogger.py": (base) C:\Users\ashish\Desktop>python blogger.py The client secrets were invalid: ('Error opening file', 'client_secrets.json', 'No such file or directory', 2) WARNING: Please configure OAuth 2.0 To make this sample run you will need to populate the client_secrets.json file found at: client_secrets.json with information from the APIs Console [ https://code.google.com/apis/console ]. == == == == == Error on placing "empty (0KB)" file named "client_secrets.json": (base) C:\Users\ashish\Desktop\blogger>python blogger_v2.py Traceback (most recent call last): File "blogger_v2.py", line 43, in [module] main(sys.argv) File "blogger_v2.py", line 12, in main scope='https://www.googleapis.com/auth/blogger') File "D:\programfiles\Anaconda3\lib\site-packages\googleapiclient\sample_tools.py", line 88, in init client_secrets, scope=scope, message=tools.message_if_missing(client_secrets) File "D:\programfiles\Anaconda3\lib\site-packages\oauth2client\_helpers.py", line 133, in positional_wrapper return wrapped(*args, **kwargs) File "D:\programfiles\Anaconda3\lib\site-packages\oauth2client\client.py", line 2135, in flow_from_clientsecrets cache=cache) File "D:\programfiles\Anaconda3\lib\site-packages\oauth2client\clientsecrets.py", line 165, in loadfile return _loadfile(filename) File "D:\programfiles\Anaconda3\lib\site-packages\oauth2client\clientsecrets.py", line 122, in _loadfile obj = json.load(fp) File "D:\programfiles\Anaconda3\lib\json\__init__.py", line 296, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File "D:\programfiles\Anaconda3\lib\json\__init__.py", line 348, in loads return _default_decoder.decode(s) File "D:\programfiles\Anaconda3\lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "D:\programfiles\Anaconda3\lib\json\decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) == == == == == Before authentication: C:\Users\ashish\Desktop\blogger>tree /f C:. blogger_v1.py client_secrets.json No subfolders exist After authentication (the authentication steps are shown below), a 'blogger.dat' file is created: C:\Users\ashish\Desktop\blogger>tree /f C:. blogger.dat blogger_v1.py client_secrets.json No subfolders exist == == == == == When we run the application the application the first time, it opens the "Google's User Authentication" window to give access to the application to the Blogger for a Google account by logging into the respective Google Account.
Last prompt will ask you to confirm your choice:
Once the sign-in is complete it opens a Broswer Window (our browser is Chrome) as shown below:
Message: Application flow has been completed == == == == == CODE V1 #!/usr/bin/env python # -*- coding: utf-8 -*- """Simple command-line sample for Blogger. Command-line application that retrieves the users blogs and posts. Usage: $ python blogger.py You can also get help on all the command-line flags the program understands by running: $ python blogger.py --help To get detailed log output run: $ python blogger.py --logging_level=DEBUG """ from __future__ import print_function import sys from oauth2client import client from googleapiclient import sample_tools def main(argv): # Authenticate and construct service. service, flags = sample_tools.init( argv, 'blogger', 'v3', __doc__, __file__, scope='https://www.googleapis.com/auth/blogger') try: users = service.users() # Retrieve this user's profile information thisuser = users.get(userId='self').execute() print('This user\'s display name is: %s' % thisuser['displayName']) blogs = service.blogs() # Retrieve the list of Blogs this user has write privileges on thisusersblogs = blogs.listByUser(userId='self').execute() for blog in thisusersblogs['items']: print('The blog named \'%s\' is at: %s' % (blog['name'], blog['url'])) posts = service.posts() # List the posts for each blog this user has for blog in thisusersblogs['items']: print('The posts for %s:' % blog['name']) request = posts.list(blogId=blog['id']) while request != None: posts_doc = request.execute() if 'items' in posts_doc and not (posts_doc['items'] is None): for post in posts_doc['items']: print(' %s (%s)' % (post['title'], post['url'])) request = posts.list_next(request, posts_doc) except client.AccessTokenRefreshError: print ('The credentials have been revoked or expired, please re-run' 'the application to re-authorize') if __name__ == '__main__': main(sys.argv) == == == == == (base) CMD>python blogger_v1.py D:\programfiles\Anaconda3\lib\site-packages\oauth2client\_helpers.py:255: UserWarning: Cannot access blogger.dat: No such file or directory warnings.warn(_MISSING_FILE_MESSAGE.format(filename)) Your browser has been opened to visit: https://accounts.google.com/o/oauth2/auth?client_id=1... If your browser is on a different machine then exit and re-run this application with the command-line parameter --noauth_local_webserver Authentication successful. This user's display name is: Ashish Jain The blog named 'survival8' is at: http://survival8.blogspot.com/ The blog named 'ashish' is at: http://ashish.blogspot.com/ The posts for survival8: Difficult Conversations. How to Discuss What Matters Most (D. Stone, B. Patton, S. Heen, 2010) (http://survival8.blogspot.com/2020/09/douglas-stone-b-patton-s-heen-difficult.html) ... Is having a water fountain a sensible thing to do? (http://survival8.blogspot.com/2016/11/is-having-water-fountain-sensible-thing_21.html) The posts for ashish: ... == == == == == 'Cloud Resource Manager' Screenshot Showing Our Usage
== == == == == References GitHub % Blogger: google-api-python-client % blogger.py Google API Console % Google API Console % Dashboard % Cloud Resource Manager Documentation % Blogger APIs Client Library for Python % Blogger API: Using the API % Blogger API v3 (Instance Methods) % Blogger API v3 . posts (Instance Methods)

Getting started with word embedding technique 'BERT'



BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.

Our academic paper which describes BERT in detail and provides full results on a number of tasks can be found here: arxiv: BERT.

Ref: bert#pre-trained-models

URL to code for this post: GitHub: bert-as-service

STEP 1:
Installation of BERT as a server:

pip install bert-serving-server  # server
pip install bert-serving-client  # client, independent of `bert-serving-server` 

STEP 2:
Download a PRE-Trained BERT model.
The one we have using is "BERT-Base, Uncased	12-layer, 768-hidden, 12-heads, 110M parameters":

storage.googleapis.com: uncased_L-12_H-768_A-12.zip

STEP 3:
After installation of the server, start it in Python shell as follows:
$ bert-serving-start -model_dir D:\workspace\Jupyter\exp_42_bert\uncased_L-12_H-768_A-12 -num_worker=1

LOGS:

(env_for_python_36) C:\Users\ashish>bert-serving-start -model_dir D:\workspace\Jupyter\exp_42_bert\uncased_L-12_H-768_A-12 -num_worker=1

usage: C:\Users\ashish\AppData\Local\Continuum\anaconda3\envs\env_for_python_36\Scripts\bert-serving-start -model_dir D:\workspace\Jupyter\exp_42_bert\uncased_L-12_H-768_A-12 -num_worker=1
                 ARG   VALUE
__________________________________________________
           ckpt_name = bert_model.ckpt
         config_name = bert_config.json
                cors = *
                 cpu = False
          device_map = []
       do_lower_case = True
  fixed_embed_length = False
                fp16 = False
 gpu_memory_fraction = 0.5
       graph_tmp_dir = None
    http_max_connect = 10
           http_port = None
        mask_cls_sep = False
      max_batch_size = 256
         max_seq_len = 25
           model_dir = D:\workspace\Jupyter\exp_42_bert\uncased_L-12_H-768_A-12
no_position_embeddings = False
    no_special_token = False
          num_worker = 1
       pooling_layer = [-2]
    pooling_strategy = REDUCE_MEAN
                port = 5555
            port_out = 5556
       prefetch_size = 10
 priority_batch_size = 16
show_tokens_to_client = False
     tuned_model_dir = None
             verbose = False
                 xla = False

I:[35mVENTILATOR[0m:freeze, optimize and export graph, could take a while...
I:[36mGRAPHOPT[0m:model config: D:\workspace\Jupyter\exp_42_bert\uncased_L-12_H-768_A-12\bert_config.json
I:[36mGRAPHOPT[0m:checkpoint: D:\workspace\Jupyter\exp_42_bert\uncased_L-12_H-768_A-12\bert_model.ckpt
I:[36mGRAPHOPT[0m:build graph...
I:[36mGRAPHOPT[0m:load parameters from checkpoint...
I:[36mGRAPHOPT[0m:optimize...
I:[36mGRAPHOPT[0m:freeze...
I:[36mGRAPHOPT[0m:write graph to a tmp file: C:\Users\ashish\AppData\Local\Temp\tmpy8lsjd5y
I:[35mVENTILATOR[0m:optimized graph is stored at: C:\Users\ashish\AppData\Local\Temp\tmpy8lsjd5y
I:[35mVENTILATOR[0m:bind all sockets
I:[35mVENTILATOR[0m:open 8 ventilator-worker sockets
I:[35mVENTILATOR[0m:start the sink
I:[32mSINK[0m:ready
I:[35mVENTILATOR[0m:get devices
W:[35mVENTILATOR[0m:no GPU available, fall back to CPU
I:[35mVENTILATOR[0m:device map:
                worker  0 -> cpu
I:[33mWORKER-0[0m:use device cpu, load graph from C:\Users\ashish\AppData\Local\Temp\tmpy8lsjd5y
I:[33mWORKER-0[0m:ready and listening!
I:[35mVENTILATOR[0m:all set, ready to serve request! 

STEP 4:
Then to use the client in a different console:

from bert_serving.client import BertClient
bc = BertClient()
bc.encode(['First do it', 'then do it right', 'then do it better']) 

Logs:
(env_for_python_36) C:\Users\ashish>python
Python 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from bert_serving.client import BertClient
>>> bc = BertClient()
>>> bc.encode(['First do it', 'then do it right', 'then do it better'])
array([[ 0.13186511,  0.32404116, -0.8270434 , ..., -0.37119645,
        -0.39250118, -0.3172187 ],
       [ 0.24873514, -0.12334443, -0.38933924, ..., -0.4475621 ,
        -0.559136  , -0.1134515 ],
       [ 0.28627324, -0.18580206, -0.30906808, ..., -0.2959365 ,
        -0.39310536,  0.07640218]], dtype=float32)
>>> 
Dated: 19-Dec-2019

Friday, October 2, 2020

Natural Language Toolkit (NLTK) - Highlights (Book by Steven Bird)



Book Edition: 2009, 1e

Lists and strings do not have exactly the same functionality. Lists have the added power that you can change their elements:

>>> beatles[0] = "John Lennon"
>>> del beatles[-1]

>>> beatles
['John Lennon', 'Paul', 'George']

On the other hand, if we try to do that with a string—changing the 0th character in query to 'F'—we get:

>>> query[0] = 'F'
Traceback (most recent call last):
File "[stdin]", line 1, in ?
TypeError: object does not support item assignment

This is because strings are immutable: you can’t change a string once you have created it. However, lists are mutable, and their contents can be modified at any time. As a result, lists support operations that modify the original value rather than producing a new value.

---  ---  ---  ---  ---

What Is Unicode? 

Unicode supports over a million characters. Each character is assigned a number, called a code point. In Python, code points are written in the form \uXXXX, where XXXX is the number in four-digit hexadecimal form.

Within a program, we can manipulate Unicode strings just like normal strings. However, when Unicode characters are stored in files or displayed on a terminal, they must be encoded as a stream of bytes. Some encodings (such as ASCII and Latin-2) use a single byte per code point, so they can support only a small subset of Unicode, enough for a single language. Other encodings (such as UTF-8) use multiple bytes and can represent the full range of Unicode characters.

Text in files will be in a particular encoding, so we need some mechanism for translating it into Unicode—translation into Unicode is called decoding. Conversely, to write out
Unicode to a file or a terminal, we first need to translate it into a suitable encoding—this translation out of Unicode is called encoding, and is illustrated in Figure 3-3.

From a Unicode perspective, characters are abstract entities that can be realized as one or more glyphs. Only glyphs can appear on a screen or be printed on paper. A font is a mapping from characters to glyphs.

In Python, a Unicode string literal can be specified by preceding an ordinary string literal with a u, as in u'hello'. Arbitrary Unicode characters are defined using the \uXXXX escape sequence inside a Unicode string literal. We find the integer ordinal of a character using ord(). For example:

>>> ord('a')
97

The hexadecimal four-digit notation for 97 is 0061, so we can define a Unicode string literal with the appropriate escape sequence:

>>> a = u'\u0061'
>>> a
u'a'
>>> print a
a

--- --- --- --- --- 4.7 Algorithm Design A major part of algorithmic problem solving is selecting or adapting an appropriate algorithm for the problem at hand. Sometimes there are several alternatives, and choosing the best one depends on knowledge about how each alternative performs as the size of the data grows. Whole books are written on this topic, and we only have space to introduce some key concepts and elaborate on the approaches that are most prevalent in natural language processing. The best-known strategy is known as divide-and-conquer. We attack a problem of size n by dividing it into two problems of size n/2, solve these problems, and combine their results into a solution of the original problem. For example, suppose that we had a pile of cards with a single word written on each card. We could sort this pile by splitting it in half and giving it to two other people to sort (they could do the same in turn). Then, when two sorted piles come back, it is an easy task to merge them into a single sorted pile. See Figure 4-3 for an illustration of this process. Another example is the process of looking up a word in a dictionary. We open the book somewhere around the middle and compare our word with the current page. If it’s earlier in the dictionary, we repeat the process on the first half; if it’s later, we use the second half. This search method is called binary search since it splits the problem in half at every step. In another approach to algorithm design, we attack a problem by transforming it into an instance of a problem we already know how to solve. For example, in order to detect duplicate entries in a list, we can pre-sort the list, then scan through it once to check whether any adjacent pairs of elements are identical.
--- --- --- --- --- Stemmers NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer, you should use one of these in preference to crafting your own using regular expressions, since NLTK’s stemmers handle a wide range of irregular cases. The Porter and Lancaster stemmers follow their own rules for stripping affixes. Observe that the Porter stemmer correctly handles the word lying (mapping it to lie), whereas the Lancaster stemmer does not. >>> porter = nltk.PorterStemmer() >>> lancaster = nltk.LancasterStemmer() >>> [porter.stem(t) for t in tokens] ['DENNI', ':', 'Listen', ',', 'strang', 'women', 'lie', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'Suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.'] >>> [lancaster.stem(t) for t in tokens] ['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not', 'from', 'som', 'farc', 'aqu', 'ceremony', '.'] --- --- --- --- --- Lemmatization The WordNet lemmatizer removes affixes only if the resulting word is in its dictionary. This additional checking process makes the lemmatizer slower than the stemmers just mentioned. Notice that it doesn’t handle lying, but it converts women to woman. >>> wnl = nltk.WordNetLemmatizer() >>> [wnl.lemmatize(t) for t in tokens] ['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond', 'distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.'] The WordNet lemmatizer is a good choice if you want to compile the vocabulary of some texts and want a list of valid lemmas (or lexicon headwords). --- --- --- --- --- 3.8 Segmentation This section discusses more advanced concepts, which you may prefer to skip on the first time through this chapter. Tokenization is an instance of a more general problem of segmentation. In this section, we will look at two other instances of this problem, which use radically different techniques to the ones we have seen so far in this chapter. Sentence Segmentation Manipulating texts at the level of individual words often presupposes the ability to divide a text into individual sentences. As we have seen, some corpora already provide access at the sentence level. In the following example, we compute the average number of words per sentence in the Brown Corpus: >>> len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents()) 20.250994070456922 In other cases, the text is available only as a stream of characters. Before tokenizing the text into words, we need to segment it into sentences. NLTK facilitates this by including the Punkt sentence segmenter (Kiss & Strunk, 2006). Here is an example of its use in segmenting the text of a novel. (Note that if the segmenter’s internal data has been updated by the time you read this, you will see different output.) >>> sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle') >>> text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt') >>> sents = sent_tokenizer.tokenize(text) >>> pprint.pprint(sents[171:181]) ['"Nonsense!', '" said Gregory, who was very rational when anyone else\nattempted paradox.', '"Why do all the clerks and navvies in the\nrailway trains look so sad and tired,...', 'I will\ntell you.', 'It is because they know that the train is going right.', 'It\nis because they know that whatever place they have taken a ticket\nfor that ...', 'It is because after they have\npassed Sloane Square they know that the next stat...', 'Oh, their wild rapture!', 'oh,\ntheir eyes like stars and their souls again in Eden, if the next\nstation w...' '"\n\n"It is you who are unpoetical," replied the poet Syme.'] --- --- --- --- --- Like every other NLTK module, distance.py begins with a group of comment lines giving a one-line title of the module and identifying the authors. (Since the code is distributed, it also includes the URL where the code is available, a copyright statement, and license information.) Next is the module-level docstring, a triple-quoted multiline string containing information about the module that will be printed when someone types help(nltk.metrics.distance). # Natural Language Toolkit: Distance Metrics # # Author: Edward Loper : edloper@gradient.cis.upenn.edu # Steven Bird : sb@csse.unimelb.edu.au # """ Distance Metrics. Compute the distance between two items (usually strings). As metrics, they must satisfy the following three requirements: 1. d(a, a) = 0 2. d(a, b) >= 0 3. d(a, c) <= d(a, b) + d(b, c) """ After this comes all the import statements required for the module, then any global variables, followed by a series of function definitions that make up most of the module. Other modules define “classes,” the main building blocks of object-oriented programming, which falls outside the scope of this book. (Most NLTK modules also include a demo() function, which can be used to see examples of the module in use.) Some module variables and functions are only used within the module. These should have names beginning with an underscore, e.g., _helper(), since this will hide the name. If another module imports this one, using the idiom: from module import *, these names will not be imported. You can optionally list the externally accessible names of a module using a special built-in variable like this: __all__ = ['edit_distance', 'jaccard_distance']. --- --- --- --- --- Debugging Techniques Since most code errors result from the programmer making incorrect assumptions, the first thing to do when you detect a bug is to check your assumptions. Localize the problem by adding print statements to the program, showing the value of important variables, and showing how far the program has progressed. If the program produced an “exception”—a runtime error—the interpreter will print a stack trace, pinpointing the location of program execution at the time of the error. If the program depends on input data, try to reduce this to the smallest size while still producing the error. Once you have localized the problem to a particular function or to a line of code, you need to work out what is going wrong. It is often helpful to recreate the situation using the interactive command line. Define some variables, and then copy-paste the offending line of code into the session and see what happens. Check your understanding of the code by reading some documentation and examining other code samples that purport to do the same thing that you are trying to do. Try explaining your code to someone else, in case she can see where things are going wrong. Python provides a debugger which allows you to monitor the execution of your program, specify line numbers where execution will stop (i.e., breakpoints), and step through sections of code and inspect the value of variables. You can invoke the debugger on your code as follows: >>> import pdb >>> import mymodule >>> pdb.run('mymodule.myfunction()') It will present you with a prompt (Pdb) where you can type instructions to the debugger. Type help to see the full list of commands. Typing step (or just s) will execute the current line and stop. If the current line calls a function, it will enter the function and stop at the first line. Typing next (or just n) is similar, but it stops execution at the next line in the current function. The break (or b) command can be used to create or list breakpoints. Type continue (or c) to continue execution as far as the next breakpoint. Type the name of any variable to inspect its value. We can use the Python debugger to locate the problem in our find_words() function. Remember that the problem arose the second time the function was called. We’ll start by calling the function without using the debugger , using the smallest possible input. The second time, we’ll call it with the debugger. >>> import pdb >>> find_words(['cat'], 3) ['cat'] >>> pdb.run("find_words(['dog'], 3)") > [string](1)[module]() (Pdb) step --Call-- > [stdin](1)find_words() (Pdb) args text = ['dog'] wordlength = 3 result = ['cat'] Here we typed just two commands into the debugger: step took us inside the function, and args showed the values of its arguments (or parameters). We see immediately that result has an initial value of ['cat'], and not the empty list as expected. The debugger has helped us to localize the problem, prompting us to check our understanding of Python functions. --- --- --- --- --- Defensive Programming In order to avoid some of the pain of debugging, it helps to adopt some defensive programming habits. Instead of writing a 20-line program and then testing it, build the program bottom-up out of small pieces that are known to work. Each time you combine these pieces to make a larger unit, test it carefully to see that it works as expected. Consider adding assert statements to your code, specifying properties of a variable, e.g., assert(isinstance(text, list)). If the value of the text variable later becomes a string when your code is used in some larger context, this will raise an AssertionError and you will get immediate notification of the problem. Once you think you’ve found the bug, view your solution as a hypothesis. Try to predict the effect of your bugfix before re-running the program. If the bug isn’t fixed, don’t fall into the trap of blindly changing the code in the hope that it will magically start working again. Instead, for each change, try to articulate a hypothesis about what is wrong and why the change will fix the problem. Then undo the change if the problem was not resolved. As you develop your program, extend its functionality, and fix any bugs, it helps to maintain a suite of test cases. This is called regression testing, since it is meant to detect situations where the code “regresses”—where a change to the code has an unintended side effect of breaking something that used to work. Python provides a simple regression-testing framework in the form of the doctest module. This module searches a file of code or documentation for blocks of text that look like an interactive Python session, of the form you have already seen many times in this book. It executes the Python commands it finds, and tests that their output matches the output supplied in the original file. Whenever there is a mismatch, it reports the expected and actual values. For details, please consult the doctest documentation at DocTest Docs Apart from its value for regression testing, the doctest module is useful for ensuring that your software documentation stays in sync with your code. Perhaps the most important defensive programming strategy is to set out your code clearly, choose meaningful variable and function names, and simplify the code wherever possible by decomposing it into functions and modules with well-documented interfaces. --- --- --- --- --- CHAPTER 5 Categorizing and Tagging Words The process of classifying words into their parts-of-speech and labeling them accordingly is known as part-of-speech tagging, POS tagging, or simply tagging. Parts-of-speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset. 5.1 Using a Tagger A part-of-speech tagger, or POS tagger, processes a sequence of words, and attaches a part of speech tag to each word (don’t forget to import nltk): >>> text = nltk.word_tokenize("And now for something completely different") >>> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')] 5.2 Tagged Corpora: Brown Corpus has been POS tagged. Representing Tagged Tokens By convention in NLTK, a tagged token is represented using a tuple consisting of the token and the tag. We can create one of these special tuples from the standard string representation of a tagged token, using the function str2tuple(): >>> tagged_token = nltk.tag.str2tuple('fly/NN') >>> tagged_token ('fly', 'NN') >>> tagged_token[0] 'fly' >>> tagged_token[1] 'NN' We can construct a list of tagged tokens directly from a string. The first step is to tokenize the string to access the individual word/tag strings, and then to convert each of these into a tuple (using str2tuple()). >>> sent = ''' ... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN ... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC ... Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS ... said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB ... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT ... interest/NN of/IN both/ABX governments/NNS ''/'' ./. ... ''' >>> [nltk.tag.str2tuple(t) for t in sent.split()] [('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')] Reading Tagged Corpora Several of the corpora included with NLTK have been tagged for their part-of-speech. Here’s an example of what you might see if you opened a file from the Brown Corpus with a text editor: The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/ nn of/in Atlanta’s/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./. Other corpora use a variety of formats for storing part-of-speech tags. NLTK’s corpus readers provide a uniform interface so that you don’t have to be concerned with the different file formats. In contrast with the file extract just shown, the corpus reader for the Brown Corpus represents the data as shown next. Note that part-of-speech tags have been converted to uppercase; this has become standard practice since the Brown Corpus was published. >>> nltk.corpus.brown.tagged_words() [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...] >>> nltk.corpus.brown.tagged_words(simplify_tags=True) [('The', 'DET'), ('Fulton', 'N'), ('County', 'N'), ...] Whenever a corpus contains tagged text, the NLTK corpus interface will have a tagged_words() method. Here are some more examples, again using the output format illustrated for the Brown Corpus: >>> print nltk.corpus.nps_chat.tagged_words() [('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...] >>> nltk.corpus.conll2000.tagged_words() [('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...] >>> nltk.corpus.treebank.tagged_words() [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...] Not all corpora employ the same set of tags; see the tagset help functionality and the readme() methods mentioned earlier for documentation. Initially we want to avoid the complications of these tagsets, so we use a built-in mapping to a simplified tagset: >>> nltk.corpus.brown.tagged_words(simplify_tags=True) [('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...] >>> nltk.corpus.treebank.tagged_words(simplify_tags=True) [('Pierre', 'NP'), ('Vinken', 'NP'), (',', ','), ...] Tagged corpora for several other languages are distributed with NLTK, including Chinese, Hindi, Portuguese, Spanish, Dutch, and Catalan. These usually contain non-ASCII text, and Python always displays this in hexadecimal when printing a larger structure such as a list. >>> nltk.corpus.sinica_treebank.tagged_words() [('\xe4\xb8\x80', 'Neu'), ('\xe5\x8f\x8b\xe6\x83\x85', 'Nad'), ...] >>> nltk.corpus.indian.tagged_words() [('\xe0\xa6\xae\xe0\xa6\xb9\xe0\xa6\xbf\xe0\xa6\xb7\xe0\xa7\x87\xe0\xa6\xb0', 'NN'), ('\xe0\xa6\xb8\xe0\xa6\xa8\xe0\xa7\x8d\xe0\xa6\xa4\xe0\xa6\xbe\xe0\xa6\xa8', 'NN'), ...] >>> nltk.corpus.mac_morpho.tagged_words() [('Jersei', 'N'), ('atinge', 'V'), ('m\xe9dia', 'N'), ...] >>> nltk.corpus.conll2002.tagged_words() [('Sao', 'NC'), ('Paulo', 'VMI'), ('(', 'Fpa'), ...] >>> nltk.corpus.cess_cat.tagged_words() [('El', 'da0ms0'), ('Tribunal_Suprem', 'np0000o'), ...] If your environment is set up correctly, with appropriate editors and fonts, you should be able to display individual strings in a human-readable way. If the corpus is also segmented into sentences, it will have a tagged_sents() method that divides up the tagged words into sentences rather than presenting them as one big list. This will be useful when we come to developing automatic taggers, as they are trained and tested on lists of sentences, not words.
5.4 Automatic Tagging The Default Tagger The simplest possible tagger assigns the same tag to each token. This may seem to be a rather banal step, but it establishes an important baseline for tagger performance. In order to get the best result, we tag each word with the most likely tag. Let’s find out which tag is most likely (now using the unsimplified tagset): >>> tags = [tag for (word, tag) in brown.tagged_words(categories='news')] >>> nltk.FreqDist(tags).max() 'NN' Now we can create a tagger that tags everything as NN. >>> raw = 'I do not like green eggs and ham, I do not like them Sam I am!' >>> tokens = nltk.word_tokenize(raw) >>> default_tagger = nltk.DefaultTagger('NN') >>> default_tagger.tag(tokens) [('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'), ('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'), ('I', 'NN'), ('am', 'NN'), ('!', 'NN')] Unsurprisingly, this method performs rather poorly. On a typical corpus, it will tag only about an eighth of the tokens correctly, as we see here: >>> default_tagger.evaluate(brown_tagged_sents) 0.13089484257215028 The Regular Expression Tagger >>> patterns = [ ... (r'.*ing$', 'VBG'), # gerunds ... (r'.*ed$', 'VBD'), # simple past ... (r'.*es$', 'VBZ'), # 3rd singular present ... (r'.*ould$', 'MD'), # modals ... (r'.*\'s$', 'NN$'), # possessive nouns ... (r'.*s$', 'NNS'), # plural nouns ... (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers ... (r'.*', 'NN') # nouns (default) ... ] >>> regexp_tagger = nltk.RegexpTagger(patterns) >>> regexp_tagger.tag(brown_sents[3]) [('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'), ('of', 'NN'), ('such', 'NN'), ('reports', 'NNS'), ('was', 'NNS'), ('received', 'VBD'), ("''", 'NN'), (',', 'NN'), ('the', 'NN'), ('jury', 'NN'), ('said', 'NN'), (',', 'NN'), ('``', 'NN'), ('considering', 'VBG'), ('the', 'NN'), ('widespread', 'NN'), ...] >>> regexp_tagger.evaluate(brown_tagged_sents) 0.20326391789486245 The Lookup Tagger A lot of high-frequency words do not have the NN tag. Let’s find the hundred most frequent words and store their most likely tag. We can then use this information as the model for a “lookup tagger” (an NLTK UnigramTagger): >>> fd = nltk.FreqDist(brown.words(categories='news')) >>> cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) >>> most_freq_words = fd.keys()[:100] >>> likely_tags = dict((word, cfd[word].max()) for word in most_freq_words) >>> baseline_tagger = nltk.UnigramTagger(model=likely_tags) >>> baseline_tagger.evaluate(brown_tagged_sents) 0.45578495136941344 5.5 N-Gram Tagging Unigram Tagging Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token. For example, it will assign the tag JJ to any occurrence of the word frequent, since frequent is used as an adjective (e.g., a frequent word) more often than it is used as a verb (e.g., I frequent this cafe). A unigram tagger behaves just like a lookup tagger (Section 5.4), except there is a more convenient technique for setting it up, called training. In the following code sample, we train a unigram tagger, use it to tag a sentence, and then evaluate: >>> from nltk.corpus import brown >>> brown_tagged_sents = brown.tagged_sents(categories='news') >>> brown_sents = brown.sents(categories='news') >>> unigram_tagger = nltk.UnigramTagger(brown_tagged_sents) >>> unigram_tagger.tag(brown_sents[2007]) [('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')] >>> unigram_tagger.evaluate(brown_tagged_sents) 0.9349006503968017 We train a UnigramTagger by specifying tagged sentence data as a parameter when we initialize the tagger. The training process involves inspecting the tag of each word and storing the most likely tag for any word in a dictionary that is stored inside the tagger. General N-Gram Tagging When we perform a language processing task based on unigrams, we are using one item of context. In the case of tagging, we consider only the current token, in isolation from any larger context. Given such a model, the best we can do is tag each word with its a priori most likely tag. This means we would tag a word such as wind with the same tag, regardless of whether it appears in the context the wind or to wind. An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens. The NgramTagger class uses a tagged training corpus to determine which part-of-speech tag is most likely for each context. Here we see a special case of an n-gram tagger, namely a bigram tagger. First we train it, then use it to tag untagged sentences: >>> bigram_tagger = nltk.BigramTagger(train_sents) >>> bigram_tagger.tag(brown_sents[2007]) [('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')] >>> unseen_sent = brown_sents[4203] >>> bigram_tagger.tag(unseen_sent) [('The', 'AT'), ('population', 'NN'), ('of', 'IN'), ('the', 'AT'), ('Congo', 'NP'), ('is', 'BEZ'), ('13.5', None), ('million', None), (',', None), ('divided', None), ('into', None), ('at', None), ('least', None), ('seven', None), ('major', None), ('``', None), ('culture', None), ('clusters', None), ("''", None), ('and', None), ('innumerable', None), ('tribes', None), ('speaking', None), ('400', None), ('separate', None), ('dialects', None), ('.', None)] Notice that the bigram tagger manages to tag every word in a sentence it saw during training, but does badly on an unseen sentence. As soon as it encounters a new word (i.e., 13.5), it is unable to assign a tag. It cannot tag the following word (i.e., million), even if it was seen during training, simply because it never saw it during training with a None tag on the previous word. Consequently, the tagger fails to tag the rest of the sentence. Its overall accuracy score is very low: >>> bigram_tagger.evaluate(test_sents) 0.10276088906608193 Combining Taggers One way to address the trade-off between accuracy and coverage is to use the more accurate algorithms when we can, but to fall back on algorithms with wider coverage when necessary. For example, we could combine the results of a bigram tagger, a unigram tagger, and a default tagger, as follows: 1. Try tagging the token with the bigram tagger. 2. If the bigram tagger is unable to find a tag for the token, try the unigram tagger. 3. If the unigram tagger is also unable to find a tag, use a default tagger. Most NLTK taggers permit a backoff tagger to be specified. The backoff tagger may itself have a backoff tagger: >>> t0 = nltk.DefaultTagger('NN') >>> t1 = nltk.UnigramTagger(train_sents, backoff=t0) >>> t2 = nltk.BigramTagger(train_sents, backoff=t1) >>> t2.evaluate(test_sents) 0.84491179108940495 5.6 Transformation-Based Tagging A potential issue with n-gram taggers is the size of their n-gram table (or language model). If tagging is to be employed in a variety of language technologies deployed on mobile computing devices, it is important to strike a balance between model size and tagger performance. An n-gram tagger with backoff may store trigram and bigram tables, which are large, sparse arrays that may have hundreds of millions of entries. A second issue concerns context. The only information an n-gram tagger considers from prior context is tags, even though words themselves might be a useful source of information. It is simply impractical for n-gram models to be conditioned on the identities of words in the context. In this section, we examine Brill tagging, an inductive tagging method which performs very well using models that are only a tiny fraction of the size of n-gram taggers. Brill tagging is a kind of transformation-based learning, named after its inventor. The general idea is very simple: guess the tag of each word, then go back and fix the mistakes. In this way, a Brill tagger successively transforms a bad tagging of a text into a better one. As with n-gram tagging, this is a supervised learning method, since we need annotated training data to figure out whether the tagger’s guess is a mistake or not. However, unlike n-gram tagging, it does not count observations but compiles a list of transformational correction rules. The process of Brill tagging is usually explained by analogy with painting. Suppose we were painting a tree, with all its details of boughs, branches, twigs, and leaves, against a uniform sky-blue background. Instead of painting the tree first and then trying to paint blue in the gaps, it is simpler to paint the whole canvas blue, then “correct” the tree section by over-painting the blue background. In the same fashion, we might paint the trunk a uniform brown before going back to over-paint further details with even finer brushes. Brill tagging uses the same idea: begin with broad brush strokes, and then fix up the details, with successively finer changes. 5.7 How to Determine the Category of a Word Now that we have examined word classes in detail, we turn to a more basic question: how do we decide what category a word belongs to in the first place? In general, linguists use morphological, syntactic, and semantic clues to determine the category of a word. Morphological Clues The internal structure of a word may give useful clues as to the word’s category. For example, -ness is a suffix that combines with an adjective to produce a noun, e.g., happy → happiness, ill → illness. So if we encounter a word that ends in -ness, this is very likely to be a noun. Similarly, -ment is a suffix that combines with some verbs to produce a noun, e.g., govern → government and establish → establishment. English verbs can also be morphologically complex. For instance, the present participle of a verb ends in -ing, and expresses the idea of ongoing, incomplete action (e.g., falling, eating). The -ing suffix also appears on nouns derived from verbs, e.g., the falling of the leaves (this is known as the gerund). Syntactic Clues Another source of information is the typical contexts in which a word can occur. For example, assume that we have already determined the category of nouns. Then we might say that a syntactic criterion for an adjective in English is that it can occur immediately before a noun, or immediately following the words be or very. According to these tests, near should be categorized as an adjective: (2) a. the near window b. The end is (very) near. Semantic Clues Finally, the meaning of a word is a useful clue as to its lexical category. For example, the best-known definition of a noun is semantic: “the name of a person, place, or thing.” Within modern linguistics, semantic criteria for word classes are treated with suspicion, mainly because they are hard to formalize. Nevertheless, semantic criteria underpin many of our intuitions about word classes, and enable us to make a good guess about the categorization of words in languages with which we are unfamiliar. For example, if all we know about the Dutch word verjaardag is that it means the same as the English word birthday, then we can guess that verjaardag is a noun in Dutch. However, some care is needed: although we might translate zij is vandaag jarig as it’s her birthday today, the word jarig is in fact an adjective in Dutch, and has no exact equivalent in English. New Words All languages acquire new lexical items. A list of words recently added to the Oxford Dictionary of English includes cyberslacker, fatoush, blamestorm, SARS, cantopop, bupkis, noughties, muggle, and robata. Notice that all these new words are nouns, and this is reflected in calling nouns an open class. By contrast, prepositions are regarded as a closed class. That is, there is a limited set of words belonging to the class (e.g., above, along, at, below, beside, between, during, for, from, in, near, on, outside, over, past, through, towards, under, up, with), and membership of the set only changes very gradually over time. Morphology in Part-of-Speech Tagsets Common tagsets often capture some morphosyntactic information, that is, information about the kind of morphological markings that words receive by virtue of their syntactic role. Consider, for example, the selection of distinct grammatical forms of the word go illustrated in the following sentences: (3) a. Go away! b. He sometimes goes to the cafe. c. All the cakes have gone. d. We went on the excursion. Each of these forms—go, goes, gone, and went—is morphologically distinct from the others. Consider the form goes. This occurs in a restricted set of grammatical contexts, and requires a third person singular subject. Thus, the following sentences are ungrammatical. (4) a. *They sometimes goes to the cafe. b. *I sometimes goes to the cafe. By contrast, gone is the past participle form; it is required after have (and cannot be replaced in this context by goes), and cannot occur as the main verb of a clause. (5) a. *All the cakes have goes. b. *He sometimes gone to the cafe. We can easily imagine a tagset in which the four distinct grammatical forms just discussed were all tagged as VB. Although this would be adequate for some purposes, a more fine-grained tagset provides useful information about these forms that can help other processors that try to detect patterns in tag sequences. In addition to this set of verb tags, the various forms of the verb to be have special tags: be/BE, being/BEG, am/BEM, are/BER, is/BEZ, been/BEN, were/BED, and was/BEDZ (plus extra tags for negative forms of the verb). All told, this fine-grained tagging of verbs means that an automatic tagger that uses this tagset is effectively carrying out a limited amount of morphological analysis. Most part-of-speech tagsets make use of the same basic categories, such as noun, verb, adjective, and preposition. However, tagsets differ both in how finely they divide words into categories, and in how they define their categories. For example, is might be tagged simply as a verb in one tagset, but as a distinct form of the lexeme be in another tagset (as in the Brown Corpus). This variation in tagsets is unavoidable, since part-of-speech tags are used in different ways for different tasks. In other words, there is no one “right way” to assign tags, only more or less useful ways depending on one’s goals. --- --- --- --- --- 7.5 Named Entity Recognition NLTK provides a classifier that has already been trained to recognize named entities, accessed with the function nltk.ne_chunk(). If we set the parameter binary=True , then named entities are just tagged as NE; otherwise, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE. >>> sent = nltk.corpus.treebank.tagged_sents()[22] >>> print nltk.ne_chunk(sent, binary=True) (S The/DT (NE U.S./NNP) is/VBZ one/CD ... according/VBG to/TO (NE Brooke/NNP T./NNP Mossman/NNP) ...) >>> print nltk.ne_chunk(sent) (S The/DT (GPE U.S./NNP) is/VBZ one/CD ... according/VBG to/TO (PERSON Brooke/NNP T./NNP Mossman/NNP) ...) • Entity recognition is often performed using chunkers, which segment multitoken sequences, and label them with the appropriate entity type. Common entity types include ORGANIZATION, PERSON, LOCATION, DATE, TIME, MONEY, and GPE (geo-political entity). --- --- --- --- ---

Some Code

(temp) C:\Users\Ashish Jain>pip install --upgrade nltk Processing c:\users\ashish jain\appdata\local\pip\cache\wheels\ae\8c\3f\b1fe0ba04555b08b57ab52ab7f86023639a526d8bc8d384306\nltk-3.5-cp37-none-any.whl Requirement already satisfied, skipping upgrade: tqdm in e:\programfiles\anaconda3\envs\temp\lib\site-packages (from nltk) (4.48.2) Requirement already satisfied, skipping upgrade: joblib in e:\programfiles\anaconda3\envs\temp\lib\site-packages (from nltk) (0.16.0) Collecting regex Downloading regex-2020.9.27-cp37-cp37m-win_amd64.whl (268 kB) |████████████████████████████████| 268 kB 3.3 MB/s Collecting click Using cached click-7.1.2-py2.py3-none-any.whl (82 kB) Installing collected packages: regex, click, nltk Attempting uninstall: nltk Found existing installation: nltk 3.4.5 Uninstalling nltk-3.4.5: Successfully uninstalled nltk-3.4.5 Successfully installed click-7.1.2 nltk-3.5 regex-2020.9.27 (temp) C:\Users\Ashish Jain>pip show nltk Name: nltk Version: 3.5 Summary: Natural Language Toolkit Home-page: http://nltk.org/ Author: Steven Bird Author-email: stevenbird1@gmail.com License: Apache License, Version 2.0 Location: e:\programfiles\anaconda3\envs\temp\lib\site-packages Requires: tqdm, regex, click, joblib Required-by: textblob, sumy import nltk print("nltk:", nltk.__version__) nltk: 3.5 import matplotlib import matplotlib.pyplot as plt # Without "%matplotlib inline", you get error "Javascript Error: IPython is not defined" in JupyterLab. %matplotlib inline # For scrollable output image %matplotlib nbagg with open('files_1/Unicode.txt', mode = 'r') as f: txt = f.read() txt[:80] 'What Is Unicode?\nUnicode supports over a million characters. Each character is a' Tokenize # Tokenize into words words = nltk.tokenize.word_tokenize(txt) print(words[:5]) print("Number of words:", len(words)) ['What', 'Is', 'Unicode', '?', 'Unicode'] Number of words: 302 Stopwords from nltk.corpus import stopwords print("Number of English stopwords:", len(stopwords.words('english'))) Number of English stopwords: 179 Word-Frequency Plot from nltk.probability import FreqDist fdist1 = FreqDist(words) %matplotlib inline fig = plt.figure(figsize=(12,5)) fdist1.plot(100, cumulative=True)
Converting input text to an NLTK text # text = nltk.Text(txt) # [Text: W h a t I s ...] text = nltk.Text(words) print(text) [Text: What Is Unicode ? Unicode supports over a...] Words Collocations (Bigram and Trigram) text.collocation_list(5) [('string', 'literal'), ('code', 'point'), ('Unicode', 'characters'), ('Unicode', 'string')] from nltk.collocations import * trigram_measures = nltk.collocations.TrigramAssocMeasures() finder = TrigramCollocationFinder.from_words(text) finder.nbest(trigram_measures.pmi, 10) [('abstract', 'entities', 'that'), ('by', 'preceding', 'an'), ('escape', 'sequence', 'inside'), ('just', 'like', 'normal'), ('preceding', 'an', 'ordinary'), ('specified', 'by', 'preceding'), ('\\uXXXX', 'escape', 'sequence'), ('encodingâ€', '”', 'this'), ('four-digit', 'hexadecimal', 'form'), ('like', 'normal', 'strings')] Clean HTML with open('files_1/tempate.html', mode = 'r') as f: in_html = f.read() nltk.clean_html(in_html) # NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function Word Distance Ref: nltk.org "Word Distance" import pkgutil for importer, modname, ispkg in pkgutil.iter_modules(nltk.__path__): print("Found submodule %s (is a package: %s)" % (modname, ispkg)) Found submodule app (is a package: True) Found submodule book (is a package: False) Found submodule ccg (is a package: True) Found submodule chat (is a package: True) Found submodule chunk (is a package: True) ... dir(nltk)[:5] ['AbstractLazySequence', 'AffixTagger', 'AlignedSent', 'Alignment', 'AnnotationTask'] [i for i in dir(nltk) if 'dist' in i] ['binary_distance', 'custom_distance', 'distance', 'edit_distance', 'edit_distance_align', 'interval_distance', 'jaccard_distance', 'masi_distance'] string_distance_examples = [ ("rain", "shine"), ("abcdef", "acbdef"), ("language", "lnaguaeg"), ("language", "lnaugage"), ("language", "lngauage"), ] for i in string_distance_examples: print(i[0], i[1], ":", nltk.binary_distance(i[0], i[1])) rain shine : 1.0 abcdef acbdef : 1.0 language lnaguaeg : 1.0 language lnaugage : 1.0 language lngauage : 1.0 for i in string_distance_examples: print(i[0], i[1], ":", nltk.edit_distance(i[0], i[1])) rain shine : 3 abcdef acbdef : 2 language lnaguaeg : 4 language lnaugage : 3 language lngauage : 2

Thursday, October 1, 2020

7 Frequent Python 'os' Package Uses



(base) C:\Users\ashish\Desktop\TEST>tree /f

C:.
│   TEST2.txt
│
└───TEST1
    │   TEST1.2.txt
    │
    └───TEST1.1
            TEST1.1.1.txt 

= = = = =

1. List all the subdirectories and files:

(base) C:\Users\ashish\Desktop\TEST>python
Python 3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.listdir('.')
['TEST1', 'TEST2.txt']
>>>

= = = = =

2. Recursively get all the subdirectories and files under the current folder (signified by "."):

>>> for dirpath, subdirs, files in os.walk("."):
...  print("dirpath:", dirpath)
...  for s in subdirs:
...   print("subdir:", s)
...  for f in files:
...   print("file:", f)
...
dirpath: .
subdir: TEST1
subdir: TEST4
file: TEST2.txt
file: TEST3.txt

dirpath: .\TEST1
subdir: TEST1.1
file: TEST1.2.txt

dirpath: .\TEST1\TEST1.1
file: TEST1.1.1.txt

dirpath: .\TEST4
>>> 

= = = = =

3. Call a OS Shell (or Windows CMD) command from Python Shell:

>>> os.system("tree /f")

C:.
│   TEST2.txt
│   TEST3.txt
│
├───TEST1
│   │   TEST1.2.txt
│   │
│   └───TEST1.1
│           TEST1.1.1.txt
│
└───TEST4
0 

Note: 0 at the end signifies 'successful' exit.

= = = = =

4. Create a "Path" from Strings:

"os.path.join" joins strings with the character representing 'path separater' for the OS.

>>> os.path.join("DIR1", "DIR2")
'DIR1\\DIR2' 

>>> os.path.join("DIR1", "DIR2", "DIR3")
'DIR1\\DIR2\\DIR3' 

>>> os.path.sep
'\\'

= = = = =

5. Get 'current working directory':

>>> os.getcwd()
'C:\\Users\\ashish\\Desktop\\TEST' 

= = = = =

6. Change Directory:

>>> os.chdir("TEST1")

>>> os.listdir()
['TEST1.1', 'TEST1.2.txt'] 

>>> os.getcwd()
'C:\\Users\\ashish\\Desktop\\TEST\\TEST1' 

= = = = =

7. Getting environment variables:

>>> import os
>>> os.environ['PATH']
'E:\\programfiles\\Anaconda3;E:\\programfiles\\Anaconda3\\Lib...' 

= = = = =