survival8: Document Parsing, Document based Embeddings and Word Embeddings

Thursday, August 12, 2021

Document Parsing, Document based Embeddings and Word Embeddings

For our examples in this post, we would use the following sentence as input:

sentence = "Thomas Jefferson began building Monticello at the age of 26."

-----

For the fundamental building blocks of NLP, there are equivalents in a computer language compiler:

# tokenizer — scanner, lexer, lexical analyzer
# vocabulary — lexicon
# parser — compiler
# token, term, word, sentence, or n-gram — token, symbol, or terminal symbol

N-gram: two-gram, three-gram or four-gram... so on.

For sentence: Thomas Jefferson began building Monticello at the age of 26.

Two-grams: “Thomas Jefferson”, “Jefferson began”, “began building”, ...

Three-grams: “Thomas Jefferson began”, “Jefferson began building”, ...

-----

One Hot Vector

Each row of the table is a binary row vector, and you can see why it’s also called a one-hot vector: all but one of the positions (columns) in a row are 0 or blank. Only one column, or position in the vector, is “hot” (“1”). A one (1) means on, or hot. A zero (0) means off, or absent. And you can use the vector:
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
to represent the word “began” in your NLP pipeline.

-----

One-hot vector Representation of a Document

-----

One hot encoding of a categorical column

-----

Word Frequency Vector Representation of the Corpus

If you summed all these one-hot vectors together, rather than “replaying” them one at a time, you’d get a bag-of-words vector. This is also called a word frequency vector, because it only counts the frequency of words, not their order.

Ex 1:

Ex 2:

-----

Explaining Construction of "Word Frequence Vector Representation"

1) Thomas Jefferson began building Monticello at the age of 26.

2) Construction was done mostly by local masons and carpenters.

3) He moved into the South Pavilion in 1770.

4) Turning Monticello into a neoclassical masterpiece was Jefferson's obsession.

We have 26 and 1770 as Numbers.

First step to building vocabulary is sorting the words:

Sorting the words:

1770, 26, Construction, He...

-----

Dot product / Inner product / Scalar product

Geometric Definition

In Euclidean space, a Euclidean vector is a geometric object that possesses both a magnitude and a direction. A vector can be pictured as an arrow. Its magnitude is its length, and its direction is the direction to which the arrow points. The magnitude of a vector “a” is denoted by ||a||. The dot product of two Euclidean vectors a and b is defined by:

An application of dot product

Sent0 = I am Ashish.
Sent1 = Maybe am not.

Words: I am Ashish Maybe not

Sent0 1 1 1 0 0
Sent1 0 1 0 1 1

SENT0.SENT1 = (1 * 0) + (1 * 1) + (1 * 0) + (0 * 1) + (0 * 1)
= 1

Number of common words in these two sentences is “1”.

-----

Several Python libraries implement tokenizers, each with its own advantages and disadvantages

# spaCy — Accurate , flexible, fast, Python

# Stanford CoreNLP — More accurate, less flexible, fast, depends on Java 8

# NLTK — Standard used by many NLP contests and comparisons, popular, Python

-----

Treebank Word Tokenizer

An even better tokenizer is the Treebank Word Tokenizer from the NLTK package. It incorporates a variety of common rules for English word tokenization. For example, it separates phrase-terminating punctuation (?!.;,) from adjacent tokens and retains decimal numbers containing a period as a single token. In addition, it contains rules for English contractions. For example, “don’t” is tokenized as ["do", "n’t"]. This tokenization will help with subsequent steps in the NLP pipeline, such as stemming. You can find all the rules for the Treebank Tokenizer at: nltk.tokenize.treebank

-----

Stop Words

Stop words are common words in any language that occur with a high frequency but carry much less substantive information about the meaning of a phrase. Examples of some common stop words include:

 a, an
 the, this
 and, or
 of, on

Historically, stop words have been excluded from NLP pipelines in order to reduce the computational effort to extract information from a text. Even though the words themselves carry little information, the stop words can provide important relational information as part of an n-gram. Consider these two examples:

 Mark reported to the CEO
 Suzanne reported as the CEO to the board

-----

Document Parsing Ends Here And: Word Embeddings Begin.
Word Embeddings

One of the most exciting recent advancements in NLP is the “discovery” of word vectors. This chapter will help you understand what they are and how to use them to do some surprisingly powerful things. You’ll learn how to recover some of the fuzziness and subtlety of word meaning that was lost in the approximations of earlier chapters.

In the previous chapters, we ignored the nearby context of a word. We ignored the words around each word. We ignored the effect the neighbors of a word have on its meaning and how those relationships affect the overall meaning of a statement. Our bag-of-words concept jumbled all the words from each document together into a statistical bag. In this chapter, you’ll create much smaller bags of words from a “neighborhood” of only a few words, typically fewer than 10 tokens. You’ll also ensure that these neighborhoods of meaning don’t spill over into adjacent sentences. This process will help focus your word vector training on the relevant words.

Word Vectors or Word Embeddings

Word vectors are numerical vector representations of word semantics, or meaning, including literal and implied meaning. So word vectors can capture the connotation of words, like “peopleness,” “animalness,” “placeness,” “thingness,” and even “conceptness.” And they combine all that into a dense vector (no zeros) of floating point values. This dense vector enables queries and logical reasoning.
Labels: Artificial Intelligence,Natural Language Processing,Python,Technology,