survival8: Normalizing your vocabulary (lexicon) for NLP application

Thursday, July 22, 2021
Normalizing your vocabulary (lexicon) for NLP application

Why normalize our vocabulary:

1. To reduce the vocabulary size as vocabulary size is important to the performance of an NLP pipeline.
2. So that tokens that mean similar things are combined into a single, normalized form.
3. It improves the association of meaning across those different “spellings” of a token or n-gram in your corpus.
4. Reducing your vocabulary can reduce the likelihood of overfitting.

Vocabulary is normalized in the following ways:

a) CASE FOLDING (aka case normalization)

Case folding is when you consolidate multiple “spellings” of a word that differ only in their capitalization.
To preserve the meaning of proper nouns:
A better approach for case normalization is to lowercase only the first word of a sentence and allow all other words to retain their capitalization such as “Joe” and “Smith” in “Joe Smith”. 

b) STEMMING

Another common vocabulary normalization technique is to eliminate the small meaning differences of pluralization or possessive endings of words, or even various verb forms. This normalization, identifying a common stem among various forms of a word, is called stemming. For example, the words housing and houses share the same stem, house. Stemming removes suffixes from words in an attempt to combine words with similar meanings together under their common stem. A stem isn’t required to be a properly spelled word, but merely a token, or label, representing several possible spellings of a word.

Stemming is important for keyword search or information retrieval. It allows you to search for “developing houses in Portland” and get web pages or documents that use both the word “house” and “houses” and even the word “housing.”

# How does stemming affect precision and recall of a search engine?

This broadening of your search results would be a big improvement in the “recall” score for how well your search engine is doing its job at returning all the relevant documents. But stemming could greatly reduce the “precision” score for your search engine, because it might return many more irrelevant documents along with the relevant ones. In some applications this “false-positive rate” (proportion of the pages returned that you don’t find useful) can be a problem. 

So most search engines allow you to turn off stemming and even case normalization by putting quotes around a word or phrase. Quoting indicates that you only want pages containing the exact spelling of a phrase, such as “‘Portland Housing Development software.’” 

That would return a different sort of document than one that talks about a “‘a Portland software developer’s house’”.

c) LEMMATIZATION

If you have access to information about connections between the meanings of various words, you might be able to associate several words together even if their spelling is quite different. This more extensive normalization down to the semantic root of a word—its lemma—is called lemmatization.

# Lemmatization and it’s use in the chatbot pipeline:

Any NLP pipeline that wants to “react” the same for multiple different spellings of the same basic root word can benefit from a lemmatizer. It reduces the number of words you have to respond to, the dimensionality of your language model. Using it can make your model more general, but it can also make your model less precise, because it will treat all spelling variations of a given root word the same. 

For example, “chat,” “chatter,” “chatty,” “chatting,” and perhaps even “chatbot” would all be treated the same in an NLP pipeline with lemmatization, even though they have different meanings. 

Likewise, “bank,” “banked,” and “banking” would be treated the same by a stemming pipeline, despite the river meaning of “bank,” the motorcycle meaning of “banked,” and the finance meaning of “banking.”

Lemmatization is a potentially more accurate way to normalize a word than stemming or case normalization because it takes into account a word’s meaning. A lemmatizer uses a knowledge base of word synonyms and word endings to ensure that only words that mean similar things are consolidated into a single token.

# Lemmatization and POS (Part of speech) Tagging

Some lemmatizers use the word’s part of speech (POS) tag in addition to its spelling to help improve accuracy. 

The POS tag for a word indicates its role in the grammar of a phrase or sentence. For example, the noun POS is for words that refer to “people, places, or things” within a phrase. An adjective POS is for a word that modifies or describes a noun. A verb refers to an action. The POS of a word in isolation cannot be determined. The context of a word must be known for its POS to be identified.

So some advanced lemmatizers can’t be run-on words in isolation.

>>> import nltk
>>> nltk.download('wordnet')
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()

# Default 'pos' is noun.
>>> lemmatizer.lemmatize("better") 
'better’

# "a" --> adjective
>>> lemmatizer.lemmatize("better", pos="a") 
'good’

>>> lemmatizer.lemmatize("goods", pos="n")
'good’

>>> lemmatizer.lemmatize("goods", pos="a")
'goods’

>>> lemmatizer.lemmatize("good", pos="a")
'good’

>>> lemmatizer.lemmatize("goodness", pos="n")
'goodness'

>>> lemmatizer.lemmatize("best", pos="a")
'best'

Difference between stemming and lemmatization:

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set. 
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance: 




The result of this mapping of text will be something like:




However, the two words differ in their flavor. 
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma. Linguistic processing for stemming or lemmatization is often done by an additional plug-in component to the indexing process, and a number of such components exist, both commercial and open-source.

Ref: nlp.stanford.edu 
Labels: Technology,Natural Language Processing,Python,
survival8

Pages

Thursday, July 22, 2021

Normalizing your vocabulary (lexicon) for NLP application

Why normalize our vocabulary:

a) CASE FOLDING (aka case normalization)

b) STEMMING

# How does stemming affect precision and recall of a search engine?

c) LEMMATIZATION

# Lemmatization and it’s use in the chatbot pipeline:

# Lemmatization and POS (Part of speech) Tagging

Difference between stemming and lemmatization:

No comments:

Post a Comment