survival8: July 2021

Thursday, July 29, 2021

JavaScript Intro (Dev Console, Data Types and Operators)

Where to run the JavaScript code?



PRINTING SOMETHING

The Hello World Program:

console.log("Hello World!");



Variable Declaration in JavaScript

Values can be assigned to variables with an = sign

x = 0; // Now the variable x has the value 0
x // => 0: A variable evaluates to its value.

JavaScript supports several types of values

x = 1; // Numbers.
x = 0.01; // Numbers can be integers or reals.

x = "hello world"; // Strings of text in quotation marks.
x = 'JavaScript'; // Single quote marks also delimit strings.

x = true; // A Boolean value.
x = false; // The other Boolean value.

x = null; // Null is a special value that means "no value."
x = undefined; // Undefined is another special value like null.

Object Declaration in JavaScript

JavaScript's most important datatype is the object.
An object is a collection of name/value pairs, or a string to value map.

let book = { // Objects are enclosed in curly braces.
topic: "JavaScript", // The property "topic" has value "JavaScript."
edition: 7 // The property "edition" has value 7
}; // The curly brace marks the end of the object.

Access the properties of an object with . or []:

book.topic // => "JavaScript"
book["edition"] // => 7: another way to access property values.
book.author = "Flanagan"; // Create new properties by assignment.

{} is an empty object with no properties.
book.contents = {}; 

Conditionally access properties with ?. (ES2020): 

book.contents?.ch01?.sect1 // => undefined: book.contents has no ch01 property.

Arrays in JavaScript (An Overview)

JavaScript also supports arrays (numerically indexed lists) of values:

let primes = [2, 3, 5, 7]; // An array of 4 values, delimited with [ and ].

primes[0] // => 2: the first element (index 0) of the array.
primes.length // => 4: how many elements in the array.
primes[primes.length-1] // => 7: the last element of the array.

primes[4] = 9; // Add a new element by assignment.
primes[4] = 11; // Or alter an existing element by assignment.

let empty = []; // [] is an empty array with no elements.
empty.length // => 0

Arrays and objects can hold other arrays and objects:
Following is an array with 2 elements:

let points = [ 
 {x: 0, y: 0}, // Each element is an object.
 {x: 1, y: 1}
];

An object with 2 properties:

let data = { 
 trial1: [[1,2], [3,4]], // The value of each property is an array.
 trial2: [[2,3], [4,5]] // The elements of the arrays are arrays.
};

Operators

Operators act on values (the operands) to produce a new value.
Arithmetic operators are some of the simplest:

3 + 2 // => 5: addition
3 - 2 // => 1: subtraction
3 * 2 // => 6: multiplication
3 / 2 // => 1.5: division

points[1].x - points[0].x // => 1: more complicated operands also work

+ adds numbers, concatenates strings:
"3" + "2" // => "32"

JavaScript defines some shorthand arithmetic operators

let count = 0; // Define a variable
count++; // Increment the variable
count--; // Decrement the variable
count += 2; // Add 2: same as count = count + 2;
count *= 3; // Multiply by 3: same as count = count * 3;
count // => 6: variable names are expressions, too.

Equality and relational operators test whether two values are equal, unequal, less than, greater than, and so on. They evaluate to true or false.

let x = 2, y = 3; // These = signs are assignment, not equality tests
x === y // => false: equality
x !== y // => true: inequality
x < y // => true: less-than
x <= y // => true: less-than or equal
x > y // => false: greater-than
x >= y // => false: greater-than or equal
"two" === "three" // => false: the two strings are different
"two" > "three" // => true: "tw" is alphabetically greater than "th"
false === (x > y) // => true: false is equal to false

Logical operators combine or invert boolean values

(x === 2) && (y === 3) // => true: both comparisons are true. && is AND

(x > 3) || (y < 3) // => false: neither comparison is true. || is OR

!(x === y) // => true: ! inverts a boolean value

Labels: Technology,Web Development,

Wednesday, July 28, 2021

Naïve Bayes Classifier for Spam Filtering

Concepts of Probability
Indepedent Events
Flipping a coin twice.

Dependent Events
Drawing two cards one by one from a deck without replacement.

First time: 52 cards
P(Jack of hearts) = 1/52

At the time of drawing second card, deck has now left with: 51 cards
So the deck at the time of second draw has changed because we are doing it without replacement

Addition Rule



Multiplication Rule



Bayes Theorem



What Is The Probability Of Getting “Class Ck And All The Evidences 1 To N”:



X1 to XN Are Our Evidence Events And They Are All Independent As Assumed In Naïve Bayes Algorithm (Or Classification).

P(x1, x2, x3, C) = P(x1|(x2, x3, C)) . P(x2, x3, C) 
RHS = P(x1|(x2, x3, C)).P(x2|(x3, C)).P(x3, C)
RHS = P(x1|(x2,x3,C)) . P(x2 | (x3, C)) . P(x3 | C) . P(C)

And if x1, x2 and x3 are independent of each other:
RHS = P(x1 | C) . P(x2 | C) . P (x3 | C) . P(C)

FRUIT PROBLEM



A fruit is long, sweet and yellow. Is it a banana? Is it an orange? Or is it some different fruit?
P(Banana | Long, Sweet, Yellow)
= (P(Long, Sweet, Yellow | Banana) * P(Banana)) / P(L,S,Y) 

P(L,S,Y | B) = P(L,S,Y,B) / P(B)

Naïve Bayes => All the events (such as L, S, Y) are independent.

Now, using the 'Chain Rule' along side 'Independence Condition':

=> P(L, S, Y, B) = P(L|B) * P(S|B) * P(Y|B) * P(B)

- - - 
P(Orange | Long, Sweet, Yellow)

Answer: Whichever P() is higher

P(Banana) = 50 / 100
P(Orange) = 30 / 100
P(Other) = 20 / 100



P(Long | Banana) = 40 / 50 = 0.8



P(Sweet | Banana) = 35 / 50 = 0.7



P(Yellow | Banana) = 45 / 50 = 0.9






P(Banana|Long, Sweet and Yellow) 
= P(Long|Banana) * P(Sweet|Banana) * P(Yellow|Banana) * P(banana)/ 
					(P(Long) * P(Sweet) * P(Yellow))
 = 0.8 * 0.7 * 0.9 * 0.5 / P(evidence) =0.252/denom

P(Orange|Long, Sweet and Yellow) = 0 

P(Other Fruit|Long, Sweet and Yellow)
= P(Long|Other fruit) * P(Sweet|Other fruit) * P(Yellow|Other fruit) * P(Other Fruit)/ (P(Long) * P(Sweet) * P(Yellow))  = 0.018/denom



P(ham | d6) and P(spam | d6)
D6: good? Bad! very bad!

P(ham | good, bad, very, bad) = 
P (good, bad, very, bad, ham) / P(good, bad, very, bad))

P(good, bad, very, bad, ham) =  P(good|ham)*P(bad|ham)*P(very|ham)*P(bad|ham)*P(ham)







Classified as spam!

Practice Question
Ques 1: What is the assumption about the dataset on which we can apply Naive Bayes' classification algorithm?
Ans 1:
That the evidence events should be independent of each other.

Ques 2: What is 'recall' metric in classification report?
Ans 2:
Recall: How many of the selected class instances have been predicted correctly (or we say “have been recalled”).

Labels: Technology,Artificial Intelligence,Machine Learning,

Sunday, July 25, 2021

Career Road Map for Artificial Intelligence & Data Science

The Data Science Skills Venn Diagram

What you saw in the previous Venn Diagram

Machine Learning = Statistics + Computer Science

It could roughly be interpreted as:

“Machine Learning is doing statistics on a computer”.

It is not entirely wrong as there are a lot of Machine Learning models that have come directly from Statistics field such as:

# Linear Regression
# Decision Trees
# Naïve Bayes’ Classification Model

More than that the first step in doing Machine Learning on a data set involves doing:

Exploratory Data Analysis on the data.

This is roughly equal to doing: Descriptive Statistics and Inferential Statistics on the data.

The next two equations we could write for intersections of fields are:

Traditional Software = Computer Science + Business Expertise

This roughly means that you are:
Doing the business via a computer.

And:

Traditional Research = Statistics + Business Expertise

This roughly means that you are:
Using Statistics to understand, explain and grow your business.

And the last one:

Data Science = Machine Learning + Traditional Research + Traditional Software

The Artificial Intelligence Venn Diagram

The Definitions From The Previous Slide
Artificial Intelligence: A program that can sense, reason, act and adapt.

Machine Learning: Algorithms who performance improve as they are exposed to more data over time.

Deep Learning: Subset of Machine Learning in which multilayered neural networks learn from vast amounts of data.

And these definitions are not very different from what experts think of these fields:

The ‘Data Scientist vs Data Analyst vs ML Engineer vs Data Engineer’ Venn Diagram

The way to differentiate between ML Engineer and Data Analyst is that both of know the Math but Analyst knows more of Statistics and lesser of Programming while the Engineer knows more of Programming and lesser of Statistics.

Data Scientist

A data scientist is responsible for pulling insights from data. It is the data scientist’s job to pull data, create models, create data products, and tell a story. A data scientist should typically have interactions with customers and/or executives. A data scientist should love scrubbing a dataset for more and more understanding.
The main goal of a data scientist is to produce data products and tell the stories of the data. A data scientist would typically have stronger statistics and presentation skills than a data engineer.

Data Engineer

Data Engineering is more focused on the systems that store and retrieve data. A data engineer will be responsible for building and deploying storage systems that can adequately handle the needs. Sometimes the needs are fast real-time incoming data streams. Other times the needs are massive amounts of large video files. Still other times the needs are many reads of the data.In other words, a data engineer needs to build systems that can handle the 4 Vs of Big Data (Volume, Velocity, Variety and Veracity).
The main goal of data engineer is to make sure the data is properly stored and available to the data scientist and others that need access. A data engineer would typically have stronger software engineering and programming skills than a data scientist.

Labels: Technology,Machine Learning,Artificial Intelligence,

Itch Guard Plus Cream (Menthol and Terbinafine)

Itch Guard Plus Cream
Manufacturer: Reckitt Benckiser
Pack Sizes: 
- 12 gm Cream / ₹75
- 20 gm Cream / ₹99

Product highlights

- It helps kill fungus and inhibits it from spreading further

- It can be used to treat jock itch

- It works by killing the fungi on the skin by destroying their cell membrane

Composition

1. Terbinafine Hydrochloride
2. Menthol

Terbinafine Uses

Terbinafine is used in the treatment of fungal infections.

How Terbinafine works 

Terbinafine is an antifungal medication. It kills and stops the growth of the fungi by destroying its cell membrane, thereby treating your skin infection.

Common side effects of Terbinafine

Headache, Diarrhea, Rash, Indigestion, Abnormal liver enzyme, Itching, Taste change, Nausea, Abdominal pain, Flatulence

Terbinafine 
- Wikipedia, 20210725

Terbinafine, sold under the brand name Lamisil among others, is an antifungal medication used to treat pityriasis versicolor, fungal nail infections, and ringworm including jock itch and athlete's foot. It is either taken by mouth or applied to the skin as a cream or ointment. The cream and ointment are effective for nail infections.

Common side effects when taken by mouth include nausea, diarrhea, headache, cough, rash, and elevated liver enzymes. Severe side effects include liver problems and allergic reactions. Liver injury is, however, unusual. Use during pregnancy is not typically recommended.

The cream and ointment may result in itchiness but are generally well tolerated. Terbinafine is in the allylamines family of medications. It works by decreasing the ability of fungi to make sterols. It appears to result in fungal cell death.

Terbinafine was discovered in 1991. It is on the World Health Organization's List of Essential Medicines. In 2017, it was the 307th most commonly prescribed medication in the United States, with more than one million prescriptions.

Menthol

Menthol is an organic compound made synthetically or obtained from the oils of corn mint, peppermint, or other mints. It is a waxy, crystalline substance, clear or white in color, which is solid at room temperature and melts slightly above.

The main form of menthol occurring in nature is (−)-menthol, which is assigned the (1R,2S,5R) configuration. Menthol has local anesthetic and counterirritant qualities, and it is widely used to relieve minor throat irritation. Menthol also acts as a weak κ-opioid receptor agonist.

In 2017, it was the 193rd most commonly prescribed medication in the United States, with more than two million prescriptions.
Labels: Medicine,Science,

Thursday, July 22, 2021

Normalizing your vocabulary (lexicon) for NLP application

Why normalize our vocabulary:

1. To reduce the vocabulary size as vocabulary size is important to the performance of an NLP pipeline.
2. So that tokens that mean similar things are combined into a single, normalized form.
3. It improves the association of meaning across those different “spellings” of a token or n-gram in your corpus.
4. Reducing your vocabulary can reduce the likelihood of overfitting.

Vocabulary is normalized in the following ways:

a) CASE FOLDING (aka case normalization)

Case folding is when you consolidate multiple “spellings” of a word that differ only in their capitalization.
To preserve the meaning of proper nouns:
A better approach for case normalization is to lowercase only the first word of a sentence and allow all other words to retain their capitalization such as “Joe” and “Smith” in “Joe Smith”. 

b) STEMMING

Another common vocabulary normalization technique is to eliminate the small meaning differences of pluralization or possessive endings of words, or even various verb forms. This normalization, identifying a common stem among various forms of a word, is called stemming. For example, the words housing and houses share the same stem, house. Stemming removes suffixes from words in an attempt to combine words with similar meanings together under their common stem. A stem isn’t required to be a properly spelled word, but merely a token, or label, representing several possible spellings of a word.

Stemming is important for keyword search or information retrieval. It allows you to search for “developing houses in Portland” and get web pages or documents that use both the word “house” and “houses” and even the word “housing.”

# How does stemming affect precision and recall of a search engine?

This broadening of your search results would be a big improvement in the “recall” score for how well your search engine is doing its job at returning all the relevant documents. But stemming could greatly reduce the “precision” score for your search engine, because it might return many more irrelevant documents along with the relevant ones. In some applications this “false-positive rate” (proportion of the pages returned that you don’t find useful) can be a problem. 

So most search engines allow you to turn off stemming and even case normalization by putting quotes around a word or phrase. Quoting indicates that you only want pages containing the exact spelling of a phrase, such as “‘Portland Housing Development software.’” 

That would return a different sort of document than one that talks about a “‘a Portland software developer’s house’”.

c) LEMMATIZATION

If you have access to information about connections between the meanings of various words, you might be able to associate several words together even if their spelling is quite different. This more extensive normalization down to the semantic root of a word—its lemma—is called lemmatization.

# Lemmatization and it’s use in the chatbot pipeline:

Any NLP pipeline that wants to “react” the same for multiple different spellings of the same basic root word can benefit from a lemmatizer. It reduces the number of words you have to respond to, the dimensionality of your language model. Using it can make your model more general, but it can also make your model less precise, because it will treat all spelling variations of a given root word the same. 

For example, “chat,” “chatter,” “chatty,” “chatting,” and perhaps even “chatbot” would all be treated the same in an NLP pipeline with lemmatization, even though they have different meanings. 

Likewise, “bank,” “banked,” and “banking” would be treated the same by a stemming pipeline, despite the river meaning of “bank,” the motorcycle meaning of “banked,” and the finance meaning of “banking.”

Lemmatization is a potentially more accurate way to normalize a word than stemming or case normalization because it takes into account a word’s meaning. A lemmatizer uses a knowledge base of word synonyms and word endings to ensure that only words that mean similar things are consolidated into a single token.

# Lemmatization and POS (Part of speech) Tagging

Some lemmatizers use the word’s part of speech (POS) tag in addition to its spelling to help improve accuracy. 

The POS tag for a word indicates its role in the grammar of a phrase or sentence. For example, the noun POS is for words that refer to “people, places, or things” within a phrase. An adjective POS is for a word that modifies or describes a noun. A verb refers to an action. The POS of a word in isolation cannot be determined. The context of a word must be known for its POS to be identified.

So some advanced lemmatizers can’t be run-on words in isolation.

>>> import nltk
>>> nltk.download('wordnet')
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()

# Default 'pos' is noun.
>>> lemmatizer.lemmatize("better") 
'better’

# "a" --> adjective
>>> lemmatizer.lemmatize("better", pos="a") 
'good’

>>> lemmatizer.lemmatize("goods", pos="n")
'good’

>>> lemmatizer.lemmatize("goods", pos="a")
'goods’

>>> lemmatizer.lemmatize("good", pos="a")
'good’

>>> lemmatizer.lemmatize("goodness", pos="n")
'goodness'

>>> lemmatizer.lemmatize("best", pos="a")
'best'

Difference between stemming and lemmatization:

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set. 
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance: 




The result of this mapping of text will be something like:




However, the two words differ in their flavor. 
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma. Linguistic processing for stemming or lemmatization is often done by an additional plug-in component to the indexing process, and a number of such components exist, both commercial and open-source.

Ref: nlp.stanford.edu 
Labels: Technology,Natural Language Processing,Python,

VADER: Rule Based Approach to Sentiment Analysis

# VADER: for Valence Aware Dictionary for sEntiment Reasoning.

# VADER is a rule-based approach towards doing sentiment analysis.

# As of NLTK version 3.6 [20210721], it uses VADER for sentiment analysis.

VADER: In Code






Try following code snippets in Python CLI
>>> from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
>>> sa = SentimentIntensityAnalyzer()
>>> sa.lexicon

{ ...
':(': -1.9,
':)': 2.0,
...
'pls': 0.3,
'plz': 0.3,
...
'great': 3.1,
... }

>>> [(tok, score) for tok, score in sa.lexicon.items() if " " in tok]

[("( '}{' )", 1.6),
("can't stand", -2.0),
('fed up', -1.8),
('screwed up', -1.5)]

>>> sa.polarity_scores(text="Python is very readable and it's great for NLP.")
{'compound': 0.6249, 'neg': 0.0, 'neu': 0.661, 'pos': 0.339}

>>> sa.polarity_scores(text="Python is not a bad choice for most applications.")
{'compound': 0.431, 'neg': 0.0, 'neu': 0.711, 'pos': 0.289}

>>> corpus = ["Absolutely perfect! Love it! :-) :-) :-)",
... "Horrible! Completely useless. :(",
... "It was OK. Some good and some bad things."]

>>> for doc in corpus:
... scores = sa.polarity_scores(doc)
... print('{:+}: {}'.format(scores['compound'], doc))

+0.9428: Absolutely perfect! Love it! :-) :-) :-)
-0.8768: Horrible! Completely useless. :(
+0.3254: It was OK. Some good and some bad things.

DRAWBACK OF VADER

The drawback of VADER is that it doesn’t look at all the words in a document, only about 7,500. 

The questions that remained unanswered by VADER are:

What if you want all the words to help add to the sentiment score? 

And what if you don’t want to have to code your own understanding of the words in a dictionary of thousands of words or add a bunch of custom words to the dictionary in SentimentIntensityAnalyzer.lexicon? 

The rule-based approach might be impossible if you don’t understand the language, because you wouldn’t know what scores to put in the dictionary (lexicon)!

References

# “VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text” by Hutto and Gilbert
(http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf)

# You can find more detailed installation instructions with the package source code on github (https://
github.com/cjhutto/vaderSentiment).

Labels: Technology,Natural Language Processing,Python,

Wednesday, July 21, 2021

Command 'git merge'

Code Legend:
Black: main branch
Dark gray: test_branch 
   
Part 1: "git clone -b test_branch"

~\git_exp\test_branch>git clone -b test_branch https://github.com/ashishjain1547/repo_for_testing.git
Cloning into 'repo_for_testing'...
remote: Enumerating objects: 23, done.
remote: Counting objects: 100% (23/23), done.
remote: Compressing objects: 100% (14/14), done.
remote: Total 23 (delta 7), reused 14 (delta 3), pack-reused 0
Unpacking objects: 100% (23/23), 6.72 KiB | 5.00 KiB/s, done.

~\git_exp\test_branch>cd repo_for_testing

~\git_exp\test_branch\repo_for_testing>git branch
* test_branch

~\git_exp\test_branch\repo_for_testing>dir
 Volume in drive C is Windows
 Volume Serial Number is 8139-90C0

 Directory of ~\git_exp\test_branch\repo_for_testing

07/21/2021  12:26 PM    <DIR>          .
07/21/2021  12:26 PM    <DIR>          ..
07/21/2021  12:26 PM               368 .gitignore
07/21/2021  12:26 PM                30 20210528_test_branch.txt
07/21/2021  12:26 PM                17 202107141543.txt
07/21/2021  12:26 PM                17 202107141608.txt
07/21/2021  12:26 PM            11,558 LICENSE
07/21/2021  12:26 PM                11 newFile.txt
07/21/2021  12:26 PM                38 README.md
07/21/2021  12:26 PM                23 test_file_20210528.txt
               8 File(s)         12,062 bytes
               2 Dir(s)  56,473,489,408 bytes free

Part 2: "git clone" Default

~\git_exp\main>git clone https://github.com/ashishjain1547/repo_for_testing.git
Cloning into 'repo_for_testing'...
remote: Enumerating objects: 23, done.
remote: Counting objects: 100% (23/23), done.
remote: Compressing objects: 100% (14/14), done.
remote: Total 23 (delta 7), reused 14 (delta 3), pack-reused 0
Unpacking objects: 100% (23/23), 6.72 KiB | 5.00 KiB/s, done.

~\git_exp\main>cd repo_for_testing

~\git_exp\main\repo_for_testing>git status
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean

~\git_exp\main\repo_for_testing>git branch -a
* main
  remotes/origin/HEAD -> origin/main
  remotes/origin/main
  remotes/origin/test_branch

Part 3: Create new file in "test_branch"

~\git_exp\test_branch\repo_for_testing>echo "202107211228" > 202107211228.txt

~\git_exp\test_branch\repo_for_testing>dir
 Volume in drive C is Windows
 Volume Serial Number is 8139-90C0

 Directory of ~\git_exp\test_branch\repo_for_testing

07/21/2021  12:28 PM    <DIR>          .
07/21/2021  12:28 PM    <DIR>          ..
07/21/2021  12:26 PM               368 .gitignore
07/21/2021  12:26 PM                30 20210528_test_branch.txt
07/21/2021  12:26 PM                17 202107141543.txt
07/21/2021  12:26 PM                17 202107141608.txt
07/21/2021  12:28 PM                17 202107211228.txt
07/21/2021  12:26 PM            11,558 LICENSE
07/21/2021  12:26 PM                11 newFile.txt
07/21/2021  12:26 PM                38 README.md
07/21/2021  12:26 PM                23 test_file_20210528.txt
               9 File(s)         12,079 bytes
               2 Dir(s)  56,473,849,856 bytes free

~\git_exp\test_branch\repo_for_testing>git status
On branch test_branch
Your branch is up to date with 'origin/test_branch'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        202107211228.txt

nothing added to commit but untracked files present (use "git add" to track)

~\git_exp\test_branch\repo_for_testing>git add -A

~\git_exp\test_branch\repo_for_testing>git commit -m "20210721 1229"
[test_branch 087a5ca] 20210721 1229
 1 file changed, 1 insertion(+)
 create mode 100644 202107211228.txt

~\git_exp\test_branch\repo_for_testing>git push
Enumerating objects: 4, done.
Counting objects: 100% (4/4), done.
Delta compression using up to 4 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 289 bytes | 289.00 KiB/s, done.
Total 3 (delta 1), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (1/1), completed with 1 local object.
To https://github.com/ashishjain1547/repo_for_testing.git
   9017804..087a5ca  test_branch -> test_branch

~\git_exp\test_branch\repo_for_testing>git status
On branch test_branch
Your branch is up to date with 'origin/test_branch'.

nothing to commit, working tree clean

Part 4: Git Metadata About the Files and the Use of 'git pull origin' to Update this Metadata

~\git_exp\main\repo_for_testing>git branch -a
* main
  remotes/origin/HEAD -> origin/main
  remotes/origin/main
  remotes/origin/test_branch

~\git_exp\main\repo_for_testing>git checkout test_branch
Switched to a new branch 'test_branch'
Branch 'test_branch' set up to track remote branch 'test_branch' from 'origin'.

~\git_exp\main\repo_for_testing>git branch -a
  main
* test_branch
  remotes/origin/HEAD -> origin/main
  remotes/origin/main
  remotes/origin/test_branch

~\git_exp\main\repo_for_testing>git checkout main
Switched to branch 'main'
Your branch is up to date with 'origin/main'.

~\git_exp\main\repo_for_testing>git merge test_branch
Already up to date.

Part 5: 'git pull origin' and then 'git merge'

~\git_exp\main\repo_for_testing>git pull origin
remote: Enumerating objects: 4, done.
remote: Counting objects: 100% (4/4), done.
remote: Compressing objects: 100% (1/1), done.
remote: Total 3 (delta 1), reused 3 (delta 1), pack-reused 0
Unpacking objects: 100% (3/3), 269 bytes | 0 bytes/s, done.
From https://github.com/ashishjain1547/repo_for_testing
   9017804..087a5ca  test_branch -> origin/test_branch
Already up to date.

~\git_exp\main\repo_for_testing>git branch -D test_branch
Deleted branch test_branch (was 9017804).

~\git_exp\main\repo_for_testing>git checkout test_branch
Switched to a new branch 'test_branch'
Branch 'test_branch' set up to track remote branch 'test_branch' from 'origin'.

~\git_exp\main\repo_for_testing>git checkout main
Switched to branch 'main'
Your branch is up to date with 'origin/main'.

~\git_exp\main\repo_for_testing>git merge test_branch
Merge made by the 'recursive' strategy.
 202107211228.txt | 1 +
 1 file changed, 1 insertion(+)
 create mode 100644 202107211228.txt

~\git_exp\main\repo_for_testing>git push
Enumerating objects: 1, done.
Counting objects: 100% (1/1), done.
Writing objects: 100% (1/1), 231 bytes | 231.00 KiB/s, done.
Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
To https://github.com/ashishjain1547/repo_for_testing.git
   d210505..9f1b42f  main -> main

The above step puts "main" branches ahead by two commits.



Part 6: Bringing 'test_branch' on level with 'main' branch using 'git merge'

~\git_exp\test_branch\repo_for_testing>git branch
* test_branch

~\git_exp\test_branch\repo_for_testing>git branch -a
* test_branch
  remotes/origin/HEAD -> origin/main
  remotes/origin/main
  remotes/origin/test_branch

~\git_exp\test_branch\repo_for_testing>git pull origin
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 1 (delta 0), pack-reused 0
Unpacking objects: 100% (1/1), 211 bytes | 1024 bytes/s, done.
From https://github.com/ashishjain1547/repo_for_testing
   d210505..9f1b42f  main       -> origin/main
Already up to date.

~\git_exp\test_branch\repo_for_testing>
~\git_exp\test_branch\repo_for_testing>git checkout main
Switched to a new branch 'main'
Branch 'main' set up to track remote branch 'main' from 'origin'.

~\git_exp\test_branch\repo_for_testing>git checkout test_branch
Switched to branch 'test_branch'
Your branch is up to date with 'origin/test_branch'.

~\git_exp\test_branch\repo_for_testing>git merge main
Updating 087a5ca..9f1b42f
Fast-forward

~\git_exp\test_branch\repo_for_testing>git status
On branch test_branch
Your branch is ahead of 'origin/test_branch' by 2 commits.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean

~\git_exp\test_branch\repo_for_testing>git push
Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
To https://github.com/ashishjain1547/repo_for_testing.git
   087a5ca..9f1b42f  test_branch -> test_branch

~\git_exp\test_branch\repo_for_testing>
~\git_exp\test_branch\repo_for_testing>git branch -a
  main
* test_branch
  remotes/origin/HEAD -> origin/main
  remotes/origin/main
  remotes/origin/test_branch

~\git_exp\test_branch\repo_for_testing>
Labels: Technology,GitHub,

Tuesday, July 20, 2021

Session 1 on ‘Understanding, Analyzing and Generating Text'

Here, we focus on only one natural language, English and only one programming language, Python.

The Way We Understand Language and How Machines See it is Quite Different

Natural languages have an additional “decoding” challenge (apart from the ‘Information Extraction’ from it) that is even harder to solve. Speakers and writers of natural languages assume that a human is the one doing the processing (listening or reading), not a machine. So when I say “good morning”, I assume that you have some knowledge about what makes up a morning, including not only that mornings come before noons and afternoons and evenings but also after midnights. And you need to know they can represent times of day as well as general experiences of a period of time. The interpreter is assumed to know that “good morning” is a common greeting that doesn’t contain much information at all about the morning. Rather it reflects the state of mind of the speaker and her readiness to speak with others.






TIP: The “r” before the quote specifies a raw string, not a regular expression.

With a Python raw string, you can send backslashes directly to the regular expression compiler without having to double-backslash ("\\") all the special regular expression characters such as spaces ("\\ ") and curly braces or handlebars("\\{ \\}").




Architecture of a Chatbot

A chatbot requires four kinds of processing as well as a database to maintain a memory of past statements and responses. Each of the four processing stages can contain one or more processing algorithms working in parallel or in series (see figure 1.3):

1. Parse—Extract features, structured numerical data, from natural language text.

2. Analyze—Generate and combine features by scoring text for sentiment, grammaticality, and semantics.

3. Generate—Compose possible responses using templates, search, or language models.

4. Execute—Plan statements based on conversation history and objectives, and select the next response.








The Way Rasa Identifies a Greeting or Good-bye






How does Rasa understand your greetings?

An image taken from “rasa interactive” command output of our conversation.






IQ of some Natural Language Processing systems
We see that bots working at depth in this image are: Domain Specific Bots.




For the fundamental building blocks of NLP, there are equivalents in a computer language compiler

# tokenizer  --  scanner, lexer, lexical analyzer

# vocabulary  --  lexicon

# parser  --  compiler 

# token, term, word, or n-gram  --  token, symbol, or terminal symbol

An quick-and-dirty example of ‘Tokenizer’ using the str.split()



>>> import numpy as np
>>> token_sequence = str.split(sentence)
>>> vocab = sorted(set(token_sequence))
>>> ', '.join(vocab)
'26., Jefferson, Monticello, Thomas, age, at, began, building, of, the'
>>> num_tokens = len(token_sequence)
>>> vocab_size = len(vocab)
>>> onehot_vectors = np.zeros((num_tokens,
... vocab_size), int)
>>> for i, word in enumerate(token_sequence):
... onehot_vectors[i, vocab.index(word)] = 1
>>> ' '.join(vocab)
'26. Jefferson Monticello Thomas age at began building of the'
>>> onehot_vectors




One-Hot Vectors and Memory Requirement

Let’s run through the math to give you an appreciation for just how big and unwieldy these “player piano paper rolls” are. In most cases, the vocabulary of tokens you’ll use in an NLP pipeline will be much more than 10,000 or 20,000 tokens. Sometimes it can be hundreds of thousands or even millions of tokens. Let’s assume you have a million tokens in your NLP pipeline vocabulary. And let’s say you have a meager 3,000 books with 3,500 sentences each and 15 words per sentence—reasonable averages for short books. That’s a whole lot of big tables (matrices):

The example below is assuming that we have a million tokens (words in our vocabulary):




Document-Term Matrix

The One-Hot Vector Based Representation of Sentences in the previous slide is a concept very similar to “Document-Term” matrix.




For Tokenization: Use NLTK (Natural Language Toolkit)

You can use the NLTK function RegexpTokenizer to replicate your simple tokenizer example like this:




An even better tokenizer is the Treebank Word Tokenizer from the NLTK package. It incorporates a variety of common rules for English word tokenization. For example, it separates phrase-terminating punctuation (?!.;,) from adjacent tokens and retains decimal numbers containing a period as a single token. In addition it contains rules for English contractions. For example “don’t” is tokenized as ["do", "n’t"]. This tokenization will help with subsequent steps in the NLP pipeline, such as stemming.




Stemming and lemmatization

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set. 
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance: 




The result of this mapping of text will be something like:




However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma. Linguistic processing for stemming or lemmatization is often done by an additional plug-in component to the indexing process, and a number of such components exist, both commercial and open-source.

Ref: nlp.stanford.edu 

CONTRACTIONS

You might wonder why you would split the contraction wasn’t into was and n’t. For some applications, like grammar-based NLP models that use syntax trees, it’s important to separate the words was and not to allow the syntax tree parser to have a consistent, predictable set of tokens with known grammar rules as its input. There are a variety of standard and nonstandard ways to contract words. By reducing contractions to their constituent words, a dependency tree parser or syntax parser only need be programmed to anticipate the various spellings of individual words rather than all possible contractions.

Tokenize informal text from social networks such as Twitter and Facebook

The NLTK library includes a tokenizer—casual_tokenize—that was built to deal with short, informal, emoticon-laced texts from social networks where grammar and spelling conventions vary widely.
The casual_tokenize function allows you to strip usernames and reduce the number of repeated characters within a token:

>>> from nltk.tokenize.casual import casual_tokenize
>>> message = """RT @TJMonticello Best day everrrrrrr at Monticello.\
... Awesommmmmmeeeeeeee day :*)"""

>>> casual_tokenize(message)
['RT', '@TJMonticello’, 'Best', 'day','everrrrrrr', 'at', 'Monticello', '.’, 'Awesommmmmmeeeeeeee', 'day', ':*)’]

>>> casual_tokenize(message, reduce_len=True, strip_handles=True)
['RT’, 'Best', 'day', 'everrr', 'at', 'Monticello', '.’, 'Awesommmeee', 'day', ':*)']

n-gram tokenizer from nltk in action




You might be able to sense a problem here. Looking at your earlier example, you can imagine that the token “Thomas Jefferson” will occur across quite a few documents.

However the 2-grams “of 26” or even “Jefferson began” will likely be extremely rare. If tokens or n-grams are extremely rare, they don’t carry any correlation with other words that you can use to help identify topics or themes that connect documents or classes of documents. So rare n-grams won’t be helpful for classification problems. You can imagine that most 2-grams are pretty rare—even more so for 3- and 4-grams.

Problem of rare n-grams

Because word combinations are rarer than individual words, your vocabulary size is exponentially approaching the number of n-grams in all the documents in your corpus. If your feature vector dimensionality exceeds the length of all your documents, your feature extraction step is counterproductive. It’ll be virtually impossible to avoid overfitting a machine learning model to your vectors; your vectors have more dimensions than there are documents in your corpus. In chapter 3, you’ll use document frequency statistics to identify n-grams so rare that they are not useful for machine learning. Typically, n-grams are filtered out that occur too infrequently (for example, in three or fewer different documents). This scenario is represented by the “rare token” filter in the coin-sorting machine of chapter 1.

Problem of common n-grams

Now consider the opposite problem. Consider the 2-gram “at the” in the previous phrase. That’s probably not a rare combination of words. In fact it might be so common, spread among most of your documents, that it loses its utility for discriminating between the meanings of your documents. It has little predictive power. Just like words and other tokens, n-grams are usually filtered out if they occur too often. For example, if a token or n-gram occurs in more than 25% of all the documents in your corpus, you usually ignore it. This is equivalent to the “stop words” filter in the coin-sorting machine of chapter 1. These filters are as useful for n-grams as they are for individual tokens. In fact, they’re even more useful.




STOP WORDS
Stop words are common words in any language that occur with a high frequency but carry much less substantive information about the meaning of a phrase. Examples of some common stop words include:
 a, an
 the, this
 and, or
 of, on

A more comprehensive list of stop words for various languages can be found in NLTK’s corpora ( stopwords.zip ).

Historically, stop words have been excluded from NLP pipelines in order to reduce the computational effort to extract information from a text. Even though the words themselves carry little information, the stop words can provide important relational information as part of an n-gram. Consider these two examples:

 Mark reported to the CEO
 Suzanne reported as the CEO to the board

Also, some stop words lists also contain the word ‘not’, which means “feeling cold” and “not feeling cold” would both be reduced to “feeling cold” by a stop words filter.

Ref: stop words removal using nltk, spacy and gensim

Stop Words Removal
Designing a filter for stop words depends on your application. Vocabulary size will drive the computational complexity and memory requirements of all subsequent steps in the NLP pipeline. But stop words are only a small portion of your total vocabulary size. A typical stop word list has only 100 or so frequent and unimportant words listed in it. But a vocabulary size of 20,000 words would be required to keep track of 95% of the words seen in a large corpus of tweets, blog posts, and news articles.9 And that’s just for 1-grams or single-word tokens. A 2-gram vocabulary designed to catch 95% of the 2-grams in a large English corpus will generally have more than 1 million unique 2-gram tokens in it.

You may be worried that vocabulary size drives the required size of any training set you must acquire to avoid overfitting to any particular word or combination of words. And you know that the size of your training set drives the amount of processing required to process it all. However, getting rid of 100 stop words out of 20,000 isn’t going to significantly speed up your work. And for a 2-gram vocabulary, the savings you’d achieve by removing stop words is minuscule. In addition, for 2-grams you lose a lot more information when you get rid of stop words arbitrarily, without checking for the frequency of the 2-grams that use those stop words in your text. For example, you might miss mentions of “The Shining” as a unique title and instead treat texts about that violent, disturbing movie the same as you treat documents that mention “Shining Light” or “shoe shining.”

So if you have sufficient memory and processing bandwidth to run all the NLP steps in your pipeline on the larger vocabulary, you probably don’t want to worry about ignoring a few unimportant words here and there. And if you’re worried about overfitting a small training set with a large vocabulary, there are better ways to select your vocabulary or reduce your dimensionality than ignoring stop words. Including stop words in your vocabulary allows the document frequency filters (discussed in chapter 3) to more accurately identify and ignore the words and n-grams with the least information content within your particular domain.

Stop Words in Code

>>> stop_words = ['a', 'an', 'the', 'on', 'of', 'off', 'this', 'is']
>>> tokens = ['the', 'house', 'is', 'on', 'fire']
>>> tokens_without_stopwords = [x for x in tokens if x not in stop_words]
>>> print(tokens_without_stopwords)
['house', 'fire’]

Stop Words From NLTK and Scikit-Learn




Code for “Stop Words From NLTK and Scikit-Learn”:

>>> import nltk
>>> nltk.download('stopwords')
>>> stop_words = nltk.corpus.stopwords.words('english’)
>>> len(stop_words)
179

>>> from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as sklearn_stop_words
>>> len(sklearn_stop_words)
318

>>> len(stop_words.union(sklearn_stop_words))
378

>>> len(stop_words.intersection(sklearn_stop_words))
119

Labels: Artificial Intelligence,Natural Language Processing,Python,Technology,

Sunday, July 18, 2021

Journal (2011-Jan-05)

Index of Journals

5 January 2011

I went to bed around 0000 last night to get the usual half-hour long rest and I just didn’t put the alarm the sound. What happened next was obvious. I woke up around six, lucky me. At 0630 I was running to go to bath. And amma asked the usual question of when I would leave. Though I’d tell her that I’m in hurry but when was she getting the food ready before I come?

I reached college on time around 0800 and I just sat in the stairs of that closed building to study my left stuff. I had left almost half of what I had planned to do and I had planned to do two-third of the whole. 

I wrote the exam nicely, though I could have got along with the teacher who was staring me too much but I didn’t. That was because she saw me talking to myself.

I came home straight away, trying to keep away from girls though I love them all. I, kind of, feel like I am not doing justice with them some times.  
I was home around 1500 and then I was watching TV till four till I went to bed. 

When I’ll get over with this movie and after dinner it’ll be nine!

God Bless ‘Me’
Ashish

Journal (2011-Jan-04)

Index of Journals

4 January 2010

I went back to bed at, even before, 2300. That was silly, but I just couldn’t sit in bed for longer. I woke up around nine in the morning. I was awake for a second at five in morning but I was again out of any sense of tension. I slept back. I had nightmare, I been having them for quite some time now. 

I took the books taking the day lightly and I have been going very slowly till now. I came to watch television way before six-half and, it’s seven-half now. I better go now. 

I texted Vibha back last night and she’s still ready to comeback even after how I had ignored her on New Year eve. I am confused what to do with her; I am just not letting her go that easily.

I feel like I am such a big fool to dream of ever becoming big with this big useless thing in my room, Prashant. He is sick, and no one except his biological relations can tolerate his shit.

God Bless ‘Me’
Ashish