Concepts of Probability
Indepedent Events
Flipping a coin twice.Dependent Events
Drawing two cards one by one from a deck without replacement. First time: 52 cards P(Jack of hearts) = 1/52 At the time of drawing second card, deck has now left with: 51 cards So the deck at the time of second draw has changed because we are doing it without replacementAddition Rule
Multiplication Rule
Bayes Theorem
What Is The Probability Of Getting “Class Ck And All The Evidences 1 To N”:
X1 to XN Are Our Evidence Events And They Are All Independent As Assumed In Naïve Bayes Algorithm (Or Classification). P(x1, x2, x3, C) = P(x1|(x2, x3, C)) . P(x2, x3, C) RHS = P(x1|(x2, x3, C)).P(x2|(x3, C)).P(x3, C) RHS = P(x1|(x2,x3,C)) . P(x2 | (x3, C)) . P(x3 | C) . P(C) And if x1, x2 and x3 are independent of each other: RHS = P(x1 | C) . P(x2 | C) . P (x3 | C) . P(C)FRUIT PROBLEM
A fruit is long, sweet and yellow. Is it a banana? Is it an orange? Or is it some different fruit? P(Banana | Long, Sweet, Yellow) = (P(Long, Sweet, Yellow | Banana) * P(Banana)) / P(L,S,Y) P(L,S,Y | B) = P(L,S,Y,B) / P(B) Naïve Bayes => All the events (such as L, S, Y) are independent. Now, using the 'Chain Rule' along side 'Independence Condition': => P(L, S, Y, B) = P(L|B) * P(S|B) * P(Y|B) * P(B) - - - P(Orange | Long, Sweet, Yellow) Answer: Whichever P() is higher P(Banana) = 50 / 100 P(Orange) = 30 / 100 P(Other) = 20 / 100 P(Long | Banana) = 40 / 50 = 0.8 P(Sweet | Banana) = 35 / 50 = 0.7 P(Yellow | Banana) = 45 / 50 = 0.9 P(Banana|Long, Sweet and Yellow) = P(Long|Banana) * P(Sweet|Banana) * P(Yellow|Banana) * P(banana)/ (P(Long) * P(Sweet) * P(Yellow)) = 0.8 * 0.7 * 0.9 * 0.5 / P(evidence) =0.252/denom P(Orange|Long, Sweet and Yellow) = 0 P(Other Fruit|Long, Sweet and Yellow) = P(Long|Other fruit) * P(Sweet|Other fruit) * P(Yellow|Other fruit) * P(Other Fruit)/ (P(Long) * P(Sweet) * P(Yellow)) = 0.018/denom P(ham | d6) and P(spam | d6) D6: good? Bad! very bad! P(ham | good, bad, very, bad) = P (good, bad, very, bad, ham) / P(good, bad, very, bad)) P(good, bad, very, bad, ham) = P(good|ham)*P(bad|ham)*P(very|ham)*P(bad|ham)*P(ham) Classified as spam!Practice Question
Ques 1: What is the assumption about the dataset on which we can apply Naive Bayes' classification algorithm? Ans 1: That the evidence events should be independent of each other. Ques 2: What is 'recall' metric in classification report? Ans 2: Recall: How many of the selected class instances have been predicted correctly (or we say “have been recalled”).
Wednesday, July 28, 2021
Naïve Bayes Classifier for Spam Filtering
Sunday, July 25, 2021
Career Road Map for Artificial Intelligence & Data Science
The Data Science Skills Venn Diagram
What you saw in the previous Venn Diagram
Machine Learning = Statistics + Computer Science It could roughly be interpreted as: “Machine Learning is doing statistics on a computer”. It is not entirely wrong as there are a lot of Machine Learning models that have come directly from Statistics field such as: # Linear Regression # Decision Trees # Naïve Bayes’ Classification Model More than that the first step in doing Machine Learning on a data set involves doing: Exploratory Data Analysis on the data. This is roughly equal to doing: Descriptive Statistics and Inferential Statistics on the data. The next two equations we could write for intersections of fields are:Traditional Software = Computer Science + Business Expertise
This roughly means that you are: Doing the business via a computer. And:Traditional Research = Statistics + Business Expertise
This roughly means that you are: Using Statistics to understand, explain and grow your business. And the last one:Data Science = Machine Learning + Traditional Research + Traditional Software
The Artificial Intelligence Venn Diagram
The Definitions From The Previous Slide Artificial Intelligence: A program that can sense, reason, act and adapt. Machine Learning: Algorithms who performance improve as they are exposed to more data over time. Deep Learning: Subset of Machine Learning in which multilayered neural networks learn from vast amounts of data. And these definitions are not very different from what experts think of these fields:The ‘Data Scientist vs Data Analyst vs ML Engineer vs Data Engineer’ Venn Diagram
The way to differentiate between ML Engineer and Data Analyst is that both of know the Math but Analyst knows more of Statistics and lesser of Programming while the Engineer knows more of Programming and lesser of Statistics.Data Scientist
A data scientist is responsible for pulling insights from data. It is the data scientist’s job to pull data, create models, create data products, and tell a story. A data scientist should typically have interactions with customers and/or executives. A data scientist should love scrubbing a dataset for more and more understanding. The main goal of a data scientist is to produce data products and tell the stories of the data. A data scientist would typically have stronger statistics and presentation skills than a data engineer.Data Engineer
Data Engineering is more focused on the systems that store and retrieve data. A data engineer will be responsible for building and deploying storage systems that can adequately handle the needs. Sometimes the needs are fast real-time incoming data streams. Other times the needs are massive amounts of large video files. Still other times the needs are many reads of the data.In other words, a data engineer needs to build systems that can handle the 4 Vs of Big Data (Volume, Velocity, Variety and Veracity). The main goal of data engineer is to make sure the data is properly stored and available to the data scientist and others that need access. A data engineer would typically have stronger software engineering and programming skills than a data scientist. Labels: Technology,Machine Learning,Artificial Intelligence,
Itch Guard Plus Cream (Menthol and Terbinafine)
Itch Guard Plus Cream Manufacturer: Reckitt Benckiser Pack Sizes: - 12 gm Cream / ₹75 - 20 gm Cream / ₹99 Product highlights - It helps kill fungus and inhibits it from spreading further - It can be used to treat jock itch - It works by killing the fungi on the skin by destroying their cell membrane Composition 1. Terbinafine Hydrochloride 2. Menthol Terbinafine Uses Terbinafine is used in the treatment of fungal infections. How Terbinafine works Terbinafine is an antifungal medication. It kills and stops the growth of the fungi by destroying its cell membrane, thereby treating your skin infection. Common side effects of Terbinafine Headache, Diarrhea, Rash, Indigestion, Abnormal liver enzyme, Itching, Taste change, Nausea, Abdominal pain, FlatulenceTerbinafine
- Wikipedia, 20210725 Terbinafine, sold under the brand name Lamisil among others, is an antifungal medication used to treat pityriasis versicolor, fungal nail infections, and ringworm including jock itch and athlete's foot. It is either taken by mouth or applied to the skin as a cream or ointment. The cream and ointment are effective for nail infections. Common side effects when taken by mouth include nausea, diarrhea, headache, cough, rash, and elevated liver enzymes. Severe side effects include liver problems and allergic reactions. Liver injury is, however, unusual. Use during pregnancy is not typically recommended. The cream and ointment may result in itchiness but are generally well tolerated. Terbinafine is in the allylamines family of medications. It works by decreasing the ability of fungi to make sterols. It appears to result in fungal cell death. Terbinafine was discovered in 1991. It is on the World Health Organization's List of Essential Medicines. In 2017, it was the 307th most commonly prescribed medication in the United States, with more than one million prescriptions.Menthol
Menthol is an organic compound made synthetically or obtained from the oils of corn mint, peppermint, or other mints. It is a waxy, crystalline substance, clear or white in color, which is solid at room temperature and melts slightly above. The main form of menthol occurring in nature is (−)-menthol, which is assigned the (1R,2S,5R) configuration. Menthol has local anesthetic and counterirritant qualities, and it is widely used to relieve minor throat irritation. Menthol also acts as a weak κ-opioid receptor agonist. In 2017, it was the 193rd most commonly prescribed medication in the United States, with more than two million prescriptions. Labels: Medicine,Science,
Thursday, July 22, 2021
Normalizing your vocabulary (lexicon) for NLP application
Why normalize our vocabulary:
1. To reduce the vocabulary size as vocabulary size is important to the performance of an NLP pipeline. 2. So that tokens that mean similar things are combined into a single, normalized form. 3. It improves the association of meaning across those different “spellings” of a token or n-gram in your corpus. 4. Reducing your vocabulary can reduce the likelihood of overfitting. Vocabulary is normalized in the following ways:a) CASE FOLDING (aka case normalization)
Case folding is when you consolidate multiple “spellings” of a word that differ only in their capitalization. To preserve the meaning of proper nouns: A better approach for case normalization is to lowercase only the first word of a sentence and allow all other words to retain their capitalization such as “Joe” and “Smith” in “Joe Smith”.b) STEMMING
Another common vocabulary normalization technique is to eliminate the small meaning differences of pluralization or possessive endings of words, or even various verb forms. This normalization, identifying a common stem among various forms of a word, is called stemming. For example, the words housing and houses share the same stem, house. Stemming removes suffixes from words in an attempt to combine words with similar meanings together under their common stem. A stem isn’t required to be a properly spelled word, but merely a token, or label, representing several possible spellings of a word. Stemming is important for keyword search or information retrieval. It allows you to search for “developing houses in Portland” and get web pages or documents that use both the word “house” and “houses” and even the word “housing.”# How does stemming affect precision and recall of a search engine?
This broadening of your search results would be a big improvement in the “recall” score for how well your search engine is doing its job at returning all the relevant documents. But stemming could greatly reduce the “precision” score for your search engine, because it might return many more irrelevant documents along with the relevant ones. In some applications this “false-positive rate” (proportion of the pages returned that you don’t find useful) can be a problem. So most search engines allow you to turn off stemming and even case normalization by putting quotes around a word or phrase. Quoting indicates that you only want pages containing the exact spelling of a phrase, such as “‘Portland Housing Development software.’” That would return a different sort of document than one that talks about a “‘a Portland software developer’s house’”.c) LEMMATIZATION
If you have access to information about connections between the meanings of various words, you might be able to associate several words together even if their spelling is quite different. This more extensive normalization down to the semantic root of a word—its lemma—is called lemmatization.# Lemmatization and it’s use in the chatbot pipeline:
Any NLP pipeline that wants to “react” the same for multiple different spellings of the same basic root word can benefit from a lemmatizer. It reduces the number of words you have to respond to, the dimensionality of your language model. Using it can make your model more general, but it can also make your model less precise, because it will treat all spelling variations of a given root word the same. For example, “chat,” “chatter,” “chatty,” “chatting,” and perhaps even “chatbot” would all be treated the same in an NLP pipeline with lemmatization, even though they have different meanings. Likewise, “bank,” “banked,” and “banking” would be treated the same by a stemming pipeline, despite the river meaning of “bank,” the motorcycle meaning of “banked,” and the finance meaning of “banking.” Lemmatization is a potentially more accurate way to normalize a word than stemming or case normalization because it takes into account a word’s meaning. A lemmatizer uses a knowledge base of word synonyms and word endings to ensure that only words that mean similar things are consolidated into a single token.# Lemmatization and POS (Part of speech) Tagging
Some lemmatizers use the word’s part of speech (POS) tag in addition to its spelling to help improve accuracy. The POS tag for a word indicates its role in the grammar of a phrase or sentence. For example, the noun POS is for words that refer to “people, places, or things” within a phrase. An adjective POS is for a word that modifies or describes a noun. A verb refers to an action. The POS of a word in isolation cannot be determined. The context of a word must be known for its POS to be identified. So some advanced lemmatizers can’t be run-on words in isolation. >>> import nltk >>> nltk.download('wordnet') >>> from nltk.stem import WordNetLemmatizer >>> lemmatizer = WordNetLemmatizer() # Default 'pos' is noun. >>> lemmatizer.lemmatize("better") 'better’ # "a" --> adjective >>> lemmatizer.lemmatize("better", pos="a") 'good’ >>> lemmatizer.lemmatize("goods", pos="n") 'good’ >>> lemmatizer.lemmatize("goods", pos="a") 'goods’ >>> lemmatizer.lemmatize("good", pos="a") 'good’ >>> lemmatizer.lemmatize("goodness", pos="n") 'goodness' >>> lemmatizer.lemmatize("best", pos="a") 'best'Difference between stemming and lemmatization:
For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set. The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance: The result of this mapping of text will be something like: However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma. Linguistic processing for stemming or lemmatization is often done by an additional plug-in component to the indexing process, and a number of such components exist, both commercial and open-source. Ref: nlp.stanford.edu Labels: Technology,Natural Language Processing,Python,
VADER: Rule Based Approach to Sentiment Analysis
# VADER: for Valence Aware Dictionary for sEntiment Reasoning. # VADER is a rule-based approach towards doing sentiment analysis. # As of NLTK version 3.6 [20210721], it uses VADER for sentiment analysis.VADER: In Code
Try following code snippets in Python CLI >>> from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer >>> sa = SentimentIntensityAnalyzer() >>> sa.lexicon { ... ':(': -1.9, ':)': 2.0, ... 'pls': 0.3, 'plz': 0.3, ... 'great': 3.1, ... } >>> [(tok, score) for tok, score in sa.lexicon.items() if " " in tok] [("( '}{' )", 1.6), ("can't stand", -2.0), ('fed up', -1.8), ('screwed up', -1.5)] >>> sa.polarity_scores(text="Python is very readable and it's great for NLP.") {'compound': 0.6249, 'neg': 0.0, 'neu': 0.661, 'pos': 0.339} >>> sa.polarity_scores(text="Python is not a bad choice for most applications.") {'compound': 0.431, 'neg': 0.0, 'neu': 0.711, 'pos': 0.289} >>> corpus = ["Absolutely perfect! Love it! :-) :-) :-)", ... "Horrible! Completely useless. :(", ... "It was OK. Some good and some bad things."] >>> for doc in corpus: ... scores = sa.polarity_scores(doc) ... print('{:+}: {}'.format(scores['compound'], doc)) +0.9428: Absolutely perfect! Love it! :-) :-) :-) -0.8768: Horrible! Completely useless. :( +0.3254: It was OK. Some good and some bad things.DRAWBACK OF VADER
The drawback of VADER is that it doesn’t look at all the words in a document, only about 7,500. The questions that remained unanswered by VADER are: What if you want all the words to help add to the sentiment score? And what if you don’t want to have to code your own understanding of the words in a dictionary of thousands of words or add a bunch of custom words to the dictionary in SentimentIntensityAnalyzer.lexicon? The rule-based approach might be impossible if you don’t understand the language, because you wouldn’t know what scores to put in the dictionary (lexicon)!References
# “VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text” by Hutto and Gilbert (http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf) # You can find more detailed installation instructions with the package source code on github (https:// github.com/cjhutto/vaderSentiment). Labels: Technology,Natural Language Processing,Python,
Wednesday, July 21, 2021
Command 'git merge'
Code Legend: Black: main branch Dark gray: test_branchPart 1: "git clone -b test_branch"
~\git_exp\test_branch>git clone -b test_branch https://github.com/ashishjain1547/repo_for_testing.git Cloning into 'repo_for_testing'... remote: Enumerating objects: 23, done. remote: Counting objects: 100% (23/23), done. remote: Compressing objects: 100% (14/14), done. remote: Total 23 (delta 7), reused 14 (delta 3), pack-reused 0 Unpacking objects: 100% (23/23), 6.72 KiB | 5.00 KiB/s, done. ~\git_exp\test_branch>cd repo_for_testing ~\git_exp\test_branch\repo_for_testing>git branch * test_branch ~\git_exp\test_branch\repo_for_testing>dir Volume in drive C is Windows Volume Serial Number is 8139-90C0 Directory of ~\git_exp\test_branch\repo_for_testing 07/21/2021 12:26 PM <DIR> . 07/21/2021 12:26 PM <DIR> .. 07/21/2021 12:26 PM 368 .gitignore 07/21/2021 12:26 PM 30 20210528_test_branch.txt 07/21/2021 12:26 PM 17 202107141543.txt 07/21/2021 12:26 PM 17 202107141608.txt 07/21/2021 12:26 PM 11,558 LICENSE 07/21/2021 12:26 PM 11 newFile.txt 07/21/2021 12:26 PM 38 README.md 07/21/2021 12:26 PM 23 test_file_20210528.txt 8 File(s) 12,062 bytes 2 Dir(s) 56,473,489,408 bytes freePart 2: "git clone" Default
~\git_exp\main>git clone https://github.com/ashishjain1547/repo_for_testing.git Cloning into 'repo_for_testing'... remote: Enumerating objects: 23, done. remote: Counting objects: 100% (23/23), done. remote: Compressing objects: 100% (14/14), done. remote: Total 23 (delta 7), reused 14 (delta 3), pack-reused 0 Unpacking objects: 100% (23/23), 6.72 KiB | 5.00 KiB/s, done. ~\git_exp\main>cd repo_for_testing ~\git_exp\main\repo_for_testing>git status On branch main Your branch is up to date with 'origin/main'. nothing to commit, working tree clean ~\git_exp\main\repo_for_testing>git branch -a * main remotes/origin/HEAD -> origin/main remotes/origin/main remotes/origin/test_branchPart 3: Create new file in "test_branch"
~\git_exp\test_branch\repo_for_testing>echo "202107211228" > 202107211228.txt ~\git_exp\test_branch\repo_for_testing>dir Volume in drive C is Windows Volume Serial Number is 8139-90C0 Directory of ~\git_exp\test_branch\repo_for_testing 07/21/2021 12:28 PM <DIR> . 07/21/2021 12:28 PM <DIR> .. 07/21/2021 12:26 PM 368 .gitignore 07/21/2021 12:26 PM 30 20210528_test_branch.txt 07/21/2021 12:26 PM 17 202107141543.txt 07/21/2021 12:26 PM 17 202107141608.txt 07/21/2021 12:28 PM 17 202107211228.txt 07/21/2021 12:26 PM 11,558 LICENSE 07/21/2021 12:26 PM 11 newFile.txt 07/21/2021 12:26 PM 38 README.md 07/21/2021 12:26 PM 23 test_file_20210528.txt 9 File(s) 12,079 bytes 2 Dir(s) 56,473,849,856 bytes free ~\git_exp\test_branch\repo_for_testing>git status On branch test_branch Your branch is up to date with 'origin/test_branch'. Untracked files: (use "git add <file>..." to include in what will be committed) 202107211228.txt nothing added to commit but untracked files present (use "git add" to track) ~\git_exp\test_branch\repo_for_testing>git add -A ~\git_exp\test_branch\repo_for_testing>git commit -m "20210721 1229" [test_branch 087a5ca] 20210721 1229 1 file changed, 1 insertion(+) create mode 100644 202107211228.txt ~\git_exp\test_branch\repo_for_testing>git push Enumerating objects: 4, done. Counting objects: 100% (4/4), done. Delta compression using up to 4 threads Compressing objects: 100% (2/2), done. Writing objects: 100% (3/3), 289 bytes | 289.00 KiB/s, done. Total 3 (delta 1), reused 0 (delta 0), pack-reused 0 remote: Resolving deltas: 100% (1/1), completed with 1 local object. To https://github.com/ashishjain1547/repo_for_testing.git 9017804..087a5ca test_branch -> test_branch ~\git_exp\test_branch\repo_for_testing>git status On branch test_branch Your branch is up to date with 'origin/test_branch'. nothing to commit, working tree cleanPart 4: Git Metadata About the Files and the Use of 'git pull origin' to Update this Metadata
~\git_exp\main\repo_for_testing>git branch -a * main remotes/origin/HEAD -> origin/main remotes/origin/main remotes/origin/test_branch ~\git_exp\main\repo_for_testing>git checkout test_branch Switched to a new branch 'test_branch' Branch 'test_branch' set up to track remote branch 'test_branch' from 'origin'. ~\git_exp\main\repo_for_testing>git branch -a main * test_branch remotes/origin/HEAD -> origin/main remotes/origin/main remotes/origin/test_branch ~\git_exp\main\repo_for_testing>git checkout main Switched to branch 'main' Your branch is up to date with 'origin/main'. ~\git_exp\main\repo_for_testing>git merge test_branch Already up to date.Part 5: 'git pull origin' and then 'git merge'
~\git_exp\main\repo_for_testing>git pull origin remote: Enumerating objects: 4, done. remote: Counting objects: 100% (4/4), done. remote: Compressing objects: 100% (1/1), done. remote: Total 3 (delta 1), reused 3 (delta 1), pack-reused 0 Unpacking objects: 100% (3/3), 269 bytes | 0 bytes/s, done. From https://github.com/ashishjain1547/repo_for_testing 9017804..087a5ca test_branch -> origin/test_branch Already up to date. ~\git_exp\main\repo_for_testing>git branch -D test_branch Deleted branch test_branch (was 9017804). ~\git_exp\main\repo_for_testing>git checkout test_branch Switched to a new branch 'test_branch' Branch 'test_branch' set up to track remote branch 'test_branch' from 'origin'. ~\git_exp\main\repo_for_testing>git checkout main Switched to branch 'main' Your branch is up to date with 'origin/main'. ~\git_exp\main\repo_for_testing>git merge test_branch Merge made by the 'recursive' strategy. 202107211228.txt | 1 + 1 file changed, 1 insertion(+) create mode 100644 202107211228.txt ~\git_exp\main\repo_for_testing>git push Enumerating objects: 1, done. Counting objects: 100% (1/1), done. Writing objects: 100% (1/1), 231 bytes | 231.00 KiB/s, done. Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 To https://github.com/ashishjain1547/repo_for_testing.git d210505..9f1b42f main -> main The above step puts "main" branches ahead by two commits.Part 6: Bringing 'test_branch' on level with 'main' branch using 'git merge'
~\git_exp\test_branch\repo_for_testing>git branch * test_branch ~\git_exp\test_branch\repo_for_testing>git branch -a * test_branch remotes/origin/HEAD -> origin/main remotes/origin/main remotes/origin/test_branch ~\git_exp\test_branch\repo_for_testing>git pull origin remote: Enumerating objects: 1, done. remote: Counting objects: 100% (1/1), done. remote: Total 1 (delta 0), reused 1 (delta 0), pack-reused 0 Unpacking objects: 100% (1/1), 211 bytes | 1024 bytes/s, done. From https://github.com/ashishjain1547/repo_for_testing d210505..9f1b42f main -> origin/main Already up to date. ~\git_exp\test_branch\repo_for_testing> ~\git_exp\test_branch\repo_for_testing>git checkout main Switched to a new branch 'main' Branch 'main' set up to track remote branch 'main' from 'origin'. ~\git_exp\test_branch\repo_for_testing>git checkout test_branch Switched to branch 'test_branch' Your branch is up to date with 'origin/test_branch'. ~\git_exp\test_branch\repo_for_testing>git merge main Updating 087a5ca..9f1b42f Fast-forward ~\git_exp\test_branch\repo_for_testing>git status On branch test_branch Your branch is ahead of 'origin/test_branch' by 2 commits. (use "git push" to publish your local commits) nothing to commit, working tree clean ~\git_exp\test_branch\repo_for_testing>git push Total 0 (delta 0), reused 0 (delta 0), pack-reused 0 To https://github.com/ashishjain1547/repo_for_testing.git 087a5ca..9f1b42f test_branch -> test_branch ~\git_exp\test_branch\repo_for_testing> ~\git_exp\test_branch\repo_for_testing>git branch -a main * test_branch remotes/origin/HEAD -> origin/main remotes/origin/main remotes/origin/test_branch ~\git_exp\test_branch\repo_for_testing> Labels: Technology,GitHub,
Tuesday, July 20, 2021
Session 1 on ‘Understanding, Analyzing and Generating Text'
Here, we focus on only one natural language, English and only one programming language, Python.The Way We Understand Language and How Machines See it is Quite Different
Natural languages have an additional “decoding” challenge (apart from the ‘Information Extraction’ from it) that is even harder to solve. Speakers and writers of natural languages assume that a human is the one doing the processing (listening or reading), not a machine. So when I say “good morning”, I assume that you have some knowledge about what makes up a morning, including not only that mornings come before noons and afternoons and evenings but also after midnights. And you need to know they can represent times of day as well as general experiences of a period of time. The interpreter is assumed to know that “good morning” is a common greeting that doesn’t contain much information at all about the morning. Rather it reflects the state of mind of the speaker and her readiness to speak with others. TIP: The “r” before the quote specifies a raw string, not a regular expression. With a Python raw string, you can send backslashes directly to the regular expression compiler without having to double-backslash ("\\") all the special regular expression characters such as spaces ("\\ ") and curly braces or handlebars("\\{ \\}").Architecture of a Chatbot
A chatbot requires four kinds of processing as well as a database to maintain a memory of past statements and responses. Each of the four processing stages can contain one or more processing algorithms working in parallel or in series (see figure 1.3): 1. Parse—Extract features, structured numerical data, from natural language text. 2. Analyze—Generate and combine features by scoring text for sentiment, grammaticality, and semantics. 3. Generate—Compose possible responses using templates, search, or language models. 4. Execute—Plan statements based on conversation history and objectives, and select the next response.The Way Rasa Identifies a Greeting or Good-bye
How does Rasa understand your greetings?
An image taken from “rasa interactive” command output of our conversation.IQ of some Natural Language Processing systems
We see that bots working at depth in this image are: Domain Specific Bots.For the fundamental building blocks of NLP, there are equivalents in a computer language compiler
# tokenizer -- scanner, lexer, lexical analyzer # vocabulary -- lexicon # parser -- compiler # token, term, word, or n-gram -- token, symbol, or terminal symbolAn quick-and-dirty example of ‘Tokenizer’ using the str.split()
>>> import numpy as np >>> token_sequence = str.split(sentence) >>> vocab = sorted(set(token_sequence)) >>> ', '.join(vocab) '26., Jefferson, Monticello, Thomas, age, at, began, building, of, the' >>> num_tokens = len(token_sequence) >>> vocab_size = len(vocab) >>> onehot_vectors = np.zeros((num_tokens, ... vocab_size), int) >>> for i, word in enumerate(token_sequence): ... onehot_vectors[i, vocab.index(word)] = 1 >>> ' '.join(vocab) '26. Jefferson Monticello Thomas age at began building of the' >>> onehot_vectorsOne-Hot Vectors and Memory Requirement
Let’s run through the math to give you an appreciation for just how big and unwieldy these “player piano paper rolls” are. In most cases, the vocabulary of tokens you’ll use in an NLP pipeline will be much more than 10,000 or 20,000 tokens. Sometimes it can be hundreds of thousands or even millions of tokens. Let’s assume you have a million tokens in your NLP pipeline vocabulary. And let’s say you have a meager 3,000 books with 3,500 sentences each and 15 words per sentence—reasonable averages for short books. That’s a whole lot of big tables (matrices): The example below is assuming that we have a million tokens (words in our vocabulary):Document-Term Matrix
The One-Hot Vector Based Representation of Sentences in the previous slide is a concept very similar to “Document-Term” matrix.For Tokenization: Use NLTK (Natural Language Toolkit)
You can use the NLTK function RegexpTokenizer to replicate your simple tokenizer example like this: An even better tokenizer is the Treebank Word Tokenizer from the NLTK package. It incorporates a variety of common rules for English word tokenization. For example, it separates phrase-terminating punctuation (?!.;,) from adjacent tokens and retains decimal numbers containing a period as a single token. In addition it contains rules for English contractions. For example “don’t” is tokenized as ["do", "n’t"]. This tokenization will help with subsequent steps in the NLP pipeline, such as stemming.Stemming and lemmatization
For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set. The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance: The result of this mapping of text will be something like: However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma. Linguistic processing for stemming or lemmatization is often done by an additional plug-in component to the indexing process, and a number of such components exist, both commercial and open-source. Ref: nlp.stanford.eduCONTRACTIONS
You might wonder why you would split the contraction wasn’t into was and n’t. For some applications, like grammar-based NLP models that use syntax trees, it’s important to separate the words was and not to allow the syntax tree parser to have a consistent, predictable set of tokens with known grammar rules as its input. There are a variety of standard and nonstandard ways to contract words. By reducing contractions to their constituent words, a dependency tree parser or syntax parser only need be programmed to anticipate the various spellings of individual words rather than all possible contractions.Tokenize informal text from social networks such as Twitter and Facebook
The NLTK library includes a tokenizer—casual_tokenize—that was built to deal with short, informal, emoticon-laced texts from social networks where grammar and spelling conventions vary widely. The casual_tokenize function allows you to strip usernames and reduce the number of repeated characters within a token: >>> from nltk.tokenize.casual import casual_tokenize >>> message = """RT @TJMonticello Best day everrrrrrr at Monticello.\ ... Awesommmmmmeeeeeeee day :*)""" >>> casual_tokenize(message) ['RT', '@TJMonticello’, 'Best', 'day','everrrrrrr', 'at', 'Monticello', '.’, 'Awesommmmmmeeeeeeee', 'day', ':*)’] >>> casual_tokenize(message, reduce_len=True, strip_handles=True) ['RT’, 'Best', 'day', 'everrr', 'at', 'Monticello', '.’, 'Awesommmeee', 'day', ':*)']n-gram tokenizer from nltk in action
You might be able to sense a problem here. Looking at your earlier example, you can imagine that the token “Thomas Jefferson” will occur across quite a few documents. However the 2-grams “of 26” or even “Jefferson began” will likely be extremely rare. If tokens or n-grams are extremely rare, they don’t carry any correlation with other words that you can use to help identify topics or themes that connect documents or classes of documents. So rare n-grams won’t be helpful for classification problems. You can imagine that most 2-grams are pretty rare—even more so for 3- and 4-grams.Problem of rare n-grams
Because word combinations are rarer than individual words, your vocabulary size is exponentially approaching the number of n-grams in all the documents in your corpus. If your feature vector dimensionality exceeds the length of all your documents, your feature extraction step is counterproductive. It’ll be virtually impossible to avoid overfitting a machine learning model to your vectors; your vectors have more dimensions than there are documents in your corpus. In chapter 3, you’ll use document frequency statistics to identify n-grams so rare that they are not useful for machine learning. Typically, n-grams are filtered out that occur too infrequently (for example, in three or fewer different documents). This scenario is represented by the “rare token” filter in the coin-sorting machine of chapter 1.Problem of common n-grams
Now consider the opposite problem. Consider the 2-gram “at the” in the previous phrase. That’s probably not a rare combination of words. In fact it might be so common, spread among most of your documents, that it loses its utility for discriminating between the meanings of your documents. It has little predictive power. Just like words and other tokens, n-grams are usually filtered out if they occur too often. For example, if a token or n-gram occurs in more than 25% of all the documents in your corpus, you usually ignore it. This is equivalent to the “stop words” filter in the coin-sorting machine of chapter 1. These filters are as useful for n-grams as they are for individual tokens. In fact, they’re even more useful.STOP WORDS
Stop words are common words in any language that occur with a high frequency but carry much less substantive information about the meaning of a phrase. Examples of some common stop words include: a, an the, this and, or of, on A more comprehensive list of stop words for various languages can be found in NLTK’s corpora ( stopwords.zip ). Historically, stop words have been excluded from NLP pipelines in order to reduce the computational effort to extract information from a text. Even though the words themselves carry little information, the stop words can provide important relational information as part of an n-gram. Consider these two examples: Mark reported to the CEO Suzanne reported as the CEO to the board Also, some stop words lists also contain the word ‘not’, which means “feeling cold” and “not feeling cold” would both be reduced to “feeling cold” by a stop words filter. Ref: stop words removal using nltk, spacy and gensimStop Words Removal
Designing a filter for stop words depends on your application. Vocabulary size will drive the computational complexity and memory requirements of all subsequent steps in the NLP pipeline. But stop words are only a small portion of your total vocabulary size. A typical stop word list has only 100 or so frequent and unimportant words listed in it. But a vocabulary size of 20,000 words would be required to keep track of 95% of the words seen in a large corpus of tweets, blog posts, and news articles.9 And that’s just for 1-grams or single-word tokens. A 2-gram vocabulary designed to catch 95% of the 2-grams in a large English corpus will generally have more than 1 million unique 2-gram tokens in it. You may be worried that vocabulary size drives the required size of any training set you must acquire to avoid overfitting to any particular word or combination of words. And you know that the size of your training set drives the amount of processing required to process it all. However, getting rid of 100 stop words out of 20,000 isn’t going to significantly speed up your work. And for a 2-gram vocabulary, the savings you’d achieve by removing stop words is minuscule. In addition, for 2-grams you lose a lot more information when you get rid of stop words arbitrarily, without checking for the frequency of the 2-grams that use those stop words in your text. For example, you might miss mentions of “The Shining” as a unique title and instead treat texts about that violent, disturbing movie the same as you treat documents that mention “Shining Light” or “shoe shining.” So if you have sufficient memory and processing bandwidth to run all the NLP steps in your pipeline on the larger vocabulary, you probably don’t want to worry about ignoring a few unimportant words here and there. And if you’re worried about overfitting a small training set with a large vocabulary, there are better ways to select your vocabulary or reduce your dimensionality than ignoring stop words. Including stop words in your vocabulary allows the document frequency filters (discussed in chapter 3) to more accurately identify and ignore the words and n-grams with the least information content within your particular domain.Stop Words in Code
>>> stop_words = ['a', 'an', 'the', 'on', 'of', 'off', 'this', 'is'] >>> tokens = ['the', 'house', 'is', 'on', 'fire'] >>> tokens_without_stopwords = [x for x in tokens if x not in stop_words] >>> print(tokens_without_stopwords) ['house', 'fire’]Stop Words From NLTK and Scikit-Learn
Code for “Stop Words From NLTK and Scikit-Learn”: >>> import nltk >>> nltk.download('stopwords') >>> stop_words = nltk.corpus.stopwords.words('english’) >>> len(stop_words) 179 >>> from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as sklearn_stop_words >>> len(sklearn_stop_words) 318 >>> len(stop_words.union(sklearn_stop_words)) 378 >>> len(stop_words.intersection(sklearn_stop_words)) 119 Labels: Artificial Intelligence,Natural Language Processing,Python,Technology,
Sunday, July 18, 2021
Journal (2011-Jan-05)
Index of Journals
5 January 2011 I went to bed around 0000 last night to get the usual half-hour long rest and I just didn’t put the alarm the sound. What happened next was obvious. I woke up around six, lucky me. At 0630 I was running to go to bath. And amma asked the usual question of when I would leave. Though I’d tell her that I’m in hurry but when was she getting the food ready before I come? I reached college on time around 0800 and I just sat in the stairs of that closed building to study my left stuff. I had left almost half of what I had planned to do and I had planned to do two-third of the whole. I wrote the exam nicely, though I could have got along with the teacher who was staring me too much but I didn’t. That was because she saw me talking to myself. I came home straight away, trying to keep away from girls though I love them all. I, kind of, feel like I am not doing justice with them some times. I was home around 1500 and then I was watching TV till four till I went to bed. When I’ll get over with this movie and after dinner it’ll be nine! God Bless ‘Me’ Ashish
Journal (2011-Jan-04)
Index of Journals
4 January 2010 I went back to bed at, even before, 2300. That was silly, but I just couldn’t sit in bed for longer. I woke up around nine in the morning. I was awake for a second at five in morning but I was again out of any sense of tension. I slept back. I had nightmare, I been having them for quite some time now. I took the books taking the day lightly and I have been going very slowly till now. I came to watch television way before six-half and, it’s seven-half now. I better go now. I texted Vibha back last night and she’s still ready to comeback even after how I had ignored her on New Year eve. I am confused what to do with her; I am just not letting her go that easily. I feel like I am such a big fool to dream of ever becoming big with this big useless thing in my room, Prashant. He is sick, and no one except his biological relations can tolerate his shit. God Bless ‘Me’ Ashish
Journal (2011-Jan-03 (ADGITM students cheat during the exam))
Index of Journals
3 January 2011 Last night I went to bed early. I just couldn’t study yesterday. I never got to make up the mood. It was really sad to know that later. But then miraculously I woke in the middle of the night around 0130 and I thanked god several times for bringing me luck. I studied and I had just started studying. I studied according the requirement of the question paper, went strategically to cover the minimum amount of chapters to pass. Around 0500 Prashant’s phone started ringing and I had to push it off but he then noticed it in his sleep, so was awaken right after. I went to college before time as always and went to an isolated place to study. Exam went fine, I guess. I don’t know man, if I would get marks for attempting questions I’d surely pass, rest is in examiner’s hand. I didn’t give Vibha a look, nor did she let me do that at any moment. Plus, I don’t have to. Because with hi-hello with Astha it’s clear I have a pass for their group. It feels really cold while writing the exam. I never get to cheat. It is really, really sad. Behind me sits Parul and on the front sits Nitish, none of them ever seem willing to even hear to any call, let alone cheating! And for themselves Parul, Sonam (who is sitting behind her) and Srishti (who is sitting on the left of Sonam) would always be discussing questions in which they are doubtful. Abhilash and Roshan would get scolded, but Kriti has always been cheating and has always been safe. I came home and slept. It was around 1930 that I woke up. My god, my knee is causing pain when I bend it to sit specifically. It’s crazy; I never got into trouble as far as I remember. I met Hardik today while returning from Metro station in the afternoon. It felt nice to see him. I never wanted to see him though, and that will never change. God Bless ‘Me’ Ashish
Journal (2011-Jan-02 - Gareema Sethi was not teaching algorithms)
Index of Journals
2 January 2011 “Gareema Sethi ma’am was just not doing the complete job. I mean, who’s supposed to teach the students to learn to write algorithms. I mean, may be the time given to her for theory classes wasn’t enough to teach algorithms but then why didn’t she utilize the practical periods to do the same?” Vibha Bhardwaj had called yesterday but I was on terrace and I did not take the call. I didn’t bother to call back. She has been acting b***hy and that just isn’t right. She had texted me for New Year but I didn’t even reply to that. I didn’t wish anybody this time then why her so specially? I actually meant ‘nobody’ when I said it. Life is not good; I just want to keep moving to get to some place of mine. It is my exam next day and I need to theoretically start studying. I was watching movie on “Movies Now” when I returned from terrace where I was flying kite. Yeah I was lucky to find one there incidentally. It was awkward to collect the thread on that branch before Kunal Jain’s mother who had suddenly come to the terrace. She must have probably come to see if I was doing anything suspicious. Huh, these people sometimes make me go crazy. God Bless ‘Me’ Ashish
Journal (2011-Jan-01)
Index of Journals
1 January 2011 “Why do I still feel bad about my short height? I feel like quitting on living a normal life! And what is surprising is that when I hate myself the most so much that if on earth I would want one person to get removed would be me, these girls still covet me.” I was watching this movie ‘Transporter 2’ and then during the last fight grandpa was giving grandma calls. Grandma wasn’t listening and so I told him. So he called me, it felt bad to spread the blanket on him during the last action scene of the movie I was so interestingly watching. But then I realized that we had a situation here and I didn’t do anything wrong from no angle. The situation was grandpa can’t shout and grandma couldn’t listen. And since past few days grandpa’s throat has been at its worst, he’s been coughing all time, sometimes it worries me too much (though it has nothing to do with me). Okay, I chose not to go for a ride and etcetera when aunt just asked. Well I was asked, or informed about them being going, twice, actually thrice when the last time Anu came. God Bless ‘Me’ Ashish
Subscribe to:
Posts (Atom)