Wednesday, October 5, 2022

Python Packages Useful For Natural Language Processing

1. nltk

The Natural Language Toolkit (NLTK) is a Python package for natural language processing. NLTK requires Python 3.7, 3.8, 3.9 or 3.10. As in: 1.1. from nltk.sentiment.vader import SentimentIntensityAnalyzer 1.2. from nltk.stem import WordNetLemmatizer 1.3. from nltk.corpus import stopwords 1.4. from nltk.tokenize import word_tokenize

2. scikit-learn

One of its most popular usages in NLP: 2.1. from sklearn.feature_extraction.text import TfidfVectorizer 2.2. from sklearn.metrics.pairwise import cosine_similarity 2.3. from sklearn.manifold import TSNE 2.4. from sklearn.decomposition import LatentDirichletAllocation, PCA 2.5. from sklearn.cluster import AgglomerativeClustering Ref: Math with words

3. spaCy: Industrial-strength NLP

spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products. spaCy comes with pretrained pipelines and currently supports tokenization and training for 60+ languages. It features state-of-the-art speed and neural network models for tagging, parsing, named entity recognition, text classification and more, multi-task learning with pretrained transformers like BERT, as well as a production-ready training system and easy model packaging, deployment and workflow management. spaCy is commercial open-source software, released under the MIT license. Our use case involved NER capability of spaCy: Ref: % Exploring Word2Vec % Python Code to Create Annotations For SpaCy NER

4. gensim

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. Ref: PyPI Our usecase involved: 4.1. from gensim.models import Word2Vec 4.2. from gensim.corpora.dictionary import Dictionary 4.3. from gensim.models.lsimodel import LsiModel, stochastic_svd 4.4. from gensim.models.coherencemodel import CoherenceModel 4.5. from gensim.models.ldamodel import LdaModel # Latent Dirichlet Allocation and not 'Latent Discriminant Analysis' 4.6. from gensim.models import RpModel 4.7. from gensim.matutils import corpus2dense, Dense2Corpus 4.8. from gensim.test.utils import common_texts

5. word2vec

Python interface to Google word2vec. Training is done using the original C code, other functionality is pure Python with numpy.

6. GloVe

Cython general implementation of the Glove multi-threaded training. GloVe is an unsupervised learning algorithm for generating vector representations for words. Training is done using a co-occcurence matrix from a corpus. The resulting representations contain structure useful for many other tasks. The paper describing the model is [here]. The original implementation for this Machine Learning model can be [found here].

7. fastText

fastText is a library for efficient learning of word representations and sentence classification. Ref: % PyPI % Reasoning with Word Vectors

8. TextWiser: Text Featurization Library

TextWiser (AAAI'21) is a research library that provides a unified framework for text featurization based on a rich set of methods while taking advantage of pretrained models provided by the state-of-the-art libraries. The main contributions include: Rich Set of Embeddings: A wide range of available embeddings and transformations to choose from. Fine-Tuning: Designed to support a PyTorch backend, and hence, retains the ability to fine-tune featurizations for downstream tasks. That means, if you pass the resulting fine-tunable embeddings to a training method, the features will be optimized automatically for your application. Parameter Optimization: Interoperable with the standard scikit-learn pipeline for hyper-parameter tuning and rapid experimentation. All underlying parameters are exposed to the user. Grammar of Embeddings: Introduces a novel approach to design embeddings from components. The compound embedding allows forming arbitrarily complex embeddings in accordance with a context-free grammar that defines a formal language for valid text featurization. GPU Native: Built with GPUs in mind. If it detects available hardware, the relevant models are automatically placed on the GPU. TextWiser is developed by the Artificial Intelligence Center of Excellence at Fidelity Investments. Documentation is available at fidelity.github.io/textwiser. Here is the video of the paper presentation at AAAI 2021. Our Usecase Involved: Document Embeddings (Doc2Vec): Supported by gensim % Defaults to training from scratch Ref: PyPI

9. BERT-As-a-Service

pip install bert-serving-server # server pip install bert-serving-client # client, independent of `bert-serving-server` Ref: % Getting Started with BERT-As-a-Service % Word Embeddings Using BERT (Demo of BERT-As-a-Service)

10. transformers

State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow: Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio. These models can be applied on: - Text, for tasks like text classification, information extraction, question answering, summarization, translation, text generation, in over 100 languages. - Images, for tasks like image classification, object detection, and segmentation. - Audio, for tasks like speech recognition and audio classification. Transformer models can also perform tasks on several modalities combined, such as table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering. Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and can be modified to enable quick research experiments. Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration between them. It's straightforward to train your models with one before loading them for inference with the other. Ref: % Word Embeddings using BERT and testing using Word Analogies, Nearest Words, 1D Spectrum % PyPI % conda-forge

11. torch

PyTorch is a Python package that provides two high-level features: % Tensor computation (like NumPy) with strong GPU acceleration % Deep neural networks built on a tape-based autograd system You can reuse your favorite Python packages such as NumPy, SciPy, and Cython to extend PyTorch when needed. Ref: PyPI

12. sentence-transformers

Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co. This framework provides an easy method to compute dense vector representations for sentences, paragraphs, and images. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various task. Text is embedding in vector space such that similar text is close and can efficiently be found using cosine similarity. We provide an increasing number of state-of-the-art pretrained models for more than 100 languages, fine-tuned for various use-cases. Further, this framework allows an easy fine-tuning of custom embeddings models, to achieve maximal performance on your specific task. For the full documentation, see www.SBERT.net. The following publications are integrated in this framework: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (EMNLP 2019) Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation (EMNLP 2020) Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks (NAACL 2021) The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes (arXiv 2020) TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning (arXiv 2021) BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models (arXiv 2021) Ref: % PyPI % conda-forge

13. Scrapy

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Scrapy is maintained by Zyte (formerly Scrapinghub) and many other contributors. Check the Scrapy homepage at https://scrapy.org for more information, including a list of features. Requirements: % Python 3.6+ % Works on Linux, Windows, macOS, BSD

14. Rasa

% Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more % Create chatbots and voice assistants Ref: Rasa

15. Sentiment Analysis using BERT, DistilBERT and ALBERT

Sentiment analysis neural network trained by fine-tuning BERT, ALBERT, or DistilBERT on the Stanford Sentiment Treebank. Ref: % barissayil/SentimentAnalysis % Sentiment Analysis Using BERT

16. pyLDAvis

Python library for interactive topic model visualization. This is a port of the fabulous R package by Carson Sievert and Kenny Shirley. pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization. The visualization is intended to be used within an IPython notebook but can also be saved to a stand-alone HTML file for easy sharing. Note: LDA stands for latent Dirichlet allocation. Ref: % PyPI % Creating taxonomy for BBC news articles

17. scipy

As in for cosine distance calculation: from scipy.spatial.distance import cosine Note: from sklearn.metrics.pairwise import cosine_similarity # Expects 2D arrays as input from scipy.spatial.distance import cosine # Works with 1D vectors

18. twitter

The Minimalist Twitter API for Python is a Python API for Twitter, everyone's favorite Web 2.0 Facebook-style status updater for people on the go. Also included is a Twitter command-line tool for getting your friends' tweets and setting your own tweet from the safety and security of your favorite shell and an IRC bot that can announce Twitter updates to an IRC channel. Ref: % https://pypi.org/project/twitter/ % https://survival8.blogspot.com/2022/09/using-twitter-api-to-fetch-trending.html

19. spark-nlp

Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment. Spark NLP comes with 11000+ pretrained pipelines and models in more than 200+ languages. It also offers tasks such as Tokenization, Word Segmentation, Part-of-Speech Tagging, Word and Sentence Embeddings, Named Entity Recognition, Dependency Parsing, Spell Checking, Text Classification, Sentiment Analysis, Token Classification, Machine Translation (+180 languages), Summarization, Question Answering, Table Question Answering, Text Generation, Image Classification, Automatic Speech Recognition, and many more NLP tasks. Spark NLP is the only open-source NLP library in production that offers state-of-the-art transformers such as BERT, CamemBERT, ALBERT, ELECTRA, XLNet, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, Longformer, ELMO, Universal Sentence Encoder, Google T5, MarianMT, GPT2, and Vision Transformers (ViT) not only to Python and R, but also to JVM ecosystem (Java, Scala, and Kotlin) at scale by extending Apache Spark natively. Ref: https://pypi.org/project/spark-nlp/

20. keras-transformer

Popular Usage: Machine Translation Ref: # PyPI # GitHub

21. pronouncing

Pronouncing is a simple interface for the CMU Pronouncing Dictionary. It’s easy to use and has no external dependencies. For example, here’s how to find rhymes for a given word: >>> import pronouncing >>> pronouncing.rhymes("climbing") ['diming', 'liming', 'priming', 'rhyming', 'timing'] Ref: https://pypi.org/project/pronouncing/

22. random-word

This is a simple python package to generate random English words.

23. langdetect

Port of Nakatani Shuyo's language-detection library (version from 03/03/2014) to Python. langdetect supports 55 languages out of the box (ISO 639-1 codes): af, ar, bg, bn, ca, cs, cy, da, de, el, en (English), es, et, fa, fi, fr, gu, he, hi (Hindi), hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl, pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-tw Ref: https://pypi.org/project/langdetect/

24. PyPDF2

PyPDF2 is a free and open-source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. PyPDF2 can retrieve text and metadata from PDFs as well.

25. python-docx

python-docx is a Python library for creating and updating Microsoft Word (.docx) files. Installation: $ pip install python-docx Usage: import docx Note: Does not support .pdf and .doc Ref: % github % Convert MS Word files into PDF format

26. emoji

Emoji for Python. This project was inspired by kyokomi. The entire set of Emoji codes as defined by the Unicode consortium is supported in addition to a bunch of aliases. By default, only the official list is enabled but doing emoji.emojize(language='alias') enables both the full list and aliases. Ref: % PyPI % conda-forge % Social Analysis (SOAN using Python 3) Report

27. pattern

Web mining module for Python. Ref: % PyPI % conda-forge

28. wordcloud

A little word cloud generator in Python. Read more about it on the blog post or the website. The code is tested against Python 2.7, 3.4, 3.5, 3.6 and 3.7. [Dated: 20221005]

Installation with pip3

$ pip3 install wordcloud Collecting wordcloud Downloading wordcloud-1.8.2.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (458 kB) |████████████████████████████████| 458 kB 13 kB/s Requirement already satisfied: numpy>=1.6.1 in /home/ashish/anaconda3/lib/python3.9/site-packages (from wordcloud) (1.21.5) Requirement already satisfied: matplotlib in /home/ashish/anaconda3/lib/python3.9/site-packages (from wordcloud) (3.5.1) Requirement already satisfied: pillow in /home/ashish/anaconda3/lib/python3.9/site-packages (from wordcloud) (9.0.1) Requirement already satisfied: cycler>=0.10 in /home/ashish/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /home/ashish/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (4.25.0) Requirement already satisfied: packaging>=20.0 in /home/ashish/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (21.3) Requirement already satisfied: python-dateutil>=2.7 in /home/ashish/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (2.8.2) Requirement already satisfied: kiwisolver>=1.0.1 in /home/ashish/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (1.3.2) Requirement already satisfied: pyparsing>=2.2.1 in /home/ashish/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (3.0.4) Requirement already satisfied: six>=1.5 in /home/ashish/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0) Installing collected packages: wordcloud Successfully installed wordcloud-1.8.2.2 Ref: % PyPI % conda-forge

29. Social Analysis (SOAN)

Social Analysis based on Whatsapp data Ref: GitHub
Tags: Technology,Natural Language Processing,

No comments:

Post a Comment