Friday, August 28, 2020

Elbow Method for identifying k in kMeans (clustering) and kNN (classification)


Elbow method (clustering)

In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use. The same method can be used to choose the number of parameters in other data-driven models, such as the number of principal components to describe a data set. Intuition Using the "elbow" or "knee of a curve" as a cutoff point is a common heuristic in mathematical optimization to choose a point where diminishing returns are no longer worth the additional cost. In clustering, this means one should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data. The intuition is that increasing the number of clusters will naturally improve the fit (explain more of the variation), since there are more parameters (more clusters) to use, but that at some point this is over-fitting, and the elbow reflects this. For example, given data that actually consist of k labeled groups – for example, k points sampled with noise – clustering with more than k clusters will "explain" more of the variation (since it can use smaller, tighter clusters), but this is over-fitting, since it is subdividing the labeled groups into multiple clusters. The idea is that the first clusters will add much information (explain a lot of variation), since the data actually consist of that many groups (so these clusters are necessary), but once the number of clusters exceeds the actual number of groups in the data, the added information will drop sharply, because it is just subdividing the actual groups. Assuming this happens, there will be a sharp elbow in the graph of explained variation versus clusters: increasing rapidly up to k (under-fitting region), and then increasing slowly after k (over-fitting region). In practice there may not be a sharp elbow, and as a heuristic method, such an "elbow" cannot always be unambiguously identified. Measures of variation There are various measures of "explained variation" used in the elbow method. Most commonly, variation is quantified by variance, and the ratio used is the ratio of between-group variance to the total variance. Alternatively, one uses the ratio of between-group variance to within-group variance, which is the one-way ANOVA F-test statistic.
Explained variance. The "elbow" is indicated by the red circle. The number of clusters chosen should therefore be 4. Related Concepts ANOVA Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the "variation" among and between groups) used to analyze the differences among group means in a sample. ANOVA was developed by the statistician Ronald Fisher. The ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. Principal component analysis (PCA) Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes only using the first few principal components and ignoring the rest. PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data's variation as possible. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. The i(th) principal component can be taken as a direction orthogonal to the first (i-1) principal components that maximizes the variance of the projected data. Python based Software/source code % Matplotlib – Python library have a PCA package in the .mlab module. % Scikit-learn – Python library for machine learning which contains PCA, Probabilistic PCA, Kernel PCA, Sparse PCA and other techniques in the decomposition module. Reiterating... Determining the number of clusters in a data set Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem. For a certain class of clustering algorithms (in particular k-means, k-medoids and expectation–maximization algorithm), there is a parameter commonly referred to as k that specifies the number of clusters to detect. Other algorithms such as DBSCAN and OPTICS algorithm do not require the specification of this parameter; hierarchical clustering avoids the problem altogether. The correct choice of k is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing k without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when k equals the number of data points, n). Intuitively then, the optimal choice of k will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. If an appropriate value of k is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow. There are several categories of methods for making this decision. The elbow method for clustering The elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the "elbow criterion". This "elbow" cannot always be unambiguously identified, making this method very subjective and unreliable. Percentage of variance explained is the ratio of the between-group variance to the total variance, also known as an F-test. A slight variation of this method plots the curvature of the within group variance. The silhouette method (for clustering) The average silhouette of the data is another useful criterion for assessing the natural number of clusters. The silhouette of a data instance is a measure of how closely it is matched to data within its cluster and how loosely it is matched to data of the neighbouring cluster, i.e. the cluster whose average distance from the datum is lowest. A silhouette close to 1 implies the datum is in an appropriate cluster, while a silhouette close to −1 implies the datum is in the wrong cluster. Optimization techniques such as genetic algorithms are useful in determining the number of clusters that gives rise to the largest silhouette. It is also possible to re-scale the data in such a way that the silhouette is more likely to be maximised at the correct number of clusters. Silhouette coefficient The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. We can compute the mean Silhouette Coefficient over all samples and use this as a metric to judge the number of clusters. Ref 1: Elbow method (clustering) Ref 2: F-test Ref 3: Analysis of variance Ref 4: Principal component analysis Ref 5: Determining the number of clusters in a data set Elbow Method for optimal value of k in KMeans (using 'Distortion' and 'Inertia' and not with explainable variance) A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k. We now define the following: Distortion: It is calculated as the average of the squared distances from the cluster centers of the respective clusters. Typically, the Euclidean distance metric is used.
Inertia: It is the sum of squared distances of samples to their closest cluster center.
Ref 6: Determining the optimal number of clusters Ref 7: Choosing the number of clusters (Coursera) In code from sklearn.cluster import KMeans from sklearn import metrics from scipy.spatial.distance import cdist import numpy as np import matplotlib.pyplot as plt %matplotlib inline #Creating the data x1 = np.array([3, 1, 1, 2, 1, 6, 6, 6, 5, 6, 7, 8, 9, 8, 9, 9, 8]) x2 = np.array([5, 4, 5, 6, 5, 8, 6, 7, 6, 7, 1, 2, 1, 2, 3, 2, 3]) X = np.array(list(zip(x1, x2))).reshape(len(x1), 2) #Visualizing the data plt.plot() plt.xlim([0, 10]) plt.ylim([0, 10]) plt.title('Dataset') plt.scatter(x1, x2) plt.show()
distortions = [] inertias = [] mapping1 = {} mapping2 = {} K = range(1,10) for k in K: #Building and fitting the model kmeanModel = KMeans(n_clusters=k).fit(X) kmeanModel.fit(X) distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0]) inertias.append(kmeanModel.inertia_) mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0] mapping2[k] = kmeanModel.inertia_ for key, val in mapping1.items(): print(str(key), ': ', str(val)) plt.plot(K, distortions, 'bx-') plt.xlabel('Values of K') plt.ylabel('Distortion') plt.title('The Elbow Method using Distortion') plt.show()
for key, val in mapping2.items(): print(str(key), ': ', str(val)) plt.plot(K, inertias, 'bx-') plt.xlabel('Values of K') plt.ylabel('Inertia') plt.title('The Elbow Method using Inertia') plt.show()
A note about np.array(), np.min() and "from scipy.spatial.distance import cdist"

Elbow Method for kNN (classification problem)

How to select the optimal K value (representing the number of Nearest Neighbors)? - Initialize a random K value and start computing. - Choosing a small value of K leads to unstable decision boundaries. - The substantial K value is better for classification as it leads to smoothening the decision boundaries. - Derive a plot between error rate and K denoting values in a defined range. Then choose the K value as having a minimum error rate. - Instead of "error", one could also plot for 'accuracy' against 'K'. With error, the curve is decreasing with K. With accuracy, the curve is increasing with K.

Wednesday, August 26, 2020

Deploying Flask based 'Hello World' REST API on Heroku Cloud


Getting Started on Heroku with Python
Basic requirement:
- a free Heroku account
- Python version 3.7 installed locally - see the installation guides for OS X, Windows, and Linux.

- Heroku CLI requires Git 
You can Git from here: git-scm

- For first time Git setup:Getting-Started-First-Time-Git-Setup

Heroku CLI is avaiable for macOS, Windows and Linux.

You use the Heroku CLI to manage and scale your applications, provision add-ons, view your application logs, and run your application locally.

Once installed, you can use the heroku command from your command shell.
On Windows, start the Command Prompt (cmd.exe) or Powershell to access the command shell.

Use the heroku login command to log in to the Heroku CLI:

(base) C:\Users\Ashish Jain>heroku login
heroku: Press any key to open up the browser to login or q to exit:
Opening browser to https://cli-auth.heroku.com/auth/cli/browser/716***J1k
heroku: Waiting for login... - 

(base) C:\Users\Ashish Jain>heroku login heroku: Press any key to open up the browser to login or q to exit: Opening browser to https://cli-auth.heroku.com/auth/cli/browser/716***J1k Logging in... done Logged in as a***@gmail.com

Create the app

Create an app on Heroku, which prepares Heroku to receive your source code: When you create an app, a git remote (called heroku) is also created and associated with your local git repository. Heroku generates a random name (in this case serene-caverns-82714) for your app, or you can pass a parameter to specify your own app name. (base) C:\Users\Ashish Jain\OneDrive\Desktop>cd myapp (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp>dir Directory of C:\Users\Ashish Jain\OneDrive\Desktop\myapp 0 File(s) 0 bytes (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp>heroku create Creating app... done, ⬢ rocky-spire-96801 https://rocky-spire-96801.herokuapp.com/ | https://git.heroku.com/rocky-spire-96801.git (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp>git clone https://git.heroku.com/rocky-spire-96801.git Cloning into 'rocky-spire-96801'... warning: You appear to have cloned an empty repository.

Writing a Python Script file

We are at: C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801 We create a Python script: MyRESTAPIUsingPythonScript.py It has following code: from flask import Flask, request from flask_restful import Resource, Api import os app = Flask(__name__) api = Api(app) class Tracks(Resource): def get(self): result = "Hello World" return result api.add_resource(Tracks, '/tracks') # URL Route if __name__ == '__main__': port = int(os.environ.get('PORT', 5000)) app.run(host='0.0.0.0', port=port) In the code above: Heroku dynamically assigns your app a port, so we cannot set the port to a fixed number. Heroku adds the port to the env, so we pull it from there. Wrong Code 1 if __name__ == '__main__': app.run(port='5002') Error Logs: 2020-08-26T16:20:58.493306+00:00 app[web.1]: * Running on http://127.0.0.1:5002/ (Press CTRL+C to quit) ... 2020-08-26T16:23:01.745361+00:00 heroku[web.1]: Error R10 (Boot timeout) -> Web process failed to bind to $PORT within 60 seconds of launch 2020-08-26T16:23:01.782641+00:00 heroku[web.1]: Stopping process with SIGKILL 2020-08-26T16:23:01.914043+00:00 heroku[web.1]: Process exited with status 137 2020-08-26T16:23:01.987508+00:00 heroku[web.1]: State changed from starting to crashed Wrong Code 2 if __name__ == '__main__': app.run() Error logs: (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801>heroku logs ... 2020-08-26T16:27:43.161519+00:00 app[web.1]: * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit) 2020-08-26T16:27:45.000000+00:00 app[api]: Build succeeded 2020-08-26T16:28:40.527069+00:00 heroku[web.1]: Error R10 (Boot timeout) -> Web process failed to bind to $PORT within 60 seconds of launch 2020-08-26T16:28:40.548232+00:00 heroku[web.1]: Stopping process with SIGKILL 2020-08-26T16:28:40.611066+00:00 heroku[web.1]: Process exited with status 137 2020-08-26T16:28:40.655930+00:00 heroku[web.1]: State changed from starting to crashed Wrong Code 3 if __name__ == '__main__': app.run(host='0.0.0.0') (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801>heroku logs ... 2020-08-26T16:35:36.792884+00:00 heroku[web.1]: Starting process with command `python MyRESTAPIUsingPythonScript.py` 2020-08-26T16:35:40.000000+00:00 app[api]: Build succeeded 2020-08-26T16:35:40.100687+00:00 app[web.1]: * Serving Flask app "MyRESTAPIUsingPythonScript" (lazy loading) 2020-08-26T16:35:40.100727+00:00 app[web.1]: * Environment: production 2020-08-26T16:35:40.100730+00:00 app[web.1]: WARNING: This is a development server. Do not use it in a production deployment. 2020-08-26T16:35:40.100738+00:00 app[web.1]: Use a production WSGI server instead. 2020-08-26T16:35:40.100767+00:00 app[web.1]: * Debug mode: off 2020-08-26T16:35:40.103621+00:00 app[web.1]: * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit) 2020-08-26T16:37:41.234182+00:00 heroku[web.1]: Error R10 (Boot timeout) -> Web process failed to bind to $PORT within 60 seconds of launch 2020-08-26T16:37:41.260167+00:00 heroku[web.1]: Stopping process with SIGKILL 2020-08-26T16:37:41.377892+00:00 heroku[web.1]: Process exited with status 137 2020-08-26T16:37:41.426917+00:00 heroku[web.1]: State changed from starting to crashed About "git commit" logs Every time we do changes and commit, Heroku knows which release this is. See below, it says "Released v6": (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801>git add . (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801>git commit -m "1012" [master a4975c0] 1012 1 file changed, 3 insertions(+), 1 deletion(-) (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801>git push Enumerating objects: 5, done. Counting objects: 100% (5/5), done. Delta compression using up to 4 threads Compressing objects: 100% (3/3), done. Writing objects: 100% (3/3), 423 bytes | 423.00 KiB/s, done. Total 3 (delta 1), reused 0 (delta 0), pack-reused 0 remote: Compressing source files... done. remote: Building source: remote: remote: -----> Python app detected remote: -----> No change in requirements detected, installing from cache remote: -----> Installing pip 20.1.1, setuptools 47.1.1 and wheel 0.34.2 remote: -----> Installing SQLite3 remote: -----> Installing requirements with pip remote: -----> Discovering process types remote: Procfile declares types -> web remote: remote: -----> Compressing... remote: Done: 45.6M remote: -----> Launching... remote: Released v6 remote: https://rocky-spire-96801.herokuapp.com/ deployed to Heroku remote: remote: Verifying deploy... done. To https://git.heroku.com/rocky-spire-96801.git 0391d70..a4975c0 master -> master

Define a Procfile

We create a file "Procfile" at: C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801 Based on our code (which is a Python script to run a simple Flask based REST API), we write in Procfile: web: python MyRESTAPIUsingPythonScript.py Procfile naming and location The Procfile is always a simple text file that is named Procfile without a file extension. For example, Procfile.txt is not valid. The Procfile must live in your app’s root directory. It does not function if placed anywhere else. Procfile format A Procfile declares its process types on individual lines, each with the following format: [process type]: [command] [process type] is an alphanumeric name for your command, such as web, worker, urgentworker, clock, and so on. [command] indicates the command that every dyno of the process type should execute on startup, such as rake jobs:work. The "web" process type A Heroku app’s web process type is special: it’s the only process type that can receive external HTTP traffic from Heroku’s routers. If your app includes a web server, you should declare it as your app’s web process. For example, the Procfile for a Rails web app might include the following process type: web: bundle exec rails server -p $PORT In this case, every web dyno executes bundle exec rails server -p $PORT, which starts up a web server. A Clojure app’s web process type might look like this: web: lein run -m demo.web $PORT You can refer to your app’s config vars, most usefully $PORT, in the commands you specify. This might be the web process type for an executable Java JAR file, such as when using Spring Boot: web: java -jar target/myapp-1.0.0.jar More on Procfile here: devcenter.heroku Deploying to Heroku A Procfile is not technically required to deploy simple apps written in most Heroku-supported languages—the platform automatically detects the language and creates a default web process type to boot the application server. However, creating an explicit Procfile is recommended for greater control and flexibility over your app. For Heroku to use your Procfile, add the Procfile to the root directory of your application, then push to Heroku: (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801>dir Directory of C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801 08/26/2020 09:47 PM [DIR] . 08/26/2020 09:47 PM [DIR] .. 08/26/2020 09:40 PM 326 MyRESTAPIUsingPythonScript.py 08/26/2020 09:42 PM 41 Procfile 08/26/2020 09:33 PM 22 requirements.txt 3 File(s) 389 bytes 2 Dir(s) 65,828,458,496 bytes free (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801>git add . (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801>git commit -m "first commit" [master (root-commit) ff73728] first commit 3 files changed, 18 insertions(+) create mode 100644 MyRESTAPIUsingPythonScript.py create mode 100644 Procfile create mode 100644 requirements.txt As opposed to what appears on the Heroku documentation, we simply have to do "git push" now. Otherwise we see following errors: (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801>git push heroku master fatal: 'heroku' does not appear to be a git repository fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists. (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801>git push master fatal: 'master' does not appear to be a git repository fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists. (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801>git push Enumerating objects: 5, done. Counting objects: 100% (5/5), done. Delta compression using up to 4 threads Compressing objects: 100% (3/3), done. Writing objects: 100% (5/5), 585 bytes | 292.00 KiB/s, done. Total 5 (delta 0), reused 0 (delta 0), pack-reused 0 remote: Compressing source files... done. remote: Building source: remote: remote: -----> Python app detected remote: -----> Installing python-3.6.12 remote: -----> Installing pip 20.1.1, setuptools 47.1.1 and wheel 0.34.2 remote: -----> Installing SQLite3 remote: -----> Installing requirements with pip remote: Collecting flask remote: Downloading Flask-1.1.2-py2.py3-none-any.whl (94 kB) remote: Collecting flask_restful remote: Downloading Flask_RESTful-0.3.8-py2.py3-none-any.whl (25 kB) remote: Collecting click>=5.1 remote: Downloading click-7.1.2-py2.py3-none-any.whl (82 kB) remote: Collecting Jinja2>=2.10.1 remote: Downloading Jinja2-2.11.2-py2.py3-none-any.whl (125 kB) remote: Collecting Werkzeug>=0.15 remote: Downloading Werkzeug-1.0.1-py2.py3-none-any.whl (298 kB) remote: Collecting itsdangerous>=0.24 remote: Downloading itsdangerous-1.1.0-py2.py3-none-any.whl (16 kB) remote: Collecting pytz remote: Downloading pytz-2020.1-py2.py3-none-any.whl (510 kB) remote: Collecting aniso8601>=0.82 remote: Downloading aniso8601-8.0.0-py2.py3-none-any.whl (43 kB) remote: Collecting six>=1.3.0 remote: Downloading six-1.15.0-py2.py3-none-any.whl (10 kB) remote: Collecting MarkupSafe>=0.23 remote: Downloading MarkupSafe-1.1.1-cp36-cp36m-manylinux1_x86_64.whl (27 kB) remote: Installing collected packages: click, MarkupSafe, Jinja2, Werkzeug, itsdangerous, flask, pytz, aniso8601, six, flask-restful remote: Successfully installed Jinja2-2.11.2 MarkupSafe-1.1.1 Werkzeug-1.0.1 aniso8601-8.0.0 click-7.1.2 flask-1.1.2 flask-restful-0.3.8 itsdangerous-1.1.0 pytz-2020.1 six-1.15.0 remote: -----> Discovering process types remote: Procfile declares types -> web remote: remote: -----> Compressing... remote: Done: 45.6M remote: -----> Launching... remote: Released v3 remote: https://rocky-spire-96801.herokuapp.com/ deployed to Heroku remote: remote: Verifying deploy... done. To https://git.heroku.com/rocky-spire-96801.git * [new branch] master -> master Checking Heroku process status (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801>heroku ps Free dyno hours quota remaining this month: 550h 0m (100%) Free dyno usage for this app: 0h 0m (0%) For more information on dyno sleeping and how to upgrade, see: https://devcenter.heroku.com/articles/dyno-sleeping === web (Free): python MyRESTAPIUsingPythonScript.py (1) web.1: restarting 2020/08/26 21:51:56 +0530 (~ 41s ago) (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801>heroku ps Free dyno hours quota remaining this month: 550h 0m (100%) Free dyno usage for this app: 0h 0m (0%) For more information on dyno sleeping and how to upgrade, see: https://devcenter.heroku.com/articles/dyno-sleeping === web (Free): python MyRESTAPIUsingPythonScript.py (1) web.1: up 2020/08/26 22:28:06 +0530 (~ 12m ago) Check Logs (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801>heroku logs 2020-08-26T15:55:49.855840+00:00 app[api]: Initial release by user a***@gmail.com 2020-08-26T15:55:49.855840+00:00 app[api]: Release v1 created by user a***@gmail.com 2020-08-26T15:55:49.992678+00:00 app[api]: Enable Logplex by user a***@gmail.com 2020-08-26T15:55:49.992678+00:00 app[api]: Release v2 created by user a***@gmail.com 2020-08-26T16:20:26.000000+00:00 app[api]: Build started by user a***@gmail.com 2020-08-26T16:20:51.873133+00:00 app[api]: Release v3 created by user a***@gmail.com 2020-08-26T16:20:51.873133+00:00 app[api]: Deploy ff73728d by user a***@gmail.com 2020-08-26T16:20:51.891792+00:00 app[api]: Scaled to web@1:Free by user a***@gmail.com 2020-08-26T16:20:55.659055+00:00 heroku[web.1]: Starting process with command `python MyRESTAPIUsingPythonScript.py` 2020-08-26T16:20:58.489161+00:00 app[web.1]: * Serving Flask app "MyRESTAPIUsingPythonScript" (lazy loading) 2020-08-26T16:20:58.489192+00:00 app[web.1]: * Environment: production 2020-08-26T16:20:58.489257+00:00 app[web.1]: WARNING: This is a development server. Do not use it in a production deployment. 2020-08-26T16:20:58.489350+00:00 app[web.1]: Use a production WSGI server instead. 2020-08-26T16:20:58.489393+00:00 app[web.1]: * Debug mode: off 2020-08-26T16:20:58.493306+00:00 app[web.1]: * Running on http://127.0.0.1:5002/ (Press CTRL+C to quit) 2020-08-26T16:21:00.000000+00:00 app[api]: Build succeeded 2020-08-26T16:28:40.527069+00:00 heroku[web.1]: Error R10 (Boot timeout) -> Web process failed to bind to $PORT within 60 seconds of launch 2020-08-26T16:28:40.548232+00:00 heroku[web.1]: Stopping process with SIGKILL 2020-08-26T16:28:40.611066+00:00 heroku[web.1]: Process exited with status 137 2020-08-26T16:28:40.655930+00:00 heroku[web.1]: State changed from starting to crashed ... 2020-08-26T16:43:04.803725+00:00 heroku[web.1]: State changed from crashed to starting 2020-08-26T16:43:07.586143+00:00 heroku[web.1]: Starting process with command `python MyRESTAPIUsingPythonScript.py` 2020-08-26T16:43:09.742529+00:00 app[web.1]: * Serving Flask app "MyRESTAPIUsingPythonScript" (lazy loading) 2020-08-26T16:43:09.742547+00:00 app[web.1]: * Environment: production 2020-08-26T16:43:09.742586+00:00 app[web.1]: WARNING: This is a development server. Do not use it in a production deployment. 2020-08-26T16:43:09.742625+00:00 app[web.1]: Use a production WSGI server instead. 2020-08-26T16:43:09.742662+00:00 app[web.1]: * Debug mode: off 2020-08-26T16:43:09.745322+00:00 app[web.1]: * Running on http://0.0.0.0:32410/ (Press CTRL+C to quit) 2020-08-26T16:43:09.847177+00:00 heroku[web.1]: State changed from starting to up 2020-08-26T16:43:12.000000+00:00 app[api]: Build succeeded (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801>heroku open In Firefox browser with URL: https://rocky-spire-96801.herokuapp.com/ Not Found The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again. In Firefox at URL: https://rocky-spire-96801.herokuapp.com/tracks
In Chrome at URL: https://rocky-spire-96801.herokuapp.com/tracks
Logout (base) C:\Users\Ashish Jain\OneDrive\Desktop\myapp\rocky-spire-96801>heroku logout Logging out... done

Tuesday, August 25, 2020

Working with 'dir' command on Windows CMD prompt


# Finding a file/folder with a string in its name.

Note: /s Lists every occurrence of the specified file name within the specified directory and all subdirectories. Exploring "dir" documentation C:\Users\Ashish Jain>help dir Displays a list of files and subdirectories in a directory. DIR [drive:][path][filename] [/A[[:]attributes]] [/B] [/C] [/D] [/L] [/N] [/O[[:]sortorder]] [/P] [/Q] [/R] [/S] [/T[[:]timefield]] [/W] [/X] [/4] [drive:][path][filename] Specifies drive, directory, and/or files to list. /A Displays files with specified attributes. attributes D Directories R Read-only files H Hidden files A Files ready for archiving S System files I Not content indexed files L Reparse Points - Prefix meaning not /B Uses bare format (no heading information or summary). /C Display the thousand separator in file sizes. This is the default. Use /-C to disable display of separator. /D Same as wide but files are list sorted by column. /L Uses lowercase. /N New long list format where filenames are on the far right. /O List by files in sorted order. sortorder N By name (alphabetic) S By size (smallest first) E By extension (alphabetic) D By date/time (oldest first) G Group directories first - Prefix to reverse order /P Pauses after each screenful of information. /Q Display the owner of the file. /R Display alternate data streams of the file. /S Displays files in specified directory and all subdirectories. /T Controls which time field displayed or used for sorting timefield C Creation A Last Access W Last Written /W Uses wide list format. /X This displays the short names generated for non-8dot3 file names. The format is that of /N with the short name inserted before the long name. If no short name is present, blanks are displayed in its place. /4 Displays four-digit years Switches may be preset in the DIRCMD environment variable. Override preset switches by prefixing any switch with - (hyphen)--for example, /-W. --- --- --- --- --- # You can include files in the current or named directory plus all of its accessible subdirectories by using the /S option. This example displays all of the .WKS and .WK1 files in the D:\DATA directory and each of its subdirectories: dir /s d:\data\*.wks;*.wk1 --- --- --- --- --- # Look for text files in D: drive containing the letter 'ACC' in the case-insensitive manner. dir /s D:\*ACC*.txt OUTPUT: Directory of D:\Downloads\rw\jakarta-tomcat-8.0.35\logs 30-Dec-16 02:05 PM 61,549 localhost_access_log.2016-10-06.txt ... Directory of D:\Work Space\rw_new\temp\iTAP\licenses 27-Jan-16 07:46 PM 1,536 javacc-license.txt 1 File(s) 1,536 bytes --- --- --- --- --- We have following directory structure in a "test" folder: C:\Users\Ashish Jain\OneDrive\Desktop\test>tree /f Folder PATH listing for volume Windows Volume serial number is 8139-90C0 C:. │ 3.txt │ ├───1 │ └───a │ file.txt │ └───2 file_2.txt 1. List everything in this directory: C:\Users\Ashish Jain\OneDrive\Desktop\test>dir /s/b C:\Users\Ashish Jain\OneDrive\Desktop\test\1 C:\Users\Ashish Jain\OneDrive\Desktop\test\2 C:\Users\Ashish Jain\OneDrive\Desktop\test\3.txt C:\Users\Ashish Jain\OneDrive\Desktop\test\1\a C:\Users\Ashish Jain\OneDrive\Desktop\test\1\a\file.txt C:\Users\Ashish Jain\OneDrive\Desktop\test\2\file_2.txt 2. List subdirectories of this directory: C:\Users\Ashish Jain\OneDrive\Desktop\test>dir /s/b /A:D C:\Users\Ashish Jain\OneDrive\Desktop\test\1 C:\Users\Ashish Jain\OneDrive\Desktop\test\2 C:\Users\Ashish Jain\OneDrive\Desktop\test\1\a 3. List files in this directory and subdirectories: C:\Users\Ashish Jain\OneDrive\Desktop\test>dir /s/b /A:-D C:\Users\Ashish Jain\OneDrive\Desktop\test\3.txt C:\Users\Ashish Jain\OneDrive\Desktop\test\1\a\file.txt C:\Users\Ashish Jain\OneDrive\Desktop\test\2\file_2.txt Explanation for ‘dir /A:D’: D:\>dir /? Displays a list of files and subdirectories in a directory. DIR [drive:][path][filename] [/A[[:]attributes]] [/B] [/C] [/D] [/L] [/N] [/O[[:]sortorder]] [/P] [/Q] [/R] [/S] [/T[[:]timefield]] [/W] [/X] [/4] [drive:][path][filename] Specifies drive, directory, and/or files to list. /A Displays files with specified attributes. attributes D Directories R Read-only files H Hidden files A Files ready for archiving S System files I Not content indexed files L Reparse Points - Prefix meaning not Another way of listing only subdirectories: C:\Users\Ashish Jain\OneDrive\Desktop\test>dir /s | find "\" Directory of C:\Users\Ashish Jain\OneDrive\Desktop\test Directory of C:\Users\Ashish Jain\OneDrive\Desktop\test\1 Directory of C:\Users\Ashish Jain\OneDrive\Desktop\test\1\a Directory of C:\Users\Ashish Jain\OneDrive\Desktop\test\2

Monday, August 24, 2020

Sentiment Analysis Books (Aug 2020)


Google Search String: "sentiment analysis books" 1. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions Book by Bing Liu Originally published: 28 May 2015 Author: Bing Liu Genre: Reference work 2. Sentiment Analysis and Opinion Mining Book by Bing Liu Originally published: 2012 Author: Bing Liu 3. A Practical Guide to Sentiment Analysis Book Originally published: 7 April 2017 Erik Cambria, Dipankar Das, Sivaji Bandyopadhyay, Antonio Feraco (eds.) Springer International Publishing 4. Sentiment Analysis in Social Networks Book Originally published: 30 September 2016 5. Opinion Mining and Sentiment Analysis Book by Bo Pang and Lillian Lee Originally published: 2008 Authors: Bo Pang, Lillian Lee 6. Deep Learning-Based Approaches for Sentiment Analysis Book Originally published: 24 January 2020 7. Text Mining with R: A Tidy Approach Book by David Robinson and Julia Silge Originally published: 2017 8. Advanced Positioning, Flow, and Sentiment Analysis in Commodity Markets: Bridging Fundamental and Technical Analysis Book by Mark J. S. Keenan Originally published: 20 December 2019 9. Prominent Feature Extraction for Sentiment Analysis Book by Basant Agarwal and Namita Mittal Originally published: 14 December 2015 10. Handbook of Sentiment Analysis in Finance Book Originally published: 2016 11. Sentiment Analysis and Knowledge Discovery in Contemporary Business Book Originally published: 25 May 2018 12. Visual and Text Sentiment Analysis Through Hierarchical Deep Learning Networks Book by Arindam Chaudhuri Originally published: 6 April 2019 Author: Arindam Chaudhuri 13. Affective Computing and Sentiment Analysis: Emotion, Metaphor and Terminology Book Originally published: 2011 Editor: Khurshid Ahmad 14. Sentiment Analysis and Ontology Engineering: An Environment of Computational Intelligence Book Originally published: 22 March 2016 15. Advances in Social Networking-based Learning: Machine Learning-based User Modelling and Sentiment Analysis Book by Christos Troussas and Maria Virvou Originally published: 20 January 2020 16. Machine Learning: An overview with the help of R software Book by Editor Ijsmi Originally published: 20 November 2018 17. People Analytics & Text Mining with R Book by Mong Shen Ng Originally published: 21 March 2019 18. The Successful Trader Foundation: How To Become The 1% Successful ... Book by Thang Duc Chu Originally published: 18 July 2019 19. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data Book by Bing Liu Originally published: 30 May 2007 20. Deep Learning in Natural Language Processing Book Originally published: 23 May 2018 Li Deng, Yang Liu Springer 21. Multimodal Sentiment Analysis Novel by Amir Hussain, Erik Cambria, and Soujanya Poria Originally published: 24 October 2018 22. Sentic Computing: Techniques, Tools, and Applications Novel by Amir Hussain and Erik Cambria Originally published: 28 July 2012 23. Semantic Sentiment Analysis in Social Streams Book by Hassan Saif Originally published: 2017 Author: Hassan Saif Genre: Dissertation 24. Trading on Sentiment: The Power of Minds Over Markets Book by Richard L. Peterson Originally published: 2 March 2016 25. Natural Language Processing with Python Book by Edward Loper, Ewan Klein, and Steven Bird Originally published: June 2009 26. Sentiment Analysis Book by BLOKDYK. GERARDUS Originally published: May 2018 27. Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning Book by Benjamin Bengfort, Rebecca Bilbro, and Tony Ojeda Originally published: 2018 28. Sentiment Analysis in the Bio-Medical Domain: Techniques, Tools, and Applications Book by Amir Hussain, Erik Cambria, and Ranjan Satapathy Originally published: 23 January 2018 29. Sentiment in the Forex Market: Indicators and Strategies To Profit from Crowd Behavior and Market Extremes Book by Jamie Saettele Originally published: 2008 30. Big Data Analytics with Java Book by Rajat Mehta Originally published: 28 July 2017 31. Sentiment Analysis for Social Media Book Originally published: 2 April 2020 32. Applying Sentiment Analysis for Tweets Linking to Scientific Papers Book by Natalie Friedrich Originally published: 21 December 2015 33. A Survey of Sentiment Analysis Book by Moritz Platt Originally published: May 2014 34. Textual Classification for Sentiment Detection. Brand Reputation Analysis on the ... Book by Mike Nkongolo Originally published: 10 April 2018 35. Company Fit: A Decision Support Tool Based on Feature Level Sentiment ... Book by Akshi Kumar Originally published: 30 August 2017 36. KNN Classifier Based Approach for Multi-Class Sentiment Analysis of Twitter Data Book by Soudamini Hota and Sudhir Pathak Originally published: 18 October 2017 37. A Classification Technique for Sentiment Analysis in Data Mining Book Originally published: 13 September 2017 38. Exploration of Competitive Market Behavior Using Near-Real-Time Sentiment Analysis Book by Norman Peitek Originally published: 30 December 2014 39. Sentiment Analysis for PTSD Signals Book by Demetrios Sapounas, Edward Rossini, and Vadim Kagan Originally published: 25 October 2013 40. Lifelong Machine Learning: Second Edition Book by Bing Liu and Zhiyuan Chen Originally published: 7 November 2016 41. Handbook of Natural Language Processing Book Originally published: 19 February 2010 42. Neural Network Methods in Natural Language Processing Book by Yoav Goldberg Originally published: 2017 43. The General Inquirer: A Computer Approach to Content Analysis Book by Philip James Stone Originally published: 1966 44. Plutchik, Robert (1980), Emotion: Theory, research, and experience: Vol. 1. Theories of emotion, 1, New York: Academic 45. Foundations of Statistical Natural Language Processing Book by Christopher D. Manning and Hinrich Schütze Originally published: 1999 46. Sentiment Analysis: Quick Reference Book by BLOKDYK. GERARDUS Originally published: 14 January 2018 47. Intelligent Asset Management Book by Erik Cambria, Frank Xing, and Roy Welsch Originally published: 13 November 2019 48. Sentic Computing: A Common-Sense-Based Framework for Concept-Level Sentiment Analysis Book by Amir Hussain and Erik Cambria Originally published: 11 December 2015 49. Computational Linguistics and Intelligent Text Processing: 18th International Conference, CICLing 2017, Budapest, Hungary, April 17–23, 2017, Revised Selected Papers, Part II Book Originally published: 9 October 2018 50. The SenticNet Sentiment Lexicon: Exploring Semantic Richness in Multi-Word Concepts Book by Raoul Biagioni Originally published: 28 May 2016

Sunday, August 23, 2020

Compare Two Files Using 'git diff'


Note: Our current directory is not in a Git repository. We have two files "file_1.txt" and "file_2.txt". "file_1.txt" has content: Hermione is a good girl. Hermione is a bad girl. Hermione is a very good girl. "file_2.txt" has content: Hermione is a good girl. No, Hermione is not a bad girl. Hermione is a very good girl. The color coding below is as it appears in Windows CMD prompt: C:\Users\Ashish Jain\OneDrive\Desktop>git diff --no-index file_1.txt file_2.txt diff --git a/file_1.txt b/file_2.txt index fc04cd5..52bdfd9 100644 --- a/file_1.txt +++ b/file_2.txt, @@ -1,3 +1,3 @@ Hermione is a good girl. -Hermione is a bad girl. +No, Hermione is not a bad girl. Hermione is a very good girl. \ No newline at end of file C:\Users\Ashish Jain\OneDrive\Desktop>git diff --no-index file_2.txt file_1.txt diff --git a/file_2.txt b/file_1.txt index 52bdfd9..fc04cd5 100644 --- a/file_2.txt +++ b/file_1.txt @@ -1,3 +1,3 @@ Hermione is a good girl. -No, Hermione is not a bad girl. +Hermione is a bad girl. Hermione is a very good girl. \ No newline at end of file The output of the "git diff" is with respect to the first file. Output of "git diff file_1.txt file_2.txt" is read as: "What changed in file_1 as we move from file_1 to file_2".

Saturday, August 22, 2020

Using Snorkel to create test data and classifying using Scikit-Learn


The data set we have is the "Iris" dataset. We will augment the dataset to create "test" dataset and then use "Scikit-Learn's Support Vector Machines' classifier 'SVC'" to classify the test points into one of the Iris species.

import pandas as pd
import numpy as np

from snorkel.augmentation import transformation_function
from snorkel.augmentation import RandomPolicy
from snorkel.augmentation import PandasTFApplier

from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

df = pd.read_csv('files_1/datasets_19_420_Iris.csv')

for i in set(df.Species):
    # Other columns are ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
    print(i)
    print(df[df.Species == i].describe().loc[['min', '25%', '50%', '75%', 'max'], :])
	
features = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm'] classes = ['Iris-setosa', 'Iris-virginica', 'Iris-versicolor'] desc_dict = {} for i in classes: desc_dict[i] = df[df.Species == i].describe() df['Train'] = 'Train' # random.randint returns a random integer N such that a <= N <= b @transformation_function(pre = []) def get_new_instance_for_this_class(x): x.SepalLengthCm = np.random.randint(round(desc_dict[x.Species].loc[['25%'], ['SepalLengthCm']].iloc[0,0], 2) * 100, round(desc_dict[x.Species].loc[['75%'], ['SepalLengthCm']].iloc[0,0], 2) * 100) / 100 x.SepalWidthCm = np.random.randint(round(desc_dict[x.Species].loc[['25%'], ['SepalWidthCm']].iloc[0,0], 2) * 100, round(desc_dict[x.Species].loc[['75%'], ['SepalWidthCm']].iloc[0,0], 2) * 100) / 100 x.PetalLengthCm = np.random.randint(round(desc_dict[x.Species].loc[['25%'], ['PetalLengthCm']].iloc[0,0], 2) * 100, round(desc_dict[x.Species].loc[['75%'], ['PetalLengthCm']].iloc[0,0], 2) * 100) / 100 x.PetalWidthCm = np.random.randint(round(desc_dict[x.Species].loc[['25%'], ['PetalWidthCm']].iloc[0,0], 2) * 100, round(desc_dict[x.Species].loc[['75%'], ['PetalWidthCm']].iloc[0,0], 2) * 100) / 100 x.Train = 'Test' return x tfs = [ get_new_instance_for_this_class ] random_policy = RandomPolicy( len(tfs), sequence_length=2, n_per_original=1, keep_original=True ) tf_applier = PandasTFApplier(tfs, random_policy) df_train_augmented = tf_applier.apply(df) print(f"Original training set size: {len(df)}") print(f"Augmented training set size: {len(df_train_augmented)}")
Output: Original training set size: 150 Augmented training set size: 300 df_test = df_train_augmented[df_train_augmented.Train == 'Test'] clf = svm.SVC(gamma = 'auto') clf.fit(df[features], df['Species']) Output: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) pred = clf.predict(df_test[features]) print("Accuracy: {:.3f}".format(accuracy_score(df_test['Species'], pred))) print("Confusion matrix:\n{}".format(confusion_matrix(df_test['Species'], pred)))
To confirm that we do not have an overlap in training and testing data: left = df[features] right = df_test[features] print(left.merge(right, on = features, how = 'inner').shape) Output: (0, 4) left = df[['Id']] right = df_test[['Id']] print(left.merge(right, on = ['Id'], how = 'inner').shape) (150, 1)

Friday, August 21, 2020

Using Snorkel, SpaCy to augment text data


We have some data that looks like this in a file "names.csv": names,text Harry Potter,Harry Potter is the protagonist. Ronald Weasley,Ronald Weasley is the chess expert. Hermione Granger,Hermione is the super witch. Hermione Granger,Hermione Granger weds Ron. We augment this data by replacing the names in the "text" column with new randomly selected different names. For this we write Python code as given below: import pandas as pd from collections import OrderedDict import numpy as np import names from snorkel.augmentation import transformation_function from snorkel.preprocess.nlp import SpacyPreprocessor spacy = SpacyPreprocessor(text_field="text", doc_field="doc", memoize=True) df = pd.read_csv('names.csv', encoding='cp1252') print(df.head()) print() # Pregenerate some random person names to replace existing ones with for the transformation strategies below replacement_names = [names.get_full_name() for _ in range(50)] # Replace a random named entity with a different entity of the same type. @transformation_function(pre=[spacy]) def change_person(x): person_names = [ent.text for ent in x.doc.ents if ent.label_ == "PERSON"] # If there is at least one person name, replace a random one. Else return None. if person_names: name_to_replace = np.random.choice(person_names) replacement_name = np.random.choice(replacement_names) x.text = x.text.replace(name_to_replace, replacement_name) return x tfs = [ change_person ] from snorkel.augmentation import RandomPolicy random_policy = RandomPolicy( len(tfs), sequence_length=2, n_per_original=1, keep_original=True ) from snorkel.augmentation import PandasTFApplier tf_applier = PandasTFApplier(tfs, random_policy) df_train_augmented = tf_applier.apply(df) print(f"Original training set size: {len(df)}") print(f"Augmented training set size: {len(df_train_augmented)}") print(df_train_augmented) print("\nDebugging for 'Hermione':\n") import spacy nlp = spacy.load('en_core_web_sm') def format_str(str, max_len = 25): str = str + " " * max_len return str[:max_len] for i, row in df.iterrows(): doc = nlp(row.text) for ent in doc.ents: # print(ent.text, ent.start_char, ent.end_char, ent.label_) print(format_str(ent.text), ent.label_) The Snorkel we are running is: (temp) E:\>conda list snorkel # packages in environment at E:\programfiles\Anaconda3\envs\temp: # # Name Version Build Channel snorkel 0.9.3 py_0 conda-forge Now, we run it in "Anaconda Prompt": (temp) E:\>python script.py names text 0 Harry Potter Harry Potter is the protagonist. 1 Ronald Weasley Ronald Weasley is the chess expert. 2 Hermione Granger Hermione is the super witch. 3 Hermione Granger Hermione Granger weds Ron. 100%|██████████| 4/4 [00:00<00:00, 34.58it/s] Original training set size: 4 Augmented training set size: 7 names text 0 Harry Potter Harry Potter is the protagonist. 0 Harry Potter Donald Gregoire is the protagonist. 1 Ronald Weasley Ronald Weasley is the chess expert. 1 Ronald Weasley John Hill is the chess expert. 2 Hermione Granger Hermione is the super witch. 3 Hermione Granger Hermione Granger weds Ron. 3 Hermione Granger Jonathan Humphrey weds Ron. Debugging for 'Hermione': Harry Potter PERSON Ronald Weasley PERSON Hermione ORG Hermione Granger PERSON Ron PERSON There is an error with the name "Hermione" (the red row above). Upon debugging we see that it is recognized as an 'Organization' and not a 'Person'.

Thursday, August 20, 2020

Using Conda to install and manage packages through YAML file and installing kernel



We are at path: E:\exp_snorkel\ File we are writing: env.yml CONTENTS OF THE YAML FILE FOR ENVIRONMENT: name: temp channels: - conda-forge dependencies: - pip - ca-certificates=2020.6.24=0 - matplotlib=3.1.1 - pip: - names==0.3.0 - nltk=3.4.5 - numpy>=1.16.0,<1.17.0 - pandas>=0.24.0,<0.25.0 - scikit-learn>=0.20.2 - spacy>=2.1.6,<2.2.0 - tensorflow=1.14.0 - textblob=0.15.3 prefix: E:\programfiles\Anaconda3\envs\temp WORKING IN THE CONDA SHELL: (base) E:\exp_snorkel>conda remove -n temp --all (base) E:\exp_snorkel>conda env create -f env.yml Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies. Conda may not use the correct pip to install your packages, and they may end up in the wrong place. Please add an explicit pip dependency. I'm adding one for you, but still nagging you. Collecting package metadata (repodata.json): done Solving environment: done Preparing transaction: done Verifying transaction: done Executing transaction: done Installing pip dependencies: | Ran pip subprocess with arguments: ['E:\\programfiles\\Anaconda3\\envs\\temp\\python.exe', '-m', 'pip', 'install', '-U', '-r', 'E:\\exp_snorkel\\condaenv.m7xf9r2n.requirements.txt'] Pip subprocess output: Collecting names==0.3.0 Using cached names-0.3.0.tar.gz (789 kB) Building wheels for collected packages: names Building wheel for names (setup.py): started Building wheel for names (setup.py): finished with status 'done' Created wheel for names: filename=names-0.3.0-py3-none-any.whl size=803694 sha256=318443a7ae55ef0d24be16b374517a61494a29b2a0f0f07d164ab1ef058efb0a Stored in directory: c:\users\ashish jain\appdata\local\pip\cache\wheels\05\ea\68\92f6b0669e478af9b7c3c524520d03050089e034edcc775c2b Successfully built names Installing collected packages: names Successfully installed names-0.3.0 done # # To activate this environment, use # # $ conda activate temp # # To deactivate an active environment, use # # $ conda deactivate MANAGING A KERNEL: 1. Create a kernel: (base) E:\exp_snorkel>conda activate temp (temp) E:\exp_snorkel>pip install ipykernel jupyter (temp) E:\exp_snorkel>python -m ipykernel install --user --name temp 2. To remove a kernel from Jupyter Notebook: [Kernel name is "temp"] (base) E:\exp_snorkel>jupyter kernelspec uninstall temp 3. To view all installed kernels: (base) E:\exp_snorkel>jupyter kernelspec list Available kernels: temp C:\Users\Ashish Jain\AppData\Roaming\jupyter\kernels\temp python3 E:\programfiles\Anaconda3\share\jupyter\kernels\python3 Ref: docs.conda.io

Wednesday, August 19, 2020

Technology Listing related to web application security (Aug 2020)


1. Internet Message Access Protocol

In computing, the Internet Message Access Protocol (IMAP) is an Internet standard protocol used by email clients to retrieve email messages from a mail server over a TCP/IP connection. IMAP is defined by RFC 3501. IMAP was designed with the goal of permitting complete management of an email box by multiple email clients, therefore clients generally leave messages on the server until the user explicitly deletes them. An IMAP server typically listens on port number 143. IMAP over SSL (IMAPS) is assigned the port number 993. Virtually all modern e-mail clients and servers support IMAP, which along with the earlier POP3 (Post Office Protocol) are the two most prevalent standard protocols for email retrieval. Many webmail service providers such as Gmail, Outlook.com and Yahoo! Mail also provide support for both IMAP and POP3. Email protocols The Internet Message Access Protocol is an Application Layer Internet protocol that allows an e-mail client to access email on a remote mail server. The current version is defined by RFC 3501. An IMAP server typically listens on well-known port 143, while IMAP over SSL (IMAPS) uses 993. Incoming email messages are sent to an email server that stores messages in the recipient's email box. The user retrieves the messages with an email client that uses one of a number of email retrieval protocols. While some clients and servers preferentially use vendor-specific, proprietary protocols, almost all support POP and IMAP for retrieving email – allowing many free choice between many e-mail clients such as Pegasus Mail or Mozilla Thunderbird to access these servers, and allows the clients to be used with other servers. Email clients using IMAP generally leave messages on the server until the user explicitly deletes them. This and other characteristics of IMAP operation allow multiple clients to manage the same mailbox. Most email clients support IMAP in addition to Post Office Protocol (POP) to retrieve messages. IMAP offers access to the mail storage. Clients may store local copies of the messages, but these are considered to be a temporary cache. Ref: Wikipedia - IMAP

2. Kerberos (krb5)

Kerberos (/ˈkɜːrbərɒs/) is a computer-network authentication protocol that works on the basis of tickets to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner. The protocol was named after the character Kerberos (or Cerberus) from Greek mythology, the ferocious three-headed guard dog of Hades. Its designers aimed it primarily at a client–server model and it provides mutual authentication—both the user and the server verify each other's identity. Kerberos protocol messages are protected against eavesdropping and replay attacks. Kerberos builds on symmetric key cryptography and requires a trusted third party, and optionally may use public-key cryptography during certain phases of authentication. Kerberos uses UDP port 88 by default. Ref: Kerberos

3. Securing Java Enterprise Apps (Spring Security)

Spring Security is a Java/Java EE framework that provides authentication, authorization and other security features for enterprise applications. Key authentication features % LDAP (using both bind-based and password comparison strategies) for centralization of authentication information. % Single sign-on capabilities using the popular Central Authentication Service. % Java Authentication and Authorization Service (JAAS) LoginModule, a standards-based method for authentication used within Java. Note this feature is only a delegation to a JAAS Loginmodule. % Basic access authentication as defined through RFC 1945. % Digest access authentication as defined through RFC 2617 and RFC 2069. % X.509 client certificate presentation over the Secure Sockets Layer standard. % CA, Inc SiteMinder for authentication (a popular commercial access management product). % Su (Unix)-like support for switching principal identity over a HTTP or HTTPS connection. % Run-as replacement, which enables an operation to assume a different security identity. % Anonymous authentication, which means that even unauthenticated principals are allocated a security identity. % Container adapter (custom realm) support for Apache Tomcat, Resin, JBoss and Jetty (web server). % Windows NTLM to enable browser integration (experimental). % Web form authentication, similar to the servlet container specification. % "Remember-me" support via HTTP cookies. % Concurrent session support, which limits the number of simultaneous logins permitted by a principal. % Full support for customization and plugging in custom authentication implementations. Ref: Spring Security

4. LDAP (Lightweight Directory Access Protocol)

The Lightweight Directory Access Protocol (LDAP /ˈɛldæp/) is an open, vendor-neutral, industry standard application protocol for accessing and maintaining distributed directory information services over an Internet Protocol (IP) network. Directory services play an important role in developing intranet and Internet applications by allowing the sharing of information about users, systems, networks, services, and applications throughout the network. As examples, directory services may provide any organized set of records, often with a hierarchical structure, such as a corporate email directory. Similarly, a telephone directory is a list of subscribers with an address and a phone number. LDAP is specified in a series of Internet Engineering Task Force (IETF) Standard Track publications called Request for Comments (RFCs), using the description language ASN.1. The latest specification is Version 3, published as RFC 4511 (a road map to the technical specifications is provided by RFC4510). A common use of LDAP is to provide a central place to store usernames and passwords. This allows many different applications and services to connect to the LDAP server to validate users. LDAP is based on a simpler subset of the standards contained within the X.500 standard. Because of this relationship, LDAP is sometimes called X.500-lite. Ref: LDAP (Lightweight Directory Access Protocol)

5. Keycloak

Keycloak is an open source software product to allow single sign-on with Identity Management and Access Management aimed at modern applications and services. As of March 2018 this JBoss community project is under the stewardship of Red Hat who use it as the upstream project for their RH-SSO product. Features: Among the many features of Keycloak include : % User Registration % Social login % Single Sign-On/Sign-Off across all applications belonging to the same Realm % 2-factor authentication % LDAP integration % Kerberos broker % multitenancy with per-realm customizeable skin Components: There are 2 main components of Keycloak: % Keycloak server % Keycloak application adapter Ref: Keycloak

6. OAuth

OAuth is an open standard for access delegation, commonly used as a way for Internet users to grant websites or applications access to their information on other websites but without giving them the passwords. This mechanism is used by companies such as Amazon, Google, Facebook, Microsoft and Twitter to permit the users to share information about their accounts with third party applications or websites. Generally, OAuth provides clients a "secure delegated access" to server resources on behalf of a resource owner. It specifies a process for resource owners to authorize third-party access to their server resources without sharing their credentials. Designed specifically to work with Hypertext Transfer Protocol (HTTP), OAuth essentially allows access tokens to be issued to third-party clients by an authorization server, with the approval of the resource owner. The third party then uses the access token to access the protected resources hosted by the resource server. OAuth is a service that is complementary to and distinct from OpenID. OAuth is unrelated to OATH, which is a reference architecture for authentication, not a standard for authorization. However, OAuth is directly related to OpenID Connect (OIDC), since OIDC is an authentication layer built on top of OAuth 2.0. OAuth is also unrelated to XACML, which is an authorization policy standard. OAuth can be used in conjunction with XACML, where OAuth is used for ownership consent and access delegation whereas XACML is used to define the authorization policies (e.g., managers can view documents in their region). Ref: OAuth

7. OpenID

OpenID is an open standard and decentralized authentication protocol. Promoted by the non-profit OpenID Foundation, it allows users to be authenticated by co-operating sites (known as relying parties, or RP) using a third-party service, eliminating the need for webmasters to provide their own ad hoc login systems, and allowing users to log into multiple unrelated websites without having to have a separate identity and password for each. Users create accounts by selecting an OpenID identity provider and then use those accounts to sign onto any website that accepts OpenID authentication. Several large organizations either issue or accept OpenIDs on their websites, according to the OpenID Foundation. The OpenID standard provides a framework for the communication that must take place between the identity provider and the OpenID acceptor (the "relying party"). An extension to the standard (the OpenID Attribute Exchange) facilitates the transfer of user attributes, such as name and gender, from the OpenID identity provider to the relying party (each relying party may request a different set of attributes, depending on its requirements). The OpenID protocol does not rely on a central authority to authenticate a user's identity. Moreover, neither services nor the OpenID standard may mandate a specific means by which to authenticate users, allowing for approaches ranging from the common (such as passwords) to the novel (such as smart cards or biometrics). The final version of OpenID is OpenID 2.0, finalized and published in December 2007. The term OpenID may also refer to an identifier as specified in the OpenID standard; these identifiers take the form of a unique Uniform Resource Identifier (URI), and are managed by some "OpenID provider" that handles authentication. OpenID vs OAuth:
The Top 10 OWASP vulnerabilities in 2020 are: % Injection % Broken Authentication % Sensitive Data Exposure % XML External Entities (XXE) % Broken Access control % Security misconfigurations % Cross Site Scripting (XSS) % Insecure Deserialization % Using Components with known vulnerabilities % Insufficient logging and monitoring Ref (a): OpenID Ref (b): Top 10 Security Vulnerabilities in 2020

8. Heroku

Heroku is a cloud platform as a service (PaaS) supporting several programming languages. One of the first cloud platforms, Heroku has been in development since June 2007, when it supported only the Ruby programming language, but now supports Java, Node.js, Scala, Clojure, Python, PHP, and Go. For this reason, Heroku is said to be a polyglot platform as it has features for a developer to build, run and scale applications in a similar manner across most languages. Heroku was acquired by Salesforce.com in 2010 for $212 million. Ref: Heroku

9. Facebook Prophet

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. Prophet is open source software released by Facebook’s Core Data Science team. It is available for download on CRAN and PyPI. % CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R. Link: https://cran.r-project.org/ The Prophet procedure includes many possibilities for users to tweak and adjust forecasts. You can use human-interpretable parameters to improve your forecast by adding your domain knowledge. Ref: Facebook.github.io

10. Snorkel

% Programmatically Build Training Data The Snorkel team is now focusing their efforts on Snorkel Flow, an end-to-end AI application development platform based on the core ideas behind Snorkel. The Snorkel project started at Stanford in 2016 with a simple technical bet: that it would increasingly be the training data, not the models, algorithms, or infrastructure, that decided whether a machine learning project succeeded or failed. Given this premise, we set out to explore the radical idea that you could bring mathematical and systems structure to the messy and often entirely manual process of training data creation and management, starting by empowering users to programmatically label, build, and manage training data. To say that the Snorkel project succeeded and expanded beyond what we had ever expected would be an understatement. The basic goals of a research repo like Snorkel are to provide a minimum viable framework for testing and validating hypotheses. Snorkel Related innovations are in weak supervision modeling, data augmentation, multi-task learning, and more. The ideas behind Snorkel change not just how you label training data, but so much of the entire lifecycle and pipeline of building, deploying, and managing ML: how users inject their knowledge; how models are constructed, trained, inspected, versioned, and monitored; how entire pipelines are developed iteratively; and how the full set of stakeholders in any ML deployment, from subject matter experts to ML engineers, are incorporated into the process. Over the last year, we have been building the platform to support this broader vision: Snorkel Flow, an end-to-end machine learning platform for developing and deploying AI applications. Snorkel Flow incorporates many of the concepts of the Snorkel project with a range of newer techniques around weak supervision modeling, data augmentation, multi-task learning, data slicing and structuring, monitoring and analysis, and more, all of which integrate in a way that is greater than the sum of its parts–and that we believe makes ML truly faster, more flexible, and more practical than ever before. Ref: snorkel.org