Wednesday, September 16, 2020

Snorkel's Analysis Package Overview (v0.9.6, Sep 2020)


Current version of Snorkel is v0.9.6 (as on 16-Sep-2020). Link to GitHub Snorkel has 8 packages. Package Reference: 1. Snorkel Analysis Package 2. Snorkel Augmentation Package 3. Snorkel Classification Package 4. Snorkel Labeling Package 5. Snorkel Map Package 6. Snorkel Preprocess Package 7. Snorkel Slicing Package 8. Snorkel Utils Package What is Snorkel's Analysis Package for? This package dicusses how to interpret classification results. Generic model analysis utilities shared across Snorkel. 1: Scorer Calculate one or more scores from user-specified and/or user-defined metrics. This defines a class 'Scorer' with two methods: 'score()' and 'score_slices()'. You have specify input arguments such as metrics (this is related to the 'metric_score()' discussed below), true labels, predicted labels and predicted probabilities. It is through this that we make use of code in 'metrics.py' Code Snippet:
~~~ ~~~ ~~~ 2: get_label_buckets Return data point indices bucketed by label combinations. This is a function written in the error_analysis.py file. Code: import snorkel import numpy as np from snorkel.analysis import get_label_buckets print("Snorkel version:", snorkel.__version__) Snorkel version: 0.9.3 A common use case is calling ``buckets = label_buckets(Y_gold, Y_pred)`` where ``Y_gold`` is a set of gold (i.e. ground truth) labels and ``Y_pred`` is a corresponding set of predicted labels. Y_gold = np.array([1, 1, 1, 0, 0, 0, 1]) Y_pred = np.array([1, 1, -1, -1, 1, 0, 1]) buckets = get_label_buckets(Y_gold, Y_pred) # If gold and pred have different number of elements >> ValueError: Arrays must all have the same number of elements The returned ``buckets[(i, j)]`` is a NumPy array of data point indices with true label i and predicted label j. More generally, the returned indices within each bucket refer to the order of the labels that were passed in as function arguments. print(buckets[(1, 1)]) # true positives where both are 1 Out: array([0, 1, 6]) buckets[(0, 0)] # true positives where both are 0 Out: array([5]) # false positives, false negatives and true negatives print((1, 0) in buckets, '/', (0, 1) in buckets, '/', (0, 0) in buckets) Out: False / True / True buckets[(1, -1)] # abstained positives Out: array([2]) buckets[(0, -1)] # abstained negatives Out: array([3]) ~~~ ~~~ ~~~ 3: metric_score() Evaluate a standard metric on a set of predictions/probabilities. Code for metric_score() is in: target="_blank">metrics.py Using this you can evaluate a standard metric on a set of predictions (True Labels and Predicted Labels) / probabilities. Scores available are: 1. _coverage_score 2. _roc_auc_score 3. _f1_score 4. _f1_micro_score 5. _f1_macro_score It is a wrapper around "sklearn.metrics" and adds to it by giving the above five metrics. METRICS = { "accuracy": Metric(sklearn.metrics.accuracy_score), "coverage": Metric(_coverage_score, ["preds"]), "precision": Metric(sklearn.metrics.precision_score), "recall": Metric(sklearn.metrics.recall_score), "f1": Metric(_f1_score, ["golds", "preds"]), "f1_micro": Metric(_f1_micro_score, ["golds", "preds"]), "f1_macro": Metric(_f1_macro_score, ["golds", "preds"]), "fbeta": Metric(sklearn.metrics.fbeta_score), "matthews_corrcoef": Metric(sklearn.metrics.matthews_corrcoef), "roc_auc": Metric(_roc_auc_score, ["golds", "probs"]), }

Monday, September 14, 2020

Starting With Selenium's Python Package (Installation)



We have a YAML file to setup our conda environment. The file 'selenium.yml' has contents: name: selenium channels: - conda-forge - defaults dependencies: - selenium - jupyterlab - ipykernel To setup the environment, we run the command: (base) CMD> conda env create -f selenium.yml (selenium) CMD> conda activate selenium After that, if we want to see which all packages got installed, we run the command: (selenium) CMD> conda env export Next, we setup a kernel from this environment: (selenium) CMD> python -m ipykernel install --user --name selenium Installed kernelspec selenium in C:\Users\Ashish Jain\AppData\Roaming\jupyter\kernels\selenium To view the list of kernels: (selenium) CMD> jupyter kernelspec list Available kernels: selenium C:\Users\Ashish Jain\AppData\Roaming\jupyter\kernels\selenium python3 E:\programfiles\Anaconda3\envs\selenium\share\jupyter\kernels\python3 ... A basic piece of code would start the browser. We have tried and tested it for Chrome and Firefox. To do this, we need the web driver file or we get the following exception: CODE: from selenium import webdriver import time from selenium.webdriver.common.keys import Keys driver = webdriver.Chrome() ERROR: ---------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) E:\programfiles\Anaconda3\envs\selenium\lib\site-packages\selenium\webdriver\common\service.py in start(self) 71 cmd.extend(self.command_line_args()) ---> 72 self.process = subprocess.Popen(cmd, env=self.env, 73 close_fds=platform.system() != 'Windows', E:\programfiles\Anaconda3\envs\selenium\lib\subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors, text) 853 --> 854 self._execute_child(args, executable, preexec_fn, close_fds, 855 pass_fds, cwd, env, E:\programfiles\Anaconda3\envs\selenium\lib\subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_start_new_session) 1306 try: -> 1307 hp, ht, pid, tid = _winapi.CreateProcess(executable, args, 1308 # no special security FileNotFoundError: [WinError 2] The system cannot find the file specified During handling of the above exception, another exception occurred: WebDriverException Traceback (most recent call last) ... WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home We got the file from here: chromedriver.storage.googleapis.com For v86 chromedriver_win32.zip ---> chromedriver.exe Error for WebDriver and Browser version mismatch: SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 86 Current browser version is 85.0.4183.102 with binary path C:\Program Files (x86)\Google\Chrome\Application\chrome.exe Download from here for Chrome v85: chromedriver.storage.googleapis.com For v85 One point to note about ChromeDriver as in September 2020: ChromeDriver only supports characters in the BMP (Basic Multilingual Plane) is a known issue with Chromium team as ChromeDriver still doesn't support characters with a Unicode after FFFF. Hence it is impossible to send any character beyond FFFF via ChromeDriver. As a result any attempt to send SMP (Supplementary Multilingual Plane) characters (e.g. CJK, Emojis, Symbols, etc) raises the error. While Firefox supports Emoji's sent via 'send_keys()' method. As of Unicode 13.0, the SMP comprises the following 134 blocks: Archaic Greek and Other Left-to-right scripts: Linear B Syllabary (10000–1007F) Linear B Ideograms (10080–100FF). ~ ~ ~ ~ ~ If you working with Firefox browser, you need the Gecko WebDriver available at the Windows 'PATH' variable. Without WebDriver file: FileNotFoundError: [WinError 2] The system cannot find the file specified WebDriverException: Message: 'geckodriver' executable needs to be in PATH. Download Gecko driver from here: GitHub Repo of Mozilla The statement to launch the web browser will be: driver = webdriver.Firefox() By default, browsers open in a partial size window. To maximize the window: driver.maximize_window() Now, we open a link: driver.get("http://survival8.blogspot.com/")

Wednesday, September 9, 2020

Sentiment Analysis using BERT, DistilBERT and ALBERT


We will do Sentiment Analysis using the code from this repo: GitHub

Check out the code from above repository to get started.

For creating Conda environment, we have a file "sentiment_analysis.yml" with content:

name: e20200909
channels:
  - defaults
  - conda-forge
  - pytorch
  
dependencies:
  - pytorch
  - pandas
  - numpy
  - pip:
    - transformers==3.0.1
  - flask
  - flask_cors
  - scikit-learn
  - ipykernel 

(base) C:\>conda env create -f sentiment_analysis.yml

It will install the above mentioned dependencies and the nested dependencies.

(base) C:\Users\Ashish Jain>conda env list 
# conda environments:
#
base                  *  E:\programfiles\Anaconda3
e20200909                E:\programfiles\Anaconda3\envs\e20200909
env_py_36                E:\programfiles\Anaconda3\envs\env_py_36
temp                     E:\programfiles\Anaconda3\envs\temp
temp202009               E:\programfiles\Anaconda3\envs\temp202009
tf                       E:\programfiles\Anaconda3\envs\tf 

(base) C:\Users\Ashish Jain>conda activate e20200909 

(e20200909) C:\Users\Ashish Jain>conda env export
name: e20200909
channels:
  - conda-forge
  - defaults
dependencies:
  - _pytorch_select=0.1=cpu_0
  - backcall=0.2.0=py_0
  - blas=1.0=mkl
  - ca-certificates=2020.7.22=0
  - certifi=2020.6.20=py38_0
  - cffi=1.14.2=py38h7a1dbc1_0
  - click=7.1.2=py_0
  - colorama=0.4.3=py_0
  - decorator=4.4.2=py_0
  - flask=1.1.2=py_0
  - flask_cors=3.0.9=pyh9f0ad1d_0
  - icc_rt=2019.0.0=h0cc432a_1
  - intel-openmp=2019.4=245
  - ipykernel=5.3.4=py38h5ca1d4c_0
  - ipython=7.18.1=py38h5ca1d4c_0
  - ipython_genutils=0.2.0=py38_0
  - itsdangerous=1.1.0=py_0
  - jedi=0.17.2=py38_0
  - jinja2=2.11.2=py_0
  - joblib=0.16.0=py_0
  - jupyter_client=6.1.6=py_0
  - jupyter_core=4.6.3=py38_0
  - libmklml=2019.0.5=0
  - libsodium=1.0.18=h62dcd97_0
  - markupsafe=1.1.1=py38he774522_0
  - mkl=2019.4=245
  - mkl-service=2.3.0=py38hb782905_0
  - mkl_fft=1.1.0=py38h45dec08_0
  - mkl_random=1.1.0=py38hf9181ef_0
  - ninja=1.10.1=py38h7ef1ec2_0
  - numpy=1.19.1=py38h5510c5b_0
  - numpy-base=1.19.1=py38ha3acd2a_0
  - openssl=1.1.1g=he774522_1
  - pandas=1.1.1=py38ha925a31_0
  - parso=0.7.0=py_0
  - pickleshare=0.7.5=py38_1000
  - pip=20.2.2=py38_0
  - prompt-toolkit=3.0.7=py_0
  - pycparser=2.20=py_2
  - pygments=2.6.1=py_0
  - python=3.8.5=h5fd99cc_1
  - python-dateutil=2.8.1=py_0
  - pytorch=1.6.0=cpu_py38h538a6d7_0
  - pytz=2020.1=py_0
  - pywin32=227=py38he774522_1
  - pyzmq=19.0.1=py38ha925a31_1
  - scikit-learn=0.23.2=py38h47e9c7a_0
  - scipy=1.5.0=py38h9439919_0
  - setuptools=49.6.0=py38_0
  - six=1.15.0=py_0
  - sqlite=3.33.0=h2a8f88b_0
  - threadpoolctl=2.1.0=pyh5ca1d4c_0
  - tornado=6.0.4=py38he774522_1
  - traitlets=4.3.3=py38_0
  - vc=14.1=h0510ff6_4
  - vs2015_runtime=14.16.27012=hf0eaf9b_3
  - wcwidth=0.2.5=py_0
  - werkzeug=1.0.1=py_0
  - wheel=0.35.1=py_0
  - wincertstore=0.2=py38_0
  - zeromq=4.3.2=ha925a31_2
  - zlib=1.2.11=h62dcd97_4
  - pip:
    - chardet==3.0.4
    - filelock==3.0.12
    - idna==2.10
    - packaging==20.4
    - pyparsing==2.4.7
    - regex==2020.7.14
    - requests==2.24.0
    - sacremoses==0.0.43
    - sentencepiece==0.1.91
    - tokenizers==0.8.0rc4
    - tqdm==4.48.2
    - transformers==3.0.1
    - urllib3==1.25.10
prefix: E:\programfiles\Anaconda3\envs\e20200909

(e20200909) C:\Users\Ashish Jain> 

Next, we run the 'analyser' code:

(e20200909) C:\SentimentAnalysis-master>python analyze.py 
Please wait while the analyser is being prepared.
Input sentiment to analyze: I am feeling good.
Positive with probability 99%.
Input sentiment to analyze: I am feeling bad.
Negative with probability 99%.
Input sentiment to analyze: I am Ashish.
Positive with probability 81%.
Input sentiment to analyze: 

Next, we run it in browser:

We pass the same sentences as above.

Here are server logs:

(e20200909) C:\SentimentAnalysis-master>python server.py 
 * Serving Flask app "server" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
127.0.0.1 - - [09/Sep/2020 21:35:48] "GET / HTTP/1.1" 400 -
127.0.0.1 - - [09/Sep/2020 21:35:48] "GET /favicon.ico HTTP/1.1" 404 -
127.0.0.1 - - [09/Sep/2020 21:36:02] "GET /?text=hello HTTP/1.1" 200 -
127.0.0.1 - - [09/Sep/2020 21:36:38] "GET /?text=shut%20up HTTP/1.1" 200 -
127.0.0.1 - - [09/Sep/2020 21:36:50] "GET /?text=i%20am%20feeling%20good HTTP/1.1" 200 -
127.0.0.1 - - [09/Sep/2020 21:36:54] "GET /?text=i%20am%20feeling%20bad HTTP/1.1" 200 -
127.0.0.1 - - [09/Sep/2020 21:37:00] "GET /?text=i%20am%20ashish HTTP/1.1" 200 - 

The browser screens:

Tuesday, September 8, 2020

2 X 2 Idempotent matrix

I had to provide an example of an idempotent matrix. That's the kind of matrix that yields itself when multiplied to itself. Much like 0 and 1 in scalar multiplication (1 x 1 = 1).
It is not so easy to predict the result of a matrix multiplication, especially for large matrices. So, instead of settling with the naïve method of guessing with trial and error, I explored the properties of a square matrix of the order 2.
In this page I state the question and begin to attempt it. I realised that for a matrix to be idempotent, it would have to retain its dimensions (order), and hence be a square matrix.
I have intentionally put distinct variable names a,b,c, and d. This is to ensure that the possibility of a different number at each index is open. I derived 'bc' from the first equation and substituted it into its instance in the last equation to obtain a solution for 'a'.
Since 0 cannot be divided by 0, I could not divide 0 by either term unless it was a non-zero term. Thus, I had two possibilities, to which I called case A and B.
I solved the four equations in case A by making substitutions into the 4 main equations. Later tested the solution with b=1.
As you can see, I could not use the elimination method in an advantageous manner for this case.
I couldn't get a unique solution in either case. That is because there are many possible square matrices that are idempotent. However, I don't feel comfortable to intuit that every 2 X 2 idempotent matrix has one of only two possible numbers as its first and last elements.

Others’ take on it

My classmate Sabari Sreekumar did manage to use elimination for the ‘bc’ term for the general case.
I took it a step further and defined the last element in terms of the other elements
So given any 2 X 2 idempotent matrix and its first three elements, you can find the last element unequivocally with this formula.

Conclusion

I wonder if multiples of matrices that satisfy either case are also idempotent. Perhaps I will see if I can prove that in another post.
In the next lecture, professor Venkata Ratnam suggested using the sure-shot approach of a zero matrix. And I was like “Why didn’t I think of that”?

Sunday, September 6, 2020

Setting up Conda Environment for Swagger and Scrapy based project


We have a file that reads "my_yml.yml": name: swagger2 channels: - conda-forge - defaults dependencies: - beautifulsoup4 - connexion - flask - flask_cors - scrapy It will do these three things: 1. It will create an environment "swagger2". 2. For downloading packages, it will use the channels: "conda-forge" and "defaults" 3. The packages it will install are mentioned as "dependencies". Checking our current environments: (base) C:\Users\Ashish Jain>conda env list # conda environments: base * E:\programfiles\Anaconda3 env_py_36 E:\programfiles\Anaconda3\envs\env_py_36 tf E:\programfiles\Anaconda3\envs\tf (base) C:\experiment_with_conda>conda env create -f my_yml.yml Collecting package metadata (repodata.json): done Solving environment: done Downloading and Extracting Packages pysocks-1.7.1 | 27 KB | ### | 100% flask_cors-3.0.9 | 15 KB | ### | 100% chardet-3.0.4 | 189 KB | ### | 100% clickclick-1.2.2 | 9 KB | ### | 100% cssselect-1.1.0 | 18 KB | ### | 100% importlib-metadata-1 | 45 KB | ### | 100% attrs-20.2.0 | 41 KB | ### | 100% protego-0.1.16 | 2.6 MB | ### | 100% twisted-20.3.0 | 5.1 MB | ### | 100% pywin32-227 | 6.9 MB | ### | 100% pyrsistent-0.16.0 | 91 KB | ### | 100% beautifulsoup4-4.9.1 | 86 KB | ### | 100% connexion-2.7.0 | 51 KB | ### | 100% pyhamcrest-2.0.2 | 29 KB | ### | 100% libxslt-1.1.33 | 499 KB | ### | 100% libxml2-2.9.10 | 3.5 MB | ### | 100% incremental-17.5.0 | 14 KB | ### | 100% flask-1.1.2 | 70 KB | ### | 100% scrapy-2.3.0 | 640 KB | ### | 100% automat-20.2.0 | 30 KB | ### | 100% python-3.8.5 | 18.9 MB | ### | 100% bcrypt-3.2.0 | 41 KB | ### | 100% service_identity-18. | 12 KB | ### | 100% win_inet_pton-1.1.0 | 7 KB | ### | 100% cryptography-3.1 | 587 KB | ### | 100% libiconv-1.16 | 680 KB | ### | 100% jmespath-0.10.0 | 21 KB | ### | 100% markupsafe-1.1.1 | 29 KB | ### | 100% parsel-1.6.0 | 15 KB | ### | 100% constantly-15.1.0 | 9 KB | ### | 100% pydispatcher-2.0.5 | 12 KB | ### | 100% zope.interface-5.1.0 | 299 KB | ### | 100% pyasn1-modules-0.2.7 | 60 KB | ### | 100% hyperlink-20.0.1 | 42 KB | ### | 100% inflection-0.5.1 | 9 KB | ### | 100% pyasn1-0.4.8 | 53 KB | ### | 100% w3lib-1.22.0 | 21 KB | ### | 100% pathlib2-2.3.5 | 34 KB | ### | 100% jinja2-2.11.2 | 93 KB | ### | 100% setuptools-49.6.0 | 968 KB | ### | 100% queuelib-1.5.0 | 13 KB | ### | 100% itemloaders-1.0.2 | 14 KB | ### | 100% pyyaml-5.3.1 | 158 KB | ### | 100% soupsieve-2.0.1 | 30 KB | ### | 100% brotlipy-0.7.0 | 368 KB | ### | 100% wincertstore-0.2 | 13 KB | ### | 100% lxml-4.5.2 | 1.1 MB | ### | 100% cffi-1.14.1 | 227 KB | ### | 100% itsdangerous-1.1.0 | 16 KB | ### | 100% click-7.1.2 | 64 KB | ### | 100% certifi-2020.6.20 | 151 KB | ### | 100% python_abi-3.8 | 4 KB | ### | 100% zlib-1.2.11 | 126 KB | ### | 100% openapi-spec-validat | 23 KB | ### | 100% jsonschema-3.2.0 | 108 KB | ### | 100% itemadapter-0.1.0 | 10 KB | ### | 100% Preparing transaction: done Verifying transaction: done Executing transaction: done # # To activate this environment, use # # $ conda activate swagger2 # # To deactivate an active environment, use # # $ conda deactivate (base) C:\experiment_with_conda>conda activate swagger2 (swagger2) C:\experiment_with_conda>conda env export name: swagger2 channels: - conda-forge - defaults dependencies: - attrs=20.2.0=pyh9f0ad1d_0 - automat=20.2.0=py_0 - bcrypt=3.2.0=py38h1e8a9f7_0 - beautifulsoup4=4.9.1=py_1 - brotlipy=0.7.0=py38h1e8a9f7_1000 - ca-certificates=2020.6.20=hecda079_0 - certifi=2020.6.20=py38h32f6830_0 - cffi=1.14.1=py38hba49e27_0 - chardet=3.0.4=py38h32f6830_1006 - click=7.1.2=pyh9f0ad1d_0 - clickclick=1.2.2=py_1 - connexion=2.7.0=py_0 - constantly=15.1.0=py_0 - cryptography=3.1=py38hba49e27_0 - cssselect=1.1.0=py_0 - flask=1.1.2=pyh9f0ad1d_0 - flask_cors=3.0.9=pyh9f0ad1d_0 - hyperlink=20.0.1=pyh9f0ad1d_0 - idna=2.10=pyh9f0ad1d_0 - importlib-metadata=1.7.0=py38h32f6830_0 - importlib_metadata=1.7.0=0 - incremental=17.5.0=py_0 - inflection=0.5.1=pyh9f0ad1d_0 - itemadapter=0.1.0=py_0 - itemloaders=1.0.2=py_0 - itsdangerous=1.1.0=py_0 - jinja2=2.11.2=pyh9f0ad1d_0 - jmespath=0.10.0=pyh9f0ad1d_0 - jsonschema=3.2.0=py38h32f6830_1 - libiconv=1.16=he774522_0 - libxml2=2.9.10=h1006b36_2 - libxslt=1.1.33=h579f668_1 - lxml=4.5.2=py38he3d0fc9_0 - markupsafe=1.1.1=py38h9de7a3e_1 - openapi-spec-validator=0.2.9=pyh9f0ad1d_0 - openssl=1.1.1g=he774522_1 - parsel=1.6.0=py_0 - pathlib2=2.3.5=py38h32f6830_1 - pip=20.2.2=py_0 - protego=0.1.16=py_0 - pyasn1=0.4.8=py_0 - pyasn1-modules=0.2.7=py_0 - pycparser=2.20=pyh9f0ad1d_2 - pydispatcher=2.0.5=py_1 - pyhamcrest=2.0.2=py_0 - pyopenssl=19.1.0=py_1 - pyrsistent=0.16.0=py38h9de7a3e_0 - pysocks=1.7.1=py38h32f6830_1 - python=3.8.5=h60c2a47_7_cpython - python_abi=3.8=1_cp38 - pywin32=227=py38hfa6e2cd_0 - pyyaml=5.3.1=py38h9de7a3e_0 - queuelib=1.5.0=pyh9f0ad1d_0 - requests=2.24.0=pyh9f0ad1d_0 - scrapy=2.3.0=py38h32f6830_0 - service_identity=18.1.0=py_0 - setuptools=49.6.0=py38h32f6830_0 - six=1.15.0=pyh9f0ad1d_0 - soupsieve=2.0.1=py_1 - sqlite=3.33.0=he774522_0 - twisted=20.3.0=py38h9de7a3e_0 - urllib3=1.25.10=py_0 - vc=14.1=h869be7e_1 - vs2015_runtime=14.16.27012=h30e32a0_2 - w3lib=1.22.0=pyh9f0ad1d_0 - werkzeug=1.0.1=pyh9f0ad1d_0 - wheel=0.35.1=pyh9f0ad1d_0 - win_inet_pton=1.1.0=py38_0 - wincertstore=0.2=py38_1003 - yaml=0.2.5=he774522_0 - zipp=3.1.0=py_0 - zlib=1.2.11=h62dcd97_1009 - zope.interface=5.1.0=py38h9de7a3e_0 prefix: E:\programfiles\Anaconda3\envs\swagger2 (swagger2) C:\experiment_with_conda>conda deactivate (base) C:\experiment_with_conda>conda env remove --name swagger2 Remove all packages in environment E:\programfiles\Anaconda3\envs\swagger2: Alternatively: conda remove --name myenv --all (base) C:\experiment_with_conda>conda info --envs # conda environments: # base * E:\programfiles\Anaconda3 env_py_36 E:\programfiles\Anaconda3\envs\env_py_36 tf E:\programfiles\Anaconda3\envs\tf Ref: conda.io

Saturday, September 5, 2020

Prediction of Nifty50 index using LSTM based model


Here we will use LSTM layers to develop time series forecasting model for the prediction of Nifty50 index's closing value.

Our environment:

(py383) ashish@ashish-VirtualBox:~/Desktop$ conda list keras
# packages in environment at /home/ashish/anaconda3/envs/py383:
#
# Name                    Version                   Build  Channel
keras                     2.4.3                    pypi_0    pypi
keras-preprocessing       1.1.2                    pypi_0    pypi
(py383) ashish@ashish-VirtualBox:~/Desktop$ conda list tensorflow
# packages in environment at /home/ashish/anaconda3/envs/py383:
#
# Name                    Version                   Build  Channel
tensorflow                2.2.0                    pypi_0    pypi
tensorflow-estimator      2.2.0                    pypi_0    pypi
(py383) ashish@ashish-VirtualBox:~/Desktop$ conda list matplotlib
# packages in environment at /home/ashish/anaconda3/envs/py383:
#
# Name                    Version                   Build  Channel
matplotlib                3.2.2                         0  
matplotlib-base           3.2.2            py38hef1b27d_0  
(py383) ashish@ashish-VirtualBox:~/Desktop$ conda list scikit-learn
# packages in environment at /home/ashish/anaconda3/envs/py383:
#
# Name                    Version                   Build  Channel
scikit-learn              0.23.1           py38h423224d_0  
(py383) ashish@ashish-VirtualBox:~/Desktop$ conda list seaborn
# packages in environment at /home/ashish/anaconda3/envs/py383:
#
# Name                    Version                   Build  Channel
seaborn                   0.10.1                     py_0  

Python Code:

from __future__ import print_function
import os
import sys
import pandas as pd
import numpy as np
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import datetime
from dateutil.parser import parse
from sklearn.metrics import mean_absolute_error 

# Read the dataset 
l = []
for i in os.listdir('files_2'):
    l.append(pd.read_csv(os.path.join('files_2', i)))

df = pd.concat(l, axis = 0) 

We have data that looks like:

def convert_str_to_date(in_date): return parse(in_date) df['Date'] = df['Date'].apply(convert_str_to_date) df.sort_values(by = ['Date'], axis = 0, ascending = True, inplace = True, na_position = 'last') df.reset_index(drop=True, inplace=True) Gradient descent algorithms perform better (for example converge faster) if the variables are wihtin range [-1, 1]. Many sources relax the boundary to even [-3, 3]. The 'close' variable is mixmax scaled to bound the tranformed variable within [0,1]. from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(0, 1)) df['scaled_close'] = scaler.fit_transform(np.array(df['Close']).reshape(-1, 1)) Before training the model, the dataset is split in two parts - train set and validation set. The neural network is trained on the train set. This means computation of the loss function, back propagation and weights updated by a gradient descent algorithm is done on the train set. The validation set is used to evaluate the model and to determine the number of epochs in model training. Increasing the number of epochs will further decrease the loss function on the train set but might not neccesarily have the same effect for the validation set due to overfitting on the train set. Hence, the number of epochs is controlled by keeping a tap on the loss function computed for the validation set. We use Keras with Tensorflow backend to define and train the model. All the steps involved in model training and validation is done by calling appropriate functions of the Keras API. # Let's start by splitting the dataset into train and validation. split_date = datetime.datetime(year=2020, month=8, day=1, hour=0) df_train = df.loc[df['Date'] < split_date] df_val = df.loc[df['Date'] >= split_date] # Reset the indices of the validation set df_val.reset_index(drop=True, inplace=True) Now we need to generate regressors (X) and target variable (y) for train and validation. 2-D array of regressor and 1-D array of target is created from the original 1-D array of columm 'Close' in the DataFrames. For the time series forecasting model, Past seven days of observations are used to predict for the next day. This is equivalent to a AR(7) model. We define a function which takes the original time series and the number of timesteps in regressors as input to generate the arrays of X and y. The makeXy function is used to generate arrays of regressors and targets-X_train, X_val, y_train and y_val. X_train, and X_val, as generated by the makeXy function, are 2D arrays of shape (number of samples, number of timesteps). However, the input to RNN layers must be of shape (number of samples, number of timesteps, number of features per timestep). In this case, we are dealing with only 'Close', hence number of features per timestep is one. Number of timesteps is seven and number of samples is the same as the number of samples in X_train and X_val, which are reshaped to 3D arrays: def makeXy(ts, nb_timesteps): """ Input: ts: original time series nb_timesteps: number of time steps in the regressors Output: X: 2-D array of regressors y: 1-D array of target """ X = [] y = [] for i in range(nb_timesteps, ts.shape[0]): X.append(list(ts.loc[i-nb_timesteps:i-1])) y.append(ts.loc[i]) X, y = np.array(X), np.array(y) return X, y X_train, y_train = makeXy(df_train['scaled_close'], 7) X_val, y_val = makeXy(df_val['scaled_close'], 7) #X_train and X_val are reshaped to 3D arrays X_train, X_val = X_train.reshape((X_train.shape[0], X_train.shape[1], 1)), X_val.reshape((X_val.shape[0], X_val.shape[1], 1)) Now we define the MLP using the Keras Functional API. In this approach a layer can be declared as the input of the following layer at the time of defining the next layer. from keras.layers import Dense, Input, Dropout from keras.layers.recurrent import LSTM from keras.optimizers import SGD from keras.models import Model from keras.models import load_model from keras.callbacks import ModelCheckpoint #Define input layer which has shape (None, 7) and of type float32. None indicates the number of instances input_layer = Input(shape=(7,1), dtype='float32') The LSTM layers are defined for seven timesteps. In this example, two LSTM layers are stacked. The first LSTM returns the output from each all seven timesteps. This output is a sequence and is fed to the second LSTM which returns output only from the last step. The first LSTM has sixty four hidden neurons in each timestep. Hence the sequence returned by the first LSTM has sixty four features. lstm_layer1 = LSTM(64, input_shape=(7,1), return_sequences=True)(input_layer) lstm_layer2 = LSTM(32, input_shape=(7,64), return_sequences=False)(lstm_layer1) dropout_layer = Dropout(0.2)(lstm_layer2) #Finally the output layer gives prediction. output_layer = Dense(1, activation='linear')(dropout_layer) The input, dense and output layers will now be packed inside a Model, which is wrapper class for training and making predictions. In case of presence of outliers, mean absolute error (MAE) is used as absolute deviations suffer less fluctuations compared to squared deviations. The network's weights are optimized by the Adam algorithm. Adam stands for adaptive moment estimation and has been a popular choice for training deep neural networks. Unlike, stochastic gradient descent, adam uses different learning rates for each weight and separately updates the same as the training progresses. The learning rate of a weight is updated based on exponentially weighted moving averages of the weight's gradients and the squared gradients. ts_model = Model(inputs=input_layer, outputs=output_layer) ts_model.compile(loss='mean_absolute_error', optimizer='adam')#SGD(lr=0.001, decay=1e-5)) ts_model.summary() Model: "model_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) [(None, 7, 1)] 0 _________________________________________________________________ lstm (LSTM) (None, 7, 64) 16896 _________________________________________________________________ lstm_1 (LSTM) (None, 32) 12416 _________________________________________________________________ dropout (Dropout) (None, 32) 0 _________________________________________________________________ dense (Dense) (None, 1) 33 ================================================================= Total params: 29,345 Trainable params: 29,345 Non-trainable params: 0 _________________________________________________________________ The model is trained by calling the fit function on the model object and passing the X_train and y_train. The training is done for a predefined number of epochs. Additionally, batch_size defines the number of samples of train set to be used for a instance of back propagation.The validation dataset is also passed to evaluate the model after every epoch completes. A ModelCheckpoint object tracks the loss function on the validation set and saves the model for the epoch, at which the loss function has been minimum. save_weights_at = os.path.join('files_1', 'models', 'p5', 'p5_nifty50_LSTM_weights.{epoch:02d}-{val_loss:.4f}.hdf5') save_best = ModelCheckpoint(save_weights_at, monitor='val_loss', verbose=0, save_best_only=True, save_weights_only=False, mode='min', period=1) ts_model.fit(x=X_train, y=y_train, batch_size=16, epochs=30, verbose=1, callbacks=[save_best], validation_data=(X_val, y_val), shuffle=True) WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen. Epoch 1/30 381/381 [==============================] - 13s 33ms/step - loss: 0.0181 - val_loss: 0.0258 ... 381/381 [==============================] - 10s 25ms/step - loss: 0.0175 - val_loss: 0.0384 [tensorflow.python.keras.callbacks.History at 0x7fed1c0a05b0] Prediction are made from the best saved model. The model's predictions, which are on the standardized 'Rate', are inverse transformed to get predictions of original 'Rate'. best_model = load_model(os.path.join('files_1', 'models', 'p5', 'p5_nifty50_LSTM_weights.12-0.0057.hdf5')) preds = best_model.predict(X_val) pred = scaler.inverse_transform(preds) pred = np.squeeze(pred) mae = mean_absolute_error(df_val['Close'].loc[7:], pred) print('MAE for the validation set:', round(mae, 4)) MAE for the validation set: 65.7769 #Let's plot the actual and predicted values. plt.figure(figsize=(5.5, 5.5)) plt.plot(range(len(df_val['Close'].loc[7:])), df_val['Close'].loc[7:], linestyle='-', marker='*', color='r') plt.plot(range(len(df_val['Close'].loc[7:])), pred[:df_val.shape[0]], linestyle='-', marker='.', color='b') plt.legend(['Actual','Predicted'], loc=2) plt.title('Actual vs Predicted') plt.ylabel('Close') plt.xlabel('Index')
from sklearn.metrics import r2_score r2 = r2_score(df_val['Close'].loc[7:], pred) print('R-squared for the validation set:', round(r2,4)) R-squared for the validation set: 0.3702

Friday, September 4, 2020

Logging in Python


When to use logging 

Logging provides a set of convenience functions for simple logging usage. These are debug(), info(), warning(), error() and critical(). To determine when to use logging, see the table below, which states, for each of a set of common tasks, the best tool to use for it.

The logging functions are named after the level or severity of the events they are used to track. The standard levels and their applicability are described below (in increasing order of severity):
The default level is WARNING, which means that only events of this level and above will be tracked, unless the logging package is configured to do otherwise. Events that are tracked can be handled in different ways. The simplest way of handling tracked events is to print them to the console. Another common way is to write them to a disk file. Advanced Logging Tutorial The logging library takes a modular approach and offers several categories of components: loggers, handlers, filters, and formatters. % Loggers expose the interface that application code directly uses. % Handlers send the log records (created by loggers) to the appropriate destination. % Filters provide a finer grained facility for determining which log records to output. % Formatters specify the layout of log records in the final output. Log event information is passed between loggers, handlers, filters and formatters in a LogRecord instance. Logging is performed by calling methods on instances of the Logger class (hereafter called loggers). Each instance has a name, and they are conceptually arranged in a namespace hierarchy using dots (periods) as separators. For example, a logger named ‘scan’ is the parent of loggers ‘scan.text’, ‘scan.html’ and ‘scan.pdf’. Logger names can be anything you want, and indicate the area of an application in which a logged message originates. A good convention to use when naming loggers is to use a module-level logger, in each module which uses logging, named as follows: logger = logging.getLogger(__name__) This means that logger names track the package/module hierarchy, and it’s intuitively obvious where events are logged just from the logger name. The root of the hierarchy of loggers is called the root logger. That’s the logger used by the functions debug(), info(), warning(), error() and critical(), which just call the same-named method of the root logger. The functions and the methods have the same signatures. The root logger’s name is printed as ‘root’ in the logged output. It is, of course, possible to log messages to different destinations. Support is included in the package for writing log messages to files, HTTP GET/POST locations, email via SMTP, generic sockets, queues, or OS-specific logging mechanisms such as syslog or the Windows NT event log. Destinations are served by handler classes. You can create your own log destination class if you have special requirements not met by any of the built-in handler classes. By default, no destination is set for any logging messages. You can specify a destination (such as console or file) by using basicConfig() as in the tutorial examples. If you call the functions debug(), info(), warning(), error() and critical(), they will check to see if no destination is set; and if one is not set, they will set a destination of the console (sys.stderr) and a default format for the displayed message before delegating to the root logger to do the actual message output. The default format set by basicConfig() for messages is: severity:logger name:message You can change this by passing a format string to basicConfig() with the format keyword argument. For all options regarding how a format string is constructed, see Formatter Objects. Logging Flow The flow of log event information in loggers and handlers is illustrated in the following diagram.
Loggers Logger objects have a threefold job. First, they expose several methods to application code so that applications can log messages at runtime. Second, logger objects determine which log messages to act upon based upon severity (the default filtering facility) or filter objects. Third, logger objects pass along relevant log messages to all interested log handlers. The most widely used methods on logger objects fall into two categories: configuration and message sending. These are the most common configuration methods: % Logger.setLevel() specifies the lowest-severity log message a logger will handle, where debug is the lowest built-in severity level and critical is the highest built-in severity. For example, if the severity level is INFO, the logger will handle only INFO, WARNING, ERROR, and CRITICAL messages and will ignore DEBUG messages. % Logger.addHandler() and Logger.removeHandler() add and remove handler objects from the logger object. Handlers are covered in more detail in Handlers. % Logger.addFilter() and Logger.removeFilter() add and remove filter objects from the logger object. Filters are covered in more detail in Filter Objects. You don’t need to always call these methods on every logger you create. See the last two paragraphs in this section. With the logger object configured, the following methods create log messages: % Logger.debug(), Logger.info(), Logger.warning(), Logger.error(), and Logger.critical() all create log records with a message and a level that corresponds to their respective method names. The message is actually a format string, which may contain the standard string substitution syntax of %s, %d, %f, and so on. The rest of their arguments is a list of objects that correspond with the substitution fields in the message. With regard to **kwargs, the logging methods care only about a keyword of exc_info and use it to determine whether to log exception information. % Logger.exception() creates a log message similar to Logger.error(). The difference is that Logger.exception() dumps a stack trace along with it. Call this method only from an exception handler. % Logger.log() takes a log level as an explicit argument. This is a little more verbose for logging messages than using the log level convenience methods listed above, but this is how to log at custom log levels. getLogger() returns a reference to a logger instance with the specified name if it is provided, or root if not. The names are period-separated hierarchical structures. Multiple calls to getLogger() with the same name will return a reference to the same logger object. Loggers that are further down in the hierarchical list are children of loggers higher up in the list. For example, given a logger with a name of foo, loggers with names of foo.bar, foo.bar.baz, and foo.bam are all descendants of foo. Loggers have a concept of effective level. If a level is not explicitly set on a logger, the level of its parent is used instead as its effective level. If the parent has no explicit level set, its parent is examined, and so on - all ancestors are searched until an explicitly set level is found. The root logger always has an explicit level set (WARNING by default). When deciding whether to process an event, the effective level of the logger is used to determine whether the event is passed to the logger’s handlers. Child loggers propagate messages up to the handlers associated with their ancestor loggers. Because of this, it is unnecessary to define and configure handlers for all the loggers an application uses. It is sufficient to configure handlers for a top-level logger and create child loggers as needed. (You can, however, turn off propagation by setting the propagate attribute of a logger to False.) Ref: docs.python.org/3/howto Logging Levels The numeric values of logging levels are given in the following table. These are primarily of interest if you want to define your own levels, and need them to have specific values relative to the predefined levels. If you define a level with the same numeric value, it overwrites the predefined value; the predefined name is lost.
Ref 1: docs.python.org/3/library/logging Ref 2: docs.python.org/3/howto/logging Using logging in multiple modules Multiple calls to logging.getLogger('someLogger') return a reference to the same logger object. This is true not only within the same module, but also across modules as long as it is in the same Python interpreter process. It is true for references to the same object; additionally, application code can define and configure a parent logger in one module and create (but not configure) a child logger in a separate module, and all logger calls to the child will pass up to the parent. Here is a main module: import logging import auxiliary_module # create logger with 'spam_application' logger = logging.getLogger('spam_application') logger.setLevel(logging.DEBUG) # create file handler which logs even debug messages fh = logging.FileHandler('spam.log') fh.setLevel(logging.DEBUG) # create console handler with a higher log level ch = logging.StreamHandler() ch.setLevel(logging.ERROR) # create formatter and add it to the handlers formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') fh.setFormatter(formatter) ch.setFormatter(formatter) # add the handlers to the logger logger.addHandler(fh) logger.addHandler(ch) logger.info('creating an instance of auxiliary_module.Auxiliary') a = auxiliary_module.Auxiliary() logger.info('created an instance of auxiliary_module.Auxiliary') logger.info('calling auxiliary_module.Auxiliary.do_something') a.do_something() logger.info('finished auxiliary_module.Auxiliary.do_something') logger.info('calling auxiliary_module.some_function()') auxiliary_module.some_function() logger.info('done with auxiliary_module.some_function()') Here is the auxiliary module: import logging # create logger module_logger = logging.getLogger('spam_application.auxiliary') class Auxiliary: def __init__(self): self.logger = logging.getLogger('spam_application.auxiliary.Auxiliary') self.logger.info('creating an instance of Auxiliary') def do_something(self): self.logger.info('doing something') a = 1 + 1 self.logger.info('done doing something') def some_function(): module_logger.info('received a call to "some_function"') The output looks like this: 2005-03-23 23:47:11,663 - spam_application - INFO - creating an instance of auxiliary_module.Auxiliary 2005-03-23 23:47:11,665 - spam_application.auxiliary.Auxiliary - INFO - creating an instance of Auxiliary 2005-03-23 23:47:11,665 - spam_application - INFO - created an instance of auxiliary_module.Auxiliary 2005-03-23 23:47:11,668 - spam_application - INFO - calling auxiliary_module.Auxiliary.do_something 2005-03-23 23:47:11,668 - spam_application.auxiliary.Auxiliary - INFO - doing something 2005-03-23 23:47:11,669 - spam_application.auxiliary.Auxiliary - INFO - done doing something 2005-03-23 23:47:11,670 - spam_application - INFO - finished auxiliary_module.Auxiliary.do_something 2005-03-23 23:47:11,671 - spam_application - INFO - calling auxiliary_module.some_function() 2005-03-23 23:47:11,672 - spam_application.auxiliary - INFO - received a call to 'some_function' 2005-03-23 23:47:11,673 - spam_application - INFO - done with auxiliary_module.some_function() When we ran it: PS C:\Users\Ashish Jain> cd .\OneDrive\Desktop\code\ PS C:\Users\Ashish Jain\OneDrive\Desktop\code> ls Directory: C:\Users\Ashish Jain\OneDrive\Desktop\code Mode LastWriteTime Length Name ---- ------------- ------ ---- -a---- 9/4/2020 11:30 PM 1126 app.py -a---- 9/4/2020 11:32 PM 518 auxiliary_module.py PS C:\Users\Ashish Jain\OneDrive\Desktop\code> python app.py PS C:\Users\Ashish Jain\OneDrive\Desktop\code> PS C:\Users\Ashish Jain\OneDrive\Desktop\code> ls Directory: C:\Users\Ashish Jain\OneDrive\Desktop\code Mode LastWriteTime Length Name ---- ------------- ------ ---- d----- 9/4/2020 11:36 PM __pycache__ -a---- 9/4/2020 11:30 PM 1126 app.py -a---- 9/4/2020 11:32 PM 518 auxiliary_module.py -a---- 9/4/2020 11:36 PM 988 spam.log Contents of spam.log: 2020-09-04 23:36:37,281 - spam_application - INFO - creating an instance of auxiliary_module.Auxiliary 2020-09-04 23:36:37,281 - spam_application.auxiliary.Auxiliary - INFO - creating an instance of Auxiliary 2020-09-04 23:36:37,281 - spam_application - INFO - created an instance of auxiliary_module.Auxiliary 2020-09-04 23:36:37,281 - spam_application - INFO - calling auxiliary_module.Auxiliary.do_something 2020-09-04 23:36:37,281 - spam_application.auxiliary.Auxiliary - INFO - doing something 2020-09-04 23:36:37,281 - spam_application.auxiliary.Auxiliary - INFO - done doing something 2020-09-04 23:36:37,281 - spam_application - INFO - finished auxiliary_module.Auxiliary.do_something 2020-09-04 23:36:37,281 - spam_application - INFO - calling auxiliary_module.some_function() 2020-09-04 23:36:37,281 - spam_application.auxiliary - INFO - received a call to "some_function" 2020-09-04 23:36:37,281 - spam_application - INFO - done with auxiliary_module.some_function() Ref for above example: howto/logging-cookbook A second example: PS C:\Users\Ashish Jain> cd .\OneDrive\Desktop\code2\ PS C:\Users\Ashish Jain\OneDrive\Desktop\code2> ls Directory: C:\Users\Ashish Jain\OneDrive\Desktop\code2 Mode LastWriteTime Length Name ---- ------------- ------ ---- -a---- 9/4/2020 11:49 PM 1072 app.py -a---- 9/4/2020 11:49 PM 325 submodule.py File "app.py": # app.py (runs when application starts) import logging import logging.config # This is required. Otherwise, you get error: AttributeError: module 'logging' has no attribute 'config' import os.path import submodule as sm def main(): logging_config = { 'version': 1, 'disable_existing_loggers': False, 'formatters': { 'standard': { 'format': '%(asctime)s [%(levelname)s] %(name)s: %(message)s' }, }, 'handlers': { 'default_handler': { 'class': 'logging.FileHandler', 'level': 'DEBUG', 'formatter': 'standard', #'filename': os.path.join('logs', 'application.log'), 'filename': 'application.log', 'encoding': 'utf8' }, }, 'loggers': { '': { 'handlers': ['default_handler'], 'level': 'DEBUG', 'propagate': False } } } logging.config.dictConfig(logging_config) logger = logging.getLogger(__name__) logger.info("Application started.") sm.do_something() if __name__ == '__main__': main() File "submodule.py" has code: import logging # define top level module logger logger = logging.getLogger(__name__) def do_something(): logger.info('Something happended.') try: logger.info("In 'try'.") except Exception as e: logger.exception(e) logger.exception('Something broke.') Run... PS C:\Users\Ashish Jain\OneDrive\Desktop\code2> python .\app.py PS C:\Users\Ashish Jain\OneDrive\Desktop\code2> ls Directory: C:\Users\Ashish Jain\OneDrive\Desktop\code2 Mode LastWriteTime Length Name ---- ------------- ------ ---- d----- 9/4/2020 11:50 PM __pycache__ -a---- 9/4/2020 11:52 PM 1259 app.py -a---- 9/4/2020 11:52 PM 180 application.log -a---- 9/4/2020 11:49 PM 325 submodule.py PS C:\Users\Ashish Jain\OneDrive\Desktop\code2> Logs in file "application.log": 2020-09-04 23:52:00,208 [INFO] __main__: Application started. 2020-09-04 23:52:00,208 [INFO] submodule: Something happended. 2020-09-04 23:52:00,208 [INFO] submodule: In 'try'. Ref for second example: stackoverflow References % realpython.com/python-logging % Python/2 Logging % Toptal - Python Logging % docs.python-guide.org/writing/logging % machinelearningplus % zetcode % tutorialspoint

Requests.get method, cleaning html and writing output to text file


Setup (base) C:\Users\Ashish Jain>conda env list # conda environments: # base * E:\programfiles\Anaconda3 env_py_36 E:\programfiles\Anaconda3\envs\env_py_36 temp E:\programfiles\Anaconda3\envs\temp tf E:\programfiles\Anaconda3\envs\tf (base) C:\Users\Ashish Jain>conda create -n temp202009 python=3.8 Collecting package metadata (repodata.json): done Solving environment: done ## Package Plan ## environment location: E:\programfiles\Anaconda3\envs\temp202009 added / updated specs: - python=3.8 The following packages will be downloaded: package | build ---------------------------|----------------- ca-certificates-2020.7.22 | 0 164 KB python-3.8.5 | h5fd99cc_1 18.7 MB sqlite-3.33.0 | h2a8f88b_0 1.3 MB wheel-0.35.1 | py_0 36 KB ------------------------------------------------------------ Total: 20.2 MB The following NEW packages will be INSTALLED: ca-certificates pkgs/main/win-64::ca-certificates-2020.7.22-0 certifi pkgs/main/win-64::certifi-2020.6.20-py38_0 openssl pkgs/main/win-64::openssl-1.1.1g-he774522_1 pip pkgs/main/win-64::pip-20.2.2-py38_0 python pkgs/main/win-64::python-3.8.5-h5fd99cc_1 setuptools pkgs/main/win-64::setuptools-49.6.0-py38_0 sqlite pkgs/main/win-64::sqlite-3.33.0-h2a8f88b_0 vc pkgs/main/win-64::vc-14.1-h0510ff6_4 vs2015_runtime pkgs/main/win-64::vs2015_runtime-14.16.27012-hf0eaf9b_3 wheel pkgs/main/noarch::wheel-0.35.1-py_0 wincertstore pkgs/main/win-64::wincertstore-0.2-py38_0 zlib pkgs/main/win-64::zlib-1.2.11-h62dcd97_4 Proceed ([y]/n)? y Downloading and Extracting Packages wheel-0.35.1 | 36 KB | ##################################### | 100% sqlite-3.33.0 | 1.3 MB | ##################################### | 100% ca-certificates-2020 | 164 KB | ##################################### | 100% python-3.8.5 | 18.7 MB | ##################################### | 100% Preparing transaction: done Verifying transaction: done Executing transaction: done # # To activate this environment, use # # $ conda activate temp202009 # # To deactivate an active environment, use # # $ conda deactivate (base) C:\Users\Ashish Jain>conda activate temp202009 (temp202009) C:\Users\Ashish Jain>pip install ipykernel jupyter jupyterlab Collecting ipykernel Collecting jupyter Collecting jupyterlab ... Building wheels for collected packages: pandocfilters, pyrsistent Building wheel for pandocfilters (setup.py) ... done Created wheel for pandocfilters: filename=pandocfilters-1.4.2-py3-none-any.whl size=7861 sha256=eaf50b551ad8291621c8a87234dca80f07b0e9b1603ec8ad7179740f988b4dec Stored in directory: c:\users\ashish jain\appdata\local\pip\cache\wheels\f6\08\65\e4636b703d0e870cd62692dafd6b47db27287fe80cea433722 Building wheel for pyrsistent (setup.py) ... done Created wheel for pyrsistent: filename=pyrsistent-0.16.0-cp38-cp38-win_amd64.whl size=71143 sha256=1f0233569beedcff74c358bd0666684c2a0f2d74b56fbdea893711c2f1a761f8 Stored in directory: c:\users\ashish jain\appdata\local\pip\cache\wheels\17\be\0f\727fb20889ada6aaaaba861f5f0eb21663533915429ad43f28 Successfully built pandocfilters pyrsistent Installing collected packages: tornado, ipython-genutils, traitlets, pyzmq, six, python-dateutil, pywin32, jupyter-core, jupyter-client, colorama, parso, jedi, pygments, backcall, wcwidth, prompt-toolkit, decorator, pickleshare, ipython, ipykernel, jupyter-console, qtpy, qtconsole, MarkupSafe, jinja2, attrs, pyrsistent, jsonschema, nbformat, mistune, pyparsing, packaging, webencodings, bleach, pandocfilters, entrypoints, testpath, defusedxml, nbconvert, pywinpty, terminado, prometheus-client, Send2Trash, pycparser, cffi, argon2-cffi, notebook, widgetsnbextension, ipywidgets, jupyter, json5, urllib3, chardet, idna, requests, jupyterlab-server, jupyterlab Successfully installed MarkupSafe-1.1.1 Send2Trash-1.5.0 argon2-cffi-20.1.0 attrs-20.1.0 backcall-0.2.0 bleach-3.1.5 cffi-1.14.2 chardet-3.0.4 colorama-0.4.3 decorator-4.4.2 defusedxml-0.6.0 entrypoints-0.3 idna-2.10 ipykernel-5.3.4 ipython-7.18.1 ipython-genutils-0.2.0 ipywidgets-7.5.1 jedi-0.17.2 jinja2-2.11.2 json5-0.9.5 jsonschema-3.2.0 jupyter-1.0.0 jupyter-client-6.1.7 jupyter-console-6.2.0 jupyter-core-4.6.3 jupyterlab-2.2.6 jupyterlab-server-1.2.0 mistune-0.8.4 nbconvert-5.6.1 nbformat-5.0.7 notebook-6.1.3 packaging-20.4 pandocfilters-1.4.2 parso-0.7.1 pickleshare-0.7.5 prometheus-client-0.8.0 prompt-toolkit-3.0.7 pycparser-2.20 pygments-2.6.1 pyparsing-2.4.7 pyrsistent-0.16.0 python-dateutil-2.8.1 pywin32-228 pywinpty-0.5.7 pyzmq-19.0.2 qtconsole-4.7.7 qtpy-1.9.0 requests-2.24.0 six-1.15.0 terminado-0.8.3 testpath-0.4.4 tornado-6.0.4 traitlets-5.0.3 urllib3-1.25.10 wcwidth-0.2.5 webencodings-0.5.1 widgetsnbextension-3.5.1 (temp202009) C:\Users\Ashish Jain>python -m ipykernel install --user --name temp202009 Installed kernelspec temp202009 in C:\Users\Ashish Jain\AppData\Roaming\jupyter\kernels\temp202009 === === === === ERROR: ImportError: DLL load failed while importing win32api: The specified module could not be found. (temp202009) E:\>conda install pywin32 === === === === (temp202009) E:\>pip install htmllaundry (temp202009) E:\>pip install html-sanitizer Collecting html-sanitizer Collecting beautifulsoup4 Collecting soupsieve>1.2 Downloading soupsieve-2.0.1-py3-none-any.whl (32 kB) Installing collected packages: soupsieve, beautifulsoup4, html-sanitizer Successfully installed beautifulsoup4-4.9.1 html-sanitizer-1.9.1 soupsieve-2.0.1 Issues faced with pulling an article using "newsapi" and "newspaper" packages. #1 Exception occurred for: [newspaper.article.Article object at 0x00000248F12896D8] and 2020-08-08T16:55:21Z Article `download()` failed with 503 Server Error: Service Unavailable for url: https://www.marketwatch.com/story/profit-up-87-at-buffetts-berkshire-but-coronavirus-slows-businesses-2020-08-08 on URL https://www.marketwatch.com/story/profit-up-87-at-buffetts-berkshire-but-coronavirus-slows-businesses-2020-08-08 #2 Exception occurred for: [newspaper.article.Article object at 0x00000248F1297B70] and 2020-08-11T22:59:42Z Article `download()` failed with 403 Client Error: Forbidden for url: https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues on URL https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues #3 Exception occurred for: [newspaper.article.Article object at 0x00000248F12AC550] and 2020-08-11T16:17:55Z Article `download()` failed with HTTPSConnectionPool(host='www.freerepublic.com', port=443): Max retries exceeded with url: /focus/f-news/3873373/posts (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])"))) on URL https://www.freerepublic.com/focus/f-news/3873373/posts Trying a fix using Python shell (base) C:\Users\Ashish Jain>python Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import requests >>> requests.get('https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues') [Response [200]] >>> requests.get('https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues').text '<!DOCTYPE html><html itemscope="" itemtype="https://schema.org/WebPage" lang="en">... >>> with open('html.txt', 'w') as f: ... f.write(requests.get('https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues').text) ... Traceback (most recent call last): File "[stdin]", line 2, in [module] File "E:\programfiles\Anaconda3\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 13665: character maps to [undefined] >>> with open('html.txt', 'w', encoding="utf-8") as f: ... f.write(requests.get('https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues').text) ... 636685 Now we have the HTML. Next we clean it to remove HTML tags. Using htmllaundry from htmllaundry import sanitize !pip show htmllaundry Name: htmllaundry Version: 2.2 Summary: Simple HTML cleanup utilities Home-page: UNKNOWN Author: Wichert Akkerman Author-email: wichert@wiggy.net License: BSD Location: e:\programfiles\anaconda3\envs\temp202009\lib\site-packages Requires: lxml, six Required-by: sanitize(r.text) '<p>\n\n\n \n \n Access to this page has been denied.\n \n \n \n\n\n\n \n \n To continue, please prove you are not a robot\n \n \n \n \n \n \n </p><p>\n To ensure this doesn’t happen in the future, please enable Javascript and cookies in your browser.<br/>\n Is this happening to you frequently? Please <a href="https://seekingalpha.userecho.com?source=captcha" rel="nofollow">report it on our feedback forum</a>.\n </p>\n <p>\n If you have an ad-blocker enabled you may be blocked from proceeding. Please disable your ad-blocker and refresh.\n </p>\n <p>Reference ID: </p>\n \n \n \n\n\n\n\n\n\n\n' from htmllaundry import strip_markup cleantext = strip_markup(sanitize(r.text)).strip() cleantext = re.sub(r"(\n)+", " ", cleantext) cleantext = re.sub(r"\s+", " ", cleantext) print(cleantext) 'Access to this page has been denied. To continue, please prove you are not a robot To ensure this doesn’t happen in the future, please enable Javascript and cookies in your browser. Is this happening to you frequently? Please report it on our feedback forum. If you have an ad-blocker enabled you may be blocked from proceeding. Please disable your ad-blocker and refresh. Reference ID:' Using html_sanitizer from html_sanitizer import Sanitizer !pip show html_sanitizer Name: html-sanitizer Version: 1.9.1 Summary: HTML sanitizer Home-page: https://github.com/matthiask/html-sanitizer/ Author: Matthias Kestenholz Author-email: mk@feinheit.ch License: BSD License Location: e:\programfiles\anaconda3\envs\temp202009\lib\site-packages Requires: beautifulsoup4, lxml Required-by: sanitizer = Sanitizer() cleantext = sanitizer.sanitize(r.text).strip() cleantext = re.sub(r"(\n)+", " ", cleantext) cleantext = re.sub(r"\s+", " ", cleantext) print(cleantext) 'Access to this page has been denied. <h1>To continue, please prove you are not a robot</h1> <p> To ensure this doesn’t happen in the future, please enable Javascript and cookies in your browser.<br> Is this happening to you frequently? Please <a href="https://seekingalpha.userecho.com?source=captcha">report it on our feedback forum</a>. </p> <p> If you have an ad-blocker enabled you may be blocked from proceeding. Please disable your ad-blocker and refresh. </p> <p>Reference ID: </p>' Using beautifulsoup4 import re from bs4 import BeautifulSoup cleantext = BeautifulSoup(r.text, "lxml").text cleantext = re.sub(r"(\n)+", " ", cleantext) cleantext = re.sub(r"\s+", " ", cleantext) cleantext.strip() 'Access to this page has been denied. To continue, please prove you are not a robot To ensure this doesn’t happen in the future, please enable Javascript and cookies in your browser. Is this happening to you frequently? Please report it on our feedback forum. If you have an ad-blocker enabled you may be blocked from proceeding. Please disable your ad-blocker and refresh. Reference ID:'

Thursday, September 3, 2020

Working with base 64 encoding using Windows CMD



We have a zip file "input1.zip" that we will turn into "output1.txt" using base-64 encoding:

C:\Users\Ashish\Desktop\e5>certutil -encode input1.zip output1.txt 

Input Length = 202
Output Length = 338
CertUtil: -encode command completed successfully. 

Notes about "output1.txt":
1. This is the output file from 'certutil'. 
2. This has character encoding base64.
3. The file encoding is utf-8. 
4. Maximum length of a line is 64.
5. Base64 encoding usually has last few characters as "=". "=" represents padding.
6. The first line in encoded file is: -----BEGIN CERTIFICATE-----
7. Last line in encoded file is: -----END CERTIFICATE-----


C:\Users\Ashish\Desktop\e5>certutil -decode output1.txt input2.zip
Input Length = 338
Output Length = 202
CertUtil: -decode command completed successfully.

Contents of "output1.txt" with header and footer:

-----BEGIN CERTIFICATE-----
UEsDBBQAAAAAAEy8IVE3rlRbAgAAAAIAAAAGAAAAdDEudHh0dDFQSwMEFAAAAAAA
TrwhUY3/XcICAAAAAgAAAAYAAAB0Mi50eHR0MlBLAQIUABQAAAAAAEy8IVE3rlRb
AgAAAAIAAAAGAAAAAAAAAAEAIAAAAAAAAAB0MS50eHRQSwECFAAUAAAAAABOvCFR
jf9dwgIAAAACAAAABgAAAAAAAAABACAAAAAmAAAAdDIudHh0UEsFBgAAAAACAAIA
aAAAAEwAAAAAAA==
-----END CERTIFICATE-----

Contents of "output1.txt" without header and footer:

C:\Users\Ashish\Desktop\e5>type output1.txt | find /V "-----BEGIN CERTIFICATE-----" | find /V "-----END CERTIFICATE-----"

UEsDBBQAAAAAAEy8IVE3rlRbAgAAAAIAAAAGAAAAdDEudHh0dDFQSwMEFAAAAAAA
TrwhUY3/XcICAAAAAgAAAAYAAAB0Mi50eHR0MlBLAQIUABQAAAAAAEy8IVE3rlRb
AgAAAAIAAAAGAAAAAAAAAAEAIAAAAAAAAAB0MS50eHRQSwECFAAUAAAAAABOvCFR
jf9dwgIAAAACAAAABgAAAAAAAAABACAAAAAmAAAAdDIudHh0UEsFBgAAAAACAAIA
aAAAAEwAAAAAAA==

Encoding input file without header and footer:

C:\Users\Ashish\Desktop\e5>certutil -encodehex -f input1.zip output2.txt 0x40000001

Input Length = 202
Output Length = 272
CertUtil: -encodehex command completed successfully.

Contents of output2.txt:

UEsDBBQAAAAAAEy8IVE3rlRbAgAAAAIAAAAGAAAAdDEudHh0dDFQSwMEFAAAAAAATrwhUY3/XcICAAAAAgAAAAYAAAB0Mi50eHR0MlBLAQIUABQAAAAAAEy8IVE3rlRbAgAAAAIAAAAGAAAAAAAAAAEAIAAAAAAAAAB0MS50eHRQSwECFAAUAAAAAABOvCFRjf9dwgIAAAACAAAABgAAAAAAAAABACAAAAAmAAAAdDIudHh0UEsFBgAAAAACAAIAaAAAAEwAAAAAAA==

The limitation of size of input file while encoding using certutil:

Interesting stats about encoding found in webpages: