survival8: Requests.get method, cleaning html and writing output to text file


  
Setup 
(base) C:\Users\Ashish Jain>conda env list 
# conda environments:
#
base                  *  E:\programfiles\Anaconda3
env_py_36                E:\programfiles\Anaconda3\envs\env_py_36
temp                     E:\programfiles\Anaconda3\envs\temp
tf                       E:\programfiles\Anaconda3\envs\tf 

(base) C:\Users\Ashish Jain>conda create -n temp202009 python=3.8
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: E:\programfiles\Anaconda3\envs\temp202009

  added / updated specs:
    - python=3.8

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2020.7.22  |                0         164 KB
    python-3.8.5               |       h5fd99cc_1        18.7 MB
    sqlite-3.33.0              |       h2a8f88b_0         1.3 MB
    wheel-0.35.1               |             py_0          36 KB
    ------------------------------------------------------------
                                            Total:        20.2 MB

The following NEW packages will be INSTALLED:

  ca-certificates    pkgs/main/win-64::ca-certificates-2020.7.22-0
  certifi            pkgs/main/win-64::certifi-2020.6.20-py38_0
  openssl            pkgs/main/win-64::openssl-1.1.1g-he774522_1
  pip                pkgs/main/win-64::pip-20.2.2-py38_0
  python             pkgs/main/win-64::python-3.8.5-h5fd99cc_1
  setuptools         pkgs/main/win-64::setuptools-49.6.0-py38_0
  sqlite             pkgs/main/win-64::sqlite-3.33.0-h2a8f88b_0
  vc                 pkgs/main/win-64::vc-14.1-h0510ff6_4
  vs2015_runtime     pkgs/main/win-64::vs2015_runtime-14.16.27012-hf0eaf9b_3
  wheel              pkgs/main/noarch::wheel-0.35.1-py_0
  wincertstore       pkgs/main/win-64::wincertstore-0.2-py38_0
  zlib               pkgs/main/win-64::zlib-1.2.11-h62dcd97_4

Proceed ([y]/n)? y

Downloading and Extracting Packages
wheel-0.35.1         | 36 KB     | ##################################### | 100%
sqlite-3.33.0        | 1.3 MB    | ##################################### | 100%
ca-certificates-2020 | 164 KB    | ##################################### | 100%
python-3.8.5         | 18.7 MB   | ##################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate temp202009
#
# To deactivate an active environment, use
#
#     $ conda deactivate 

(base) C:\Users\Ashish Jain>conda activate temp202009 

(temp202009) C:\Users\Ashish Jain>pip install ipykernel jupyter jupyterlab 
Collecting ipykernel
Collecting jupyter
Collecting jupyterlab
...
Building wheels for collected packages: pandocfilters, pyrsistent
  Building wheel for pandocfilters (setup.py) ... done
  Created wheel for pandocfilters: filename=pandocfilters-1.4.2-py3-none-any.whl size=7861 sha256=eaf50b551ad8291621c8a87234dca80f07b0e9b1603ec8ad7179740f988b4dec
  Stored in directory: c:\users\ashish jain\appdata\local\pip\cache\wheels\f6\08\65\e4636b703d0e870cd62692dafd6b47db27287fe80cea433722
  Building wheel for pyrsistent (setup.py) ... done
  Created wheel for pyrsistent: filename=pyrsistent-0.16.0-cp38-cp38-win_amd64.whl size=71143 sha256=1f0233569beedcff74c358bd0666684c2a0f2d74b56fbdea893711c2f1a761f8
  Stored in directory: c:\users\ashish jain\appdata\local\pip\cache\wheels\17\be\0f\727fb20889ada6aaaaba861f5f0eb21663533915429ad43f28
Successfully built pandocfilters pyrsistent
Installing collected packages: tornado, ipython-genutils, traitlets, pyzmq, six, python-dateutil, pywin32, jupyter-core, jupyter-client, colorama, parso, jedi, pygments, backcall, wcwidth, prompt-toolkit, decorator, pickleshare, ipython, ipykernel, jupyter-console, qtpy, qtconsole, MarkupSafe, jinja2, attrs, pyrsistent, jsonschema, nbformat, mistune, pyparsing, packaging, webencodings, bleach, pandocfilters, entrypoints, testpath, defusedxml, nbconvert, pywinpty, terminado, prometheus-client, Send2Trash, pycparser, cffi, argon2-cffi, notebook, widgetsnbextension, ipywidgets, jupyter, json5, urllib3, chardet, idna, requests, jupyterlab-server, jupyterlab
Successfully installed MarkupSafe-1.1.1 Send2Trash-1.5.0 argon2-cffi-20.1.0 attrs-20.1.0 backcall-0.2.0 bleach-3.1.5 cffi-1.14.2 chardet-3.0.4 colorama-0.4.3 decorator-4.4.2 defusedxml-0.6.0 entrypoints-0.3 idna-2.10 ipykernel-5.3.4 ipython-7.18.1 ipython-genutils-0.2.0 ipywidgets-7.5.1 jedi-0.17.2 jinja2-2.11.2 json5-0.9.5 jsonschema-3.2.0 jupyter-1.0.0 jupyter-client-6.1.7 jupyter-console-6.2.0 jupyter-core-4.6.3 jupyterlab-2.2.6 jupyterlab-server-1.2.0 mistune-0.8.4 nbconvert-5.6.1 nbformat-5.0.7 notebook-6.1.3 packaging-20.4 pandocfilters-1.4.2 parso-0.7.1 pickleshare-0.7.5 prometheus-client-0.8.0 prompt-toolkit-3.0.7 pycparser-2.20 pygments-2.6.1 pyparsing-2.4.7 pyrsistent-0.16.0 python-dateutil-2.8.1 pywin32-228 pywinpty-0.5.7 pyzmq-19.0.2 qtconsole-4.7.7 qtpy-1.9.0 requests-2.24.0 six-1.15.0 terminado-0.8.3 testpath-0.4.4 tornado-6.0.4 traitlets-5.0.3 urllib3-1.25.10 wcwidth-0.2.5 webencodings-0.5.1 widgetsnbextension-3.5.1 

(temp202009) C:\Users\Ashish Jain>python -m ipykernel install --user --name temp202009 
Installed kernelspec temp202009 in C:\Users\Ashish Jain\AppData\Roaming\jupyter\kernels\temp202009 

===   ===   ===   ===

ERROR:
ImportError: DLL load failed while importing win32api: The specified module could not be found.

(temp202009) E:\>conda install pywin32

===   ===   ===   ===

(temp202009) E:\>pip install htmllaundry 
(temp202009) E:\>pip install html-sanitizer 
Collecting html-sanitizer
Collecting beautifulsoup4
Collecting soupsieve>1.2
  Downloading soupsieve-2.0.1-py3-none-any.whl (32 kB)
Installing collected packages: soupsieve, beautifulsoup4, html-sanitizer
Successfully installed beautifulsoup4-4.9.1 html-sanitizer-1.9.1 soupsieve-2.0.1 

Issues faced with pulling an article using "newsapi" and "newspaper" packages. 

#1
Exception occurred for: [newspaper.article.Article object at 0x00000248F12896D8] and 2020-08-08T16:55:21Z
Article `download()` failed with 503 Server Error: Service Unavailable for url: https://www.marketwatch.com/story/profit-up-87-at-buffetts-berkshire-but-coronavirus-slows-businesses-2020-08-08 on URL https://www.marketwatch.com/story/profit-up-87-at-buffetts-berkshire-but-coronavirus-slows-businesses-2020-08-08 

#2
Exception occurred for: [newspaper.article.Article object at 0x00000248F1297B70] and 2020-08-11T22:59:42Z
Article `download()` failed with 403 Client Error: Forbidden for url: https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues on URL https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues 

#3
Exception occurred for: [newspaper.article.Article object at 0x00000248F12AC550] and 2020-08-11T16:17:55Z
Article `download()` failed with HTTPSConnectionPool(host='www.freerepublic.com', port=443): Max retries exceeded with url: /focus/f-news/3873373/posts (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])"))) on URL https://www.freerepublic.com/focus/f-news/3873373/posts 

Trying a fix using Python shell 

(base) C:\Users\Ashish Jain>python
Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> requests.get('https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues')
[Response [200]] 

>>> requests.get('https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues').text
'<!DOCTYPE html><html itemscope="" itemtype="https://schema.org/WebPage" lang="en">... 

>>> with open('html.txt', 'w') as f:
...  f.write(requests.get('https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues').text)
... 
Traceback (most recent call last):
  File "[stdin]", line 2, in [module]
  File "E:\programfiles\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 13665: character maps to [undefined] 

>>> with open('html.txt', 'w', encoding="utf-8") as f:
...  f.write(requests.get('https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues').text)
...
636685 

Now we have the HTML. Next we clean it to remove HTML tags. 

Using htmllaundry 

from htmllaundry import sanitize
!pip show htmllaundry 

Name: htmllaundry
Version: 2.2
Summary: Simple HTML cleanup utilities
Home-page: UNKNOWN
Author: Wichert Akkerman
Author-email: wichert@wiggy.net
License: BSD
Location: e:\programfiles\anaconda3\envs\temp202009\lib\site-packages
Requires: lxml, six
Required-by: 

sanitize(r.text) 

'<p>\n\n\n  \n  \n  Access to this page has been denied.\n  \n  \n  \n\n\n\n  \n    \n      To continue, please prove you are not a robot\n    \n  \n  \n    \n      \n      \n      </p><p>\n        To ensure this doesn’t happen in the future, please enable Javascript and cookies in your browser.<br/>\n        Is this happening to you frequently? Please <a href="https://seekingalpha.userecho.com?source=captcha" rel="nofollow">report it on our feedback forum</a>.\n      </p>\n      <p>\n        If you have an ad-blocker enabled you may be blocked from proceeding. Please disable your ad-blocker and refresh.\n      </p>\n      <p>Reference ID: </p>\n    \n  \n  \n\n\n\n\n\n\n\n' 

from htmllaundry import strip_markup

cleantext = strip_markup(sanitize(r.text)).strip()
cleantext = re.sub(r"(\n)+", " ", cleantext)
cleantext = re.sub(r"\s+", " ", cleantext)

print(cleantext) 

'Access to this page has been denied. To continue, please prove you are not a robot To ensure this doesn’t happen in the future, please enable Javascript and cookies in your browser. Is this happening to you frequently? Please report it on our feedback forum. If you have an ad-blocker enabled you may be blocked from proceeding. Please disable your ad-blocker and refresh. Reference ID:' 

Using html_sanitizer 

from html_sanitizer import Sanitizer
!pip show html_sanitizer 

Name: html-sanitizer
Version: 1.9.1
Summary: HTML sanitizer
Home-page: https://github.com/matthiask/html-sanitizer/
Author: Matthias Kestenholz
Author-email: mk@feinheit.ch
License: BSD License
Location: e:\programfiles\anaconda3\envs\temp202009\lib\site-packages
Requires: beautifulsoup4, lxml
Required-by: 

sanitizer = Sanitizer() 
cleantext = sanitizer.sanitize(r.text).strip()

cleantext = re.sub(r"(\n)+", " ", cleantext)
cleantext = re.sub(r"\s+", " ", cleantext)
print(cleantext) 

'Access to this page has been denied.         <h1>To continue, please prove you are not a robot</h1>     <p> To ensure this doesn’t happen in the future, please enable Javascript and cookies in your browser.<br> Is this happening to you frequently? Please <a href="https://seekingalpha.userecho.com?source=captcha">report it on our feedback forum</a>. </p> <p> If you have an ad-blocker enabled you may be blocked from proceeding. Please disable your ad-blocker and refresh. </p> <p>Reference ID: </p>' 

Using beautifulsoup4 

import re
from bs4 import BeautifulSoup
cleantext = BeautifulSoup(r.text, "lxml").text

cleantext = re.sub(r"(\n)+", " ", cleantext)
cleantext = re.sub(r"\s+", " ", cleantext)
cleantext.strip() 

'Access to this page has been denied. To continue, please prove you are not a robot To ensure this doesn’t happen in the future, please enable Javascript and cookies in your browser. Is this happening to you frequently? Please report it on our feedback forum. If you have an ad-blocker enabled you may be blocked from proceeding. Please disable your ad-blocker and refresh. Reference ID:'
survival8

Friday, September 4, 2020

Requests.get method, cleaning html and writing output to text file

No comments:

Post a Comment