survival8: Web Scraping

Showing posts with label Web Scraping. Show all posts

Tuesday, February 23, 2021

HTTP Error Codes and REST APIs

HTTP response status codes indicate whether a specific HTTP request has been successfully completed. Responses are grouped in five classes:

1. Informational responses (100–199)
2. Successful responses (200–299)
3. Redirects (300–399)
4. Client errors (400–499)
5. Server errors (500–599)

If you receive a response that is not in this list, it is a non-standard response, possibly custom to the server's software.

Ref: developer.mozilla.org

All HTTP response status codes are separated into five classes or categories. The first digit of the status code defines the class of response, while the last two digits do not have any classifying or categorization role. There are five classes defined by the standard:

1xx informational response – the request was received, continuing process
2xx successful – the request was successfully received, understood, and accepted
3xx redirection – further action needs to be taken in order to complete the request
4xx client error – the request contains bad syntax or cannot be fulfilled
5xx server error – the server failed to fulfil an apparently valid request

Ref: en.wikipedia.org

Some common error codes one must know:

401 Unauthorized
Although the HTTP standard specifies "unauthorized", semantically this response means "unauthenticated". That is, the client must authenticate itself to get the requested response.

403 Forbidden
The client does not have access rights to the content; that is, it is unauthorized, so the server is refusing to give the requested resource. Unlike 401, the client's identity is known to the server.

405 Method Not Allowed
The request method is known by the server but has been disabled and cannot be used. For example, an API may forbid DELETE-ing a resource. The two mandatory methods, GET and HEAD, must never be disabled and should not return this error code.

415 Unsupported Media Type
The media format of the requested data is not supported by the server, so the server is rejecting the request.

RESTful API Response Codes (used by Amazon Drive API)

The HTTP Status Codes used by the RESTful Amazon Drive API.

HTTP Status Code --- Description

200 OK --- Successful.

201 Created --- Created.
Status code '201' is important for REST APIs that are performing some action such as raising a ticket or logging something.

400 Bad Request --- Bad input parameter. Error message should indicate which one and why.

401 Unauthorized --- The client passed in the invalid Auth token. Client should refresh the token and then try again.

403 Forbidden --- * Customer doesn’t exist. * Application not registered. * Application try to access to properties not belong to an App. * Application try to trash/purge root node. * Application try to update contentProperties. * Operation is blocked (for third-party apps). * Customer account over quota.

404 Not Found --- Resource not found.

405 Method Not Allowed --- The resource doesn't support the specified HTTP verb.

409 Conflict --- Conflict.

411 Length Required --- The Content-Length header was not specified.

412 Precondition Failed --- Precondition failed.

429 Too Many Requests --- Too many request for rate limiting.

500 Internal Server Error --- Servers are not working as expected. The request is probably valid but needs to be requested again later.

503 Service Unavailable --- Service Unavailable.

Ref: developer.amazon.com (Dated: 24 Feb 2021)

Additional Notes

In Europe, the NotFound project, created by multiple European organizations including Missing Children Europe and Child Focus, encourages site operators to add a snippet of code to serve customized 404 error pages which provide data about missing children.

Ref: HTTP 404

Tuesday, February 16, 2021

Getting a Web Server's Response Header Using Python

We have a Python code that will get the response headers for a website:

from datetime import datetime 
import requests

url = 'http://survival8.blogspot.com/'

x = requests.get(url)

print(x.headers)

curr_time = datetime.now()

# We also write our main HTML output to a file.
with open("s8_" + str(curr_time).replace(":", "_") + ".log", mode='w') as f:
    f.write(x.text)

The output of this code looks like as shown below:

(base) ~/Desktop$ python response_header_info.py 

{'Content-Type': 'text/html; charset=UTF-8', 'Expires': 'Tue, 16 Feb 2021 10:13:29 GMT', 'Date': 'Tue, 16 Feb 2021 10:13:29 GMT', 'Cache-Control': 'private, max-age=0', 'Last-Modified': 'Tue, 16 Feb 2021 08:54:25 GMT', 'ETag': 'W/"047a2cb250a2ad10a53227bf4085727f97833f5235788c95f99a149e4d1afa68"', 'Content-Encoding': 'gzip', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '135818', 'Server': 'GSE'} 

Next we discuss some important Response Headers:

1: Response header
Ref: developer.mozilla.org

A response header is an HTTP header that can be used in an HTTP response and that doesn't relate to the content of the message. Response headers, like Age, Location or Server are used to give a more detailed context of the response.

Not all headers appearing in a response are categorized as response headers by the specification. For example, the Content-Length header is an Representation metadata header indicating the size of the body of the response message (and as an entity header in older versions of the specification). However, "conversationally" all headers are usually referred to as response headers in a response message.

The following shows a few response headers after a GET request. Note that strictly speaking, the Content-Encoding and Content-Type headers are entity header:

200 OK
Access-Control-Allow-Origin: *
Connection: Keep-Alive
Content-Encoding: gzip
Content-Type: text/html; charset=utf-8
Date: Mon, 18 Jul 2016 16:06:00 GMT
Etag: "c561c68d0ba92bbeb8b0f612a9199f722e3a621a"
Keep-Alive: timeout=5, max=997
Last-Modified: Mon, 18 Jul 2016 02:36:04 GMT
Server: Apache
Set-Cookie: mykey=myvalue; expires=Mon, 17-Jul-2017 16:06:00 GMT; Max-Age=31449600; Path=/; secure
Transfer-Encoding: chunked
Vary: Cookie, Accept-Encoding
X-Backend-Server: developer2.webapp.scl3.mozilla.com
X-Cache-Info: not cacheable; meta data too large
X-kuma-revision: 1085259
x-frame-options: DENY

###

2: 'Cache-Control': 'private'

Ref: developer.mozilla.org

Cacheability

Directives that define whether a response/request can be cached, where it may be cached, and whether it must be validated with the origin server before caching.

public
    The response may be stored by any cache, even if the response is normally non-cacheable.

private
    The response may be stored only by a browser's cache, even if the response is normally non-cacheable. If you mean to not store the response in any cache, use no-store instead. This directive is not effective in preventing caches from storing your response.

no-cache
    The response may be stored by any cache, even if the response is normally non-cacheable. However, the stored response MUST always go through validation with the origin server first before using it, therefore, you cannot use no-cache in-conjunction with immutable. If you mean to not store the response in any cache, use no-store instead. This directive is not effective in preventing caches from storing your response.

no-store
    The response may not be stored in any cache. Note that this will not prevent a valid pre-existing cached response being returned. Clients can set max-age=0 to also clear existing cache responses, as this forces the cache to revalidate with the server (no other directives have an effect when used with no-store). 

###

'Transfer-Encoding': 'chunked'

The Transfer-Encoding header specifies the form of encoding used to safely transfer the payload body to the user.

chunked
    Data is sent in a series of chunks. The Content-Length header is omitted in this case and at the beginning of each chunk you need to add the length of the current chunk in hexadecimal format, followed by '\r\n' and then the chunk itself, followed by another '\r\n'. The terminating chunk is a regular chunk, with the exception that its length is zero. It is followed by the trailer, which consists of a (possibly empty) sequence of entity header fields.

### 

'Content-Type': 'application/json; charset=utf-8'

Content-type: application/json; charset=utf-8 designates the content to be in JSON format, encoded in the UTF-8 character encoding.

### 

'Server': 'Private Server', 

The Server header describes the software used by the origin server that handled the request — that is, the server that generated the response.

Examples: 
  Server: Apache/2.4.1 (Unix)

Ref: developer.mozilla.org

### 

'jsonerror': 'true'

Nothing found about it.

###

'X-Frame-Options': 'SAMEORIGIN'

The X-Frame-Options HTTP response header can be used to indicate whether or not a browser should be allowed to render a page in a <frame>, <iframe>, <embed> or <object>. Sites can use this to avoid click-jacking attacks, by ensuring that their content is not embedded into other sites.

The added security is provided only if the user accessing the document is using a browser that supports X-Frame-Options.

There are two possible directives for X-Frame-Options:

X-Frame-Options: DENY
X-Frame-Options: SAMEORIGIN


SAMEORIGIN
    The page can only be displayed in a frame on the same origin as the page itself. The spec leaves it up to browser vendors to decide whether this option applies to the top level, the parent, or the whole chain, although it is argued that the option is not very useful unless all ancestors are also in the same origin (see bug 725490). Also see Browser compatibility for support details.

Ref: developer.mozilla.org

###

'Strict-Transport-Security': 'max-age=31536000', 

The HTTP Strict-Transport-Security response header (often abbreviated as HSTS) lets a web site tell browsers that it should only be accessed using HTTPS, instead of using HTTP.

max-age=<expire-time>
    The time, in seconds, that the browser should remember that a site is only to be accessed using HTTPS.

###

'X-UA-Compatible': 'IE=EmulateIE7'

Ref: docs.microsoft.com

Web developers can also specify a document mode by including instructions in a meta element or HTTP response header:

    Webpages that include a meta element (see [HTML5:2014]) with an http-equivalent value of X-UA-Compatible.

    Webpages that are served with an HTTP header named "X-UA-Compatible".


IE=EmulateIE7 ::
IE7 mode (if a valid <!DOCTYPE> declaration is present)
Quirks Mode (otherwise)

###

'X-Contet-Type-Options': 'nosniff'

Ref: developer.mozilla.org

The X-Content-Type-Options response HTTP header is a marker used by the server to indicate that the MIME types advertised in the Content-Type headers should not be changed and be followed. This is a way to opt out of MIME type sniffing, or, in other words, to say that the MIME types are deliberately configured.

This header was introduced by Microsoft in IE 8 as a way for webmasters to block content sniffing that was happening and could transform non-executable MIME types into executable MIME types. Since then, other browsers have introduced it, even if their MIME sniffing algorithms were less aggressive.

Starting with Firefox 72, the opting out of MIME sniffing is also applied to top-level documents if a Content-type is provided. This can cause HTML web pages to be downloaded instead of being rendered when they are served with a MIME type other than text/html. Make sure to set both headers correctly.

Site security testers usually expect this header to be set.

X-Content-Type-Options: nosniff

nosniff
    Blocks a request if the request destination is of type:

        "style" and the MIME type is not text/css, or
        "script" and the MIME type is not a JavaScript MIME type

    Enables Cross-Origin Read Blocking (CORB) protection for the MIME-types:

        text/html
        text/plain
        text/json, application/json or any other type with a JSON extension: */*+json
        text/xml, application/xml or any other type with an XML extension: */*+xml (excluding image/svg+xml)

###


'X-XSS-Protection': '1; mode=block'

Ref: developer.mozilla.org

The HTTP X-XSS-Protection response header is a feature of Internet Explorer, Chrome and Safari that stops pages from loading when they detect reflected cross-site scripting (XSS) attacks. Although these protections are largely unnecessary in modern browsers when sites implement a strong Content-Security-Policy that disables the use of inline JavaScript ('unsafe-inline'), they can still provide protections for users of older web browsers that don't yet support CSP.

X-XSS-Protection: 0
X-XSS-Protection: 1
X-XSS-Protection: 1; mode=block
X-XSS-Protection: 1; report=<reporting-uri>


1; mode=block
    Enables XSS filtering. Rather than sanitizing the page, the browser will prevent rendering of the page if an attack is detected.

###

Date

The Date general HTTP header contains the date and time at which the message was originated.

Ref: developer.mozilla.org

fetch('https://httpbin.org/get', {
    'headers': {
        'Date': (new Date()).toUTCString()
    }
})

Header type: General header

Tags: Technology, Web Scraping, Web Development

Monday, September 14, 2020

Starting With Selenium's Python Package (Installation)


  
We have a YAML file to setup our conda environment. The file 'selenium.yml' has contents:

name: selenium
channels:
  - conda-forge
  - defaults
dependencies:
  - selenium
  - jupyterlab
  - ipykernel 

To setup the environment, we run the command:

(base) CMD> conda env create -f selenium.yml 

(selenium) CMD> conda activate selenium 

After that, if we want to see which all packages got installed, we run the command:

(selenium) CMD> conda env export 

Next, we setup a kernel from this environment:

(selenium) CMD> python -m ipykernel install --user --name selenium 
Installed kernelspec selenium in C:\Users\Ashish Jain\AppData\Roaming\jupyter\kernels\selenium 

To view the list of kernels:

(selenium) CMD> jupyter kernelspec list 
Available kernels:
  selenium              C:\Users\Ashish Jain\AppData\Roaming\jupyter\kernels\selenium
  python3               E:\programfiles\Anaconda3\envs\selenium\share\jupyter\kernels\python3 
  ... 
  
A basic piece of code would start the browser. We have tried and tested it for Chrome and Firefox. To do this, we need the web driver file or we get the following exception:

CODE:

from selenium import webdriver  
import time  
from selenium.webdriver.common.keys import Keys  

driver = webdriver.Chrome()  

ERROR:

----------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
E:\programfiles\Anaconda3\envs\selenium\lib\site-packages\selenium\webdriver\common\service.py in start(self)
     71             cmd.extend(self.command_line_args())
---> 72             self.process = subprocess.Popen(cmd, env=self.env,
     73                                             close_fds=platform.system() != 'Windows',

E:\programfiles\Anaconda3\envs\selenium\lib\subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors, text)
    853 
--> 854             self._execute_child(args, executable, preexec_fn, close_fds,
    855                                 pass_fds, cwd, env,

E:\programfiles\Anaconda3\envs\selenium\lib\subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_start_new_session)
   1306             try:
-> 1307                 hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
   1308                                          # no special security

FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

WebDriverException                        Traceback (most recent call last)
...
WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home 

We got the file from here: chromedriver.storage.googleapis.com For v86 

chromedriver_win32.zip ---> chromedriver.exe

Error for WebDriver and Browser version mismatch:

SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 86
Current browser version is 85.0.4183.102 with binary path C:\Program Files (x86)\Google\Chrome\Application\chrome.exe 

Download from here for Chrome v85: chromedriver.storage.googleapis.com For v85 

One point to note about ChromeDriver as in September 2020:

ChromeDriver only supports characters in the BMP (Basic Multilingual Plane) is a known issue with Chromium team as ChromeDriver still doesn't support characters with a Unicode after FFFF. Hence it is impossible to send any character beyond FFFF via ChromeDriver. As a result any attempt to send SMP (Supplementary Multilingual Plane) characters (e.g. CJK, Emojis, Symbols, etc) raises the error. 

While Firefox supports Emoji's sent via 'send_keys()' method. 

As of Unicode 13.0, the SMP comprises the following 134 blocks: Archaic Greek and Other Left-to-right scripts: Linear B Syllabary (10000–1007F) Linear B Ideograms (10080–100FF). 

~ ~ ~ ~ ~

If you working with Firefox browser, you need the Gecko WebDriver available at the Windows 'PATH' variable.

Without WebDriver file:
	FileNotFoundError: [WinError 2] The system cannot find the file specified
	WebDriverException: Message: 'geckodriver' executable needs to be in PATH. 

Download Gecko driver from here: GitHub Repo of Mozilla 

The statement to launch the web browser will be: 

driver = webdriver.Firefox()  

By default, browsers open in a partial size window. To maximize the window: 

driver.maximize_window() 

Now, we open a link: driver.get("http://survival8.blogspot.com/")

Friday, September 4, 2020

Requests.get method, cleaning html and writing output to text file


  
Setup 
(base) C:\Users\Ashish Jain>conda env list 
# conda environments:
#
base                  *  E:\programfiles\Anaconda3
env_py_36                E:\programfiles\Anaconda3\envs\env_py_36
temp                     E:\programfiles\Anaconda3\envs\temp
tf                       E:\programfiles\Anaconda3\envs\tf 

(base) C:\Users\Ashish Jain>conda create -n temp202009 python=3.8
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: E:\programfiles\Anaconda3\envs\temp202009

  added / updated specs:
    - python=3.8

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2020.7.22  |                0         164 KB
    python-3.8.5               |       h5fd99cc_1        18.7 MB
    sqlite-3.33.0              |       h2a8f88b_0         1.3 MB
    wheel-0.35.1               |             py_0          36 KB
    ------------------------------------------------------------
                                            Total:        20.2 MB

The following NEW packages will be INSTALLED:

  ca-certificates    pkgs/main/win-64::ca-certificates-2020.7.22-0
  certifi            pkgs/main/win-64::certifi-2020.6.20-py38_0
  openssl            pkgs/main/win-64::openssl-1.1.1g-he774522_1
  pip                pkgs/main/win-64::pip-20.2.2-py38_0
  python             pkgs/main/win-64::python-3.8.5-h5fd99cc_1
  setuptools         pkgs/main/win-64::setuptools-49.6.0-py38_0
  sqlite             pkgs/main/win-64::sqlite-3.33.0-h2a8f88b_0
  vc                 pkgs/main/win-64::vc-14.1-h0510ff6_4
  vs2015_runtime     pkgs/main/win-64::vs2015_runtime-14.16.27012-hf0eaf9b_3
  wheel              pkgs/main/noarch::wheel-0.35.1-py_0
  wincertstore       pkgs/main/win-64::wincertstore-0.2-py38_0
  zlib               pkgs/main/win-64::zlib-1.2.11-h62dcd97_4

Proceed ([y]/n)? y

Downloading and Extracting Packages
wheel-0.35.1         | 36 KB     | ##################################### | 100%
sqlite-3.33.0        | 1.3 MB    | ##################################### | 100%
ca-certificates-2020 | 164 KB    | ##################################### | 100%
python-3.8.5         | 18.7 MB   | ##################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate temp202009
#
# To deactivate an active environment, use
#
#     $ conda deactivate 

(base) C:\Users\Ashish Jain>conda activate temp202009 

(temp202009) C:\Users\Ashish Jain>pip install ipykernel jupyter jupyterlab 
Collecting ipykernel
Collecting jupyter
Collecting jupyterlab
...
Building wheels for collected packages: pandocfilters, pyrsistent
  Building wheel for pandocfilters (setup.py) ... done
  Created wheel for pandocfilters: filename=pandocfilters-1.4.2-py3-none-any.whl size=7861 sha256=eaf50b551ad8291621c8a87234dca80f07b0e9b1603ec8ad7179740f988b4dec
  Stored in directory: c:\users\ashish jain\appdata\local\pip\cache\wheels\f6\08\65\e4636b703d0e870cd62692dafd6b47db27287fe80cea433722
  Building wheel for pyrsistent (setup.py) ... done
  Created wheel for pyrsistent: filename=pyrsistent-0.16.0-cp38-cp38-win_amd64.whl size=71143 sha256=1f0233569beedcff74c358bd0666684c2a0f2d74b56fbdea893711c2f1a761f8
  Stored in directory: c:\users\ashish jain\appdata\local\pip\cache\wheels\17\be\0f\727fb20889ada6aaaaba861f5f0eb21663533915429ad43f28
Successfully built pandocfilters pyrsistent
Installing collected packages: tornado, ipython-genutils, traitlets, pyzmq, six, python-dateutil, pywin32, jupyter-core, jupyter-client, colorama, parso, jedi, pygments, backcall, wcwidth, prompt-toolkit, decorator, pickleshare, ipython, ipykernel, jupyter-console, qtpy, qtconsole, MarkupSafe, jinja2, attrs, pyrsistent, jsonschema, nbformat, mistune, pyparsing, packaging, webencodings, bleach, pandocfilters, entrypoints, testpath, defusedxml, nbconvert, pywinpty, terminado, prometheus-client, Send2Trash, pycparser, cffi, argon2-cffi, notebook, widgetsnbextension, ipywidgets, jupyter, json5, urllib3, chardet, idna, requests, jupyterlab-server, jupyterlab
Successfully installed MarkupSafe-1.1.1 Send2Trash-1.5.0 argon2-cffi-20.1.0 attrs-20.1.0 backcall-0.2.0 bleach-3.1.5 cffi-1.14.2 chardet-3.0.4 colorama-0.4.3 decorator-4.4.2 defusedxml-0.6.0 entrypoints-0.3 idna-2.10 ipykernel-5.3.4 ipython-7.18.1 ipython-genutils-0.2.0 ipywidgets-7.5.1 jedi-0.17.2 jinja2-2.11.2 json5-0.9.5 jsonschema-3.2.0 jupyter-1.0.0 jupyter-client-6.1.7 jupyter-console-6.2.0 jupyter-core-4.6.3 jupyterlab-2.2.6 jupyterlab-server-1.2.0 mistune-0.8.4 nbconvert-5.6.1 nbformat-5.0.7 notebook-6.1.3 packaging-20.4 pandocfilters-1.4.2 parso-0.7.1 pickleshare-0.7.5 prometheus-client-0.8.0 prompt-toolkit-3.0.7 pycparser-2.20 pygments-2.6.1 pyparsing-2.4.7 pyrsistent-0.16.0 python-dateutil-2.8.1 pywin32-228 pywinpty-0.5.7 pyzmq-19.0.2 qtconsole-4.7.7 qtpy-1.9.0 requests-2.24.0 six-1.15.0 terminado-0.8.3 testpath-0.4.4 tornado-6.0.4 traitlets-5.0.3 urllib3-1.25.10 wcwidth-0.2.5 webencodings-0.5.1 widgetsnbextension-3.5.1 

(temp202009) C:\Users\Ashish Jain>python -m ipykernel install --user --name temp202009 
Installed kernelspec temp202009 in C:\Users\Ashish Jain\AppData\Roaming\jupyter\kernels\temp202009 

===   ===   ===   ===

ERROR:
ImportError: DLL load failed while importing win32api: The specified module could not be found.

(temp202009) E:\>conda install pywin32

===   ===   ===   ===

(temp202009) E:\>pip install htmllaundry 
(temp202009) E:\>pip install html-sanitizer 
Collecting html-sanitizer
Collecting beautifulsoup4
Collecting soupsieve>1.2
  Downloading soupsieve-2.0.1-py3-none-any.whl (32 kB)
Installing collected packages: soupsieve, beautifulsoup4, html-sanitizer
Successfully installed beautifulsoup4-4.9.1 html-sanitizer-1.9.1 soupsieve-2.0.1 

Issues faced with pulling an article using "newsapi" and "newspaper" packages. 

#1
Exception occurred for: [newspaper.article.Article object at 0x00000248F12896D8] and 2020-08-08T16:55:21Z
Article `download()` failed with 503 Server Error: Service Unavailable for url: https://www.marketwatch.com/story/profit-up-87-at-buffetts-berkshire-but-coronavirus-slows-businesses-2020-08-08 on URL https://www.marketwatch.com/story/profit-up-87-at-buffetts-berkshire-but-coronavirus-slows-businesses-2020-08-08 

#2
Exception occurred for: [newspaper.article.Article object at 0x00000248F1297B70] and 2020-08-11T22:59:42Z
Article `download()` failed with 403 Client Error: Forbidden for url: https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues on URL https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues 

#3
Exception occurred for: [newspaper.article.Article object at 0x00000248F12AC550] and 2020-08-11T16:17:55Z
Article `download()` failed with HTTPSConnectionPool(host='www.freerepublic.com', port=443): Max retries exceeded with url: /focus/f-news/3873373/posts (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])"))) on URL https://www.freerepublic.com/focus/f-news/3873373/posts 

Trying a fix using Python shell 

(base) C:\Users\Ashish Jain>python
Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> requests.get('https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues')
[Response [200]] 

>>> requests.get('https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues').text
'<!DOCTYPE html><html itemscope="" itemtype="https://schema.org/WebPage" lang="en">... 

>>> with open('html.txt', 'w') as f:
...  f.write(requests.get('https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues').text)
... 
Traceback (most recent call last):
  File "[stdin]", line 2, in [module]
  File "E:\programfiles\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 13665: character maps to [undefined] 

>>> with open('html.txt', 'w', encoding="utf-8") as f:
...  f.write(requests.get('https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues').text)
...
636685 

Now we have the HTML. Next we clean it to remove HTML tags. 

Using htmllaundry 

from htmllaundry import sanitize
!pip show htmllaundry 

Name: htmllaundry
Version: 2.2
Summary: Simple HTML cleanup utilities
Home-page: UNKNOWN
Author: Wichert Akkerman
Author-email: wichert@wiggy.net
License: BSD
Location: e:\programfiles\anaconda3\envs\temp202009\lib\site-packages
Requires: lxml, six
Required-by: 

sanitize(r.text) 

'<p>\n\n\n  \n  \n  Access to this page has been denied.\n  \n  \n  \n\n\n\n  \n    \n      To continue, please prove you are not a robot\n    \n  \n  \n    \n      \n      \n      </p><p>\n        To ensure this doesn’t happen in the future, please enable Javascript and cookies in your browser.<br/>\n        Is this happening to you frequently? Please <a href="https://seekingalpha.userecho.com?source=captcha" rel="nofollow">report it on our feedback forum</a>.\n      </p>\n      <p>\n        If you have an ad-blocker enabled you may be blocked from proceeding. Please disable your ad-blocker and refresh.\n      </p>\n      <p>Reference ID: </p>\n    \n  \n  \n\n\n\n\n\n\n\n' 

from htmllaundry import strip_markup

cleantext = strip_markup(sanitize(r.text)).strip()
cleantext = re.sub(r"(\n)+", " ", cleantext)
cleantext = re.sub(r"\s+", " ", cleantext)

print(cleantext) 

'Access to this page has been denied. To continue, please prove you are not a robot To ensure this doesn’t happen in the future, please enable Javascript and cookies in your browser. Is this happening to you frequently? Please report it on our feedback forum. If you have an ad-blocker enabled you may be blocked from proceeding. Please disable your ad-blocker and refresh. Reference ID:' 

Using html_sanitizer 

from html_sanitizer import Sanitizer
!pip show html_sanitizer 

Name: html-sanitizer
Version: 1.9.1
Summary: HTML sanitizer
Home-page: https://github.com/matthiask/html-sanitizer/
Author: Matthias Kestenholz
Author-email: mk@feinheit.ch
License: BSD License
Location: e:\programfiles\anaconda3\envs\temp202009\lib\site-packages
Requires: beautifulsoup4, lxml
Required-by: 

sanitizer = Sanitizer() 
cleantext = sanitizer.sanitize(r.text).strip()

cleantext = re.sub(r"(\n)+", " ", cleantext)
cleantext = re.sub(r"\s+", " ", cleantext)
print(cleantext) 

'Access to this page has been denied.         <h1>To continue, please prove you are not a robot</h1>     <p> To ensure this doesn’t happen in the future, please enable Javascript and cookies in your browser.<br> Is this happening to you frequently? Please <a href="https://seekingalpha.userecho.com?source=captcha">report it on our feedback forum</a>. </p> <p> If you have an ad-blocker enabled you may be blocked from proceeding. Please disable your ad-blocker and refresh. </p> <p>Reference ID: </p>' 

Using beautifulsoup4 

import re
from bs4 import BeautifulSoup
cleantext = BeautifulSoup(r.text, "lxml").text

cleantext = re.sub(r"(\n)+", " ", cleantext)
cleantext = re.sub(r"\s+", " ", cleantext)
cleantext.strip() 

'Access to this page has been denied. To continue, please prove you are not a robot To ensure this doesn’t happen in the future, please enable Javascript and cookies in your browser. Is this happening to you frequently? Please report it on our feedback forum. If you have an ad-blocker enabled you may be blocked from proceeding. Please disable your ad-blocker and refresh. Reference ID:'

Pages