Showing posts with label Web Scraping. Show all posts
Showing posts with label Web Scraping. Show all posts

Tuesday, February 23, 2021

HTTP Error Codes and REST APIs



HTTP response status codes indicate whether a specific HTTP request has been successfully completed. Responses are grouped in five classes: 1. Informational responses (100–199) 2. Successful responses (200–299) 3. Redirects (300–399) 4. Client errors (400–499) 5. Server errors (500–599) If you receive a response that is not in this list, it is a non-standard response, possibly custom to the server's software. Ref: developer.mozilla.org All HTTP response status codes are separated into five classes or categories. The first digit of the status code defines the class of response, while the last two digits do not have any classifying or categorization role. There are five classes defined by the standard: 1xx informational response – the request was received, continuing process 2xx successful – the request was successfully received, understood, and accepted 3xx redirection – further action needs to be taken in order to complete the request 4xx client error – the request contains bad syntax or cannot be fulfilled 5xx server error – the server failed to fulfil an apparently valid request Ref: en.wikipedia.org Some common error codes one must know: 401 Unauthorized Although the HTTP standard specifies "unauthorized", semantically this response means "unauthenticated". That is, the client must authenticate itself to get the requested response. 403 Forbidden The client does not have access rights to the content; that is, it is unauthorized, so the server is refusing to give the requested resource. Unlike 401, the client's identity is known to the server. 405 Method Not Allowed The request method is known by the server but has been disabled and cannot be used. For example, an API may forbid DELETE-ing a resource. The two mandatory methods, GET and HEAD, must never be disabled and should not return this error code. 415 Unsupported Media Type The media format of the requested data is not supported by the server, so the server is rejecting the request.

RESTful API Response Codes (used by Amazon Drive API)

The HTTP Status Codes used by the RESTful Amazon Drive API. HTTP Status Code --- Description 200 OK --- Successful. 201 Created --- Created. Status code '201' is important for REST APIs that are performing some action such as raising a ticket or logging something. 400 Bad Request --- Bad input parameter. Error message should indicate which one and why. 401 Unauthorized --- The client passed in the invalid Auth token. Client should refresh the token and then try again. 403 Forbidden --- * Customer doesn’t exist. * Application not registered. * Application try to access to properties not belong to an App. * Application try to trash/purge root node. * Application try to update contentProperties. * Operation is blocked (for third-party apps). * Customer account over quota. 404 Not Found --- Resource not found. 405 Method Not Allowed --- The resource doesn't support the specified HTTP verb. 409 Conflict --- Conflict. 411 Length Required --- The Content-Length header was not specified. 412 Precondition Failed --- Precondition failed. 429 Too Many Requests --- Too many request for rate limiting. 500 Internal Server Error --- Servers are not working as expected. The request is probably valid but needs to be requested again later. 503 Service Unavailable --- Service Unavailable. Ref: developer.amazon.com (Dated: 24 Feb 2021) Additional Notes In Europe, the NotFound project, created by multiple European organizations including Missing Children Europe and Child Focus, encourages site operators to add a snippet of code to serve customized 404 error pages which provide data about missing children. Ref: HTTP 404
Tags: Technology,Web Development,Web Scraping,

Tuesday, February 16, 2021

Getting a Web Server's Response Header Using Python



We have a Python code that will get the response headers for a website:

from datetime import datetime 
import requests

url = 'http://survival8.blogspot.com/'

x = requests.get(url)

print(x.headers)

curr_time = datetime.now()

# We also write our main HTML output to a file.
with open("s8_" + str(curr_time).replace(":", "_") + ".log", mode='w') as f:
    f.write(x.text)

The output of this code looks like as shown below:

(base) ~/Desktop$ python response_header_info.py 

{'Content-Type': 'text/html; charset=UTF-8', 'Expires': 'Tue, 16 Feb 2021 10:13:29 GMT', 'Date': 'Tue, 16 Feb 2021 10:13:29 GMT', 'Cache-Control': 'private, max-age=0', 'Last-Modified': 'Tue, 16 Feb 2021 08:54:25 GMT', 'ETag': 'W/"047a2cb250a2ad10a53227bf4085727f97833f5235788c95f99a149e4d1afa68"', 'Content-Encoding': 'gzip', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '135818', 'Server': 'GSE'} 

Next we discuss some important Response Headers:

1: Response header
Ref: developer.mozilla.org

A response header is an HTTP header that can be used in an HTTP response and that doesn't relate to the content of the message. Response headers, like Age, Location or Server are used to give a more detailed context of the response.

Not all headers appearing in a response are categorized as response headers by the specification. For example, the Content-Length header is an Representation metadata header indicating the size of the body of the response message (and as an entity header in older versions of the specification). However, "conversationally" all headers are usually referred to as response headers in a response message.

The following shows a few response headers after a GET request. Note that strictly speaking, the Content-Encoding and Content-Type headers are entity header:

200 OK
Access-Control-Allow-Origin: *
Connection: Keep-Alive
Content-Encoding: gzip
Content-Type: text/html; charset=utf-8
Date: Mon, 18 Jul 2016 16:06:00 GMT
Etag: "c561c68d0ba92bbeb8b0f612a9199f722e3a621a"
Keep-Alive: timeout=5, max=997
Last-Modified: Mon, 18 Jul 2016 02:36:04 GMT
Server: Apache
Set-Cookie: mykey=myvalue; expires=Mon, 17-Jul-2017 16:06:00 GMT; Max-Age=31449600; Path=/; secure
Transfer-Encoding: chunked
Vary: Cookie, Accept-Encoding
X-Backend-Server: developer2.webapp.scl3.mozilla.com
X-Cache-Info: not cacheable; meta data too large
X-kuma-revision: 1085259
x-frame-options: DENY

###

2: 'Cache-Control': 'private'

Ref: developer.mozilla.org

Cacheability

Directives that define whether a response/request can be cached, where it may be cached, and whether it must be validated with the origin server before caching.

public
    The response may be stored by any cache, even if the response is normally non-cacheable.

private
    The response may be stored only by a browser's cache, even if the response is normally non-cacheable. If you mean to not store the response in any cache, use no-store instead. This directive is not effective in preventing caches from storing your response.

no-cache
    The response may be stored by any cache, even if the response is normally non-cacheable. However, the stored response MUST always go through validation with the origin server first before using it, therefore, you cannot use no-cache in-conjunction with immutable. If you mean to not store the response in any cache, use no-store instead. This directive is not effective in preventing caches from storing your response.

no-store
    The response may not be stored in any cache. Note that this will not prevent a valid pre-existing cached response being returned. Clients can set max-age=0 to also clear existing cache responses, as this forces the cache to revalidate with the server (no other directives have an effect when used with no-store). 

###

'Transfer-Encoding': 'chunked'

The Transfer-Encoding header specifies the form of encoding used to safely transfer the payload body to the user.

chunked
    Data is sent in a series of chunks. The Content-Length header is omitted in this case and at the beginning of each chunk you need to add the length of the current chunk in hexadecimal format, followed by '\r\n' and then the chunk itself, followed by another '\r\n'. The terminating chunk is a regular chunk, with the exception that its length is zero. It is followed by the trailer, which consists of a (possibly empty) sequence of entity header fields.

### 

'Content-Type': 'application/json; charset=utf-8'

Content-type: application/json; charset=utf-8 designates the content to be in JSON format, encoded in the UTF-8 character encoding.

### 

'Server': 'Private Server', 

The Server header describes the software used by the origin server that handled the request — that is, the server that generated the response.

Examples: 
  Server: Apache/2.4.1 (Unix)

Ref: developer.mozilla.org

### 

'jsonerror': 'true'

Nothing found about it.

###

'X-Frame-Options': 'SAMEORIGIN'

The X-Frame-Options HTTP response header can be used to indicate whether or not a browser should be allowed to render a page in a <frame>, <iframe>, <embed> or <object>. Sites can use this to avoid click-jacking attacks, by ensuring that their content is not embedded into other sites.

The added security is provided only if the user accessing the document is using a browser that supports X-Frame-Options.

There are two possible directives for X-Frame-Options:

X-Frame-Options: DENY
X-Frame-Options: SAMEORIGIN


SAMEORIGIN
    The page can only be displayed in a frame on the same origin as the page itself. The spec leaves it up to browser vendors to decide whether this option applies to the top level, the parent, or the whole chain, although it is argued that the option is not very useful unless all ancestors are also in the same origin (see bug 725490). Also see Browser compatibility for support details.

Ref: developer.mozilla.org

###

'Strict-Transport-Security': 'max-age=31536000', 

The HTTP Strict-Transport-Security response header (often abbreviated as HSTS) lets a web site tell browsers that it should only be accessed using HTTPS, instead of using HTTP.

max-age=<expire-time>
    The time, in seconds, that the browser should remember that a site is only to be accessed using HTTPS.

###

'X-UA-Compatible': 'IE=EmulateIE7'

Ref: docs.microsoft.com

Web developers can also specify a document mode by including instructions in a meta element or HTTP response header:

    Webpages that include a meta element (see [HTML5:2014]) with an http-equivalent value of X-UA-Compatible.

    Webpages that are served with an HTTP header named "X-UA-Compatible".


IE=EmulateIE7 ::
IE7 mode (if a valid <!DOCTYPE> declaration is present)
Quirks Mode (otherwise)

###

'X-Contet-Type-Options': 'nosniff'

Ref: developer.mozilla.org

The X-Content-Type-Options response HTTP header is a marker used by the server to indicate that the MIME types advertised in the Content-Type headers should not be changed and be followed. This is a way to opt out of MIME type sniffing, or, in other words, to say that the MIME types are deliberately configured.

This header was introduced by Microsoft in IE 8 as a way for webmasters to block content sniffing that was happening and could transform non-executable MIME types into executable MIME types. Since then, other browsers have introduced it, even if their MIME sniffing algorithms were less aggressive.

Starting with Firefox 72, the opting out of MIME sniffing is also applied to top-level documents if a Content-type is provided. This can cause HTML web pages to be downloaded instead of being rendered when they are served with a MIME type other than text/html. Make sure to set both headers correctly.

Site security testers usually expect this header to be set.

X-Content-Type-Options: nosniff

nosniff
    Blocks a request if the request destination is of type:

        "style" and the MIME type is not text/css, or
        "script" and the MIME type is not a JavaScript MIME type

    Enables Cross-Origin Read Blocking (CORB) protection for the MIME-types:

        text/html
        text/plain
        text/json, application/json or any other type with a JSON extension: */*+json
        text/xml, application/xml or any other type with an XML extension: */*+xml (excluding image/svg+xml)

###


'X-XSS-Protection': '1; mode=block'

Ref: developer.mozilla.org

The HTTP X-XSS-Protection response header is a feature of Internet Explorer, Chrome and Safari that stops pages from loading when they detect reflected cross-site scripting (XSS) attacks. Although these protections are largely unnecessary in modern browsers when sites implement a strong Content-Security-Policy that disables the use of inline JavaScript ('unsafe-inline'), they can still provide protections for users of older web browsers that don't yet support CSP.

X-XSS-Protection: 0
X-XSS-Protection: 1
X-XSS-Protection: 1; mode=block
X-XSS-Protection: 1; report=<reporting-uri>


1; mode=block
    Enables XSS filtering. Rather than sanitizing the page, the browser will prevent rendering of the page if an attack is detected.

###

Date

The Date general HTTP header contains the date and time at which the message was originated.

Ref: developer.mozilla.org

fetch('https://httpbin.org/get', {
    'headers': {
        'Date': (new Date()).toUTCString()
    }
})

Header type: General header

Tags: Technology, Web Scraping, Web Development

Monday, September 14, 2020

Starting With Selenium's Python Package (Installation)



We have a YAML file to setup our conda environment. The file 'selenium.yml' has contents: name: selenium channels: - conda-forge - defaults dependencies: - selenium - jupyterlab - ipykernel To setup the environment, we run the command: (base) CMD> conda env create -f selenium.yml (selenium) CMD> conda activate selenium After that, if we want to see which all packages got installed, we run the command: (selenium) CMD> conda env export Next, we setup a kernel from this environment: (selenium) CMD> python -m ipykernel install --user --name selenium Installed kernelspec selenium in C:\Users\Ashish Jain\AppData\Roaming\jupyter\kernels\selenium To view the list of kernels: (selenium) CMD> jupyter kernelspec list Available kernels: selenium C:\Users\Ashish Jain\AppData\Roaming\jupyter\kernels\selenium python3 E:\programfiles\Anaconda3\envs\selenium\share\jupyter\kernels\python3 ... A basic piece of code would start the browser. We have tried and tested it for Chrome and Firefox. To do this, we need the web driver file or we get the following exception: CODE: from selenium import webdriver import time from selenium.webdriver.common.keys import Keys driver = webdriver.Chrome() ERROR: ---------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) E:\programfiles\Anaconda3\envs\selenium\lib\site-packages\selenium\webdriver\common\service.py in start(self) 71 cmd.extend(self.command_line_args()) ---> 72 self.process = subprocess.Popen(cmd, env=self.env, 73 close_fds=platform.system() != 'Windows', E:\programfiles\Anaconda3\envs\selenium\lib\subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors, text) 853 --> 854 self._execute_child(args, executable, preexec_fn, close_fds, 855 pass_fds, cwd, env, E:\programfiles\Anaconda3\envs\selenium\lib\subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_start_new_session) 1306 try: -> 1307 hp, ht, pid, tid = _winapi.CreateProcess(executable, args, 1308 # no special security FileNotFoundError: [WinError 2] The system cannot find the file specified During handling of the above exception, another exception occurred: WebDriverException Traceback (most recent call last) ... WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home We got the file from here: chromedriver.storage.googleapis.com For v86 chromedriver_win32.zip ---> chromedriver.exe Error for WebDriver and Browser version mismatch: SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 86 Current browser version is 85.0.4183.102 with binary path C:\Program Files (x86)\Google\Chrome\Application\chrome.exe Download from here for Chrome v85: chromedriver.storage.googleapis.com For v85 One point to note about ChromeDriver as in September 2020: ChromeDriver only supports characters in the BMP (Basic Multilingual Plane) is a known issue with Chromium team as ChromeDriver still doesn't support characters with a Unicode after FFFF. Hence it is impossible to send any character beyond FFFF via ChromeDriver. As a result any attempt to send SMP (Supplementary Multilingual Plane) characters (e.g. CJK, Emojis, Symbols, etc) raises the error. While Firefox supports Emoji's sent via 'send_keys()' method. As of Unicode 13.0, the SMP comprises the following 134 blocks: Archaic Greek and Other Left-to-right scripts: Linear B Syllabary (10000–1007F) Linear B Ideograms (10080–100FF). ~ ~ ~ ~ ~ If you working with Firefox browser, you need the Gecko WebDriver available at the Windows 'PATH' variable. Without WebDriver file: FileNotFoundError: [WinError 2] The system cannot find the file specified WebDriverException: Message: 'geckodriver' executable needs to be in PATH. Download Gecko driver from here: GitHub Repo of Mozilla The statement to launch the web browser will be: driver = webdriver.Firefox() By default, browsers open in a partial size window. To maximize the window: driver.maximize_window() Now, we open a link: driver.get("http://survival8.blogspot.com/")

Friday, September 4, 2020

Requests.get method, cleaning html and writing output to text file


Setup (base) C:\Users\Ashish Jain>conda env list # conda environments: # base * E:\programfiles\Anaconda3 env_py_36 E:\programfiles\Anaconda3\envs\env_py_36 temp E:\programfiles\Anaconda3\envs\temp tf E:\programfiles\Anaconda3\envs\tf (base) C:\Users\Ashish Jain>conda create -n temp202009 python=3.8 Collecting package metadata (repodata.json): done Solving environment: done ## Package Plan ## environment location: E:\programfiles\Anaconda3\envs\temp202009 added / updated specs: - python=3.8 The following packages will be downloaded: package | build ---------------------------|----------------- ca-certificates-2020.7.22 | 0 164 KB python-3.8.5 | h5fd99cc_1 18.7 MB sqlite-3.33.0 | h2a8f88b_0 1.3 MB wheel-0.35.1 | py_0 36 KB ------------------------------------------------------------ Total: 20.2 MB The following NEW packages will be INSTALLED: ca-certificates pkgs/main/win-64::ca-certificates-2020.7.22-0 certifi pkgs/main/win-64::certifi-2020.6.20-py38_0 openssl pkgs/main/win-64::openssl-1.1.1g-he774522_1 pip pkgs/main/win-64::pip-20.2.2-py38_0 python pkgs/main/win-64::python-3.8.5-h5fd99cc_1 setuptools pkgs/main/win-64::setuptools-49.6.0-py38_0 sqlite pkgs/main/win-64::sqlite-3.33.0-h2a8f88b_0 vc pkgs/main/win-64::vc-14.1-h0510ff6_4 vs2015_runtime pkgs/main/win-64::vs2015_runtime-14.16.27012-hf0eaf9b_3 wheel pkgs/main/noarch::wheel-0.35.1-py_0 wincertstore pkgs/main/win-64::wincertstore-0.2-py38_0 zlib pkgs/main/win-64::zlib-1.2.11-h62dcd97_4 Proceed ([y]/n)? y Downloading and Extracting Packages wheel-0.35.1 | 36 KB | ##################################### | 100% sqlite-3.33.0 | 1.3 MB | ##################################### | 100% ca-certificates-2020 | 164 KB | ##################################### | 100% python-3.8.5 | 18.7 MB | ##################################### | 100% Preparing transaction: done Verifying transaction: done Executing transaction: done # # To activate this environment, use # # $ conda activate temp202009 # # To deactivate an active environment, use # # $ conda deactivate (base) C:\Users\Ashish Jain>conda activate temp202009 (temp202009) C:\Users\Ashish Jain>pip install ipykernel jupyter jupyterlab Collecting ipykernel Collecting jupyter Collecting jupyterlab ... Building wheels for collected packages: pandocfilters, pyrsistent Building wheel for pandocfilters (setup.py) ... done Created wheel for pandocfilters: filename=pandocfilters-1.4.2-py3-none-any.whl size=7861 sha256=eaf50b551ad8291621c8a87234dca80f07b0e9b1603ec8ad7179740f988b4dec Stored in directory: c:\users\ashish jain\appdata\local\pip\cache\wheels\f6\08\65\e4636b703d0e870cd62692dafd6b47db27287fe80cea433722 Building wheel for pyrsistent (setup.py) ... done Created wheel for pyrsistent: filename=pyrsistent-0.16.0-cp38-cp38-win_amd64.whl size=71143 sha256=1f0233569beedcff74c358bd0666684c2a0f2d74b56fbdea893711c2f1a761f8 Stored in directory: c:\users\ashish jain\appdata\local\pip\cache\wheels\17\be\0f\727fb20889ada6aaaaba861f5f0eb21663533915429ad43f28 Successfully built pandocfilters pyrsistent Installing collected packages: tornado, ipython-genutils, traitlets, pyzmq, six, python-dateutil, pywin32, jupyter-core, jupyter-client, colorama, parso, jedi, pygments, backcall, wcwidth, prompt-toolkit, decorator, pickleshare, ipython, ipykernel, jupyter-console, qtpy, qtconsole, MarkupSafe, jinja2, attrs, pyrsistent, jsonschema, nbformat, mistune, pyparsing, packaging, webencodings, bleach, pandocfilters, entrypoints, testpath, defusedxml, nbconvert, pywinpty, terminado, prometheus-client, Send2Trash, pycparser, cffi, argon2-cffi, notebook, widgetsnbextension, ipywidgets, jupyter, json5, urllib3, chardet, idna, requests, jupyterlab-server, jupyterlab Successfully installed MarkupSafe-1.1.1 Send2Trash-1.5.0 argon2-cffi-20.1.0 attrs-20.1.0 backcall-0.2.0 bleach-3.1.5 cffi-1.14.2 chardet-3.0.4 colorama-0.4.3 decorator-4.4.2 defusedxml-0.6.0 entrypoints-0.3 idna-2.10 ipykernel-5.3.4 ipython-7.18.1 ipython-genutils-0.2.0 ipywidgets-7.5.1 jedi-0.17.2 jinja2-2.11.2 json5-0.9.5 jsonschema-3.2.0 jupyter-1.0.0 jupyter-client-6.1.7 jupyter-console-6.2.0 jupyter-core-4.6.3 jupyterlab-2.2.6 jupyterlab-server-1.2.0 mistune-0.8.4 nbconvert-5.6.1 nbformat-5.0.7 notebook-6.1.3 packaging-20.4 pandocfilters-1.4.2 parso-0.7.1 pickleshare-0.7.5 prometheus-client-0.8.0 prompt-toolkit-3.0.7 pycparser-2.20 pygments-2.6.1 pyparsing-2.4.7 pyrsistent-0.16.0 python-dateutil-2.8.1 pywin32-228 pywinpty-0.5.7 pyzmq-19.0.2 qtconsole-4.7.7 qtpy-1.9.0 requests-2.24.0 six-1.15.0 terminado-0.8.3 testpath-0.4.4 tornado-6.0.4 traitlets-5.0.3 urllib3-1.25.10 wcwidth-0.2.5 webencodings-0.5.1 widgetsnbextension-3.5.1 (temp202009) C:\Users\Ashish Jain>python -m ipykernel install --user --name temp202009 Installed kernelspec temp202009 in C:\Users\Ashish Jain\AppData\Roaming\jupyter\kernels\temp202009 === === === === ERROR: ImportError: DLL load failed while importing win32api: The specified module could not be found. (temp202009) E:\>conda install pywin32 === === === === (temp202009) E:\>pip install htmllaundry (temp202009) E:\>pip install html-sanitizer Collecting html-sanitizer Collecting beautifulsoup4 Collecting soupsieve>1.2 Downloading soupsieve-2.0.1-py3-none-any.whl (32 kB) Installing collected packages: soupsieve, beautifulsoup4, html-sanitizer Successfully installed beautifulsoup4-4.9.1 html-sanitizer-1.9.1 soupsieve-2.0.1 Issues faced with pulling an article using "newsapi" and "newspaper" packages. #1 Exception occurred for: [newspaper.article.Article object at 0x00000248F12896D8] and 2020-08-08T16:55:21Z Article `download()` failed with 503 Server Error: Service Unavailable for url: https://www.marketwatch.com/story/profit-up-87-at-buffetts-berkshire-but-coronavirus-slows-businesses-2020-08-08 on URL https://www.marketwatch.com/story/profit-up-87-at-buffetts-berkshire-but-coronavirus-slows-businesses-2020-08-08 #2 Exception occurred for: [newspaper.article.Article object at 0x00000248F1297B70] and 2020-08-11T22:59:42Z Article `download()` failed with 403 Client Error: Forbidden for url: https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues on URL https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues #3 Exception occurred for: [newspaper.article.Article object at 0x00000248F12AC550] and 2020-08-11T16:17:55Z Article `download()` failed with HTTPSConnectionPool(host='www.freerepublic.com', port=443): Max retries exceeded with url: /focus/f-news/3873373/posts (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])"))) on URL https://www.freerepublic.com/focus/f-news/3873373/posts Trying a fix using Python shell (base) C:\Users\Ashish Jain>python Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import requests >>> requests.get('https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues') [Response [200]] >>> requests.get('https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues').text '<!DOCTYPE html><html itemscope="" itemtype="https://schema.org/WebPage" lang="en">... >>> with open('html.txt', 'w') as f: ... f.write(requests.get('https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues').text) ... Traceback (most recent call last): File "[stdin]", line 2, in [module] File "E:\programfiles\Anaconda3\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 13665: character maps to [undefined] >>> with open('html.txt', 'w', encoding="utf-8") as f: ... f.write(requests.get('https://seekingalpha.com/article/4367745-greatest-disconnect-stocks-and-economy-continues').text) ... 636685 Now we have the HTML. Next we clean it to remove HTML tags. Using htmllaundry from htmllaundry import sanitize !pip show htmllaundry Name: htmllaundry Version: 2.2 Summary: Simple HTML cleanup utilities Home-page: UNKNOWN Author: Wichert Akkerman Author-email: wichert@wiggy.net License: BSD Location: e:\programfiles\anaconda3\envs\temp202009\lib\site-packages Requires: lxml, six Required-by: sanitize(r.text) '<p>\n\n\n \n \n Access to this page has been denied.\n \n \n \n\n\n\n \n \n To continue, please prove you are not a robot\n \n \n \n \n \n \n </p><p>\n To ensure this doesn’t happen in the future, please enable Javascript and cookies in your browser.<br/>\n Is this happening to you frequently? Please <a href="https://seekingalpha.userecho.com?source=captcha" rel="nofollow">report it on our feedback forum</a>.\n </p>\n <p>\n If you have an ad-blocker enabled you may be blocked from proceeding. Please disable your ad-blocker and refresh.\n </p>\n <p>Reference ID: </p>\n \n \n \n\n\n\n\n\n\n\n' from htmllaundry import strip_markup cleantext = strip_markup(sanitize(r.text)).strip() cleantext = re.sub(r"(\n)+", " ", cleantext) cleantext = re.sub(r"\s+", " ", cleantext) print(cleantext) 'Access to this page has been denied. To continue, please prove you are not a robot To ensure this doesn’t happen in the future, please enable Javascript and cookies in your browser. Is this happening to you frequently? Please report it on our feedback forum. If you have an ad-blocker enabled you may be blocked from proceeding. Please disable your ad-blocker and refresh. Reference ID:' Using html_sanitizer from html_sanitizer import Sanitizer !pip show html_sanitizer Name: html-sanitizer Version: 1.9.1 Summary: HTML sanitizer Home-page: https://github.com/matthiask/html-sanitizer/ Author: Matthias Kestenholz Author-email: mk@feinheit.ch License: BSD License Location: e:\programfiles\anaconda3\envs\temp202009\lib\site-packages Requires: beautifulsoup4, lxml Required-by: sanitizer = Sanitizer() cleantext = sanitizer.sanitize(r.text).strip() cleantext = re.sub(r"(\n)+", " ", cleantext) cleantext = re.sub(r"\s+", " ", cleantext) print(cleantext) 'Access to this page has been denied. <h1>To continue, please prove you are not a robot</h1> <p> To ensure this doesn’t happen in the future, please enable Javascript and cookies in your browser.<br> Is this happening to you frequently? Please <a href="https://seekingalpha.userecho.com?source=captcha">report it on our feedback forum</a>. </p> <p> If you have an ad-blocker enabled you may be blocked from proceeding. Please disable your ad-blocker and refresh. </p> <p>Reference ID: </p>' Using beautifulsoup4 import re from bs4 import BeautifulSoup cleantext = BeautifulSoup(r.text, "lxml").text cleantext = re.sub(r"(\n)+", " ", cleantext) cleantext = re.sub(r"\s+", " ", cleantext) cleantext.strip() 'Access to this page has been denied. To continue, please prove you are not a robot To ensure this doesn’t happen in the future, please enable Javascript and cookies in your browser. Is this happening to you frequently? Please report it on our feedback forum. If you have an ad-blocker enabled you may be blocked from proceeding. Please disable your ad-blocker and refresh. Reference ID:'