Creating a Flask API for the Scrapy and dumping the parsed content into a JSON file


We will create a project as shown in this link titled Getting Started with Scrapy:

It will give us the following directory structure:
(base) D:\exp_44_scrapy\ourfirstscraper>dir /b /s
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper
D:\exp_44_scrapy\ourfirstscraper\scrapy.cfg
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\items.py
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\middlewares.py
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\pipelines.py
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\settings.py
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\__init__.py
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\__pycache__
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders\wikipedia.py
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders\__init__.py
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders\__pycache__
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders\__pycache__\wikipedia.cpython-37.pyc
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders\__pycache__\__init__.cpython-37.pyc
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\__pycache__\settings.cpython-37.pyc
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\__pycache__\__init__.cpython-37.pyc

STEP 1:
We add these properties in the file "D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\settings.py":
FEED_FORMAT="json"
FEED_URI="D:/exp_44_scrapy/ourfirstscraper/response_logs.json"
FEED_STORAGES={'d': 'scrapy.extensions.feedexport.FileFeedStorage'}

Note: 
Without FEED_STORAGES property, following error is seen in logs:
2019-12-26 13:18:37 [scrapy.extensions.feedexport] ERROR: Unknown feed storage scheme: d

On Ubuntu 19.10, we do not need "FEED_STORAGES" property.
We can set these properties as follows:

FEED_FORMAT="json"
FEED_URI="/home/ashish/Desktop/workspace/exp_44_scrapy/ourfirstscraper/response_logs.json"

STEP 2:
We create a file "D:\exp_44_scrapy\ourfirstscraper\client.py" with the following code:

import requests
headers = {'content-type': 'application/json'}

URL = "http://127.0.0.1:5050/helloworld"
r = requests.post(url = URL, data = {}, headers = headers)
print("Response text: " + r.text)

URL = "http://127.0.0.1:5050"
r = requests.get(url = URL, data = {}, headers = headers)
print("Response text: " + r.text)

STEP 3: 
We create the file "D:\exp_44_scrapy\ourfirstscraper\server.py" with the following code:

from ourfirstscraper.spiders.wikipedia import WikipediaSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

# Flask Imports
from flask_cors import CORS, cross_origin
from flask import Flask

app = Flask(__name__)
cors = CORS(app)
app.config['CORS_HEADERS'] = 'Content-Type'

# POST
@app.route("/helloworld", methods = ['POST'])
@cross_origin()
def helloWorld():
    process.crawl(WikipediaSpider)
    process.start()
    return "Exiting helloWorld()"

# GET
@app.route("/")
@cross_origin() # allow all origins all methods.
def hello():
    return "Hello, cross-origin-world!"
    
if __name__ == "__main__":
    process = CrawlerProcess(get_project_settings())
    app.run(host = "0.0.0.0", port = 5050)
 
STEP 4:
We start the server:

(base) D:\exp_44_scrapy\ourfirstscraper>python server.py
2019-12-26 14:07:31 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: ourfirstscraper)
2019-12-26 14:07:31 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 18.9.0, Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1a  20 Nov 2018), cryptography 2.4.2, Platform Windows-10-10.0.16299-SP0
 * Serving Flask app "server" (lazy loading)
 * Environment: production
   WARNING: Do not use the development server in a production environment.
   Use a production WSGI server instead.
 * Debug mode: off
2019-12-26 14:07:32 [werkzeug] INFO:  * Running on http://0.0.0.0:5050/ (Press CTRL+C to quit)

STEP 5:
We make a request from the client:

(base) D:\exp_44_scrapy\ourfirstscraper>python client.py
Response text: Exiting helloWorld()
Response text: Hello, cross-origin-world!

ISSUES / BUGS
On hitting the Flask API the second time, following error occurs:
twisted.internet.error.ReactorNotRestartable

Hence, the code in this post is not working.

References:
1. https://dingyuliang.me/scrapy-build-scrapy-flask-rest-api-1/
2. https://dingyuliang.me/scrapy-how-to-build-scrapy-with-flask-rest-api-2/
3. https://docs.scrapy.org/en/latest/topics/practices.html

No comments:

Post a Comment