We will create a project as shown in this link titled Getting Started with Scrapy: It will give us the following directory structure: (base) D:\exp_44_scrapy\ourfirstscraper>dir /b /s D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper D:\exp_44_scrapy\ourfirstscraper\scrapy.cfg D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\items.py D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\middlewares.py D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\pipelines.py D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\settings.py D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\__init__.py D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\__pycache__ D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders\wikipedia.py D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders\__init__.py D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders\__pycache__ D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders\__pycache__\wikipedia.cpython-37.pyc D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders\__pycache__\__init__.cpython-37.pyc D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\__pycache__\settings.cpython-37.pyc D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\__pycache__\__init__.cpython-37.pyc STEP 1: We add these properties in the file "D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\settings.py": FEED_FORMAT="json" FEED_URI="D:/exp_44_scrapy/ourfirstscraper/response_logs.json" FEED_STORAGES={'d': 'scrapy.extensions.feedexport.FileFeedStorage'} Note: Without FEED_STORAGES property, following error is seen in logs: 2019-12-26 13:18:37 [scrapy.extensions.feedexport] ERROR: Unknown feed storage scheme: d On Ubuntu 19.10, we do not need "FEED_STORAGES" property. We can set these properties as follows: FEED_FORMAT="json" FEED_URI="/home/ashish/Desktop/workspace/exp_44_scrapy/ourfirstscraper/response_logs.json" STEP 2: We create a file "D:\exp_44_scrapy\ourfirstscraper\client.py" with the following code: import requests headers = {'content-type': 'application/json'} URL = "http://127.0.0.1:5050/helloworld" r = requests.post(url = URL, data = {}, headers = headers) print("Response text: " + r.text) URL = "http://127.0.0.1:5050" r = requests.get(url = URL, data = {}, headers = headers) print("Response text: " + r.text) STEP 3: We create the file "D:\exp_44_scrapy\ourfirstscraper\server.py" with the following code: from ourfirstscraper.spiders.wikipedia import WikipediaSpider from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings # Flask Imports from flask_cors import CORS, cross_origin from flask import Flask app = Flask(__name__) cors = CORS(app) app.config['CORS_HEADERS'] = 'Content-Type' # POST @app.route("/helloworld", methods = ['POST']) @cross_origin() def helloWorld(): process.crawl(WikipediaSpider) process.start() return "Exiting helloWorld()" # GET @app.route("/") @cross_origin() # allow all origins all methods. def hello(): return "Hello, cross-origin-world!" if __name__ == "__main__": process = CrawlerProcess(get_project_settings()) app.run(host = "0.0.0.0", port = 5050) STEP 4: We start the server: (base) D:\exp_44_scrapy\ourfirstscraper>python server.py 2019-12-26 14:07:31 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: ourfirstscraper) 2019-12-26 14:07:31 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 18.9.0, Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.4.2, Platform Windows-10-10.0.16299-SP0 * Serving Flask app "server" (lazy loading) * Environment: production WARNING: Do not use the development server in a production environment. Use a production WSGI server instead. * Debug mode: off 2019-12-26 14:07:32 [werkzeug] INFO: * Running on http://0.0.0.0:5050/ (Press CTRL+C to quit) STEP 5: We make a request from the client: (base) D:\exp_44_scrapy\ourfirstscraper>python client.py Response text: Exiting helloWorld() Response text: Hello, cross-origin-world! ISSUES / BUGS On hitting the Flask API the second time, following error occurs: twisted.internet.error.ReactorNotRestartable Hence, the code in this post is not working. References: 1. https://dingyuliang.me/scrapy-build-scrapy-flask-rest-api-1/ 2. https://dingyuliang.me/scrapy-how-to-build-scrapy-with-flask-rest-api-2/ 3. https://docs.scrapy.org/en/latest/topics/practices.html
Creating a Flask API for the Scrapy and dumping the parsed content into a JSON file
Subscribe to:
Posts (Atom)
No comments:
Post a Comment