We will create a project as shown in this link titled "Getting Started with Scrapy": It will give us the following directory structure: (base) D:\exp_44_scrapy\ourfirstscraper>dir /b /s D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper D:\exp_44_scrapy\ourfirstscraper\scrapy.cfg D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\items.py D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\middlewares.py D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\pipelines.py D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\settings.py D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\__init__.py D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders\wikipedia.py D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders\__init__.py STEP 1: We add these properties in the file "D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\settings.py": FEED_FORMAT="json" FEED_URI="D:/exp_44_scrapy/ourfirstscraper/response_logs.json" FEED_STORAGES={'d': 'scrapy.extensions.feedexport.FileFeedStorage'} Note: Without FEED_STORAGES property, following error is seen in logs: 2019-12-26 13:18:37 [scrapy.extensions.feedexport] ERROR: Unknown feed storage scheme: d On Ubuntu 19.10, we do not need "FEED_STORAGES" property. We can set these properties as follows: FEED_FORMAT="json" FEED_URI="/home/ashish/Desktop/workspace/exp_44_scrapy/ourfirstscraper/response_logs.json" STEP 2: We create a file "D:\exp_44_scrapy\ourfirstscraper\client.py" with the following code: import requests headers = {'content-type': 'application/json'} URL = "http://127.0.0.1:5050/helloworld" r = requests.post(url = URL, json = { 'name': 'xyz', 'allowed_domains': 'survival8.blogspot.com', 'start_urls': ['https://survival8.blogspot.com/p/index-of-lessons-in-technology.html'] }, headers = headers) print("Response text: " + r.text) URL = "http://127.0.0.1:5050" r = requests.get(url = URL, data = {}, headers = headers) print("Response text: " + r.text) STEP 3: We create the file "D:\exp_44_scrapy\ourfirstscraper\server.py" with the following code: import re # Flask Imports from flask_cors import CORS, cross_origin from flask import Flask, request import os app = Flask(__name__) cors = CORS(app) app.config['CORS_HEADERS'] = 'Content-Type' # POST @app.route("/helloworld", methods = ['POST']) @cross_origin() def helloWorld(): print("Content-Type: " + request.headers['Content-Type']) if request.headers['Content-Type'] == 'text/plain': return "Request from client is: " + request.data elif(request.headers['Content-Type'] in ['application/json', 'application/json; charset=utf-8']): # For C#: request.headers['Content-Type'] == 'application/json; charset=utf-8' # For Python: request.headers['Content-Type'] == 'application/json' # For AngularJS due to the "Content-Type" headers == 'application/json' print("request.json: " + str(request.json)) dirpath = os.getcwd() print("current directory is : " + dirpath) # OUTPUT: current directory is: D:\exp_44_scrapy\ourfirstscraper # Next line changes the current directory to the path from where CLI commands will be executed. os.chdir(r"D:/workspace/Jupyter/exp_44_scrapy/ourfirstscraper") # The next line will create a file: D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders\request.json['name'].py # If the 'name' is already in use, you get these logs in the server console: # Spider 'wikipedia_spidy' already exists in module: # ourfirstscraper.spiders.wikipedia_spidy os.system('scrapy genspider ' + request.json['name'] + " " + request.json['allowed_domains']) r = re.compile(r"^ *") file_path = dirpath + '/ourfirstscraper/spiders/' + request.json['name'] + '.py' with open(file_path, 'r') as file: # read a list of lines into data data = file.readlines() for i in range(len(data)): if 'start_urls' in data[i]: data[i] = r.findall(data[i])[0] + "start_urls = " + str(request.json['start_urls']) + "\n" if 'pass' in data[i]: data[i] = r.findall(data[i])[0] + "yield {'text': response.text}" + "\n" # And write everything back with open(file_path, 'w') as file: file.writelines(data) os.system('scrapy crawl ' + request.json['name']) return "Exiting helloWorld()" # GET @app.route("/") @cross_origin() # allow all origins all methods. def hello(): return "Hello, cross-origin-world!" if __name__ == "__main__": app.run(host = "0.0.0.0", port = 5050) STEP 4: We start the server: (base) D:\exp_44_scrapy\ourfirstscraper>python server.py 2019-12-26 14:07:31 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: ourfirstscraper) 2019-12-26 14:07:31 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 18.9.0, Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.4.2, Platform Windows-10-10.0.16299-SP0 * Serving Flask app "server" (lazy loading) * Environment: production WARNING: Do not use the development server in a production environment. Use a production WSGI server instead. * Debug mode: off 2019-12-26 14:07:32 [werkzeug] INFO: * Running on http://0.0.0.0:5050/ (Press CTRL+C to quit) STEP 5: We make a request from the client: (base) D:\exp_44_scrapy\ourfirstscraper>python client.py Response text: Exiting helloWorld() Response text: Hello, cross-origin-world! NOTES: The server.py created the following file for the custom Spider made for the last request: D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders\xyz.py The program will work even for two consecutive requests having same input data for the server from the client code. The first request will having these logs: 127.0.0.1 - - [29/Dec/2019 22:44:08] "POST /helloworld HTTP/1.1" 200 - 127.0.0.1 - - [29/Dec/2019 22:44:08] "GET / HTTP/1.1" 200 - Content-Type: application/json request.json: {'name': 'xyz', 'allowed_domains': 'survival8.blogspot.com', 'start_urls': ['https://survival8.blogspot.com/p/index-of-lessons-in-technology.html']} current directory is : D:\exp_44_scrapy\ourfirstscraper Created spider 'xyz' using template 'basic' in module: ourfirstscraper.spiders.xyz 2019-12-29 22:44:57 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: ourfirstscraper) The second request will have logs that look like these: 127.0.0.1 - - [29/Dec/2019 22:45:04] "POST /helloworld HTTP/1.1" 200 - 127.0.0.1 - - [29/Dec/2019 22:45:04] "GET / HTTP/1.1" 200 - Content-Type: application/json request.json: {'name': 'xyz', 'allowed_domains': 'survival8.blogspot.com', 'start_urls': ['https://survival8.blogspot.com/p/index-of-lessons-in-technology.html']} current directory is : D:\exp_44_scrapy\ourfirstscraper Spider 'xyz' already exists in module: ourfirstscraper.spiders.xyz 2019-12-29 22:45:28 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: ourfirstscraper)
Creating a Flask API for Scrapy that creates custom Spiders for every request
Subscribe to:
Posts (Atom)
No comments:
Post a Comment