Creating a Flask API for Scrapy that creates custom Spiders for every request


We will create a project as shown in this link titled "Getting Started with Scrapy":

It will give us the following directory structure:
(base) D:\exp_44_scrapy\ourfirstscraper>dir /b /s
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper
D:\exp_44_scrapy\ourfirstscraper\scrapy.cfg
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\items.py
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\middlewares.py
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\pipelines.py
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\settings.py
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\__init__.py

D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders\wikipedia.py
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders\__init__.py

STEP 1:
We add these properties in the file "D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\settings.py":
FEED_FORMAT="json"
FEED_URI="D:/exp_44_scrapy/ourfirstscraper/response_logs.json"
FEED_STORAGES={'d': 'scrapy.extensions.feedexport.FileFeedStorage'}

Note: 
Without FEED_STORAGES property, following error is seen in logs:
2019-12-26 13:18:37 [scrapy.extensions.feedexport] ERROR: Unknown feed storage scheme: d

On Ubuntu 19.10, we do not need "FEED_STORAGES" property.
We can set these properties as follows:

FEED_FORMAT="json"
FEED_URI="/home/ashish/Desktop/workspace/exp_44_scrapy/ourfirstscraper/response_logs.json"

STEP 2:
We create a file "D:\exp_44_scrapy\ourfirstscraper\client.py" with the following code:

import requests
headers = {'content-type': 'application/json'}

URL = "http://127.0.0.1:5050/helloworld"
r = requests.post(url = URL, json = {
    'name': 'xyz', 
    'allowed_domains': 'survival8.blogspot.com', 
    'start_urls': ['https://survival8.blogspot.com/p/index-of-lessons-in-technology.html']
    }, headers = headers)

print("Response text: " + r.text)

URL = "http://127.0.0.1:5050"
r = requests.get(url = URL, data = {}, headers = headers)
print("Response text: " + r.text)

STEP 3: 
We create the file "D:\exp_44_scrapy\ourfirstscraper\server.py" with the following code:

import re

# Flask Imports
from flask_cors import CORS, cross_origin
from flask import Flask, request
import os
    
app = Flask(__name__)
cors = CORS(app)
app.config['CORS_HEADERS'] = 'Content-Type'

# POST
@app.route("/helloworld", methods = ['POST'])
@cross_origin()
def helloWorld():
    print("Content-Type: " + request.headers['Content-Type'])

    if request.headers['Content-Type'] == 'text/plain':
        return "Request from client is: " + request.data
    
    elif(request.headers['Content-Type'] in  ['application/json', 'application/json; charset=utf-8']):
        # For C#: request.headers['Content-Type'] == 'application/json; charset=utf-8'
        # For Python: request.headers['Content-Type'] == 'application/json'
        # For AngularJS due to the "Content-Type" headers == 'application/json'
        print("request.json: " + str(request.json))
    
    dirpath = os.getcwd()
    print("current directory is : " + dirpath) 
    # OUTPUT: current directory is: D:\exp_44_scrapy\ourfirstscraper

    # Next line changes the current directory to the path from where CLI commands will be executed.
    os.chdir(r"D:/workspace/Jupyter/exp_44_scrapy/ourfirstscraper")

    # The next line will create a file: D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders\request.json['name'].py
    # If the 'name' is already in use, you get these logs in the server console:
    # Spider 'wikipedia_spidy' already exists in module:
    # ourfirstscraper.spiders.wikipedia_spidy

    os.system('scrapy genspider ' + request.json['name'] + " " + request.json['allowed_domains'])

    r = re.compile(r"^ *")
    file_path = dirpath + '/ourfirstscraper/spiders/' + request.json['name'] + '.py'
    with open(file_path, 'r') as file:
        # read a list of lines into data
        data = file.readlines()

    for i in range(len(data)):
        if 'start_urls' in data[i]:
            data[i] = r.findall(data[i])[0] + "start_urls = " + str(request.json['start_urls']) + "\n"

        if 'pass' in data[i]:
            data[i] = r.findall(data[i])[0] + "yield {'text': response.text}" + "\n"
    
    # And write everything back
    with open(file_path, 'w') as file:
        file.writelines(data)
        
    os.system('scrapy crawl ' + request.json['name'])
    
    return "Exiting helloWorld()"

# GET
@app.route("/")
@cross_origin() # allow all origins all methods.
def hello():
    return "Hello, cross-origin-world!"
    
if __name__ == "__main__":
    app.run(host = "0.0.0.0", port = 5050)

STEP 4:
We start the server:

(base) D:\exp_44_scrapy\ourfirstscraper>python server.py
2019-12-26 14:07:31 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: ourfirstscraper)
2019-12-26 14:07:31 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 18.9.0, Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1a  20 Nov 2018), cryptography 2.4.2, Platform Windows-10-10.0.16299-SP0
    * Serving Flask app "server" (lazy loading)
    * Environment: production
    WARNING: Do not use the development server in a production environment.
    Use a production WSGI server instead.
    * Debug mode: off
2019-12-26 14:07:32 [werkzeug] INFO:  * Running on http://0.0.0.0:5050/ (Press CTRL+C to quit)

STEP 5:
We make a request from the client:

(base) D:\exp_44_scrapy\ourfirstscraper>python client.py
Response text: Exiting helloWorld()
Response text: Hello, cross-origin-world!
    
NOTES: 

The server.py created the following file for the custom Spider made for the last request:
D:\exp_44_scrapy\ourfirstscraper\ourfirstscraper\spiders\xyz.py

The program will work even for two consecutive requests having same input data for the server from the client code.

The first request will having these logs:
127.0.0.1 - - [29/Dec/2019 22:44:08] "POST /helloworld HTTP/1.1" 200 -
127.0.0.1 - - [29/Dec/2019 22:44:08] "GET / HTTP/1.1" 200 -
Content-Type: application/json
request.json: {'name': 'xyz', 'allowed_domains': 'survival8.blogspot.com', 'start_urls': ['https://survival8.blogspot.com/p/index-of-lessons-in-technology.html']}
current directory is : D:\exp_44_scrapy\ourfirstscraper
Created spider 'xyz' using template 'basic' in module:
  ourfirstscraper.spiders.xyz
2019-12-29 22:44:57 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: ourfirstscraper)

The second request will have logs that look like these:
127.0.0.1 - - [29/Dec/2019 22:45:04] "POST /helloworld HTTP/1.1" 200 -
127.0.0.1 - - [29/Dec/2019 22:45:04] "GET / HTTP/1.1" 200 -
Content-Type: application/json
request.json: {'name': 'xyz', 'allowed_domains': 'survival8.blogspot.com', 'start_urls': ['https://survival8.blogspot.com/p/index-of-lessons-in-technology.html']}
current directory is : D:\exp_44_scrapy\ourfirstscraper
Spider 'xyz' already exists in module:
  ourfirstscraper.spiders.xyz
2019-12-29 22:45:28 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: ourfirstscraper)

No comments:

Post a Comment