Demonstrating proxy authorization for a Scrapy spider through middleware and custom settings


This post demonstrates how you can pass custom settings for a spider to use while overriding the "settings.py" file. This way we can configure our spider to use proxy for specific sites. The way to configure a proxy is by defining a middleware class.

First, a proxy settings are defined in the "middlewares.py" file:

For our project it is: "D:\exp_44\myscrapers\ourfirstscraper\middlewares.py"

The Middleware class that we are going to write is:

class CustomProxyMiddleware(object):
    def process_request(self, request, spider):

        request.meta['proxy'] = "https://system:manager@10.10.16.74:80"
        
        request.headers['Proxy-Authorization'] = basic_auth_header('system', 'manager')
        return None
        
The spider code looks like this:

# File: D:\exp_44\myscrapers\ourfirstscraper\spiders\xyz.py

import scrapy
import os

class XyzSpider(scrapy.Spider):
    name = 'xyz'

    custom_settings = {"DOWNLOADER_MIDDLEWARES": {
            'ourfirstscraper.middlewares.CustomProxyMiddleware': 350,
            'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
    }}


    def __init__(self, *args, **kwargs): 
        super(XyzSpider, self).__init__(*args, **kwargs) 
        if kwargs.get('start_urls'):
            self.start_urls = kwargs.get('start_urls').split(',')

        else:            
            print("No URL to start with!")
            raise Exception

    def parse(self, response):
        body = ';'.join(response.xpath('//a/text()').extract())
        
        yield { 'text': response.text }

The way to execute this spider is:

>>> cd D:/exp_44/myscrapers/ourfirstscraper
>>> scrapy crawl xyz -a start_urls="abc.com,xyz.in"

No comments:

Post a Comment