This post demonstrates how you can pass custom settings for a spider to use while overriding the "settings.py" file. This way we can configure our spider to use proxy for specific sites. The way to configure a proxy is by defining a middleware class. First, a proxy settings are defined in the "middlewares.py" file: For our project it is: "D:\exp_44\myscrapers\ourfirstscraper\middlewares.py" The Middleware class that we are going to write is: class CustomProxyMiddleware(object): def process_request(self, request, spider): request.meta['proxy'] = "https://system:manager@10.10.16.74:80" request.headers['Proxy-Authorization'] = basic_auth_header('system', 'manager') return None The spider code looks like this: # File: D:\exp_44\myscrapers\ourfirstscraper\spiders\xyz.py import scrapy import os class XyzSpider(scrapy.Spider): name = 'xyz' custom_settings = {"DOWNLOADER_MIDDLEWARES": { 'ourfirstscraper.middlewares.CustomProxyMiddleware': 350, 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400, }} def __init__(self, *args, **kwargs): super(XyzSpider, self).__init__(*args, **kwargs) if kwargs.get('start_urls'): self.start_urls = kwargs.get('start_urls').split(',') else: print("No URL to start with!") raise Exception def parse(self, response): body = ';'.join(response.xpath('//a/text()').extract()) yield { 'text': response.text } The way to execute this spider is: >>> cd D:/exp_44/myscrapers/ourfirstscraper >>> scrapy crawl xyz -a start_urls="abc.com,xyz.in"
Demonstrating proxy authorization for a Scrapy spider through middleware and custom settings
Subscribe to:
Posts (Atom)
No comments:
Post a Comment