This post demonstrates how you can pass custom settings for a spider to use while overriding the "settings.py" file. This way we can configure our spider to use proxy for specific sites. The way to configure a proxy is by defining a middleware class.
First, a proxy settings are defined in the "middlewares.py" file:
For our project it is: "D:\exp_44\myscrapers\ourfirstscraper\middlewares.py"
The Middleware class that we are going to write is:
class CustomProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = "https://system:manager@10.10.16.74:80"
request.headers['Proxy-Authorization'] = basic_auth_header('system', 'manager')
return None
The spider code looks like this:
# File: D:\exp_44\myscrapers\ourfirstscraper\spiders\xyz.py
import scrapy
import os
class XyzSpider(scrapy.Spider):
name = 'xyz'
custom_settings = {"DOWNLOADER_MIDDLEWARES": {
'ourfirstscraper.middlewares.CustomProxyMiddleware': 350,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}}
def __init__(self, *args, **kwargs):
super(XyzSpider, self).__init__(*args, **kwargs)
if kwargs.get('start_urls'):
self.start_urls = kwargs.get('start_urls').split(',')
else:
print("No URL to start with!")
raise Exception
def parse(self, response):
body = ';'.join(response.xpath('//a/text()').extract())
yield { 'text': response.text }
The way to execute this spider is:
>>> cd D:/exp_44/myscrapers/ourfirstscraper
>>> scrapy crawl xyz -a start_urls="abc.com,xyz.in"
Pages
- Index of Lessons in Technology
- Index of Book Summaries
- Index of Book Lists And Downloads
- Index For Job Interviews Preparation
- Index of "Algorithms: Design and Analysis"
- Python Course (Index)
- Data Analytics Course (Index)
- Index of Machine Learning
- Postings Index
- Index of BITS WILP Exam Papers and Content
- Lessons in Investing
- Index of Math Lessons
- Downloads
- Index of Management Lessons
- Book Requests
- Index of English Lessons
- Index of Medicines
- Index of Quizzes (Educational)
Demonstrating proxy authorization for a Scrapy spider through middleware and custom settings
Subscribe to:
Comments (Atom)
No comments:
Post a Comment