Getting Started with Scrapy


Set up your system 
Scrapy supports both versions of Python 2 and 3. If you’re using Anaconda, you can install the package from the conda-forge channel, which has up-to-date packages for Linux, Windows and OS X.

To install Scrapy using conda, run:

conda install -c conda-forge scrapy
Alternatively, if you’re on Linux or Mac OSX, you can directly install scrapy by:

pip install scrapy

Scrapy Shell 
Scrapy provides a shell of its own that you can use to experiment. To start the scrapy shell in your command line type:

scrapy shell

To run the crawler in the shell type:

fetch("https://www.wikipedia.org/")

When you crawl something with scrapy it returns a “response” object that contains the downloaded information. Let’s see what the crawler has downloaded:

view(response)

Let’s see how does the raw content looks like:

print(response.text)

Digging more:
response.css(".title::text").extract_first()

Here response.css(..) is a function that helps extract content based on css selector passed to it. The ‘.’ is used with the title because it’s a css . Also you need to use ::text to tell your scraper to extract only text content of the matching elements. This is done because scrapy directly returns the matching element along with the HTML code. 

So far:

response – An object that the scrapy crawler returns. This object contains all the information about the downloaded content.

response.css(...) – Matches the element with the given CSS selectors.

extract_first(...) – Extracts the “first” element that matches the given criteria.

extract(...) – Extracts “all” the elements that match the given criteria.

Writing Custom Spiders 

Let’s exit the scrapy shell first and create a new scrapy project:

(base) D:\workspace\Jupyter\exp_44_scrapy>scrapy startproject ourfirstscraper
LOGS:
New Scrapy project 'ourfirstscraper', using template directory 'c:\users\ashish\appdata\local\continuum\anaconda3\lib\site-packages\scrapy\templates\project', created in:
D:\workspace\Jupyter\exp_44_scrapy\ourfirstscraper

You can start your first spider with:
    cd ourfirstscraper
    scrapy genspider example example.com

For now, the two most important files are:

settings.py – This file contains the settings you set for your project, you’ll be dealing a lot with it.

spiders/ – This folder is where all your custom spiders will be stored. Every time you ask scrapy to run a spider, it will look for it in this folder.

Creating a spider

Let’s change directory into our first scraper and create a basic spider "wikipedia":

scrapy genspider wikipedia www.wikipedia.org

Few things to note here:

name : Name of the spider, in this case it is “wikipedia”. Naming spiders properly becomes a huge relief when you have to maintain hundreds of spiders.

allowed_domains : An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list won’t be followed.

parse(self, response) : This function is called whenever the crawler successfully crawls a URL. Remember the response object from earlier? This is the same response object that is passed to the parse(..).

Code that goes in 'spiders/wikipedia.py': 

# -*- coding: utf-8 -*-
import scrapy

class WikipediaSpider(scrapy.Spider):
    name = 'wikipedia'
    allowed_domains = ['wikipedia.org']
    start_urls = ['http://wikipedia.org/']

    def parse(self, response):
        rtnVal = {"text": response.text}
        yield rtnVal

And to run the spider :

(base) D:\workspace\Jupyter\exp_44_scrapy\ourfirstscraper>scrapy crawl wikipedia 

#Ref: https://docs.scrapy.org/en/latest/topics/spiders.html#topics-spiders
#Ref: https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/

No comments:

Post a Comment