Scrapy spiders for getting data about programming languages from Wikipedia

First, create a Scrapy project as shown in this post: Getting Started with Scrapy

Then do the following changes in the "" file:

File: D:\workspace\Jupyter\myscrapers\ourfirstscraper\

FEED_STORAGES={'d': 'scrapy.extensions.feedexport.FileFeedStorage'}

How to execute the spider 'xyz': 
(base) D:\workspace\Jupyter\myscrapers\ourfirstscraper>scrapy crawl xyz

# Spider 1: To extract the names of all the programming languages listed on a Wikipedia page:

import scrapy

class XyzSpider(scrapy.Spider):
    name = 'xyz'
    allowed_domains = ['']
    start_urls = ['']

    def parse(self, response):
        body = ';'.join(response.xpath('//a/text()').extract())
        yield { 'text': body }

Logs in JSON file:

[{"text": "...;1C:Enterprise programming language;A# .NET;A-0 System;A+;A++;ABAP;ABC;ABC ALGOL;ACC;Accent;..."}]

--- --- --- --- ---

# Spider 2: To extract the contents of the Wikipedia pages for the programming languages "Python, C, C#, C++, ECMAScript, Java":

import scrapy

class XyzSpider(scrapy.Spider):
    name = 'xyz'
    allowed_domains = ['']
    start_urls = [
        '', # For C++.

    def parse(self, response):
        body = u''.join(response.xpath('//body/descendant-or-self::*[not(self::script)]/text()').extract()).strip()
        yield { 'url': response.url, 'text': body }

--- --- --- --- ---

The reason we are extracting for only these six pages is that we can not process a corpus having text of length that exceeds a maximum of 1000000 for NER using SpaCy.

No comments:

Post a Comment