Scrapy spiders for getting data about programming languages from Wikipedia


First, create a Scrapy project as shown in this post: Getting Started with Scrapy

Then do the following changes in the "settings.py" file:

File: D:\workspace\Jupyter\myscrapers\ourfirstscraper\settings.py

FEED_FORMAT="json"
FEED_URI="D:/workspace/Jupyter/myscrapers/response_logs.json"
FEED_STORAGES={'d': 'scrapy.extensions.feedexport.FileFeedStorage'}


How to execute the spider 'xyz': 
(base) D:\workspace\Jupyter\myscrapers\ourfirstscraper>scrapy crawl xyz

# Spider 1: To extract the names of all the programming languages listed on a Wikipedia page:

import scrapy

class XyzSpider(scrapy.Spider):
    name = 'xyz'
    allowed_domains = ['wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/List_of_programming_languages']

    def parse(self, response):
        body = ';'.join(response.xpath('//a/text()').extract())
        
        yield { 'text': body }

Logs in JSON file:

[{"text": "...;1C:Enterprise programming language;A# .NET;A-0 System;A+;A++;ABAP;ABC;ABC ALGOL;ACC;Accent;..."}]

--- --- --- --- ---

# Spider 2: To extract the contents of the Wikipedia pages for the programming languages "Python, C, C#, C++, ECMAScript, Java":

import scrapy

class XyzSpider(scrapy.Spider):
    name = 'xyz'
    allowed_domains = ['wikipedia.org']
    start_urls = [
        'https://en.wikipedia.org/wiki/Python_(programming_language)',
        'https://en.wikipedia.org/wiki/C_(programming_language)',
        'https://en.wikipedia.org/wiki/C_Sharp_(programming_language)',
        'https://en.wikipedia.org/wiki/C%2B%2B', # For C++.
        'https://en.wikipedia.org/wiki/ECMAScript',
        'https://en.wikipedia.org/wiki/Java_(programming_language)',
    ]

    def parse(self, response):
        body = u''.join(response.xpath('//body/descendant-or-self::*[not(self::script)]/text()').extract()).strip()
        
        yield { 'url': response.url, 'text': body }

--- --- --- --- ---

The reason we are extracting for only these six pages is that we can not process a corpus having text of length that exceeds a maximum of 1000000 for NER using SpaCy.
        

No comments:

Post a Comment