First, create a Scrapy project as shown in this post: Getting Started with Scrapy Then do the following changes in the "settings.py" file: File: D:\workspace\Jupyter\myscrapers\ourfirstscraper\settings.py FEED_FORMAT="json" FEED_URI="D:/workspace/Jupyter/myscrapers/response_logs.json" FEED_STORAGES={'d': 'scrapy.extensions.feedexport.FileFeedStorage'} How to execute the spider 'xyz': (base) D:\workspace\Jupyter\myscrapers\ourfirstscraper>scrapy crawl xyz # Spider 1: To extract the names of all the programming languages listed on a Wikipedia page: import scrapy class XyzSpider(scrapy.Spider): name = 'xyz' allowed_domains = ['wikipedia.org'] start_urls = ['https://en.wikipedia.org/wiki/List_of_programming_languages'] def parse(self, response): body = ';'.join(response.xpath('//a/text()').extract()) yield { 'text': body } Logs in JSON file: [{"text": "...;1C:Enterprise programming language;A# .NET;A-0 System;A+;A++;ABAP;ABC;ABC ALGOL;ACC;Accent;..."}] --- --- --- --- --- # Spider 2: To extract the contents of the Wikipedia pages for the programming languages "Python, C, C#, C++, ECMAScript, Java": import scrapy class XyzSpider(scrapy.Spider): name = 'xyz' allowed_domains = ['wikipedia.org'] start_urls = [ 'https://en.wikipedia.org/wiki/Python_(programming_language)', 'https://en.wikipedia.org/wiki/C_(programming_language)', 'https://en.wikipedia.org/wiki/C_Sharp_(programming_language)', 'https://en.wikipedia.org/wiki/C%2B%2B', # For C++. 'https://en.wikipedia.org/wiki/ECMAScript', 'https://en.wikipedia.org/wiki/Java_(programming_language)', ] def parse(self, response): body = u''.join(response.xpath('//body/descendant-or-self::*[not(self::script)]/text()').extract()).strip() yield { 'url': response.url, 'text': body } --- --- --- --- --- The reason we are extracting for only these six pages is that we can not process a corpus having text of length that exceeds a maximum of 1000000 for NER using SpaCy.
Scrapy spiders for getting data about programming languages from Wikipedia
Subscribe to:
Posts (Atom)
No comments:
Post a Comment