Getting Wikipedia data as text


You can download the English (Wikipedia) dumps from here:
• https://dumps.wikimedia.org/enwiki/20191120/
• https://dumps.wikimedia.org/enwiki/latest/

Use this Python code to convert the archive into a single text file, Python code link "https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py"

Usage:

python3 WikiExtractor.py --infn dump.xml.bz2
For more information: http://wiki.apertium.org/wiki/Wikipedia_Extractor

Or you can also download old Wikipedia archives as text from here:

http://kopiwiki.dsd.sztaki.hu/

~ * ~

Python code file also available here:
https://drive.google.com/open?id=10Slry7jVmaVo2XcD_gZeqJrH9USN-hjC

~ * ~

This Python program fails on Windows 10 (Conda version: 4.7.5)

(base) D:\workspace\Jupyter\exp_40_wikipedia_data_as_text>python WikiExtractor.py --infn enwiki-20191120-pages-articles-multistream1.xml-p10p30302.bz2
File detected as being bzip2.
12 Anarchism
25 Autism
Traceback (most recent call last):
  File "WikiExtractor.py", line 760, in [module]
    main()
  File "WikiExtractor.py", line 743, in main
    process_data('bzip2',f, output_sentences, vital_titles, incubator, vital_tags)
  File "WikiExtractor.py", line 643, in process_data
    ''.join(page))
  File "WikiExtractor.py", line 143, in WikiDocumentSentences
    print(line, file=out)
  File "WikiExtractor.py", line 548, in write
    self.out_file.write(text)
  File "C:\Users\ashish\AppData\Local\Continuum\anaconda3\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 209-213: character maps to [undefined]

~ * ~

It runs successfully on Ubuntu 19.04.

Logs:

(base) ashish@ashish-vBox:~/Desktop/test$ python WikiExtractor.py --infn enwiki-20191120-pages-articles-multistream1.xml-p10p30302.bz2

File detected as being bzip2.
12 Anarchism
25 Autism
39 Albedo
290 A
303 Alabama
305 Achilles
307 Abraham Lincoln
308 Aristotle
309 An American in Paris
...

30284 Statistical hypothesis testing
30292 The Hobbit
30296 Tax Freedom Day
30297 Tax
30299 Transhumanism
30302 TARDIS

~ * ~

No comments:

Post a Comment