You can download the English (Wikipedia) dumps from here: • https://dumps.wikimedia.org/enwiki/20191120/ • https://dumps.wikimedia.org/enwiki/latest/ Use this Python code to convert the archive into a single text file, Python code link "https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py" Usage: python3 WikiExtractor.py --infn dump.xml.bz2 For more information: http://wiki.apertium.org/wiki/Wikipedia_Extractor Or you can also download old Wikipedia archives as text from here: http://kopiwiki.dsd.sztaki.hu/ ~ * ~ Python code file also available here: https://drive.google.com/open?id=10Slry7jVmaVo2XcD_gZeqJrH9USN-hjC ~ * ~ This Python program fails on Windows 10 (Conda version: 4.7.5) (base) D:\workspace\Jupyter\exp_40_wikipedia_data_as_text>python WikiExtractor.py --infn enwiki-20191120-pages-articles-multistream1.xml-p10p30302.bz2 File detected as being bzip2. 12 Anarchism 25 Autism Traceback (most recent call last): File "WikiExtractor.py", line 760, in [module] main() File "WikiExtractor.py", line 743, in main process_data('bzip2',f, output_sentences, vital_titles, incubator, vital_tags) File "WikiExtractor.py", line 643, in process_data ''.join(page)) File "WikiExtractor.py", line 143, in WikiDocumentSentences print(line, file=out) File "WikiExtractor.py", line 548, in write self.out_file.write(text) File "C:\Users\ashish\AppData\Local\Continuum\anaconda3\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 209-213: character maps to [undefined] ~ * ~ It runs successfully on Ubuntu 19.04. Logs: (base) ashish@ashish-vBox:~/Desktop/test$ python WikiExtractor.py --infn enwiki-20191120-pages-articles-multistream1.xml-p10p30302.bz2 File detected as being bzip2. 12 Anarchism 25 Autism 39 Albedo 290 A 303 Alabama 305 Achilles 307 Abraham Lincoln 308 Aristotle 309 An American in Paris ... 30284 Statistical hypothesis testing 30292 The Hobbit 30296 Tax Freedom Day 30297 Tax 30299 Transhumanism 30302 TARDIS ~ * ~
Getting Wikipedia data as text
Subscribe to:
Posts (Atom)
No comments:
Post a Comment