You can download the English (Wikipedia) dumps from here:
• https://dumps.wikimedia.org/enwiki/20191120/
• https://dumps.wikimedia.org/enwiki/latest/
Use this Python code to convert the archive into a single text file, Python code link "https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py"
Usage:
python3 WikiExtractor.py --infn dump.xml.bz2
For more information: http://wiki.apertium.org/wiki/Wikipedia_Extractor
Or you can also download old Wikipedia archives as text from here:
http://kopiwiki.dsd.sztaki.hu/
~ * ~
Python code file also available here:
https://drive.google.com/open?id=10Slry7jVmaVo2XcD_gZeqJrH9USN-hjC
~ * ~
This Python program fails on Windows 10 (Conda version: 4.7.5)
(base) D:\workspace\Jupyter\exp_40_wikipedia_data_as_text>python WikiExtractor.py --infn enwiki-20191120-pages-articles-multistream1.xml-p10p30302.bz2
File detected as being bzip2.
12 Anarchism
25 Autism
Traceback (most recent call last):
File "WikiExtractor.py", line 760, in [module]
main()
File "WikiExtractor.py", line 743, in main
process_data('bzip2',f, output_sentences, vital_titles, incubator, vital_tags)
File "WikiExtractor.py", line 643, in process_data
''.join(page))
File "WikiExtractor.py", line 143, in WikiDocumentSentences
print(line, file=out)
File "WikiExtractor.py", line 548, in write
self.out_file.write(text)
File "C:\Users\ashish\AppData\Local\Continuum\anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 209-213: character maps to [undefined]
~ * ~
It runs successfully on Ubuntu 19.04.
Logs:
(base) ashish@ashish-vBox:~/Desktop/test$ python WikiExtractor.py --infn enwiki-20191120-pages-articles-multistream1.xml-p10p30302.bz2
File detected as being bzip2.
12 Anarchism
25 Autism
39 Albedo
290 A
303 Alabama
305 Achilles
307 Abraham Lincoln
308 Aristotle
309 An American in Paris
...
30284 Statistical hypothesis testing
30292 The Hobbit
30296 Tax Freedom Day
30297 Tax
30299 Transhumanism
30302 TARDIS
~ * ~
Pages
- Index of Lessons in Technology
- Index of Book Summaries
- Index of Book Lists And Downloads
- Index For Job Interviews Preparation
- Index of "Algorithms: Design and Analysis"
- Python Course (Index)
- Data Analytics Course (Index)
- Index of Machine Learning
- Postings Index
- Index of BITS WILP Exam Papers and Content
- Lessons in Investing
- Index of Math Lessons
- Index of Management Lessons
- Book Requests
- Index of English Lessons
- Index of Medicines
- Index of Quizzes (Educational)
Getting Wikipedia data as text
Subscribe to:
Comments (Atom)
No comments:
Post a Comment