You can download the English (Wikipedia) dumps from here: • https://dumps.wikimedia.org/enwiki/20191120/ • https://dumps.wikimedia.org/enwiki/latest/ Use this Python code to convert the archive into a single text file, Python code link "https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py" Usage: python3 WikiExtractor.py --infn dump.xml.bz2 For more information: http://wiki.apertium.org/wiki/Wikipedia_Extractor Or you can also download old Wikipedia archives as text from here: http://kopiwiki.dsd.sztaki.hu/ ~ * ~ Python code file also available here: https://drive.google.com/open?id=10Slry7jVmaVo2XcD_gZeqJrH9USN-hjC ~ * ~ This Python program fails on Windows 10 (Conda version: 4.7.5) (base) D:\workspace\Jupyter\exp_40_wikipedia_data_as_text>python WikiExtractor.py --infn enwiki-20191120-pages-articles-multistream1.xml-p10p30302.bz2 File detected as being bzip2. 12 Anarchism 25 Autism Traceback (most recent call last): File "WikiExtractor.py", line 760, in [module] main() File "WikiExtractor.py", line 743, in main process_data('bzip2',f, output_sentences, vital_titles, incubator, vital_tags) File "WikiExtractor.py", line 643, in process_data ''.join(page)) File "WikiExtractor.py", line 143, in WikiDocumentSentences print(line, file=out) File "WikiExtractor.py", line 548, in write self.out_file.write(text) File "C:\Users\ashish\AppData\Local\Continuum\anaconda3\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 209-213: character maps to [undefined] ~ * ~ It runs successfully on Ubuntu 19.04. Logs: (base) ashish@ashish-vBox:~/Desktop/test$ python WikiExtractor.py --infn enwiki-20191120-pages-articles-multistream1.xml-p10p30302.bz2 File detected as being bzip2. 12 Anarchism 25 Autism 39 Albedo 290 A 303 Alabama 305 Achilles 307 Abraham Lincoln 308 Aristotle 309 An American in Paris ... 30284 Statistical hypothesis testing 30292 The Hobbit 30296 Tax Freedom Day 30297 Tax 30299 Transhumanism 30302 TARDIS ~ * ~
Pages
- Index of Lessons in Technology
- Index of Book Summaries
- Index of Book Lists And Downloads
- Index For Job Interviews Preparation
- Index of "Algorithms: Design and Analysis"
- Python Course (Index)
- Data Analytics Course (Index)
- Index of Machine Learning
- Postings Index
- Index of BITS WILP Exam Papers and Content
- Lessons in Investing
- Index of Math Lessons
- Downloads
- Index of Management Lessons
- Book Requests
- Index of English Lessons
- Index of Medicines
- Index of Quizzes (Educational)
Getting Wikipedia data as text
Subscribe to:
Posts (Atom)
No comments:
Post a Comment