App/Import/mwad
mwad is a python script / executable by Mattze96 to generate XML dumps using the MediaWiki API
Contents
Overview: XML Dumps
XOWA is an offline wiki application for online wikis. It works by converting a MediaWiki XML dump into an .xowa sqlite3 database.
XML dumps can be obtained in the following locations:
- Wikimedia wikis: https://dumps.wikimedia.org/backup-index.html
- Wikia wikis: On the Special:Statistics for a given wiki. For example, for the freespeech wikia wiki, one can go to http://freespeech.wikia.com/wiki/Special:Statistics
- Other wikis: Varies and depends on wiki setup.
For non-Wikimedia wikis (Wikia wikis and other wikis), the dumps may not be available or out-of-date. For example, the freespech wikia has a dump date of 2013-12-26, which is over 2 and a half years old.
For Wikia wikis, one can request an XML dump by doing the following:
- Logging in with a user account
- Requesting a dump through the Special:Statistics page
- Waiting for the dump to be generated
Other wikis may require emails to the wiki's admins.
An alternative to this process is to use Mattze96's mwad: the Media Wiki Api dump
Usage
Currently mwad is available as a command-line executable and a python script.
- For up-to-date info, see https://github.com/Mattze96/mwad
- For info as of 2016-07-10, see mwad usage notes below
- For a walk-through synopsis, see the following:
Generating the dump
Executable
- Open up a command prompt
-
cmd
- Change to the mwad directory
-
cd C:\xowa\bin\windows\python\mwad
- Run mwad with the following options
- mediawiki_api_dump.win32.exe http://freespeech.wikia.com
Python script
- Make sure you have Python 3 installed on your system
- Open up a command prompt
-
cmd
- Change to the mwad directory
-
cd C:\xowa\bin\any\python\mwad
- Run mwad with the following options
- python mediawiki_api_dump.py http://freespeech.wikia.com
Both cases will generate an xml file called freespeech.wikia.com-20160710-pages-articles.xml
Importing the dump
- Create a folder called C:\xowa\wiki\freespeech.wikia.com
- Move the xml file to C:\xowa\wiki\freespeech.wikia.com
- Rename the file to freespeech.wikia.com.xml
- Choose "Main Menu" -> "Tools" -> "Import Offline"
- Change "Wiki" to "Other wiki"
- Change "Where to get the dump" to "read from file"
- Select the XML file by clicking "..."
- Press "Import Now"
Depending on the wiki, the Main_Page may not be available. You can use the XOWA search bar to look for pages in the wiki.
Other notes
- Do not run this on Wikimedia wikis. Wikimedia has strict web-crawling policies. If you run this on a Wikimedia wiki, such as en.wikipedia.org, your IP address will probably be banned and you will be unable to access Wikipedia.
- Pay attention to the licenses for the wiki. All Wikia wikis are under a Creative Commons license for article text[1]. Other wikis may follow similiarly permissive licensing but it is your responsibility to check. If a wiki has a strict copyright license, please do not run mwad on it.
- Web-scraping policies may get your IP banned. Different wikis may have different limits on number of articles downloaded, even through their API. If you're downloading a large wiki, you should consult first with the wiki's admins. Otherwise, your IP address may be flagged as an unauthorized web-crawler and you will be banned.
mwad usage notes
usage: mediawiki_api_dump.py [-h] [-v] [-n NAME] [-l LOG] [-c] [-x] url Create a wiki xml-dump via api.php positional arguments: url download url optional arguments: -h, --help show this help message and exit -v, --verbose verbose level... repeat up to three times -n NAME, --name NAME name of the wiki for filename etc. -l LOG, --log LOG specify log-file. -c, --compress compress output file with bz2 -x, --xowa special XOWA mode: xml to stdout, progress to stderr Example: ./mediawiki_api_dump.py http://wiki.archlinux.org