Dev/Design/Data dump format

From XOWA: the free, open-source, offline wiki application

The Wikimedia data dump files are released in compressed forms: bzip2 or gzip. Prior to v0.5.2, XOWA required that the files be uncompressed in order to read them. v0.5.2 allows the user the option to either read directly from the compressed or uncompressed file.

bzip2: disk space vs speed

Currently, reading from a bzip2 file is much slower than unzipping and reading from the xml file.[1]

For example, using a 10 GB English Wikipedia dump file:

  • unzip takes 120 minutes and +40 GB extra disk space. This process includes unzipping to .xml with 7-zip (40 min: 40 GB) and then importing the wiki (80 min)
  • bzip2 takes 330 minutes and + 0 GB extra disk space. This process includes reading directly from the .bz2 file (250 min: 0 GB) and importing the wiki (80 min)

If you have the extra disk space, you will want to use the unzip route. If you are low on disk space, then you can use the bzip2 route instead

bzip2: Application install (GUI)

By default, the application install uses the unzip route.

To change it to the bzip2 route:

  • Go to Options/Import
  • Change Custom wiki commands to wiki.download,wiki.import
Note: the key step is to remove wiki.unzip after wiki.download

Command-line install

The core_init build step now has an extra property: src_bz2_fil_. A sample invocation would be

.add('simple.wikipedia.org', 'core.init').src_bz2_fil_('/home/download/simplewiki-latest-pages-articles.bz2').owner

Note that XOWA can also auto-detect the appropriate file. For example, using a directory of /xowa/wiki/simple.wikipedia.org/

  • If a .bz2 file is there, it will use it
  • If a .xml file is there, it will use it
  • If both a .bz2 file and a .xml file are there, it will use the .xml file. (since the .xml will be faster)
  • If neither are there, it will fail

gzip

Currently, gzip is only used for the /category2/ system.

  • For application setup, .gz is always used (there is no unzipping)
  • For CLI, either .gz or .sql can be used. Note that usage follows the same rules as described above for .bz2 / .xml.

References

  1. ^ This seems to be a result of Java's lack of support for an unsigned byte data-type, as well as other performance gains from a native C++/C application. (7-zip on Windows; bzip2 on Linux)

Namespaces

XOWA

Getting started

Android

Help

Blog

Donate