Help/Features/Search/Build

From XOWA: the free, open-source, offline wiki application

XOWA builds search indexes in a multi-step process


Overview

As of v3.3.4.1, XOWA has a new search engine. This uses the same search functionality that is in XOWA Android.

Unfortunately, this search engine is not backward compatible with old wikis. To use it, you must do one of the following:

  • Import a new wiki with v3.3.4.1. The new search-databases are built automatically.
  • Upgrade an existing wiki with v3.3.4.1. You can upgrade with the following process


Process

XOWA will generate a search index during the wiki import process. The following steps occur:

  • XOWA reads every page title and breaks it up into words
  • XOWA stores this data in an inverted index. From a database standpoint, they are placed in two database tables called search_word and search_link.
  • XOWA then downloads a list of pagelinks from Wikimedia's dump servers. For example, for 2016-03 English Wikipeda, the link is http://dumps.wikimedia.org/enwiki/20160305/enwiki-20160305-pagelinks.sql.gz
  • XOWA then parses this data and calculates PageRank based on which page links to which page
  • XOWA then applies a series of calculations to come up with a page score for each page. For more info, see Help/Features/Search/Score


Build process

Due to the nature of the PageRank algorithm, a lot of additional time and disk-space is needed. These requirements are especially dramatic for English Wikipedia:

  • 125+ GB hard disk space needed: The pagelinks dump is compressed at 4.7 GB (.gz), expands to 40 GB (.sql) and will require a scratch space of 80 GB (.sqlite3).
  • 8+ hours of processing time needed: The PageRank algorithm is computationally expensive on three fronts:
    • English Wikipedia has 16.3 million pages
    • Each page links to each other through over 1 billion links
    • PageRank needs approximately 20 iterations to completely rank all pages.
Note, on a machine with a fast processor and an SSD this process will only take about 2 hours.

With this in mind, XOWA offers the following options:

Download the XOWA search databases from archive.org

Monthly versions of English Wikipedia's search databases will be posted to https://archive.org/edit/Xowa_enwiki_latest . You can just download a 2 GB dump of these databases and replace your copies.

Use page-length instead of PageRank

XOWA can use page-length and skip the pagelinks download (125+ GB) as well as the PageRank running time (8+ hours). However the generated results will not be as accurate as PageRank. Specifically, long pages like "List of ...." will have a high page score.

To use this option, do the following:

  • Go to Options/Import
  • Change "PageRank iteration count" to 0
  • Save the page
  • Import the wiki.

Note that 0 is the default value for this option.

Use PageRank but limit to 1 iteration

This option will still require a lot of disk space, but will limit the running time to a few hours. To use this option, do the same as above, but change "PageRank iteration count" to 1.

Use PageRank but limit to 1000 iteration

This option will create the full version of PageRank search indexes. To use this option, do the same as above, but change "PageRank iteration count" to 1000.


Other notes

  • Test with Simple Wikipedia: The full PageRank process only takes about 20 minutes and requires no more than another 1 GB.
  • Recreate through Dashboard/Wiki_maintenance: A search index is built when a wiki is first created. You can also recreate it at Dashboard/Wiki_maintenance

Namespaces

XOWA

Getting started

Android

Help

Blog

Donate