App/Full-text search/Lucene/Search indexes/Building

From XOWA: the free, open-source, offline wiki application

XOWA can generate full-text search indexes for existing Download Central wikis

Purpose

There are two reasons why one would want to build their own search index:

  • Old Download Central wiki: Wikis built before 2017-04 will not have search indexes. Rather than wait for a new version, or download a new one, you can build one for the existing wiki
  • Building a custom index: Download Central imposes 2 restrictions to keep the disk usage low for search indexes: You may want to build a custom index in order to work around the following two limitations:
    • Main namespace only: The Project, Portal, Category and other namespaces are not indexed. For example, some Wikisources have an Author and an Index index.
    • Proximity queries are not supported: Lucene supports proximity queries such as "word1 word2"~8 which means find pages where word1 and word2 are within 8 words of each other. This support can be added, but it uses significantly more space. For English Wikipedia, the index size can go from 9 GB to 40 GB.

Requirements

Indexes can only be built for wikis downloaded from Download Central.

If your wiki isn't yet on Download Central, please contact me through Help/Feedback and I'll add your wiki to the queue.

Process

  • Go to Special:XowaSearchBuilder
  • Choose the domain for your wiki. For example, en.wiktionary.org
  • Choose namespaces for your wiki. For example, 0,4,14. For more info, see https://en.wikipedia.org/wiki/Wikipedia:Namespace
  • Choose index options.
    • Documents: This index uses the least amount of space. However, it's not as accurate as "Documents / Frequencies"
    • Documents / Frequencies: This is the default index option used for all of XOWA wikis. It is slightly more accurate, as it tracks the number of words page. For example, if you're searching for "earth" and Page1 has "earth" 1 time, Page2 has "earth" 10 times, and Page3 has "earth" 20 times, then "Documents / Frequencies" returns the pages in the following order: Page3, Page2, Page1. "Documents" would list them in a random order.
    • Documents / Frequencies / Positions: This index option allows proximity queries such as "word1 word2"~8. However, it can use 4 to 5 times as much space
    • Documents / Frequencies / Positions / Offsets: This index option is primarily used for Lucene highlighting. XOWA uses its own highlighter in order to save space. At the moment, there's no reason to choose this option, but it may be useful to some power Lucene users.

Namespaces

XOWA

Getting started

Android

Help

Blog

Donate