App/Full-text search

From XOWA: the free, open-source, offline wiki application

As of v4.5.0, XOWA can search for words in the page text.

Usage

Running

Full-text search can be reached in any of the following ways:

  • Main menu: Bookmarks -> Search for pages in the wiki
  • GUI:
    • Type in the text-box to the left of the magnifying glass
    • Press enter or click the magnifying glass.
  • HTML:
    • Type in the text-box to the right of View HTML
    • Press enter or click the magnifying glass.
  • URL: Go to Special:XowaSearch

Canceling

Searches can be canceled by pressing the cancel button to the right of the search button

Options

The following options are available:

  • Wikis: List wikis to search using a pipe (|) character; EX: en.wikipedia.org|simple.wikipedia.org
  • Namespaces: List namespaces to search using a comma character; EX: 0,4
  • Results per page: List number of results per page; EX: 50
  • Expand pages: Expand pages sections when retrieving results. If 'n', pages will be collapsed; if 'y', pages will be expanded
  • Expand snips: Expand snips sections when retrieving results. If 'n', snips will be collapsed; if 'y', snips will be expanded
  • Show all snips: Show all snips when retrieving results. If 'n', only the first snip will show; if 'y', all snips will show

Multiple wikis

In addition, for multiple wikis, options can be specified per wiki using the pipe character. For example:

  • If Wikis is en.wikipedia.org|simple.wikipedia.org
  • And Results per page is 20|10
  • Then en.wikipedia.org will retrieve 20 results per page and simple.wikipedia.org will retrieve 10

In addition, the last value is used for other wikis. For example:

  • If Wikis is en.wikipedia.org|simple.wikipedia.org|home
  • And Results per page is 20|10
  • Then home will have a page count of 10

Search engine types

XOWA supports two types of full-text search engines: XOWA Wikitext and Lucene HTML

The following table illustrates the high-level differences.

function XOWA Wikitext Lucene HTML
availability Wikitext wikis (Import Online / Offline) HTML wikis (Download Central)
how it works Opens every page and scans wikitext Searches precompiled Lucene indexes
speed slower: small wikis will be subsecond, but en.wikipedia.org searches can take 1+ hour for each search fast: en.wikipedia.org searches can execute in less than a second.
disk space no additional space is needed additional space is needed. en.wikipedia.org will use at least 9 GB
syntax uses same syntax as title search. See App/Search uses Lucene syntax. See the lucene search page as well as below.

Options

Options can be configured at Special:XowaCfg?grp=xowa.addon.fulltext search

In addition, the Special:XowaSearch page also has a copy of the more-frequently used options.


Lucene search syntax

The best reference for Lucene syntax is probably the lucene search page. The following is an edited version of that page

Fields

XOWA uses one field: body.

Body is the HTML of a page without the markup. So <span title='some more words'>word</span> will only have word, and ignore span, title, some, more, and words.

In addition, XOWA uses three other fields: page_id, title, and page_score. These are included for system purposes only.

Wildcards

Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries).

  • To perform a single character wildcard search use the "?" symbol. For example, E?rth
  • To perform a multiple character wildcard search use the "*" symbol. For example, Ear*

Fuzzy Searches

Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search: roam~

This search will find terms like foam and roams.

An additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example: roam~0.8

The default that is used if the parameter is not given is 0.5.

Proximity Searches

Lucene supports finding words are a within a specific distance away. To do a proximity search use the tilde, "~", symbol at the end of a Phrase. For example to search for a "apache" and "jakarta" within 10 words of each other in a document use the search:

"jakarta apache"~10

Boosting a Term

Lucene provides the relevance level of matching documents based on the terms found. To boost a term use the caret, "^", symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be.

Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for

jakarta apache

and you want the term "jakarta" to be more relevant boost it using the ^ symbol along with the boost factor next to the term. You would type:

jakarta^4 apache

This will make documents with the term jakarta appear more relevant. You can also boost Phrase Terms as in the example:

"jakarta apache"^4 "Apache Lucene"

By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2)

Boolean Operators

Boolean operators allow terms to be combined through logic operators. Lucene supports AND, "+", OR, NOT and "-" as Boolean operators(Note: Boolean operators must be ALL CAPS).

OR

The OR operator is the default conjunction operator. This means that if there is no Boolean operator between two terms, the OR operator is used. The OR operator links two terms and finds a matching document if either of the terms exist in a document. This is equivalent to a union using sets. The symbol || can be used in place of the word OR.

To search for documents that contain either jakarta apache or just jakarta use the query:

"jakarta apache" jakarta

or

"jakarta apache" OR jakarta


AND

The AND operator matches documents where both terms exist anywhere in the text of a single document. This is equivalent to an intersection using sets. The symbol && can be used in place of the word AND.

To search for documents that contain "jakarta apache" and "Apache Lucene" use the query:

"jakarta apache" AND "Apache Lucene"

+

The "+" or required operator requires that the term after the "+" symbol exist somewhere in a the field of a single document.

To search for documents that must contain "jakarta" and may contain "lucene" use the query:

+jakarta lucene

NOT

The NOT operator excludes documents that contain the term after NOT. This is equivalent to a difference using sets. The symbol ! can be used in place of the word NOT.

To search for documents that contain "jakarta apache" but not "Apache Lucene" use the query:

"jakarta apache" NOT "Apache Lucene"

Note: The NOT operator cannot be used with just one term. For example, the following search will return no results:

NOT "jakarta apache"

-

The "-" or prohibit operator excludes documents that contain the term after the "-" symbol.

To search for documents that contain "jakarta apache" but not "Apache Lucene" use the query:

"jakarta apache" -"Apache Lucene"

Grouping

Lucene supports using parentheses to group clauses to form sub queries. This can be very useful if you want to control the boolean logic for a query.

To search for either "jakarta" or "apache" and "website" use the query:

(jakarta OR apache) AND website

This eliminates any confusion and makes sure you that website must exist and either term jakarta or apache may exist. Field Grouping

Escaping Special Characters

Lucene supports escaping special characters that are part of the query syntax. The current list special characters are

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \

To escape these character use the \ before the character. For example to search for (1+1):2 use the query:

\(1\+1\)\:2

Namespaces

XOWA

Getting started

Android

Help

Blog

Donate