App/Search/Score

From XOWA: the free, open-source, offline wiki application

XOWA calculates a score to every page for the purpose of ranking search results.

Overview

From a broad perspective, the following happens:

  • A PageRank score is calculated for a page. This score is scaled from 0 to 1,000,000
  • A page-length score is then calculated for the page. This score is ranked from 0 to 1,000,000
  • The PageRank score is then multiplied by a ratio if it has a low page-length score
  • The resulting PageRank score is then ranked from 0 to 1,000,000.


Scaling / Ranking

XOWA uses "scaling" and "ranking" at various stages to calculate the score.

Scaling

A simplified definition of scaling is converting a number from one range to another range based on proportion. For a more thorough definition, see the Wikipedia page on feature scaling

For example, let's say you have a score of 100 in a range of 0 to 400 and want to scale it to 0 to 1000. The following steps would be involved:

  • Take 100 and divide it by 400. This yields .25
  • Take .25 and multiply it by 1000. This yields 250.

The following formula is the basis for scaling:

newScore = \frac{oldScore - \text{min}(oldRange)} {\text{max}(oldRange)-\text{min}(oldRange)} \cdot (\text{max}(newRange)-\text{min}(newRange))

Or, to use the example from above:

250 = \frac{100 - 0} {400-0} \cdot (1000-0)

Ranking

A simplified definition of ranking is assigning a number based on its order in a population of numbers. For those familiar with a school setting, this is "grading on a curve". For a more thorough definition, see the Wikipedia page on percentile ranks

For example, let's say you have the following:

  • A minimum score of 0
  • A maximum score of 100
  • 5 entities with the following scores
    • A : 99
    • B : 10
    • C : 42
    • D : 71
    • E : 56

Ranking would do the following:

  • Sort the scores
    • A : 99
    • D : 71
    • E : 56
    • C : 42
    • B : 10
  • Calculate the "interval" for each score by taking the maximum and dividing by the number of scores.
    • In this case, this would be 20: 100 / 5
  • Assign each score a new score based on its proportionate position in the sort.
    • A : 100
    • D : 80
    • E : 60
    • C : 40
    • B : 20


Calculation

PageRank

The basis of XOWA's page score is PageRank.

In brief, PageRank will give high scores to pages which are:

  1. linked to by many pages
  2. linked to by pages which have a high score.

Note that #2 is recursive (a page will have a high score only if it is linked to by many pages). For more info, a good starting point is the Wikipedia page on PageRank.

After XOWA calculates the PageRank, XOWA then scales this score in a range of 0 to 1,000,000

Short pages are penalized

XOWA penalizes short pages. This reduces the effect of small stub pages which are linked to by many articles, but mostly from boilerplate navigation boxes.

XOWA ranks pages based on page-length. The generated score is in a range from 0 to 1,000,000

Currently the method is:

  • If the page is in the bottom 60% of page lengths...
  • Then multiply the page score by that percentage.

For example, a page that has a length in the bottom 10% and a score of 1000, will have a score of 100 (1000 * 10%). In contrast, a page that has a length in the top 65% with a score of 9,000 will still have a score of 9,000.

Note that this calculation is an ad-hoc creation and will probably change in the future.

Scores are re-scaled

The final step is to take the modified score and rank it from 0 to 1,000,000. Note that this final score is an integer (not a decimal / float)

Namespaces

XOWA

Getting started

Android

Help

Blog

Donate