App/Category/Internals
This page will document some of the internals of V2
Contents
Builder commands
For reference, this is the current script to set up the V2 Category system
app.bldr.pause_at_end_('n'); app.bldr.cmds .add_many('simple.wikipedia.org', 'ctg.hiddencat_sql', 'ctg.hiddencat_ttl', 'ctg.link_sql', 'ctg.link_idx').owner ; app.bldr.run;
Note that 'ctg.link_sql' and 'ctg.link_idx' are required.
Note that 'ctg.hiddencat_sql' and 'ctg.hiddencat_ttl' can be omitted. However, it is recommended that they be run (for English Wikipedia, it adds less than 5 minutes to the entire process).
- This command will look for a file matching *page_props.sql in the wiki directory
- For example: /xowa/wiki/simple.wikipedia.org/simplewiki-latest-page_props.sql. Note this sql will have a format of (page_id, prop_name, prop_val)
- It will then parse the .sql file and look for entries having a prop_name of "hiddencat". For example (1, 'hiddencat', '')
- When it's done, it will generate a Base85 encoded list of all page_ids
- The output directory will be /xowa/wiki/simple.wikipedia.org/tmp/ctg.hiddencat_sql/make/
- An example of a file would be:
!!!!# !!!!$
- This command will look at the output of ctg.hiddencat_sql and find the appropriate title for the given id
- This step is necessary as the category indexes are sorted by title, not by id.
- When it's done, it will generate a sorted list of title|id.
- The output directory will be /xowa/wiki/simple.wikipedia.org/tmp/ctg.hiddencat_ttl/make/
- An example of a file would be:
A|!!!!# B|!!!!$
ctg.link_sql
- This command will look for a file matching *categorylinks.sql in the wiki directory
- For example: /xowa/wiki/simple.wikipedia.org/simplewiki-latest-categorylinks.sql.
- It will then parse the .sql file and extract the following data: category_name, page_id, page_member_type, page_sortkey, page_member_add_date
- When it's done, it will generate a sorted list of category|type|sortkey|id|date.
- The output directory will be /xowa/wiki/simple.wikipedia.org/tmp/ctg.link_sql/make/
- An example of a file would be:
A|p|Page_1_sortkey|!!!!%|!!!@!| B|p|Page_2_sortkey|!!!!^|!!!@@|
ctg.link_idx
-
This command will generate the /category2/ hive based on the output of the above commands. It uses the following:
- Category link data as built in /xowa/wiki/simple.wikipedia.org/tmp/ctg.link_sql/make/.
- Category hidden data as built in /xowa/wiki/simple.wikipedia.org/tmp/ctg.hiddencat_ttl/make/.
- It will then merge the output of the above data and generate the /main/ and /link/ sudirectories in /category2/
/category2/
/main/
The main files are located at /xowa/wiki/simple.wikipedia.org/site/category2/main/. They follow the same hive structure as the other directories (a main reg.csv and subdirectories of the format of /00/00/00/00/0123456789.xdat)
Each file contains header information for a category. Presently, this includes the following:
- Category name
- Hidden: "y" means hidden; "n" means not hidden
- Number of subcategories (Base85 encoded)
- Number of files (Base85 encoded)
- Number of pages (Base85 encoded)
-
EX:
A|y|!!!!!|!!!!!|!!!!!|
/link/
The link files are located at /xowa/wiki/simple.wikipedia.org/site/category2/link/. They also follow the same hive structure as the other directories.
Each file contains members of a category. Presently, this includes the following:
- Category name
- Length of subcategories data
- Length of files data
- Length of pages data
-
A series of entries listing category members
- Note that these entries are broken into subgroups (subcategories / files / pages) depending on the preceding lengths.
-
Each entry is in a semi-colon delimited format
- page_id (Base85 encoded)
- page_member_add_date (Base85 encoded)
- page_sortkey
-
-
EX (for entry):
|!!!!%;!!!@!;Page_1_sortkey|
-
EX (for entry):
-
EX (for all):
A|!!!!!|!!!!!|!!!!X|!!!!%;!!!@!;Page_1_sortkey|!!!!^;!!!@@;Page_2_sortkey|