Dev/Command-line/Thumbs
XOWA can make complete wikis which will have the following:
- All images downloaded offline
- All pages compiled into HTML (pages will load faster)
This process is run by a custom command-line make script.
|
Please note that this script is for power users. It is not meant for casual users. Please read through these instructions carefully. If you fail to follow these instructions, you may end up downloading millions of images by accident, and have your IP address banned by Wikimedia. Also, the script will change in the future, and without any warning. There is no backward compatibility. Although the XOWA databases have a fixed format, the scripts do not. If you discover that your script breaks, please refer to this page, contact me for assistance, or go through the code. |
Contents
Overview
The make script works in the following way:
- Loads the wikitext for a page.
- Converts the wikitext to HTML and saves it.
- Gathers a list of [[File]] links.
- Repeats for each page until there are no more pages
- Downloads the list of [[File]] to create the XOWA file databases.
Process
-
Open up a terminal
-
On Windows, run
cmd - On Linux / Mac OS X, run the Terminal app
-
On Windows, run
-
Change to the xowa root directory
-
For example, if xowa is setup in
C:\xowa, runcd C:\xowa
-
For example, if xowa is setup in
-
Create a text file in your xowa root folder called
make_xowa.gfswith a text-editor.- For Windows, Notepad++ is recommended, or any other text editor that does not have Windows line-ending. (Do not use Notepad)
- For other systems, you can use a text-editor like Atom, jEdit, or whatever you're most comfortable with
- Copy each of the scripts below to the text file
- Run the following command. Make sure to match the jar path and jar file
-
java -jar C:\xowa\xowa_windows_64.jar --app_mode cmd --cmd_file C:\xowa\make_xowa.gfs --show_license n --show_args n
- Wait for the script to complete
Script
The make script should be run in 3 parts:
-
make_commonsscript: Builds commons.wikimedia.org which is needed to provide image metadata for the download -
make_wikidatascript: Builds www.wikidata.org which needed for data from {{#property}} calls or Module code. -
make_wikiscript: Build the actual wiki
Note that other wikis can re-use the same commons and wikidata. For example, if you want to build enwiki and dewiki, you only need to build make_commons and make_wikidata once.
make_commons
-
Copy the following into
make_xowa.gfs
app.bldr.pause_at_end_('n');
app.scripts.run_file_by_type('xowa_cfg_app');
app.cfg.set_temp('app', 'xowa.app.web.enabled', 'y');
app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.text', '0');
app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.html', '0');
app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.file', '0');
app.bldr.cmds {
// build commons database; this only needs to be done once, whenever commons is updated
add ('commons.wikimedia.org' , 'util.cleanup') {delete_all = 'y';}
add ('commons.wikimedia.org' , 'util.download') {dump_type = 'pages-articles';}
add ('commons.wikimedia.org' , 'util.download') {dump_type = 'page_props';}
add ('commons.wikimedia.org' , 'util.download') {dump_type = 'image';}
add ('commons.wikimedia.org' , 'text.init');
add ('commons.wikimedia.org' , 'text.page');
add ('commons.wikimedia.org' , 'text.term');
add ('commons.wikimedia.org' , 'text.css');
add ('commons.wikimedia.org' , 'wiki.page_props');
add ('commons.wikimedia.org' , 'wiki.image');
add ('commons.wikimedia.org' , 'file.page_regy') {build_commons = 'y'}
add ('commons.wikimedia.org' , 'wiki.page_dump.make');
add ('commons.wikimedia.org' , 'wiki.redirect') {commit_interval = 1000; progress_interval = 100; cleanup_interval = 100;}
add ('commons.wikimedia.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}
}
app.bldr.run;
-
Run the script using the process above
- For 2020-02, this script will take about 7 hours to complete and use 125 GB of disk space.
make_wikidata
-
Copy the following into
make_xowa.gfs
app.bldr.pause_at_end_('n');
app.scripts.run_file_by_type('xowa_cfg_app');
app.cfg.set_temp('app', 'xowa.app.web.enabled', 'y');
app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.text', '0');
app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.html', '0');
app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.file', '0');
app.bldr.cmds {
// build wikidata database; this only needs to be done once, whenever wikidata is updated
add ('www.wikidata.org' , 'util.cleanup') {delete_all = 'y';}
add ('www.wikidata.org' , 'util.download') {dump_type = 'pages-articles';}
add ('www.wikidata.org' , 'util.download') {dump_type = 'categorylinks';}
add ('www.wikidata.org' , 'util.download') {dump_type = 'page_props';}
add ('www.wikidata.org' , 'util.download') {dump_type = 'image';}
add ('www.wikidata.org' , 'text.init');
add ('www.wikidata.org' , 'text.page');
add ('www.wikidata.org' , 'text.term');
add ('www.wikidata.org' , 'text.css');
add ('www.wikidata.org' , 'wiki.page_props');
add ('www.wikidata.org' , 'wiki.categorylinks');
add ('www.wikidata.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}
}
app.bldr.run;
-
Run the script using the process above
- For 2020-02, this script can take about 24 hours to complete and use 250 GB of disk space.
make_wiki
-
Copy the following into
make_xowa.gfs
app.bldr.pause_at_end_('n');
app.scripts.run_file_by_type('xowa_cfg_app');
app.cfg.set_temp('app', 'xowa.app.web.enabled', 'y');
app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.text', '0');
app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.html', '0');
app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.file', '0');
app.bldr.cmds {
// build simple.wikipedia.org
add ('simple.wikipedia.org' , 'util.cleanup') {delete_all = 'y';}
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'pages-articles';}
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'categorylinks';}
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'page_props';}
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'image';}
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'pagelinks';} // needed for sorting search results by PageRank
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'imagelinks';}
add ('simple.wikipedia.org' , 'text.init');
add ('simple.wikipedia.org' , 'text.page') {
// calculate redirect_id for #REDIRECT pages. needed for html databases
redirect_id_enabled = 'y';
}
add ('simple.wikipedia.org' , 'text.search');
// upload desktop css
add ('simple.wikipedia.org' , 'text.css');
// upload mobile css
add ('simple.wikipedia.org' , 'text.css') {css_key = 'xowa.mobile'; /* css_dir = 'C:\xowa\user\anonymous\wiki\simple.wikipedia.org-mobile\html\'; */}
add ('simple.wikipedia.org' , 'text.term');
add ('simple.wikipedia.org' , 'wiki.page_props');
add ('simple.wikipedia.org' , 'wiki.categorylinks');
// create local "page" tables in each "text" database for "lnki_temp"
add ('simple.wikipedia.org' , 'wiki.page_dump.make');
// create a redirect table for pages in the File namespace
add ('simple.wikipedia.org' , 'wiki.redirect') {commit_interval = 1000; progress_interval = 100; cleanup_interval = 100;}
// create an "image" table to get the metadata for all files in the current wiki
add ('simple.wikipedia.org' , 'wiki.image');
// create an "imagelinks" table to find out which images are used for the wiki
add ('simple.wikipedia.org' , 'wiki.imagelinks');
// parse all page-to-page links
add ('simple.wikipedia.org' , 'wiki.page_link');
// calculate a score for each page using the page-to-page links
add ('simple.wikipedia.org' , 'search.page__page_score') {iteration_max = 100;}
// update link score statistics for the search tables
add ('simple.wikipedia.org' , 'search.link__link_score') {page_rank_enabled = 'y';}
// update word count statistics for the search_word table
add ('simple.wikipedia.org' , 'search.word__link_count');
// cleanup all downloaded files as well as temporary files
add ('simple.wikipedia.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}
// v2 html generator; allows for multi-threaded / multi-machine builds
add ('simple.wikipedia.org' , 'wiki.mass_parse.init') {cfg {ns_ids = '0|4|14|8';}}
// uncomment the next line to resume parsing. See === Resuming === below
// add ('simple.wikipedia.org' , 'wiki.mass_parse.resume');
// NOTE: must change manual_now
add ('simple.wikipedia.org' , 'wiki.mass_parse.exec') {
cfg {
// locks time to a specific value so all pages use the same time when calling Date.Now()
manual_now = '2020-02-01 01:02:03';
// number of threads; set to 1 to skip multi-threaded behavior
num_wkrs = 8;
// enables building full-text search indexes
indexer_enabled = 'y';
// optimization; loads all templates in memory instead of loading each one from disk
load_all_templates = 'y';
// optimization; loads all imglinks in memory instead of loading each one from disk
// an imglink maps a given image (File:Abc.png) to a repo (commons vs local wiki) as well as a rename
load_all_imglinks = 'y';
// number of pages after which XOWA empties cache
cleanup_interval = 50;
// DEPRECATE: uncomment these 2 lines to use custom HTML zip compression
// hzip_enabled = 'y';
// hdiff_enabled ='y';
// uncomment these 3 lines if using the build script as a "worker" helping a "server"
// num_pages_in_pool = 32000;
// mgr_url = '\\server_machine_name\xowa\wiki\en.wikipedia.org\tmp\xomp\';
// wkr_machine_name = 'worker_machine_1'
}
}
// note that if multi-machine mode is enabled, all worker directories must be manually copied to the server directory (a build command will be added later)
add ('simple.wikipedia.org' , 'wiki.mass_parse.make');
// aggregate the lnkis
add ('simple.wikipedia.org' , 'file.lnki_regy');
// generate orig metadata for files in the current wiki (for example, for pages in en.wikipedia.org/wiki/File:*)
add ('simple.wikipedia.org' , 'file.page_regy') {build_commons = 'n';}
// generate all orig metadata for all lnkis
add ('simple.wikipedia.org' , 'file.orig_regy');
// generate list of files to download based on "orig_regy" and XOWA image code
add ('simple.wikipedia.org' , 'file.xfer_temp.thumb');
// aggregate list one more time
add ('simple.wikipedia.org' , 'file.xfer_regy');
// identify images that have already been downloaded
add ('simple.wikipedia.org' , 'file.xfer_regy_update');
// download images. This step may also take a long time, depending on how many images are needed
add ('simple.wikipedia.org' , 'file.fsdb_make') {
commit_interval = 1000; progress_interval = 200; select_interval = 10000;
ns_ids = '0|4|14';
// specify whether original wiki databases are v1 (.sqlite3) or v2 (.xowa)
src_bin_mgr__fsdb_version = 'v1';
// always redownload certain files
src_bin_mgr__fsdb_skip_wkrs = 'page_gt_1|small_size';
// allow downloads from wikimedia
src_bin_mgr__wmf_enabled = 'y';
}
// generate registry of original metadata by file title
add ('simple.wikipedia.org' , 'file.orig_reg');
// drop page_dump tables
add ('simple.wikipedia.org' , 'wiki.page_dump.drop');
}
app.bldr.run;
-
Change the
manual_nowabove to match the first day of the current month. For example, if today is2020-02-16, change it tomanual_now = '2020-02-01 01:02:03'. -
Run the script using the process above
- For 2020-02, this script can take about 1 hour to complete and use 5 GB of disk space.
Resuming
The wiki.mass_parse.exec may take many hours. For English Wikipedia, it can take up to 5 days, even with 8 threads
During this time, the build can be canceled by any of the following:
- Manual: User presses Ctrl+C
- Unanticipated: Process dies or machine shuts down
To resume the build, the following steps can be applied
-
Comment out all commands before
wiki.mass_parse.execusing a block comment-
Place a
/*before the line with 'util.cleanup' -
Place a
*/after the line with 'wiki.mass_parse.init'
-
Place a
- Uncomment the line for 'wiki.mass_parse.resume'
- Run the command-line again
-
java -jar C:\xowa\xowa_windows_64.jar --app_mode cmd --cmd_file C:\xowa\make_xowa.gfs --show_license n --show_args n
Appendix
Requirements
Hardware
You should have a recent-generation machine with relatively high-performance hardware, especially if you're planning to run the make script for English Wikipedia.
For context, here is my current machine setup for generating the image dumps:
- Processor: Intel Core i7-4770K; 3.5 GHz with 8 MB L3 cache
- Memory: 16 GB DDR3 SDRAM DDR3 1600 (PC3 12800)
- Hard Drive: 1TB SSD
- Operating System: openSUSE 13.2
(Note: The hardware was assembled in late 2013.)
For English Wikipedia, it takes about 50 hours for the entire process.
Internet-connectivity
You should have a broadband connection to the internet. The script will need to download dump files from Wikimedia and some dump files (like English Wikipedia) will be in the tens of GB.
Pre-existing image databases for your wiki (optional)
XOWA will automatically re-use the images from existing image databases so that you do not have to redownload them. This is particularly useful for large wikis where redownloading millions of images would be unwanted.
It is strongly advised that you download the image database for your wiki. You can find a full list here: http://xowa.sourceforge.net/image_dbs.html Note that if an image database does not exist for your wiki, you can still proceed to use the script
-
If you have v1 image databases, they should be placed in
/xowa/file/wiki_domain-prv. For example, English Wikipedia should have/xowa/file/en.wikipedia.org-prv/fsdb.main/fsdb.bin.0000.sqlite3 -
If you have v2 image databases, they should be placed in
/xowa/wiki/wiki_domain/prv. For example, English Wikipedia should have/xowa/wiki/en.wikipedia.org/prv/en.wikipedia.org-file-ns.000-db.001.xowa
gfs script
The script is written in the gfs format. This is a custom scripting format specific to XOWA. It is similar to JSON, but also supports commenting.
Unfortunately the error-handling for gfs is quite minimal. When making changes, please do them in small steps and be prepared to go to backups.
The following is a brief list of rules:
-
Comments are made with either "//","\n" or "/*","*/". For example:
// single-line commentor/* multi-line comment*/ -
Booleans are "y" and "n" (yes / no or true / false). For example:
enabled = 'y'; -
Numbers are 32-bit integers and are not enclosed in quotes. For example,
count = 10000; -
Strings are surrounded by apostrophes (') or quotes ("). For example:
key = 'val'; -
Statements are terminated by a semi-colon (;). For example:
procedure1; -
Statements can take arguments in parentheses. For example:
procedure1('argument1', 'argument2', 'argument3'); -
Statements are grouped with curly braces. ({}). For example:
group {procedure1; procedure2; procedure3;}
Terms
lnki
A lnki is short for "link internal". It refers to all wikitext with the double bracket syntax: [[A]]. A more elaborate example for files would be [[File:A.png|thumb|200x300px|upright=.80]]. Note that the abbreviation was chosen to differentiate it from lnke which is short for "link enternal".
For the purposes of the script, all lnki data comes from the wikitext in the current wiki's data dump
orig
An orig is short for "original file". It refers to the original file metadata.
For the purposes of this script, all orig data comes from commons.wikimedia.org
xfer
An xfer is short for "transfer file". It refers to the actual file to be downloaded.
fsdb
The fsdb is short for "file system database". It refers to the file as it is stored in the internal table format of the XOWA image databases.
Examples
Simple Wikipedia example with documentation
app.bldr.pause_at_end_('n');
app.scripts.run_file_by_type('xowa_cfg_app');
app.cfg.set_temp('app', 'xowa.app.web.enabled', 'y');
app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.text', '0');
app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.html', '0');
app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.file', '0');
app.bldr.cmds {
// build commons database; this only needs to be done once, whenever commons is updated
add ('commons.wikimedia.org' , 'util.cleanup') {delete_all = 'y';}
add ('commons.wikimedia.org' , 'util.download') {dump_type = 'pages-articles';}
add ('commons.wikimedia.org' , 'util.download') {dump_type = 'page_props';}
add ('commons.wikimedia.org' , 'util.download') {dump_type = 'image';}
add ('commons.wikimedia.org' , 'text.init');
add ('commons.wikimedia.org' , 'text.page');
add ('commons.wikimedia.org' , 'text.term');
add ('commons.wikimedia.org' , 'text.css');
add ('commons.wikimedia.org' , 'wiki.page_props');
add ('commons.wikimedia.org' , 'wiki.image');
add ('commons.wikimedia.org' , 'file.page_regy') {build_commons = 'y'}
add ('commons.wikimedia.org' , 'wiki.page_dump.make');
add ('commons.wikimedia.org' , 'wiki.redirect') {commit_interval = 1000; progress_interval = 100; cleanup_interval = 100;}
add ('commons.wikimedia.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}
// build wikidata database; this only needs to be done once, whenever wikidata is updated
add ('www.wikidata.org' , 'util.cleanup') {delete_all = 'y';}
add ('www.wikidata.org' , 'util.download') {dump_type = 'pages-articles';}
add ('www.wikidata.org' , 'util.download') {dump_type = 'categorylinks';}
add ('www.wikidata.org' , 'util.download') {dump_type = 'page_props';}
add ('www.wikidata.org' , 'util.download') {dump_type = 'image';}
add ('www.wikidata.org' , 'text.init');
add ('www.wikidata.org' , 'text.page');
add ('www.wikidata.org' , 'text.term');
add ('www.wikidata.org' , 'text.css');
add ('www.wikidata.org' , 'wiki.page_props');
add ('www.wikidata.org' , 'wiki.categorylinks');
add ('www.wikidata.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}
// build simple.wikipedia.org
add ('simple.wikipedia.org' , 'util.cleanup') {delete_all = 'y';}
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'pages-articles';}
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'categorylinks';}
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'page_props';}
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'image';}
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'pagelinks';} // needed for sorting search results by PageRank
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'imagelinks';}
add ('simple.wikipedia.org' , 'text.init');
add ('simple.wikipedia.org' , 'text.page') {
// calculate redirect_id for #REDIRECT pages. needed for html databases
redirect_id_enabled = 'y';
}
add ('simple.wikipedia.org' , 'text.search');
// upload desktop css
add ('simple.wikipedia.org' , 'text.css');
// upload mobile css
add ('simple.wikipedia.org' , 'text.css') {css_key = 'xowa.mobile'; /* css_dir = 'C:\xowa\user\anonymous\wiki\simple.wikipedia.org-mobile\html\'; */}
add ('simple.wikipedia.org' , 'text.term');
add ('simple.wikipedia.org' , 'wiki.page_props');
add ('simple.wikipedia.org' , 'wiki.categorylinks');
// create local "page" tables in each "text" database for "lnki_temp"
add ('simple.wikipedia.org' , 'wiki.page_dump.make');
// create a redirect table for pages in the File namespace
add ('simple.wikipedia.org' , 'wiki.redirect') {commit_interval = 1000; progress_interval = 100; cleanup_interval = 100;}
// create an "image" table to get the metadata for all files in the current wiki
add ('simple.wikipedia.org' , 'wiki.image');
// create an "imagelinks" table to find out which images are used for the wiki
add ('simple.wikipedia.org' , 'wiki.imagelinks');
// parse all page-to-page links
add ('simple.wikipedia.org' , 'wiki.page_link');
// calculate a score for each page using the page-to-page links
add ('simple.wikipedia.org' , 'search.page__page_score') {iteration_max = 100;}
// update link score statistics for the search tables
add ('simple.wikipedia.org' , 'search.link__link_score') {page_rank_enabled = 'y';}
// update word count statistics for the search_word table
add ('simple.wikipedia.org' , 'search.word__link_count');
// cleanup all downloaded files as well as temporary files
add ('simple.wikipedia.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}
// OBSOLETE: use v2
// v1 html generator
// parse every page in the listed namespace and gather data on their lnkis.
// this step will take the longest amount of time.
/*
add ('simple.wikipedia.org' , 'file.lnki_temp') {
// save data every # of pages
commit_interval = 10000;
// update progress every # of pages
progress_interval = 50;
// free memory by flushing internal caches every # of pages
cleanup_interval = 50;
// specify # of pages to read into memory at a time, where # is in MB. For example, 25 means read approximately 25 MB of page text into memory
select_size = 25;
// namespaces to parse. See en.wikipedia.org/wiki/Help:Namespaces
ns_ids = '0|4|14';
// enable generation of ".html" databases. This will increase processing time by 20% - 25%
hdump_bldr {
// generate html databases
enabled = 'y';
// compression method for html: 1=none; 2=zip; 3=gz; 4=bz2
zip_tid = 3;
// enable additional custom compression
hzip_enabled = 'y';
// perform extra validation step of custom compression
hzip_diff = 'y';
}
}
*/
// v2 html generator; allows for multi-threaded / multi-machine builds
add ('simple.wikipedia.org' , 'wiki.mass_parse.init') {cfg {ns_ids = '0|4|14|8';}}
add ('simple.wikipedia.org' , 'wiki.mass_parse.exec') {
cfg {
num_wkrs = 8; load_all_templates = 'y'; cleanup_interval = 50; hzip_enabled = 'y'; hdiff_enabled ='y'; manual_now = '2016-08-01 01:02:03';
load_all_imglinks = 'y';
// uncomment the following 3 lines if using the build script as a "worker" helping a "server"
// num_pages_in_pool = 32000;
// mgr_url = '\\server_machine_name\xowa\wiki\en.wikipedia.org\tmp\xomp\';
// wkr_machine_name = 'worker_machine_1'
}
}
// note that if multi-machine mode is enabled, all worker directories must be manually copied to the server directory (a build command will be added later)
add ('simple.wikipedia.org' , 'wiki.mass_parse.make');
// aggregate the lnkis
add ('simple.wikipedia.org' , 'file.lnki_regy');
// generate orig metadata for files in the current wiki (for example, for pages in en.wikipedia.org/wiki/File:*)
add ('simple.wikipedia.org' , 'file.page_regy') {build_commons = 'n';}
// generate all orig metadata for all lnkis
add ('simple.wikipedia.org' , 'file.orig_regy');
// generate list of files to download based on "orig_regy" and XOWA image code
add ('simple.wikipedia.org' , 'file.xfer_temp.thumb');
// aggregate list one more time
add ('simple.wikipedia.org' , 'file.xfer_regy');
// identify images that have already been downloaded
add ('simple.wikipedia.org' , 'file.xfer_regy_update');
// download images. This step may also take a long time, depending on how many images are needed
add ('simple.wikipedia.org' , 'file.fsdb_make') {
commit_interval = 1000; progress_interval = 200; select_interval = 10000;
ns_ids = '0|4|14';
// specify whether original wiki databases are v1 (.sqlite3) or v2 (.xowa)
src_bin_mgr__fsdb_version = 'v1';
// always redownload certain files
src_bin_mgr__fsdb_skip_wkrs = 'page_gt_1|small_size';
// allow downloads from wikimedia
src_bin_mgr__wmf_enabled = 'y';
}
// generate registry of original metadata by file title
add ('simple.wikipedia.org' , 'file.orig_reg');
// drop page_dump tables
add ('simple.wikipedia.org' , 'wiki.page_dump.drop');
}
app.bldr.run;
Script: gnosygnu's actual English Wikipedia script (dirty; provided for reference only)
app.bldr.pause_at_end_('n');
app.scripts.run_file_by_type('xowa_cfg_app');
app.cfg.set_temp('app', 'xowa.app.web.enabled', 'y');
app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.text', '0');
app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.html', '0');
app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.file', '0');
app.bldr.cmds {
/*
add ('www.wikidata.org' , 'util.cleanup') {delete_all = 'y';}
add ('www.wikidata.org' , 'util.download') {dump_type = 'pages-articles';}
add ('www.wikidata.org' , 'util.download') {dump_type = 'categorylinks';}
add ('www.wikidata.org' , 'util.download') {dump_type = 'page_props';}
add ('www.wikidata.org' , 'util.download') {dump_type = 'image';}
add ('www.wikidata.org' , 'text.init');
add ('www.wikidata.org' , 'text.page');
add ('www.wikidata.org' , 'text.term');
add ('www.wikidata.org' , 'text.css');
add ('www.wikidata.org' , 'wiki.image');
add ('www.wikidata.org' , 'wiki.page_dump.make');
add ('www.wikidata.org' , 'wiki.page_props');
add ('www.wikidata.org' , 'wiki.categorylinks');
add ('www.wikidata.org' , 'wiki.redirect') {commit_interval = 1000; progress_interval = 100; cleanup_interval = 100;}
// add ('www.wikidata.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}
add ('commons.wikimedia.org' , 'util.cleanup') {delete_all = 'y';}
add ('commons.wikimedia.org' , 'util.download') {dump_type = 'pages-articles';}
add ('commons.wikimedia.org' , 'util.download') {dump_type = 'image';}
add ('commons.wikimedia.org' , 'util.download') {dump_type = 'page_props';}
add ('commons.wikimedia.org' , 'text.init');
add ('commons.wikimedia.org' , 'text.page');
add ('commons.wikimedia.org' , 'text.term');
add ('commons.wikimedia.org' , 'text.css');
add ('commons.wikimedia.org' , 'wiki.image');
add ('commons.wikimedia.org' , 'file.page_regy') {build_commons = 'y'}
add ('commons.wikimedia.org' , 'wiki.page_dump.make');
add ('commons.wikimedia.org' , 'wiki.redirect') {commit_interval = 1000; progress_interval = 100; cleanup_interval = 100;}
// add ('commons.wikimedia.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}
add ('en.wikipedia.org' , 'util.download') {dump_type = 'pages-articles';}
add ('en.wikipedia.org' , 'util.download') {dump_type = 'pagelinks';}
add ('en.wikipedia.org' , 'util.download') {dump_type = 'categorylinks';}
add ('en.wikipedia.org' , 'util.download') {dump_type = 'page_props';}
add ('en.wikipedia.org' , 'util.download') {dump_type = 'image';}
add ('en.wikipedia.org' , 'util.download') {dump_type = 'imagelinks';}
*/
/*
// en.wikipedia.org
add ('en.wikipedia.org' , 'text.init');
add ('en.wikipedia.org' , 'text.page') {redirect_id_enabled = 'y';}
add ('en.wikipedia.org' , 'text.search');
add ('en.wikipedia.org' , 'text.css');
add ('en.wikipedia.org' , 'text.term');
add ('en.wikipedia.org' , 'wiki.image');
add ('en.wikipedia.org' , 'wiki.imagelinks');
add ('en.wikipedia.org' , 'wiki.page_dump.make');
add ('en.wikipedia.org' , 'wiki.redirect') {commit_interval = 1000; progress_interval = 100; cleanup_interval = 100;}
add ('en.wikipedia.org' , 'wiki.page_link');
add ('en.wikipedia.org' , 'search.page__page_score') {iteration_max = 100;}
add ('en.wikipedia.org' , 'search.link__link_score') {page_rank_enabled = 'y';
score_adjustment_mgr {
match_mgr {
get(0) {
add('bgn', 'mult', '.999', 'List_of_', 'National_Register_of_Historic_Places_listings_');
add('end', 'mult', '.999', '_United_States_Census');
add('all', 'mult', '.999', 'Copyright_infringement', 'Time_zone', 'Daylight_saving_time');
add('all', 'add' , '0' , 'Animal');
}
}
}
}
add ('en.wikipedia.org' , 'search.word__link_count');
add ('en.wikipedia.org' , 'wiki.page_props');
add ('en.wikipedia.org' , 'wiki.categorylinks');
*/
/*
add ('en.wikipedia.org' , 'file.page_regy') {build_commons = 'n'}
add ('en.wikipedia.org' , 'wiki.mass_parse.init') {cfg {ns_ids = '0|4|100|14|8';}}
// add ('en.wikipedia.org' , 'wiki.mass_parse.resume');
add ('en.wikipedia.org' , 'wiki.mass_parse.exec') {cfg {
num_wkrs = 8; load_all_templates = 'y'; load_ifexists_ns = '*'; cleanup_interval = 25; hzip_enabled = 'y'; hdiff_enabled ='y'; manual_now = '2017-01-01 01:02:03';}
// num_wkrs = 1; load_all_templates = 'n'; load_all_imglnks = 'n'; cleanup_interval = 50; hzip_enabled = 'y'; hdiff_enabled ='y'; manual_now = '2016-07-28 01:02:03';}
}
add ('en.wikipedia.org' , 'wiki.mass_parse.make');
*/
/*
add ('en.wikipedia.org' , 'file.lnki_temp') {
commit_interval = 10000; progress_interval = 50; cleanup_interval = 50; select_size = 25;
ns_ids = '0|4|14|100|12|8|6|10|828|108|118|446|710|2300|2302|2600';
hdump_bldr {enabled = 'y'; hzip_enabled = 'y'; hzip_diff = 'y';}
}
*/
/*
add ('commons.wikimedia.org' , 'file.page_regy') {build_commons = 'y'}
add ('en.wikipedia.org' , 'file.page_regy') {build_commons = 'n';}
add ('en.wikipedia.org' , 'file.lnki_regy');
// add ('en.wikipedia.org' , 'wiki.image');
add ('en.wikipedia.org' , 'file.orig_regy');
add ('en.wikipedia.org' , 'file.xfer_temp.thumb');
add ('en.wikipedia.org' , 'file.xfer_regy');
add ('en.wikipedia.org' , 'file.xfer_regy_update');
*/
/*
add ('en.wikipedia.org' , 'file.fsdb_make') {
commit_interval = 1000; progress_interval = 200; select_interval = 10000;
ns_ids = '0|4|100|14|8';
// // specify whether original wiki databases are v1 (.sqlite3) or v2 (.xowa)
// src_bin_mgr__fsdb_version = 'v2';
// trg_bin_mgr__fsdb_version = 'v1';
// always redownload certain files
src_bin_mgr__fsdb_skip_wkrs = 'page_gt_1|small_size';
// allow downloads from wikimedia
src_bin_mgr__wmf_enabled = 'y';
}
add ('en.wikipedia.org' , 'file.orig_reg');
add ('en.wikipedia.org' , 'wiki.page_dump.drop');
add ('en.wikipedia.org' , 'file.page_file_map.create');
*/
}
app.bldr.run;
Change log
- 2016-10-12: explicitly set web_access_enabled to y
- 2017-02-02: updated script for multi-threaded version and new options
- 2020-02-16: rewrote page to provide more explicit step-by-steps. Moved content to glossary