Code for “Current water contact and <i>Schistosoma mansoni</i> infection have distinct determinants: a data-driven population-based study in rural Uganda”
Version 3 2024-10-07, 16:27Version 3 2024-10-07, 16:27
Version 2 2024-10-07, 14:41Version 2 2024-10-07, 14:41
Version 1 2024-09-10, 16:29Version 1 2024-09-10, 16:29
software
posted on 2024-10-07, 16:27authored byFabian ReitzugFabian Reitzug, Narcis B. Kabatereine, Anatoli Maranda Byaruhanga, Fred Besigye, Betty Nabatte, Goylette F. Chami
<p dir="ltr">Details on how to rerun the analysis pipeline described in:</p><p dir="ltr">“Current water contact and <i>Schistosoma mansoni</i> infection have distinct determinants: a data-driven population-based study in rural Uganda” by Fabian Reitzug, Narcis B. Kabatereine, Anatol M. Byaruhanga, Fred Besigye, Betty Nabatte, Goylette F. Chami</p><p dir="ltr"><b>System requirements</b></p><p dir="ltr">This code was run on the University of Oxford high-performance Biomedical Research Computing (BMRC) computing cluster on August 24, 2024 on 1 CPU core with 24 GB RAM (approximate run time 24 hours).</p><p dir="ltr">The following software modules on the BMRC cluster were required:</p><ul><li><code>R/4.1.0</code></li><li><code>SQLite/3.38.3-GCCcore-11.3.0</code></li><li><code>PROJ/9.0.0-GCCcore-11.3.0</code></li><li><code>GEOS/3.10.3-GCC-11.3.0</code></li><li><code>GDAL/3.3.0-foss-2021a</code></li><li><code>rgdal/1.5-23-foss-2021a-R-4.1.0</code></li><li><code>MPFR/4.1.0-GCCcore-11.3.0</code></li></ul><h2><b>Installation guide</b></h2><p dir="ltr">To run this code, installation of <code>R >= 4.1.0</code> is required.</p><p dir="ltr">All required R packages are loaded in <code>/code/prep/01_paths_pkgs.R</code> (any packages not installed already can be installed via the <code>install.package</code> function).</p><p dir="ltr">Typical install time on a normal desktop computer should be less than 30 minutes.</p><h2><b>Demo</b></h2><p dir="ltr"><i>Instructions to run on data</i></p><p dir="ltr">The following scripts may need to be modified to successfully run the scripts on a local computer:</p><ul><li>Set the working directory to the code directory using the <code>setwd</code> command in R.</li><li>Set the directory paths so that they point to the directory where the demo data is located.</li></ul><p dir="ltr">The entire analysis pipeline can be run by executing the <code>/code/RUN.R</code> script, which runs all scripts required to reproduce the results.</p><p dir="ltr"><i>Expected output</i></p><ul><li><b>Variable selection output</b>: Outputs from the variable selection process (via likelihood ratio tests and Bayesian variable selection) are saved in the <code>/code/out/var_sel/</code> directory (the variable selection is run on the confidential raw data, thus only selection outputs are publicly available).</li><li><b>Main figures</b>: Figs. 3-9 are written to the <code>/code/out/main/</code> directory (Fig. 1 is not created programmatically, and Fig. 2 has latitude and longitude columns and requires an external waterbody dataset that is not included with the demo data).</li><li><b>Supplementary tables</b>: All supplementary tables are written to the <code>/code/out/main/s_tabs/</code> directory.</li><li><b>Supplementary figures</b>: All supplementary figures are written to the <code>/code/out/main/s_figs/</code> directory.</li><li><b>Supplementary file</b>: All supplementary figures and tables are wrapped together using LaTeX (by means of the <code>/code/out/s_file/s_file.Rnw</code> script which generates a PDF saved in the same folder).</li></ul><p dir="ltr"><i>Expected run time for demo on a "normal" desktop computer</i></p><ul><li>Expected runtime of the project should be less than two hours.</li></ul><h2><b>Instructions for use</b></h2><p dir="ltr">To run the code on a different dataset with a similar structure, the following modifications would be required:</p><ul><li>The <code>/code/prep/03_read.R</code> file would need to be modified to load in the desired datasets.</li><li>All preprocessing steps for the data (subsequent to loading) should be done by scripts in the <code>/code/prep/</code> directory, which is aimed to contain all data preparation scripts.</li><li>A data dictionary of the same format as the one saved in <code>/code/dict/</code> (in <code>.csv</code> format) would be required to label the main datasets and specify the variables which should be included in the candidate variable set (this is done in the <code>/code/prep/12_dict.R</code> and the <code>/code/prep/13_applylabs.R</code> scripts).</li></ul><p></p>
Funding
A DPhil scholarship was awarded from the Nuffield Department of Population Health (NDPH) to Fabian Reitzug. Grants from the Wellcome Trust Institutional Strategic Support Fund (204826/Z/16/Z), NDPH Pump Priming Fund, John Fell Fund, Robertson Foundation Fellowship, and UKRI EPSRC Award (EP/X021793/1) were awarded to Goylette F. Chami.