The UK Biobank is a resource that includes detailed health-related and genetic data on about 500,000 individuals and is available to the research community. ukbtools removes all the upfront data wrangling required to get a single dataset for statistical analysis, and provides tools to assist in quality control, query of disease diagnoses, and retrieval of genetic metadata.
Download and decrypt your data with the supplied helper programs. To use ukbtools, you need to create a UKB fileset (.tab, .r, and .html):
ukb_unpack decrypts your downloaded ukbxxxx.enc file, outputting a ukbxxxx.enc_ukb file. ukb_conv with the r flag converts the decrypted data to a tab-delimited file ukbxxxx.tab and an R script ukbxxxx.r that reads the tab file. The docs flag creates an html file containing a field-code-to-description table (among others).
Note. Full details of the data download and decrypt process are given in the Using UK Biobank Data documentation . Updated versions of these helper programs exist. Other than small name changes (underscores removed) they appear to function similarly.
In R,
# Install from CRAN
install.packages("ukbtools")
# Install latest development version
devtools::install_github("kenhanscombe/ukbtools", build_vignettes = TRUE, dependencies = TRUE)
The function ukb_df()
takes the stem of your fileset and returns a dataframe with usable column names.
You can also specify the path to your fileset if it is not in the current directory. For example, if your fileset is in a subdirectory of the working directory called data
Use ukb_df_field
to create a field code-to-descriptive name key, as dataframe or named lookup vector.
Note. You can move the three files in your fileset after creating them with ukb_conv
, but they should be kept together. ukb_df()
automatically updates the read call in the R source file to point to the correct directory (the current directly by default, or the directory specified by path
).
Memory and efficiency
To reduce you memory usage, you could save your new UKB dataset with
save(my_ukb_data, file = "my_ukb_data.rda")
. Load the dataset withload("my_ukb_data.rda")
. A UKB dataset from my largest UKB fileset which included a 2.6 GB .tab file took a little under 2 minutes to create withukb_df
. The associated .rda file was 138 MB and loaded in a little under 1.5 mins.
If you have multiple UKB downloads, first read each one in, then merge them with your preferred method. You could use ukb_df_full_join
which is a thin wrapper around dplyr::full_join
applied recursively with purrr::reduce
.
ukbxxxx_data <- ukb_df("ukbxxxx")
ukbyyyy_data <- ukb_df("ukbyyyy")
ukbzzzz_data <- ukb_df("ukbzzzz")
ukb_df_full_join(ukbxxxx_data, ukbyyyy_data, ukbzzzz_data)
Repeated variables.
The join key is set to “eid” only (default value of the
by
parameter). Any additional variables common to any two tables will have “.x” and “.y” appended to their names. If you are satisfied the additional variables are identical to the original, the copies can be safely deleted. For example, ifsetequal(my_ukb_data$var, my_ukb_data$var.x)
isTRUE
, then my_ukb_data$var.x can be dropped. Adlyr::full_join
is like the set operation union in that all observation from all tables are included, i.e., all samples are included even if they are not included in all datasets.Repeated variable names within UKB datasets are unlikely to occur.
ukb_df
creates variable names by combining a snake_case descriptor with the variable’s field code, index and array. This should be sufficient to uniquely identify the variable. However, if an index_array combination is incorrectly repeated in the original UKB data, this will result in a duplicated variable name. We observed two instances. The variables were encoded–0.0 ,–1.0 ,––1.0 , andukb_df
created a variable named var_0_0, var_1_0, var_1_0. This is probably a typo that should have been–0.0 ,–1.0 ,–2.0 , consistent with UKB official documentation describing the field as having 3 values for index. We have providedukb_df_duplicated_names
to identify duplicated names within a dataset. This will allow the user to make changes as appropriate. We expect the occurrence of such duplicates will be rare.
ukbxxxx.tab, ukbxxxx.r, ukbxxxx.html
A minimal example fileset is included with the package, in the subdirectory inst/extdata. This fileset will allow the user to test the the read (ukb_df
, ukb_df_field
) and summarise (ukb_context
) functionality.
# To load the example data
path_to_example_data <- system.file("extdata", package = "ukbtools")
df <- ukb_df("ukbxxxx", path = path_to_example_data)
# To create a field code to name key
df_field <- ukb_df_field("ukbxxxx", path = path_to_example_data)
The full path to the raw test data can be retrieved with system.file("extdata", "ukbXXXX.tab", package = "ukbtools")
.
As an exploratory step you might want to look at the demographics of a particular subset of the UKB sample relative to a reference sample. For example, using the nonmiss.var
argument of ukb_context
will produce a plot of the primary demographics (sex, age, ethnicity, and Townsend deprivation score) and employment status and assessment centre, for the subsample with data on your variable of interest compared to those without data (i.e. NA
).
It is also possible to supply a logical vector with subset.var
to define the subset and reference sample. This is particularly useful for understanding a subgroup within the UKB study, e.g., overweight individuals.
subgroup_of_interest <- (my_ukb_data$body_mass_index_bmi_0_0 >= 25)
ukb_context(my_ukb_data, subset.var = subgroup_of_interest)