Getting started with xtune

Purpose of this vignette

This vignette is a tutorial on how to use the xtune package to fit differential-shrinkage penalized regression models based on external information.

In this tutorial the following points are going to be viewed:

• The overall idea of xtune model
• Three examples of external information Z
• How to fit the model
• Two special cases of Z

xtune Overview

The main usage of xtune is to tune multiple shrinkage parameters in Lasso and Ridge regression, based on external information.

The classical penalized regression uses a single penalty parameter $$\lambda$$ that applies equally to all regression coefficients to control the amount of regularization in the model. And the single penalty parameter tuning is typically performed using cross-validation.

Here we apply an individual shrinkage parameter $$\lambda_j$$ to each regression coefficient $$\beta_j$$. And the vector of shrinkage parameters $$\lambda s = (\lambda_1,...,\lambda_p)$$ is guided by external information $$Z$$. In specific, $$\lambda$$s is modeled as a log-linear function of $$Z$$. Better prediction accuracy for penalized regression models may be achieved by allowing individual shrinkage for each regression coefficients based on external information.

To tune the differential shrinkage parameter vector $$\lambda s = (\lambda_1,...,\lambda_p)$$, we employ an Empirical Bayes approach by specifying LASSO and Ridge to their random-effect formulation. Once the tuning parameters $$\lambda$$s are estimated, and therefore the penalties known, the regression coefficients are obtained using the glmnet package.

The response variable can be either quantitative or a factor with two levels. For a binary response variable, a sparse LDA procedure is used to get the fisher discriminant direction and the discriminant function for a new data point.

The sparse LDA procedure is described in this paper:

Mai, Qing & Zou, Hui & Yuan, Ming. (2012). A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika. 99. 10.2307/41720670.

Utilities for carrying out post-fitting summarization and prediction are also provided.

Examples

This file uses three simulated examples to illustrate the usage and syntax of xtune. The first example gives users a general sense of the data structure and model fitting process. The second and third examples use simulated data in concrete scenarios to illustrate the usage of the package. In the second example diet, we provide simulated data to mimic the dietary example described in this paper:

S. Witte, John & Greenland, Sander & W. Haile, Robert & L. Bird, Cristy. (1994). Hierarchical Regression Analysis Applied to a Study of Multiple Dietary Exposures and Breast Cancer. Epidemiology (Cambridge, Mass.). 5. 612-21. 10.1097/00001648-199411000-00009.

And in the third example gene, we provide simulated data to mimic the bone density data published in the European Bioinformatics Institute (EMBL-EBI) ArrayExpress repository, ID: E-MEXP-1618.

General Example

In the first example, $$Y$$ is a $$n = 100$$-dimensional continuous observed outcome vector, $$X$$ is matrix of $$p$$ potential predictors observed on the $$n$$ observations, and $$Z$$ is a set of $$q = 4$$ external features available for the $$p = 300$$ predictors.

library(xtune)
data("example")
X <- example$X; Y <- example$Y; Z <- example$Z dim(X);dim(Z) #>  100 300 #>  300 4 Each column of Z contains information about the predictors in design matrix X. The number of rows in Z equals to the number of predictors in X. X[1:3,1:10] #> Predictor_1 Predictor_2 Predictor_3 Predictor_4 Predictor_5 #> Observation_1 -0.7667960 0.9212806 2.0149030 0.79004563 -1.4244699 #> Observation_2 -0.8164583 -0.3144157 -0.2253684 0.08712746 -1.0296026 #> Observation_3 -0.1415352 0.6623149 -1.0398456 1.87611212 0.7340254 #> Predictor_6 Predictor_7 Predictor_8 Predictor_9 Predictor_10 #> Observation_1 -0.9529327 -0.9344928 -0.32964818 0.4486023 -0.70894600 #> Observation_2 2.2546851 0.2732793 -0.03852896 1.3830463 0.03070716 #> Observation_3 1.7534320 0.3263808 0.09564893 -2.2104531 0.22615224 The external information is encoded as follows: Z[1:10,] #> External_variable_1 External_variable_2 External_variable_3 #> Predictor_1 1 0 0 #> Predictor_2 1 0 0 #> Predictor_3 0 1 0 #> Predictor_4 0 1 0 #> Predictor_5 0 0 1 #> Predictor_6 0 0 1 #> Predictor_7 0 0 0 #> Predictor_8 0 0 0 #> Predictor_9 0 0 0 #> Predictor_10 0 0 0 #> External_variable_4 #> Predictor_1 0 #> Predictor_2 0 #> Predictor_3 0 #> Predictor_4 0 #> Predictor_5 0 #> Predictor_6 0 #> Predictor_7 1 #> Predictor_8 1 #> Predictor_9 1 #> Predictor_10 1 Here, each variable in Z is a binary variable. $$Z_{jk}$$ indicates if $$Predictor_j$$ has $$ExternalVariable_k$$ or not. This Z is an example of (non-overlapping) grouping of predictors. Predictor 1 and 2 belongs to group 1; predictor 3 and 4 belongs to group 2; predictor 5 and 6 belongs to group 3, and the rest of the predictors belongs to group 4. To fit a differential-shrinkage lasso model to this data: fit.example1 <- xtune(X,Y,Z) #> Z provided, start estimating individual tuning parameters The individual penalty parameters are returned by fit.example1$penalty.vector

In this example, predictors in each group get different estimated penalty parameters.

unique(fit.example1$penalty.vector) #> [,1] #> Predictor_1 0.00768154 #> Predictor_3 0.01407634 #> Predictor_5 0.03259048 #> Predictor_7 0.87835961 Coefficient estimates and predicted values and can be obtained via predict and coef: coef(fit.example1) predict(fit.example1, newX = X) The mse function can be used to get the mean square error (MSE) between prediction values and true values. mse(predict(fit.example1, newX = X), Y) Dietary example Suppose we want to predict a person’s weight loss (binary outcome) using his/her weekly dietary intake. Our external information Z could incorporate information about the levels of relevant food constituents in the dietary items. data(diet) head(diet$DietItems)
#>      Milk Margarine Eggs Apples Lettuce Celery Hot dogs Liver Dark bread
#> [1,]    1         1    4      1       2      1        1     0          0
#> [2,]    1         0    0      0       2      4        0     0          2
#> [3,]    0         1    2      3       1      3        0     0          4
#> [4,]    0         2    0      1       0      1        1     0          3
#> [5,]    0         1    1      3       1      1        0     2          2
#> [6,]    2         1    2      1       3      0        1     2          2
#>      Pasta Beer Liquor Cookies Bran
#> [1,]     1    0      2       3    4
#> [2,]     0    2      1       0    3
#> [3,]     0    1      2       1    1
#> [4,]     1    1      1       1    2
#> [5,]     2    0      0       0    3
#> [6,]     1    1      0       2    1
head(diet$weightloss) #>  0 1 1 1 1 1 The external information Z in this example is: head(diet$NuitritionFact)
#> NULL

In this example, Z is not a grouping of the predictors. The idea is that the nutrition facts about the dietary items might give us some information on the importance of each predictor in the model.

Similar to the previous example, the xtune model could be fit by:

fit.diet = xtune(X = diet$DietItems,Y=diet$weightloss,Z = diet$NuitritionFact, family="binary") #> No Z matrix provided, only a single tuning parameter will be estimated using empirical Bayes tuning Each dietary predictor is estimated an individual tuning parameter based on their nutrition fact. fit.diet$penalty.vector
#>   1 1 1 1 1 1 1 1 1 1 1 1 1 1

To make prediction using the trained model

predict(fit.diet,newX = diet$DietItems) The above code returns the predicted probabilities (scores). To make a class prediction, use the type = "class" option and also provide the original X and Y used for the fit. pred_class <- predict(fit.diet,newX = diet$DietItems,type = "class",X = diet$DietItems,Y=diet$weightloss)

The misclassification() function can be used to extract the misclassification rate. The prediction AUC can be calculated using the auc() function from the AUC package.

misclassification(pred_class,true = diet$weightloss) Gene expression data example The gene data contains simulated gene expression data. The dimension of data is $$50\times 200$$. The outcome Y is continuous (bone mineral density). The external information is four previous study results that identify the biological importance of genes. For example $$Z_{jk}$$ means whether $$gene_j$$ is identified to be biologically important in previous study $$k$$ result. $$Z_{jk} = 1$$ means that gene $$j$$ is identified by previous study $$k$$ result and $$Z_{jk} = 0$$ means that gene $$j$$ is not identified to be important by previous study $$k$$ result. data(gene) gene$GeneExpression[1:3,1:5]
#>          Gene_1     Gene_2     Gene_3     Gene_4     Gene_5
#> [1,] -0.7667960  1.7520578  0.9212806 -0.6273008  2.0149030
#> [2,] -0.8164583 -0.5477714 -0.3144157 -0.8796116 -0.2253684
#> [3,] -0.1415352 -0.8585257  0.6623149 -0.3053110 -1.0398456
gene$PreviousStudy[1:5,] #> Identified by previous study 1 Identified by previous study 2 #> Gene_1 0 0 #> Gene_2 0 0 #> Gene_3 0 0 #> Gene_4 0 0 #> Gene_5 0 0 #> Identified by previous study 3 Identified by previous study 4 #> Gene_1 0 0 #> Gene_2 0 0 #> Gene_3 0 0 #> Gene_4 0 0 #> Gene_5 0 0 A gene can be identified to be important by several previous study results, therefore the external information Z in this example can be seen as an overlapping group of variables. Model fitting: fit.gene = xtune(X = gene$GeneExpression,Y=gene$bonedensity,Z = gene$PreviousStudy)

The rest of the steps are the same as the previous two examples.

Two special cases

No external information Z

If you just want to tune a single penalty parameter using empirical Bayes tuning, simply do not provide Z in the xtune() function. If no external information Z is provided, the function will perform empirical Bayes tuning to choose the single penalty parameter in penalized regression, as an alternative to cross-validation. For example

fit.eb <- xtune(X,Y)
#> No Z matrix provided, only a single tuning parameter will be estimated using empirical Bayes tuning

The estimated tuning parameter is:

fit.eb$lambda Z as an identity matrix If you provide an identity matrix as external information Z to xtune(), the function will estimate a separate tuning parameter $$\lambda_j$$ for each regression coefficient $$\beta_j$$. Note that this is not advised when the number of predictors $$p$$ is very large. Using the dietary example, the following code would estimate a separate penalty parameter for each coefficient. Z_iden = diag(ncol(diet$DietItems))
fit.diet.identity = xtune(diet$DietItems,diet$weightloss,Z_iden)
#> Z provided, start estimating individual tuning parameters
fit.diet.identity\$penalty.vector
#>               [,1]
#>  [1,] 5.490105e-02
#>  [2,] 8.796161e-02
#>  [3,] 6.863956e+00
#>  [4,] 2.056148e-02
#>  [5,] 2.323703e+01
#>  [6,] 1.912083e-02
#>  [7,] 1.970638e-02
#>  [8,] 3.083099e-01
#>  [9,] 5.411627e-03
#> [10,] 3.176205e+00
#> [11,] 4.214480e-02
#> [12,] 1.117616e-01
#> [13,] 3.396484e+02
#> [14,] 5.695925e-02

A predictor is excluded from the model (regression coefficient equals to zero) if its corresponding penalty parameter is estimated to be infinity.

Conclusion

We presented the main usage of xtune package. For more details about each function, please go check the package documentation. If you would like to give us feedback or report issue, please tell us on Github.