# 1 Description

BranchGLM is a package for fitting GLMs and performing variable selection. Most functions in this package make use of RcppArmadillo and some of them can also make use of OpenMP to perform parallel computations. This vignette introduces the package, provides examples on how to use the main functions in the package and also briefly describes the methods employed by the functions.

# 2 Installation

BranchGLM can be installed using the install.packages() function


install.packages("BranchGLM")

It can also be installed via the install_github() function from the devtools package.


devtools::install_github("JacobSeedorff21/BranchGLM")

# 3 Fitting GLMs

• BranchGLM() allows fitting of gaussian, binomial, gamma, and poisson GLMs with a variety of links available.
• Parallel computation can also be done to speed up the fitting process, but it is only useful for larger datasets.

## 3.1 Optimization methods

• The optimization method can be specified, the default method is fisher scoring, but BFGS and L-BFGS are also available.
• BFGS and L-BFGS typically perform better when there are many predictors in the model (at least 50 predictors), otherwise fisher scoring is typically faster.
• The grads argument is for L-BFGS only and it is the number of gradients that are stored at a time and are used to approximate the inverse information. The default value for this is 10, but another common choice is 5.
• The tol argument controls how strict the convergence criteria are, lower values of this will lead to more accurate results, but may also be slower.
• The method argument is ignored for linear regression and the OLS solution is used.

## 3.2 Examples

• An offset can be specified using offset, it should be a numeric vector.
### Using mtcars

library(BranchGLM)

cars <- mtcars

### Fitting linear regression model with Fisher scoring

LinearFit <- BranchGLM(mpg ~ ., data = cars, family = "gaussian", link = "identity")

LinearFit
#> Results from gaussian regression with identity link function
#> Using the formula mpg ~ .
#>
#>             Estimate       SE       t p.values
#> (Intercept) 12.30000 18.72000  0.6573  0.51810
#> cyl         -0.11140  1.04500 -0.1066  0.91610
#> disp         0.01334  0.01786  0.7468  0.46350
#> hp          -0.02148  0.02177 -0.9868  0.33500
#> drat         0.78710  1.63500  0.4813  0.63530
#> wt          -3.71500  1.89400 -1.9610  0.06325 .
#> qsec         0.82100  0.73080  1.1230  0.27390
#> vs           0.31780  2.10500  0.1510  0.88140
#> am           2.52000  2.05700  1.2250  0.23400
#> gear         0.65540  1.49300  0.4389  0.66520
#> carb        -0.19940  0.82880 -0.2406  0.81220
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion parameter taken to be 7.0235
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#>
#> Residual Deviance: 147 on 21 degrees of freedom
#> AIC: 166

### Fitting gamma regression with inverse link with L-BFGS

GammaFit <- BranchGLM(mpg ~ ., data = cars, family = "gamma", link = "inverse",
method = "LBFGS", grads = 5)

GammaFit
#> Results from gamma regression with inverse link function
#> Using the formula mpg ~ .
#>
#>               Estimate         SE       t p.values
#> (Intercept) -6.794e-02  3.085e-02 -2.2020  0.03898 *
#> cyl          1.760e-03  1.946e-03  0.9047  0.37590
#> disp        -7.947e-06  3.515e-05 -0.2261  0.82330
#> hp          -6.724e-05  4.050e-05 -1.6600  0.11170
#> drat        -4.283e-04  2.728e-03 -0.1570  0.87670
#> wt          -9.224e-03  3.360e-03 -2.7450  0.01214 *
#> qsec         1.739e-03  1.165e-03  1.4930  0.15040
#> vs          -3.087e-04  3.322e-03 -0.0929  0.92690
#> am           6.304e-04  3.441e-03  0.1832  0.85640
#> gear         4.882e-03  2.577e-03  1.8940  0.07204 .
#> carb        -1.025e-03  1.481e-03 -0.6926  0.49610
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion parameter taken to be 0.0087
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#>
#> Residual Deviance: 0 on 21 degrees of freedom
#> AIC: 152
#> Algorithm converged in 3 iterations using L-BFGS

## 3.3 Useful functions

• BranchGLM also has many utility functions for GLMs such as
• coef() to extract the coefficients
• logLik() to extract the log likelihood
• AIC() to extract the AIC
• BIC() to extract the BIC
• predict() to obtain predictions from the fitted model
• The coefficients, standard errors, wald test statistics, and p-values are stored in the coefficients slot of the fitted model
• Unlike GLM there is no summary method, all the important information is included in the BranchGLM object.
### Predict method

predict(GammaFit)
#>  [1] 21.18646 20.58540 25.08262 19.73260 16.84361 19.66750 14.40825 22.08188
#>  [9] 23.72831 18.71060 19.08321 15.34256 16.20899 16.27084 12.61678 12.23244
#> [17] 12.09701 28.31601 31.25370 32.11988 22.42958 17.18607 17.63342 13.73885
#> [25] 15.78752 29.52892 26.61243 30.53420 16.86624 19.39720 14.08588 21.53099

### Accessing coefficients matrix

GammaFit$coefficients #> Estimate SE t p.values #> (Intercept) -6.794026e-02 3.085396e-02 -2.20199519 0.03897971 #> cyl 1.760179e-03 1.945517e-03 0.90473592 0.37586830 #> disp -7.947123e-06 3.515071e-05 -0.22608709 0.82331968 #> hp -6.724158e-05 4.049734e-05 -1.66039498 0.11169449 #> drat -4.282730e-04 2.727545e-03 -0.15701775 0.87673070 #> wt -9.224050e-03 3.360483e-03 -2.74485832 0.01213720 #> qsec 1.739266e-03 1.165327e-03 1.49251309 0.15043663 #> vs -3.086809e-04 3.322383e-03 -0.09290948 0.92685616 #> am 6.304156e-04 3.441185e-03 0.18319724 0.85640047 #> gear 4.881920e-03 2.577117e-03 1.89433406 0.07203615 #> carb -1.025451e-03 1.480570e-03 -0.69260584 0.49614538 # 4 Performing variable selection • Forward selection, backward elimination, and branch and bound selection can be done using VariableSelection(). • VariableSelection() can accept either a BranchGLM object or a formula along with the data and the desired family and link to perform the variable selection. • Available metrics are AIC and BIC, which are used to compare models and to select the best model. • VariableSelection() returns the final model and some other information about the search. • Note that VariableSelection() will properly handle interaction terms. • keep can also be specified if any set of variables are desired to be kept in every model. ## 4.1 Stepwise methods • Forward selection and backward elimination are both stepwise variable selection methods. • They are not guaranteed to find the best model or even a good model, but they are very fast. • Forward selection is recommended if the number of variables is greater than the number of observations or if many of the larger models don’t converge. • Parallel computation can be used for the methods, but is generally only necessary for large datasets. ### 4.1.1 Forward selection example ### Forward selection with mtcars VariableSelection(GammaFit, type = "forward") #> Variable Selection Info: #> -------------------------------------------- #> Variables were selected using forward selection with AIC #> The best value of AIC obtained was 142 #> Number of models fit: 27 #> #> Order the variables were added to the model: #> #> 1). wt #> 2). hp #> -------------------------------------------- #> Final Model: #> -------------------------------------------- #> Results from gamma regression with inverse link function #> Using the formula mpg ~ hp + wt #> #> Estimate SE t p.values #> (Intercept) -8.923e-03 2.806e-03 -3.180 0.0034910 ** #> hp -8.887e-05 2.110e-05 -4.212 0.0002245 *** #> wt -9.826e-03 1.384e-03 -7.100 8.21e-08 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Dispersion parameter taken to be 0.0104 #> 32 observations used to fit model #> (0 observations removed due to missingness) #> #> Residual Deviance: 0 on 29 degrees of freedom #> AIC: 142 #> Algorithm converged in 3 iterations using Fisher's scoring ### 4.1.2 Backward elimination example ### Backward elimination with mtcars VariableSelection(GammaFit, type = "backward") #> Variable Selection Info: #> -------------------------------------------- #> Variables were selected using backward elimination with AIC #> The best value of AIC obtained was 142 #> Number of models fit: 49 #> #> Order the variables were removed from the model: #> #> 1). vs #> 2). drat #> 3). am #> 4). disp #> 5). carb #> 6). cyl #> -------------------------------------------- #> Final Model: #> -------------------------------------------- #> Results from gamma regression with inverse link function #> Using the formula mpg ~ hp + wt + qsec + gear #> #> Estimate SE t p.values #> (Intercept) -4.691e-02 1.782e-02 -2.633 0.01384 * #> hp -6.284e-05 3.015e-05 -2.084 0.04675 * #> wt -9.485e-03 1.819e-03 -5.213 1.719e-05 *** #> qsec 1.299e-03 7.471e-04 1.739 0.09348 . #> gear 2.662e-03 1.652e-03 1.612 0.11870 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Dispersion parameter taken to be 0.0091 #> 32 observations used to fit model #> (0 observations removed due to missingness) #> #> Residual Deviance: 0 on 27 degrees of freedom #> AIC: 142 #> Algorithm converged in 3 iterations using Fisher's scoring ## 4.2 Branch and bound • The Branch and bound methods can be much slower than the stepwise methods, but they are guaranteed to find the best model. • The branch and bound methods can be much faster than an exhaustive search and can also be made many times faster if parallel computation is used. • One way to judge how much faster the branch and bound algorithm is compared to an exhaustive search is to look at the ratio of the number of models fit against the total number of possible models. • If this ratio is close to 1, then it is performing poorly and has to fit almost every model, if it is close to 0, then it is performing well and ruled out many models without having to fit them. ### 4.2.1 Branch and bound example • If showprogress is true, then progress of the branch and bound algorithm will be reported occasionally. • Parallel computation can be used with these methods and can lead to very large speedups. ### Branch and bound with mtcars VariableSelection(GammaFit, type = "branch and bound", showprogress = FALSE) #> Variable Selection Info: #> -------------------------------------------- #> Variables were selected using branch and bound selection with AIC #> The best value of AIC obtained was 142 #> Number of models fit: 74 #> #> #> -------------------------------------------- #> Final Model: #> -------------------------------------------- #> Results from gamma regression with inverse link function #> Using the formula mpg ~ hp + wt + qsec + gear #> #> Estimate SE t p.values #> (Intercept) -4.691e-02 1.782e-02 -2.633 0.01384 * #> hp -6.284e-05 3.015e-05 -2.084 0.04675 * #> wt -9.485e-03 1.819e-03 -5.213 1.719e-05 *** #> qsec 1.299e-03 7.471e-04 1.739 0.09348 . #> gear 2.662e-03 1.652e-03 1.612 0.11870 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Dispersion parameter taken to be 0.0091 #> 32 observations used to fit model #> (0 observations removed due to missingness) #> #> Residual Deviance: 0 on 27 degrees of freedom #> AIC: 142 #> Algorithm converged in 3 iterations using Fisher's scoring ### Can also use a formula and data FormulaVS <- VariableSelection(mpg ~ . ,data = cars, family = "gamma", link = "inverse", type = "branch and bound", showprogress = FALSE) ### Number of models fit divided by the number of possible models FormulaVS$numchecked / 2^(length(FormulaVS$variables)) #> [1] 0.07226562 ### Extracting final model FormulaVS$finalmodel
#> Results from gamma regression with inverse link function
#> Using the formula mpg ~ hp + wt + qsec + gear
#>
#>               Estimate         SE      t  p.values
#> (Intercept) -4.691e-02  1.782e-02 -2.633   0.01384 *
#> hp          -6.284e-05  3.015e-05 -2.084   0.04675 *
#> wt          -9.485e-03  1.819e-03 -5.213 1.719e-05 ***
#> qsec         1.299e-03  7.471e-04  1.739   0.09348 .
#> gear         2.662e-03  1.652e-03  1.612   0.11870
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion parameter taken to be 0.0091
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#>
#> Residual Deviance: 0 on 27 degrees of freedom
#> AIC: 142
#> Algorithm converged in 3 iterations using Fisher's scoring

## 4.3 Using keep

• Specifying variables via keep will ensure that those variables are kept through the selection process.
### Example of using keep

VariableSelection(mpg ~ . ,data = cars, family = "gamma",
link = "inverse", type = "branch and bound",
keep = c("hp", "cyl"), metric = "AIC",
showprogress = FALSE)
#> Variable Selection Info:
#> --------------------------------------------
#> Variables were selected using branch and bound selection with AIC
#> The best value of AIC obtained was 143
#> Number of models fit: 38
#> Variables that were kept in each model:  hp, cyl
#>
#> --------------------------------------------
#> Final Model:
#> --------------------------------------------
#> Results from gamma regression with inverse link function
#> Using the formula mpg ~ cyl + hp + wt + qsec + gear
#>
#>               Estimate         SE       t  p.values
#> (Intercept) -6.464e-02  2.700e-02 -2.3940   0.02418 *
#> cyl          1.412e-03  1.642e-03  0.8603   0.39750
#> hp          -7.523e-05  3.321e-05 -2.2650   0.03205 *
#> wt          -1.037e-02  2.082e-03 -4.9830 3.517e-05 ***
#> qsec         1.816e-03  9.462e-04  1.9190   0.06603 .
#> gear         3.861e-03  2.155e-03  1.7910   0.08490 .
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion parameter taken to be 0.0089
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#>
#> Residual Deviance: 0 on 26 degrees of freedom
#> AIC: 143
#> Algorithm converged in 3 iterations using Fisher's scoring

## 4.4 Convergence issues

• It is not recommended to use branch and bound if the upper models do not converge since it can make the algorithm very slow.
• Sometimes when using backwards selection and all the upper models that are tested do not converge, no final model can be selected.
• For these reasons, if there are convergence issues it is recommended to use forward selection.

# 5 Utility functions for binomial GLMs

• BranchGLM also has some utility functions for binomial GLMs
• Table() creates a confusion matrix based on the predicted classes and observed classes
• ROC() creates an ROC curve which can be plotted with plot()
• AUC() and Cindex() calculate the area under the ROC curve
• MultipleROCCurves() allows for the plotting of multiple ROC curves on the same plot

## 5.1 Table

### Predicting if a car gets at least 18 mpg

catData <- ToothGrowth

catFit <- BranchGLM(supp ~ ., data = catData, family = "binomial", link = "logit")

Table(catFit)
#> Confusion matrix:
#> ----------------------
#>             Predicted
#>              OJ   VC
#>
#>          OJ  17   13
#> Observed
#>          VC  7    23
#>
#> ----------------------
#> Measures:
#> ----------------------
#> Accuracy:  0.6667
#> Sensitivity:  0.7667
#> Specificity:  0.5667
#> PPV:  0.6389

## 5.2 ROC


catROC <- ROC(catFit)

plot(catROC, main = "ROC Curve", col = "indianred")

## 5.3 Cindex/AUC


Cindex(catFit)
#> [1] 0.7127778

AUC(catFit)
#> [1] 0.7127778

## 5.4 MultipleROCPlots

### Showing ROC plots for logit, probit, and cloglog

probitFit <- BranchGLM(supp ~ . ,data = catData, family = "binomial",

cloglogFit <- BranchGLM(supp ~ . ,data = catData, family = "binomial",

MultipleROCCurves(catROC, ROC(probitFit), ROC(cloglogFit),
names = c("Logistic ROC", "Probit ROC", "Cloglog ROC"))

## 5.5 Using predictions

• For each of the methods used in this section predicted probabilities and observed classes can also be supplied instead of the BranchGLM object.

preds <- predict(catFit)

Table(preds, catData$supp) #> Confusion matrix: #> ---------------------- #> Predicted #> OJ VC #> #> OJ 17 13 #> Observed #> VC 7 23 #> #> ---------------------- #> Measures: #> ---------------------- #> Accuracy: 0.6667 #> Sensitivity: 0.7667 #> Specificity: 0.5667 #> PPV: 0.6389 AUC(preds, catData$supp)
#> [1] 0.7127778

ROC(preds, catData\$supp) |> plot(main = "ROC Curve", col = "deepskyblue")