A beneficial property of the lasso penalty is that it shrinks
coefficients to zero. Less beneficial is that in the process, the lasso
tends to overshrink large coefficients. It has been argued that the
lasso should “be considered as a *variable screener* rather than
a *model selector*” (Su et al., 2017). There appears a trade-off
between prediction and selection: The penalty parameter value optimal
for variable selection will overshrink the large coefficients, making it
suboptimal for prediction. The penalty parameter value optimal for
prediction (“lambda.min”) selects too many variables. The relaxed lasso
and adaptive lasso have been proposed to improve upon this problem.

When fitting prediction rule ensembles (PREs), the high false-positive selection rate of the lasso may be a nuisance, because often we want to obtain a sparse set of rules and linear terms. This vignette aims to show how the relaxed and adaptive lassos may deliver sparser ensembles, while retaining (relatively) high predictive accuracy.

An excellent discussion of consistency, predictive performance and selection accuracy of the lasso and its variations is provided in the master’s thesis of Kirkland (2014; pages 101-120).

The relaxed lasso was originally proposed by Meinshausen (2007).
Investigations of Su et al. (2107) provide insight on why the relaxed
lasso is beneficial. Hastie, Tibshirani & Tibshirani (2017) propose
a simplified version of the relaxed lasso, which is implemented in
package ** glmnet** and can be employed in
package

`pre`

`pre`

supports use of the relaxed lasso through passing of
argument `relax`

. A short introduction to the relaxed lasso
is provided in `glmnet`

`R`

`vignette("relax", "glmnet")`

.The adaptive lasso has been proposed by Zou (2006). It applies positive weighting factors to the lasso penalty to control the bias through shrinking coefficients with weights inversely proportional to their size. It thus aims to shrink small coefficients more and large coefficients less. It requires an initial estimate of the coefficients, for which OLS (if \(N > p\)) or ridge (if \(N < p\)) estimation is usually employed, to obtain a vector of weights. These weights can then be used to scale the predictor matrix, or to scale the penalty; both approaches have the same effect.

Function `pre`

allows for adaptive lasso estimation
through specification of argument `ad.alpha`

. It first uses
ridge regression for computing penalty weights. In principle, any
elastic net solution can be used, but use of ridge is recommended and
recommended. Other solutions can be used by specifying the
`ad.alpha`

and `ad.penalty`

arguments. Lasso
regression can be used by specifying `ad.alpha = 1`

and OLS
can be used by specifying `ad.penalty = 0`

. Next, the inverse
of the absolute values of the estimated coefficients are supplied as
penalty factors to the `cv.glmnet`

function. For the initial
and final estimates, the same fold assignments are used in the cross
validation.

It should not come as a surprise that a combination has also been
proposed: The relaxed adaptive lasso (Zhang et al., 2022). It can easily
be employed through specifying both `ad.alpha = 0`

and
`relax = TRUE`

.

We fit a PRE to predict `Ozone`

and employ the relaxed
lasso by specifying `relax = TRUE`

:

```
airq <- airquality[complete.cases(airquality), ]
set.seed(42)
airq.ens.rel <- pre(Ozone ~ ., data = airq, relax = TRUE)
```

If we specify `relax = TRUE`

, the `gamma`

argument (see `?cv.glmnet`

for documentation on arguments
`relax`

and `gamma`

) will by default be set to a
range of five values in the interval [0, 1]. This can be overruled by
specifying different values for argument `gamma`

in the call
to function `pre`

(but the default likely suffices in most
applications).

We take a look at the regularization paths for the relaxed fits:

We obtained one regularization path for each value of \(\gamma\). \(gamma\) is a mixing parameter, that
determines the weight of the original lasso solution, relative to a
solution containing only the selected variables, but with unpenalized
coefficient estimates. The path for \(\gamma =
1\) is the default lasso path, which we would also have obtained
without specifying `relax = TRUE`

. Lower values of \(\gamma\) ‘unshrink’ the value of the
non-zero coefficients of the lasso towards their unpenalized values. We
see that for the \(\lambda\) value
yielding the minimum MSE (indicated by the left-most vertical dotted
line), the value of \(\gamma\) does not
make a lot of difference for the MSE, but when \(\lambda\) values increase, higher values of
\(\gamma\) tend to improve predictive
performance. This is a common pattern for \(\lambda\) and \(\gamma\).

For model selection using the `"lambda.min"`

criterion, by
default the \(\lambda\) and \(\gamma\) combination yielding the lowest CV
error is returned. For the `"lambda.1se"`

criterion, the
\(\lambda\) and \(\gamma\) combination yielding the sparsest
solution within 1 standard error of the error criterion of the minimum
is returned:

```
fit <- airq.ens.rel$glmnet.fit$relaxed
mat <- data.frame(lambda.1se = c(fit$lambda.1se, fit$gamma.1se, fit$nzero.1se),
lambda.min = c(fit$lambda.min, fit$gamma.min, fit$nzero.min),
row.names = c("lamda", "gamma", "# of non-zero terms"))
mat
```

```
## lambda.1se lambda.min
## lamda 6.193185 3.382889
## gamma 0.000000 0.000000
## # of non-zero terms 9.000000 12.000000
```

Thus, as the dotted vertical lines in the plots already suggest, with
the default `"lambda.1se"`

criterion, a final model with 9
terms will be selected, with coefficients obtained using a \(\lambda\) value of 6.193 and a \(\gamma\) value of 0. With the
`"lambda.min"`

criterion, we obtain a more complex fit; \(\gamma = 0\) still yields the lowest CV
error. Note that use of `"lambda.min"`

increases the
likelihood of overfitting, because function `pre`

uses the
same data to extract the rules and fit the penalized regression, so in
most cases the default `"lambda.1se"`

criterion can be
expected to provide a less complex, better generalizable, often more
accurate fit.

The default of function `pre`

is to use the
`"lambda.1se"`

criterion. When `relax = TRUE`

has
been specified in the call to function `pre`

, the default of
all functions and `S3`

methods applied to objects of class
`pre`

(`print`

, `plot`

,
`coef`

, `predict`

, `importance`

,
`explain`

, `cvpre`

, `singleplot`

,
`pairplot`

, `interact`

) is to use the solution
obtained with `"lambda.1se"`

and the \(\gamma\) value yielding lowest CV error at
that value of \(\lambda\). This can be
overruled by specifying a different value of \(\lambda\) (`penalty.par.val`

)
and/or \(\gamma\) (`gamma`

).
Some examples:

```
##
## Final ensemble with cv error within 1se of minimum:
##
## lambda = 6.193185
## gamma = 0
## number of terms = 9
## mean cv error (se) = 304.7364 (79.60512)
##
## cv error type : Mean-Squared Error
```

```
## Final ensemble with minimum cv error:
##
## lambda = 3.382889
## gamma = 0
## number of terms = 12
## mean cv error (se) = 244.7256 (67.54855)
##
## cv error type : Mean-Squared Error
```

```
## Final ensemble:
##
## lambda = 7.814913
## gamma = 0
## number of terms = 5
## mean cv error (se) = 390.2582 (101.9163)
##
## cv error type : Mean-Squared Error
```

```
## Final ensemble:
##
## lambda = 7.814913
## gamma = 1
## number of terms = 5
## mean cv error (se) = 682.127 (146.0948)
##
## cv error type : Mean-Squared Error
```

Note how lowest CV error is indeed obtained with the
`"lambda.min"`

criterion, while the default
`"lambda.1se"`

yields a sparser model, with accuracy within 1
standard error of `"lambda.min"`

. If we want to go (much)
sparser, we need to specify a lower value for the \(\lambda\) penalty, and a lower value of
\(\gamma\) should likely be preferred,
to retain good-enough predictive accuracy.

Some rules for specification of \(\lambda\) and \(\gamma\):

If a numeric value of \(\lambda\) has been supplied, a (numeric) value for \(\gamma\)

*must*be supplied.Otherwise (if the default

`"lambda.1se"`

criterion is employed, or`"lambda.min"`

specified), the \(\gamma\) value yielding lowest CV error (at the \(\lambda\) value associated with the specified criterion) will be used; this \(\gamma\) value can be overruled by supplying the desired \(\gamma\) value to the`gamma`

argument.Multiple values of \(\gamma\) can be passed to function

`pre`

, but all other methods and functions accept*only a single value*for \(\gamma\) (this differs from severalfunctions) .`glmnet`

If a specific \(\lambda\) value is supplied, results are returned for a penalty parameter value that was used in the path, and closest to the specified value.

Also note that in the code chunk above we refer to the
`penalty.par.val`

argument by abbreviating it to
`penalty`

; this has the same effect as writing
`penalty.par.val`

in full.

Using \(\gamma = 0\) amounts to a forward stepwise selection approach, with entry order of the variables (rules and linear terms) determined by the lasso. This approach can be useful if we want a rule ensemble with low complexity and high generalizability, and especially when we want to decide a-priori on the number of terms we want to retain. By specifying a high value of \(\lambda\), we can retain a small number of rules, while specifying \(\gamma = 0\) will provide unbiased (unpenalized) coefficients. This avoids the overshrinking of large coefficients. In terms of predictive accuracy, this approach may not perform best, but if low complexity (interpretability) is most important, this is a very useful approach, which does not reduce predictive accuracy too much.

To use forward stepwise regression with variable entry order
determined by the lasso, we specify a \(\gamma\) value of 0, and specify the number
of variables we want to retain through specification of \(\lambda\) (`penalty.par.val`

).
To find the value of \(\lambda\)
corresponding to the number of terms one want to retain, check (results
not shown for space considerations):

Function `prune_pre`

is helpful for selecting sparser
ensembles. Say, we want to retain an ensemble with only five rules, then
`prune_pre`

will return the \(\lambda\) and \(\gamma\) values that yield an ensemble of
specified size, with optimal cross-validated predictive accuracy.

Here, we request the optimal parameter values for a five-term ensemble:

```
## The best ensemble with 5 non-zero terms is obtained with a lambda value of 6.797013 and a gamma value of 0.
##
## Overview of performance of ensembles selected with the nearest lambda values:
## lambda number_of_nonzero_terms optimal_gamma mean_cv_error
## s7 8.985251 2 0 425.7366
## s8 8.576857 2 0 414.1260
## s9 8.187026 4 0 407.1132
## s10 7.814913 5 0 390.2582
## s11 7.459713 5 0 378.8175
## s12 7.120658 5 0.25 384.3582
## s13 6.797013 5 0 370.8112
## s14 6.488078 6 0 347.7939
## s15 6.193185 9 0 304.7364
## s16 5.911695 9 0 294.8892
```

Note that the `mean_cv_error`

may be slightly optimistic.
Cross validation was performed on the same data that was used the
generate the rules. A less optimistic estimate of generalization error
can be obtained using function `cvpre`

.

Finally, we fit a PRE with adaptive lasso to predict
`Ozone`

levels:

```
##
## Final ensemble with cv error within 1se of minimum:
##
## lambda = 16.74798
## number of terms = 8
## mean cv error (se) = 318.0806 (91.18799)
##
## cv error type : Mean-Squared Error
```

The adaptive lasso did not provide a sparser ensemble, while the mean
CV error suggests better predictive accuracy than the standard, but not
the relaxed, lasso. Adaptive lasso settings can further be adjusted by
specification of argument `ad.penalty`

.

We can also fit a rule ensemble using the relaxed adaptive lasso:

```
set.seed(42)
airq.ens.rel.ad <- pre(Ozone ~ ., data = airq, relax = TRUE, ad.alpha = 0)
print(airq.ens.rel.ad)
```

```
##
## Final ensemble with cv error within 1se of minimum:
##
## lambda = 40.53227
## gamma = 0.25
## number of terms = 4
## mean cv error (se) = 302.991 (91.09976)
##
## cv error type : Mean-Squared Error
##
## rule coefficient description
## (Intercept) 75.01177 1
## rule191 -21.82831 Wind > 5.7 & Temp <= 87
## rule173 -18.12641 Wind > 5.7 & Temp <= 82
## rule204 12.38470 Wind <= 10.3 & Solar.R > 148
## rule51 -11.81249 Wind > 5.7 & Temp <= 84
```

The summary suggests that the relaxed adaptive lasso provides highest
predictive accuracy compared to the standard, the relaxed and the
adaptive lasso, when using the default `"lambda.1se"`

criterion. Note however that the training data have been used to
generate the rules, to estimate the weights for the penalty factors
using ridge regression and to estimate the final lasso model. Thus, the
printed CV error can provide an overly optimistic estimate of predictive
accuracy. To obtain an honest estimate of predictive accuracy, it should
be computed on a separate test dataset or using an additional layer of
cross validation (e.g., using function `cvpre`

or other
approach).

Use of the relaxed lasso improves accuracy and sparsity of the final ensemble. Relaxed lasso can be used to obtain an ensemble of pre-specified sparsity, that still provides good predictive performance. Use of the adaptive lasso penalties may further improve predictive accuracy.

Hastie, T., Tibshirani, R., & Tibshirani, R. J. (2017). Extended
comparisons of best subset selection, forward stepwise selection, and
the lasso. *arXiv:1707.08692*, https://arxiv.org/abs/1707.08692.

Kirklan, L.-A. (2014). LASSO - Simultaneous shrinkage and selection
via the L1 norm. Master’s thesis, University of Pretoria.
www_dot_researchgate_dot_net/publication/325929497_LASSO_-_Simultaneous_shrinkage_and_selection_via_the_L1_norm
(replace *dot* with a period and there you go, this is just to
avoid issues on CRAN submission)

Meinshausen, N. (2007). Relaxed lasso. *Computational Statistics
& Data Analysis, 52*(1), 374-393.

Su, W., Bogdan, M., & Candes, E. (2017). False discoveries occur
early on the lasso path. *The Annals of Statistics, 45*(5),
2133-2150.

Zhang, R., Zhao, T., Lu, Y., & Xu, X. (2022). Relaxed Adaptive
Lasso and its asymptotic results. *Symmetry, 14*(7), 1422.

Zou, H. (2006). The adaptive lasso and its oracle properties.
*Journal of the American Statistical Association, 101*(476),
1418-1429.

In case you obtained different results, the results above were obtained using the following:

```
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=C LC_CTYPE=Dutch_Netherlands.utf8
## [3] LC_MONETARY=Dutch_Netherlands.utf8 LC_NUMERIC=C
## [5] LC_TIME=Dutch_Netherlands.utf8
##
## time zone: Europe/Amsterdam
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] pre_1.0.7 mice_3.16.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.7 utf8_1.2.3 generics_0.1.3 tidyr_1.3.0
## [5] shape_1.4.6 stringi_1.7.12 lattice_0.21-8 inum_1.0-5
## [9] plotmo_3.6.2 lme4_1.1-35.1 digest_0.6.31 magrittr_2.0.3
## [13] mitml_0.4-5 evaluate_0.23 grid_4.3.1 iterators_1.0.14
## [17] mvtnorm_1.2-3 fastmap_1.1.1 foreach_1.5.2 jomo_2.7-6
## [21] jsonlite_1.8.4 glmnet_4.1-8 Matrix_1.6-2 nnet_7.3-19
## [25] backports_1.4.1 Formula_1.2-5 survival_3.5-5 purrr_1.0.1
## [29] fansi_1.0.4 codetools_0.2-19 jquerylib_0.1.4 cli_3.6.0
## [33] rlang_1.1.1 splines_4.3.1 plotrix_3.8-4 cachem_1.0.8
## [37] yaml_2.3.7 pan_1.9 tools_4.3.1 deldir_1.0-9
## [41] MatrixModels_0.5-3 earth_5.3.2 nloptr_2.0.3 minqa_1.2.6
## [45] dplyr_1.1.2 interp_1.1-4 boot_1.3-28.1 broom_1.0.5
## [49] rpart_4.1.19 vctrs_0.6.3 R6_2.5.1 lifecycle_1.0.4
## [53] stringr_1.5.1 libcoin_1.0-10 MASS_7.3-60 partykit_1.2-20
## [57] pkgconfig_2.0.3 pillar_1.9.0 bslib_0.5.1 TeachingDemos_2.12
## [61] glue_1.6.2 Rcpp_1.0.10 highr_0.10 xfun_0.41
## [65] tibble_3.2.1 tidyselect_1.2.0 rstudioapi_0.15.0 knitr_1.45
## [69] htmltools_0.5.7 nlme_3.1-162 rmarkdown_2.25 compiler_4.3.1
```