After installing scorecard via instructions in the README section, load the package into your environment.
Let’s use the germancredit dataset for the purposes of this demonstration.
var_filter function drops column variables that
don’t meet the thresholds for missing rate (> 95% by default),
information value (IV) (< 0.02 by default), or identical value rate
(> 95% by default).
<- var_filter(germancredit, y = "creditability")dt_f
When building scorecard models, a subset of the observations should
be held out from the data used to train the model (similar to most other
traditional modeling approaches), and instead be apportioned to the
test set. We can perform this sampling to create the
train and test datasets using the
<- split_df(dt_f, y = "creditability", ratios = c(0.6, 0.4), seed = 30) dt_list <- lapply(dt_list, function(x) x$creditability)label_list
Weight-of-Evidence binning is a technique for binning both continuous
and categorical independent variables in a way that provides the most
robust bifurcation of the data against the dependent variable. This
technique can be easily executed across all independent variables using
<- woebin(dt_f, y = "creditability") bins # woebin_plot(bins)
The user can also adjust bin breaks interactively by using the
# breaks_adj <- woebin_adj(dt_f, y = "creditability", bins = bins)
Furthermore, the user can set the bin breaks manually via the
breaks_list = list() argument in the
function. Note the use of %,% as a separator to create a single
bin from two classes in a categorical independent variable.
<- list( breaks_adj age.in.years = c(26, 35, 40), other.debtors.or.guarantors = c("none", "co-applicant%,%guarantor") ) <- woebin(dt_f, y = "creditability", breaks_list = breaks_adj)bins_adj
Once your WoE bins are established for all desired independent variables, apply the binning logic to the training and test datasets.
<- lapply(dt_list, function(x) woebin_ply(x, bins_adj))dt_woe_list
Logistic regression can often be leveraged effectively to assist in building the scorecards.
<- glm( creditability ~ ., family = binomial(), data = dt_woe_list$train) m1 # vif(m1, merge_coef = TRUE) # summary(m1) # Select a formula-based model by AIC (or by LASSO for large dataset) <- step(m1, direction = "both", trace = FALSE) m_step <- eval(m_step$call) m2 # vif(m2, merge_coef = TRUE) # summary(m2)
If oversampling is a concern, the following code chunk could be uncommented and run to help adjust for this issue.
# Read documentation on handling oversampling (support.sas.com/kb/22/601.html) # library(data.table) # p1 <- 0.03 # bad probability in population # r1 <- 0.3 # bad probability in sample dataset # dt_woe <- copy(dt_woe_list$train)[, weight := ifelse(creditability == 1, p1/r1, (1-p1)/(1-r1) )] # fmla <- as.formula(paste("creditability ~", paste(names(coef(m2))[-1], collapse = "+"))) # m3 <- glm(fmla, family = binomial(), data = dt_woe, weights = weight)
perf_eva function provides model accuracy statistics
(such as mse, rmse, logloss, r2, ks, auc, gini) and plots (such as ks,
lift, gain, roc, lz, pr, f1, density).
# First, get probabalistic predictions <- lapply(dt_woe_list, function(x) predict(m2, x, type = 'response')) pred_list # Then evaluate model accuracy <- perf_eva(pred = pred_list, label = label_list)perf
Once the model has been selected, scorecards can be created via the
scorecard function. Note that the default target points is
600, target odds is 1/19 and points to double the odds is 50. See
?scorecard for more information on the function and its
The scorecard can then be applied to the original data using the
scorecard_ply function. Lastly, a chart encompassing
Population Stability Index (PSI) statistics can be rendered via the
# Build the card <- scorecard(bins_adj, m2) card # Obtain Credit Scores <- lapply(dt_list, function(x) scorecard_ply(x, card)) score_list # Analyze the PSI perf_psi(score = score_list, label = label_list)