To satisfy the main hallmarks of scientific model development – rigor, testability, replicability and precision, and confidence – it’s important to consider model validation and how to deal with unbalanced data. This article outlines advanced validation frameworks that users can implement to satisfy those hallmarks and provides a brief overview of methodologies commonly used to deal with unbalanced data.
Advanced Validation Framework
Users should be suspect of any predictive model that fits data too well should. By building complex, high-performance predictive models, data scientists often make modelling errors, referred as overfitting. Overfitting – which occurs when a model fits perfectly to the training dataset but fails to generalize on a training dataset – is a fundamental issue and is predictive models’ biggest threat. Overfitting leads to poor predictions on new (unseen, holdout) datasets.
Figure 1. Model overfitting
Many validation frameworks exist that detect and minimize overfitting, and they differ in terms of algorithm complexity, computational power, and robustness. Two simple, common techniques are:
- Simple validation – Random or stratified partitioning into train and test partitions, and
- Nested holdout validation – Random or stratified partitioning into train, validation, and test partitions. Different models are trained on the training partition, mutually compared on the validation sample, and the champion model is validated on an unseen data that is the testing partition
The main drawback of these two approaches is that the model fitted to a subset of the available data could still be subject to overfitting. This is especially true with datasets that contain few observations.
Another drawback, which occurs in simple validation, arises when adjusting model parameters and constantly testing the model performance on the same test sample. This leads to data leak, as the model effectively “learns” from the test sample, meaning that the test sample is no longer the true holdout sample and as such overfitting may become a problem. Nested holdout validation could resolve the problem to a certain extent; however this approach requires a lot of data, which could be the issue.
Bootstrapping and cross-validation are two validation frameworks specifically designed to overcome problems with overfitting and more thoroughly capture sources of variation, among other frameworks:
- Bootstrapping – Sampling with replacement. The standard bootstrap validation process randomly creates M different samples from the original data of the same size. The model is fitted on each of the bootstrap samples and subsequently tested on the entire data to measure performance
- Cross-validation (CV) – Fits data on the entire population by systematically swapping out samples for testing and training. Cross-validation has many forms, including K-fold, stratified, leave-one-out, and nested cross-validation
- Nested cross-validation is required if users want to validate the model in addition to parameter tuning and/or variable selection. It consists of an inner and an outer CV. The inner CV is used for either parameter tuning or variable selection, while the outer CV is used for model validation
With some modifications, both bootstrapping and cross-validation can simultaneously achieve three different objectives, which are model validation, variable selection, and parameter tuning (grid-search).
|Design Framework||Execution Complexity||Technique||Optimisation Parameters||CV folds||CV repeats|
|Variable selection||1||1-D grid-search CV||S*||K||N|
|Parameter tuning||1||1-D grid-search CV||P**||K||N|
|2||2-D grid-search CV||(S, P)||K||N|
|2||1-D grid-search Nested-CV||S||K1, K2||N1, N2|
|2||1-D grid-search Nested-CV||P||K1, K2||K1, K2|
|3||2-D grid-search Nested-CV||(S, P)||K1, K2||N1, N2|
Table 2. Grid-search and CV for validation, selection, and tuning
Modeling Unbalanced Data
Model accuracy – the ratio of correct predictions to the total number of cases – is a typical measurement used to assess model performance. However, assessing model performance solely by accuracy may itself present problems, like the accuracy paradox. For example, assume we have an unbalanced training dataset with a very small percentage of the target population (1%) for who we predict fraud or other catastrophic events. Even without a predictive model, just by making the same guess “no fraud” or “no catastrophe” we reach 99% accuracy! Impressive! However, such strategy would have a 100% miss rate, meaning that we still need a predictive model to either reduce the miss rate (false negative, a “type II error”) or to reduce false alarms (false positive, a “type I error”).
The right performance measure depends on business objectives. Some cases require minimizing miss rate, while others are more focused on minimizing false alarms, especially if customer satisfaction is the primary aim. Based on the overall objective, data scientists need to identify the best methodology to build and evaluate a model using unbalanced data.
Unbalanced data may be a problem when using machine learning algorithms since these datasets could have insufficient information about the minority class. This is because algorithms based on minimizing the overall error are biased towards the majority class, neglecting the contribution of the cases we’re more interested in.
Two general techniques used to combat unbalanced data modelling issues are sampling and ensemble modelling.
Sampling methods are further classified into undersampling and oversampling techniques. Undersampling involves removing cases from the majority class and keeping the complete minority population. Oversampling is the process of replicating the minority class to balance the data. Both aim to create balanced training data so the learning algorithms can produce less biased results. Both techniques have potential disadvantages; undersampling may lead to information loss, while oversampling can lead to overfitting.
A popular modification of the oversampling technique, developed to minimize overfitting, is synthetic minority oversampling technique (SMOTE) that creates minority cases based on another learning technique, usually KNN algorithm. As a rule of thumb, if more observations are available, use undersampling, otherwise, oversampling is better.
The steps below outline a simple example of development steps using the undersampling technique:
- Create a balanced training view by selecting all “bad” cases and a random sample of “good” cases in proportion, for example 35%/65%, respectively. If there is enough “bad” cases, undersample from an unbalanced training partition, otherwise use the entire population to undersample.
- Select the best set of predictors using the usual modeling steps:
a. Selection of candidate variables
b. Fine classing
c. Coarse classing with optimal binning
d. Weight of evidence (WOE) or dummy transformations
e. Stepwise logistic regression model
- If not created in step 1, partition the full unbalanced dataset into train and test partitions, for example 70% in the training partition, 30% in the testing partition. Keep the ratio of the minority class the same in both partitions.
- Train the model with the model variables selected by the stepwise method in step 2e on the training partition.
- Validate the model on the testing partition.
Ensemble modeling is an alternative for unbalanced data modelling. Bagging and boosting are typical techniques used to make stronger predictors and overcome overfitting without using undersampling or oversampling. Bagging is a bootstrap aggregation that creates different bootstraps with replacement, trains the model on each bootstrap, and averages prediction results. Boosting works by gradually building a stronger predictor in each iteration and learning from the errors made in the previous iteration.
As discussed above, accuracy is not the preferred metric for unbalanced data since it considers only correct predictions. However, considering correct and incorrect results simultaneously, we can get more insights about the classification model. In such cases, the useful performance measures are sensitivity (synonyms are recall, hit rate, probability of detection or true positive rate), specificity, (true negative rate) or precision.
In addition to these three scalar metrics, another popular measure that dominates across the industry is the ROC curve. The ROC curve is independent to proportion of “bad” vs. “good” cases, which is the important feature, especially to unbalanced data. When there are enough “bad” cases, rather than using unbalanced data methods, the standard modelling methodology can be applied and the resulting model tested using the ROC curve.
Credit scoring is a dynamic, flexible, and powerful tool for lenders, but there are plenty of ins and outs that are worth covering in detail. To learn more about credit scoring and credit risk mitigation techniques, read the next installment of our credit scoring series, Part Eight: Credit Risk Strategies.
Read prior Credit Scoring Series installments: