Handling Overfitting

One-in-ten rule

 

In statistics, the one in ten rule is a rule of thumb for how many predictor parameters can be estimated from data when doing regression analysis (in particular proportional hazards models in survival analysis and logistic regression) while keeping the risk of overfitting low. The rule states that one predictive variable can be studied for every ten events. For logistic regression, the number of events is given by the size of the smallest of the outcome categories, and for survival analysis, it is given by the number of uncensored events.

Example

For example, if a sample of 200 patients are studied and 20 patients die during the study (so that 180 patients survive), the one in ten rule implies that two pre-specified predictors can reliably be fitted to the total data. Similarly, if 100 patients die during the study (so that 100 patients survive), ten pre-specified predictors can be fitted reliably. If more are fitted, the rule implies that overfitting is likely and the results will not predict well outside the training data. It is not uncommon to see the 1:10 rule violated in fields with many variables (e.g. gene expression studies in cancer), decreasing the confidence in reported findings.

 

A "one in 20 rule" has been suggested, indicating the need for shrinkage of regression coefficients, and a "one in 50 rule" for stepwise selection with the default p-value of 5%.

Cross Validation

 

In normal cross-validation, say k-fold, we split the data into k equal sized chunks. Then, we randomly select k-1 chunks for training and the remaining chunk for testing the model, and we repeat this random selection k times, summing up all the outcomes.