How to Select Variables Robustly in a Scoring Model | Towards Data Science

1 / 2

How to Select Variables Robustly in a Scoring Model | Towards Data Science

Towards Data Science·JUNIOR JUMBONG·about 1 month ago

#mmJVHxnZ

#editorspicks #deepdives #newsletter #creditscoring #dataanalysis #variables

Reading 0:00

15s threshold

fail for one reason: bad variable selection. You pick variables that work on your training data. They fall apart on new data. The model looks great in development and breaks in production. There is a better way. This article shows you how to select variables that are stable, interpretable, and robust, no matter how you split the data. The Core Idea: Stability Over Performance A variable is robust if it matters on every subset of your data, not just on the full dataset. To check this, we split the training data into 4 folds using stratified cross-validation. We stratify by the default variable and the year to ensure each fold is representative of the full population. from sklearn.model_selection import StratifiedKFold. skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=42) train_imputed["fold"] = -1 for fold, (_, test_idx) in enumerate(skf.split(train_imputed, train_imputed["def_year"])): train_imputed.loc[test_idx, "fold"] = fold We then build four pairs (train, test).…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

How to Select Variables Robustly in a Scoring Model | Towards Data Science