Menu

Post image 1
Post image 2
1 / 2
0

How to Select Variables Robustly in a Scoring Model | Towards Data Science

Towards Data Science·JUNIOR JUMBONG·about 1 month ago
#mmJVHxnZ
Reading 0:00
15s threshold

fail for one reason: bad variable selection. You pick variables that work on your training data. They fall apart on new data. The model looks great in development and breaks in production. There is a better way. This article shows you how to select variables that are stable, interpretable, and robust, no matter how you split the data. The Core Idea: Stability Over Performance A variable is robust if it matters on every subset of your data, not just on the full dataset. To check this, we split the training data into 4 folds using stratified cross-validation. We stratify by the default variable and the year to ensure each fold is representative of the full population. from sklearn.model_selection import StratifiedKFold. skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=42) train_imputed["fold"] = -1 for fold, (_, test_idx) in enumerate(skf.split(train_imputed, train_imputed["def_year"])): train_imputed.loc[test_idx, "fold"] = fold We then build four pairs (train, test).…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More