Machine Learning for Credit Risk Assessment Credit Default Prediction Model
🎯 The Objective
Develop a model to predict Unpaid Tagging (Default) with a target Recall > 60%.
⚠️ Why Recall?
Minimizing False Negatives is crucial. We must catch as many at-risk customers as possible, even if it means flagging some safe ones.
🧪 Experimental Approach
To find the best predictor, I conducted two scenarios using Logistic Regression, Gradient Boosting, and Random Forest:
- Experiment 1 (Annual Review): Analyzing behavior over the last 12 months (Q1-Q4).
- Experiment 2 (Semester Review): Focusing on recent behavior over the last 6 months (Q3-Q4).
📉 Model Performance Results
| Algorithm | Test Accuracy | Test Recall | Validation Recall |
|---|---|---|---|
| Logistic Regression | 77.7% | 43.5% | 26.2% |
| Gradient Boosting (Exp 1) | 68.6% | 60.6% ✅ | 44.9% 📉 |
| Random Forest | 80.8% | 34.0% | 35.0% |
💡 Evaluation & Next Steps
Key Insight: Vintage_CR (Credit Card Tenure) and Delta Balance (Balance Fluctuation) are the strongest predictors. This suggests that how long a customer has been with us and how drastically their balance changes are the biggest indicators of default risk.
The Challenge: While Gradient Boosting achieved the 60% Recall target on the Test data, it dropped to ~45% on the Validation set. This indicates the model struggles to generalize to completely unseen data (Potential Overfitting).
Optimization Plan:
To improve robustness, the next iteration will focus on:
- Oversampling (SMOTE): To handle the class imbalance (Defaulters are minority).
- Feature Selection: Removing low-impact variables to reduce noise.
- Extending Data Horizon: Using more than 1 year of historical data for better trend capture.
Post a Comment