Machine Learning for Credit Risk Assessment Credit Default Prediction Model

Late credit card payments disrupt cash flow and increase bad debt. This project aims to build a classification model to identify high-risk customers early, enabling proactive collection strategies.

🎯 The Objective

Develop a model to predict Unpaid Tagging (Default) with a target Recall > 60%.

⚠️ Why Recall?

Minimizing False Negatives is crucial. We must catch as many at-risk customers as possible, even if it means flagging some safe ones.

🧪 Experimental Approach

To find the best predictor, I conducted two scenarios using Logistic Regression, Gradient Boosting, and Random Forest:

  • Experiment 1 (Annual Review): Analyzing behavior over the last 12 months (Q1-Q4).
  • Experiment 2 (Semester Review): Focusing on recent behavior over the last 6 months (Q3-Q4).
📊 Feature Importance (Gradient Boosting)
Feature Importance Plot

📉 Model Performance Results

Algorithm Test Accuracy Test Recall Validation Recall
Logistic Regression 77.7% 43.5% 26.2%
Gradient Boosting (Exp 1) 68.6% 60.6% ✅ 44.9% 📉
Random Forest 80.8% 34.0% 35.0%

💡 Evaluation & Next Steps

Key Insight: Vintage_CR (Credit Card Tenure) and Delta Balance (Balance Fluctuation) are the strongest predictors. This suggests that how long a customer has been with us and how drastically their balance changes are the biggest indicators of default risk.

The Challenge: While Gradient Boosting achieved the 60% Recall target on the Test data, it dropped to ~45% on the Validation set. This indicates the model struggles to generalize to completely unseen data (Potential Overfitting).

Optimization Plan: To improve robustness, the next iteration will focus on:

  • Oversampling (SMOTE): To handle the class imbalance (Defaulters are minority).
  • Feature Selection: Removing low-impact variables to reduce noise.
  • Extending Data Horizon: Using more than 1 year of historical data for better trend capture.