Beyond Scores: Transaction-Level Modeling for Credit Risk

Machine Learning and NLP for Optimized Risk Assessment

By Yujie Kang, Ethan Flores, Jason Kim, Aile Banuelos

Mentors: Kyle Nero, Daniel Mathew

View on GitHub

Introduction

Traditional credit risk models rely heavily on credit bureau information such as payment history, credit utilization, and tradelines. While these systems are effective for consumers with established credit histories, they provide limited information for individuals with thin or nonexistent credit files. As a result, many consumers may be evaluated without accurately reflecting their true financial behavior.


This project investigates whether transaction-level financial data can be used to estimate credit risk directly. By analyzing behavioral signals such as income patterns, spending activity, and balance stability, we construct a behavior-based probability-of-default model that predicts delinquency risk from observed financial activity.

Results

Feature Summary

The final model used 50 engineered features covering liquidity, cashflow stability, spending patterns, and distress indicators such as overdraft-related activity. Table below summarizes the distribution of features across specific categories.

Distribution of the Top 50 Selected Features by Behavioral Category
Feature Category Number of Features
Category24
Transaction9
Balance8
Income & Spending2
Low Balance Risk2
Overdraft & Fees2
Cashflow1
Income Regularity1
Multi-Account1
Total50

Model Performance

We evaluate several machine learning models trained on the engineered behavioral features. Because the delinquency rate is relatively low, we focus on ROC-AUC and F1 rather than accuracy.

Final model performance comparison
Model ROC-AUC Precision Recall F1 Training Time (s) Prediction Time (ms)
Logistic Regression0.75250.15830.71590.25930.0670.887
XGBoost0.80610.19030.73860.30270.2600.166
LightGBM0.76370.16090.66480.25910.5800.187
Random Forest0.80010.16960.77270.27810.60848.161
Gradient Boosting0.81290.23280.63640.340946.2099.758

Analysis:

Table compares model performance on the engineered features. Logistic Regression serves as a baseline (ROC-AUC 0.7525) with relatively high recall but low precision. Tree-based models (Random Forest, XGBoost, Gradient Boosting) outperform the linear baseline. Gradient Boosting attains the best overall balance (ROC-AUC 0.8129, F1 0.3409), with higher precision and reasonable prediction time, making it preferable for production deployment where false positives are costly.

ROC Curve Comparison

ROC curves comparing models
ROC curves comparing the evaluated models. The Gradient Boosting model achieves the strongest separation.

The ROC curves confirm the numeric results: Gradient Boosting maintains the highest true positive rate across most false positive rates, consistent with its higher ROC-AUC.

Confusion Matrix Comparison

Confusion matrices for XGBoost and Gradient Boosting
Confusion matrices for XGBoost and Gradient Boosting on the test dataset.

The confusion matrices show that XGBoost achieves slightly higher recall for delinquency but at the cost of more false positives. Gradient Boosting produces fewer false positives while still identifying a substantial portion of delinquent consumers, explaining its stronger F1 score.

Model Interpretation Using SHAP

SHAP summary plot
SHAP summary plot showing the most influential features for the Gradient Boosting model.

SHAP analysis highlights features related to account balances and financial stability as key predictors. Lower liquid balances and more time spent below balance thresholds are associated with higher predicted delinquency risk, indicating short-term liquidity and balance stability are important behavioral signals.

Limitations

While the models demonstrate predictive signal, there are important limitations to note. The model can overfit to the training data — especially given the class imbalance and the large set of engineered features. We mitigated this with cross-validation, regularization, and early stopping, but overfitting and optimistic performance estimates remain possible without stronger temporal holdouts or external validation.

Additional limitations include dataset representativeness (results may not generalize to different populations or institutions), potential label noise, and risks of temporal leakage. Before production deployment we recommend further external validation, calibration of predicted probabilities, and monitoring for data drift.

Methods

Data Description

The dataset used in this project contains financial activity data that captures how consumers manage their money over time. Rather than relying solely on traditional credit bureau information, the dataset provides detailed behavioral signals through bank account balances and transaction histories. These signals allow us to observe patterns such as income stability, spending behavior, and liquidity risk. The data is organized into four primary tables that together describe consumer financial activity:


Dataset structure diagram
Figure 1. Core Dataset Structure

Exploratory Data Analysis

Exploratory analysis was conducted to understand the behavioral patterns present in the financial data before feature engineering and modeling. The analysis focuses on balance dynamics, population delinquency rates, and account composition across consumers.


Balance over time by delinquency status
Figure 2. Median account balance trends for delinquent and non-delinquent consumers
Consumer population distribution
Figure 3. Delinquency Distribution
Account type distribution
Figure 4. Financial Account Distribution

Exploratory analysis reveals several behavioral patterns within the dataset. Consumers who eventually became delinquent tend to maintain lower median balances and exhibit greater balance volatility compared to non-delinquent consumers. The dataset is also imbalanced, with delinquent consumers representing a small portion of the population, and checking accounts comprising the majority of observed account activity.

Feature Generation

Raw transaction and balance data were transformed into behavioral features that summarize financial activity patterns. These features capture balance dynamics, income stability, spending behavior, and overall account activity across multiple financial accounts.

Behavioral feature engineering framework
Figure 5. Behavioral Feature Engineering Framework
  • Balance Behavior: Statistical measures of account balances including averages, volatility, and liquidity risk across multiple time windows.
  • Cashflow Behavior: Measures of income inflows, transaction frequency, and stability of daily cashflow patterns.
  • Transaction Activity: Aggregate transaction volumes, credit and debit magnitudes, and credit-to-debit ratios.
  • Category-Level Spending: Spending distributions across merchant categories including essential and discretionary expenditures.
  • Multi-Account Features: Aggregated balances across savings, investments, credit cards, and loans to compute metrics such as total assets, debt, and savings-to-debt ratios.

Feature Selection

To reduce dimensionality and retain the most predictive behavioral signals, we implemented a multi-stage feature selection pipeline. The process combined preprocessing filters with multiple feature importance metrics to identify variables that consistently contributed to predictive performance.

  • Preprocessing Filters: Near-constant features were removed using a variance threshold (variance < 1%), and highly correlated variables were eliminated using a correlation filter (correlation > 0.95).
  • Multi-Metric Evaluation: Remaining features were evaluated using Gradient Boosting feature importance, permutation importance, and mutual information to capture both linear and non-linear predictive relationships.
  • Consensus Ranking: Importance scores from all three methods were normalized and combined into a weighted consensus score.
Top 5 Feature Importance using SHAP Values
Figure 6. Top behavioral features contributing to delinquency prediction.

Using this consensus ranking, the top 50 features were selected for model training. SHAP analysis highlights several key behavioral signals driving predictions, including liquid balance levels, account history length, and low-balance frequency. These features primarily capture liquidity stability and financial activity patterns associated with delinquency risk.

Models

After constructing behavioral financial features, we implemented a machine learning pipeline to predict delinquency outcomes. The pipeline begins with engineered behavioral signals derived from account balances, cashflow activity, transaction behavior, and category-level spending patterns. These features are then processed through the feature selection stage before being used to train and evaluate several machine learning models.

Behavioral feature engineering and modeling pipeline
Figure 7. End-to-End Credit Risk Prediction Pipeline

Models

After feature engineering and feature selection, several machine learning models were trained to predict delinquency risk. The dataset was split into 60% training, 20% validation, and 20% test sets. The validation set was used for hyperparameter tuning, while the test set was reserved for final evaluation. Because delinquency is relatively rare in the dataset (approximately 8%), model performance was evaluated using ROC-AUC and F1 score rather than accuracy.

  • Logistic Regression: A linear baseline model used to establish a reference point for evaluating whether more complex models capture nonlinear patterns in the behavioral features.
  • Random Forest: An ensemble tree-based method that builds multiple decision trees using bootstrapped samples and averages their predictions, allowing the model to capture nonlinear interactions between financial behavior variables.
  • LightGBM: A gradient boosting framework optimized for tabular data that grows trees leaf-wise, improving efficiency and predictive performance on structured financial datasets.
  • XGBoost: A gradient boosting implementation that sequentially builds trees to correct errors from previous models, incorporating regularization and efficient handling of sparse features.
  • Gradient Boosting: The scikit-learn implementation of gradient boosting, which iteratively fits decision trees to residual errors to learn complex nonlinear relationships between behavioral signals and delinquency risk.

All models were trained using the selected feature set from the consensus feature selection pipeline. Hyperparameters were tuned on the validation set, and final performance metrics were computed on the held-out test set.

Model Evaluation

Model performance was evaluated using several metrics commonly used in credit risk modeling. Because the dataset is imbalanced, these metrics focus on the model’s ability to correctly identify delinquent consumers rather than relying on overall accuracy.

  • ROC–AUC: Measures how well the model distinguishes between delinquent and non-delinquent consumers across all classification thresholds.
  • Precision: The proportion of consumers predicted to be delinquent that actually become delinquent.
  • Recall: The proportion of truly delinquent consumers correctly identified by the model.
  • F1 Score: The harmonic mean of precision and recall, balancing false positives and false negatives.
  • Training Time: The computational time required to fit the model on the training dataset, indicating the algorithm's efficiency during the learning phase.
  • Prediction Time: The time required for the model to generate predictions, which is important for scalability and real-time deployment.

Conclusion

This project shows that transaction-level financial data can provide meaningful signals for credit risk prediction. By transforming raw balance histories and transaction activity into behavioral features, we built a model that captures patterns related to liquidity stability, income regularity, and spending behavior.


Among the models we tested, Gradient Boosting performed best overall, achieving the strongest balance between predictive performance and classification quality. The results suggest that behavioral signals from bank activity can help identify the risk of defaulting in ways that traditional credit bureau data may miss.


These findings are especially relevant for consumers with thin or limited credit histories, where traditional credit scores may not fully reflect real financial behavior. Overall, the project highlights how transaction-based features can serve as a valuable complement to modern credit risk assessment.

Next Steps

Future work could focus on improving model calibration so that predicted probabilities more closely match observed delinquency outcomes. Incorporating longer financial histories and additional behavioral signals may also strengthen predictive performance.


Another promising direction is expanding the behavioral feature set. Additional features could be created by combining spending categories, analyzing category-level spending volatility, or tracking changes in spending composition over time. These signals could provide deeper insight into financial stability and consumer behavior.


More broadly, future work could focus on optimizing the behavioral modeling framework itself. This includes testing additional feature transformations, exploring interactions between behavioral signals, and evaluating how well the model generalizes across consumer populations. Continued improvements in feature engineering and modeling would help strengthen the reliability of transaction-based credit risk assessment.