Date of Award
Fall 2025
Access Type
Thesis - Open Access
Degree Name
Master of Science in Data Science
Department
Mathematics
Committee Chair
Prashant Shekhar
Committee Chair Email
SHEKHARP@erau.edu
First Committee Member
Hari Adhikari
First Committee Member Email
adhikarh@erau.edu
Second Committee Member
Timothy Smith
Second Committee Member Email
smitht1@erau.edu
College Dean
Jayathi Raghavan
Abstract
This research explores a systematic application of machine learning techniques combined with causal inference to predict loan defaults in peer-to-peer lending. Accurately forecasting loan defaults is crucial for mitigating financial risk and optimizing lending strategies. This analysis is based on multiple datasets of loan applications spanning over a decade, containing detailed financial and credit information about borrowers. Beginning with extensive Exploratory Data Analysis (EDA) coupled with scaling strategies, the research identifies key trends in loan performance across a large number of factors, such as interest rates or borrower creditworthiness, and one objective is to determine from the many available predictors which are the most essential. The research introduces two different Recursive Feature Elimination (RFE) variations based on: (1): occlusion sensitivity and (2): double machine learning (DML), each providing a unique view of feature engineering. These RFE methods were then coupled with machine learning models such as Logistic Regression, Random Forest, Naive Bayes, XGBoost, and CatBoost final classification. For performance validation, the modeling pipelines (feature selection + prediction) were evaluated for accuracy, precision, recall, and F1-score to determine their effectiveness in predicting loan defaults.
Among the models tested, Naive Bayes overall demonstrated the best performance (F1-score), significantly outperforming the next best model, XGBoost. However, Naive Bayes exhibited lower precision, suggesting that while it captured more defaulted loans, it also produced a higher number of false positives. Neural network models developed using TensorFlow and Keras to explore non-linear relationships in the data lacked in performance compared to the traditional models. Between the two RFE methods tested, the results showed that models that selected their features using DML were more consistent with each other, making the same selection choices more often and indicating this method is less sensitive to model choice. Occlusion sensitivity had more varied results due to its greedy selection algorithm, but often showed better performance than DML for the model used. This study demonstrates the potential of machine learning in financial risk assessment and highlights the need for intelligent selection of features. Future work will extend this analysis by refining feature engineering, using more sophisticated non-linear models, and performing in-depth causal inference to uncover specific useful features.
Scholarly Commons Citation
Guida, Luca, "Leveraging Machine Learning and Causal Inference for Loan Default Prediction" (2025). Doctoral Dissertations and Master's Theses. 939.
https://commons.erau.edu/edt/939
Included in
Artificial Intelligence and Robotics Commons, Data Science Commons, Finance and Financial Management Commons