Date of Award

Fall 2025

Access Type

Thesis - Open Access

Degree Name

Master of Science in Data Science

Department

Mathematics

Committee Chair

Prashant Shekhar

Committee Chair Email

SHEKHARP@erau.edu

First Committee Member

Hari Adhikari

First Committee Member Email

adhikarh@erau.edu

Second Committee Member

Timothy Smith

Second Committee Member Email

smitht1@erau.edu

College Dean

Jayathi Raghavan

Abstract

This research explores a systematic application of machine learning techniques combined with causal inference to predict loan defaults in peer-to-peer lending. Accurately forecasting loan defaults is crucial for mitigating financial risk and optimizing lending strategies. This analysis is based on multiple datasets of loan applications spanning over a decade, containing detailed financial and credit information about borrowers. Beginning with extensive Exploratory Data Analysis (EDA) coupled with scaling strategies, the research identifies key trends in loan performance across a large number of factors, such as interest rates or borrower creditworthiness, and one objective is to determine from the many available predictors which are the most essential. The research introduces two different Recursive Feature Elimination (RFE) variations based on: (1): occlusion sensitivity and (2): double machine learning (DML), each providing a unique view of feature engineering. These RFE methods were then coupled with machine learning models such as Logistic Regression, Random Forest, Naive Bayes, XGBoost, and CatBoost final classification. For performance validation, the modeling pipelines (feature selection + prediction) were evaluated for accuracy, precision, recall, and F1-score to determine their effectiveness in predicting loan defaults.

Among the models tested, Naive Bayes overall demonstrated the best performance (F1-score), significantly outperforming the next best model, XGBoost. However, Naive Bayes exhibited lower precision, suggesting that while it captured more defaulted loans, it also produced a higher number of false positives. Neural network models developed using TensorFlow and Keras to explore non-linear relationships in the data lacked in performance compared to the traditional models. Between the two RFE methods tested, the results showed that models that selected their features using DML were more consistent with each other, making the same selection choices more often and indicating this method is less sensitive to model choice. Occlusion sensitivity had more varied results due to its greedy selection algorithm, but often showed better performance than DML for the model used. This study demonstrates the potential of machine learning in financial risk assessment and highlights the need for intelligent selection of features. Future work will extend this analysis by refining feature engineering, using more sophisticated non-linear models, and performing in-depth causal inference to uncover specific useful features.

Share

COinS