Date of Award


Access Type

Dissertation - Open Access

Degree Name

Doctor of Philosophy in Aviation


College of Aviation

Committee Chair

Bruce A. Conway, Ph.D.

First Committee Member

Dothang Truong, Ph.D.

Second Committee Member

David S. Cross, Ph.D.

Third Committee Member

Robert W. Maxson, Ph.D.


Aviation safety management is implemented through reactive, proactive, and predictive methodologies. Unlike reactive and proactive safety, predictive safety can predict the next accident and enable prevention before an actual occurrence. The study outlined here promotes predictive safety management through machine learning technologies using large amounts of data to facilitate predictive modeling.

The study addresses efforts to reduce General Aviation accidents, an effort that was renewed in earnest with the Federal Aviation Administration’s 1998 Safer Skies Initiative. Over the past 22 years, the General Aviation fatality rate has decreased. However, accidents still happen, and there is some evidence showing the number of accidents, representing hazard exposure, is increasing. The accident data suggest that the aviation community still has more to learn about the variables involved in an accident sequence.

The purpose of the study was to conduct an exploratory data-driven examination of General Aviation accidents in the United States from January 1, 1998, to December 31, 2018, using machine learning and data mining techniques. The goal was to determine what model best predicts fatal and severe injury aviation accidents and further, what variables were most important in the prediction model.

The study sample comprised 26,387 fixed-wing general aviation accidents accessed through the publicly accessible National Transportation Safety Board Aviation Accident Database and Synopses archive. Using a mixed-methods approach, the study employed both unstructured narrative text and structured tabular data within the predictive modeling. First, the accident narratives were culled using text mining algorithms to develop text-based quantitative variables. Next, data mining algorithms were used to develop models based on both text- and data-based variables derived from the accident reports.

Five types of machine learning models were created using SAS® Enterprise Miner™, including the Decision Tree, Gradient Boosting, Logistic Regression, Neural Network, and Random Forest. Additionally, three broad sets of variables were used in modeling, including text-only, data-only, and a combination of text and data variables. Three models, Logistic Regression (text-only variables), Random Forest (text-only variables), and Gradient Boosting (text and data variables), emerged with a similar prediction capability. The top six variables within the models were all text-based covering Medical, Slow-flight and stalls, Flight control, IMC flight, Weather factors, and Flight hours topics. The Logistic Regression (Text) model was selected as the champion model: Misclassification Rate = 0.098, ROC Index = 0.945, and Cumulative Lift = 3.46.

The results of the study provide insights to the entire General Aviation community, including government, industry, flight training, and the operational pilot. Specific recommendations include the following areas: 1) improve the quality and usefulness of accident reports for machine learning applications, 2) investigate ways to capture and publish more open-source flight data for use in safety modeling, 3) invest in additional medical education and find ways to address impairing medications and high risk medical conditions, 4) renew efforts on improving flight skills and combatting decision-based errors, 5) emphasize the importance of weather briefings, pre-flight planning, and weather-based risk management, and 6) create an aviation-specific corpus for text mining to improve text analysis and transformation.