Date of Award
Spring 2025
Access Type
Dissertation - Open Access
Degree Name
Doctor of Philosophy in Electrical Engineering & Computer Science
Department
Electrical Engineering and Computer Science
Committee Chair
Omar Ochoa
Committee Advisor
Omar Ochoa
First Committee Member
Massood Towhidnejad
Second Committee Member
Nicholas Del Rio
Third Committee Member
Alex Vargas
Fourth Committee Member
Keith Garfield
College Dean
James W. Gregory
Abstract
The present dissertation delineates a system that enables those engaged in software development to automatically generate and maintain project life cycle provenance. All projects are implemented and made manifest with the development of artifacts, e.g., papers, code files, etc. Tools exist to accelerate artifact creation, but little focus is paid to the processes that produce them. In terms of Ontology, or, from Ancient Greek, the study of being, the two most basic entities in reality are Continuant and Occurrent, or, roughly, “Artifact” and “Process”. This dissertation posits that for any created artifact, its process of creation, i.e., its life cycle provenance, must also be captured and maintained. Artifacts are often delivered without an explicit trace of their evolution. This is particularly unacceptable for critical systems, where requirements documents, codebases, and other meta-artifacts are revisited without a corresponding history of how or why they came to be, leading to confusion and rework.
While the software development life cycle (SDLC) incorporates meta-artifacts like traceability matrices to improve artifact provenance, these are typically informal, heavy with natural language, and lack structured explainability. This work proposes that each artifact should be attended by a machine-readable, human-interpretable, extensible provenance record, implemented in the form of a knowledge graph, backed by well-established ontologies. The developed system, ProvTracer, leverages structured knowledge via PROV-O and the Basic Formal Ontology alongside generative natural language capabilities via the Generative Pretrained Transformer (GPT) series of Multimodal Large Language Models (MLLMs), to create real-time, traceable, and explainable links between development activities and their resulting artifacts. By capturing these provenance trace links automatically through multimodal signals, e.g., screenshots, peripheral device input, etc., ProvTracer aims to bridge the gap between implicit processes and explicit traces, enabling developers to understand, query, maintain, integrate, and trust the evolution of their projects and systems.
The synergy between knowledge graphs and MLLMs enables a novel form of interactive, explainable software development. Natural language queries of provenance trace link knowledge graphs can reduce information overload, extract developer rationale and decision histories, support task assignment, and a range of project management activities. This aligns with a burgeoning trend in research demonstrating that structured knowledge improves machine learning trust, transparency and reproducibility. The present dissertation addresses the challenges of traceability and explainability in the SDLC by presenting a system that automatically captures artifact provenance and operationalizes it for practical use in real-world software development.
Scholarly Commons Citation
Procko, Tyler, "On the Provenance of Software Systems: Automating Software Traceability with Knowledge Graph and Large Language Model Synergy" (2025). Doctoral Dissertations and Master's Theses. 898.
https://commons.erau.edu/edt/898
Included in
Applied Behavior Analysis Commons, Archival Science Commons, Cataloging and Metadata Commons, Cognition and Perception Commons, Cognitive Science Commons, Communication Technology and New Media Commons, Computational Linguistics Commons, Computer and Systems Architecture Commons, Data Storage Systems Commons, Experimental Analysis of Behavior Commons, Graphic Communications Commons, Operational Research Commons, Organizational Communication Commons, Signal Processing Commons, Systems Engineering Commons
Comments
https://github.com/PR0CK0/ProvTracer