This paper presents an algorithm for comparing large numbers of traces to each other and identifying and presenting groups of traces with similar features. It is applied to forensic analysis in which groups of similar traces are automatically identified and presented so that attribution and other related claims may be asserted, and independently confirmed or refuted. The approach of this paper is to identify an approximate algorithm that will find a large subset of greatest common factor similar groups of arbitrary factors in far less time and space than an exact algorithm using examiner-provided selection criteria for factor definition.


[1] E. Locard, "The Analysis of Dust Traces", The American Journal of Police Science, (1930), V1:276-298. [This is one of the most referenced papers in forensics, and is considered the defining paper in this area by many subsequent authors.]

[2] F. Cohen, "Digital Forensic Evidence Examination", 2009, ASP Press. [This book introduces a physics of digital systems and explores issues in examination of digital forensic evidence for legal purposes, including analysis, interpretation, attribution, reconstruction, and presentation. It also includes a substantial review of the literature in related areas.]

[3] F. Cohen, "Analysis of redundant traces for consistency", IEEE International Workshop on Computer Forensics in Software Engineering (CFSE 09), Seattle, Washington, USA, July 20-24, 2009. [This paper provides complexity results for analysis of traces and extends the use of redundancy for trace analysis for the identification of consistency and inconsistency of traces and events.]

[4] T. Stallard and K. Levitt, "Automated Analysis for Digital Forensic Science: Semantic Integrity Checking", ACSAC-2003. [This paper identifies the use of redundancy for testing hypotheses with regard to digital forensic evidence.]

[5] F. Cohen, "Two models of digital forensics examination", IEEE/SADFE-2009, Fourth International IEEE Workshop on Systematic Approaches to Digital Forensic Engineering, Oakland Conference, Oakland, CA, USA, May 21, 2009. [This paper identifies a model for performing analysis of issues related to the examination of digital forensic evidence.]

[6] K. Popper, The Logic of Scientific Discovery (1959), Hutchins and Company, London. ISBN10: 0415278449.. [This book is considered a defining standard for philosophy of science in terms of identifying the principles and properties of confirmation and refutation in the scientific arena.

[7] Federal Rules of Evidence Rules 701-706. [These are the rules used by the US Federal and many state court systems to determine admissibility of evidence and expert witness credentials and testimony.]

[8] Daubert v. Merrell Dow Pharmaceuticals, Inc. 509 US 579, 125 L. Ed. 2d 469, 113 S. Ct. 2786 (1993). [This is one of the key cases that defines the standard for admissibility of evidence based on Supreme Court precedent.]

[9] Committee on Identifying the Needs of the Forensic Sciences Community, "Strengthening Forensic Science in the United States: A Path Forward", ISBN: 978-0-309-13130-8, 254 pages, (2009).; Committee on Applied and Theoretical Statistics, National Research Council. [This report outlines the current limits of scientific evidence in the US legal system and identifies a need for increased scientific rigor in many aareas.]

[10] D. Knuth, "The Art of Computer Programming, Volume 3: Sorting and Searching", ISBN 0-201-03803-X, Addison Wesley, 1973. [This book summarizes a wide array of research in computer science, and is the third in a series of books by this author that are highly regarded in clarifying the nature of results in computer science.]

[11] S. Brin and L. Page, "The anatomy of a large-scale hypertextual Web search engine", Computer Networks and ISDN Systems, Volume 30, Issues 1-7, April 1998, Pages 107-117. (see also an extended version at http://infolab.stanford.edu/~backrub/google.html) [This paper defines the architecture used to provide for the initial Google search engine and identifies many of the properties and features of many search engines since that time.]

[12] C. Chaski, "Who’s At The Keyboard? Authorship Attribution in DigiEvidence Investigations", International Journal of Digital Evidence, V4#1, 2005. [This paper summarizes results in attribution relating individuals to actions based on behavioral characteristics and identifies the extent to which the US legal system to date has accepted such results in court cases.]

[13] T. Pedersen, "Computational Approaches to Measuring the Similarity of Short Contexts: A Review of Applications and Methods", Journal of Intelligent Systems (Special Issue : Recent Advances in KnowledgeBased Systems and Their Applications), 17(1-3), 37-50, 2008. http://www.d.umn.edu/~tpederse/ [This paper summarizes results in examining n-tuples and proximity measurements for natural language processing to determine similarity and attribute word sequences to authorship.]

[14] M. Corney, "Analysing E-mail Text Authorship for Forensic Purposes", Masters Thesis, Queensland University of Technology, March, 2003 [This thesis examines using a variety of classifiers with output fed into a Support Vector Machine (SVM). The approach is to compare a specific email to an SVM model built from a corpus of emails with known provenance e.g. given 20 emails from each of A, B and C, compare a new email to those models to see which author it is appears to belong to.]

[15] Bomze, M. Budinich, P. Pardalos, and M. Pelillo, "The Maximum Clique Problem", The Handbook of Combinatorial Optimization (http://reference.kfupm.edu.sa/content/m/a/the_maximum_clique_proble m__265525.pdf). [This book chapter summarizes mathematical results related to the identification of cliques, a problem loosely related to the GCF group problem addressed by this paper.]

[16] Susan Polgar vs. United States of America Chess Federation et. al. Case # 5-08CV0169-C. [This is a current legal matter in which the results of analysis have been publicly released. It was still pending at the time of the original submission of this paper, is related to several other pending cases, including a Federal felony case against a related party but this particular matter has now been settled. It applies the analytical techniques identified in this paper in support of attribution.]

[17] J. Hunt and M. McIlroy, "An Algorithm for Differential File Comparison". Computing Science Technical Report, Bell Laboratories 41. (see http://www.cs.dartmouth.edu/~doug/diff.ps) June 1976. [This paper defined the algorithms used in the Unix "diff" program and introduces the problems associated with identifying maximum matching sequences.]

[18] D. Maier (1978). "The Complexity of Some Problems on Subsequences and Supersequences". J. ACM (ACM Press) 25 (2): 322–336. [This paper summarizes results in matching and related problems for pairs of sequence, including complexity results associated with a wide range of related problems and details of algorithms and their limitations.]

[19] L. Bergroth and H. Hakonen and T. Raita (2000). "A Survey of Longest Common Subsequence Algorithms". SPIRE (IEEE Computer Society) 00: 39–48. [This paper summarizes results in the particular specialty of identifying longest common sequences, and is the evolution of the work on identifying differences (and similarities) related to the "diff" and similar sorts of programs.]

[20] P. Weiner, "Linear pattern matching algorithm". 14th Annual IEEE Symposium on Switching and Automata Theory: 1-11. (1973). [This paper demonstrates an approach to doing linear time matching of regular expressions to strings based on the creation of a finite state machine that optimally implements a matching device. The device can then be implemented in software with linear time detection of these expressions. The same method can be applied to LALR parsing and a wide range of other similar problems that are equivalent in relevant ways to regular expression parsing.]



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.