Bytewise approximate matching algorithms have in recent years shown significant promise in detecting files that are similar at the byte level. This is very useful for digital forensic investigators, who are regularly faced with the problem of searching through a seized device for pertinent data. A common scenario is where an investigator is in possession of a collection of "known-illegal" files (e.g. a collection of child abuse material) and wishes to find whether copies of these are stored on the seized device. Approximate matching addresses shortcomings in traditional hashing, which can only find identical files, by also being able to deal with cases of merged files, embedded files, partial files, or if a file has been changed in any way. Most approximate matching algorithms work by comparing pairs of files, which is not a scalable approach when faced with large corpora. This paper demonstrates the effectiveness of using a "Hierarchical Bloom Filter Tree" (HBFT) data structure to reduce the running time of collection-against-collection matching, with a specific focus on the MRSH-v2 algorithm. Three experiments are discussed, which explore the effects of different configurations of HBFTs. The proposed approach dramatically reduces the number of pairwise comparisons required, and demonstrates substantial speed gains, while maintaining effectiveness.


Bloom, B. H. (1970). Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM , 13 (7), 422–426.

Breitinger, F., & Baier, H. (2012). Similarity Preserving Hashing: Eligible Properties and a New Algorithm MRSH-v2. In International conference on digital forensics and cyber crime (pp. 167–182). Springer.

Breitinger, F., Baier, H., & White, D. (2014). On the Database Lookup Problem of Approximate Matching. Digital Investigation, 11, S1–S9. doi: 10.1016/j.diin.2014.03.001

Breitinger, F., Guttman, B., McCarrin, M., Roussev, V., & White, D. (2014). Approximate Matching: Definition and Terminology. NIST Special Publication, 800, 168.

Breitinger, F., Rathgeb, C., & Baier, H. (2014, sep). An Efficient Similarity Digests Database Lookup - A Logarithmic Divide & Conquer Approach. Journal of Digital Forensics, Security and Law, 9(2), 155– 166.

Broder, A. Z. (1997). On the Resemblance and Containment of Documents. In Compression and complexity of sequences 1997. proceedings (pp. 21–29). doi: 10.1109/ SEQUEN.1997.666900

Casey, E., Ferraro, M., & Nguyen, L. (2009). Investigation Delayed is Justice Denied: Pro- posals for Expediting Forensic Examinations of Digital Evidence. Journal of forensic sciences, 54(6), 1353–1364.

de Braekt, R. I., Le-Khac, N. A., Farina, J., Scanlon, M., & Kechadi, T. (2016, April). Increasing Digital Investigator Availability Through Efficient Workflow Management and Automation. In 4th international symposium on digital forensic and security (isdfs) (p. 68-73). doi: 10.1109/ ISDFS.2016.7473520

Gupta, J. N., Kalaimannan, E., & Yoo, S.-M. (2016). A Heuristic for Maximizing Investigation Effectiveness of Digital Forensic Cases Involving Multiple Investigators. Computers & Operations Research, 69, 1–9. doi: 10.1016/j.cor.2015.11.003

Gupta, V., & Breitinger, F. (2015). How cuckoo filter can improve existing approximate matching techniques. In J. James & F. Breitinger (Eds.), Digital forensics and cyber crime (Vol. 157, p. 39-52). Springer International Publishing. doi: 10.1007/ 978-3-319-25512-5 4

Harichandran, V. S., Breitinger, F., & Baggili, I. (2016). Bytewise Approximate Matching: The Good, The Bad, and The Unknown. The Journal of Digital Forensics, Security and Law: JDFSL, 11(2), 59.

James, J. I., & Gladyshev, P. (2015). Automated Inference of Past Action Instances in Digital Investigations. International Journal of Information Security, 14(3), 249–261. doi: 10.1007/s10207-014-0249-6

Kornblum, J. (2006). Identifying Identical Files Using Context Triggered Piecewise Hashing. Digital investigation , 3 , 91–97. doi: 10.1016/j.diin.2006.06.015

Lillis, D., Becker, B., O’Sullivan, T., & Scanlon, M. (2016). Current Challenges and Future Research Areas for Digital Forensic Investigation. In 11th ADFSL Conference on Digital Forensics, Security and Law (CDFSL 2016). Daytona Beach, FL, USA: ADFSL. doi: 10.13140/RG.2.2.34898.76489

Lillis, D., Breitinger, F., & Scanlon, M. (2017). Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees. In 9th EAI International Conference on Digital Forensics and Cyber Crime (ICDF2C 2017). Prague, Czech Republic.

Oliver, J., Cheng, C., & Chen, Y. (2013). TLSH– A Locality Sensitive Hash. In Fourth Cybercrime and trustworthy computing workshop (CTC), 2013 (pp. 7–13). doi: 10.1109/CTC .2013.9

Quick, D., & Choo, K.-K. R. (2014). Impacts of Increasing Volume of Digital Forensic Data: A Survey and Future Research Challenges. Digital Investigation , 11 (4), 273– 294. doi: 10.1016/j.diin.2014.09.002

Rogers, M. K., Goldman, J., Mislan, R., Wedge, T., & Debrota, S. (2006). Computer Forensics Field Triage Process Model. Journal of Digital Forensics, Security and Law, 1(2), 19–38.

Roussev, V. (2010). Data Fingerprinting with Similarity Digests. In IFIP International Conference on Digital Forensics (pp. 207– 226). Springer. doi: 10.1007/978-3-642 -15506-2 15

Roussev, V. (2011). An Evaluation of Forensic Similarity Hashes. Digital Investigation, 8, S34–S41.

Roussev, V., & Richard III, G. G. (2004). Break- ing the Performance Wall: The Case for Distributed Digital Forensics. In Proceedings of the 2004 Digital Forensics Research Workshop (Vol. 94).

Sadowski, C., & Levin, G. (2007). Simhash: Hash-based Similarity Detection. Citeseer.

Scanlon, M. (2016, 08). Battling the Digital Forensic Backlog through Data Deduplication. In Proceedings of the 6th IEEE International Conference on Innovative Computing Technologies (INTECH 2016). Dublin, Ireland: IEEE.

van Baar, R., van Beek, H., & van Eijk, E. (2014). Digital Forensics as a Service: A Game Changer. Digital Investigation , 11, Supplement 1, S54 - S62. doi: 10.1016/ j.diin.2014.03.007



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.