Over the past few years the popularity of approximate matching algorithms (a.k.a. fuzzy hashing) has increased. Especially within the area of bytewise approximate matching, several algorithms were published, tested and improved. It has been shown that these algorithms are powerful, however they are sometimes too precise for real world investigations. That is, even very small commonalities (e.g., in the header of a le) can cause a match. While this is a desired property, it may also lead to unwanted results. In this paper we show that by using simple pre-processing, we signicantly can in uence the outcome. Although our test set is based on text-based le types (cause of an easy processing), this technique can be used for other, well-documented types as well. Our results show, that it can be benecial to focus on the content of les only (depending on the use-case). While for this experiment we utilized text les, Additionally, we present a small, self-created dataset that can be used in the future for approximate matching algorithms since it is labeled (we know which les are similar and how).


Baier, H., & Breitinger, F. (2011, May). Security Aspects of Piecewise Hashing in Computer Forensics. IT Security Incident Management & IT Forensics (IMF), 21–36. doi: 10.1109/IMF.2011.16

Bjelland, P. C., Franke, K., & ˚Arnes, A. (2014, May). Practical use of approximate hash based matching in digital investigations. Digital Investigation, 11 , 18–26. Retrieved from http://dx.doi.org/10.1016/ j.diin.2014.03.003 doi: 10.1016/j.diin.2014.03.003

Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM , 13 , 422–426.

Borenstein, N. S., & Freed, N. (1993, September). Mime (multipurpose internet mail extensions) — part one: Mechanisms for specifying and describing the format of internet message bodies (Tech. Rep.). Internet RFC 1521.

Breitinger, F., & Baier, H. (2013). Similarity preserving hashing: Eligible properties and a new algorithm mrsh-v2. In M. Rogers & K. Seigfried-Spellar (Eds.), Digital forensics and cyber crime (Vol. 114, pp. 167–182). Springer Berlin Heidelberg. Retrieved from http://dx.doi.org/10.1007/ 978-3-642-39891-9 11 doi: 10.1007/978-3-642-39891-9 11

Breitinger, F., Guttman, B., McCarrin, M., Roussev, V., & White, D. (2014, May). Approximate matching: Definition and terminology (Special Publication 800-168). National Institute of Standards and Technologies. Retrieved fromhttp://dx.doi.org/10.6028/ NIST.SP.800-168

Breitinger, F., Stivaktakis, G., & Roussev, V. (2014, June). Evaluating detection error trade-offs for bytewise approximate matching algorithms. Digital Investigation, 11 (2), 81–89. Retrieved from http://dx.doi.org/ 10.1016/j.diin.2014.05.002 doi: 10.1016/j.diin.2014.05.002

Farrell, P., Garfinkel, S. L., & White, D. (2008). Practical applications of bloom filters to the nist rds and hard drive triage. In Computer security applications conference, 2008. acsac 2008. annual (pp. 13–22).

Garfinkel, S. L., & McCarrin, M. (2015). Hash-based carving: Searching media for complete files and file fragments with sector hashing and hashdb. Digital Investigation, 14 , S95–S105.

Klimt, B., & Yang, Y. (2004). The enron corpus: A new dataset for email classification research. In Machine learning: Ecml 2004 (pp. 217–226). Springer. Kornblum, J. (2006, September). Identifying almost identical files using context triggered piecewise hashing. Digital Investigation, 3 , 91–97. Retrieved from http://dx.doi.org/ 10.1016/j.diin.2006.06.015 doi: 10.1016/j.diin.2006.06.015

Resnick, P. (2001). RFC 2822: Internet message format (Tech. Rep.). IETF. Retrieved from http://www.rfc-archive.org/ getrfc.php?rfc=2822

Roussev, V. (2010). Data fingerprinting with similarity digests. In K.-P. Chow & S. Shenoi (Eds.), Advances in digital forensics vi (Vol. 337, pp. 207–226). Springer Berlin Heidelberg. Retrieved from http://dx.doi.org/ 10.1007/978-3-642-15506-2 15 doi: 10.1007/978-3-642-15506-2\ 15





To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.