Prior Publisher
The Association of Digital Forensics, Security and Law (ADFSL)
Abstract
Over the past few years the popularity of approximate matching algorithms (a.k.a. fuzzy hashing) has increased. Especially within the area of bytewise approximate matching, several algorithms were published, tested and improved. It has been shown that these algorithms are powerful, however they are sometimes too precise for real world investigations. That is, even very small commonalities (e.g., in the header of a le) can cause a match. While this is a desired property, it may also lead to unwanted results. In this paper we show that by using simple pre-processing, we signicantly can in uence the outcome. Although our test set is based on text-based le types (cause of an easy processing), this technique can be used for other, well-documented types as well. Our results show, that it can be benecial to focus on the content of les only (depending on the use-case). While for this experiment we utilized text les, Additionally, we present a small, self-created dataset that can be used in the future for approximate matching algorithms since it is labeled (we know which les are similar and how).
References
Baier, H., & Breitinger, F. (2011, May). Security Aspects of Piecewise Hashing in Computer Forensics. IT Security Incident Management & IT Forensics (IMF), 21–36. doi: 10.1109/IMF.2011.16
Bjelland, P. C., Franke, K., & ˚Arnes, A. (2014, May). Practical use of approximate hash based matching in digital investigations. Digital Investigation, 11 , 18–26. Retrieved from http://dx.doi.org/10.1016/ j.diin.2014.03.003 doi: 10.1016/j.diin.2014.03.003
Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM , 13 , 422–426.
Borenstein, N. S., & Freed, N. (1993, September). Mime (multipurpose internet mail extensions) — part one: Mechanisms for specifying and describing the format of internet message bodies (Tech. Rep.). Internet RFC 1521.
Breitinger, F., & Baier, H. (2013). Similarity preserving hashing: Eligible properties and a new algorithm mrsh-v2. In M. Rogers & K. Seigfried-Spellar (Eds.), Digital forensics and cyber crime (Vol. 114, pp. 167–182). Springer Berlin Heidelberg. Retrieved from http://dx.doi.org/10.1007/ 978-3-642-39891-9 11 doi: 10.1007/978-3-642-39891-9 11
Breitinger, F., Guttman, B., McCarrin, M., Roussev, V., & White, D. (2014, May). Approximate matching: Definition and terminology (Special Publication 800-168). National Institute of Standards and Technologies. Retrieved fromhttp://dx.doi.org/10.6028/ NIST.SP.800-168
Breitinger, F., Stivaktakis, G., & Roussev, V. (2014, June). Evaluating detection error trade-offs for bytewise approximate matching algorithms. Digital Investigation, 11 (2), 81–89. Retrieved from http://dx.doi.org/ 10.1016/j.diin.2014.05.002 doi: 10.1016/j.diin.2014.05.002
Farrell, P., Garfinkel, S. L., & White, D. (2008). Practical applications of bloom filters to the nist rds and hard drive triage. In Computer security applications conference, 2008. acsac 2008. annual (pp. 13–22).
Garfinkel, S. L., & McCarrin, M. (2015). Hash-based carving: Searching media for complete files and file fragments with sector hashing and hashdb. Digital Investigation, 14 , S95–S105.
Klimt, B., & Yang, Y. (2004). The enron corpus: A new dataset for email classification research. In Machine learning: Ecml 2004 (pp. 217–226). Springer. Kornblum, J. (2006, September). Identifying almost identical files using context triggered piecewise hashing. Digital Investigation, 3 , 91–97. Retrieved from http://dx.doi.org/ 10.1016/j.diin.2006.06.015 doi: 10.1016/j.diin.2006.06.015
Resnick, P. (2001). RFC 2822: Internet message format (Tech. Rep.). IETF. Retrieved from http://www.rfc-archive.org/ getrfc.php?rfc=2822
Roussev, V. (2010). Data fingerprinting with similarity digests. In K.-P. Chow & S. Shenoi (Eds.), Advances in digital forensics vi (Vol. 337, pp. 207–226). Springer Berlin Heidelberg. Retrieved from http://dx.doi.org/ 10.1007/978-3-642-15506-2 15 doi: 10.1007/978-3-642-15506-2\ 15
Recommended Citation
Jeong, Doowon; Breitinger, Frank; Kang, Hari; and Lee, Sangjin
(2016)
"Towards Syntactic Approximate Matching - A Pre-Processing Experiment,"
Journal of Digital Forensics, Security and Law: Vol. 11
, Article 6.
DOI: https://doi.org/10.15394/jdfsl.2016.1381
Available at:
https://commons.erau.edu/jdfsl/vol11/iss2/6
Included in
Computer Engineering Commons, Computer Law Commons, Electrical and Computer Engineering Commons, Forensic Science and Technology Commons, Information Security Commons