Towards Syntactic Approximate Matching - A Pre-Processing Experiment

Doowon Jeong, Korea University
Frank Breitinger, University of New HavenFollow
Hari Kang, Korea University
Sangjin Lee, Korea University

Prior Publisher

The Association of Digital Forensics, Security and Law (ADFSL)

Abstract

Over the past few years the popularity of approximate matching algorithms (a.k.a. fuzzy hashing) has increased. Especially within the area of bytewise approximate matching, several algorithms were published, tested and improved. It has been shown that these algorithms are powerful, however they are sometimes too precise for real world investigations. That is, even very small commonalities (e.g., in the header of a le) can cause a match. While this is a desired property, it may also lead to unwanted results. In this paper we show that by using simple pre-processing, we signicantly can in uence the outcome. Although our test set is based on text-based le types (cause of an easy processing), this technique can be used for other, well-documented types as well. Our results show, that it can be benecial to focus on the content of les only (depending on the use-case). While for this experiment we utilized text les, Additionally, we present a small, self-created dataset that can be used in the future for approximate matching algorithms since it is labeled (we know which les are similar and how).

References

Baier, H., & Breitinger, F. (2011, May). Security Aspects of Piecewise Hashing in Computer Forensics. IT Security Incident Management & IT Forensics (IMF), 21–36. doi: 10.1109/IMF.2011.16

Bjelland, P. C., Franke, K., & ˚Arnes, A. (2014, May). Practical use of approximate hash based matching in digital investigations. Digital Investigation, 11 , 18–26. Retrieved from http://dx.doi.org/10.1016/ j.diin.2014.03.003 doi: 10.1016/j.diin.2014.03.003

Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM , 13 , 422–426.

Borenstein, N. S., & Freed, N. (1993, September). Mime (multipurpose internet mail extensions) — part one: Mechanisms for specifying and describing the format of internet message bodies (Tech. Rep.). Internet RFC 1521.

Breitinger, F., & Baier, H. (2013). Similarity preserving hashing: Eligible properties and a new algorithm mrsh-v2. In M. Rogers & K. Seigfried-Spellar (Eds.), Digital forensics and cyber crime (Vol. 114, pp. 167–182). Springer Berlin Heidelberg. Retrieved from http://dx.doi.org/10.1007/ 978-3-642-39891-9 11 doi: 10.1007/978-3-642-39891-9 11

Breitinger, F., Guttman, B., McCarrin, M., Roussev, V., & White, D. (2014, May). Approximate matching: Definition and terminology (Special Publication 800-168). National Institute of Standards and Technologies. Retrieved fromhttp://dx.doi.org/10.6028/ NIST.SP.800-168

Breitinger, F., Stivaktakis, G., & Roussev, V. (2014, June). Evaluating detection error trade-offs for bytewise approximate matching algorithms. Digital Investigation, 11 (2), 81–89. Retrieved from http://dx.doi.org/ 10.1016/j.diin.2014.05.002 doi: 10.1016/j.diin.2014.05.002

Farrell, P., Garfinkel, S. L., & White, D. (2008). Practical applications of bloom filters to the nist rds and hard drive triage. In Computer security applications conference, 2008. acsac 2008. annual (pp. 13–22).

Garfinkel, S. L., & McCarrin, M. (2015). Hash-based carving: Searching media for complete files and file fragments with sector hashing and hashdb. Digital Investigation, 14 , S95–S105.

Klimt, B., & Yang, Y. (2004). The enron corpus: A new dataset for email classification research. In Machine learning: Ecml 2004 (pp. 217–226). Springer. Kornblum, J. (2006, September). Identifying almost identical files using context triggered piecewise hashing. Digital Investigation, 3 , 91–97. Retrieved from http://dx.doi.org/ 10.1016/j.diin.2006.06.015 doi: 10.1016/j.diin.2006.06.015

Resnick, P. (2001). RFC 2822: Internet message format (Tech. Rep.). IETF. Retrieved from http://www.rfc-archive.org/ getrfc.php?rfc=2822

Roussev, V. (2010). Data fingerprinting with similarity digests. In K.-P. Chow & S. Shenoi (Eds.), Advances in digital forensics vi (Vol. 337, pp. 207–226). Springer Berlin Heidelberg. Retrieved from http://dx.doi.org/ 10.1007/978-3-642-15506-2 15 doi: 10.1007/978-3-642-15506-2\ 15

Recommended Citation

Jeong, Doowon; Breitinger, Frank; Kang, Hari; and Lee, Sangjin (2016) "Towards Syntactic Approximate Matching - A Pre-Processing Experiment," Journal of Digital Forensics, Security and Law: Vol. 11 , Article 6.
DOI: https://doi.org/10.15394/jdfsl.2016.1381
Available at: https://commons.erau.edu/jdfsl/vol11/iss2/6

Download

Included in

Computer Engineering Commons, Computer Law Commons, Electrical and Computer Engineering Commons, Forensic Science and Technology Commons, Information Security Commons

COinS

Towards Syntactic Approximate Matching - A Pre-Processing Experiment

Authors

Prior Publisher

Abstract

References

Recommended Citation

Included in

Share

Search