Hash functions are established and well-known in digital forensics, where they are commonly used for proving integrity and file identification (i.e., hash all files on a seized device and compare the fingerprints against a reference database). However, with respect to the latter operation, an active adversary can easily overcome this approach because traditional hashes are designed to be sensitive to altering an input; output will significantly change if a single bit is flipped. Therefore, researchers developed approximate matching, which is a rather new, less prominent area but was conceived as a more robust counterpart to traditional hashing. Since the conception of approximate matching, the community has constructed numerous algorithms, extensions, and additional applications for this technology, and are still working on novel concepts to improve the status quo. In this survey article, we conduct a high-level review of the existing literature from a non-technical perspective and summarize the existing body of knowledge in approximate matching, with special focus on bytewise algorithms. Our contribution allows researchers and practitioners to receive an overview of the state of the art of approximate matching so that they may understand the capabilities and challenges of the field. Simply, we present the terminology, use cases, classification, requirements, testing methods, algorithms, applications, and a list of primary and secondary literature.


Allen, B. (2015). Hashdb. https://github.com/simsong/hashdb. ˚Astebøl, K. P. (2012). mvhash - a new approach for fuzzy hashing (Unpublished master’s thesis). Gjøvik University College.

Baier, H. (2015). Towards automated preprocessing of bulk data in digital forensic investigations using hash functions. it-Information Technology, 57 (6), 347–356.

Baier, H., & Breitinger, F. (2011, May). Security Aspects of Piecewise Hashing in Computer Forensics. IT Security Incident Management & IT Forensics (IMF), 21–36. doi: 10.1109/IMF.2011.16

Bertoni, G., Daemen, J., Peeters, M., & Assche, G. V. (2008, October). Keccak specifications (Tech. Rep.). STMicroelectronics and NXP Semiconductors.

Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors.Communications of the ACM , 13 , 422–426.

Breitinger, F. (2014). On the utility of bytewise approximate matching in computer science with a special focus on digital forensics investigations (Doctoral dissertation, Technical University Darmstadt). Retrieved from http://tuprints.ulb.tu -darmstadt.de/4055/

Breitinger, F., ˚Astebøl, K. P., Baier, H., & Busch, C. (2013, March). mvhash-b - a new approach for similarity preserving hashing. In It security incident management and it forensics (imf ), 2013 seventh international conference on (p. 33-44). doi: 10.1109/IMF.2013.18

Breitinger, F., & Baggili, I. (2014, September). File detection on network traffic using approximate matching. Journal of Digital Forensics, Security and Law (JDFSL), 9 (2), 23–36. Retrieved from http://ojs.jdfsl.org/index.php/ jdfsl/article/view/261

Breitinger, F., & Baier, H. (2012a, May). A Fuzzy Hashing Approach based on Random Sequences and Hamming Distance. In 7th annual Conference on Digital Forensics, Security and Law (ADFSL) (pp. 89–100).

Breitinger, F., & Baier, H. (2012b). Performance issues about context-triggered piecewise hashing. In P. Gladyshev & M. Rogers (Eds.), Digital forensics and cyber crime (Vol. 88, p. 141-155). Springer Berlin Heidelberg. Retrieved from http://dx.doi.org/10.1007/ 978-3-642-35515-8 12 doi: 10.1007/978-3-642-35515-8 12

Breitinger, F., & Baier, H. (2012c). Properties of a similarity preserving hash function and their realization in sdhash. In Information security for south africa (issa), 2012 (p. 1-8). doi: 10.1109/ISSA.2012.6320445

Breitinger, F., & Baier, H. (2013). Similarity preserving hashing: Eligible properties and a new algorithm mrsh-v2. In M. Rogers & K. Seigfried-Spellar (Eds.), Digital forensics and cyber crime (Vol. 114, pp. 167–182). Springer Berlin Heidelberg. Retrieved from http://dx.doi.org/10.1007/ 978-3-642-39891-9 11 doi: 10.1007/978-3-642-39891-9 11

Breitinger, F., Baier, H., & White, D. (2014, May). On the database lookup problem of approximate matching. Digital Investigation, 11, Supplement 1 (0), S1 - S9. Retrieved from http://www.sciencedirect.com/ science/article/pii/ S1742287614000061 (Proceedings of the First Annual DFRWS Europe) doi: http://dx.doi.org/10.1016/ j.diin.2014.03.001

Breitinger, F., Guttman, B., McCarrin, M., Roussev, V., & White, D. (2014, May). Approximate matching: Definition and terminology (Special Publication 800-168). National Institute of Standards and Technologies. Retrieved from http://dx.doi.org/10.6028/ NIST.SP.800-168

Breitinger, F., Rathgeb, C., & Baier, H. (2014, September). An efficient similarity digests database lookup - A logarithmic divide & conquer approach. Journal of Digital Forensics, Security and Law (JDFSL), 9 (2), 155–166. Retrieved from http://ojs.jdfsl.org/index.php/ jdfsl/article/view/276

Breitinger, F., & Roussev, V. (2014, May). Automated evaluation of approximate matching algorithms on real data. Digital Investigation, 11, Supplement 1 (0), S10 - S17. Retrieved from http://www.sciencedirect.com/ science/article/pii/ S1742287614000073 (Proceedings of the First Annual DFRWS Europe) doi: http://dx.doi.org/10.1016/ j.diin.2014.03.002

Breitinger, F., Stivaktakis, G., & Baier, H. (2013, August). FRASH: A framework to test algorithms of similarity hashing. In 13th Digital Forensics Research Conference (DFRWS’13). Monterey.

Breitinger, F., Stivaktakis, G., & Roussev, V. (2013, Sept). Evaluating detection error trade-offs for bytewise approximate matching algorithms. 5th ICST Conference on Digital Forensics & Cyber Crime (ICDF2C).

Breitinger, F., Ziroff, G., Lange, S., & Baier, H. (2014, January). sahash: Similarity hashing based on levenshtein distance. In Tenth annual ifip wg 11.9 international conference on digital forensics (ifip wg11.9).

Broder, A. Z. (1997). On the resemblance and containment of documents. In Compression and complexity of sequences (sequences’97) (pp. 21–29). IEEE Computer Society.

Broder, A. Z., Charikar, M., Frieze, A. M., & Mitzenmacher, M. (1998). Min-wise Independent Permutations. Journal of Computer and System Sciences, 60 , 327-336.

Charikar, M. S. (2002). Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual acm symposium on theory of computing (pp. 380–388).

Chen, L., & Wang, G. (2008). An efficient piecewise hashing method for computer forensics. In Knowledge discovery and data mining, 2008. wkdd 2008. first international workshop on (pp. 635–638). doi: 10.1109/WKDD.2008.80

Collange, S., Dandass, Y. S., Daumas, M., & Defour, D. (2009). Using graphics processors for parallelizing hash-based data carving. In System sciences, 2009. hicss’09. 42nd hawaii international conference on (pp. 1–10).

Comodo Group Inc. (2013). PDM (Partial Document Matching). https:// www.mydlp.com/partial-document -matching-how-it-works/.

Faruki, P., Laxmi, V., Bharmal, A., Gaur, M., & Ganmoor, V. (2015). Androsimilar: Robust signature for detecting variants of android malware. Journal of Information Security and Applications, 22 , 66–80.

FIPS, P. (1995). 180-1. secure hash standard. National Institute of Standards and Technology, 17 , 45.

Fowler, G., Noll, L. C., & Vo, P. (1994–2012). Fnv hash. http://www.isthe.com/chongo/ tech/comp/fnv/index.html.

Garfinkel, S., & McCarrin, M. (2014). Hash-based carving: Searching media for complete files and file fragments with sector hashing and hashdb. digital investigation, 12 .

Garfinkel, S., Nelson, A., White, D., & Roussev, V. (2010). Using purpose-built functions and block hashes to enable small block and sub-file forensics. digital investigation, 7 , S13–S23.

Garfinkel, S. L. (2009, March). Announcing frag find: finding file fragments in disk images using sector hashing. Retrieved from http:// tech.groups.yahoo.com/group/ linux forensics/message/3063

Gupta, V. (2013). File fragment detection on network traffic using similarity hashing (Unpublished master’s thesis). Denmark Technical University.

Harichandran, V. S., Breitinger, F., Baggili, I., & Marrington, A. (2016). A cyber forensics needs analysis survey: Revisiting the domain’s needs a decade later. Computers & Security, 57 , 1–13.

Jaccard, P. (1901). Distribution de la flore alpine dans le bassin des drouces et dans quelques regions voisines. Bulletin de la Soci´et´e Vaudoise des Sciences Naturelles, 37 , 241–272.

Jaccard, P. (1912, February). The distribution of the flora in the alpine zone. New Phytologist, 11 , 37–50.

Key, S. (2013). File block hash map analysis. Retrieved from https://www.guidancesoftware .com/appcentral/

Kornblum, J. (2006, September). Identifying almost identical files using context triggered piecewise hashing. Digital Investigation, 3 , 91–97. Retrieved from http://dx.doi.org/ 10.1016/j.diin.2006.06.015 doi: 10.1016/j.diin.2006.06.015

Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge University Press.

Li, Y., Sundaramurthy, S. C., Bardas, A. G., Ou, X., Caragea, D., Hu, X., & Jang, J. (2015). Experimental study of fuzzy hashing in malware clustering analysis. In 8th workshop on cyber security experimentation and test (cset 15).

Manber, U. (1994). Finding similar files in a large file system. In Proceedings of the usenix winter 1994 technical conference on usenix winter 1994 technical conference (pp. 1–10).

Mart´ınez, V. G., Alvarez, F. H., & Encinas, ´ L. H. (2014). State of the art in similarity preserving hashing functions. worldcomp-proceedings.com.

Mart´ınez, V. G., Alvarez, F. H., Encinas, ´ L. H., & Avila, C. S. (2015). A new ´ edit distance for fuzzy hashing applications. In Proceedings of the international conference on security and management (sam) (p. 326).

Neuner, S., Mulazzani, M., Schrittwieser, S., & Weippl, E. (2015). Gradually improving the forensic process. In Availability, reliability and security (ares), 2015 10th international conference on (pp. 404–410).

Oliver, J., Cheng, C., & Chen, Y. (2013). Tlsh–a locality sensitive hash. In 4th cybercrime and trustworthy computing workshop (ctc) (pp. 7–13).

Oliver, J., Forman, S., & Cheng, C. (2014). Using randomization to attack similarity digests. In Applications and techniques in information security (pp. 199–210). Springer.

Payer, M., Crane, S., Larsen, P., Brunthaler, S., Wartell, R., & Franz, M. (2014). Similarity-based matching meets malware diversity. arXiv preprint arXiv:1409.7760 .

Rabin, M. O. (1981). Fingerprinting by random polynomials (Tech. Rep. No. TR1581). Cambridge, Massachusetts: Center for Research in Computing Technology, Harvard University.

Rajaraman, A., & Ullman, J. D. (2012). Mining of massive datasets. Cambridge: Cambridge University Press. Rathgeb, C., Breitinger, F., & Busch, C. (2013, June). Alignment-free cancelable iris biometric templates based on adaptive bloom filters. In Biometrics (icb), international conference on (p. 1-8). doi: 10.1109/ICB.2013.6612976

Rathgeb, C., Breitinger, F., Busch, C., & Baier, H. (2013). On the application of bloom filters to iris biometrics. IET Biometrics.

Ren, L., & Cheng, R. (2015, August). Bytewise approximate matching, searching and clustering. DFRWS USA.

Rivest, R. (1992). The MD5 Message-Digest Algorithm.

Roussev, V. (2009). Building a better similarity trap with statistically improbable features. In System sciences, 2009. hicss ’09. 42nd hawaii international conference on (pp. 1–10). doi: 10.1109/HICSS.2009.97

Roussev, V. (2010). Data fingerprinting with similarity digests. In K.-P. Chow & S. Shenoi (Eds.), Advances in digital forensics vi (Vol. 337, pp. 207–226). Springer Berlin Heidelberg. Retrieved from http://dx.doi.org/ 10.1007/978-3-642-15506-2 15 doi: 10.1007/978-3-642-15506-2\ 15

Roussev, V. (2011, August). An evaluation of forensic similarity hashes. Digital Investigation, 8 , 34–41. Retrieved from http://dx.doi.org/10.1016/ j.diin.2011.05.005 doi: 10.1016/j.diin.2011.05.005

Roussev, V. (2012). Managing terabyte-scale investigations with similarity digests. In G. Peterson & S. Shenoi (Eds.), Advances in digital forensics viii (Vol. 383, pp. 19–34). Springer Berlin Heidelberg. Retrieved from http://dx.doi.org/10.1007/ 978-3-642-33962-2 2 doi: 10.1007/978-3-642-33962-2 2

Roussev, V., III, G. G. R., & Marziale, L. (2007, September). Multi-resolution similarity hashing. Digital Investigation, 4 , 105–113. doi: 10.1016/j.diin.2007.06.011

Roussev, V., Richard, I., Golden, & Marziale, L. (2008). Class-aware similarity hashing for data classification. In I. Ray & S. Shenoi (Eds.), Advances in digital forensics iv (Vol. 285, pp. 101–113). Springer US. Retrieved from http://dx.doi.org/ 10.1007/978-0-387-84927-0 9 doi: 10.1007/978-0-387-84927-0 9

Rowe, N. C. (2012). Testing the national software reference library. Digital Investigation, 9 , S131–S138.

Sadowski, C., & Levin, G. (2007, December). Simhash: Hash-based similarity detection. http://simhash.googlecode.com/svn/ trunk/paper/SimHashWithBib.pdf.

Security, H. N. (2013). The rise of sophisticated malware. Retrieved from http://www.net-security.org/ malware news.php?id=2543

Seo, K., Lim, K., Choi, J., Chang, K., & Lee, S. (2009). Detecting similar files based on hash and statistical analysis for digital forensic investigation. In Computer science and its applications, 2009. csa ’09. 2nd international conference on (p. 1-6). doi: 10.1109/CSA.2009.5404198

Symantec. (2010). Machine learning sets new standard for data loss prevention: Describe, fingerprint, learn (Tech. Rep.). Symantec Corporation.

Tridgell, A. (2002–2009). spamsum. http://www.samba.org/ftp/ unpacked/junkcode/spamsum/. Retrieved from http://samba.org/ftp/unpacked/ junkcode/spamsum/README (last accessed Feb 4th, 2016)

Winter, C., Schneider, M., & Yannikos, Y. (2013). F2s2: Fast forensic similarity search through indexing piecewise hashsignatures. Digital Investigation, 10 (4), 361–371. doi: http:// dx.doi.org/10.1016/j.diin.2013.08.003

Zhou, W., Zhou, Y., Jiang, X., & Ning, P. (2012). Detecting repackaged smartphone applications in third-party android marketplaces. In Proceedings of the second acm conference on data and application security and privacy (pp. 317–326).

Ziroff, G. (2012, February). Approaches to similarity-preserving hashing, Bachelor’s thesis, Hochschule Darmstadt.





To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.