In the era of big data, the volume of digital data is increasing rapidly, causing new challenges for investigators to examine the same in a reasonable amount of time. A major requirement of modern forensic investigation is the ability to perform automatic filtering of correlated data, and thereby reducing and focusing the manual effort of the investigator. Approximate matching is a technique to find “closeness” between two digital artifacts. mvHash-B is a well-known approximate matching scheme used for finding similarity between two digital objects and produces a ‘score of similarity’ on a scale of 0 to 100. However, no security analysis of mvHash-B is available in the literature. In this work, we perform the first academic security analysis of this algorithm and show that it is possible for an attacker to “fool” it by causing the similarity score to be close to zero even when the objects are very similar. By similarity of the objects, we mean semantic similarity for text and visual match for images.

The designers of mvHash-B had claimed that the scheme is secure against ‘active manipulation’. We contest this claim in this work. We propose an algorithm that starts with a given document and produces another one of the same size without influencing its semantic and visual meaning (for text and image files, respectively) but which has low similarity score as measured by mvHash-B. In our experiments, we show that the similarity score can be reduced from 100 to less than 6 for text and image documents. We performed experiments with 50 text files and 200 images and the average similarity score between the original file and the file produced by our algorithm was found to be 4 for text files and 6 for image files. In fact, if the original file size is small then the similarity score between the two files was close to 0, almost always.

To improve the security of mvHash-B against active adversaries, we propose a modification in the scheme. We show that the modification prevents the attack we describe in this work.


Baier, H., & Breitinger, F. (2011). Security aspects of piecewise hashing in computer forensics. In H. Morgenstern et al. (Eds.), Sixth international conference on IT security incident management and IT forensics, IMF 2011, stuttgart, germany, may 10-12, 2011 (pp. 21{36). IEEE Computer Society. Retrieved from http://dx.doi.org/10.1109/ IMF.2011.16 doi: 10.1109/IMF.2011.16

Breitinger, F., Astebol, K. P., Baier, H., & Busch, C. (2013). mvhash-b - A new approach for similarity preserving hashing. In Seventh international conference on IT security incident management and IT forensics, IMF 2013, nuremberg, germany, march 12-14, 2013 (pp. 33{44).

Breitinger, F., & Baier, H. (2012a). A fuzzy hashing approach based on random sequences and hamming distance. In Proceedings of the conference on digital forensics, security and law (pp. 89{100).

Breitinger, F., & Baier, H. (2012b). Properties of a similarity preserving hash function and their realization in sdhash. In 2012 information security for south africa, balalaika hotel, sandton, johannesburg, south africa, august 15-17, 2012 (pp. 1{8). Retrieved from http://dx.doi.org/10.1109/ISSA.2012.6320445 doi: 10.1109/ISSA.2012.6320445

Breitinger, F., Baier, H., & Beckingham, J. (2012). Security and implementation analysis of the similarity digest sdhash. In First international baltic conference on network security & forensics (nesefo).

Breitinger, F., Guttman, B., McCarrin, M., & Roussev, V. (2014). Approximate matching: definition and terminology. URL http://csrc. nist. gov/publications/drafts/800-168/sp800 168 draft. pdf .

Chang, D., Sanadhya, S. K., Singh, M., & Verma, R. (2015). A collision attack on sdhash similarity hashing. In Proceedings of 10th intl. conference on systematic approaches to digital forensic engineering (pp. 36{46).

Chen, L., & Wang, G. (2008). An efficient piecewise hashing method for computer forensics. In Proceedings of the international workshop on knowledge discovery and data mining, WKDD 2008,adelaide, australia, 23-24 january 2008 (pp. 635{638). IEEE Computer Society. Retrieved from http://dx.doi.org/ 10.1109/WKDD.2008.80 doi: 10.1109/WKDD.2008.80

Divakaran, A. (2008). Multimedia content analysis: Theory and applications (1st ed.). Springer Publishing Company, Incorporated. Harbour, N. (2002). Dcdd. defense computer forensics lab.

Kornblum, J. D. (2006). Identifying almost 12 identical files using context triggered piecewise ashing. Digital Investigation, 3 (Supplement-1), 91{97. Retrieved from http://dx.doi.org/10.1016/

j.diin.2006.06.015 doi: 10.1016/j.diin.2006.06.015

Roussev, V. (2009). Building a better similarity trap with statistically improbable features. In 42st hawaii international international conference on systems science (HICSS-42 2009), proceedings

(CD-ROM and online), 5-8 january 2009, waikoloa, big island, hi, USA (pp. 1{10). IEEE Computer Society. Retrieved from http://dx.doi.org/10.1109/HICSS.2009.97 doi: 10.1109/HICSS.2009.97

Roussev, V. (2010). Data fingerprinting with similarity digests. In K. Chow & S. Shenoi (Eds.), Advances in digital forensics VI - sixth IFIP WG 11.9 international conference on digital forensics, hong kong, china, january 4-6, 2010, revised selected papers (Vol. 337, pp. 207{226). Springer. Retrieved from http://dx.doi.org/10.1007/978-3-642-15506-2 15 doi: 10.1007/978-3-642-15506-2 15

Seo, K., Lim, K., Choi, J., Chang, K., & Lee, S. (2009, 12). Detecting similar files based on hash and statistical analysis for digital forensic investigation. In Proceedings of the 2009 2nd international conference on computer science and its applications, csa 2009. doi: 10.1109/CSA.2009.5404198

Tridgell, A. (2002). Spamsum readme. Retrieved from https://www.samba.org/ftp/unpacked/ junkcode/spamsum/README





To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.