Abstract
In the last decade, the proliferation of machine learning (ML) algorithms and their application on big data sets have benefited many researchers and practitioners in different scientific areas. Consequently, the research in cybercrime and digital forensics has relied on ML techniques and methods for analyzing large quantities of data such as text, graphics, images, videos, and network traffic scans to support criminal investigations. Complete and accurate training data sets are indispensable for efficient and effective machine learning models. An essential part of creating complete and accurate data sets is annotating or labelling data. We present a method for law enforcement agency investigators to annotate and store specific dark web content. Using a design science strategy, we design and develop tools to enable and extend web content annotation. The annotation tool was implemented as a plugin for the Tor browser. It can store web content, thus automatically creating a dataset of dark web data pertinent to criminal investigations. Combined with a central storage management server, enabling annotation sharing and collaboration, and a web scraping program, the dataset becomes multifold, dynamic, and extensive while maintaining the forensic soundness of the data saved and transmitted. To manifest our toolset's fitness of purpose, we used our dataset as training data for ML based classification models. A five cross-fold validation technique was used to evaluate the classifiers, which reported an accuracy score of 85 - 96%. In the concluding sections, we discuss the possible use-cases of the proposed method in real-life cybercrime investigations, along with ethical concerns and future extensions.
References
Arabnezhad, E., La Morgia, M., Mei, A.,
Nemmi, E. N., & Stefa, J. (2020). A
light in the dark web: Linking dark
web aliases to real internet identities.
In 2020 ieee 40th international
conference on distributed computing
systems (icdcs) (p. 311-321).
Singapore. doi:
10.1109/ICDCS47774.2020.00081
Casey, E. (2011). Digital evidence and
computer crime: Forensic science
computers and the internet (Third
ed.). USA: Elsevier.
Chen, H., Chung, W., Quin, J., Reid, E.,
Sageman, M., & Weinmann, G.
(2008). Uncovering the dark web: A
case study of jihad on the web.
Journal of the American Society for
Information Science and Technology,
59(8).
Cohen, J. (1960). A coefficient of
agreement for nominal scales.
Educational and Psychological
Measurement, 20(1), 37-46. doi:
10.1177/001316446002000104
Comey, J. B. (2015). Going dark:
Encryption, technology, and the
balances between public safety and
privacy. Retrieved from
https://www.fbi.gov/news/
testimony/going-dark-encryption
-technology-and-the-balances
-between-public-safety-and
-privacy (Retrieved 2021-03-30)
Dalianis, H. (2018). Clinical text mining:
Secondary use of electronic patient
records. USA: Springer Open.
Dalins, J., Wilson, C., & Carman, M.
(2018). Criminal motivation on the
dark web: A categorisation model for
law enforcement. Digital
Investigation, 24, 62-71. doi:
10.1016/j.diin.2017.12.003
de Vel, O., Anderson, A., Corney, M., &
Mohay, G. (2001). Mining e-mail
content for author identification
forensics. ACM SIGMOD Record,
30(4), 55{64. doi:
10.1145/604264.604272
Europol. (2017). Drugs and the darknet:
Perspectives for enforcement, research
and policy. Retrieved from
https://www.europol.europa.eu/
publications-documents/
drugs-and-darknet-perspectives
-for-enforcement-research-and
-policy (Retrieved 2021-0 1-07)
Farzan, R., & Brusilovsky, P. (2008,
January). Annotated: A social
navigation and annotation service for
web-based educational resources. New
Rev. Hypermedia Multimedia, 14(1),
3{32. doi:
10.1080/13614560802357172
Ghappour, A. (2017). Searching places
unknown: Law enforcement
jurisdiction on the dark web. Stanford
Law Review, 69(4).
Ghosh, S., Das, A., Porras, P.,
Yegneswaran, V., & Ghehani, A.
(2017). Automated categorization of
onion sites for analyzing the darkweb
ecosystem. In Kdd’17: Proceedings of
the 23rd acm sigkdd international
conference on knowledge discovery
and data mining (p. 1793-1802).
ACM.
Hansken.nl. (2020). Dutch investigative
services team up to continue hansken
development. Retrieved from
https://www.hansken.nl/latest/
news/2020/07/30/
dutch-investigative-services
-team-up-to-continue-hansken
-development
Hayes, D., Cappa, F., & Cardon, J. (2018).
A framework for more effective dark
web marketplace investigations.
Information, special issue: Darkweb
Cyber Threat Intelligence Mining,
9 (8:186). doi: 10.3390/info9080186
Johannesson, P., & Perjons, E. (2014). An
introduction to design science.
Springer International Publishing.
doi: 10.1007/978-3-319-10632-8
Johnston, P. (n.d.). Paj’s home:
Cryptography: Javascript md5:
Scripts: md5.js. Retrieved from
http://pajhome.org.uk/crypt/
md5/md5.html (Retrieved
2021-01-12)
Kalpakis, G., Tsikrika, T., Iliou, C.,
Mironidis, T., Vrochidis, S.,
Middleton, J., . . . Kompatsiaris, I.
(2016). Interactive discovery and
retrieval of web resources containing
home made explosive recipes. In Has
2016: Human aspects of information
security, privacy, and trust
(p. 221-233). Springer.
Kessler, G. (2016). The impact of sha-1 file
hash collisions on digital forensic
imaging: A follow-up experiment.
Journal of Digital Forensics, Security
and Law, 11 (10), 129-139. doi:
https://doi.org/10.15394/
jdfsl.2016.1433
Kwon, K. H., Priniski, J. H., Sakar, S.,
Shakarian, J., & Shakarian, P. (2017).
Crisis and collective problem solving
in dark web: An exploration of a
black hat forum. In Proceedings of the
8th international conference on social
media & society article no. 45
(p. 1-5). ACM.
McKemmish, R. (2008). When is digital
evidence forensically sound? In Ifip
international conference on digital
forensics (p. 3-15). Springer Link.
Netclean. (2019). Netclean report 2018 - a
report on documented sexual abuse
against children. Retrieved from
https://www.netclean.com/
netclean-report-2018/ (Retrieved
01/11/2020)
Netclean. (2021). Netclean report 2020 -
covid-19 impact 2020. Retrieved from
https://www.netclean.com/
netclean-report-2020/ (Retrieved
02/02/2021)
Neto, L., Pinto, N., Proen¸ca, A., Amorim,
A., & Conde-Sousa, E. (2021).
4specid: Reference dna libraries
auditing and annotation system for
forensic applications. Genes, 12 (1).
Retrieved from https://
www.mdpi.com/2073-4425/12/1/61
doi: 10.3390/genes12010061
Neves, M., & Leser, U. (2014). The forensic
investigation of android private
browsing sessions using orweb.
Briefings in Bioinformatics, 15 (2),
327-340. doi:
https://doi.org/10.1093/bib/bbs084
Nunes, E., Diab, A., Gunn, A., Ericsson,
M., Vineet, M., Mishra, V., . . .
Shakarian, P. (2016). Darknet and
deepnet mining for proactive cyber
treat intelligence. Intelligence and
Security Informatics (ISI), 7-12. doi:
10.1109/ISI.2016.7745435
Pedregosa, F., Varoquaux, G., Gramfort,
A., Michel, V., Thirion, B., Grisel, O.,
. . . Duchesnay, E. (2011).
Scikit-learn: Machine learning in
Python. Journal of Machine Learning
Research, 12 , 2825{2830.
Popov, O., Bergman, J., & Valassi, C.
(2018). A framework for a forensically
sound harvesting the dark web. In
Cecc 2018: Proceedings of the central
european cybersecurity conference
2018 (p. 1-7). ACM. doi:
10.1145/3277570.3277584
Portnoff, R. S., Afroz, S., Durrett, G.,
Kummerfeld, J. K., Berg-Kirkpatrick,
T., McCoy, D., . . . Paxson, V. (2017).
Tools for automated analysis of
cybercriminal markets. In Proceedings
of the 26th international conference
on world wide web (p. 657{666).
Republic and Canton of Geneva,
CHE: International World Wide Web
Conferences Steering Committee. doi:
10.1145/3038912.3052600
Qin, R. Z. Y., Huang, Z., & Chen, H.
(2003). Authorship analysis in
cybercrime investigation. In
(p. 59-73). Springer.
Sabbah, T., Selamat, A., Selamat, M. H.,
Ibrahim, R., & Fujita, H. (2016).
Hybridized term-weighting method
for dark web classification.
Neurocomputing, 173 , 1908-1926. doi:
10.1016/j.neucom.2015.09.063
Sorokin, A., & Forsyth, D. (2008). Utility
data annotation with amazon
mechanical turk. In Ieee computer
society conference on computer vision
and pattern recognition workshops
(p. 1-8). Anchorage, AK, USA. doi:
10.1109/CVPRW.2008.4562953
Spitters, M., Klaver, F., Koot, G., & van
Staalduinen, M. (2015). Authorship
analysis on dark marketplace forums.
In European intelligence and security
informatics conference (p. 631-641).
IEEE.
SQLite.org. (n.d.). 35 precent faster than
the filesystem. Retrieved from
https://sqlite.org/
fasterthanfs.html#approx
(Retrieved 24/03/2021)
Tai, X. H., Soska, K., & Christin, N.
(2019). Adversarial matching of dark
net market vendor accounts. In Kdd
’19: Proceedings of the 25th acm
sigkdd international conference on
knowledge discovery and data mining
(p. 1871-1880). IEEE. doi:
10.1145/3292500.3330763
Tensor. (n.d.). Titanium: Tools for the
investigation of transactions in
underground markets. Retrieved from
https://titanium-project.eu/
(Retrieved 2021-01-30)
Titaniu. (n.d.). Titanium: Tools for the
investigation of transactions in
underground markets. Retrieved from
https://titanium-project.eu/
results/ (Retrieved 2021-01-30)
Tor-Project. (n.d.). index : tor-browser.
Retrieved from
https://gitweb.torproject.org/
tor-browser.git/tree/
dom?h=tor-browser-24.3.0esr-1
(Retrieved 2021-04-10)
van Baar, R., van Beek, H., & van Eijk, E.
(2014). Digital forensics as a service:
A game changer. Digital
Investigation, 11 , S54-S62.
(Proceedings of the First Annual
DFRWS Europe) doi:
10.1016/j.diin.2014.03.007
van Beek, H., van Eijk, E., van Baar, R.,
Ugen, M., Bodde, J., & Siemelink, A.
(2015). Digital forensics as a service:
Game on. Digital Investigation, 15 ,
20-38. (Special Issue: Big Data and
Intelligent Data Analysis) doi:
10.1016/j.diin.2015.07.004
Webtoolkit. (n.d.). Javascript sha-256 -
javascript tutorial with example
source code. Retrieved from
http://www.webtoolkit.info/
javascript sha256.html (Retrieved
2021-01-03)
Wojahn, P. G., Neuwirth, C. M., &
Bullock, B. (1998). Effects of
interfaces for annotation on
communication in a collaborative
task. In Proceedings of the sigchi
conference on human factors in
computing systems (p. 456{463).
USA: ACM Press/Addison-Wesley
Publishing Co. doi:
10.1145/274644.274706
Zhang, Y., Zeng, S., Huang, C.-N., Fan, L.,
Yu, X., Dang, Y., . . . Chen, H.
(2010). Developing a dark web
collection and infrastructure for
computational and social sciences. In
2010 ieee international conference on
intelligence and security informatics
(p. 59-64). doi:
10.1109/ISI.2010.5484774
Recommended Citation
Bergman, Jesper and Popov, Oliver B.
(2022)
"The Digital Detective's Discourse - A toolset for forensically sound collaborative dark web content annotation and collection,"
Journal of Digital Forensics, Security and Law: Vol. 17
, Article 5.
DOI: https://doi.org/10.15394/jdfsl.2022.1740
Available at:
https://commons.erau.edu/jdfsl/vol17/iss1/5