In the last decade, the proliferation of machine learning (ML) algorithms and their application on big data sets have benefited many researchers and practitioners in different scientific areas. Consequently, the research in cybercrime and digital forensics has relied on ML techniques and methods for analyzing large quantities of data such as text, graphics, images, videos, and network traffic scans to support criminal investigations. Complete and accurate training data sets are indispensable for efficient and effective machine learning models. An essential part of creating complete and accurate data sets is annotating or labelling data. We present a method for law enforcement agency investigators to annotate and store specific dark web content. Using a design science strategy, we design and develop tools to enable and extend web content annotation. The annotation tool was implemented as a plugin for the Tor browser. It can store web content, thus automatically creating a dataset of dark web data pertinent to criminal investigations. Combined with a central storage management server, enabling annotation sharing and collaboration, and a web scraping program, the dataset becomes multifold, dynamic, and extensive while maintaining the forensic soundness of the data saved and transmitted. To manifest our toolset's fitness of purpose, we used our dataset as training data for ML based classification models. A five cross-fold validation technique was used to evaluate the classifiers, which reported an accuracy score of 85 - 96%. In the concluding sections, we discuss the possible use-cases of the proposed method in real-life cybercrime investigations, along with ethical concerns and future extensions.


Arabnezhad, E., La Morgia, M., Mei, A.,

Nemmi, E. N., & Stefa, J. (2020). A

light in the dark web: Linking dark

web aliases to real internet identities.

In 2020 ieee 40th international

conference on distributed computing

systems (icdcs) (p. 311-321).

Singapore. doi:


Casey, E. (2011). Digital evidence and

computer crime: Forensic science

computers and the internet (Third

ed.). USA: Elsevier.

Chen, H., Chung, W., Quin, J., Reid, E.,

Sageman, M., & Weinmann, G.

(2008). Uncovering the dark web: A

case study of jihad on the web.

Journal of the American Society for

Information Science and Technology,


Cohen, J. (1960). A coefficient of

agreement for nominal scales.

Educational and Psychological

Measurement, 20(1), 37-46. doi:


Comey, J. B. (2015). Going dark:

Encryption, technology, and the

balances between public safety and

privacy. Retrieved from





-privacy (Retrieved 2021-03-30)

Dalianis, H. (2018). Clinical text mining:

Secondary use of electronic patient

records. USA: Springer Open.

Dalins, J., Wilson, C., & Carman, M.

(2018). Criminal motivation on the

dark web: A categorisation model for

law enforcement. Digital

Investigation, 24, 62-71. doi:


de Vel, O., Anderson, A., Corney, M., &

Mohay, G. (2001). Mining e-mail

content for author identification

forensics. ACM SIGMOD Record,

30(4), 55{64. doi:


Europol. (2017). Drugs and the darknet:

Perspectives for enforcement, research

and policy. Retrieved from





-policy (Retrieved 2021-0 1-07)

Farzan, R., & Brusilovsky, P. (2008,

January). Annotated: A social

navigation and annotation service for

web-based educational resources. New

Rev. Hypermedia Multimedia, 14(1),

3{32. doi:


Ghappour, A. (2017). Searching places

unknown: Law enforcement

jurisdiction on the dark web. Stanford

Law Review, 69(4).

Ghosh, S., Das, A., Porras, P.,

Yegneswaran, V., & Ghehani, A.

(2017). Automated categorization of

onion sites for analyzing the darkweb

ecosystem. In Kdd’17: Proceedings of

the 23rd acm sigkdd international

conference on knowledge discovery

and data mining (p. 1793-1802).


Hansken.nl. (2020). Dutch investigative

services team up to continue hansken

development. Retrieved from






Hayes, D., Cappa, F., & Cardon, J. (2018).

A framework for more effective dark

web marketplace investigations.

Information, special issue: Darkweb

Cyber Threat Intelligence Mining,

9 (8:186). doi: 10.3390/info9080186

Johannesson, P., & Perjons, E. (2014). An

introduction to design science.

Springer International Publishing.

doi: 10.1007/978-3-319-10632-8

Johnston, P. (n.d.). Paj’s home:

Cryptography: Javascript md5:

Scripts: md5.js. Retrieved from


md5/md5.html (Retrieved


Kalpakis, G., Tsikrika, T., Iliou, C.,

Mironidis, T., Vrochidis, S.,

Middleton, J., . . . Kompatsiaris, I.

(2016). Interactive discovery and

retrieval of web resources containing

home made explosive recipes. In Has

2016: Human aspects of information

security, privacy, and trust

(p. 221-233). Springer.

Kessler, G. (2016). The impact of sha-1 file

hash collisions on digital forensic

imaging: A follow-up experiment.

Journal of Digital Forensics, Security

and Law, 11 (10), 129-139. doi:



Kwon, K. H., Priniski, J. H., Sakar, S.,

Shakarian, J., & Shakarian, P. (2017).

Crisis and collective problem solving

in dark web: An exploration of a

black hat forum. In Proceedings of the

8th international conference on social

media & society article no. 45

(p. 1-5). ACM.

McKemmish, R. (2008). When is digital

evidence forensically sound? In Ifip

international conference on digital

forensics (p. 3-15). Springer Link.

Netclean. (2019). Netclean report 2018 - a

report on documented sexual abuse

against children. Retrieved from


netclean-report-2018/ (Retrieved


Netclean. (2021). Netclean report 2020 -

covid-19 impact 2020. Retrieved from


netclean-report-2020/ (Retrieved


Neto, L., Pinto, N., Proen¸ca, A., Amorim,

A., & Conde-Sousa, E. (2021).

4specid: Reference dna libraries

auditing and annotation system for

forensic applications. Genes, 12 (1).

Retrieved from https://


doi: 10.3390/genes12010061

Neves, M., & Leser, U. (2014). The forensic

investigation of android private

browsing sessions using orweb.

Briefings in Bioinformatics, 15 (2),

327-340. doi:


Nunes, E., Diab, A., Gunn, A., Ericsson,

M., Vineet, M., Mishra, V., . . .

Shakarian, P. (2016). Darknet and

deepnet mining for proactive cyber

treat intelligence. Intelligence and

Security Informatics (ISI), 7-12. doi:


Pedregosa, F., Varoquaux, G., Gramfort,

A., Michel, V., Thirion, B., Grisel, O.,

. . . Duchesnay, E. (2011).

Scikit-learn: Machine learning in

Python. Journal of Machine Learning

Research, 12 , 2825{2830.

Popov, O., Bergman, J., & Valassi, C.

(2018). A framework for a forensically

sound harvesting the dark web. In

Cecc 2018: Proceedings of the central

european cybersecurity conference

2018 (p. 1-7). ACM. doi:


Portnoff, R. S., Afroz, S., Durrett, G.,

Kummerfeld, J. K., Berg-Kirkpatrick,

T., McCoy, D., . . . Paxson, V. (2017).

Tools for automated analysis of

cybercriminal markets. In Proceedings

of the 26th international conference

on world wide web (p. 657{666).

Republic and Canton of Geneva,

CHE: International World Wide Web

Conferences Steering Committee. doi:


Qin, R. Z. Y., Huang, Z., & Chen, H.

(2003). Authorship analysis in

cybercrime investigation. In

(p. 59-73). Springer.

Sabbah, T., Selamat, A., Selamat, M. H.,

Ibrahim, R., & Fujita, H. (2016).

Hybridized term-weighting method

for dark web classification.

Neurocomputing, 173 , 1908-1926. doi:


Sorokin, A., & Forsyth, D. (2008). Utility

data annotation with amazon

mechanical turk. In Ieee computer

society conference on computer vision

and pattern recognition workshops

(p. 1-8). Anchorage, AK, USA. doi:


Spitters, M., Klaver, F., Koot, G., & van

Staalduinen, M. (2015). Authorship

analysis on dark marketplace forums.

In European intelligence and security

informatics conference (p. 631-641).


SQLite.org. (n.d.). 35 precent faster than

the filesystem. Retrieved from



(Retrieved 24/03/2021)

Tai, X. H., Soska, K., & Christin, N.

(2019). Adversarial matching of dark

net market vendor accounts. In Kdd

’19: Proceedings of the 25th acm

sigkdd international conference on

knowledge discovery and data mining

(p. 1871-1880). IEEE. doi:


Tensor. (n.d.). Titanium: Tools for the

investigation of transactions in

underground markets. Retrieved from


(Retrieved 2021-01-30)

Titaniu. (n.d.). Titanium: Tools for the

investigation of transactions in

underground markets. Retrieved from


results/ (Retrieved 2021-01-30)

Tor-Project. (n.d.). index : tor-browser.

Retrieved from




(Retrieved 2021-04-10)

van Baar, R., van Beek, H., & van Eijk, E.

(2014). Digital forensics as a service:

A game changer. Digital

Investigation, 11 , S54-S62.

(Proceedings of the First Annual

DFRWS Europe) doi:


van Beek, H., van Eijk, E., van Baar, R.,

Ugen, M., Bodde, J., & Siemelink, A.

(2015). Digital forensics as a service:

Game on. Digital Investigation, 15 ,

20-38. (Special Issue: Big Data and

Intelligent Data Analysis) doi:


Webtoolkit. (n.d.). Javascript sha-256 -

javascript tutorial with example

source code. Retrieved from


javascript sha256.html (Retrieved


Wojahn, P. G., Neuwirth, C. M., &

Bullock, B. (1998). Effects of

interfaces for annotation on

communication in a collaborative

task. In Proceedings of the sigchi

conference on human factors in

computing systems (p. 456{463).

USA: ACM Press/Addison-Wesley

Publishing Co. doi:


Zhang, Y., Zeng, S., Huang, C.-N., Fan, L.,

Yu, X., Dang, Y., . . . Chen, H.

(2010). Developing a dark web

collection and infrastructure for

computational and social sciences. In

2010 ieee international conference on

intelligence and security informatics

(p. 59-64). doi:




To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.