•  
  •  
 

Abstract

In the last decade, the proliferation of machine learning (ML) algorithms and their application on big data sets have benefited many researchers and practitioners in different scientific areas. Consequently, the research in cybercrime and digital forensics has relied on ML techniques and methods for analyzing large quantities of data such as text, graphics, images, videos, and network traffic scans to support criminal investigations. Complete and accurate training data sets are indispensable for efficient and effective machine learning models. An essential part of creating complete and accurate data sets is annotating or labelling data. We present a method for law enforcement agency investigators to annotate and store specific dark web content. Using a design science strategy, we design and develop tools to enable and extend web content annotation. The annotation tool was implemented as a plugin for the Tor browser. It can store web content, thus automatically creating a dataset of dark web data pertinent to criminal investigations. Combined with a central storage management server, enabling annotation sharing and collaboration, and a web scraping program, the dataset becomes multifold, dynamic, and extensive while maintaining the forensic soundness of the data saved and transmitted. To manifest our toolset's fitness of purpose, we used our dataset as training data for ML based classification models. A five cross-fold validation technique was used to evaluate the classifiers, which reported an accuracy score of 85 - 96%. In the concluding sections, we discuss the possible use-cases of the proposed method in real-life cybercrime investigations, along with ethical concerns and future extensions.

References

Arabnezhad, E., La Morgia, M., Mei, A.,

Nemmi, E. N., & Stefa, J. (2020). A

light in the dark web: Linking dark

web aliases to real internet identities.

In 2020 ieee 40th international

conference on distributed computing

systems (icdcs) (p. 311-321).

Singapore. doi:

10.1109/ICDCS47774.2020.00081

Casey, E. (2011). Digital evidence and

computer crime: Forensic science

computers and the internet (Third

ed.). USA: Elsevier.

Chen, H., Chung, W., Quin, J., Reid, E.,

Sageman, M., & Weinmann, G.

(2008). Uncovering the dark web: A

case study of jihad on the web.

Journal of the American Society for

Information Science and Technology,

59(8).

Cohen, J. (1960). A coefficient of

agreement for nominal scales.

Educational and Psychological

Measurement, 20(1), 37-46. doi:

10.1177/001316446002000104

Comey, J. B. (2015). Going dark:

Encryption, technology, and the

balances between public safety and

privacy. Retrieved from

https://www.fbi.gov/news/

testimony/going-dark-encryption

-technology-and-the-balances

-between-public-safety-and

-privacy (Retrieved 2021-03-30)

Dalianis, H. (2018). Clinical text mining:

Secondary use of electronic patient

records. USA: Springer Open.

Dalins, J., Wilson, C., & Carman, M.

(2018). Criminal motivation on the

dark web: A categorisation model for

law enforcement. Digital

Investigation, 24, 62-71. doi:

10.1016/j.diin.2017.12.003

de Vel, O., Anderson, A., Corney, M., &

Mohay, G. (2001). Mining e-mail

content for author identification

forensics. ACM SIGMOD Record,

30(4), 55{64. doi:

10.1145/604264.604272

Europol. (2017). Drugs and the darknet:

Perspectives for enforcement, research

and policy. Retrieved from

https://www.europol.europa.eu/

publications-documents/

drugs-and-darknet-perspectives

-for-enforcement-research-and

-policy (Retrieved 2021-0 1-07)

Farzan, R., & Brusilovsky, P. (2008,

January). Annotated: A social

navigation and annotation service for

web-based educational resources. New

Rev. Hypermedia Multimedia, 14(1),

3{32. doi:

10.1080/13614560802357172

Ghappour, A. (2017). Searching places

unknown: Law enforcement

jurisdiction on the dark web. Stanford

Law Review, 69(4).

Ghosh, S., Das, A., Porras, P.,

Yegneswaran, V., & Ghehani, A.

(2017). Automated categorization of

onion sites for analyzing the darkweb

ecosystem. In Kdd’17: Proceedings of

the 23rd acm sigkdd international

conference on knowledge discovery

and data mining (p. 1793-1802).

ACM.

Hansken.nl. (2020). Dutch investigative

services team up to continue hansken

development. Retrieved from

https://www.hansken.nl/latest/

news/2020/07/30/

dutch-investigative-services

-team-up-to-continue-hansken

-development

Hayes, D., Cappa, F., & Cardon, J. (2018).

A framework for more effective dark

web marketplace investigations.

Information, special issue: Darkweb

Cyber Threat Intelligence Mining,

9 (8:186). doi: 10.3390/info9080186

Johannesson, P., & Perjons, E. (2014). An

introduction to design science.

Springer International Publishing.

doi: 10.1007/978-3-319-10632-8

Johnston, P. (n.d.). Paj’s home:

Cryptography: Javascript md5:

Scripts: md5.js. Retrieved from

http://pajhome.org.uk/crypt/

md5/md5.html (Retrieved

2021-01-12)

Kalpakis, G., Tsikrika, T., Iliou, C.,

Mironidis, T., Vrochidis, S.,

Middleton, J., . . . Kompatsiaris, I.

(2016). Interactive discovery and

retrieval of web resources containing

home made explosive recipes. In Has

2016: Human aspects of information

security, privacy, and trust

(p. 221-233). Springer.

Kessler, G. (2016). The impact of sha-1 file

hash collisions on digital forensic

imaging: A follow-up experiment.

Journal of Digital Forensics, Security

and Law, 11 (10), 129-139. doi:

https://doi.org/10.15394/

jdfsl.2016.1433

Kwon, K. H., Priniski, J. H., Sakar, S.,

Shakarian, J., & Shakarian, P. (2017).

Crisis and collective problem solving

in dark web: An exploration of a

black hat forum. In Proceedings of the

8th international conference on social

media & society article no. 45

(p. 1-5). ACM.

McKemmish, R. (2008). When is digital

evidence forensically sound? In Ifip

international conference on digital

forensics (p. 3-15). Springer Link.

Netclean. (2019). Netclean report 2018 - a

report on documented sexual abuse

against children. Retrieved from

https://www.netclean.com/

netclean-report-2018/ (Retrieved

01/11/2020)

Netclean. (2021). Netclean report 2020 -

covid-19 impact 2020. Retrieved from

https://www.netclean.com/

netclean-report-2020/ (Retrieved

02/02/2021)

Neto, L., Pinto, N., Proen¸ca, A., Amorim,

A., & Conde-Sousa, E. (2021).

4specid: Reference dna libraries

auditing and annotation system for

forensic applications. Genes, 12 (1).

Retrieved from https://

www.mdpi.com/2073-4425/12/1/61

doi: 10.3390/genes12010061

Neves, M., & Leser, U. (2014). The forensic

investigation of android private

browsing sessions using orweb.

Briefings in Bioinformatics, 15 (2),

327-340. doi:

https://doi.org/10.1093/bib/bbs084

Nunes, E., Diab, A., Gunn, A., Ericsson,

M., Vineet, M., Mishra, V., . . .

Shakarian, P. (2016). Darknet and

deepnet mining for proactive cyber

treat intelligence. Intelligence and

Security Informatics (ISI), 7-12. doi:

10.1109/ISI.2016.7745435

Pedregosa, F., Varoquaux, G., Gramfort,

A., Michel, V., Thirion, B., Grisel, O.,

. . . Duchesnay, E. (2011).

Scikit-learn: Machine learning in

Python. Journal of Machine Learning

Research, 12 , 2825{2830.

Popov, O., Bergman, J., & Valassi, C.

(2018). A framework for a forensically

sound harvesting the dark web. In

Cecc 2018: Proceedings of the central

european cybersecurity conference

2018 (p. 1-7). ACM. doi:

10.1145/3277570.3277584

Portnoff, R. S., Afroz, S., Durrett, G.,

Kummerfeld, J. K., Berg-Kirkpatrick,

T., McCoy, D., . . . Paxson, V. (2017).

Tools for automated analysis of

cybercriminal markets. In Proceedings

of the 26th international conference

on world wide web (p. 657{666).

Republic and Canton of Geneva,

CHE: International World Wide Web

Conferences Steering Committee. doi:

10.1145/3038912.3052600

Qin, R. Z. Y., Huang, Z., & Chen, H.

(2003). Authorship analysis in

cybercrime investigation. In

(p. 59-73). Springer.

Sabbah, T., Selamat, A., Selamat, M. H.,

Ibrahim, R., & Fujita, H. (2016).

Hybridized term-weighting method

for dark web classification.

Neurocomputing, 173 , 1908-1926. doi:

10.1016/j.neucom.2015.09.063

Sorokin, A., & Forsyth, D. (2008). Utility

data annotation with amazon

mechanical turk. In Ieee computer

society conference on computer vision

and pattern recognition workshops

(p. 1-8). Anchorage, AK, USA. doi:

10.1109/CVPRW.2008.4562953

Spitters, M., Klaver, F., Koot, G., & van

Staalduinen, M. (2015). Authorship

analysis on dark marketplace forums.

In European intelligence and security

informatics conference (p. 631-641).

IEEE.

SQLite.org. (n.d.). 35 precent faster than

the filesystem. Retrieved from

https://sqlite.org/

fasterthanfs.html#approx

(Retrieved 24/03/2021)

Tai, X. H., Soska, K., & Christin, N.

(2019). Adversarial matching of dark

net market vendor accounts. In Kdd

’19: Proceedings of the 25th acm

sigkdd international conference on

knowledge discovery and data mining

(p. 1871-1880). IEEE. doi:

10.1145/3292500.3330763

Tensor. (n.d.). Titanium: Tools for the

investigation of transactions in

underground markets. Retrieved from

https://titanium-project.eu/

(Retrieved 2021-01-30)

Titaniu. (n.d.). Titanium: Tools for the

investigation of transactions in

underground markets. Retrieved from

https://titanium-project.eu/

results/ (Retrieved 2021-01-30)

Tor-Project. (n.d.). index : tor-browser.

Retrieved from

https://gitweb.torproject.org/

tor-browser.git/tree/

dom?h=tor-browser-24.3.0esr-1

(Retrieved 2021-04-10)

van Baar, R., van Beek, H., & van Eijk, E.

(2014). Digital forensics as a service:

A game changer. Digital

Investigation, 11 , S54-S62.

(Proceedings of the First Annual

DFRWS Europe) doi:

10.1016/j.diin.2014.03.007

van Beek, H., van Eijk, E., van Baar, R.,

Ugen, M., Bodde, J., & Siemelink, A.

(2015). Digital forensics as a service:

Game on. Digital Investigation, 15 ,

20-38. (Special Issue: Big Data and

Intelligent Data Analysis) doi:

10.1016/j.diin.2015.07.004

Webtoolkit. (n.d.). Javascript sha-256 -

javascript tutorial with example

source code. Retrieved from

http://www.webtoolkit.info/

javascript sha256.html (Retrieved

2021-01-03)

Wojahn, P. G., Neuwirth, C. M., &

Bullock, B. (1998). Effects of

interfaces for annotation on

communication in a collaborative

task. In Proceedings of the sigchi

conference on human factors in

computing systems (p. 456{463).

USA: ACM Press/Addison-Wesley

Publishing Co. doi:

10.1145/274644.274706

Zhang, Y., Zeng, S., Huang, C.-N., Fan, L.,

Yu, X., Dang, Y., . . . Chen, H.

(2010). Developing a dark web

collection and infrastructure for

computational and social sciences. In

2010 ieee international conference on

intelligence and security informatics

(p. 59-64). doi:

10.1109/ISI.2010.5484774

DOI

https://doi.org/10.15394/jdfsl.2022.1740

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.