Email is the most common and comparatively the most efficient means of exchanging information in today's world. However, given the widespread use of emails in all sectors, they have been the target of spammers since the beginning. Filtering spam emails has now led to critical actions such as forensic activities based on mining spam email. The data mine for spam emails at the University of Alabama at Birmingham is considered to be one of the most prominent resources for mining and identifying spam sources. It is a widely researched repository used by researchers from different global organizations. The usual process of mining the spam data involves going through every email in the data mine and clustering them based on their different attributes. However, given the size of the data mine, it takes an exceptionally long time to execute the clustering mechanism each time. In this paper, we have illustrated sampling as an efficient tool for data reduction, while preserving the information within the clusters, which would thus allow the spam forensic experts to quickly and effectively identify the ‘hot zone’ from the spam campaigns. We have provided detailed comparative analysis of the quality of the clusters after sampling, the overall distribution of clusters on the spam data, and timing measurements for our sampling approach. Additionally, we present different strategies which allowed us to optimize the sampling process using data-preprocessing and using the database engine's computational resources, and thus improving the performance of the clustering process.
1. Arlia, D. & Coppola, M. (2001). Experiments in parallel clustering with dbscan. Euro-Par 2001 Parallel Processing. Lecture Notes in Computer Science, 2150. Springer Berlin Heidelberg, 326- 331.
2. Birant, D. & Kut, A. (2007). ST-DBSCAN: An algorithm for clustering spatial-temporal data. Data & Knowledge Engineering, 60(1), 208- 221.
3. Caruana, G. & Li, M. (2008). A survey of emerging approaches to spam filtering. ACM Computing Surveys, 44(2), 9:1-9:27.
4. Dagon, D., Gu, G., Lee, C., & Lee, W. (2007). A taxonomy of botnet structures. Proceedings of the 23rd Annual Computer Security Applications Conference. ACSAC 2007, 325-339.
5. Ganti, V., Ramakrishnan, R., Gehrke, J., & Powell, A. (1999). Clustering large datasets in arbitrary metric spaces. Proceedings of the 15th International Conference on Data Engineering (ICDE 1999). IEEE Computer Society, Washington, DC, USA.
6. Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17, December, 2-3, 107-145.
7. Hammersley, J. M., Handscomb, D. C., & Weiss, G. (1965). Monte Carlo methods. Physics Today, 18, 55.
8. Hartigan, J. A. & Wong, M. A. (1979). Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28(1), 100- 108.
9. Jaccard, P. (1901). Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bulletin de la Société Vaudoise des Sciences Naturelles, 37, 241-272.
10. Kanungo, T., Mount, D., Netanyahu, N., Piatko, C., Silverman, R., & Wu, A. (2002). An efficient k-means clustering algorithm: analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7), 881- 892.
11. Knuth, D. E. (2006). The art of computer programming. 4, fascicle 4, 1. print.. Generating all trees. Addison-Wesley.
12. Koontz, W. L. G., Narendra, P. M., & Fukunaga, K. (1975). A Branch and Bound Clustering Algorithm. IEEE Transactions on Computers, 24(9), 908-915.
13. Kyriakopoulou, A. & Kalamboukis, T. (2008). Combining clustering with classification for spam detection in social bookmarking systems. Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases Discovery Challenge, (ECML/PKDD RSDC 2008), 47-54.
14. Levchenko, K., Pitsillidis, A., Chachra, N., Enright, B., Halvorson, T., Kanich, C…Savage, S. (2011). Click trajectories: End-to-end analysis of the spam value chain. Proceedings of The IEEE Symposium on Security & Privacy, 431-446.
15. Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady. 10(8 Feb), 707-710.
16. Matsumoto, M. & Nishimura, T. (1998). Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special issue on uniform random number generation, 8(1 Jan), 3-30.
17. Moore, T., Clayton, R., & Anderson, R. (2009). The economics of online crime. The Journal of Economic Perspectives, 23(3), 3-20.
18. Nagwani, N. K. & Bhansali, A. (2010). An Email Clustering Model Using Weighted Similarities between Emails Attributes. International Journal of Research and Reviews in Computer Science (IJRRCS), 1, 2.
19. Nhung, N. P. & Phuong, T. M. (2007). An efficient method for filtering image-based spam e-mail. Proceedings of The 12th international conference on Computer analysis of images and patterns, (CAIP’07). Springer-Verlag, Berlin, Heidelberg, 945-953.
20. Ono, K., Kawaishi, I., & Kamon, T. (2007). Trend of Botnet Activities. Proceedings of the 41st Annual IEEE International Carnahan Conference on Security Technology, (ICCST) ’07, 243-249.
21. Ramachandran, A., Feamster, N., & Vempala, S. (2007). Filtering spam with behavioral blacklisting. Proceedings of the 14th ACM Conference on Computer and Communications Security, (CCS) 2007. ACM, New York, NY, USA, 342-351.
22. Sasaki, M. & Shinnou, H. (2005). Spam detection using text clustering. Proceedings of the International Conference on Cyberworlds, 4(4), 319.
23. Thomas, K., Grier, C., Ma , J., Paxson , V., & Song, D. (2011). Design and evaluation of a real-time url spam filtering service. Proceedings of the 2011 IEEE Symposium on Security and Privacy, (S&P 2011), IEEE, 447-462.
24. UAB-CIS. (2013). Department of CIS, University of Alabama at Birmingham, UAB Spam Data Mine. Retrieved from http://www.cis.uab.edu/UABSpamDataMine.
25. Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1 Mar), 37-57.
26. Wei, C. (2010). Clustering Spam Domains and Hosts: Anti-Spam Forensics with Data Mining. Doctoral thesis, University of Alabama at Birmingham.
27. Wei, C., Sprague, A., & Warner, G. (2009). Clustering malware-generated spam emails with a novel fuzzy string matching algorithm. Proceedings of the 2009 ACM symposium on Applied Computing, (SAC 2009), ACM, New York, NY, USA, 889-890.
28. Ying, W., Kai, Y., & Zhong, Jian Z. (2010). Using DBSCAN clustering algorithm in spam identifying. Proceedings of the 2 nd International Conference on Education Technology and Computer. (ICETC) 2010, 1, 398-402.
Khan, Rasib; Mizan, Mainul; Hasan, Ragib; and Sprague, Alan
"Hot Zone Identification: Analyzing Effects of Data Sampling On Spam Clustering,"
Journal of Digital Forensics, Security and Law: Vol. 9
, Article 5.
Available at: http://commons.erau.edu/jdfsl/vol9/iss1/5