“Time for Some Traffic Problems”: Enhancing E-Discovery and Big Data Processing Tools with Linguistic Methods for Deception Detection

Erin S. Crabb, University of Maryland, College Park

Prior Publisher

The Association of Digital Forensics, Security and Law (ADFSL)

Abstract

Linguistic deception theory provides methods to discover potentially deceptive texts to make them accessible to clerical review. This paper proposes the integration of these linguistic methods with traditional e-discovery techniques to identify deceptive texts within a given author’s larger body of written work, such as their sent email box. First, a set of linguistic features associated with deception are identified and a prototype classifier is constructed to analyze texts and describe the features’ distributions, while avoiding topic-specific features to improve recall of relevant documents. The tool is then applied to a portion of the Enron Email Dataset to illustrate how these strategies identify records, providing an example of its advantages and capability to stratify the large data set at hand.

References

Afroz, S., Brennan, M., & Greenstadt, R. (2012). Detecting hoaxes, frauds, and deception in writing style online. In 2012 IEEE Symposium on Security and Privacy (SP), 461-475.

Baron, J. R., Braman, R., Withers, K., Allman, T., Daley, M., & Paul, G. (2007). The Sedona Conference® Best Practices Commentary on the Use of Search and Information Retrieval Methods in eDiscovery. The Sedona Conference Journal, 8, 189-223.

Belt, W., Kiker, D., & Shetterly, D. (2012). Technology-assisted document review: Is it defensible? Richmond Journal of Law and Technology, XVIII(3), 1-43.

Bird, S., Loper, E., & Klein, E. (2009). Natural Language Processing with Python, O’Reilly Media Inc.

Brennan, M., Afroz, S., & Greenstadt, R. (2012). Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information and System Security (TISSEC), 15(3), 12:1-12:22.

EDRM, LLC (2014). Enron Email Data Set v2. Retrieved from http://www.edrm.net/resources/datasets/edrm-enron-email-data-set

Enos, F., Shriberg, E., Graciarena, M., Hirschberg, J., & Stolcke, A. (2007). Detecting deception using critical segments. INTERSPEECH, 2281-2284.

Feng, S., Banerjee, R., & Choi, Y. (2012). Syntactic stylometry for deception detection. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, 2, 171-175. Association for Computational Linguistics.

Fitzpatrick, E. & Bachenko, J. (2009). Building a forensic corpus to test language-based indicators of deception. Language and Computers, 71(1), 183-196.

Fornaciari, T., & Poesio, M. (2012). On the use of homogenous sets of subjects in deceptive language analysis. Proceedings of the Workshop on Computational Approaches to Deception Detection, 39-47. Association for Computational Linguistics.

González-Ibáñez, R., Muresan, S., & Wacholder, N. (2011). Identifying sarcasm in Twitter: A closer look. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 581–586.

Grossman, M., & Cormack, G. (2011). Technology-Assisted review in e-Discovery can be more effective and more efficient than exhaustive manual review. Richmond Journal of Law and Technology, XVII(3), 1-33.

Gupta, S. (2007). Modelling Deception Detection in Text (Master’s thesis). Retrieved from http://qspace.library.queensu.ca/handle/1 974/922

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. (2009). The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 10- 18.

Hancock, J., Curry, L., Goorha, S., & Woodworth, M. (2008). On lying and being lied to: A linguistic analysis of deception in computer-mediated communication. Discourse Processes, 45(1), 1-23.

Juola, P. (2012). Detecting stylistic deception. In Proceedings of the Workshop on Computational Approaches to Deception Detection, 91-96. Association for Computational Linguistics.

Keila, P., & Skillicorn, D. (2005). Detecting Unusual and Deceptive Communication in Email. Technical report.

Kroll Ontrack (2013). 5 Daunting Problems Facing EDiscovery: Insights on EDiscovery Challenges in the Legal Technologies Market. Technical report. Retrieved from http://www.krollontrack.com

Lee, C., Welker, R., & Odom, M. (2009). Features of computer-mediated, text-based messages that support automatable, linguistics-based indicators for deception detection. Journal of Information Systems, 23(1), 5-24.

Louwerse, M., Lin, K. I., Drescher, A., & Semin, G. (2010). Linguistic cues predict fraudulent events in a corporate social network. Proceedings of the 32nd Annual Conference of the Cognitive Science Society, 961-966.

Oard, D., & Webber, W. (2013). Information retrieval for E-Discovery. Foundations and Trends in Information Retrieval, 7(2-3), 99-237.

Pang, B., & Lee, L. (2008). Opinion mining and Sentiment Analysis. Foundations and Trends in Information Retrieval, 2(1-2), 1-135.

Tingen, J. (2012). Technologies-that-mustnot-be-named: Understanding and implementing advanced search technologies in E-Discovery. Richmond Journal of Law and Technology, XIX(1), 1-49.

Zhou, L., Burgoon, J., & Twitchell, D. (2003). A longitudinal analysis of language behavior of deception in e-mail. In Chen, H. Miranda, R., Zeng, D., Demchak, C., Schroeder, J., Madhusudan, T. (eds). Intelligence and Security Informatics, LNCS 2665, 102-110. Springer Verlag, Berlin Heidelberg.

Zhou, L., Twitchell, D., Qin, T., Burgoon, J., & Nunamaker, J. (2003). An exploratory study into deception detection in textbased computer-mediated communication. Proceedings of the 36th Hawaii International Conference on System Sciences, 1-10.

Recommended Citation

Crabb, Erin S. (2014) "“Time for Some Traffic Problems”: Enhancing E-Discovery and Big Data Processing Tools with Linguistic Methods for Deception Detection," Journal of Digital Forensics, Security and Law: Vol. 9 , Article 14.
DOI: https://doi.org/10.15394/jdfsl.2014.1179
Available at: https://commons.erau.edu/jdfsl/vol9/iss2/14

Download

Included in

Computer Engineering Commons, Computer Law Commons, Electrical and Computer Engineering Commons, Forensic Science and Technology Commons, Information Security Commons

COinS

“Time for Some Traffic Problems”: Enhancing E-Discovery and Big Data Processing Tools with Linguistic Methods for Deception Detection

Authors

Prior Publisher

Abstract

References

Recommended Citation

Included in

Share

Search