Lic. En Sistema

Páginas: 21 (5005 palabras) Publicado: 2 de diciembre de 2012
Structure in the Enron Email Dataset
P.S. Keila and D.B. Skillicorn School of Computing Queen’s University {keila,skill}@cs.queensu.ca

Abstract We investigate the structures present in the Enron email dataset using singular value decomposition and semidiscrete decomposition. Using word frequency profiles we show that messages fall into two distinct groups, whose extrema are characterized byshort messages and rare words versus long messages and common words. It is surprising that length of message and word use pattern should be related in this way. We also investigate relationships among individuals based on their patterns of word use in email. We show that word use is correlated to function within the organization, as expected. We also show that word use among those involved inalleged criminal activity may be slightly distinctive. 1 Introduction

Many countries intercept communication and analyze messages as an intelligence technique. The largest such system is Echelon [3], run jointly by the U.S., Canada, U.K, Australia, and New Zealand. The standard publicly-acknowledged analysis of intercepted data is to search messages for keywords, discard those messages that do notcontain keywords, and pass those that do to analysts for further processing. An interesting question is what else can be learned from such messages; for example, can connections between otherwise innocuous messages reveal links between their senders and/or receivers [13]. The Enron email dataset provides real-world data that is arguably of the same kind as data from Echelon intercepts – a set ofmessages about a wide range of topics, from a large group of people who do not form a closed set. Further, individuals at Enron were involved in several apparently criminal

activities. Hence, like Echelon data, there are probably patterns of unusual communication within the dataset. Understanding the characteristics and structure of both normal and abnormal (collusive) emails therefore providesinformation about how such data might be better analyzed in an intelligence setting. Linguistically, email has been considered to occupy a middle ground between written material, which is typically well-organized, and uses more formal grammatical style and word choices; and speech, which is produced in real-time and characterized by sentence fragments and informal word choices. Although thepotential for editing email exists, anecdotal evidence suggests that this rarely happens; on the other hand, email does not usually contain the spoken artifacts of pausing (ums etc.). We examine the structure of the Enron email dataset, looking for what it can tell us about how email is constructed and used, and also for what it can tell us about how individuals use email to communicate. 2 Related WorkPrevious attention has been paid to email with two main goals: spam detection, and email topic classification. Spam detection tends to rely on local properties of email: the use of particular words, and more generally the occurrence of unlikely combinations of words. This has been increasingly unsuccessful, as spam email has increasingly used symbol substitution (readable to humans) which makesmost of its content seem not to be words at all. Email topic classification attempts to assist

users by automatically classifying their email into different folders by topic. Some examples are [2, 7, 10, 12]. This work has been moderately successful when the topics are known in advance, but perform much less adequately in an unsupervised setting (but see some of the papers in this workshop). Anattempt to find connections between people based on patterns in their email can be found in [8]. 3 Matrix Decompositions

We will use two matrix decompositions, Singular Value Decomposition (SVD) [4], and SemiDiscrete Decomposition (SDD) [5, 6]. Both decompose a matrix, A, with n rows and m columns into the form A = CW F

where C is n×k, W is a k×k diagonal matrix whose entries indicate the...
Leer documento completo

Regístrate para leer el documento completo.

Estos documentos también te pueden resultar útiles

  • lic en sistemas
  • lic. en sistemas
  • Lic En Sistemas
  • Lic En Sistemas
  • Lic en Sistemas
  • Lic. Sistemas
  • Lic. Sistemas
  • Lic en sistemas

Conviértase en miembro formal de Buenas Tareas

INSCRÍBETE - ES GRATIS