Solo disponible en BuenasTareas
  • Páginas : 46 (11256 palabras )
  • Descarga(s) : 0
  • Publicado : 23 de noviembre de 2010
Leer documento completo
Vista previa del texto
Machine Learning, 42, 31–60, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.

SPADE: An Efficient Algorithm for Mining Frequent Sequences
MOHAMMED J. ZAKI Computer Science Department, Rensselaer Polytechnic Institute, Troy NY 12180-3590 Editor: Douglas Fisher zaki@cs.rpi.edu

Abstract. In this paper we present SPADE, a new algorithm for fast discovery of SequentialPatterns. The existing solutions to this problem make repeated database scans, and use complex hash structures which have poor locality. SPADE utilizes combinatorial properties to decompose the original problem into smaller sub-problems, that can be independently solved in main-memory using efficient lattice search techniques, and using simple join operations. All sequences are discovered in onlythree database scans. Experiments show that SPADE outperforms the best previous algorithm by a factor of two, and by an order of magnitude with some pre-processed data. It also has linear scalability with respect to the number of input-sequences, and a number of other database parameters. Finally, we discuss how the results of sequence mining can be applied in a real application domain. Keywords:sequence mining, sequential patterns, frequent patterns, data mining, knowledge discovery

1. Introduction The sequence mining task is to discover a set of attributes, shared across time among a large number of objects in a given database. For example, consider the sales database of a bookstore, where the objects represent customers and the attributes represent authors or books. Let’s say that thedatabase records the books bought by each customer over a period of time. The discovered patterns are the sequences of books most frequently bought by the customers. An example could be that, “70% of the people who buy Jane Austen’s Pride and Prejudice also buy Emma within a month.” Stores can use these patterns for promotions, shelf placement, etc. Consider another example of a web accessdatabase at a popular site, where an object is a web user and an attribute is a web page. The discovered patterns are the sequences of most frequently accessed pages at that site. This kind of information can be used to restructure the web-site, or to dynamically insert relevant links in web pages based on user access patterns. Other domains where sequence mining has been applied include identifying planfailures (Zaki, Lesh, & Ogihara, 1998), finding network alarm patterns (Hatonen et al., 1996), and so on. The task of discovering all frequent sequences in large databases is quite challenging. The search space is extremely large. For example, with m attributes there are O(m k ) potentially frequent sequences of length k. With millions of objects in the database the problem of I/O minimizationbecomes paramount. However, most current algorithms are iterative in nature, requiring as many full database scans as the longest frequent sequence;



clearly a very expensive process. Some of the methods, especially those using some form of sampling, can be sensitive to the data-skew, which can adversely affect performance. Furthermore, most approaches use very complicatedinternal data structures which have poor locality (Parthasarathy, Zaki, & Li, 1998), and add additional space and computation overheads. Our goal is to overcome all of these limitations. In this paper we present a new algorithm, called SPADE (Sequential PAttern Discovery using Equivalence classes), for discovering the set of all frequent sequences. The key features of our approach are as follows: 1.We use a vertical id-list database format, where we associate with each sequence a list of objects in which it occurs, along with the time-stamps. We show that all frequent sequences can be enumerated via simple temporal joins (or intersections) on id-lists. 2. We use a lattice-theoretic approach to decompose the original search space (lattice) into smaller pieces (sub-lattices) which can be...
tracking img