Asfsdfs a

Solo disponible en BuenasTareas
  • Páginas : 53 (13193 palabras )
  • Descarga(s) : 0
  • Publicado : 7 de septiembre de 2010
Leer documento completo
Vista previa del texto
IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, Dec. 1996.

Visualization Techniques for Mining Large Databases: A Comparison
Daniel A. Keim, Hans-Peter Kriegel

Visual data mining techniques have proven to be of high value in exploratory data analysis and they also have a high potential for mining large databases. In this article, we describe and evaluate a newvisualization-based approach to mining large databases. The basic idea of our visual data mining techniques is to represent as many data items as possible on the screen at the same time by mapping each data value to a pixel of the screen and arranging the pixels adequately. The major goal of this article is to evaluate our visual data mining techniques and to compare them to other well-knownvisualization techniques for multidimensional data: the parallel coordinate and stick figure visualization techniques. For the evaluation of visual data mining techniques, in the first place the perception of properties of the data counts, and only in the second place the CPU time and the number of secondary storage accesses are important. In addition to testing the visualization techniques using realdata, we developed a testing environment for database visualizations similar to the benchmark approach used for comparing the performance of database systems. The testing environment allows the generation of test data sets with predefined data characteristics which are important for comparing the perceptual abilities of visual data mining techniques. Keywords: Data Mining, Explorative Data Analysis,Visualizing Large Databases, Visualizing Multidimensional and Multivariate Data

1. Introduction Having the right information at the right time is crucial for making the right decisions. Because of the fast technological progress, the amount of information which may be of interest for making decisions increases very fast. One reason for the ever increasing stream of data is the automation ofactivities in all areas, including business, engineering, science, and government. Today, even simple transactions, such as paying by credit card or using the telephone, are typically recorded by using computers. Test series in physics, chemistry, and medicine generate large amounts of data which are collected automatically via sensors and monitoring systems. Even larger amounts of data arecollected by satellite observation systems which are expected to generate one terabyte of data every day in the near future. But finding the valuable information hidden in them, is like searching a pin in a haystack. Very large amounts of data are an important resource, but most of the time it is very hard to find the relevant information. ‘Data Mining’ may be defined as the (non-trivial) process ofsearching and analyzing data in order to find implicit but potentially useful information [1]. Let D ={d1, ..., dn} be the data set to be analyzed. Then, the data mining process may be described as the process of finding


• •

a subset D’ of D and hypotheses HU (D’, C) about D’

that a user U considers useful in an application context C. Note that D’ may not only have fewer data elementsthan D, but it may also have a lower dimensionality (m’). Since in databases the data is often partitioned into relations or object classes, D may be considered as a union of relations R1, ..., Rk ( D =

∪ Ri ), each having its own dimensionality (m1, ..., mk). The hypotheses expressing inter-


esting aspects of the data may deal with the whole database or with a single relation (D’ =D or D’ = Ri); they may deal with real subsets of the database ( D′ ⊂ D with D′ « D and D′ sufficiently large ) or with single exceptional data items, so-called hot spots ( D′ ⊂ D and D′ = 1 or sufficiently small when compared to |D| ). Among others, hypotheses may be • properties that hold for all or most ei ∈ D’, ( D′ ⊆ D ), • classifications of D’ into classes Ci with different properties Pi...
tracking img