Machine learning for imbalanced datasets: application in medical diagnostic

Páginas: 17 (4195 palabras) Publicado: 25 de junio de 2011

Proceedings of the 19th International FLAIRS Conference (FLAIRS-2006), Melbourne Beach, Florida, May 11-13, 2006

Machine Learning for Imbalanced Datasets: Application in Medical Diagnostic
Luis Mena a,b and Jesus A. Gonzalez a
a

Department of Computer Science, National Institute of Astrophysics, Optics and Electronics, Puebla, Mexico. b Department of Computer Science, Faculty ofEngineering, University of Zulia, Maracaibo, Venezuela. {lmena,jagonzalez}@.inaoep.mx

Abstract
In this paper, we present a new rule induction algorithm for machine learning in medical diagnosis. Medical datasets, as many other real-world datasets, exhibit an imbalanced class distribution. However, this is not the only problem to solve for this kind of datasets, we must also consider other problemsbesides the poor classification accuracy caused by the classes distribution. Therefore, we propose a different strategy based on the maximization of the classification accuracy of the minority class as opposed to the usually used sampling and cost techniques. Our experimental results were conducted using an original dataset for cardiovascular diseases diagnostic and three public datasets. Theexperiments are performed using standard classifiers (Naïve Bayes, C4.5 and k-Nearest Neighbor), emergent classifiers (Neural Networks and Support Vector Machines) and other classifiers used for imbalanced datasets (Ripper and Random Forest). In all the tests, our algorithm showed competitive results in terms of accuracy and area under the ROC curve, but overcomes the other classifiers in terms ofcomprehensibility and validity. Key words: machine learning, imbalanced datasets, medical diagnosis, accuracy, validity and comprehensibility.

Another important issue is that medical datasets used for machine learning should be representative of the general incidence of the studied disease. This is important to make possible the use of the generated knowledge with other populations. Therefore, theover-sampling and undersampling techniques (Kubat and Matwin 1997, Chawla et al. 2002) frequently used to balance the classes and to improve the minority class prediction of some classifiers, could generate biased knowledge that might not be applicable to the general population due to the artificial manipulation of the datasets. For this reason, we propose a different strategy that tries to maximizethe classification accuracy of the minority class (sick people) without modifying the original dataset. Thus, each of the steps of our algorithm is guided to reach this objective. Since we are dealing with binary classification problems, the majority class accuracy is guaranteed by default. In section 2 we describe the methodology of our algorithm. In section 3, we present a brief description ofthe datasets and classifiers used in our experiments. We then compare (section 4) the performance of our algorithm with some standard classifiers, emergent classifiers, and classifiers specifically used for imbalanced datasets. This comparison makes reference to the accuracy, comprehensibility and validity (only for the symbolic classifiers) of the obtained results. In section 5 we show an analysisof the results and finally, in section 6 we present our conclusions and future work.

1. Introduction
Many real-world datasets exhibit an imbalanced class distribution, where there exists a majority class with normal data and a minority class with abnormal or important data. Fraud detection, network intrusion and medical diagnosis are examples of this kind of datasets; however, opposite toother machine learning applications, the medical diagnostic problem does not end once we get a model to classify new instances. That is, if the instance is classified as sick (the most important class) the generated knowledge should be able of provide the medical staff with a novel point of view about the given problem. This could help to apply a medical treatment on time to avoid, delay, or...

Leer documento completo

Regístrate para leer el documento completo.

Machine learning for imbalanced datasets: application in medical diagnostic

Estos documentos también te pueden resultar útiles

MACHINE LEARNING

Factors In Learning A Language

Resources For Learning English.Pdf

machine learning

Vending Machines Companies In Chile

A Day In My E-Learning Life

Perfect age for learning english

Hirsh-Pasek (2009) The Evidence For Playful Learning In Preschool (En Español)

OTRAS TAREAS POPULARES

Únete a millones de otros estudiantes y comienza tu investigación

Conviértase en miembro formal de Buenas Tareas