Aprendizaje Computacional

Páginas: 89 (22236 palabras) Publicado: 23 de abril de 2012
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 21,

NO. 9,

SEPTEMBER 2009

1263

Learning from Imbalanced Data
Haibo He, Member, IEEE, and Edwardo A. Garcia
Abstract—With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as
surveillance, security, Internet, and finance, it becomes critical to advance the fundamentalunderstanding of knowledge discovery and
analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques
have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning
problem) is a relatively new challenge that has attracted growing attention from both academia andindustry. The imbalanced learning
problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class
distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new
understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently intoinformation and knowledge
representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data.
Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment
metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order tostimulate future
research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for
learning from imbalanced data.
Index Terms—Imbalanced learning, classification, sampling methods, cost-sensitive learning, kernel-based learning, active learning,
assessment metrics.

Ç
1

INTRODUCTION

R

ECENT developments in scienceand technology have
enabled the growth and availability of raw data to
occur at an explosive rate. This has created an immense
opportunity for knowledge discovery and data engineering
research to play an essential role in a wide range of
applications from daily civilian life to national security,
from enterprise information processing to governmental
decision-making support systems, frommicroscale data
analysis to macroscale knowledge discovery. In recent
years, the imbalanced learning problem has drawn a
significant amount of interest from academia, industry,
and government funding agencies. The fundamental issue
with the imbalanced learning problem is the ability of
imbalanced data to significantly compromise the performance of most standard learning algorithms. Moststandard
algorithms assume or expect balanced class distributions or
equal misclassification costs. Therefore, when presented
with complex imbalanced data sets, these algorithms fail to
properly represent the distributive characteristics of the
data and resultantly provide unfavorable accuracies across
the classes of the data. When translated to real-world
domains, the imbalanced learning problemrepresents a
recurring problem of high importance with wide-ranging
implications, warranting increasing exploration. This increased interest is reflected in the recent installment of

. The authors are with the Department of Electrical and Computer
Engineering, Stevens Institute of Technology, Hoboken, NJ 07030.
E-mail: hhe@stevens.edu, edwardo.garcia@nyu.edu.
Manuscript received 1 May2008; revised 6 Oct. 2008; accepted 1 Dec. 2008;
published online 19 Dec. 2008.
Recommended for acceptance by C. Clifton.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number
TKDE-2008-05-0233.
Digital Object Identifier no. 10.1109/TKDE.2008.239.
1041-4347/09/$25.00 ß 2009 IEEE

several major workshops, conferences,...
Leer documento completo

Regístrate para leer el documento completo.

Estos documentos también te pueden resultar útiles

  • computacional
  • Computacional
  • computacional
  • Computacional
  • computacional
  • computacional
  • Sistemas computacionales
  • Neurociencia computacional

Conviértase en miembro formal de Buenas Tareas

INSCRÍBETE - ES GRATIS