Categorizacion

Páginas: 8 (1835 palabras) Publicado: 4 de agosto de 2010
Inductive Learning Algorithms and Representations for
Text Categorization
Susan Dumais
Microsoft Research
One Microsoft Way
Redmond, WA 98052
sdumais@microsoft.com
John Platt
Microsoft Research
One Microsoft Way
Redmond, WA 98052
jplatt@microsoft.com
Mehran Sahami
Computer Science Department
Stanford University
Stanford, CA 94305-9010
sahami@cs.stanford.edu
David HeckermanMicrosoft Research
One Microsoft Way
Redmond, WA 98052
heckerma@microsoft.com
1. ABSTRACT
Text categorization – the assignment of natural
language texts to one or more predefined
categories based on their content – is an
important component in many information
organization and management tasks. We
compare the effectiveness of five different
automatic learning algorithms for textcategorization in terms of learning speed, realtime
classification speed, and classification
accuracy. We also examine training set size,
and alternative document representations.
Very accurate text classifiers can be learned
automatically from training examples. Linear
Support Vector Machines (SVMs) are
particularly promising because they are very
accurate, quick to train, and quick to evaluate.1.1 Keywords
Text categorization, classification, support vector machines,
machine learning, information management.
2. INTRODUCTION
As the volume of information available on the Internet and
corporate intranets continues to increase, there is growing
interest in helping people better find, filter, and manage
these resources. Text categorization – the assignment of
natural language texts toone or more predefined categories
based on their content – is an important component in many
information organization and management tasks. Its most
widespread application to date has been for assigning
subject categories to documents to support text retrieval,
routing and filtering.
Automatic text categorization can play an important role in
a wide variety of more flexible, dynamic andpersonalized
information management tasks as well: real-time sorting of
email or files into folder hierarchies; topic identification to
support topic-specific processing operations; structured
search and/or browsing; or finding documents that match
long-term standing interests or more dynamic task-based
interests. Classification technologies should be able to
support category structures thatare very general, consistent
across individuals, and relatively static (e.g., Dewey
Decimal or Library of Congress classification systems,
Medical Subject Headings (MeSH), or Yahoo!’s topic
hierarchy), as well as those that are more dynamic and
customized to individual interests or tasks (e.g., email about
the CIKM conference).
In many contexts (Dewey, MeSH, Yahoo!, CyberPatrol),
trainedprofessionals are employed to categorize new items.
This process is very time-consuming and costly, thus
limiting its applicability. Consequently there is increased
interest in developing technologies for automatic text
categorization. Rule-based approaches similar to those
used in expert systems are common (e.g., Hayes and
Weinstein’s CONSTRUE system for classifying Reuters
news stories,1990), but they generally require manual
construction of the rules, make rigid binary decisions about
category membership, and are typically difficult to modify.
Another strategy is to use inductive learning techniques to
automatically construct classifiers using labeled training
data. Text classification poses many challenges for
inductive learning methods since there can be millions of
wordfeatures. The resulting classifiers, however, have
many advantages: they are easy to construct and update,
they depend only on information that is easy for people to
provide (i.e., examples of items that are in or out of
categories), they can be customized to specific categories of
interest to individuals, and they allow users to smoothly
tradeoff precision and recall depending on their...
Leer documento completo

Regístrate para leer el documento completo.

Estos documentos también te pueden resultar útiles

  • Categorización
  • categorizacion
  • Categorización
  • categorizacion
  • Categorizacion
  • categorizacion
  • categorizacion
  • Categorizacion

Conviértase en miembro formal de Buenas Tareas

INSCRÍBETE - ES GRATIS