Analisis Semantico

Páginas: 23 (5630 palabras) Publicado: 17 de febrero de 2013
Probabilistic Latent Semantic Analysis

To appear in: Uncertainity in Arti cial Intelligence, UAI'99, Stockholm

Thomas Hofmann

EECS Department, Computer Science Division, University of California, Berkeley &
International Computer Science Institute, Berkeley, CA
hofmann@cs.berkeley.edu

Abstract
Probabilistic Latent Semantic Analysis is a
novel statistical technique for the analysisof two mode and co-occurrence data, which
has applications in information retrieval and
ltering, natural language processing, machine learning from text, and in related areas. Compared to standard Latent Semantic
Analysis which stems from linear algebra and
performs a Singular Value Decomposition of
co-occurrence tables, the proposed method
is based on a mixture decomposition derived
froma latent class model. This results in a
more principled approach which has a solid
foundation in statistics. In order to avoid
over tting, we propose a widely applicable
generalization of maximum likelihood model
tting by tempered EM. Our approach yields
substantial and consistent improvements over
Latent Semantic Analysis in a number of experiments.

1 Introduction
Learning from textand natural language is one of the
great challenges of Arti cial Intelligence and Machine
Learning. Any substantial progress in this domain has
strong impact on many applications ranging from information retrieval, information ltering, and intelligent interfaces, to speech recognition, natural language
processing, and machine translation. One of the fundamental problems is to learn the meaningand usage
of words in a data-driven fashion, i.e., from some given
text corpus, possibly without further linguistic prior
knowledge.
The main challenge a machine learning system has to
address roots in the distinction between the lexical
level of what actually has been said or written" and
the semantical level of what was intended" or what

was referred to" in a text or an utterance. Theresulting problems are twofold: i polysems, i.e., a word
may have multiple senses and multiple types of usage
in di erent context, and ii synonymys and semantically related words, i.e., di erent words may have a
similar meaning, they may at least in certain contexts
denote the same concept or in a weaker sense refer
to the same topic.
Latent semantic analysis LSA 3 is well-knowntechnique which partially addresses these questions. The
key idea is to map high-dimensional count vectors,
such as the ones arising in vector space representations of text documents 12 , to a lower dimensional
representation in a so-called latent semantic space. As
the name suggests, the goal of LSA is to nd a data
mapping which provides information well beyond the
lexical level and revealssemantical relations between
the entities of interest. Due to its generality, LSA
has proven to be a valuable analysis tool with a wide
range of applications e.g. 3, 5, 8, 1 . Yet its theoretical foundation remains to a large extent unsatisfactory
and incomplete.
This paper presents a statistical view on LSA which
leads to a new model called Probabilistic Latent Semantics Analysis PLSA. Incontrast to standard
LSA, its probabilistic variant has a sound statistical
foundation and de nes a proper generative model of
the data. A detailed discussion of the numerous advantages of PLSA can be found in subsequent sections.

2 Latent Semantic Analysis
2.1 Count Data and Co-occurrence Tables
LSA can in principle be applied to any type of count
data over a discrete dyadic domain cf. 7. However, since the most prominent application of LSA is
in the analysis and retrieval of text documents, we
focus on this setting for sake of concreteness. Suppose therefore we have given a collection of text doc-

uments D = fd1; : : : ; dN g with terms from a vocabulary W = fw1; : : : wM g. By ignoring the sequential order in which words occur in a document, one
may summarize the data...
Leer documento completo

Regístrate para leer el documento completo.

Estos documentos también te pueden resultar útiles

  • Analisis semantico
  • Análisis semántico
  • Analisis semantico
  • Analisis semantico
  • Analisis semantico
  • ANÁLISIS SEMÁNTICO
  • Analisis Semantico
  • análisis semantico

Conviértase en miembro formal de Buenas Tareas

INSCRÍBETE - ES GRATIS