duplicacion

Páginas: 4 (780 palabras) Publicado: 15 de octubre de 2014
6.5.5 Duplicate Detection
Duplicate documents or pages are not a problem in traditional IR. How- ever, in the context of the Web, it is a significant issue. There are different types ofduplication of pages and contents on the Web. Copying a page is usually called duplication or replication, and copy- ing an entire site is called mirroring. Duplicate pages and mirror sites are often used toimprove efficiency of browsing and file downloading worldwide due to limited bandwidth across different geographic regions and poor or unpredictable network performances. Of course, some dupli- catepages are the results of plagiarism. Detecting such pages and sites can reduce the index size and improve search results. Several methods can be used to find duplicate information. The simplest methodis to hash the whole document, e.g., using the MD5 algorithm, or computing an aggregated number (e.g., checksum). However, these meth- ods are only useful for detecting exact duplicates. On the Web,one seldom finds exact duplicates. For example, even different mirror sites may have different URLs, different Web masters, different contact information, dif- ferent advertisements to suit localneeds, etc. One efficient duplicate detection technique is based on n-grams (also called shingles). An n-gram is simply a consecutive sequence of words of a fixed window size n. For example, the sentence,“John went to school with his brother,” can be represented with five 3-gram phrases “John went to”, “went to school”, “to school with”, “school with his”, and “with his brother”. Note that 1-gram issimply the individual words. Let Sn(d) be the set of distinctive n-grams (or shingles) contained in document d. Each n-gram may be coded with a number or a MD5 hash (which is usually a 32-digithexadecimal number). Given the n-gram repre- sentations of the two documents d1 and d2, Sn(d1) and Sn(d2), the Jaccard coefficient can be used to compute the similarity of the two documents,

A...
Leer documento completo

Regístrate para leer el documento completo.

Estos documentos también te pueden resultar útiles

  • Duplicación
  • Duplicación
  • Duplicacion del dna
  • Duplicación cromosómica
  • Duplicacion del gen
  • Duplicación Del Adn
  • duplicacion del adn
  • Duplicacion Dna

Conviértase en miembro formal de Buenas Tareas

INSCRÍBETE - ES GRATIS