Pimero

Solo disponible en BuenasTareas
  • Páginas : 16 (3840 palabras )
  • Descarga(s) : 0
  • Publicado : 4 de septiembre de 2012
Leer documento completo
Vista previa del texto
Nutch: A Flexible and Scalable Open-Source Web Search Engine
Rohit Khare
CommerceNet Labs 510 Logue Avenue Mountain View, CA 94043 +1 650 714 5529

Doug Cutting
Nutch Organization PO Box 5633 Petaluma, CA 94955 +1 707 696 8996

Kragen Sitaker
CommerceNet Labs 510 Logue Avenue Mountain View, CA 94043 +1 415 505 1494

Adam Rifkin
CommerceNet Labs 510 Logue Avenue Mountain View, CA 94043+1 650 906 4652

rohit@commerce.net ABSTRACT

cutting@nutch.org

kragen@commerce.net adam@commerce.net

Nutch is an open-source Web search engine that can be used at global, local, and even personal scale. Its initial design goal was to enable a transparent alternative for global Web search in the public interest — one of its signature features is the ability to “explain” its resultrankings. Recent work has emphasized how it can also be used for intranets; by local communities with richer data models, such as the Creative Commons metadata-enabled search for licensed content; on a personal scale to index a user's files, email, and web-surfing history; and we also report on several other research projects built on Nutch. In this paper, we present how the architecture of the Nutchsystem enables it to be more flexible and scalable than other comparable systems today.

Categories and Subject Descriptors
H.3.4 [Information Storage and Retrieval] Systems and Software—World Wide Web; H.3.3 [Information Storage and Retrieval] Information Search and Retrieval—Search Process; K.6.3 [Management of Computing and Information Systems] Software Management—Software developmentFigure 1: A Nutch-powered search engine for Creative Commons More significantly, Nutch makes it easy to customize the search process for particular kinds of content, such as a recent Creative Commons search engine that can query using intellectual-property licensing constraints (see Figure 1). This trend inspired our own experiment to apply Nutch at personal scale at CommerceNet. We hypothesized thatthe same architecture used to run public search engines on dedicated hardware ten years ago might be ready to run as a background task on a laptop. We also evaluated how Nutch could adapt to the distinct hypertext structure of a user’s personal archives. We also suggest that there are intriguing possibilities for blending these scales. In particular, we extended Nutch to index an intranet orextranet as well as all of the content it links to. This sort of ‘neighborhood’ search helps visitors explore an organization’s collective memory. Furthermore, many other academic and industrial research projects are building upon Nutch to explore aspects of Web search, from integration with document summarizers to teaching graduate courses in data mining. This paper motivates Nutch in the context ofother hypertext search research; describes its internal architecture and behavior; evaluates how that architecture supports or conflicts with the requirements of global-, local-, and personal-scale hypertext search problems; reports on how it has been adopted by other researchers and users; and reflects upon future directions.

General Terms
Design, Documentation, Experimentation

KeywordsOpen Source Software, Web Search, Software Architecture

1. INTRODUCTION
Nutch is a complete open-source Web search engine package that aims to index the World Wide Web as effectively as commercial search services [9]. As a research platform it is also promising at smaller scales, since its flexible architecture enables communities to customize it; and can even scale down to a personal computer.Its founding goal was to increase the transparency of the Web search process as searching becomes an everyday task. The nonprofit Nutch Organization supports the open-source development effort as it addresses significant technical challenges of operating at the scale of the entire public Web. Nutch server installations have already indexed 100M-page collections while providing state-of-the-art...
tracking img