Lalaa

Páginas: 15 (3514 palabras) Publicado: 19 de julio de 2011
Large Text Compression Benchmark

Matt Mahoney
Last update: July 21, 2009. history
This page is no longer maintained. The newest version can be found at http://mattmahoney.net/dc/text.html

This competition ranks lossless data compression programs by the compressed size (including the size of the decompression program) of the first 109 bytes of the XML text dump of the English version ofWikipedia on Mar. 3, 2006. About the test data.

The goal of this benchmark is not to find the best overall compression program, but to encourage research in artificial intelligence and natural language processing (NLP). A fundamental problem in both NLP and text compression is modeling: the ability to distinguish between high probability strings like recognize speech and low probability stringslike reckon eyes peach. Rationale.

This is an open benchmark. Anyone may contribute results. Please read the rules first.

Compression improvements to the first 108 bytes are eligible for the Hutter Prize, with 50,000 euros of funding.
Benchmark Results

Compressors are ranked by the compressed size of enwik9 (109 bytes) plus the size of a zip archive containing the decompressor. Optionsare selected for maximum compression at the cost of speed and memory. Other data in the table does not affect rankings. This benchmark is for informational purposes only. There is no prize money for a top ranking. Notes about the table:

Program: The version believed to give the best compression. A | denotes a combination of 2 programs.
Compression options: selected for what I believegives the best compression.
enwik8: compressed size of first 108 bytes of enwik9. This data is used for the Hutter Prize, but has no effect on this ranking.
enwik9: compressed size of first 109 bytes of enwiki-20060303-pages-articles.xml.
Decompressor size: size of a zip archive containing the decompression program (source code or executable) and all associated files needed to run it(e.g. dictionaries). A letter following the size has the following meaning:
x = executable size.
s = source code size (if available and smaller).
d = size of a separate decompression program (separate from compression). For self extracting archives (SFX), the size is 0 because the decompressor and compressed data are combined into one file.
For testing, if no zipfile is supplied I create archives using InfoZIP 2.32 -9. (Prior to July 1, 2008 I used 7zip 4.32 -tzip -mx=9).
Total size: total size of compressed enwik9 + decompressor size, ranked smallest to largest.
Comp: compression rate in nanoseconds per byte on the largest file tested (e.g. seconds for enwik9). Speed is approximate and has no effect on ranking. A ~ means "very approximate". Notall tests are done on the same computer. Tests on my computer (Compaq Presario 5440, 2.188 Ghz Athlon-64 3500+, 2 GB memory, Windows XP SP2 or occasionally Ubuntu linux 2.6), are usually process times (user + system) measured with timer 3.01 by Igor Pavlov. (A newer version can be found in 7-Benchmark). This does not include disk I/O time, which can be significant for fast compressors. CPU time mayincrease because of Cool'n'Quiet, which updates the clock speed every 1/30 second and drops to 994 MHz if waiting on disk. I don't average over multiple runs. An underlined time means that no better compressor is faster.
Decomp: decompression time as above. If blank, decompression was not tested yet and ranking is pending verification that the output is identical. An underlined time meansthat no better compressor is faster.
Mem: approximate memory used for compression in MB. Decompression uses the same or possibly less. There is some ambiguity whether a megabyte means 106 bytes or 220 bytes. The approximation is course enough that it doesn't matter. I use peak memory as measured with Windows Task Manager during compression (so if you really want to know, 1 MB = 1,024,000...
Leer documento completo

Regístrate para leer el documento completo.

Estos documentos también te pueden resultar útiles

  • Lalaa
  • Lalaa
  • Lalaa
  • Lalaa
  • Lalaa
  • Lalaa
  • lalaa
  • Lalaa

Conviértase en miembro formal de Buenas Tareas

INSCRÍBETE - ES GRATIS