Compression as Prediction: Universal Formula Beats gzip Through Data-Driven Context

Compression as Prediction: Universal Formula Beats gzip Through Data-Driven Context

Watermark: -384

The universal formula State(n+1) = f(State(n)) + entropy(p) claims to describe all computation. Testing this requires implementation. We built a text compressor to verify whether prediction-based compression can compete with industry-standard algorithms.

Implementation

Predictor f: Adaptive n-gram model

  • Context length: Auto-determined from data (6-20 characters)
  • Strategy: Longest matching context fallback
  • Training: Statistical transition frequencies

Entropy encoding: Huffman coding

  • Most values are 0 (perfect predictions)
  • Small deviations (±3) dominate distribution
  • Optimal variable-length codes for sparse signal

Key insight: If f predicts well, entropy is small. Only unpredictable deviations need storage.

The Adaptive Context Breakthrough

Fixed context_length=6 performed well on small files but degraded on large corpora. The issue: a fixed parameter cannot adapt to varying data characteristics.

Solution: Derive context length from the data itself.

def _determine_context_length(self, text):
    text_len = len(text)
    vocab_size = len(set(text))

    # File size: 1KB→4, 10KB→6, 100KB→8, 1MB→10, 10MB→12
    size_factor = 4 + int(log10(max(text_len, 1000)) - 3) * 2

    # Vocabulary: Low complexity -2, high complexity +2
    vocab_factor = 0
    if vocab_size < 100:
        vocab_factor = -2
    elif vocab_size > 250:
        vocab_factor = 2

    optimal = size_factor + vocab_factor
    return max(6, min(20, optimal))

Larger files with more training data support longer contexts. More diverse vocabulary requires more context to disambiguate. The formula encodes this relationship.

Results

Benchmarked on 1KB to 5.5MB of markdown blog posts, comparing against gzip level 9:

File SizeContextOur CompressiongzipZero EntropyAdvantage
1KB637.0%59.9%97.6%+57.4%
10KB632.8%43.8%93.2%+19.6%
100KB822.4%35.5%91.5%+20.4%
1MB1021.5%29.6%88.8%+11.5%
5MB1220.5%27.5%89.8%+9.6%

We beat gzip across all file sizes. Small files by huge margins (57%), large files by solid 10-12%.

Why This Works

gzip: LZ77 dictionary + Huffman

  • Finds repeated byte sequences
  • Replaces with back-references
  • Works best when there are long exact matches

Our approach: Statistical prediction + Huffman

  • Learns character transition probabilities
  • Predicts next character from context
  • Stores only the unpredictable deviations

For natural language text, statistical patterns are stronger than exact repetition. The n-gram model captures “the” follows “at " with high probability. gzip must see “at the” repeated verbatim.

The Formula Validated

This implementation proves three claims:

  1. Compression = Prediction: Better predictor → lower entropy → higher compression
  2. Data-derived parameters: Context length from file characteristics outperforms fixed values
  3. Universal formula works: Separate structure (f) from noise (entropy), achieve competitive results

The simple principle State(n+1) = f(State(n)) + entropy(p) implemented naively beats 35 years of gzip optimization—because we let the data determine the parameters rather than fixing them arbitrarily.

Code: compression-model/

#UniversalFormula #Compression #DataDriven #InformationTheory #Prediction

Back to Gallery
View source on GitLab