Count-Min Sketch

Count–Min Sketch is a simple space-efficient probabilistic data structure that is used to estimate frequencies of elements in data streams and can address the Heavy hitters problem. It was presented in 2003 [1] by Graham Cormode and Shan Muthukrishnan and published in 2005 [2].

References

[1] Cormode, G., Muthukrishnan, S.
What’s hot and what’s not: Tracking most frequent items dynamically Proceedings of the 22th ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, San Diego, California - June 09-11, 2003, pp. 296–306, ACM New York, NY. http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CormodeM-hot.pdf
[2] Cormode, G., Muthukrishnan, S.
An Improved Data Stream Summary: The Count–Min Sketch and its Applications Journal of Algorithms, Vol. 55 (1), pp. 58–75. http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf

This implementation uses MurmurHash3 family of hash functions which yields a 32-bit hash value. Thus, the length of the counters is expected to be smaller or equal to the (2^{32} - 1), since we cannot access elements with indexes above this value.

from pdsa.frequency.count_min_sketch import CountMinSketch

cms = CountMinSketch(5, 2000)
cms.add("hello")
cms.frequency("hello")

Build a sketch

You can build a new sketch either from specifiyng its dimensions (number of counter arrays and their length), or from the expected overestimation diviation and standard error probability.

Build filter from its dimensions

from pdsa.frequency.count_min_sketch import CountMinSketch

cms = CountMinSketch(num_of_counters=5, length_of_counter=2000)

Build filter from the expected errors

In this case the number of counter arrays and their length will be calculated corresponsing to the expected overestimation and the requested error.

from pdsa.frequency.count_min_sketch import CountMinSketch

cms = CountMinSketch.create_from_expected_error(deviation=0.000001, error=0.01)

Note

The deviation is the error ε in answering the paricular query. For example, if we expect 10^7 elements and allow the fixed overestimate of 10, the deviation is 10/10^7 = 10^{-6}.

The error is the standard error δ (0 < error < 1).

Note

The Count–Min Sketch is approximate and probabilistic at the same time, therefore two parameters, the error ε in answering the paricular query and the error probability δ, affect the space and time requirements. In fact, it provides the guarantee that the estimation error for frequencies will not exceed ε x n with probability at least 1 – δ.

Index element into the sketch

cms.add("hello")

Note

It is possible to index into the counter any elements (internally it uses repr() of the python object to calculate hash values for elements that are not integers, strings or bytes.

Estmiate frequency of the element

print(cms.frequency("hello"))

Warning

It is only an approximation of the exact frequency.

Size of the sketch in bytes

print(cms.sizeof())

Length of the sketch

print(len(cms))