Count-Min Sketch¶
Count–Min Sketch is a simple space-efficient probabilistic data structure that is used to estimate frequencies of elements in data streams and can address the Heavy hitters problem. It was presented in 2003 [1] by Graham Cormode and Shan Muthukrishnan and published in 2005 [2].
References¶
- [1] Cormode, G., Muthukrishnan, S.
- What’s hot and what’s not: Tracking most frequent items dynamically Proceedings of the 22th ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, San Diego, California - June 09-11, 2003, pp. 296–306, ACM New York, NY. http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CormodeM-hot.pdf
- [2] Cormode, G., Muthukrishnan, S.
- An Improved Data Stream Summary: The Count–Min Sketch and its Applications Journal of Algorithms, Vol. 55 (1), pp. 58–75. http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf
This implementation uses MurmurHash3 family of hash functions which yields a 32-bit hash value. Thus, the length of the counters is expected to be smaller or equal to the (2^{32} - 1), since we cannot access elements with indexes above this value.
from pdsa.frequency.count_min_sketch import CountMinSketch
cms = CountMinSketch(5, 2000)
cms.add("hello")
cms.frequency("hello")
Build a sketch¶
You can build a new sketch either from specifiyng its dimensions (number of counter arrays and their length), or from the expected overestimation diviation and standard error probability.
Build filter from its dimensions¶
from pdsa.frequency.count_min_sketch import CountMinSketch
cms = CountMinSketch(num_of_counters=5, length_of_counter=2000)
Build filter from the expected errors¶
In this case the number of counter arrays and their length will be calculated corresponsing to the expected overestimation and the requested error.
from pdsa.frequency.count_min_sketch import CountMinSketch
cms = CountMinSketch.create_from_expected_error(deviation=0.000001, error=0.01)
Note
The deviation is the error ε in answering the paricular query. For example, if we expect 10^7 elements and allow the fixed overestimate of 10, the deviation is 10/10^7 = 10^{-6}.
The error is the standard error δ (0 < error < 1).
Note
The Count–Min Sketch is approximate and probabilistic at the same time, therefore two parameters, the error ε in answering the paricular query and the error probability δ, affect the space and time requirements. In fact, it provides the guarantee that the estimation error for frequencies will not exceed ε x n with probability at least 1 – δ.
Index element into the sketch¶
cms.add("hello")
Note
It is possible to index into the counter any elements (internally it uses repr() of the python object to calculate hash values for elements that are not integers, strings or bytes.
Estmiate frequency of the element¶
print(cms.frequency("hello"))
Warning
It is only an approximation of the exact frequency.
Size of the sketch in bytes¶
print(cms.sizeof())
Length of the sketch¶
print(len(cms))