Count Sketch

Count Sketch is a simple space-efficient probabilistic data structure that is used to estimate frequencies of elements in data streams and can address the Heavy hitters problem. It was proposed by Moses Charikar, Kevin Chen, and Martin Farach-Colton in 2002.

References

[1] Charikar, M., Chen, K., Farach-Colton, M.
Finding Frequent Items in Data Streams Proceedings of the 29th International Colloquium on Automata, Languages and Programming, pp. 693–703, Springer, Heidelberg. https://www.cs.rutgers.edu/~farach/pubs/FrequentStream.pdf

This implementation uses MurmurHash3 family of hash functions which yields a 32-bit hash value. Thus, the length of the counters is expected to be smaller or equal to the (2^{32} - 1), since we cannot access elements with indexes above this value.

from pdsa.frequency.count_min_sketch import CountSketch

cs = CountSketch(5, 2000)
cs.add("hello")
cs.frequency("hello")

Build a sketch

You can build a new sketch either from specifiyng its dimensions (number of counter arrays and their length), or from the expected overestimation diviation and standard error probability.

Build filter from its dimensions

from pdsa.frequency.count_min_sketch import CountSketch

cs = CountSketch(num_of_counters=5, length_of_counter=2000)

Build filter from the expected errors

In this case the number of counter arrays and their length will be calculated corresponsing to the expected overestimation and the requested error.

from pdsa.frequency.count_min_sketch import CountSketch

cs = CountSketch.create_from_expected_error(deviation=0.000001, error=0.01)

Note

The deviation is the error ε in answering the paricular query. For example, if we expect 10^7 elements and allow the fixed overestimate of 10, the deviation is 10/10^7 = 10^{-6}.

The error is the standard error δ (0 < error < 1).

Note

The Count–Min Sketch is approximate and probabilistic at the same time, therefore two parameters, the error ε in answering the paricular query and the error probability δ, affect the space and time requirements. In fact, it provides the guarantee that the estimation error for frequencies will not exceed ε x n with probability at least 1 – δ.

Index element into the sketch

cs.add("hello")

Note

It is possible to index into the counter any elements (internally it uses repr() of the python object to calculate hash values for elements that are not integers, strings or bytes.

Estmiate frequency of the element

print(cs.frequency("hello"))

Warning

It is only an approximation of the exact frequency.

Size of the sketch in bytes

print(cs.sizeof())

Length of the sketch

print(len(cs))