HyperLogLog

HyperLogLog algorithm was proposed by Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier in 2007.

It’s a hash-based probabilistic algorithm for counting the number of distinct values in the presence of duplicates.

This implementation uses the classical algorithm with a 32-bit hash function and 4-byte counters.

from pdsa.cardinality.hyperloglog import HyperLogLog

hll = HyperLogLog(10)
hll.add("hello")
print(hll.count())

Build a counter

To build a counter, specify its precision - the number of bits that should be used to randomly choose the counter (stochastic averaging). The rest of the bits of the 32-bit hash value will be used to index into the selected counter.

from pdsa.cardinality.hyperloglog import HyperLogLog

hll = HyperLogLog(precision=10)

Note

Precision has to be an integer in range 4 … 16.

Note

This implementation uses MurmurHash3 family of hash functions which yields a 32-bit hash value.

Index element into the counter

hll.add("hello")

Note

It is possible to index into the counter any elements (internally it uses repr() of the python object to calculate hash values for elements that are not integers, strings or bytes.

Size of the counter in bytes

print(hll.sizeof())

Length of the counter

print(len(hll))

Count of unique elements in the counter

print(hll.count())

Warning

It is only an approximation of the exact cardinality.