HyperLogLog¶
HyperLogLog algorithm was proposed by Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier in 2007.
It’s a hash-based probabilistic algorithm for counting the number of distinct values in the presence of duplicates.
This implementation uses the classical algorithm with a 32-bit hash function and 4-byte counters.
from pdsa.cardinality.hyperloglog import HyperLogLog
hll = HyperLogLog(10)
hll.add("hello")
print(hll.count())
Build a counter¶
To build a counter, specify its precision - the number of bits that should be used to randomly choose the counter (stochastic averaging). The rest of the bits of the 32-bit hash value will be used to index into the selected counter.
from pdsa.cardinality.hyperloglog import HyperLogLog
hll = HyperLogLog(precision=10)
Note
Precision has to be an integer in range 4 … 16.
Note
This implementation uses MurmurHash3 family of hash functions which yields a 32-bit hash value.
Index element into the counter¶
hll.add("hello")
Note
It is possible to index into the counter any elements (internally it uses repr() of the python object to calculate hash values for elements that are not integers, strings or bytes.
Size of the counter in bytes¶
print(hll.sizeof())
Length of the counter¶
print(len(hll))
Count of unique elements in the counter¶
print(hll.count())
Warning
It is only an approximation of the exact cardinality.