About the Blog > Programming > HyperLogLog Algorithm

HyperLogLog Algorithm

Why HyperLogLog or LogLog algorithm needed ?

Lets us take a common but challenging use-case :

Find out unique users who visited or connected to a website at any given point of time (in efficient way)

This is a Count-distinct problem which can be solved by an Algorithm “HyperLogLog”

HyperLogLog is an extension of the earlier LogLog algorithm which derived from the 1984 Flajolet–Martin algorithm.

Keep in mind , Hyperloglog is an approximation algorithm. It is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%. [Source - Wikipedia]

LogLog Algorithm

Algorithm Steps

Hash the object in 64 but representation
Count of number of leading 0s
Estimate : 2**max(num_zeros(element) for element in data)

Example - If we have observed the following hashed values such as 

Count the leading 0's (which is quoted with *)

- *00*11 0101
- *000*1 1011
- *0*110 0011
- *0000* 0100 - Max value with leading 0s as 4
- *00*10 0010

Max value = 4

Estimated unique value = 2**(max_value) = 2**4 = 16

Accuracy is not good

HyperLogLog Algorithm

Algorithm Steps

Hash the object in 64 bit representation
Take first bits to choose bucket
Take reminaing bits and count leading 0’s (like LogLog algorithm)
Estimate : Average (bucket_max)

Example - If we have observed the following hashed values such as 
- 0011 0101     =>  bucket(0011) max value = 1
- 0001 1011     =>  bucket(0001) max value = 0
- 0110 0011     =>  bucket(0110) max value = 2
- 0000 0100     =>  bucket(0000) max value = 1
- 0010 0010     =>  bucket(0010) max value = 2

Avg (bucketmax) = (1+0+2+1+2)/5 = 6/5 = 1

Watch video from PyCon 2018 conference here

Downsides

O(n) space is needed
Limited by memory for large dataset
No streaming mode

HyperLogLog in Python

Pre-requisites

Install hll package for HyperLogLog from here
Install mm3 package for MurmurHash from here

### Find unique users in Startrek Enterprise using Main control unit
from python_hll.hll import HLL
import mmh3

set1_user = HLL(13, 5)
set2_user = HLL(13, 5)

### Add Users to Set1 HLL
set1_user.add_raw(mmh3.hash('Picard'))
set1_user.add_raw(mmh3.hash('Data'))
set1_user.add_raw(mmh3.hash('Picard'))
set1_user.add_raw(mmh3.hash('Worf'))

cardinality = set1_user.cardinality()
print("Cardinality or Unique users in Set 1 - " , cardinality)


### Add Users to Set2 HLL
set2_user.add_raw(mmh3.hash('Picard'))
set2_user.add_raw(mmh3.hash('Riker'))
set2_user.add_raw(mmh3.hash('Troi'))
set2_user.add_raw(mmh3.hash('La Forge'))

cardinality = set2_user.cardinality()
print("Cardinality or Unique users in Set 2 - " , cardinality)

set1_user.union(set2_user)

cardinality = set1_user.cardinality()

print("After Union cardinality/distinct count - " , cardinality)

Output

Cardinality or Unique users in Set 1 -  3
Cardinality or Unique users in Set 2 -  4
After Union cardinality/distinct count -  6

Interesting facts

Redis has HyperLogLog (HLL) function which claims to be 1000x faster than SQL distinct count Link is here
HLL functionality is implemented in AWS redshift . More details are here
Google Cloud has HLL++ functions in Standard SQL . HLL++ improved version of HLL by more accurately estimating small and large cardinalities. More details are here
HLL++ from Google - improved version of HLL paper is here
Microsoft Azure has various functions on HLL which is documented here
Facebook implemented HLL in Presto to handle cardinality estimation. More details are in FB enggineering blog here