HyperLogLog Algorithm

Why HyperLogLog or LogLog algorithm needed ?

Lets us take a common but challenging use-case :

Find out unique users who visited or connected to a website at any given point of time (in efficient way)

This is a Count-distinct problem which can be solved by an Algorithm “HyperLogLog”

HyperLogLog is an extension of the earlier LogLog algorithm which derived from the 1984 Flajolet–Martin algorithm.

Keep in mind , Hyperloglog is an approximation algorithm. It is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%. [Source - Wikipedia]

LogLog Algorithm

Algorithm Steps

  • Hash the object in 64 but representation
  • Count of number of leading 0s
  • Estimate : 2**max(num_zeros(element) for element in data)
Example - If we have observed the following hashed values such as 

Count the leading 0's (which is quoted with *)

- *00*11 0101
- *000*1 1011
- *0*110 0011
- *0000* 0100 - Max value with leading 0s as 4
- *00*10 0010

Max value = 4

Estimated unique value = 2**(max_value) = 2**4 = 16

Accuracy is not good

HyperLogLog Algorithm

Algorithm Steps

  • Hash the object in 64 bit representation
  • Take first bits to choose bucket
  • Take reminaing bits and count leading 0’s (like LogLog algorithm)
  • Estimate : Average (bucket_max)
Example - If we have observed the following hashed values such as 
- 0011 0101     =>  bucket(0011) max value = 1
- 0001 1011     =>  bucket(0001) max value = 0
- 0110 0011     =>  bucket(0110) max value = 2
- 0000 0100     =>  bucket(0000) max value = 1
- 0010 0010     =>  bucket(0010) max value = 2

Avg (bucketmax) = (1+0+2+1+2)/5 = 6/5 = 1

Watch video from PyCon 2018 conference here

Downsides

  • O(n) space is needed
  • Limited by memory for large dataset
  • No streaming mode

HyperLogLog in Python

Pre-requisites

  • Install hll package for HyperLogLog from here
  • Install mm3 package for MurmurHash from here
### Find unique users in Startrek Enterprise using Main control unit
from python_hll.hll import HLL
import mmh3

set1_user = HLL(13, 5)
set2_user = HLL(13, 5)

### Add Users to Set1 HLL
set1_user.add_raw(mmh3.hash('Picard'))
set1_user.add_raw(mmh3.hash('Data'))
set1_user.add_raw(mmh3.hash('Picard'))
set1_user.add_raw(mmh3.hash('Worf'))

cardinality = set1_user.cardinality()
print("Cardinality or Unique users in Set 1 - " , cardinality)


### Add Users to Set2 HLL
set2_user.add_raw(mmh3.hash('Picard'))
set2_user.add_raw(mmh3.hash('Riker'))
set2_user.add_raw(mmh3.hash('Troi'))
set2_user.add_raw(mmh3.hash('La Forge'))

cardinality = set2_user.cardinality()
print("Cardinality or Unique users in Set 2 - " , cardinality)

set1_user.union(set2_user)

cardinality = set1_user.cardinality()

print("After Union cardinality/distinct count - " , cardinality)

Output

Cardinality or Unique users in Set 1 -  3
Cardinality or Unique users in Set 2 -  4
After Union cardinality/distinct count -  6

Interesting facts

  • Redis has HyperLogLog (HLL) function which claims to be 1000x faster than SQL distinct count Link is here

  • HLL functionality is implemented in AWS redshift . More details are here

  • Google Cloud has HLL++ functions in Standard SQL . HLL++ improved version of HLL by more accurately estimating small and large cardinalities. More details are here

  • HLL++ from Google - improved version of HLL paper is here

  • Microsoft Azure has various functions on HLL which is documented here

  • Facebook implemented HLL in Presto to handle cardinality estimation. More details are in FB enggineering blog here