Lets us take a common but challenging use-case :
Find out unique users who visited or connected to a website at any given point of time (in efficient way)
This is a Count-distinct problem which can be solved by an Algorithm “HyperLogLog”
HyperLogLog is an extension of the earlier LogLog algorithm which derived from the 1984 Flajolet–Martin algorithm.
Keep in mind , Hyperloglog is an approximation algorithm. It is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%. [Source - Wikipedia]
Algorithm Steps
Example - If we have observed the following hashed values such as
Count the leading 0's (which is quoted with *)
- *00*11 0101
- *000*1 1011
- *0*110 0011
- *0000* 0100 - Max value with leading 0s as 4
- *00*10 0010
Max value = 4
Estimated unique value = 2**(max_value) = 2**4 = 16
Accuracy is not good
Algorithm Steps
Example - If we have observed the following hashed values such as
- 0011 0101 => bucket(0011) max value = 1
- 0001 1011 => bucket(0001) max value = 0
- 0110 0011 => bucket(0110) max value = 2
- 0000 0100 => bucket(0000) max value = 1
- 0010 0010 => bucket(0010) max value = 2
Avg (bucketmax) = (1+0+2+1+2)/5 = 6/5 = 1
Watch video from PyCon 2018 conference here
Downsides
### Find unique users in Startrek Enterprise using Main control unit
from python_hll.hll import HLL
import mmh3
set1_user = HLL(13, 5)
set2_user = HLL(13, 5)
### Add Users to Set1 HLL
set1_user.add_raw(mmh3.hash('Picard'))
set1_user.add_raw(mmh3.hash('Data'))
set1_user.add_raw(mmh3.hash('Picard'))
set1_user.add_raw(mmh3.hash('Worf'))
cardinality = set1_user.cardinality()
print("Cardinality or Unique users in Set 1 - " , cardinality)
### Add Users to Set2 HLL
set2_user.add_raw(mmh3.hash('Picard'))
set2_user.add_raw(mmh3.hash('Riker'))
set2_user.add_raw(mmh3.hash('Troi'))
set2_user.add_raw(mmh3.hash('La Forge'))
cardinality = set2_user.cardinality()
print("Cardinality or Unique users in Set 2 - " , cardinality)
set1_user.union(set2_user)
cardinality = set1_user.cardinality()
print("After Union cardinality/distinct count - " , cardinality)
Cardinality or Unique users in Set 1 - 3
Cardinality or Unique users in Set 2 - 4
After Union cardinality/distinct count - 6
Redis has HyperLogLog (HLL) function which claims to be 1000x faster than SQL distinct count Link is here
HLL functionality is implemented in AWS redshift . More details are here
Google Cloud has HLL++ functions in Standard SQL . HLL++ improved version of HLL by more accurately estimating small and large cardinalities. More details are here
HLL++ from Google - improved version of HLL paper is here
Microsoft Azure has various functions on HLL which is documented here
Facebook implemented HLL in Presto to handle cardinality estimation. More details are in FB enggineering blog here