What’s a Bloom filter
A bloom filter is a data-structure that can be used to check if a set contains an element. It uses way less memory than a conventional set data-structure by sacrificing accuracy.
Example
Say we are building a log-structured merge-tree, we can use a bloom filter to find out if the LSM-tree contains a particular key in O(1) time in most cases, the downside is that sometimes the bloom filter would say that the LSM-tree contains a key, but it actually does not and we would go searching for the value that’s mapped to the key and never actually find it.
It is used in a lot of places.
How it works
A bloom filter is just a bit-set that uses n
deterministic hash functions to add elements to it.
empty bit-set
Adding elements to the set
To add the key bob
to the set, we run the key through each of the n
hash functions and map the hash function output to one of the positions in the bit-set and for each position, we flip the bit to 1
.
bit-set after bob was added to the bloom filter
Finding out if the set contains an element
To find out if the set contains the key bob
, we run the key through each of the n
hash functions again – since the hash functions must be deterministic they will always map to the same position in the bit-set – and check if the bit is set to 1
for each of the bit-set positions we reached after running the key through the hash functions. If every hash function maps to a bit set to 1
, it means the key is in the set.
bob is in the set because every hash function mapped it to a bit set to 1
alice is not in the set because not every hash function mapped to a bit set to 1
False positives
Since collisions can happen some keys will be mapped to bits that were set to 1
when other keys were added to the set. In this case, the bloom filter will say that it contains the key even though it does not.
the bit-set after bob was added to it
since john maps to the same bits as bob and the bits were set to 1 after bob was added to the set, we got a false positive
Removing an element from the set
As it stands, removing an element from the set is not actually possible. If we had a bloom filter that uses 3 hash functions that looks like this after adding alice
and bob
to it:
bloom filter after adding alice and bob to it with 3 hash functions
Note that alice
and bob
hash to the same position in the bit-set for some of the hash functions
bits shared between alice and bob are in white
The naive solution is to remove alice
from the bloom filter by setting the bits mapped by hashi(alice) to 0
:
bits that were flipped to 0 are in white
Now, let’s check if alice
is in the set, for a key to be in the set hashi(key) must map to bits set to 1
hashi(alice) maps to bits set to 0 which means alice is not in the set
alice
is not in the set as expected. Let’s see if bob
is still in the set, it should be since we didn’t remove it.
not every hashi(bob) maps to bits set to 1 which means bob is not in the set, bits set to 0 after removing alice from the set are in white
bob
is not in the set anymore, even though we didn’t remove it. The problem is that since keys may share the positions in the bit-set, we cannot just flip bits back to 0
to remove a key from the set because in doing so we may flip bits that are used by other keys.
Counting bloom filter
Since we cannot flip bits back to 0
to remove a key from the set, we could maintain a counter instead of a single bit. When a key is added to the set, the counter is incremented and when a key is removed from the set, the counter is decremented. If the counter reaches 0, it means no keys are mapped to the position.
Positions that have more than one key mapped to it will have a counter greater than 1
.
positions in white are shared between two or more keys and have a counter greater than 1
removing alice from the set by decrementing the counters mapped by hashi(alice)
After decrementing the counters, not every hashi(alice) maps to a counter greater than 0
which means alice
is not in the set anymore. Unlike the bloom filter that uses only bits, hashi(bob) still maps to counters that are greater than 0
which means bob
is still in the set.
Example in Rust
https://github.com/PoorlyDefinedBehaviour/bloom_filter/
References
Network Applications of Bloom Filters: A Survey - https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems - Martin Kleppmann