Bloom filters are probabilistic data structures: they can test for the non-membership of an element with 100% certainty, but they can’t give 100% certainty about the membership of an element.
Bloom filters are popular because of their savings in space.
Internally, they’re an array of bits, which are set by the members being passed through hash functions. When a membership is checked, if those exact bits are not set, we know for a fact that the element was not present, but if all the bits are set, they might be a result of collisions, meaning we can’t know for sure.
Note that because of this design, elements cannot be removed from a bloom filter.
Counting bloom filters
A variation is to store numbers in each positions instead of a bit that’s only set/not set. This allows for elements to be “removed” so that the other elements are not removed accidentally. However, removing elements that had not been entered in the first place (which we can’t test for with 100% certainty) is a possibility, meaning that we introduce the chance for false-negatives too.
Choosing parameters
Optimal number of hash functions, which minimizes the error rate:
Where is the number of hash functions, is the number of bits that the bloom filter is composed of, is the number of elements to be stored.
However, if we know of an acceptable error rate (), this means we can calculate the amount of bits: