Bloom Filter is a data structure that can be used to compactly represent a very large set of items. It was originally developed by Burton H. Bloom. Bloom filters have the interesting property of requiring constant time to add an element to the set or test for membership, regardless of the size of the elements or the number of elements already in the set. It works by storing a bit vector representing the set. Additions are simply setting k bits (obtained by performing some hash functions) to 1 and checks are implemented by performing the same hash functions and returning true if all of the resulting positions are 1. Because the set stored is a proper superset of the set of items added, false positives may occur, though false negatives cannot. The false positive rate can be specified. We can reduce the false positive rate at the cost of space used. In a bloom filter with k hash values, both insertion and membership testing are O(k).

Instead of storing each value, a bloom filter holds simply an array of bits indicating the presence of that key in the filter. A set data structure is used for this purpose.

Set data structure

Set is an abstract data type that can store values without any particular order. A value can be stored by inserting a ‘1’ bit at the corresponding index. For efficiently using the space, we can use a bit array for the implementation of a set, so that instead of storing a byte, a bit can be stored at the bit position corresponding to the value entered.

Hashing

We can store words(strings) in a set by finding some integer values that represents the original strings. A hash function called MD5 hash can generate a fixed length hexadecimal formatted string corresponding to each word. A single integer value can be obtained by taking some bits from the hash value and combining them mathematically.

A Bloom filter uses multiple distinct hash values to store a value. For each word, we can find hash values by extracting different sets of bytes from the hash string. Here we are taking three hash values for each word.

Suppose we want to make a set consisting of three elements. For each element, we can take an array initialized to zero and find the block obtained by taking hash value / 8 corresponding to each hash value obtained. In the block, there will be eight zeros at each bit position. For example, consider the following case:

hash1(“word1”) / 8 => 4

hash2(“word1”) / 8 => 58

hash3(“word1”) / 8 => 121

block 4 block 58 block 121

Now, to add the string “word1” to the set, we just insert a “1” at the corresponding bit positions obtained by finding hash value % 8 in each block.

hash1(“word1”) % 8 => 7

hash2(“word1”) % 8 => 4

hash3(“word1”) % 8 => 6

block 4 block 58 block 121

Likewise all words can be stored. Member checking can be done by taking the same hash values checking the status of the bits. If all the bits corresponding to a word entered are “1”, then the bloom filter makes an assumption that the word is present in the array. There is a possibility for for wrong positives. That means, we may get a result “present” even if the word is not present. This is because all the bit positions obtained for the word are already set by some other words.

An set with three stored words is shown below:In the above figure, bit positions corresponding to the word w are set by words x, y and z. So, w is assumed to be present.This is an example for false positives.

But, we can reduce the rate of false positives by increasing the array size or by increasing the hash values. False negatives will not be there in a Bloom filter.

Here is my code for implementing dictionary spell check using bloom filter: https://bitbucket.org/fathimashadiyya/set-data-structure/changeset/f89e45e3fd81