Thoughts on Programming

November 4, 2011

“Bloom” Filter

Filed under: C — shadiyya @ 10:14 am

Bloom Filter is a data structure that can be used to compactly represent a very large set of items. It was originally developed by Burton H. Bloom. Bloom filters have the interesting property of requiring constant time to add an element to the set or test for membership, regardless of the size of the elements or the number of elements already in the set. It works by storing a bit vector representing the set. Additions are simply setting k bits (obtained by performing some hash functions) to 1 and checks are implemented by performing the same hash functions and returning true if all of the resulting positions are 1. Because the set stored is a proper superset of the set of items added, false positives may occur, though false negatives cannot. The false positive rate can be specified.  We can reduce the false positive rate at the cost of space used. In a bloom filter with k hash values, both insertion and membership testing are O(k).

Instead of storing each value, a bloom filter holds simply an array of bits indicating the presence of that key in the filter. A set data structure is used for this purpose.

Set data structure

Set is an abstract data type that can store values without any particular order. A value can be stored by inserting a ‘1’ bit at the corresponding index. For efficiently using the space, we can use a bit array for the implementation of a set, so that instead of storing a byte, a bit can be stored at the bit position corresponding to the value entered.

Hashing

We can store words(strings) in a set by finding some integer values that represents the original strings. A hash function called MD5 hash can generate a fixed length hexadecimal formatted string corresponding to each word. A single integer value can be obtained by taking some bits from the hash value and combining them mathematically.

A Bloom filter uses multiple distinct hash values to store a value. For each word, we can find hash values by extracting different sets of bytes from the hash string. Here we are taking three hash values for each word.

Suppose we want to make a set consisting of three elements. For each element, we can take an array initialized to zero and find the block  obtained by taking hash value / 8 corresponding to each hash value obtained. In the block, there will be eight zeros at each bit position. For example, consider the following case:

hash1(“word1”) / 8      =>     4

hash2(“word1”) / 8       =>    58

hash3(“word1”) / 8       =>    121


                        block 4                                                    block 58                                                      block 121

Now, to add the string “word1” to the set, we just insert a “1” at the corresponding bit positions obtained by finding hash value % 8 in each block.

hash1(“word1”) % 8     =>    7

hash2(“word1”) % 8     =>    4

hash3(“word1”)  % 8    =>    6

                        block 4                                                    block 58                                                  block 121

Likewise all words can be stored. Member checking can be done by taking the same hash values checking the status of the bits. If all the bits corresponding to a word entered are “1”, then the bloom filter makes an assumption that the word is present in the array. There is a possibility for for wrong positives. That means, we may get a result “present” even if the word is not present. This is because all the bit positions obtained for the word are already set by some other words.

An set with three stored words is shown below:In the above figure, bit positions corresponding to the word w are set by words x, y and z. So, w is assumed to be present.This is an example for false positives.

But, we can reduce the rate of false positives by increasing the array size or by increasing the hash values. False negatives will not be there in a Bloom filter.

Here is my code for implementing dictionary spell check using bloom filter: https://bitbucket.org/fathimashadiyya/set-data-structure/changeset/f89e45e3fd81

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: