Thoughts on Programming

November 4, 2011

“Bloom” Filter

Filed under: C — shadiyya @ 10:14 am

Bloom Filter is a data structure that can be used to compactly represent a very large set of items. It was originally developed by Burton H. Bloom. Bloom filters have the interesting property of requiring constant time to add an element to the set or test for membership, regardless of the size of the elements or the number of elements already in the set. It works by storing a bit vector representing the set. Additions are simply setting k bits (obtained by performing some hash functions) to 1 and checks are implemented by performing the same hash functions and returning true if all of the resulting positions are 1. Because the set stored is a proper superset of the set of items added, false positives may occur, though false negatives cannot. The false positive rate can be specified.  We can reduce the false positive rate at the cost of space used. In a bloom filter with k hash values, both insertion and membership testing are O(k).

Instead of storing each value, a bloom filter holds simply an array of bits indicating the presence of that key in the filter. A set data structure is used for this purpose.

Set data structure

Set is an abstract data type that can store values without any particular order. A value can be stored by inserting a ‘1’ bit at the corresponding index. For efficiently using the space, we can use a bit array for the implementation of a set, so that instead of storing a byte, a bit can be stored at the bit position corresponding to the value entered.

Hashing

We can store words(strings) in a set by finding some integer values that represents the original strings. A hash function called MD5 hash can generate a fixed length hexadecimal formatted string corresponding to each word. A single integer value can be obtained by taking some bits from the hash value and combining them mathematically.

A Bloom filter uses multiple distinct hash values to store a value. For each word, we can find hash values by extracting different sets of bytes from the hash string. Here we are taking three hash values for each word.

Suppose we want to make a set consisting of three elements. For each element, we can take an array initialized to zero and find the block  obtained by taking hash value / 8 corresponding to each hash value obtained. In the block, there will be eight zeros at each bit position. For example, consider the following case:

hash1(“word1”) / 8      =>     4

hash2(“word1”) / 8       =>    58

hash3(“word1”) / 8       =>    121


                        block 4                                                    block 58                                                      block 121

Now, to add the string “word1” to the set, we just insert a “1” at the corresponding bit positions obtained by finding hash value % 8 in each block.

hash1(“word1”) % 8     =>    7

hash2(“word1”) % 8     =>    4

hash3(“word1”)  % 8    =>    6

                        block 4                                                    block 58                                                  block 121

Likewise all words can be stored. Member checking can be done by taking the same hash values checking the status of the bits. If all the bits corresponding to a word entered are “1”, then the bloom filter makes an assumption that the word is present in the array. There is a possibility for for wrong positives. That means, we may get a result “present” even if the word is not present. This is because all the bit positions obtained for the word are already set by some other words.

An set with three stored words is shown below:In the above figure, bit positions corresponding to the word w are set by words x, y and z. So, w is assumed to be present.This is an example for false positives.

But, we can reduce the rate of false positives by increasing the array size or by increasing the hash values. False negatives will not be there in a Bloom filter.

Here is my code for implementing dictionary spell check using bloom filter: https://bitbucket.org/fathimashadiyya/set-data-structure/changeset/f89e45e3fd81

Advertisements

October 30, 2011

Storage Allocator – “malloc”

Filed under: C — shadiyya @ 6:41 am
Tags: , , ,

The dynamic memory allocator malloc will request space from the operating system as needed rather than allocating from a compiled-in fixed-size array. The storage used by a program can be requested by malloc,but also by other sources, so its memory blocks are basically of three types:

  • Not owned by malloc
  • Owned by malloc which are free
  • Owned by malloc which are in useWhen malloc function is called, the free list is scanned until a big-enough block is found.  (big enough for current data to be stored, a first fit algorithm is used meaning that first free block which has a size larger than the one requested is taken)

There are three possibilities:

  • No free big-enough block found, in this case program must a request another piece of memory from operating system 
  • Found a bigger block than the size requested, in this case only the needed space is used and residue storage is returned to the free list
  • Found a block having exactly the needed size, in this case this block is “unlinked” from the free list

When free is called, corresponding “freed” block is added to the free list at proper place. If the block being freed is adjacent to a free block on either side, it is united with it to form a single bigger block, so storage does not become too fragmented. Determining the adjacency is easy because the free list is maintained in order of decreasing address.

Every free block contains three fields as shown in the picture below: a pointer to the next block, the size of the block, and the free space itself.
The control information at the beginning is called the “header” that contains a union of a structure and alignment type. To simplify alignment, all blocks are multiples of the header size.

In malloc, the requested size in characters is rounded up to the proper number of header-sized units; the block that will be allocated contains one more unit, for the header itself, and this is the value recorded in the size field of the header.

For the first call of malloc, a list that contains one block of size zero, and points to itself is created. The free list is then searched for a free block of required size. The search begins at the point where the last block was found. If a too-big block is found, the tail end is returned to the user; in this way the header of the original block needs only to have its size adjusted. In all cases, the pointer returned to the user points to the free space within the block, which begins one unit beyond the header.

             

If no adequate memory block is found, then the function obtains storage from the operating system using the UNIX system call sbrk(). sbrk returns -1 if there is no space.Since asking the system for memory is a comparatively expensive operation, we don’t want to do that on every call to malloc, so the function requests at least NALLOC units which is a larger quantity; this larger block will be chopped up as needed. The extra space is inserted into the free list. It scans the free list, starting at the point where a free block was found the last time, looking for the place to insert the free block. This is either between two existing blocks or at the end of the list. In any case, if the block being freed is adjacent to either neighbor, the adjacent blocks are combined and thus prevents memory becoming too fragmented.

October 13, 2011

Trie data structure and implementation of “Auto complete”

Filed under: C — shadiyya @ 10:04 am
Tags: , ,

A trie is a data structure that stores the information about the contents of each node in the path from the root to the node, rather than the node itself. That means its position in the tree shows what key it is associated with. Searching a word of length m in a trie is having a time complexity of o(m) and are more space efficient when they contain a large number of short keys. Each node contains an array of pointers, one pointer for each character in the alphabet. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string.

For example, in the case of alphabetical keys, each node is a structure having two members. One is an integer for storing the value corresponding to a word if a word terminates at that node. Second is a pointer to an array of 26 characters for each of 26 alphabet characters. For example, consider the tree shown in the below figure. In this tree, each path between the root and a child represents a key and the end of each word is denoted by storing some data in the node, so this trie contains the words “to”, “tea”, “ted”, “ten”, “i”, “in”, “inn”, “A”. Inserting a new key traverses the trie until it either reaches the end of the string, or it discovers that the trie does not contain the string. In the first case, all that is necessary is to mark the end node as being the end of a key and assigning a data into this node. In the second case, the trie must be extended with new nodes that represent the remainder of the string. 

Starting from the root node, we can check if a word exists in the trie easily by following pointers corresponding to the letters in the target word.

Implementation of Auto Complete

Auto-complete functionality is used widely over the internet and mobile apps. A lot of websites and apps try to complete our input as soon as we start typing. 

The autocompletion I have implemented returns all words starting with a given prefix prefix. If a null string is entered as the input, then it will print all the words stored in the trie. Searching uses a well known algorithm called DFS (Depth First Search). It searches the depth first and then only the side links.

Here is the complete code for autocomplete using Trie: https://bitbucket.org/fathimashadiyya/trie

July 13, 2011

C bitwise operators – Precedence and associativity

Filed under: C — shadiyya @ 1:48 am

Bitwise means working at the level of single bits. We need to know about the precedences and associativities of different bitwise operators for efficient and error prone coding. An operator’s precedence is meaningful only if other operators with higher or lower precedence are present. Expressions with higher-precedence operators are evaluated first. The grouping of operands can be forced by using parentheses. Where several operators appear together, they have equal precedence and are evaluated according to their associativity. The following table list the C language operators in order of precedence and show the direction of associativity for each operator.

Operator

Description

Associativity

()
[]

++  —
Parentheses (function call)
Brackets (array subscript)
Postfix increment/decrement

left-to-right

++  —
+  –
!  ~
(type)
*
&
sizeof
 
Prefix increment/decrement
Unary plus/minus
Logical negation/bitwise complement
Cast (change type)
Dereference
Address
Determine size in bytes
right-to-left
*  /  % Multiplication/division/modulus left-to-right
+  – Addition/subtraction left-to-right
<<  >> Bitwise shift left, Bitwise shift right left-to-right
<  <=
>  >=
Relational less than/less than or equal to
Relational greater than/greater than or equal to
left-to-right
==  != Relational is equal to/is not equal to left-to-right
& Bitwise AND left-to-right
^ Bitwise exclusive OR left-to-right
| Bitwise inclusive OR left-to-right
&& Logical AND left-to-right
|| Logical OR left-to-right
?: Ternary conditional right-to-left
=
+=  -=
*=  /=
%=  &=
^=  |=
<<=  >>=
Assignment
Addition/subtraction assignment
Multiplication/division assignment
Modulus/bitwise AND assignment
Bitwise exclusive/inclusive OR assignment
Bitwise shift left/right assignment
right-to-left

,

Comma (separate expressions) left-to-right

Create a free website or blog at WordPress.com.

%d bloggers like this: