Thoughts on Programming

November 4, 2011

“Bloom” Filter

Filed under: C — shadiyya @ 10:14 am

Bloom Filter is a data structure that can be used to compactly represent a very large set of items. It was originally developed by Burton H. Bloom. Bloom filters have the interesting property of requiring constant time to add an element to the set or test for membership, regardless of the size of the elements or the number of elements already in the set. It works by storing a bit vector representing the set. Additions are simply setting k bits (obtained by performing some hash functions) to 1 and checks are implemented by performing the same hash functions and returning true if all of the resulting positions are 1. Because the set stored is a proper superset of the set of items added, false positives may occur, though false negatives cannot. The false positive rate can be specified.  We can reduce the false positive rate at the cost of space used. In a bloom filter with k hash values, both insertion and membership testing are O(k).

Instead of storing each value, a bloom filter holds simply an array of bits indicating the presence of that key in the filter. A set data structure is used for this purpose.

Set data structure

Set is an abstract data type that can store values without any particular order. A value can be stored by inserting a ‘1’ bit at the corresponding index. For efficiently using the space, we can use a bit array for the implementation of a set, so that instead of storing a byte, a bit can be stored at the bit position corresponding to the value entered.

Hashing

We can store words(strings) in a set by finding some integer values that represents the original strings. A hash function called MD5 hash can generate a fixed length hexadecimal formatted string corresponding to each word. A single integer value can be obtained by taking some bits from the hash value and combining them mathematically.

A Bloom filter uses multiple distinct hash values to store a value. For each word, we can find hash values by extracting different sets of bytes from the hash string. Here we are taking three hash values for each word.

Suppose we want to make a set consisting of three elements. For each element, we can take an array initialized to zero and find the block  obtained by taking hash value / 8 corresponding to each hash value obtained. In the block, there will be eight zeros at each bit position. For example, consider the following case:

hash1(“word1”) / 8      =>     4

hash2(“word1”) / 8       =>    58

hash3(“word1”) / 8       =>    121


                        block 4                                                    block 58                                                      block 121

Now, to add the string “word1” to the set, we just insert a “1” at the corresponding bit positions obtained by finding hash value % 8 in each block.

hash1(“word1”) % 8     =>    7

hash2(“word1”) % 8     =>    4

hash3(“word1”)  % 8    =>    6

                        block 4                                                    block 58                                                  block 121

Likewise all words can be stored. Member checking can be done by taking the same hash values checking the status of the bits. If all the bits corresponding to a word entered are “1”, then the bloom filter makes an assumption that the word is present in the array. There is a possibility for for wrong positives. That means, we may get a result “present” even if the word is not present. This is because all the bit positions obtained for the word are already set by some other words.

An set with three stored words is shown below:In the above figure, bit positions corresponding to the word w are set by words x, y and z. So, w is assumed to be present.This is an example for false positives.

But, we can reduce the rate of false positives by increasing the array size or by increasing the hash values. False negatives will not be there in a Bloom filter.

Here is my code for implementing dictionary spell check using bloom filter: https://bitbucket.org/fathimashadiyya/set-data-structure/changeset/f89e45e3fd81

October 30, 2011

Storage Allocator – “malloc”

Filed under: C — shadiyya @ 6:41 am
Tags: , , ,

The dynamic memory allocator malloc will request space from the operating system as needed rather than allocating from a compiled-in fixed-size array. The storage used by a program can be requested by malloc,but also by other sources, so its memory blocks are basically of three types:

  • Not owned by malloc
  • Owned by malloc which are free
  • Owned by malloc which are in useWhen malloc function is called, the free list is scanned until a big-enough block is found.  (big enough for current data to be stored, a first fit algorithm is used meaning that first free block which has a size larger than the one requested is taken)

There are three possibilities:

  • No free big-enough block found, in this case program must a request another piece of memory from operating system 
  • Found a bigger block than the size requested, in this case only the needed space is used and residue storage is returned to the free list
  • Found a block having exactly the needed size, in this case this block is “unlinked” from the free list

When free is called, corresponding “freed” block is added to the free list at proper place. If the block being freed is adjacent to a free block on either side, it is united with it to form a single bigger block, so storage does not become too fragmented. Determining the adjacency is easy because the free list is maintained in order of decreasing address.

Every free block contains three fields as shown in the picture below: a pointer to the next block, the size of the block, and the free space itself.
The control information at the beginning is called the “header” that contains a union of a structure and alignment type. To simplify alignment, all blocks are multiples of the header size.

In malloc, the requested size in characters is rounded up to the proper number of header-sized units; the block that will be allocated contains one more unit, for the header itself, and this is the value recorded in the size field of the header.

For the first call of malloc, a list that contains one block of size zero, and points to itself is created. The free list is then searched for a free block of required size. The search begins at the point where the last block was found. If a too-big block is found, the tail end is returned to the user; in this way the header of the original block needs only to have its size adjusted. In all cases, the pointer returned to the user points to the free space within the block, which begins one unit beyond the header.

             

If no adequate memory block is found, then the function obtains storage from the operating system using the UNIX system call sbrk(). sbrk returns -1 if there is no space.Since asking the system for memory is a comparatively expensive operation, we don’t want to do that on every call to malloc, so the function requests at least NALLOC units which is a larger quantity; this larger block will be chopped up as needed. The extra space is inserted into the free list. It scans the free list, starting at the point where a free block was found the last time, looking for the place to insert the free block. This is either between two existing blocks or at the end of the list. In any case, if the block being freed is adjacent to either neighbor, the adjacent blocks are combined and thus prevents memory becoming too fragmented.

October 13, 2011

Trie data structure and implementation of “Auto complete”

Filed under: C — shadiyya @ 10:04 am
Tags: , ,

A trie is a data structure that stores the information about the contents of each node in the path from the root to the node, rather than the node itself. That means its position in the tree shows what key it is associated with. Searching a word of length m in a trie is having a time complexity of o(m) and are more space efficient when they contain a large number of short keys. Each node contains an array of pointers, one pointer for each character in the alphabet. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string.

For example, in the case of alphabetical keys, each node is a structure having two members. One is an integer for storing the value corresponding to a word if a word terminates at that node. Second is a pointer to an array of 26 characters for each of 26 alphabet characters. For example, consider the tree shown in the below figure. In this tree, each path between the root and a child represents a key and the end of each word is denoted by storing some data in the node, so this trie contains the words “to”, “tea”, “ted”, “ten”, “i”, “in”, “inn”, “A”. Inserting a new key traverses the trie until it either reaches the end of the string, or it discovers that the trie does not contain the string. In the first case, all that is necessary is to mark the end node as being the end of a key and assigning a data into this node. In the second case, the trie must be extended with new nodes that represent the remainder of the string. 

Starting from the root node, we can check if a word exists in the trie easily by following pointers corresponding to the letters in the target word.

Implementation of Auto Complete

Auto-complete functionality is used widely over the internet and mobile apps. A lot of websites and apps try to complete our input as soon as we start typing. 

The autocompletion I have implemented returns all words starting with a given prefix prefix. If a null string is entered as the input, then it will print all the words stored in the trie. Searching uses a well known algorithm called DFS (Depth First Search). It searches the depth first and then only the side links.

Here is the complete code for autocomplete using Trie: https://bitbucket.org/fathimashadiyya/trie

September 10, 2011

The traceroute utility

Filed under: General — shadiyya @ 6:38 am
Tags: , ,

Traceroute is a very useful debugging tool, which can be used to find a number of useful things about host, client, and routers that data passes through on its way from the source to the destination. However, Traceroute is most commonly used for network troubleshooting by finding out the path taken by an IP datagram from the source to the destination. Providing a list of routers traversed, it allows the user to identify the path taken to reach a particular destination in the network. This can help identify routing problems or firewalls that may be blocking access to a site..

How It Works

Traceroute begins by sending a UDP datagram from the originating host to the destination host with the TTL initially set to a value of 1. When the datagram arrives at the first router, that router decrements the TTL by one which results in a TTL of zero. The datagram is now expired so the router sends an ICMP Time Exceeded message to the originating host. The source address of this ICMP message is the address of the router and the destination address is the address of the originating host. This response to the originating host now gives the IP address of the first router.

The second UDP datagram sent by the originating host is exactly the same at the first but with a TTL value of 2. This time the first router receives
the datagram and decrements the TTL by 1. Since the TTL is still greater than 0, this first router now routes the datagram to the next router. The second router in turn decrements the TTL by 1 resulting in a TTL of 0. This triggers an ICMP Time Exceeded message from the second router. The source address of the ICMP message is the address of the second router and the destination address is the address of the originating host. Now the originating host knows the address of the second router also. This process continues until a maximum TTL is reached (typically around 30) or the destination host is finally reached.

            

Here is the python code:

August 8, 2011

Unix “make” command

Filed under: Linux — shadiyya @ 6:27 pm

This post is a short introduction to the Unix make utility.The make utility is a software engineering tool for managing and maintaining computer programs. Make provides most help when the program consists of many component files. As the number of files in the program increases so to does the compile time, complexity of compilation command and the likelihood of human error when entering command lines, i.e. typos and missing file names.

By creating a descriptor file containing dependency rules, macros and suffix rules, you can instruct make to automatically rebuild your program whenever one of the program’s component files is modified. makeis smart enough to only recompile the files that were affected by changes thus saving compile time.

If you run:

make

this program will look for a file named ‘makefile’ in your directory, and then execute it.
If you have several makefiles, then you can execute them with the command:

make -f MyMakefile

Simple Example

This is an example descriptor file to build an executable file called prog1. It requires the source files file1.c, file2.c, and file3.c. An include file, mydefs.h, is required by files file1.c and file2.c. If you wanted to compile this file from the command line using c the command would be

% cc -o prog1 file1.c file2.c file3.c

This command line is rather long to be entered many times as a program is developed and is prone to typing errors. A descriptor file could run the same command better by using the simple command

% make prog1

or if prog1is the first target defined in the descriptor file

% make

This first example descriptor file is much longer than necessary but is useful for describing what is going on.

prog1 : file1.o file2.o file3.o

cc -o prog1 file1.o file2.o file3.o

file1.o : file1.c mydefs.h

cc -c file1.c

file2.o : file2.c mydefs.h

cc -c file2.c

file3.o : file3.c

cc -c file3.c

clean : rm file1.o file2.o file3.o

Let’s go through the example to see what make does by executing with the command make prog1and assuming the program has never been compiled.
makefinds the target prog1 and sees that it depends on the object files file1.o file2.o file3.o

makenext looks to see if any of the three object files are listed as targets. They are so make looks at each target to see what it depends on. make sees that file1.o depends on the files file1.cc and mydefs.h

Now make looks to see if either of these files are listed as targets and since they aren’t it executes the commands given in file1.o’s rule and compiles file1.cc to get the object file.

make looks at the targets file2.o and file3.o and compiles these object files in a similar fashion.

make now has all the object files required to make prog1 and does so by executing the commands in its rule.

Unix commands: sed & awk

Filed under: Linux — shadiyya @ 5:20 pm

sed

It is a stream editor is used to perform basic text transformations on an input stream. The sed utility is a stream editor that reads one or more text files, makes editing changes according to a script of editing commands, and writes the results to standard output.

The pattern to match is typically included between a pair of slashes // and quoted. For example, to print lines containing the string “1024”, we may use:

cat filename | sed -n ‘/1024/p’

Here, sed filters the output from the cat command. The option “-n” tells sed to block all the incoming lines but those explicitly matching the expression.
Here is another example, this time for deleting selected lines:

cat filename | sed ‘/.*o$/d’ > new_file

In this example, lines ending with an “o” will be deleted. It uses a regular expression for matching any string followed by an “o” and the end of the line. The output (i.e., all lines but those ending with “o”) is directed to new_file.

To search and replace, use the sed ‘s’ action, which comes in front of two expressions:

sed ‘s/string_old/string_new/’ filename > newfile

awk

The awk is mostly used for pattern scanning and processing. It searches one or more files to see if they contain lines that matches with the specified patterns and then perform associated actions. Awk breaks each line of input passed to it into fields. By default, a field is a string of consecutive characters delimited by whitespace, though there are options for changing this.

Awk runs through a text file by reading and processing one record at time. Its commands are written with the intention that they act repetitively on each record as it is read in to awk. A record that has been read by awk is broken into separate fields, and actions can be performed on the separate fields as well as on the whole record.

As awk processes each line of the input file, each word on the line is assigned to variables named $1 (the first word), $2 (the second word), and so on.

Let’s start with a file, words.txt, that contains these lines:

nail hammer wood
pedal foot car
clown pie circus

Now we’ll use the print function in awk to plug the words from each input line into a template, like this:

awk ‘{print “Hit the”,$1,”with your”,$2}’ words.txt


Then the output will be:

Hit the nail with your hammer
Hit the pedal with your foot
Hit the clown with your pie

Shell Scripting

Filed under: Linux — shadiyya @ 10:28 am

Shell scripts combine lengthy and repetitive sequences of commands and generalize a sequence of operations on one set of data.

Following steps are required to write shell script:

(1) Use any editor like vi to write shell script.

(2) After writing shell script set execute permission for your script as follows

syntax:
chmod permission your-script-name

Example:
$ chmod 755 your-script-name

This will set read write execute(7) permission for owner, for group and other – permission is read and execute only(5).

(3) Execute your script as

syntax:
bash your-script-name
sh your-script-name
./your-script-name

Following is an example of a very simple shell script

echo You are
who am i
echo The directory you are in is
pwd
echo The date is
date
echo calendar
cal

It will print the user’s name, current directory, date & time and calendar of this month.

Now, we can color this output on the screen, by the following code word :

echo -e “33[31m This is in Red”

Now, to get a red colored calendar, we can use the following code:

clear
echo -e “33[31m calendar”
cal

There are many more interesting features in shells. Some detailed informations are available here:

http://www.freeos.com/guides/lsst/misc.htm

July 24, 2011

Piping and Redirection in Linux

Filed under: Linux — shadiyya @ 6:07 pm

With pipes and redirection, we can “chain” multiple programs to create extremely powerful commands. Most programs on the command-line accept different modes of operation. Many can read and write to files for data, and most can accept standard input or output. This means that we can direct the output of one program as input to another program. We can then take the output of the second program and redirect it as input to yet another program, or redirect the output to a file.

A pipe is used to to send the output of one program to another program for further processing. With pipes, we can combine multiple commands together in a powerful way. The general syntax for pipes is:

$ command_1 | command_2 [| command_3 . . . ]

This chain can continue for any number of commands or programs.

The following is one of the most common ways of using pipes:

$ ls | less

As another example, suppose we have a text file called applist.txt and we want to find out what lines of it contain the word “desktop”. It’s easy with pipes. We list the contents of the file applist.txt and send the results to grep, which then filters all lines containing the desired word “desktop”, and displays those lines on your screen:

$ cat applist.txt | grep desktop

This direct connection between programs allows them to operate simultaneously and permits data to be transferred between them continuously rather than having to pass it through temporary text files or through the display screen and having to wait for one program to be completed before the next program begins.

Redirection is the transferring of standard output to some other destination, such as another program, a file or a printer, instead of the display monitor (which is its default destination). Standard output, sometimes abbreviated stdout, is the destination of the output from command line programs in Unix-like operating systems.

For example, to redirect the output to a file, we can use the > character like this:

$ ls > dir_listing.txt

The above redirects the output of the ls command to a file called dir_listing.txt. Because the output is redirected to a file, we won’t see any results of ls on your screen.

Each time we repeat the above command, the file dir_listing.txt is overwritten with the results of the ls command. If we want to append the new results to the file instead of rewriting it, we can use >> instead.

$ ls >> dir_listing.txt

Each time we repeat the above command, the new output of ls is added at the end of the dir_listing.txt file instead of overwriting the file.

The following adds the contents of File1 at the end of File2:

$ cat File1 >> File2

Towers of Hanoi

Filed under: Python — shadiyya @ 3:23 pm

In 1883 the French mathematician Edouard Lucas, the number theory scholar who became famous for the analysis on the Fibonacci sequence invented the Towers of Hanoi puzzle.

He was inspired by an Indian legend that tells of  the temple of Brahma in Benares. The legend tells that in the Temple with a dome which marked the center of the world, a brass plate is located on which 64  golden disk threaded on a needle are placed. Monks move one at a time disks on another needle according to the rule that a large disk can never be put on a smaller disk. When all disks are moved, recreating a Tower on another needle, then comes the end of the world.

It would take at least 264 -1 moves to complete the task. Assuming one move per second, and no wrong moves, it would take almost 585,000,000,000 years to complete. So we are safe even if the legend is true.
The objective of the puzzle is to move the entire stack to another rod, obeying the following rules:

  • Only one disk may be moved at a time.
  • Each move consists of taking the upper disk from one of the pegs and sliding it onto another rod.
  • No disk may be placed on top of a smaller disk.

The number of moves necessary to move a tower with n disks can be calculated as: 2n – 1

There is a general rule for moving a tower of size n (n > 1) from the position A to the position C using the extra position B:

  • move a tower of n – 1 discs from A to B. Disk n is left alone at A
  • Move disk n from A to C
  • move the tower of n – 1 discs from B to C, i.e. this tower will be put on top of Disk n

The algorithm, which we have just defined, is a recursive algorithm to move a tower of size n. It actually is the one, which we will use in our Python implementation to solve the Towers of Hanoi. Step 2 is a simple move of a disk. But to accomplish the steps 1 and 3, we apply the same algorithm again on a tower of n-1. The calculation will finish with a finite number of steps, because very time the recursion will be started with a tower which is 1 smaller than the one in the calling function. So finally we will end up with a tower of size n = 1, i.e. a simple move.

The following Python program contains a recursive function “hanoi”, which implements a recusive solution for Towers of Hanoi. Here A is renamed as “source”, B as “helper” and C as “target”.

The Pygame implementation of each step of the algorithm is given here:


The complete code is available at:

https://bitbucket.org/fathimashadiyya/exercises/changeset/9cb1a9f7d823

July 18, 2011

Flood Fill

Filed under: Python — shadiyya @ 11:10 am

Flood filling means that a group of connected pixels with close values is filled with, or is set to, a certain value. The flood filling process starts with a specified point (“seed”) and continues until it cannot find any new pixels to fill due to a large difference in pixel values. Here I am using pygame with a recursive method to fill the area. For every pixel filled, the functions analyze four neighbor pixels. So, this kind of connectivity is called 4-connectivity.

The pygame module should be imported first. The code is using large number of recursions. So we have to set the recursion limit to some large value. For that, we should import the sys module also.

We have to define a function to check whether the user pressed the quit button.  Then we have to take a point inside the polygon to specify the “seed” and fill it with the specified colour. The recursive function “floodfill” should call its four adjacent points and fill them until it reaches the boundary.

The function to implement this is:

The program should display the colour filling process until the user press the quit button. Here are some screen shots during flood filling process:

                                                                                       


The whole program is available here:  

https://bitbucket.org/fathimashadiyya/exercises/changeset/00b89ce23e8b

Next Page »

Blog at WordPress.com.

%d bloggers like this: