Apr
14
2009
0

A Fast String Scanning Algorithm with Small Startup Overhead

Abstract

A fast substring search algorithm is presented here. This algorithm is faster than a simple left to right scanning algorithm. Startup overhead is taken into consideration to avoid the use of a large lookup table. This algorithm is easily implementable in Java.

Introduction

The Java class ‘String’ provides substring searching capabilities.

The method String.indexOf(pattern_string) will search for the location of the first occurrence of pattern string in the source string. For example, “Who is Tomas”.indexOf(”Tomas”) would return 7.

Typical implementation algorithm would search for the character ‘T’ from left to right. Once ‘T’ is found, then ‘o’, ‘m’, ‘a’, and ’s’ would be matched. If a mismatch happens, then we have to resume searching for ‘T’ until we hit the end of the search string.

This is pseudo code implementing the simple algorithm.

outer_loop:
for indx = 0 to length(source) - length(pattern)
{
  // match first pattern character
  if pattern[0] == source[indx] then
  {
    for indx2 = 1 to length(pattern) - 1
    {
      // match subsequent pattern characters
      if pattern[indx2] != source[indx + indx2] then
        continue outer_loop;
    }
    return indx; // everything matched
  }
}
return did_not_match;

We can do much better than the above algorithm.

Looking For Impossible Characters

The key to a faster scanning algorithm is to look at the last character of the pattern string first. If the last pattern character matches, then we continue to search for pattern characters from left to right, until the second last pattern character. (remember the last pattern character already matches)

"Who is Tomas".indexOf("Tomas")

    v
Who i........         <- source
Tomas                 <- pattern
    ^

In the above example, we found a character ‘i’ where a character ’s’ is required to make a match. Because ‘i’ does not occur anywhere in the string ‘Tomas’, we can advance the search position 5 characters! While we are matching the last character of the pattern string, if we found an “impossible character” in the search string, we are safe to advance the search pointer by the length of the pattern string.

How do we program this test? We can use a 64 bit long integer for a quick test.

// initialization
long cache = 0L;
for i = 0 to length(pattern) - 1 {
  cache |= 1L << (pattern[i] & 63);
}

...

// test for last pattern character; pattern string is "Tomas"
if (source[i] != 's') { // last character did not match
  if ((cache & (1L << (source[i] & 63))) == 0L) { // got an impossible char
    i += 5; // skip 5 characters
  }
}

Being able to skip 5 characters instead of 1 character at a time gives us a speed boost.

If the implementation language does not have good support for 64 bit integer arithmetic, a 32 bit integer cache can be used in a pinch.

Search Order Discussion

After the last character of the pattern string matches with the source string character under the scan pointer, we scan the rest of the pattern string from left to right. The question is ‘why’? Some other choices include right to left, and random order. The answer has to do with worst case performance estimate.

If we find an “impossible character” in the subsequent pattern string matches, we can still skip more than one characters. For example, if we find the fifth character is “impossible”, then we can increase search pointer by 5.

stre stress     <- source
stress          <- pattern

234561          <- match order

    v
stre stress     <- source
stress          <- pattern
    ^

By scanning from left to right, we ensure the amount we skip is roughly proportional to the number of characters we scanned. If we scan from right to left, then it is possible for a long scan to yield a small skip amount.

MD2 Optimization

Optimization of md2 offset is used when the last character of the pattern string matches, but some other pattern characters do not match.

Take the example of “James Tomas”.indexOf(”Tomas”)

In this example, the 5th character of the search string is ’s’, so it matches the last character of the pattern string. We then match ‘J’ against ‘T’, which does not match. At this point, we can again skip 5 characters! Why?

This is because in the string “Tomas”, ’s’ only occurs as the last character and no where else. We know the 5th character is ’s’ but the front part did not match. If we advance the search position one character, then the ’s’ character would be eventually matched against the 4th character of the pattern string (’a'). But we already said ’s’ does not occur in the pattern string from the first position up to the 4th position.

    v
James....       <- source
 Tomas          <- pattern
    ^

So before we do anything we can already predict any shift less than 5 would produce an ultimate mismatch.

The safe shift distance for subsequent mismatches is called ‘md2′. One can calculate md2 by looking for ’s’ starting from the 2nd last pattern character to the first pattern character

For the pattern string “Tomss”, md2 would be 1.
For the pattern string “Tosas”, md2 would be 2.
For the pattern string “Tsmas”, md2 would be 3.
For the pattern string “somas”, md2 would be 4.

Pseudo Code for Full Implementation

// initialize the cache
long cache = 0L;
for i = 0 to length(pattern) - 1
{
  cache |= 1L << (pattern[i] & 63);
}

// calculate md2
last_pattern = pattern[ length(pattern) - 1 ];
md2 = length(pattern)
for i = 0 to length(pattern) - 1
{
  if last_pattern == pattern[i] then
  {
    md2 = length(pattern) - i;
  }
}

scan_loop:
for i = length(pattern) - 1 to length(source) - 1
{
  if last_pattern == source[i] then
  {
    // last character matched, match the rest

    for i2 = 0 to length(pattern) - 2
    {
      if pattern[i2] != source[i - length(pattern) + 1] then
      {
        // see if character under cursor is "impossible".
        if ((cache & (1L << (source[i] & 63))) == 0L)
          altskip = i2 + 1;
        else
          altskip = 1;
        // skip the maximum of md2 and impossible calculation
        i += max(md2, altskip);
        continue scan_loop;
      }
    }
    return i - length(pattern) + 1; // everything matched
  }
  if ((cache & (1L << (source[i] & 63))) == 0L)
    i += length(pattern); // got an impossible character
  else
    i += 1;
}
return not_found;

Performance Estimate

The new string searching mechanism can potentially speed up string searching by ‘p’ times, where p is the length of the pattern string. The down side would be that there is pre-processing cost of 2p instructions, calculating the impossible bitmap, and md2. When we discovered two instances of impossible characters in the search string, then we should be even in processing time.

This performance profile compares favorably with traditional fast substring scanning algorithms, such as Boyer Moore. Since Java’s character is 16 bit wide, a naive implementation of Boyer Moore would require 65536 machine instructions to initialize its lookup table. This is obviously not acceptable when we are scanning any string shorter than 70000 characters long.

Written by admin in: Algorithms | Tags: ,
Apr
07
2009
0

Search Algorithms - Binary Search

Binary Search Algorithms

The Binary search requires an ordered list.  You can use Heap Sort Algorithm to sort the list first.

Iterative Algorithm
int find (const apvector &list, double target)
// pre: list is sorted in ascending order
//post: ITERATIVE binary search will return the index of the target element, else -1

{
	int mid;
	int first = 0;
	int last = list.length( ) -1;
	while ( first <= last )
	{
		mid = (first + last) / 2;
		if ( list[mid] == target )
			return mid;
		if ( list[mid] > target )
			last = mid - 1;
		else	first = mid + 1;
	}
	return -1;
}

Recursive Algorithm
int find (const apvector &list, double target, int first, int last)
// pre: list is sorted in ascending order
//post: RECURSIVE binary search will return the index of the target element, else -1

{
	if (first > last)
		return -1;

int mid = (first + last) / 2;
	if (list[mid] == target)
		return mid;

if (list[mid] < target)
		return find(list, target, mid+1, last);

return find(list, target, first, mid-1);

}
Written by admin in: Algorithms | Tags: ,
Apr
07
2009
1

Sorting Algorithms - Heap Sort

Heap Sort


Algorithm Analysis

The heap sort is the slowest of the O(n log n) sorting algorithms, but unlike the merge and quick sorts it doesn’t require massive recursion or multiple arrays to work. This makes it the most attractive option for very large data sets of millions of items. You can compare performance of All Sorting Algorithms to see which sort algorithm is the best.

For mobile map application, we suggest you choose heap sort because this algorithm does not require much memory like merge and quick sorts.

The heap sort works as it name suggests - it begins by building a heap out of the data set, and then removing the largest item and placing it at the end of the sorted array. After removing the largest item, it reconstructs the heap and removes the largest remaining item and places it in the next open position from the end of the sorted array. This is repeated until there are no items left in the heap and the sorted array is full. Elementary implementations require two arrays - one to hold the heap and the other to hold the sorted elements.

To do an in-place sort and save the space the second array would require, the algorithm below “cheats” by using the same array to store both the heap and the sorted array. Whenever an item is removed from the heap, it frees up a space at the end of the array that the removed item can be placed in.

Pros: In-place and non-recursive, making it a good choice for extremely large data sets.
Cons: Slower than the merge and quick sorts.

Empirical Analysis

Heap Sort Efficiency
Heap Sort Efficiency Graph

As mentioned above, the heap sort is slower than the merge and quick sorts but doesn’t use multiple arrays or massive recursion like they do. This makes it a good choice for really large sets, but most modern computers have enough memory and processing power to handle the faster sorts unless over a million items are being sorted.

The “million item rule” is just a rule of thumb for common applications - high-end servers and workstations can probably safely handle sorting tens of millions of items with the quick or merge sorts. But if you’re working on a common user-level application, there’s always going to be some yahoo who tries to run it on junk machine older than the programmer who wrote it, so better safe than sorry.

Source Code
Below is the basic heap sort algorithm. The siftDown() function builds and reconstructs the heap.

void heapSort(int numbers[], int array_size)
{
  int i, temp;

  for (i = (array_size / 2)-1; i >= 0; i--)
    siftDown(numbers, i, array_size);

  for (i = array_size-1; i >= 1; i--)
  {
    temp = numbers[0];
    numbers[0] = numbers[i];
    numbers[i] = temp;
    siftDown(numbers, 0, i-1);
  }
}

void siftDown(int numbers[], int root, int bottom)
{
  int done, maxChild, temp;

  done = 0;
  while ((root*2 <= bottom) && (!done))
  {
    if (root*2 == bottom)
      maxChild = root * 2;
    else if (numbers[root * 2] > numbers[root * 2 + 1])
      maxChild = root * 2;
    else
      maxChild = root * 2 + 1;

    if (numbers[root] < numbers[maxChild])
    {
      temp = numbers[root];
      numbers[root] = numbers[maxChild];
      numbers[maxChild] = temp;
      root = maxChild;
    }
    else
      done = 1;
  }
}
Written by admin in: Algorithms | Tags: ,

OpenMobileMap would not be possible without the generous support from our sponsors:

www.andgps.com provides gps and routing source code templates for android developers

www.j2megps.com provides gps and routing source code templates for j2me developers


OpenMobileMap | The Free Mobile Map Resources