CS3851 -- Lab 1: Searching

Outcomes Addressed

apply asymptotic time complexity analysis to choose among competing algorithms.

Individual Assignment

Students should work individually on this assignment.

Overview

The purpose of this lab assignment is to review concepts from previous courses. This is accomplished through implementing two substring search algorithms and comparing their performance.

Time Logging Tip

If you found Process Dashboard cumbersome, you are encouraged to take a look at SlimTimer.com. SlimTimer is an easy to use web application that allows you to "clock-in" and "clock-out" of various tasks. You may find it helpful in keeping track of the time you spend on this course.

Problem Statement

You should write a program that will search through text files for a given sequence of words and display all of the line numbers (of the file) on which the sequence of words was found. You will implement two substring search algorithms (described below) and compare the run times of these two algorithms and the String.indexOf() method on a variety of test cases.

You should test your program on the files in searches.zip (taken from Project Gutenburg). Upon unzipping you will find a file called searches. This file contains a set of substrings (each on a separate line) that should be searched for in the other files that are part of the archive. You should compare all of the searches on all of these files using your implementation of the two algorithms described below and the String.indexOf() method.

One required feature that is not addressed in the algorithm pseudocode below is that your program should assume that all whitespace is created equal. This will allow your program to find My clothes are on fire in all three of the following examples:

Example 1                  Example 2                  Example 3
=========                  =========                  =========
Bill exclaimed,            Bill exclaimed, "My        Bill exclaimed, "My
"My clothes are on fire"   clothes are on fire"       clothes are
as he stopped, dropped,    as he stopped, dropped,
and rolled.                and rolled.                on    fire" as he
                                                      stopped, dropped, ...

Expected output:

Example 1                  Example 2                  Example 3
=========                  =========                  =========
2                          1-2                        1-4

The program should find all of the matches in the file (including multiple matches on the same line).

Be sure to spend adequate time understanding the problem statement and developing a strategy for how to design a successful solution before jumping into coding. You'll want to find a way to ignore whitespace while searching but still maintain information about which line each character in the file was originally positioned. You may find that the easiest way to deal with this is to pre-process the file being searched on in such a way that all whitespace has been normalized while still retaining enough information (perhaps elsewhere) so that line number information is not lost.

Algorithms

Both algorithms described below have the same parameters passed in and returned. file is the file (with n characters in it) and target is the substring (with m characters in it) to be found in the file. The value returned is either -1 or the index of file where target begins. Please note that pseudocode is given to help explain the algorithm. Your code may not look like the pseudocode. For example, your algorithms may not take in a string representing the whole file. Also, your program should display the line number(s) on which the substring was found in the file.

Algorithm 1

The first algorithm is a simple brute-force technique. Here is pseudocode for the first algorithm:

Algorithm1(file, target) 
  for i=0 to n-m step 1
    j=0
    while j<m and file[i+j]==target[j]
      j = j+1
    if j==m
      return i
  return -1

Algorithm 2

The second algorithm attempts to improve the search time by making use of the following techniques: 1) when testing a potential placement for a match of target in file, start comparing at the end of target and move towards the front. 2) when a test for a match fails (say at file[i]) remember the character c = file[i] which caused the failure. If c is not contained in target we can avoid checking any potential placement of target that overlaps with file[i]. Otherwise, we can shift target until an occurrence of c in target is aligned with file[i]. Here is pseudocode for the second algorithm:

Algorithm2(file, target)
  i=m-1
  j=m-1
  do
    if target[j]==file[i]
      if j==0
        return i
      else
        i=i-1
        j=j-1
    else
      i=i + m - min(j, 1+last(file[i]))
      j=m-1
  until i>n-1
  return -1

Where last(c) is the index of the right most occurrence of c in target (unless c is not in target in which case, last(c) we set it to -1).

Enhancements (optional)

Students interested in exploring this topic further may wish to consider this algorithm as well.

Other students may be interested in implementing a command line interface so that the program functions much like the UNIX grep command. Be sure to implement the requirements of the lab (returning the line number on which matches are found as a command line option).

Intermediate Deliverable (due one hour prior to week 2 lab)

You must be prepared to demonstrate benchmarking the String.indexOf() method and your implementation of Algorithm 1.

Lab report (due 11:00pm, Monday of week 3)

Here (right-click and save) is a template file to use as a starting point for this report. You'll also need the cctHW.xsl (right-click and save preserving the file name) file in the same folder.

Your report should be in your own words and include:

Pattern focus: what patterns did you consider using when designing your solution, what did you use/not use, why?
An intuitive explanation of how Algorithm 2 works
Discussion: convince me that you were thinking when you did this assignment.
Activity Log
Complete program results for the test cases provided
Documented source code

As with any report you submit, correct spelling and grammar are required. In addition, your report should be submitted electronically following the Electronic Submission Guidelines. (You may wish to consult the sample report before submitting your report.) Be sure to keep copies of all your files, in case something gets lost.

If you have any questions, consult the instructor.

Acknowledgment

This laboratory was developed by Dr. Chris Taylor.

Office: L-343 •
Phone: 277-7339 •
npǝ˙ǝosɯ@ɹolʎɐʇ