CS286 -- Lab 1: Searching



->Courses
->CS286
-->Homework
->Lab 1
-->Lab 2
-->Lab 3
-->Lab 4
->Electronic Submission
->Old Exams
->C++ Examples
->MSVC++ Info
->STL Info
->MFC/GUI Info
->Software
->Tentative Schedule
->Support Forum
->Course Policies

[Courses]
[Rich][Home][Rich]
[PHome]

Spring Quarter 2003

Objectives Addressed

  • Be able to apply asymptotic time complexity analysis to choose among competing algorithms.
  • Be familiar with engineering applications for many of the fundamental algorithms discussed in the course.

Purpose

The purpose of this lab assignment is to review concepts from previous courses. This is accomplished through implementing two substring search algorithms and comparing their performance.

Assignment

You should write a C++ program that will search through text files for a given sequence of words and display all of the line numbers (of the file) on which the sequence of words was found. You will implement two substring search algorithms (described below) and compare the run times of these two algorithms and the C++ string class' find algorithm on a variety of test cases.

You should test your program on the files in searches.zip (taken from Project Gutenburg). Upon unzipping you will find a file called searches. This file contains a set of substrings (each on a separate line) that should be searched for in the other files that are part of the archive. You should compare all of the searches on all of these files using your implementation of the two algorithms described below and the find member function from the C++ string class.

Algorithms

Both algorithms described below have the same parameters passed in and returned. file is the file (with n characters in it) and substr is the substring (with m characters in it) to be found in the file. The value returned is either -1 or the index of file where substr begins. Please note that pseudocode is given to help explain the algorithm. Your code should not look like the pseudocode. For example, your algorithms should not take in a string representing the whole file (pass one line at a time instead). Also, your program should display the line number(s) on which the substring was found in the file.

Algorithm 1

The first algorithm is a simple brute-force technique. Here is pseudocode for the first algorithm:

Algorithm1(file, substr) {
  for i=0 to n-m step 1 do {
    j=0
    while j<m and file[i+j]==substr[j] do {
      j = j+1
    }
    if j==m then {
      return i
    }
  }
  return -1
}

Algorithm 2

The second algorithm attempts to improve the search time by making use of the following techniques: 1) when testing a potential placement for a match of substr in file, start comparing at the end of substr and move towards the front. 2) when a test for a match fails (say at file[i]) remember the character c = file[i] which caused the failure. If c is not contained in substr we can avoid checking any potential placement of substr that overlaps with file[i]. Otherwise, we can shift substr until an occurrence of c in substr is aligned with file[i]. Here is pseudocode for the second algorithm:

Algorithm2(file, substr) {
  i=m-1
  j=m-1
  do {
    if substr[j]==file[i] then {
      if j==0 then {
        return i
      } else {
        i=i-1
        j=j-1
      }
    } else {
      i=i + m - min(j, 1+last(file[i]))
      j=m-1
    }
  } until i>n-1
  return -1
}

Where last(c) is the index of the right most occurrence of c in substr (unless c is not in substr in which case, last(c) we set it to -1).

Support Classes

I have written some simple classes that you may find useful:

  • A Benchmark class for timing your algorithms
  • A fastReader class (fastReader.h and fastReader.cpp) that reads lines of text files significantly faster than the getline function defined in the string class.

Enhancements (optional)

Students interested in exploring this topic further may wish to consider this algorithm as well.

Another feature that could be added to your program is to assume that all whitespace is created equal. This will allow your program to find My clothes are on fire in all three of the following examples:

Example 1                  Example 2                  Example 3
=========                  =========                  =========
Bill exclaimed,            Bill exclaimed, "My        Bill exclaimed, "My
"My clothes are on fire"   clothes are on fire"       clothes are
as he stopped, dropped,    as he stopped, dropped,
and rolled.                and rolled.                on    fire" as he
                                                      stopped, dropped, ...

Other students may be interested in implementing a command line interface so that the program functions much like the UNIX grep command. Be sure to implement the requirements of the lab (returning the line number on which matches are found as a command line option).

Lab report (due 11:00pm, the day prior to week 3 lab)

The lab report should be self-contained. That is, it should be possible for someone to understand what you did and why without seeing anything other than your report. Your report should include:

  • Educational Objective
  • Problem Statement
  • Discussion: convince me that you were thinking when you did this assignment.
  • Conclusions (what you learned, suggestions of how the lab could have been better, things you would have done differently, etc.)
  • Program results
  • PSP spreadsheet
  • Documented source code (using your program to format the code)

Template Lab Report

As with any report you submit, correct spelling and grammar are required. In addition, your report should be submitted electronically following the Electronic Submission Guidelines. (You may wish to consult the sample report before submitting your report.) Be sure to keep copies of all your files, in case something gets lost. It may be wise to keep a diskette backup as well.

If you have any questions, consult the instructor.

© 2002-2003 Dr. Christopher C. Taylor Office: CC-27C Phone: 277-7339 Last Updated: Mon Mar 24 03:05:26 2003
I am responsible for all content posted on these pages; MSOE is welcome to share these opinions but may not want to.