Hands-on Day

Quiz: Range Searching


Files

The files that you need for today's activity may be downloaded as a single zipfile or as individual files. The file search.py is the only one that you should be editing, but the others will be required for executing the various tests.


Overview

Today's activity will look at how the basic binary search algorithm can be leveraged for more advance searches. In particular, we will examine the goal of performing a range search. Rather than looking for a single value, we might be interested in finding all values that lie within a particular range of values (e.g., find all zipcodes from 63100 through 63199).

We will focus purely on determining the number of matching results and will examine two algorithms (the first is the required part of this quiz; the second will be considered extra credit). We will give a brief prose description of each algorithm in this section, and further advice on implementing the algorithms below.

The first algorithm uses binary search to find any matching element of the list (assuming one can be found). Once that has been found, a pair of while loops can be used to scan leftward and rightward in the list, so long as you continue to find qualifying elements. For example, consider the following sequence of characters as our data list

B D E E G I J J L M M N P P Q Q S Y Y Z
and a goal of counting the number of values within the range K through P inclusive. If we were to binary search, perhaps for the ending value P, we might end up at the leftmost of the two P values in our data list (that at index 12 above). A while loop could be used to walk leftward finding the relevant N, M, M, L and another while loop might walk rightward to find the second relevant P. In the end, there were six valid elements found (L M M N P P).

A drawback to the first algorithm is that although the binary search is efficient, the linear scanning to find the full extent of the range could be quite time-consuming in a case where there are many matching values in the search range. The second algorithm we propose is more efficient. With care a binary search can be implemented to locate the index of the rightmost element that is strictly less than a given value, or by symmetry to locate the index of the leftmost element that is strictly greater than a given value. Those two search algorithms can be used to therefore find the full extent of the desired range by locating the greatest element that is too small for the range and the least element that is too large for the range. For example, in the above example, we'd hope to quickly identify the first element strictly before K (which is the second J at index 7) and the first element that is strictly after P (which is the first Q at index 14). With that knowledge, we can immediately determine the number of elements that do match the desired range.


First Algorithm (Required)

Your first algorithm should be implemented through the public countRangeSlow(data,lowTarget,highTarget) function. This takes a previously sorted list of values, as well as a low and high target for the desired range. NOTE WELL: we have chosen to make this an inclusive range, in that you are to count the number of elements satisfying lowTarget <= element <= highTarget.

In solving this task, you are welcome to define a new utility function that performs the original binary search to attempt to locate at least one matching element. To help you, we have included a version of the original recursive binary search implementation from the book, but that code was written to look for a single target value and to return a True/False result. You will need to adapt that code to locate any matching element and to return the index at which your search completed (rather than True/False). Presumably when a match was found, the returned index should indicate the location of such a value. You can decide how you'd like to report a failed search (perhaps a -1 or perhaps as the index where you ended up, with the presumption that the surrounding code will detect that this was not a valid match).

If you successfully adapt the search, you can call that from countRangeSlow and after the binary search completes, you can continue with the loops to find the full extent of matching elements.


Second Algorithm (Extra Credit)

The second algorithm (should you attempt it) should be implemented using the function countRangeFast. For this algorithm, you will presumably want to have two new recursive search algorithms. For example, you might have a recursive function with signature search_lt(data, target, start, stop) that is responsible for finding the greatest index of an element that is strictly less than the target and a corresponding function search_gt(data, target, start, stop) that is responsibile for finding the smallest index of an element that is strictly greater than the target. We recommend search_lt returning index -1 in the case when there are no elements strictly less than the target and returning len(data) from search_gt in a case when there are no elements strictly greater than the target.

If you can get these two new functions working properly, then the countRangeFast function can determine the location of the greatest element strictly less than lowTarget and the least element strictly greater than highTarget and comparing those indices should allow you to report the number of matching elements within the list (without needing any loop through the full range of matches).


Testing

We offer two forms of testing. Some small-scale tests that will allow you to manually trace your algorithm's behavior, and then a larger-scale automated test that will both report Success/Failure based on the accuracy of your return values, and report of the overall time spent performing the tests (which should allow us to see the benefit of the extra credit algorithm on large inputs).

manualTest.py
The manual test generates a sample list of 20 characters, echos that list to the screen in a format that shows the indices of the elements, and then it allows you to manually indicate a low/high pair for a range search. This will allow you to test special cases such as ranges with no results or ranges that include the first and/or last elements of the list.

Furthermore, we have placed a function call debug() at the beginning of our model recursive search algorithm. Should you keep that call as the first command of your own recursive implementation, we have outfitted the code so that we will echo the series of recursive calls made while you are testing your code.

automatedTest.py
The automated test is meant to do larger scale tests (and we intentionally turn off the verbose debugging code when running the automated tests).

The basic experiment that we perform is the following. For increasing values of N, we generate a sorted list of N values, each an integer chosen uniformly at random from [0,999]. Then we exhaustively perform a search for each individual possible value, and range searchers for pairs of consecutive possible values. (So we are performing about 2000 different range searches for each such test.) Our driver oversees your work and will report SUCCESS if your code succeed on all tests and FAILURE if there was at least one test on which your answer is wrong; we do not try to detail what test(s) you failed.

For the sake of timing, we try this experiment for N=10 using both the countRangeSlow and countRangeFast functions. (If you don't try the extra credit, don't worry about it; it will simply report failure.) We then repeat the same thing for N=100, N=1000 eventually testing for ten million. The larger tests should allow you to see the improved efficiency of the second algorithm (in particular because there are many more duplicate values when we have millions of entries all of which are numbers from 0 to 999. For sense of scale, on our machine, we find the slower algorithm completing the biggest test in about 5 seconds and the faster algorithm finishing the same test in 0.03 seconds.


Submission

One member of your partnership should submit the single file, search.py, to the quiz19 folder of their git repository. Please make sure that both contributing names appear at the top of that file.


Michael Goldwasser
Last modified: Friday, 07 December 2018