Optimization: implementing a binary search
Searching in a list occurs in a lot of programs. The lazy approach - the one I will normally use in almost all cases - is to just use the default IndexOf method (I am mostly using C# nowadays). However, this gets quickly inefficient if the number of searches starts climbing - IndexOf is a linear search and the inefficiency adds up.
If the list is sorted, a binary search will be orders of magnitude faster. (I cannot emphasize enough how important this is - binary search will return incorrect results if the list is not sorted.) The description of the algorithm is as follows:
- start with the whole list; let
landrbe the first and last indices in the interval to search (defaulting to0andlist.Count - 1) - let
m= the middle of thel..rinterval - if the item at index
mis equal to the value we're searching, success: returnm - otherwise, if the item is lower than the value, we need to focus on the upper half of the list; set
l=m + 1 - if the item is higher than the value we need the lower half; set
r=m - 1 - if
l>r, we couldn't find the value; return-1. Ifl<=r, go back to step 2
The implementation is pretty straightforward:
private static int BinarySearch1(List<string> list, string item)
{
var l = 0;
var r = list.Count - 1;
while (l <= r)
{
var m = l + (r - l) / 2;
var comparison = string.CompareOrdinal(list[m], item);
if (comparison == 0)
return m;
else if (comparison < 0)
l = m + 1;
else
r = m - 1;
}
return -1;
}
Of course, .NET already has this implemented in the core libraries, so I'm going to add it for comparison:
private static int BinarySearch2(List<string> list, string item)
{
return list.BinarySearch(item);
}
I'll also add the IndexOf method:
private static int SearchUsingIndexOf(List<string> list, string item)
{
return list.IndexOf(item);
}
and a manual implementation of the linear search:
private static int LinearSearch1(List<string> list, string item)
{
for (var i = 0; i < list.Count; i++)
if (list[i] == item)
return i;
return -1;
}
Incidentally, even this linear search can be optimized; the loop has to make two comparisons (i < list.Count and list[i] == item). I can remove one of them by adding to the list the value I'm searching for. That way, I only need the second test. (Of course, I also need to remove the extra value from the list before returning.) Here is the optimized linear search method:
private static int LinearSearch2(List<string> list, string item)
{
list.Add(item);
var i = 0;
while (list[i] != item)
i++;
list.RemoveAt(list.Count - 1);
return i == list.Count ? -1 : i;
}
In order to test these methods, I created a list of one hundred thousand strings, containing the numbers from 0 to 99999 left-padded with zeroes (that is, the strings "000000" to "099999"). Then I created a new list with ten thousand random values from 0 to 199999, also with leading zeroes, so that one average half the searches would pass and half would fail. I then ran the search function ten thousand times and timed the total running time. Here are the remaining fields and methods in the Program class (this is a console application):
private const int MAX = 100000;
private const int RUNS = 10000;
private static void Main(string[] args)
{
var list = Enumerable
.Range(0, MAX)
.Select(it => it.ToString("000000"))
.ToList();
var rng = new Random();
var randomValues = Enumerable
.Range(0, RUNS)
.Select(it => rng.Next(MAX * 2).ToString("000000"))
.ToList();
var dict = new Dictionary<string, Func<List<string>, string, int>>
{
{ "SearchUsingIndexOf", SearchUsingIndexOf },
{ "LinearSearch1", LinearSearch1 },
{ "LinearSearch2", LinearSearch2 },
{ "BinarySearch1", BinarySearch1 },
{ "BinarySearch2", BinarySearch2 },
};
var timings = dict
.Select(it => new { name = it.Key, ticks = Run(i => it.Value(list, randomValues[i]), RUNS) })
.Select(timing => string.Format("{0,-20} -> {1,12} msec", timing.name, timing.ticks));
foreach (var timing in timings)
Console.WriteLine(timing);
}
private static long Run(Action<int> action, int count)
{
var timer = new Stopwatch();
timer.Start();
for (var i = 0; i < count; i++)
action(i);
timer.Stop();
return timer.ElapsedMilliseconds;
}
This is the result of running the program on my computer in Release mode:
SearchUsingIndexOf -> 6768 msec LinearSearch1 -> 5161 msec LinearSearch2 -> 4954 msec BinarySearch1 -> 2 msec BinarySearch2 -> 13 msec Press any key to continue . . .
I was surprised to see that the optimization of the linear search doesn't reduce the time more (the improvement is around 4%). However, the difference between the linear and binary searches is colossal: my binary search is more than three thousand times faster than the IndexOf method, and even the existing implementation is over 500 times faster. (.NET does a lot more checks than I do - as you can see, the linear searches are also faster than the predefined IndexOf method.)
In any case, here you go: the next time you have to do a lot of searches through lists, remember the binary search. It won't pay off unless you have at least hundreds of searches, preferably over a thousand; otherwise, just stick to IndexOf.
Edit: The code is available on GitHub.
Comments