Optimization: implementing a binary search
Searching in a list occurs in a lot of programs. The lazy approach - the one I will normally use in almost all cases - is to just use the default IndexOf
method (I am mostly using C# nowadays). However, this gets quickly inefficient if the number of searches starts climbing - IndexOf
is a linear search and the inefficiency adds up.
If the list is sorted, a binary search will be orders of magnitude faster. (I cannot emphasize enough how important this is - binary search will return incorrect results if the list is not sorted.) The description of the algorithm is as follows:
- start with the whole list; let
l
andr
be the first and last indices in the interval to search (defaulting to0
andlist.Count - 1
) - let
m
= the middle of thel
..r
interval - if the item at index
m
is equal to the value we're searching, success: returnm
- otherwise, if the item is lower than the value, we need to focus on the upper half of the list; set
l
=m + 1
- if the item is higher than the value we need the lower half; set
r
=m - 1
- if
l
>r
, we couldn't find the value; return-1
. Ifl
<=r
, go back to step 2
The implementation is pretty straightforward:
private static int BinarySearch1(List<string> list, string item) { var l = 0; var r = list.Count - 1; while (l <= r) { var m = l + (r - l) / 2; var comparison = string.CompareOrdinal(list[m], item); if (comparison == 0) return m; else if (comparison < 0) l = m + 1; else r = m - 1; } return -1; }
Of course, .NET already has this implemented in the core libraries, so I'm going to add it for comparison:
private static int BinarySearch2(List<string> list, string item) { return list.BinarySearch(item); }
I'll also add the IndexOf
method:
private static int SearchUsingIndexOf(List<string> list, string item) { return list.IndexOf(item); }
and a manual implementation of the linear search:
private static int LinearSearch1(List<string> list, string item) { for (var i = 0; i < list.Count; i++) if (list[i] == item) return i; return -1; }
Incidentally, even this linear search can be optimized; the loop has to make two comparisons (i < list.Count
and list[i] == item
). I can remove one of them by adding to the list the value I'm searching for. That way, I only need the second test. (Of course, I also need to remove the extra value from the list before returning.) Here is the optimized linear search method:
private static int LinearSearch2(List<string> list, string item) { list.Add(item); var i = 0; while (list[i] != item) i++; list.RemoveAt(list.Count - 1); return i == list.Count ? -1 : i; }
In order to test these methods, I created a list of one hundred thousand strings, containing the numbers from 0 to 99999 left-padded with zeroes (that is, the strings "000000"
to "099999"
). Then I created a new list with ten thousand random values from 0 to 199999, also with leading zeroes, so that one average half the searches would pass and half would fail. I then ran the search function ten thousand times and timed the total running time. Here are the remaining fields and methods in the Program class (this is a console application):
private const int MAX = 100000; private const int RUNS = 10000; private static void Main(string[] args) { var list = Enumerable .Range(0, MAX) .Select(it => it.ToString("000000")) .ToList(); var rng = new Random(); var randomValues = Enumerable .Range(0, RUNS) .Select(it => rng.Next(MAX * 2).ToString("000000")) .ToList(); var dict = new Dictionary<string, Func<List<string>, string, int>> { { "SearchUsingIndexOf", SearchUsingIndexOf }, { "LinearSearch1", LinearSearch1 }, { "LinearSearch2", LinearSearch2 }, { "BinarySearch1", BinarySearch1 }, { "BinarySearch2", BinarySearch2 }, }; var timings = dict .Select(it => new { name = it.Key, ticks = Run(i => it.Value(list, randomValues[i]), RUNS) }) .Select(timing => string.Format("{0,-20} -> {1,12} msec", timing.name, timing.ticks)); foreach (var timing in timings) Console.WriteLine(timing); } private static long Run(Action<int> action, int count) { var timer = new Stopwatch(); timer.Start(); for (var i = 0; i < count; i++) action(i); timer.Stop(); return timer.ElapsedMilliseconds; }
This is the result of running the program on my computer in Release mode:
SearchUsingIndexOf -> 6768 msec LinearSearch1 -> 5161 msec LinearSearch2 -> 4954 msec BinarySearch1 -> 2 msec BinarySearch2 -> 13 msec Press any key to continue . . .
I was surprised to see that the optimization of the linear search doesn't reduce the time more (the improvement is around 4%). However, the difference between the linear and binary searches is colossal: my binary search is more than three thousand times faster than the IndexOf
method, and even the existing implementation is over 500 times faster. (.NET does a lot more checks than I do - as you can see, the linear searches are also faster than the predefined IndexOf
method.)
In any case, here you go: the next time you have to do a lot of searches through lists, remember the binary search. It won't pay off unless you have at least hundreds of searches, preferably over a thousand; otherwise, just stick to IndexOf
.
Edit: The code is available on GitHub.
Comments