Thursday, May 09, 2013

Optimization: implementing a binary search

Searching in a list occurs in a lot of programs. The lazy approach - the one I will normally use in almost all cases - is to just use the default IndexOf method (I am mostly using C# nowadays). However, this gets quickly inefficient if the number of searches starts climbing - IndexOf is a linear search and the inefficiency adds up.

If the list is sorted, a binary search will be orders of magnitude faster. (I cannot emphasize enough how important this is - binary search will return incorrect results if the list is not sorted.) The description of the algorithm is as follows:

  1. start with the whole list; let l and r be the first and last indices in the interval to search (defaulting to 0 and list.Count - 1)
  2. let m = the middle of the l .. r interval
  3. if the item at index m is equal to the value we're searching, success: return m
  4. otherwise, if the item is lower than the value, we need to focus on the upper half of the list; set l = m + 1
  5. if the item is higher than the value we need the lower half; set r = m - 1
  6. if l > r, we couldn't find the value; return -1. If l <= r, go back to step 2

The implementation is pretty straightforward:

    private static int BinarySearch1(List<string> list, string item)
    {
      var l = 0;
      var r = list.Count - 1;

      while (l <= r)
      {
        var m = l + (r - l) / 2;
        var comparison = string.CompareOrdinal(list[m], item);

        if (comparison == 0)
          return m;
        else if (comparison < 0)
          l = m + 1;
        else
          r = m - 1;
      }

      return -1;
    }

Of course, .NET already has this implemented in the core libraries, so I'm going to add it for comparison:

    private static int BinarySearch2(List<string> list, string item)
    {
      return list.BinarySearch(item);
    }

I'll also add the IndexOf method:

    private static int SearchUsingIndexOf(List<string> list, string item)
    {
      return list.IndexOf(item);
    }

and a manual implementation of the linear search:

    private static int LinearSearch1(List<string> list, string item)
    {
      for (var i = 0; i < list.Count; i++)
        if (list[i] == item)
          return i;

      return -1;
    }

Incidentally, even this linear search can be optimized; the loop has to make two comparisons (i < list.Count and list[i] == item). I can remove one of them by adding to the list the value I'm searching for. That way, I only need the second test. (Of course, I also need to remove the extra value from the list before returning.) Here is the optimized linear search method:

    private static int LinearSearch2(List<string> list, string item)
    {
      list.Add(item);
      var i = 0;
      while (list[i] != item)
        i++;

      list.RemoveAt(list.Count - 1);

      return i == list.Count ? -1 : i;
    }

In order to test these methods, I created a list of one hundred thousand strings, containing the numbers from 0 to 99999 left-padded with zeroes (that is, the strings "000000" to "099999"). Then I created a new list with ten thousand random values from 0 to 199999, also with leading zeroes, so that one average half the searches would pass and half would fail. I then ran the search function ten thousand times and timed the total running time. Here are the remaining fields and methods in the Program class (this is a console application):

    private const int MAX = 100000;
    private const int RUNS = 10000;

    private static void Main(string[] args)
    {
      var list = Enumerable
        .Range(0, MAX)
        .Select(it => it.ToString("000000"))
        .ToList();

      var rng = new Random();
      var randomValues = Enumerable
        .Range(0, RUNS)
        .Select(it => rng.Next(MAX * 2).ToString("000000"))
        .ToList();

      var dict = new Dictionary<string, Func<List<string>, string, int>>
      {
        { "SearchUsingIndexOf", SearchUsingIndexOf },
        { "LinearSearch1", LinearSearch1 },
        { "LinearSearch2", LinearSearch2 },
        { "BinarySearch1", BinarySearch1 },
        { "BinarySearch2", BinarySearch2 },
      };

      var timings = dict
        .Select(it => new { name = it.Key, ticks = Run(i => it.Value(list, randomValues[i]), RUNS) })
        .Select(timing => string.Format("{0,-20} -> {1,12} msec", timing.name, timing.ticks));

      foreach (var timing in timings)
        Console.WriteLine(timing);
    }

    private static long Run(Action<int> action, int count)
    {
      var timer = new Stopwatch();
      timer.Start();
      for (var i = 0; i < count; i++)
        action(i);
      timer.Stop();

      return timer.ElapsedMilliseconds;
    }

This is the result of running the program on my computer in Release mode:

SearchUsingIndexOf   ->         6768 msec
LinearSearch1        ->         5161 msec
LinearSearch2        ->         4954 msec
BinarySearch1        ->            2 msec
BinarySearch2        ->           13 msec
Press any key to continue . . .

I was surprised to see that the optimization of the linear search doesn't reduce the time more (the improvement is around 4%). However, the difference between the linear and binary searches is colossal: my binary search is more than three thousand times faster than the IndexOf method, and even the existing implementation is over 500 times faster. (.NET does a lot more checks than I do - as you can see, the linear searches are also faster than the predefined IndexOf method.)

In any case, here you go: the next time you have to do a lot of searches through lists, remember the binary search. It won't pay off unless you have at least hundreds of searches, preferably over a thousand; otherwise, just stick to IndexOf.

Edit: The code is available on GitHub.

No comments: