Skip to content

Musings of an Anonymous Geek

Made with only the finest 1's and 0's

Menu
  • About
  • Search Results
Menu

Python Code Kata 2: Karate Chop (Binary Search)

Posted on December 13, 2009December 14, 2009 by bkjones

I’ve decided it’d be fun to go through the Code Kata put together by Dave Thomas, co-author of The Pragmatic Programmer, and co-founder of Pragmatic Programmers, LLC (publishers of the finest tech books on the market right now; they can’t expand their selection fast enough for me).

Anyway, the first Code Kata, while being truly interesting to ponder, doesn’t really involve code, so I’m starting with Kata 2, which is a binary search exercise. It occurred to me while reading it over that I’d never written a binary search in Python, because you typically don’t need to, since list objects have a method called ‘index()’ which will hand you the index position of your search target within the list. I decided to force myself not to rely on it, as an exercise. If you’ve never had to go through a computer science curriculum, but want to excel at programming, I suggest doing all of the Kata, and getting your hands on at least one good computer science text (a computer science grad student hanging around is nice, too!)

Binary Search Background

So, the idea behind a binary search is often likened to looking up a name in a phone book, which is a pretty good analogy. A phone book has thousands of pages and perhaps millions of names, and yet it doesn’t take us long to find any given name, no matter which part of the book it’s in. You open the book to the middle, and the name you’re looking for is either on one of the facing pages, in the first half of the book, or in the last half of the book. After glancing at a single page, you’ve reduced your haystack by 50%. You flip to the middle of the proper half of the book, and repeat the process. If you had a phone book listing every person in the entire world (say, 7,000,000,000 people), you could find any given person in 33 steps or less. That’s pretty efficient (in big O notation, it’s O(logn)).

You can use a binary algorithm to tease your friends, too. Bet a friend $1 that you can guess any number they pick between 1 and 100 in 7 guesses if they tell you if your guess is too low or too high. If you use a binary algorithm to make your guesses, it’s not possible to lose this bet as far as I know. Here’s an example:

Your buddy chooses number 70. Your guesses will be:

  1. 50 (too low)
  2. 75 (too high)
  3. 62 (too low)
  4. 69 (too low)
  5. 72 (too high)
  6. 71 (too high)
  7. 70 (that’s it!)

One more. Your buddy chooses the number 1. Your guesses will be:

  1. 50 (too high)
  2. 25 (too high)
  3. 13 (too high)
  4. 7 (too high)
  5. 4 (too high)
  6. 2 (too high)
  7. 1  (that’s it!)

But what if your buddy wants to up the stakes by increasing the scope to numbers between 1 and 1000? Say he picks 485:

  1. 500 (too high)
  2. 250 (too low)
  3. 375 (too low)
  4. 438 (too low)
  5. 469 (too low)
  6. 485 (that’s it!)

You got lucky on that last one, but you might’ve negotiated more guesses for yourself. To do that and still guarantee a win, figure out what log2N is, where N is the highest number they can choose. So, log2100 = 7, and log21000 = 10. Another tip: when calculating your next guess, always round your division UP. Notice that log27000000000 = 32.7, which is where I got the max of 33 steps above.

Ok, now on to the Kata!!

Binary Search in Python Take 1: Traditional Binary Search

This is just my take on a traditional binary search algorithm in Python:

def chop(needle, haystack):
   """
   Takes an integer target (needle) and a sorted list of integers (haystack),
   and performs a binary search for the target. Return the integer index of the
   target in the array, or -1 if it's not found.
   """
   lo = haystack[0]
   hi = len(haystack)
   while lo < hi:
      mid = (lo+hi)/2
      midval = haystack[mid]
      if needle < midval:
         hi = mid
      elif needle > midval:
         lo = mid
      else:
         print "Found %d at index %d" % (needle, mid)
         break

This makes assumptions. Namely, it assumes a 0-indexed, sorted list with no missing values. So, my test lists are always output from range(N), where N is the number of elements I want to search. It works well enough. I think a CS prof would accept this. Let’s move on.

What if I break it in *THREE* (oooooohhhhhh)

If breaking the list in two is so cool, why don’t we break it in three pieces? We can do one extra check, and then cut out 2/3 of the haystack in one fell swoop! Ok, let’s see how it goes:

def chop3(needle, haystack):

   lo = haystack[0]
   hi = len(haystack)
   while lo < hi:
      brk1 = (lo+hi)/3
      brk2 = (2 * (lo+hi)) / 3
      val1 = haystack[brk1]
      val2 = haystack[brk2]

      if needle < val1:
         hi = brk1
      elif needle > val1 and needle < val2:
         hi = brk2
         lo = brk1
      elif needle > val2:
         lo = brk2

This one attempts to cut the initial array into three parts instead of bisecting. This won’t work, actually. At some point, you’ll shrink the list down two where the lower bound is 1/2 the upper bound, at which point, all hell breaks loose. Think about how this works when lo is 16 and hi is 32. 16+32 = 48, and 48/3 = 16! This means brk1 never actually changes, and you’ve created perfect conditions for an infinite loop.

Instead, define the hi and lo variables and break points in the list differently:

def chop3_take2(needle, haystack):
   lo = haystack[0]
   hi = len(haystack)
   while lo < hi:
      incr = (hi-lo)/3
      brk1 = lo+incr
      brk2 = lo + (incr * 2)
      val1 = haystack[brk1]
      val2 = haystack[brk2]
      if needle < val1:
         hi = brk1
      elif needle > val1 and needle < val2:
         hi = brk2
         lo = brk1
      elif needle > val2:
         lo = brk2
      elif needle == val1:
         break
      elif needle == val2:
         break

This one will work. It’s slightly more heavy lifting than chop2, but not too bad. In the calculation for the brk variables, I’ve decided to calculate an increment defined as 1/3 the distance between lo and hi. So, if lo=16 and hi=32, instead of brk1 becoming 16 and looping forever, it becomes the point 1/3 of the way between lo and hi. It makes fewer assumptions, which in my experience means fewer bugs. (32-16)/3 == 5 (the remainder in integer division is truncated), so brk1 is going to be 21 instead of 16.

From simple tests, it doesn’t appear that this extra bit munging buys you anything. On a list with 1 million elements, chop() and chop3_take2() run in almost the exact same amount of time. Increasing the list size to 10 million elements starts to show evidence that the initial chop() routine is going to scale better than chop3_take2(). I think chop() is also nicer to read and understand, which makes it a better candidate for production code.

Combine the two?

I sorta wondered what would happen if I only did the 2/3 reduction on the inital pass, and then passed the part of the array where the target exists to chop(). The function would look a lot like chop3_take2, but inside the if…elif parts, instead of setting new vals for lo or hi, we’d pass the part of haystack bounded by lo and hi to chop().

A nice idea except that, if you recall, chop() assumes a zero-indexed array, and our chop3_take2() function sets new values for lo in some instances, so what we really need is a chop_non_zero_index() function. Here’s one:

def chop_non_zero_index(needle, haystack):
   lo = 0
   hi = len(haystack)
   while lo < hi:
      mid = (lo+hi)/2
      midval = haystack[mid]
      if needle < midval:
         hi = mid
      elif needle > midval:
         lo = mid
      else:
         print "Found %d at index %d" % (needle, mid)
         break

And then we pass portions of our haystack to chop_non_zero_index() from a function called chop_salad(). The chop_salad() function does an initial pass to cut out 2/3 of the candidates, then delegates to chop_non_zero_index() for the rest.

def chop_salad(needle, haystack):
   """Same as take 1 above, but this time we pass to chop_non_zero_index
   after the first pass.
   """
   lo = haystack[0]
   hi = len(haystack)
   while lo < hi:
      incr = (hi-lo)/3
      print "increment = %d" % incr
      brk1 = lo+incr
      print "brk1 = %d" % brk1
      brk2 = lo + (incr * 2)
      print "brk2 = %d" % brk2
      val1 = haystack[brk1]
      val2 = haystack[brk2]

      if needle < val1:
         chop_non_zero_index(needle, haystack[:brk1+1])
         break
      elif needle > val1 and needle < val2:
         chop_non_zero_index(needle, haystack[brk1:brk2+1])
         break
      elif needle > val2:
         chop_non_zero_index(needle, haystack[brk2:])
         break
      elif needle == val1:
         print "Found it @val1!"
         break
      elif needle == val2:
         print "Found it @val2!"
         break

At this point, we’re getting into doing more list manipulations, more comparisons, passing stuff between functions… not nearly as clean and simple as plain old chop(), and it doesn’t buy you anything in terms of performance either.

Extend the List Object, and using Recursion

I almost never extend built-in objects like list or dict. I couldn’t remember the last time I did it, but I wondered if there was any benefit to adding a bisect() operation to a list object, so I gave it a shot. It’s not hard or cool or anything. Here’s my list object extension:

class mylist(list):
   def bisect(self):
      return [mylist(self[:len(self)/2]),mylist(self[len(self)/2:])]

And to use it, you create a haystack that is a mylist object, then pass it to a recursive function that’ll call bisect()

def chop_bisect_object(needle, haystack):
   while True:
      head, tail = haystack.bisect()
      try:
         if needle < head[-1]:
            chop_bisect_object(needle, head)
            break
         elif needle > tail[0]:
            chop_bisect_object(needle, tail)
            break
         elif needle == head[0]:
            print "needle in head[0]"
            break
         elif needle == tail[0]:
            print "needle in tail[0]"
            break

if __name__ == "__main__":
   x = mylist(range(1000000))
   chop_bisect_object(needle,x)

Finally, Chef Guido Chops

Of course, this was all a fun exercise to get the synapses sparking. The reality is that, if all you really want to do is find out where in a list some value exists, you call l.index(val), where val is what you’re looking for. This method only runs slightly faster than the fastest algorithm I was able to come up with (for both smallish and largish haystacks), but it’s absolutely the most readable, and being that it’s so short, it has the least number of ways to introduce bugs, so it really is the right tool for the job in this case.

Onward

First, I don’t put forth these examples as flawless. Quite the contrary. I assume others have some cool ideas, or can point out some mistakes I’ve made, so please share in the comments!

I plan to go through the rest of the exercises in the Code Kata, and I also have a couple of interesting ones of my own I might throw in. I’ll try to post roughly once a week about this, so if you’re interested, subscribe to my RSS feed, or just the Python or CodeKata feeds.

Share this:

  • Click to share on X (Opens in new window) X
  • Click to share on Reddit (Opens in new window) Reddit
  • Click to share on Tumblr (Opens in new window) Tumblr
  • Click to share on Facebook (Opens in new window) Facebook

Recent Posts

  • Auditing Your Data Migration To ClickHouse Using ClickHouse Local
  • ClickHouse Cheat Sheet 2024
  • User Activation With Django and Djoser
  • Python Selenium Webdriver Notes
  • On Keeping A Journal and Journaling
  • What Geeks Could Learn From Working In Restaurants
  • What I’ve Been Up To
  • PyCon Talk Proposals: All You Need to Know And More
  • Sending Alerts With Graphite Graphs From Nagios
  • The Python User Group in Princeton (PUG-IP): 6 months in

Categories

  • Apple
  • Big Ideas
  • Books
  • CodeKata
  • Database
  • Django
  • Freelancing
  • Hacks
  • journaling
  • Leadership
  • Linux
  • LinuxLaboratory
  • Loghetti
  • Me stuff
  • Other Cool Blogs
  • PHP
  • Productivity
  • Python
  • PyTPMOTW
  • Ruby
  • Scripting
  • Sysadmin
  • Technology
  • Testing
  • Uncategorized
  • Web Services
  • Woodworking

Archives

  • January 2024
  • May 2021
  • December 2020
  • January 2014
  • September 2012
  • August 2012
  • February 2012
  • November 2011
  • October 2011
  • June 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • September 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • December 2006
  • November 2006
  • September 2006
  • August 2006
  • July 2006
  • June 2006
  • April 2006
  • March 2006
  • February 2006
  • January 2006
  • December 2005
  • November 2005
  • October 2005
  • September 2005
  • August 2005
  • July 2005
  • June 2005
  • May 2005
  • April 2005
  • March 2005
  • February 2005
  • January 2005
  • December 2004
  • November 2004
  • October 2004
  • September 2004
  • August 2004
© 2025 Musings of an Anonymous Geek | Powered by Minimalist Blog WordPress Theme