I’ve decided it’d be fun to go through the Code Kata put together by Dave Thomas, co-author of The Pragmatic Programmer, and co-founder of Pragmatic Programmers, LLC (publishers of the finest tech books on the market right now; they can’t expand their selection fast enough for me).
Anyway, the first Code Kata, while being truly interesting to ponder, doesn’t really involve code, so I’m starting with Kata 2, which is a binary search exercise. It occurred to me while reading it over that I’d never written a binary search in Python, because you typically don’t need to, since list objects have a method called ‘index()’ which will hand you the index position of your search target within the list. I decided to force myself not to rely on it, as an exercise. If you’ve never had to go through a computer science curriculum, but want to excel at programming, I suggest doing all of the Kata, and getting your hands on at least one good computer science text (a computer science grad student hanging around is nice, too!)
Binary Search Background
So, the idea behind a binary search is often likened to looking up a name in a phone book, which is a pretty good analogy. A phone book has thousands of pages and perhaps millions of names, and yet it doesn’t take us long to find any given name, no matter which part of the book it’s in. You open the book to the middle, and the name you’re looking for is either on one of the facing pages, in the first half of the book, or in the last half of the book. After glancing at a single page, you’ve reduced your haystack by 50%. You flip to the middle of the proper half of the book, and repeat the process. If you had a phone book listing every person in the entire world (say, 7,000,000,000 people), you could find any given person in 33 steps or less. That’s pretty efficient (in big O notation, it’s O(logn)).
You can use a binary algorithm to tease your friends, too. Bet a friend $1 that you can guess any number they pick between 1 and 100 in 7 guesses if they tell you if your guess is too low or too high. If you use a binary algorithm to make your guesses, it’s not possible to lose this bet as far as I know. Here’s an example:
Your buddy chooses number 70. Your guesses will be:
- 50 (too low)
- 75 (too high)
- 62 (too low)
- 69 (too low)
- 72 (too high)
- 71 (too high)
- 70 (that’s it!)
One more. Your buddy chooses the number 1. Your guesses will be:
- 50 (too high)
- 25 (too high)
- 13 (too high)
- 7 (too high)
- 4 (too high)
- 2 (too high)
- 1 (that’s it!)
But what if your buddy wants to up the stakes by increasing the scope to numbers between 1 and 1000? Say he picks 485:
- 500 (too high)
- 250 (too low)
- 375 (too low)
- 438 (too low)
- 469 (too low)
- 485 (that’s it!)
You got lucky on that last one, but you might’ve negotiated more guesses for yourself. To do that and still guarantee a win, figure out what log2N is, where N is the highest number they can choose. So, log2100 = 7, and log21000 = 10. Another tip: when calculating your next guess, always round your division UP. Notice that log27000000000 = 32.7, which is where I got the max of 33 steps above.
Ok, now on to the Kata!!
Binary Search in Python Take 1: Traditional Binary Search
This is just my take on a traditional binary search algorithm in Python:
def chop(needle, haystack): """ Takes an integer target (needle) and a sorted list of integers (haystack), and performs a binary search for the target. Return the integer index of the target in the array, or -1 if it's not found. """ lo = haystack hi = len(haystack) while lo < hi: mid = (lo+hi)/2 midval = haystack[mid] if needle < midval: hi = mid elif needle > midval: lo = mid else: print "Found %d at index %d" % (needle, mid) break
This makes assumptions. Namely, it assumes a 0-indexed, sorted list with no missing values. So, my test lists are always output from range(N), where N is the number of elements I want to search. It works well enough. I think a CS prof would accept this. Let’s move on.
What if I break it in *THREE* (oooooohhhhhh)
If breaking the list in two is so cool, why don’t we break it in three pieces? We can do one extra check, and then cut out 2/3 of the haystack in one fell swoop! Ok, let’s see how it goes:
def chop3(needle, haystack): lo = haystack hi = len(haystack) while lo < hi: brk1 = (lo+hi)/3 brk2 = (2 * (lo+hi)) / 3 val1 = haystack[brk1] val2 = haystack[brk2] if needle < val1: hi = brk1 elif needle > val1 and needle < val2: hi = brk2 lo = brk1 elif needle > val2: lo = brk2
This one attempts to cut the initial array into three parts instead of bisecting. This won’t work, actually. At some point, you’ll shrink the list down two where the lower bound is 1/2 the upper bound, at which point, all hell breaks loose. Think about how this works when lo is 16 and hi is 32. 16+32 = 48, and 48/3 = 16! This means brk1 never actually changes, and you’ve created perfect conditions for an infinite loop.
Instead, define the hi and lo variables and break points in the list differently:
def chop3_take2(needle, haystack): lo = haystack hi = len(haystack) while lo < hi: incr = (hi-lo)/3 brk1 = lo+incr brk2 = lo + (incr * 2) val1 = haystack[brk1] val2 = haystack[brk2] if needle < val1: hi = brk1 elif needle > val1 and needle < val2: hi = brk2 lo = brk1 elif needle > val2: lo = brk2 elif needle == val1: break elif needle == val2: break
This one will work. It’s slightly more heavy lifting than chop2, but not too bad. In the calculation for the brk variables, I’ve decided to calculate an increment defined as 1/3 the distance between lo and hi. So, if lo=16 and hi=32, instead of brk1 becoming 16 and looping forever, it becomes the point 1/3 of the way between lo and hi. It makes fewer assumptions, which in my experience means fewer bugs. (32-16)/3 == 5 (the remainder in integer division is truncated), so brk1 is going to be 21 instead of 16.
From simple tests, it doesn’t appear that this extra bit munging buys you anything. On a list with 1 million elements, chop() and chop3_take2() run in almost the exact same amount of time. Increasing the list size to 10 million elements starts to show evidence that the initial chop() routine is going to scale better than chop3_take2(). I think chop() is also nicer to read and understand, which makes it a better candidate for production code.
Combine the two?
I sorta wondered what would happen if I only did the 2/3 reduction on the inital pass, and then passed the part of the array where the target exists to chop(). The function would look a lot like chop3_take2, but inside the if…elif parts, instead of setting new vals for lo or hi, we’d pass the part of haystack bounded by lo and hi to chop().
A nice idea except that, if you recall, chop() assumes a zero-indexed array, and our chop3_take2() function sets new values for lo in some instances, so what we really need is a chop_non_zero_index() function. Here’s one:
def chop_non_zero_index(needle, haystack): lo = 0 hi = len(haystack) while lo < hi: mid = (lo+hi)/2 midval = haystack[mid] if needle < midval: hi = mid elif needle > midval: lo = mid else: print "Found %d at index %d" % (needle, mid) break
And then we pass portions of our haystack to chop_non_zero_index() from a function called chop_salad(). The chop_salad() function does an initial pass to cut out 2/3 of the candidates, then delegates to chop_non_zero_index() for the rest.
def chop_salad(needle, haystack): """Same as take 1 above, but this time we pass to chop_non_zero_index after the first pass. """ lo = haystack hi = len(haystack) while lo < hi: incr = (hi-lo)/3 print "increment = %d" % incr brk1 = lo+incr print "brk1 = %d" % brk1 brk2 = lo + (incr * 2) print "brk2 = %d" % brk2 val1 = haystack[brk1] val2 = haystack[brk2] if needle < val1: chop_non_zero_index(needle, haystack[:brk1+1]) break elif needle > val1 and needle < val2: chop_non_zero_index(needle, haystack[brk1:brk2+1]) break elif needle > val2: chop_non_zero_index(needle, haystack[brk2:]) break elif needle == val1: print "Found it @val1!" break elif needle == val2: print "Found it @val2!" break
At this point, we’re getting into doing more list manipulations, more comparisons, passing stuff between functions… not nearly as clean and simple as plain old chop(), and it doesn’t buy you anything in terms of performance either.
Extend the List Object, and using Recursion
I almost never extend built-in objects like list or dict. I couldn’t remember the last time I did it, but I wondered if there was any benefit to adding a bisect() operation to a list object, so I gave it a shot. It’s not hard or cool or anything. Here’s my list object extension:
class mylist(list): def bisect(self): return [mylist(self[:len(self)/2]),mylist(self[len(self)/2:])]
And to use it, you create a haystack that is a mylist object, then pass it to a recursive function that’ll call bisect()
def chop_bisect_object(needle, haystack): while True: head, tail = haystack.bisect() try: if needle < head[-1]: chop_bisect_object(needle, head) break elif needle > tail: chop_bisect_object(needle, tail) break elif needle == head: print "needle in head" break elif needle == tail: print "needle in tail" break if __name__ == "__main__": x = mylist(range(1000000)) chop_bisect_object(needle,x)
Finally, Chef Guido Chops
Of course, this was all a fun exercise to get the synapses sparking. The reality is that, if all you really want to do is find out where in a list some value exists, you call l.index(val), where val is what you’re looking for. This method only runs slightly faster than the fastest algorithm I was able to come up with (for both smallish and largish haystacks), but it’s absolutely the most readable, and being that it’s so short, it has the least number of ways to introduce bugs, so it really is the right tool for the job in this case.
First, I don’t put forth these examples as flawless. Quite the contrary. I assume others have some cool ideas, or can point out some mistakes I’ve made, so please share in the comments!
I plan to go through the rest of the exercises in the Code Kata, and I also have a couple of interesting ones of my own I might throw in. I’ll try to post roughly once a week about this, so if you’re interested, subscribe to my RSS feed, or just the Python or CodeKata feeds.