If you haven’t heard the news, Dave Beazley and I have officially begun work on the next edition of the Python Cookbook, which will be completely overhauled using absolutely nothing but Python 3. Yay!
Right now, I’m going through some string formatting recipes from the 2nd edition to see if they still work, and if Python 3 offers any preferred alternatives to the solutions provided. As usual, it turns out that the answer to that is often ‘it depends’. For example, you might decide on a slower solution that’s more readable. Conversely, you might need to run an operation in a loop a million times and really need the speed.
New string formatting operations like the built-in format() function (separate from the str.format method) and the format mini-language are available in 2.6, and made nicer in 2.7. All of it is backported from the 3.x tree to my knowledge, and I’ll be using a Python 3.2b2 interpreter session for my examples.
I want to focus specifically on string alignment here, because there are very obviously multiple ways to solve alignment needs. Here’s an example solution from the 2nd edition:
>>> print '|' , 'hej'.ljust(20) , '|' , 'hej'.rjust(20) , '|' , 'hej'.center(20) , '|' | hej | hej | hej |
Note that this is of course in Python 2.x syntax, but this works in Python 3.2 if you just make it a function call instead of a statement (so, just add parens and it works). The string methods used here are still in Python 3.2, with no notices of deprecation or preference for newer methods available now. That said, this looks messy to me, and so I wondered if I could make it more readable without losing performance, or at least without losing so much performance that it’s not worth any gains in the area of readability.
Single String Formatting
Here are three ways to get the same string alignment behavior in Python 3.2b2:
>>> '{:+<20s}'.format('hej') 'hej+++++++++++++++++' >>> format('hej', '+<20s') 'hej+++++++++++++++++' >>> 'hej'.ljust(20, '+') 'hej+++++++++++++++++'
Ok, so they all work the same. Now I’m going to wrap each one in a function and use the timeit module to help me get an idea what the difference is in terms of performance.
>>> def runit(): ... format('hej', '+<20s') ... >>> def runit2(): ... 'hej'.ljust(20, '+') ... >>> def runit3(): ... '{:+<20s}'.format('hej') ... >>> timeit(stmt=runit3, number=1000000) 0.6168370246887207 >>> timeit(stmt=runit3, number=1000000) 0.6109819412231445 >>> timeit(stmt=runit3, number=1000000) 0.6166291236877441 >>> timeit(stmt=runit2, number=1000000) 0.49651098251342773 >>> timeit(stmt=runit2, number=1000000) 0.4870288372039795 >>> timeit(stmt=runit2, number=1000000) 0.49135899543762207 >>> timeit(stmt=runit, number=1000000) 0.7751290798187256 >>> timeit(stmt=runit, number=1000000) 0.7771239280700684 >>> timeit(stmt=runit, number=1000000) 0.7805869579315186
Turns out using the old, tried and true str.* methods are fastest in this case, though I think in a more complex case like the recipe from the 2nd edition I’d opt for something more readable if I had the chance.
One String, Three Ways
Let’s look at a more complex case. Let’s take each of the methodologies used in runit, runit2, and runit3, and see how things pan out when we want to do something like the 2nd edition recipe. I’ll start with the bare interpreter operation to compare the output:
>>> '|' + format('hej', '+<20s') + '|' + format('hej', '+^20s') + '|' + format('hej', '+>20s') + '|' '|hej+++++++++++++++++|++++++++hej+++++++++|+++++++++++++++++hej|' >>> '|' + 'hej'.ljust(20, '+') + '|' + 'hej'.center(20, '+') + '|' + 'hej'.rjust(20, '+') + '|' '|hej+++++++++++++++++|++++++++hej+++++++++|+++++++++++++++++hej|' >>> '|{0:+<20s}|{0:+^20s}|{0:+>20s}|'.format('hej') '|hej+++++++++++++++++|++++++++hej+++++++++|+++++++++++++++++hej|'
Unless you go through the rigamarole of creating a sequence and using ‘|’.join(myseq), the last method seems the most readable to me. I’d really just like to use the built-in print function with a “sep=’|'” argument, but that won’t cover the pipes at the beginning and end of the string unless I’ve missed something.
Here are the functions and timings:
>>> def threeways(): ... '|' + format('hej', '+<20s') + '|' + format('hej', '+^20s') + '|' + format('hej', '+>20s') + '|' ... >>> def threeways2(): ... '|' + 'hej'.ljust(20, '+') + '|' + 'hej'.center(20, '+') + '|' + 'hej'.rjust(20, '+') + '|' ... >>> def threeways3(): ... '|{0:+<20s}|{0:+^20s}|{0:+>20s}|'.format('hej') ... >>> timeit(stmt=threeways, number=1000000) 2.4910600185394287 >>> timeit(stmt=threeways, number=1000000) 2.50291109085083 >>> timeit(stmt=threeways, number=1000000) 2.4913830757141113 >>> timeit(stmt=threeways2, number=1000000) 1.9027390480041504 >>> timeit(stmt=threeways2, number=1000000) 1.8975908756256104 >>> timeit(stmt=threeways2, number=1000000) 1.8957319259643555 >>> timeit(stmt=threeways3, number=1000000) 1.311446189880371 >>> timeit(stmt=threeways3, number=1000000) 1.3099820613861084 >>> timeit(stmt=threeways3, number=1000000) 1.3031558990478516
The threeways3
function has a bit of an advantage in not having to muck with concatenation at all, and this probably explains the difference. Changing threeways()
to use a list and '|'.join()
brought it from about 2.50 to about 2.30. Better. Changing threeways2()
in the same way was also a small improvement from ~1.90 to ~1.77. No big wins there, and they’re not particularly readable in either case. For this one arguably trivial corner case, the new formatting mini-language wins in both performance and (IMO) readability.
Big Assumptions
This of course assumes I didn’t overlook something in creating the comparison functions, that there’s not yet a different way to do this that blows all of my work out of the water. If you see a completely different way to do this that’s both readable and performant, or I did something bone-headed, please let me know in the comments. 🙂