Home > Dive Into Python > Unit Testing > Refactoring | << >> | ||||
diveintopython.org Python for experienced programmers |
The best thing about comprehensive unit testing is not the feeling you get when all your test cases finally pass, or even the feeling you get when someone else blames you for breaking their code and you can actually prove that you didn't. The best thing about unit testing is that it gives you the freedom to refactor mercilessly.
Refactoring is the process of taking working code and making it work better. Usually, “better” means “faster”, although it can also mean “using less memory”, or “using less disk space”, or simply “more elegantly”. Whatever it means to you, to your project, in your environment, refactoring is important to the long-term health of any program.
Here, “better” means “faster”. Specifically, the fromRoman function is slower than it needs to be, because of that big nasty regular expression that we use to validate Roman numerals. It's probably not worth trying to do away with the regular expression altogether (it would be difficult, and it might not end up any faster), but we can speed up the function by precompiling the regular expression.
Example 6.31. Compiling regular expressions
>>> import re >>> pattern = '^M?M?M?$' >>> re.search(pattern, 'M') <SRE_Match object at 01090490> >>> compiledPattern = re.compile(pattern) >>> compiledPattern <SRE_Pattern object at 00F06E28> >>> dir(compiledPattern) ['findall', 'match', 'scanner', 'search', 'split', 'sub', 'subn'] >>> compiledPattern.search('M') <SRE_Match object at 01104928>
Whenever you are going to use a regular expression more than once, you should compile it to get a pattern object, then call the methods on the pattern object directly. |
Example 6.32. Compiled regular expressions in roman81.py
If you have not already done so, you can download this and other examples used in this book.
# toRoman and rest of module omitted for clarity romanNumeralPattern = \ re.compile('^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$') def fromRoman(s): """convert Roman numeral to integer""" if not s: raise InvalidRomanNumeralError, 'Input can not be blank' if not romanNumeralPattern.search(s): raise InvalidRomanNumeralError, 'Invalid Roman numeral: %s' % s result = 0 index = 0 for numeral, integer in romanNumeralMap: while s[index:index+len(numeral)] == numeral: result += integer index += len(numeral) return result
So how much faster is it to compile our regular expressions? See for yourself:
Example 6.33. Output of romantest81.py against roman81.py
............. ---------------------------------------------------------------------- Ran 13 tests in 3.385s OK
Just a note in passing here: this time, I ran the unit test without the -v option, so instead of the full doc string for each test, we only get a dot for each test that passes. (If a test failed, we'd get an F, and if it had an error, we'd get an E. We'd still get complete tracebacks for each failure and error, so we could track down any problems.) | |
We ran 13 tests in 3.385 seconds, compared to 3.685 seconds without precompiling the regular expressions. That's an 8% improvement overall, and remember that most of the time spent during the unit test is spent doing other things. (Separately, I time-tested the regular expressions by themselves, apart from the rest of the unit tests, and found that compiling this regular expression speeds up the search by an average of 54%.) Not bad for such a simple fix. | |
Oh, and in case you were wondering, precompiling our regular expression didn't break anything, and we just proved it. |
There is one other performance optimization that I want to try. Given the complexity of regular expression syntax, it should come as no surprise that there is frequently more than one way to write the same expression. After some discussion about this module on comp.lang.python, someone suggested that I try using the {m,n} syntax for the optional repeated characters.
If you have not already done so, you can download this and other examples used in this book.
# rest of program omitted for clarity #old version #romanNumeralPattern = \ # re.compile('^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$') #new version romanNumeralPattern = \ re.compile('^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$')
This form of the regular expression is a little shorter (though not any more readable). The big question is, is it any faster?
Example 6.35. Output of romantest82.py against roman82.py
............. ---------------------------------------------------------------------- Ran 13 tests in 3.315s OK
One other tweak I would like to make, and then I promise I'll stop refactoring and put this module to bed. As we've seen repeatedly, regular expressions can get pretty hairy and unreadable pretty quickly. I wouldn't like to come back to this module in six months and try to maintain it. Sure, the test cases pass, so I know that it works, but if I can't figure out how it works, I won't be able to add new features, fix new bugs, or otherwise maintain it. Documentation is critical, and Python provides a way of verbosely documenting your regular expressions.
If you have not already done so, you can download this and other examples used in this book.
# rest of program omitted for clarity #old version #romanNumeralPattern = \ # re.compile('^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$') #new version romanNumeralPattern = re.compile(''' ^ # beginning of string M{0,4} # thousands - 0 to 4 M's (CM|CD|D?C{0,3}) # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's), # or 500-800 (D, followed by 0 to 3 C's) (XC|XL|L?X{0,3}) # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's), # or 50-80 (L, followed by 0 to 3 X's) (IX|IV|V?I{0,3}) # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's), # or 5-8 (V, followed by 0 to 3 I's) $ # end of string ''', re.VERBOSE)
Example 6.37. Output of romantest83.py against roman83.py
............. ---------------------------------------------------------------------- Ran 13 tests in 3.315s OK
Handling changing requirements | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | Postscript |