Thursday, December 11, 2008

Python and Ruby: Regular Expression Anchors

In Python regular expressions, multiline mode is off by default. The documentation says:
When [multiline mode is] specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.
In Ruby regular expressions, the multiline modifier (m) is also off by default. However, '^' still matches the beginning of each line.

Hence, in Python, the following does not match:
re.match(r'^foo', '\nfoo\nbar')
Interestingly enough, this does not match in Perl either:
"\nfoo\nbar" =~ /^foo/
In Ruby, it does:
/^foo/.match("\nfoo\nbar")
Both Python and Ruby support the "\A" operator which explicitly matches the beginning of the string (not the line).

To make matters even more confusing, in Python "\Z" matches the "end of the string." In Ruby, "\Z" matches the end of the string except for the final newline, whereas "\z" matches the end of the string. Ruby is similar to Perl in this regard.

I was surprised to discover such subtle differences existed. Things like that make expert-level proficiency in multiple languages extremely difficult.

No comments: