Friday, April 28, 2006

Python: Protecting UTF-8 Strings from Naive Code

"""Temporarily convert a UTF-8 string to Unicode to prevent breakage.

BASIC IDEA: protect_utf8 is a function decorator that can prevent naive
functions from breaking UTF-8.


def protect_utf8(wrapped_function, encoding='UTF-8'):

"""Temporarily convert a UTF-8 string to Unicode to prevent breakage.

protect_utf8 is a function decorator that can prevent naive
functions from breaking UTF-8.

If the wrapped function takes a string, and that string happens to be valid
UTF-8, convert it to a unicode object and call the wrapped function. If a
conversion was done and if a unicode object was returned, convert it back
to a UTF-8 string.

The wrapped function should take a string as its first parameter and it may
return an object of the same type. Anything else is optional. For

def truncate(s):
return s[:1]

Pass "encoding" if you want to protect something other than UTF-8.

Ideally, we'd have unicode objects everywhere, but sometimes life is not
ideal. :)


def proxy_function(s, *args, **kargs):
unconvert = False
if isinstance(s, str):
s = s.decode(encoding)
unconvert = True
except UnicodeDecodeError:
ret = wrapped_function(s, *args, **kargs)
if unconvert and isinstance(ret, unicode):
ret = ret.encode(encoding)
return ret

return proxy_function

def truncate(s, length=1, etc="..."):
"""Truncate a string to the given length.

If truncation is necessary, append the value of "etc".

This is really just a silly test.

if len(s) < length:
return s
return s[:length] + etc
truncate = protect_utf8(truncate) # I'm stuck on Python 2.3.

if __name__ == '__main__':
assert (truncate('\xe3\x82\xa6\xe3\x82\xb6\xe3\x83\x86', etc="") ==
assert truncate('abc') == 'a...'
assert truncate(u'\u30a0\u30b1\u30c3', etc="") == u'\u30a0'

Humor: bash-3.00$ jj < coffee.cup >

This is my second all-nighter this week (well, technically, I slept for a couple hours the other day). I had two double espressos yesterday, a cup of coffee last night at 10:30 PM, and I got one this morning at 5:00 AM. I think I realized I had a problem when I discovered that I was irritated that Star Bucks wasn't open between 11:00 PM and 4:30 AM.

Wednesday, April 26, 2006

Python: Django Meeting at Google

I've organized a BayPIGies meeting to take place at Google tonight at 7:30PM. Jacob Kaplan-Moss, one of the lead developers of Django, will be giving a talk on Django. There's more information on the BayPIGies Web site.

Wednesday, April 19, 2006

UNIX: ssh + tar + gzip -q = goodness

To retrieve a hierarchy of files from a remote server (or to copy it back to a remote server), I often do something like:
ssh servername "tar cvzf - dirname" | tar xvfz -
However, I usually get the following error message:
gzip: stdin: decompression OK, trailing garbage ignored
tar: Child returned status 2
tar: Error exit delayed from previous errors
Strangely enough, as I write this, I get the error message copying something from one FreeBSD system to another FreeBSD system, but I don't get it when copying something from one FreeBSD system to my Ubuntu system. Weird.

I put up with this problem for years. However, I recently needed to use it in a Makefile. Having an error like that is fine when you're a human, but a non-zero return code is a deal-breaker in a Makefile. I needed to clean up my act.

One easy way to make the problem go away is to not use the "z" flag for both instances of tar. This is somewhat icky, because it really would be nice to have the content gzipped. Otherwise, it could take too long to transfer.

Finally, I found the real solution on the gzip Web site. Instead of passing the "z" flag to tar when untarring, use gunzip separately and pass the "q" flag to tell it to be quiet:
ssh server "tar cvzf - dirname" | gunzip -q | tar xvf -
By the way, as is standard in UNIX, there are plenty of other variations of this tar + ssh idiom. For instance, consider:
ssh server "cd myapplication/share/locale && 
find . -name '*.po' -o -name '*.mo' |
xargs tar cvzf -" |
gunzip -q | tar xvf -

Wednesday, April 05, 2006

Software Engineering: Professional Software Development

I've written a comprehensive summary of the book "Professional Software Development". You can find the slides here. I highly encourage everyone to take the time to read the slides as the signal to noise ratio is extremely high.

Sunday, April 02, 2006

Hardware: Smarter Memory

I'm not a hardware guy, but it seems to me that it would be really nice if RAM could be a little smarter and implement a few simple commands:
  • memcpy - Copy one area of memory to another.
  • memcmp - Are the given two strings of memory equal?
  • memzero - Zero out an area of memory.
Now naturally, these commands can't directly be used by applications because of the difficulties of virtual memory. I also wouldn't expect them to be as smart as their C counterparts. However, given some support in the standard library and the kernel, these commands could be very useful optimizations.

Emacs: Syntax Highlighting

It's that time again! Whether it's because I'm drinking coffee and that's causing my compulsive obsessive nature to do crazy things, or because I'm inspired by other smart programmers who use Emacs, I'm getting "must use Emacs" cravings again. However, as soon as I started it up, the syntax highlighting irritations hit me like a stop sign over the head. The time, it's just a normal Python file. Notice how Emacs doesn't understand that double quotes can be embedded in triple double quotes. By the way, this isn't my code, so don't send me hate mail because there's HTML embedded in Python ;)