Friday, June 29, 2007

Computer Science: Coping with Unknown Types

What do "void *" (a la C), polymorphism (a la C++ classes), interfaces (a la Java), generics (a la C++ templates), and duck typing (a la Python) all have to do with one another? They're all ways in which you can write code that works with types that you didn't envision when writing the code.

A "void *" in C is a pointer to something of unspecified type. You can't do very much with it unless you know what type the something is. However, you can still pass it around. You can store it in a list or tree. You can take it and later pass it back to a callback function. All of these things are useful, and, in fact, this functionality still exists in Java (albeit, it's a lot safer in Java). However, instead of casting to "void *", you cast to "Object".

Polymorphism in languages like C++ and Java let you take an object and call methods on it without necessarily knowing exactly which subclass the object is a member of. Let's suppose there is a class named Fruit with a method named peel, and let's suppose there are two subclasses named Apple and Orange. If you have a list of apples and oranges, you can loop over that list and call peel on each fruit. Even better, if someone later creates a Lemon class, and slips a few lemons into that list, your code will still know how to peel them.

However, what if you don't want to subclass Fruit? What happens if you have an object that knows how to peel itself as well as a ton of other things? Do you need to subclass multiple classes? An interface in Java (or a typeclass in Haskell) lets you say that your object knows how to peel itself, without requiring any specific subclassing. Instead, it can implement some Peelable interface, and that's close enough. Hence, instead of peeling a list of fruit, your code can now peel a list of objects that each implement the Peelable interface. Those objects might not be related at all, and they're free to implement all sorts of interfaces aside from just the Peelable interface.

Generics, which are called templates in Java and C++, let you write code and leave blanks in it that can be filled in later. Generics are an interesting subject, and the question really comes down to what kind of stuff can you leave blank?

Generics in Java are actually pretty weak. It use to be that if you wanted a list, you had to have a list of Objects (remember the "void *" trick?). You didn't know exactly what was in the list. These days, with Java templates, you can tell the compiler that you're creating a list, and that the list can only contain Apples. The list is a template, and you're "filling in the blank" with the type Apple. However, templates in Java are limited. For instance, you can't create a template that says "Create a new [blank]". (If I'm wrong, please leave a comment!)

C++ templates are more powerful. You can do all the same things that you can do in Java, but you can also do things like create a template that says "Create a new [blank]". The differences have to do with how the compiler implements templates. When you tell the compiler that you want to "fill in the blank" with an Apple, i.e. "Create a new Apple", that's called instantiating the template. By the way, this is something that happens at compile time. Now, let's suppose you have a template for lists, and you want a "list of Apple" and a "list of Orange". One way the compiler can implement this is to take the code for list and fill in all the blanks with Apple, then take the code for list and fill in all the blanks with Orange. You'd end up with two slightly different versions of the same code in the compiled binary. I don't know how modern C++ compilers do it, and feel free to call me ignorant, but it really makes me suspicious when I see how big C++ binaries are compared to C binaries ;)

Generics in functional programming languages like Haskell are even more impressive. If a function takes a fruit and then peels it, Haskell can automatically figure out that the function will work with any object that can be peeled. The impressive thing is that it can in many cases automatically infer this interface at compile time! You don't even have to tell the compiler that you're trying to write generic code. (Note, I'm handwaiving a little about when you do and don't need to use type classes.)

Duck typing (also known as latent typing) achieves the same goal, but does so using runtime checks. Hence, if you write a function that takes an object and peels it, you don't need to subclass anything or write an interface. However, at runtime, the interpreter will figure out if the object actually knows how to peel itself. On the one hand, you don't get as many compile-time safety checks, but on the other hand, it's really easy to understand. You can accept whatever objects you want, and call any methods you want, and if it doesn't actually work at runtime, you'll get a nice exception that you can respond to in a controlled manner. There's an old joke that says that C++ is like juggling chainsaws in full body armor, whereas Python is like juggling rubber chickens. Even better, you can do tricks like have the same object respond to any method. For instance, you can call any method on a proxy object, and it will just proxy that method call to the object it is acting as a proxy for. The same proxy object can proxy any object with any interface.

Ok, that was a pretty basic overview of a bunch of related language features. As I said, they're all ways in which you can write code that works with types that you didn't envision when writing the code. Now, take a minute and think about that problem and the many different ways to solve it. If you wrote a new language, how might you solve it differently? Leave me a comment below!

9 comments:

Shannon -jj Behrens said...

Mike Cheponis has been complaining to me lately that he thinks object oriented programming is silly. Mike is a smart guy from MIT, so he got me thinking. In a way, this is my response to him. Hi, Mike!

Paddy3118 said...

In Duck Typing objects only need supply the methods needed when determined at runtime.

To extend your example, if you write a new function that takes an object and peels it if it is ripe or ferments it otherwise, then you could pass the function objects that are ripe and only know how to peel themselves, or conversely, objects that are not ripe and only know how to ferment themselves. If you then say that the objects to the second function are satisfying an interface then it is an interface that can only be determined by running the code.

I only write because you have written a good article but someone from a non-dynamic language background might conclude that Duck-Typing is something that can be covered by a non-dynamic language in its entirety.

- Paddy.

Shannon -jj Behrens said...

Yep.

Kelly Yancey said...

With Python 3's Voluntary Abstract Base Classes a lot of the guesswork involved with detecting sequences and whatnot in the world of latent typing should be relieved. Of course, one could argue that you shouldn't need to detect types in a latently-typed language, but some of us like to catch mismatched argument errors early and with intuitive error messages. :)

I'm looking forward to being able to say isinstance(x, Sequence) rather than my current roundabout incantation hasattr(x, '__getitem__') and not isinstance(x, basestring). (I'm assuming/hoping strings won't inherit from the Sequence ABC).

Shannon -jj Behrens said...

Kelly, I agree with you, although I'm quite sure strings will still inherit from the sequence ABC ;)

Anonymous said...

Don't forget about reflection in Java. You can implement duck typing if you want it, by getting the method via reflection.

Shannon -jj Behrens said...

> Don't forget about...

Yep, point granted. It's definitely not the norm, though.

Eat_My_Shortz said...

OK so on the generics/templates discussion. I'm not really sure what you meant by Java is limited because you can't say "create a new [blank]".

The only limitation in Java versus C++ templates is that C++ templates can hold values (such as "3") OR types, while templates in Java and Haskell and most other languages can only hold types. That's quite a bit safer IMHO!

The main difference between generics and inheritance is that with generics, you can say "this function accepts a T and returns a T", guaranteeing that it returns an object of the same type as you passed in. Whereas with inheritance, there is no guarantee that "Fruit a" and "Fruit b" are both the same type of fruit - a might be an Apple and b might be an Orange.

As far as implementation goes, yes, C++ code which uses templates bloats out quite a bit because when you instantiate a template in C++, the compiler generates a new copy of the code with the template replaced by what you replaced it with. Templates are really just fancy (and safe) macro substitution.

In Java, it was tacked on later. What it really does, behind the scenes, is just a) does compiler type checking for safety, then b) secretly just makes all generic types an Object, and downcasts them as necessary! This means the compiled code is completely unsafe (as if you just used Objects everywhere), but at least it is type-checked at compile time.

Note that this is really just an implementation detail in the Java bytecode. You could probably write a new Java compiler (that didn't use Java bytecode) which handled it differently. Note that if you compiled it to native it would be unsafe anyway.

Shannon -jj Behrens said...

> OK so on the generics/templates discussion.

Hey, thanks for your comments. They're really good :)

> I'm not really sure what you meant by Java is limited because you can't say "create a new [blank]".

I was alluding to a lot of comments Bruce Eckel made on his blog and what he wrote in "Thinking in Java". I don't remember the exact details. Does Java actually let you write a factory function and specify the return type via the templated type?

> The only limitation in Java versus C++ templates is that C++ templates can hold values (such as "3") OR types, while templates in Java and Haskell and most other languages can only hold types. That's quite a bit safer IMHO!

> The main difference between generics and inheritance is that with generics, you can say "this function accepts a T and returns a T", guaranteeing that it returns an object of the same type as you passed in. Whereas with inheritance, there is no guarantee that "Fruit a" and "Fruit b" are both the same type of fruit - a might be an Apple and b might be an Orange.

That's a good point. It seems to me generics are more free form than OO is, especially in C++. In Java, you can use a generic with any class, but if you want to use OO, you have to start putting interfaces in the mix.

> As far as implementation goes, yes, C++ code which uses templates bloats out quite a bit because when you instantiate a template in C++, the compiler generates a new copy of the code with the template replaced by what you replaced it with. Templates are really just fancy (and safe) macro substitution.

Ah, I was right! Thanks for telling me this! I suspect big binaries results in more RAM usage, which is a good explanation for the explosion in RAM usage we're seeing these days.

> In Java, it was tacked on later. What it really does, behind the scenes, is just a) does compiler type checking for safety, then b) secretly just makes all generic types an Object, and downcasts them as necessary!

Yeah, erasure. That's what I was talking about when I said Java templates were weaker than C++ templates.

> This means the compiled code is completely unsafe (as if you just used Objects everywhere), but at least it is type-checked at compile time.

> Note that this is really just an implementation detail in the Java bytecode. You could probably write a new Java compiler (that didn't use Java bytecode) which handled it differently.

Yeah, the way I heard it was that they were trying to maintain binary backward compatibility which is why they had to do it that way. I've also heard that binary backward compatibility was broken for other reasons, so it was a wasted sacrifice.

> Note that if you compiled it to native it would be unsafe anyway.

Hmm.

Thanks again for your comments!