Monday, February 28, 2011

Personal: Looking for a New Job

I'm looking for a new job. I really like my current company, but it's not a very good fit for me for various reasons. Anyway, here's my resume. Here's what I'm looking for:
  • I'd like to find something in San Francisco or the East Bay
  • I'd like to go somewhere where the code is clean and high quality
  • As much as I enjoy startups, I'd really like to find something calm and relaxing
  • I prefer to code in Python or Ruby, but I also really enjoy weird programming languages like Scala, Haskell, and Erlang

Friday, February 25, 2011

Bless: A New Twist on Concurrent Programming

(Disclaimer: I'm just a guy who likes learning new programming languages. I'm not an academic language expert.)

I've been reading Coders at Work lately and really enjoying it. Fran Allen is a woman who won the Turing Award for her work on optimizing compilers. She said something interesting. She said that we've been in a rut since C came along because the pointers in C prevent the compiler from doing a lot of interesting optimizations. She said that programming languages before C were actually more optimizable. She also said something about using offsets rather than pointers so that the compiler could easily move memory around.

It reminded me of Dreaming in Code. Alan Kay said that when it comes to software, we're still building pyramids because we haven't yet discovered the programming version of the arch. Pyramids are heavy, whereas arches can support an amazing amount of weight while not having as much weight themselves.

I remember years ago that Alan Kay's statements made me think of actors in Erlang. Fran Allen's statements also made me think of actors in that an actor could be pointed to by a pointer, whereas all the state in that actor could be represented by offsets of that pointer. This is how you can isolate actors from each other within the same UNIX process. (Of course, you have to make sure the language doesn't provide the equivalent of pointer arithmetic with indexes. That's easy. References are indexes, and you can't set them arbitrarily.)

One thing is clear in "Coders at Work". Everyone is trying to figure out better ways to do concurrency. Currently, the three competing models are threads, actors, and STM (software transactional memory). One lesson that I learned when I read Why Science Fiction So Often Fails to Predict the Future is that life isn't homogeneous. That post pointed out that every form of human transportation is still in use somewhere in the world. Did I walk to work, drive, or take the train? The answer is a little bit of all three. Similarly, I've read that the CISC vs. RISC debate has fizzled because these days chips are CISC on the outside and RISC on the inside. Hence, I've been thinking of ways to use threads, actors, and STM together.

Bless is a new "twist" on concurrent programming. I haven't written it yet--I'm just thinking out loud to get feedback from fellow language enthusiasts. Bless is a little bit like Erlang at a high-level and a little like Python at the mid-level. I haven't decided on the low-level parts yet.

First, let's start with Erlang-style processes. I like to think of them as "green processes" because they form their own "address space", but they're implemented by the VM. However, to avoid confusion, I'm going to call them actors. An actor in Bless is an entity that is implemented by the VM. It has its own VM-level stack and its own VM-level heap. The VM-level stack must not use the C stack. At the C level, an actor can be referenced by a pointer (i.e. a location in memory), but all data within the actor is referenced by offsets to that pointer. Hence, the actor can move around in memory without disturbing the contents.

Each actor has its own green process. Actors can be scheduled onto any number of operating system threads. Erlang is like this.

Each actor can receive asynchronous messages using a queue. I think there are queue implementations that don't require locks; however, I was also thinking of implementing the queue using STM.

All data is local to an actor. The code for an actor looks sort of like a Python module. It has lexically scoped functions that have local variables. Variables global to the actor are kind of like instance variables in object oriented languages. If a module contains the code for an actor, you can run that same code on multiple green processes. For instance, you might use a new actor for every web server connection. Perhaps it makes sense to use the term "module" for the source code for an actor and "actor" for when that module is run on a green process (remember, I say "process" not in the UNIX sense, but rather because each actor has its own data).

Data is not immutable like In Erlang. However, messages between actors are passed by deeping copying the data. I'm thinking of the messages as being JSON-like messages. This is something my buddy Warren DeLano was a real fan of. Immutable data made a lot of sense when Erlang was created because you can implement message passing via passing a pointer to immutable data. However, mutable data is more familiar to most programmers, and if you have 100+ CPUs, you can afford to copy some data around. In a perfect implementation, messages would be contiguous blocks of data, and they could be memcpy'd between green processes. One thing I learned from the Sam Rushing is that you'd be amazed at how fast memcpy is, especially if you have 100+ CPUs to play with.

I know that a lot of functional people are going to be upset with my use of mutable data. That's okay. Most of these ideas still apply if you want to go purely functional. That's cool too.

Actors can block on IO, like in Erlang. (I don't buy Node.JS's argument that everything should be a callback. A little brilliance at the VM level can make everyone else's life easier. Think of how Java frees programmers from the pains of manual memory management.) However, like Erlang, you can use asynchronous IO under the covers.

Bless will need a scheduler in order to schedule actors onto the real kernel threads underneath. This sort of thing is done in Erlang and Scala, so there's nothing particularly mysterious about this. Note that if you have 100+ CPUs, the time slices can be longer which cuts down on the amount of work you need to do for scheduling.

I want Pythonic syntax. I want indentation-based syntax. I know some Python haters will reject Bless because of this, but they'd probably find other reasons to reject Bless anyway. Besides, Python isn't the only language to use indentation. Another one of my favorite languages, Haskell, does it too.

I want blocks. Blocks in Ruby are wonderful. I think the best way to represent a block in Python is "function_call(args) do: ...". The "do:" behaves the same way "if ...:" does. Blocks should follow the same scoping rules that if statements do. I'm not opposed to anonymous functions such as "def (args):".

I want keyword arguments. Python, Common Lisp, and Smalltalk all got this right.

I want variable declarations. I would prefer the keyword "var", but it should behave sort of like "my" in Perl. Variable declarations can help the system catch you if you misspell things, and they also help when you want to set a variable in an outer scope.

There is no "self" in method calls. It's just like functions in a module. However, you can of course "instantiate" the module (i.e. run the actor) multiple times in separate green processes. There's no need for "@" like in Ruby or "self." in Python because it's all lexically scoped.

If I decide to do static typing, I want it to look like ActionScript3, "var a: type". If I decide to do an ML-style type system, the ":type" can be optional.

Although I really like Python syntax, I think I'd be okay with a slightly-lower level language like Java. I.e., I'm not against using a compiler, and I am interested in resolving symbols at compile time. If you think about levels of dynamism, Python, Java, and Haskell all have valid, interesting approaches. I really like the middle ground that ActionScript3 provides. At this point, I care more about the approach to concurrency in Bless than which level of dynamism to use. Certainly, having both a compiler and an interpreter (like Common Lisp, Haskell, and Ocaml) is very appealing.

Compared to how crazy you can get in Python, I'd be okay with sacrificing a little bit of that dynamism for a little more safety, although I'm not a fan of babysitting the compiler like in Java, and I am still a fan of either duck typing or ML-style typing.

By the way, there are other forms of concurrency such as data parallelism, data-flow programming (aka promises and futures), etc. To some degree, I think they are orthogonal--i.e. you can do them in Bless as well, but that's not my focus. It's best to avoid scope creep and second system syndrome.

I think there are a few things I haven't straightened out in my head. I haven't thought about how to reuse functions between actors. Perhaps mixins make sense. I also haven't thought about inheritance.

Notice that if you do it right, you can safely run multiple programs in the same UNIX process using separate actors. However, to do this, you'll need to control how actors are able to obtain references to each other. This would be a nice feature, but it can easily be thrown out if it makes the implementation too complex--there's a reason we have UNIX processes. It may make sense to think of the actors within a UNIX process as all being "friendly", whereas separate UNIX processes are allowed to be somewhat less "friendly" with each other.

That's it. It's basically Erlang-style concurrency with Python syntax and actors with mutable "process-local" state (note, that I know that Erlang has mutable process-local state too). Easy peasy. Anyone know how to implement this thing? ;)

By the way, if I ever did manage to implement it, my next step would be to implement an operating system as I described here. Basically, start with a microkernel and then code all the things like the filesystem in userland using Bless.