Saturday, February 16, 2008

Computer Science: Offset-based Linked Lists

Note, I'm speaking a little loosely, and I'm not a C expert, but I think this post is still interesting nonetheless.

There are many ways to "architect" and implement lists.

These days, it's the norm in languages like Java, Python, etc. for the language or language library itself to provide a list implementation. You simply create a list and then start shoving objects into it. Python, Ruby, Perl and PHP all provide native syntax for lists. The algorithm used is more sophisticated than a singly or doubly-linked list because lists must double as arrays in those languages. Java and C++ have native arrays and array syntax, but they also provide a variety of list implementations in their libraries. Both can use generics to constrain the type of objects you put in those lists. One thing to note, however, is that there's a distinction between the list itself and the items in the list.

Since C's standard library doesn't provide a list implementation, a variety of approaches are used. It's not uncommon to create structs that have prev and next pointers in them and to manage the list manually. In general, it seems common in C to create data structures manually and to simply manage the data structure as part of the overall programming task. This is in stark contrast to, say, Python where the list implementation is in C and the code using the list is in Python.

My buddy Kelly Yancey once showed me that FreeBSD had macros so that programmers wouldn't have to keep reimplementing linked lists all the time in the kernel. I think macros were used instead of functions so that the code behaved as if it were actually written inline, thus avoiding the function call overhead--but I could be wrong.

At Metaweb, I had a buddy named Scott Meyer who use to work at Oracle. He showed me a pretty interesting trick for creating linked lists. As before, he had structs which contained application data as well as next and prev pointers for managing the structs in a linked list. However, rather than manipulate those linked lists manually, he wrote separate, generic linked list functions. The question is: how can a generic linked list function operate on random structs (i.e. void *'s) that just happen to have prev and next fields somewhere within them?

For each type of struct, he would create a "descriptor" struct that contained the offset of the prev and next fields. He would pass the descriptor to the function anytime he needed to manipulate the linked list. Loosely speaking, it's like saying, "Hey, I got this linked list, and I want you to insert a new member into it. You might not know anything about the type of structs in the linked list, but I can tell you that the prev pointer is 16 bytes from the beginning of the struct, and the next pointer is 20 bytes from the beginning."

I think the lesson is deep. If you want to write a function that operates generically on structs that you know have certain fields, you just have to tell that function the offset of those fields. Naturally, you don't want to just pass the offsets. Rather, you pack them up into a struct. Creating such a struct is like declaring that you implement some interface. I had seen how structs full of function pointers doubled as interfaces in Apache, but structs full of offsets was something entirely new for me.



Ian Bicking said...

Alan Kay described what he considers the first object programming similarly. I can't remember enough keywords to look it up. But basically it was in filesystems -- people wrote different filesystems for tapes, but this caused problems as new filesystems appeared or old filesystems were forgotten about. So some unnamed programmer made a system where the tape contained the code to access the records on the tape at known offsets. So, say, the get-record routine was at offset 0 (or maybe a pointer? Or maybe enough room to just jump to where the routine was held), set-record at offset 10, etc.

Similarly the linked list routines are just one step away from being a dynamically typed system -- if you put the struct describing offsets in the original struct itself then you'd have a type.

Game_Ender said...

This is off topic, but that wouldn't happen to be the Scott Meyers of Effective C++ fame would it?

mike barton said...

Huh... I've done similar things in the past, but my generic list functions mostly take a next pointer offset, which is computed with a macro. Something like:

#define NEXTOFFSET(list) ((unsigned long)((list)->next) - (unsigned long)(list))

list = list_sort(list, NEXTOFFSET(list), comparison_function);

I guess building a framework to organize that isn't too bad an idea.

Dave said...

The entire Amiga operating system was built around a linked list structure - every object started with the linked list struct, and the OS provided functions for navigating round them, even from assembler. Everything else was referenced by offsets to one of these structs - by convention data was at positive offsets and pointers to functions were at negative offsets. This enabled consistent object oriented programming, even from assembler.

In many ways the Amiga was way ahead of its time.

Kelly Yancey said...

Hey JJ, just wanted to add that I've seen versions of the BSD queue.h macros that used offsets from a known base rather than pointers for the next/previous links. The advantage is that you can then put the linked objects in a shared memory segment (the segment may have a different base address in each process).

There was a proposal to add them to FreeBSD back in 2002 which includes the source code.

Shannon -jj Behrens said...

Ian, thanks for your comment.

Shannon -jj Behrens said...

> This is off topic, but that wouldn't happen to be the Scott Meyers of Effective C++ fame would it?

Embarrassingly, I don't know. I don't know him all that well. I do know he's a brilliant hacker, and he's the manager of the graph database team at Metaweb.

Shannon -jj Behrens said...

Wow, thanks for the great comments, everyone! I was taking a risk posting something outside my realm of experience, but, man, what interesting responses!

Xavier said...

Yet another case of "why I didn't I think of it before?". A very simple approach to take advantage of obvious commonalities in existing code. .

I thnk that the Scott Meyers of Effective C++ is into training and consulting

Shannon -jj Behrens said...

> I thnk that the Scott Meyers of Effective C++ is into training and consulting

Ah, not the same guy.

Anonymous said...

Interesting that you mention macros, and passing offsets... However, if you consistently name the next/prev pointers, you can combine the two into a macro that calls the underlying function with the offsets for the specific struct. For example:

#define LINKSHOW(object) linkshow(((void *) - ((void *) object), \
((void *) object.prev) - ((void *) object), object)

so you can just call "LINKSHOW(object)" which then calls linkshow(nextOffset, prevOffset, object)

The magic of macros effectively implement C++-like code templates.


Shannon -jj Behrens said...

> #define LINKSHOW(object) linkshow(((void *) - ((void *) object), \
((void *) object.prev) - ((void *) object), object)

Very nice ;)

George Reilly said...

Windows, internally, has used the same idea for many years. Look at CONTAINING_RECORD and LIST_ENTRY, which are defined in ntdef.h. LIST_ENTRYs are used heavily in the Windows source code, though it's only exposed to general developers in driver code.