Skip to main content

Software Engineering: Coping When You Must Repeat Yourself

These days, most software engineers are familiar with the acronym "DRY" which stands for "don't repeat yourself". In general, it's good advice. In a perfect world, no code would ever be duplicated and every piece of "truth" would only exist in one place. However, the real world isn't quite so perfect, and things are far less DRY than you might imagine. The question is, how do you cope?

First let me show you some reasons why you can't always keep it DRY:

Often, the same truth is duplicated in the code, the tests, and the docs. Duplicating the same piece of truth in the code and in the tests helps each verify the other. Generally, the code is more general than the tests (the tests verify examples of running the more general code), but the duplication is there. When you update one (for instance to change the API), you'll need to update the other. This is a case where not keeping it DRY pays off--if you have to update the tests, that's a reminder that you'll also have to update all the other consumers of your API.

Similarly, the API docs often duplicate the truth that is built in to the code. That's because it's helpful to explain the truth to the computer in one way (using very precise code) and explain the truth to the reader in another way (using friendly, high-level English). Every truthful comment duplicates what the code already says, but not every piece of code is easily and quickly readable by human readers--this is especially true in, say, assembly.

Another area where truth is duplicated is in APIs. The function defines a name and an API. The caller uses it. They must agree on these things or the code won't work. If the caller decides to use a different name or a different API, the code will break. Essentially, programmers have decided that it's better to duplicate the name and the API rather than duplicate the contents of the function. This points to a useful trick--sometimes a small amount of duplication saves a large amount of duplication. You'll also see this sometimes in comments when they say "see also..."

Another source of duplication concerns public vs. private. For instance, in C, the same API is duplicated in the .h file and the .c file. Sometimes, the same piece of code must be duplicated in different projects. For instance, one operating system might need to define the same C types as another operating system because there's no easy way for them to share the same header files.

At a higher level, one time I had to add the same function I wrote to two projects. One project was proprietary company code. The other was open source (I had permission, of course). For technical reasons, it was impractical for the company code to import or subclass the open source code, so I was stuck just duplicating it.

Often, you'll need to duplicate the same piece of truth in multiple languages. For instance, think of how many HTTP client libraries there are in all the different programming languages. It doesn't matter how good an HTTP client library is if it's not easily accessible from the programming language I'm currently coding in. Sometimes there will be multiple HTTP client libraries for the same language because they're implemented differently (for instance, syncronously vs. asyncronously).

I mentioned tests before. Often tests duplicate some setup or teardown or perhaps the same pattern of interacting with a function. Refactoring is sometimes appropriate, but not always. It is commonly held that this is one area where keeping it DRY is less important than keeping it simple and isolated. A perfectly DRY collection of unittests that is difficult to comprehend and difficult to debug when something fails is less helpful than a set of simple, isolated unittests that contain a small amount of duplication. If the duplication causes multiple tests to fail, you'll know to keep fixing the tests until they all pass.

The question remains, how do you cope when you can't keep it DRY?

Greppability is very important. (By grep, I mean any tool that can search for a string or regular expression. I don't necessarily mean the UNIX tool "grep".) In highly dynamic languages like Ruby (that have great facilities for metaprogramming, but no static typing or interfaces) and highly factored frameworks like Rails (that use lots of files and levels of indirection), even a brilliant IDE can fail in comparison to a simple "grep tour". If you refactor a class in Ruby, how will you remember to refactor all the mocks of that class? You might have a user of your class that has a mock of your class that still makes use of your old API. The tests might be passing even though the code will assuredly crash. If you use grep, you can update all the callers of your class as well as all the mocks of the class. Grep can also help you find instances of a string in non-code documentation, and it even occasionally works with binary files. My point is, don't underestimate the utility of grep. Rather, you should aim for greppability. A function named "f" is not greppable, but a function named "calculate_apr" is. (By the way, naming all your loop variables "iterator" does not improve greppability, it just wastes time.)

Another way of coping when things aren't DRY is to have cross referencing comments. If you know that you must duplicate the same piece of truth in 5 places, add a comment next to each of those 5 places that refers to the other 5 places. Don't be afraid to duplicate the comment exactly. Your comment can say something like, "If you change this, don't forget to update..."

Another thing that helps mitigate duplication is proximity. Docstrings belong in the code because if a programmer updates one, he'll be more likely to update the other (although even proximity can't always help lazy programmers). If all the API documentation is in a separate file, that file will go stale very quickly.

Parallelization also helps. For instance, this code has a small amount of duplication:
 some_a = 1
some_a.invoke_method()
register(some_a)
call_something_unrelated()
some_b = 2
some_b.invoke_method()
register(some_b)
Sometimes you can factor out this duplication. However, in less dynamic languages like C, it may not always be easy to do so. However, parallelization can really help:
 some_a = 1
some_b = 2
some_a.invoke_method()
some_b.invoke_method()
register(some_a)
register(some_b)
call_something_unrelated()
Another old trick for coping with duplication is to have one source generate the other. Generating API documentation using javadoc is a good example of this. Sometimes you can use a program to generate code for multiple programming languages. There's another example of "generation" that I sometimes use in Python. I use string interpolation when creating docstrings. For instance, if there's a piece of documentation that should be duplicated in multiple places, string interpolation makes it possible so that I only have to write that piece of documentation once.

Another source of duplication has to deal with the plethora of tools programmers must use. There is the source code itself, a revision control system, a bug tracker, and a wiki. Often times, the same piece of truth needs to be duplicated in all of these places. This is one place where Trac really shines. Once you properly configure Trac, you can reference the bug number in each of your commits. Trac's commit hook will take that commit and add it as a comment in the original bug with a reference to the source code in Trac's source code viewer. Hence, Trac (which is a bug tracking system, a wiki, and a source code viewer) and the revision control system work together to reduce duplication.

It's unfortunate that life isn't always as DRY as you'd like it to be. However, keeping a few tricks in mind can really help mitigate the problems caused by having to duplicate a piece of truth in more than once place. If you have other tricks, feel free to leave them in a comment below.

Comments

Popular posts from this blog

Drawing Sierpinski's Triangle in Minecraft Using Python

In his keynote at PyCon, Eben Upton, the Executive Director of the Rasberry Pi Foundation, mentioned that not only has Minecraft been ported to the Rasberry Pi, but you can even control it with Python. Since four of my kids are avid Minecraft fans, I figured this might be a good time to teach them to program using Python. So I started yesterday with the goal of programming something cool for Minecraft and then showing it off at the San Francisco Python Meetup in the evening.

The first problem that I faced was that I didn't have a Rasberry Pi. You can't hack Minecraft by just installing the Minecraft client. Speaking of which, I didn't have the Minecraft client installed either ;) My kids always play it on their Nexus 7s. I found an open source Minecraft server called Bukkit that "provides the means to extend the popular Minecraft multiplayer server." Then I found a plugin called RaspberryJuice that implements a subset of the Minecraft Pi modding API for Bukkit s…

Creating Windows 10 Boot Media for a Lenovo Thinkpad T410 Using Only a Mac and a Linux Machine

TL;DR: Giovanni and I struggled trying to get Windows 10 installed on the Lenovo Thinkpad T410. We struggled a lot trying to create the installation media because we only had a Mac and a Linux machine to work with. Everytime we tried to boot the USB thumb drive, it just showed us a blinking cursor. At the end, we finally realized that Windows 10 wasn't supported on this laptop :-/I've heard that it took Thomas Edison 100 tries to figure out the right material to use as a lightbulb filament. Well, I'm no Thomas Edison, but I thought it might be noteworthy to document our attempts at getting it to boot off a USB thumb drive:Download the ISO. Attempt 1: Use Etcher. Etcher says it doesn't work for Windows. Attempt 2: Use Boot Camp Assistant. It doesn't have that feature anymore. Attempt 3: Use Disk Utility on a Mac. Erase a USB thumb drive: Format: ExFAT Scheme: GUID Partition Map Mount the ISO. Copy everything from the I…

ERNOS: Erlang Networked Operating System

I've been reading Dreaming in Code lately, and I really like it. If you're not a dreamer, you may safely skip the rest of this post ;)

In Chapter 10, "Engineers and Artists", Alan Kay, John Backus, and Jaron Lanier really got me thinking. I've also been thinking a lot about Minix 3, Erlang, and the original Lisp machine. The ideas are beginning to synthesize into something cohesive--more than just the sum of their parts.

Now, I'm sure that many of these ideas have already been envisioned within Tunes.org, LLVM, Microsoft's Singularity project, or in some other place that I haven't managed to discover or fully read, but I'm going to blog them anyway.

Rather than wax philosophical, let me just dump out some ideas:Start with Minix 3. It's a new microkernel, and it's meant for real use, unlike the original Minix. "This new OS is extremely small, with the part that runs in kernel mode under 4000 lines of executable code." I bet it&…