Skip to main content

Software Engineering: Coping When You Must Repeat Yourself

These days, most software engineers are familiar with the acronym "DRY" which stands for "don't repeat yourself". In general, it's good advice. In a perfect world, no code would ever be duplicated and every piece of "truth" would only exist in one place. However, the real world isn't quite so perfect, and things are far less DRY than you might imagine. The question is, how do you cope?

First let me show you some reasons why you can't always keep it DRY:

Often, the same truth is duplicated in the code, the tests, and the docs. Duplicating the same piece of truth in the code and in the tests helps each verify the other. Generally, the code is more general than the tests (the tests verify examples of running the more general code), but the duplication is there. When you update one (for instance to change the API), you'll need to update the other. This is a case where not keeping it DRY pays off--if you have to update the tests, that's a reminder that you'll also have to update all the other consumers of your API.

Similarly, the API docs often duplicate the truth that is built in to the code. That's because it's helpful to explain the truth to the computer in one way (using very precise code) and explain the truth to the reader in another way (using friendly, high-level English). Every truthful comment duplicates what the code already says, but not every piece of code is easily and quickly readable by human readers--this is especially true in, say, assembly.

Another area where truth is duplicated is in APIs. The function defines a name and an API. The caller uses it. They must agree on these things or the code won't work. If the caller decides to use a different name or a different API, the code will break. Essentially, programmers have decided that it's better to duplicate the name and the API rather than duplicate the contents of the function. This points to a useful trick--sometimes a small amount of duplication saves a large amount of duplication. You'll also see this sometimes in comments when they say "see also..."

Another source of duplication concerns public vs. private. For instance, in C, the same API is duplicated in the .h file and the .c file. Sometimes, the same piece of code must be duplicated in different projects. For instance, one operating system might need to define the same C types as another operating system because there's no easy way for them to share the same header files.

At a higher level, one time I had to add the same function I wrote to two projects. One project was proprietary company code. The other was open source (I had permission, of course). For technical reasons, it was impractical for the company code to import or subclass the open source code, so I was stuck just duplicating it.

Often, you'll need to duplicate the same piece of truth in multiple languages. For instance, think of how many HTTP client libraries there are in all the different programming languages. It doesn't matter how good an HTTP client library is if it's not easily accessible from the programming language I'm currently coding in. Sometimes there will be multiple HTTP client libraries for the same language because they're implemented differently (for instance, syncronously vs. asyncronously).

I mentioned tests before. Often tests duplicate some setup or teardown or perhaps the same pattern of interacting with a function. Refactoring is sometimes appropriate, but not always. It is commonly held that this is one area where keeping it DRY is less important than keeping it simple and isolated. A perfectly DRY collection of unittests that is difficult to comprehend and difficult to debug when something fails is less helpful than a set of simple, isolated unittests that contain a small amount of duplication. If the duplication causes multiple tests to fail, you'll know to keep fixing the tests until they all pass.

The question remains, how do you cope when you can't keep it DRY?

Greppability is very important. (By grep, I mean any tool that can search for a string or regular expression. I don't necessarily mean the UNIX tool "grep".) In highly dynamic languages like Ruby (that have great facilities for metaprogramming, but no static typing or interfaces) and highly factored frameworks like Rails (that use lots of files and levels of indirection), even a brilliant IDE can fail in comparison to a simple "grep tour". If you refactor a class in Ruby, how will you remember to refactor all the mocks of that class? You might have a user of your class that has a mock of your class that still makes use of your old API. The tests might be passing even though the code will assuredly crash. If you use grep, you can update all the callers of your class as well as all the mocks of the class. Grep can also help you find instances of a string in non-code documentation, and it even occasionally works with binary files. My point is, don't underestimate the utility of grep. Rather, you should aim for greppability. A function named "f" is not greppable, but a function named "calculate_apr" is. (By the way, naming all your loop variables "iterator" does not improve greppability, it just wastes time.)

Another way of coping when things aren't DRY is to have cross referencing comments. If you know that you must duplicate the same piece of truth in 5 places, add a comment next to each of those 5 places that refers to the other 5 places. Don't be afraid to duplicate the comment exactly. Your comment can say something like, "If you change this, don't forget to update..."

Another thing that helps mitigate duplication is proximity. Docstrings belong in the code because if a programmer updates one, he'll be more likely to update the other (although even proximity can't always help lazy programmers). If all the API documentation is in a separate file, that file will go stale very quickly.

Parallelization also helps. For instance, this code has a small amount of duplication:
 some_a = 1
some_a.invoke_method()
register(some_a)
call_something_unrelated()
some_b = 2
some_b.invoke_method()
register(some_b)
Sometimes you can factor out this duplication. However, in less dynamic languages like C, it may not always be easy to do so. However, parallelization can really help:
 some_a = 1
some_b = 2
some_a.invoke_method()
some_b.invoke_method()
register(some_a)
register(some_b)
call_something_unrelated()
Another old trick for coping with duplication is to have one source generate the other. Generating API documentation using javadoc is a good example of this. Sometimes you can use a program to generate code for multiple programming languages. There's another example of "generation" that I sometimes use in Python. I use string interpolation when creating docstrings. For instance, if there's a piece of documentation that should be duplicated in multiple places, string interpolation makes it possible so that I only have to write that piece of documentation once.

Another source of duplication has to deal with the plethora of tools programmers must use. There is the source code itself, a revision control system, a bug tracker, and a wiki. Often times, the same piece of truth needs to be duplicated in all of these places. This is one place where Trac really shines. Once you properly configure Trac, you can reference the bug number in each of your commits. Trac's commit hook will take that commit and add it as a comment in the original bug with a reference to the source code in Trac's source code viewer. Hence, Trac (which is a bug tracking system, a wiki, and a source code viewer) and the revision control system work together to reduce duplication.

It's unfortunate that life isn't always as DRY as you'd like it to be. However, keeping a few tricks in mind can really help mitigate the problems caused by having to duplicate a piece of truth in more than once place. If you have other tricks, feel free to leave them in a comment below.

Comments