Skip to main content

Software Engineering: Coping When You Must Repeat Yourself

These days, most software engineers are familiar with the acronym "DRY" which stands for "don't repeat yourself". In general, it's good advice. In a perfect world, no code would ever be duplicated and every piece of "truth" would only exist in one place. However, the real world isn't quite so perfect, and things are far less DRY than you might imagine. The question is, how do you cope?

First let me show you some reasons why you can't always keep it DRY:

Often, the same truth is duplicated in the code, the tests, and the docs. Duplicating the same piece of truth in the code and in the tests helps each verify the other. Generally, the code is more general than the tests (the tests verify examples of running the more general code), but the duplication is there. When you update one (for instance to change the API), you'll need to update the other. This is a case where not keeping it DRY pays off--if you have to update the tests, that's a reminder that you'll also have to update all the other consumers of your API.

Similarly, the API docs often duplicate the truth that is built in to the code. That's because it's helpful to explain the truth to the computer in one way (using very precise code) and explain the truth to the reader in another way (using friendly, high-level English). Every truthful comment duplicates what the code already says, but not every piece of code is easily and quickly readable by human readers--this is especially true in, say, assembly.

Another area where truth is duplicated is in APIs. The function defines a name and an API. The caller uses it. They must agree on these things or the code won't work. If the caller decides to use a different name or a different API, the code will break. Essentially, programmers have decided that it's better to duplicate the name and the API rather than duplicate the contents of the function. This points to a useful trick--sometimes a small amount of duplication saves a large amount of duplication. You'll also see this sometimes in comments when they say "see also..."

Another source of duplication concerns public vs. private. For instance, in C, the same API is duplicated in the .h file and the .c file. Sometimes, the same piece of code must be duplicated in different projects. For instance, one operating system might need to define the same C types as another operating system because there's no easy way for them to share the same header files.

At a higher level, one time I had to add the same function I wrote to two projects. One project was proprietary company code. The other was open source (I had permission, of course). For technical reasons, it was impractical for the company code to import or subclass the open source code, so I was stuck just duplicating it.

Often, you'll need to duplicate the same piece of truth in multiple languages. For instance, think of how many HTTP client libraries there are in all the different programming languages. It doesn't matter how good an HTTP client library is if it's not easily accessible from the programming language I'm currently coding in. Sometimes there will be multiple HTTP client libraries for the same language because they're implemented differently (for instance, syncronously vs. asyncronously).

I mentioned tests before. Often tests duplicate some setup or teardown or perhaps the same pattern of interacting with a function. Refactoring is sometimes appropriate, but not always. It is commonly held that this is one area where keeping it DRY is less important than keeping it simple and isolated. A perfectly DRY collection of unittests that is difficult to comprehend and difficult to debug when something fails is less helpful than a set of simple, isolated unittests that contain a small amount of duplication. If the duplication causes multiple tests to fail, you'll know to keep fixing the tests until they all pass.

The question remains, how do you cope when you can't keep it DRY?

Greppability is very important. (By grep, I mean any tool that can search for a string or regular expression. I don't necessarily mean the UNIX tool "grep".) In highly dynamic languages like Ruby (that have great facilities for metaprogramming, but no static typing or interfaces) and highly factored frameworks like Rails (that use lots of files and levels of indirection), even a brilliant IDE can fail in comparison to a simple "grep tour". If you refactor a class in Ruby, how will you remember to refactor all the mocks of that class? You might have a user of your class that has a mock of your class that still makes use of your old API. The tests might be passing even though the code will assuredly crash. If you use grep, you can update all the callers of your class as well as all the mocks of the class. Grep can also help you find instances of a string in non-code documentation, and it even occasionally works with binary files. My point is, don't underestimate the utility of grep. Rather, you should aim for greppability. A function named "f" is not greppable, but a function named "calculate_apr" is. (By the way, naming all your loop variables "iterator" does not improve greppability, it just wastes time.)

Another way of coping when things aren't DRY is to have cross referencing comments. If you know that you must duplicate the same piece of truth in 5 places, add a comment next to each of those 5 places that refers to the other 5 places. Don't be afraid to duplicate the comment exactly. Your comment can say something like, "If you change this, don't forget to update..."

Another thing that helps mitigate duplication is proximity. Docstrings belong in the code because if a programmer updates one, he'll be more likely to update the other (although even proximity can't always help lazy programmers). If all the API documentation is in a separate file, that file will go stale very quickly.

Parallelization also helps. For instance, this code has a small amount of duplication:
 some_a = 1
some_a.invoke_method()
register(some_a)
call_something_unrelated()
some_b = 2
some_b.invoke_method()
register(some_b)
Sometimes you can factor out this duplication. However, in less dynamic languages like C, it may not always be easy to do so. However, parallelization can really help:
 some_a = 1
some_b = 2
some_a.invoke_method()
some_b.invoke_method()
register(some_a)
register(some_b)
call_something_unrelated()
Another old trick for coping with duplication is to have one source generate the other. Generating API documentation using javadoc is a good example of this. Sometimes you can use a program to generate code for multiple programming languages. There's another example of "generation" that I sometimes use in Python. I use string interpolation when creating docstrings. For instance, if there's a piece of documentation that should be duplicated in multiple places, string interpolation makes it possible so that I only have to write that piece of documentation once.

Another source of duplication has to deal with the plethora of tools programmers must use. There is the source code itself, a revision control system, a bug tracker, and a wiki. Often times, the same piece of truth needs to be duplicated in all of these places. This is one place where Trac really shines. Once you properly configure Trac, you can reference the bug number in each of your commits. Trac's commit hook will take that commit and add it as a comment in the original bug with a reference to the source code in Trac's source code viewer. Hence, Trac (which is a bug tracking system, a wiki, and a source code viewer) and the revision control system work together to reduce duplication.

It's unfortunate that life isn't always as DRY as you'd like it to be. However, keeping a few tricks in mind can really help mitigate the problems caused by having to duplicate a piece of truth in more than once place. If you have other tricks, feel free to leave them in a comment below.

Comments

Popular posts from this blog

Ubuntu 20.04 on a 2015 15" MacBook Pro

I decided to give Ubuntu 20.04 a try on my 2015 15" MacBook Pro. I didn't actually install it; I just live booted from a USB thumb drive which was enough to try out everything I wanted. In summary, it's not perfect, and issues with my camera would prevent me from switching, but given the right hardware, I think it's a really viable option. The first thing I wanted to try was what would happen if I plugged in a non-HiDPI screen given that my laptop has a HiDPI screen. Without sub-pixel scaling, whatever scale rate I picked for one screen would apply to the other. However, once I turned on sub-pixel scaling, I was able to pick different scale rates for the internal and external displays. That looked ok. I tried plugging in and unplugging multiple times, and it didn't crash. I doubt it'd work with my Thunderbolt display at work, but it worked fine for my HDMI displays at home. I even plugged it into my TV, and it stuck to the 100% scaling I picked for the othe

ERNOS: Erlang Networked Operating System

I've been reading Dreaming in Code lately, and I really like it. If you're not a dreamer, you may safely skip the rest of this post ;) In Chapter 10, "Engineers and Artists", Alan Kay, John Backus, and Jaron Lanier really got me thinking. I've also been thinking a lot about Minix 3 , Erlang , and the original Lisp machine . The ideas are beginning to synthesize into something cohesive--more than just the sum of their parts. Now, I'm sure that many of these ideas have already been envisioned within Tunes.org , LLVM , Microsoft's Singularity project, or in some other place that I haven't managed to discover or fully read, but I'm going to blog them anyway. Rather than wax philosophical, let me just dump out some ideas: Start with Minix 3. It's a new microkernel, and it's meant for real use, unlike the original Minix. "This new OS is extremely small, with the part that runs in kernel mode under 4000 lines of executable code.&quo

Haskell or Erlang?

I've coded in both Erlang and Haskell. Erlang is practical, efficient, and useful. It's got a wonderful niche in the distributed world, and it has some real success stories such as CouchDB and jabber.org. Haskell is elegant and beautiful. It's been successful in various programming language competitions. I have some experience in both, but I'm thinking it's time to really commit to learning one of them on a professional level. They both have good books out now, and it's probably time I read one of those books cover to cover. My question is which? Back in 2000, Perl had established a real niche for systems administration, CGI, and text processing. The syntax wasn't exactly beautiful (unless you're into that sort of thing), but it was popular and mature. Python hadn't really become popular, nor did it really have a strong niche (at least as far as I could see). I went with Python because of its elegance, but since then, I've coded both p