Tuesday, May 27, 2008

Better than simian-check

Simian is great for helping to maintain code quality and enforcing the DRY principle, especially in a green field project. And while not as helpful, it can help to implement the idea of "making things no worse" in a brown field project. However, its goodness only extends in one dimension for a two dimensional problem.

A team can mandate that the number of duplications within the codebase shall be no more than some threshold. That's one dimension. But what about duplications that already occur below the threshold? simian-check does nothing to ensure that the level of duplication below the threshold doesn't entropy. Simian is good for legacy codebases, but there is plenty of room for improvement.

This is something I've been mildly aware of for several years now, but I haven't been actively trying to find a way to solve it until recently. I had some slack time and was feverishly working at refactoring our legacy codebase to reduce our simian-check threshold another notch. Having reduced the count by one, I took a quick survey of what would be required to reduce it by yet another notch. At that moment I half panicked half had a flash of insight (most like I had the insight quickly followed by a brief moment of terror).

If you've never had to play in a large legacy codebase, you may not have ever run into the situation I just found myself. Since simian had not been running on this system, there was a lot of duplication. The more the threshold was lowered for the simian-check, the more work (i.e. more duplications to be removed) was required to lower the threshold one more notch. And the increase in work is more than linear.

My insight was this; I had a finite amount of time in which to lower the duplication count. If I couldn't do all off the work within that time, the threshold would remain the same. It was quite possible that I could remove half the duplications and then return sometime later and find that I had double the work to do. While that's possible, it's not very probably. I had seen the second dimension to the simian-check. My fear from that initial thought got the wheels in my head in motion.

The solution is very simple. The simian-report is a an xml file, so I wrote a SAX2 DefaultHandler that was able to parse the number of duplications at the different threshold levels. Putting this into a trivial ant task then gave us a task to help make things no worse even at levels below what the simian-check was doing! Within the first week, the new legacy-check was breaking the build (where the simian-check would never have) and focusing the teams attention on how to make things better.

2 Comments:

At 28 May, 2008 07:50, Blogger Simon Harris said...

Hey Doug. I'm not sure I follow. Are you saying that yo now check to ensure that no duplication occurs at specific levels? If so, that's kinda cool. I'd be more than happy to roll that into simian proper. Or have I missed the point totally? I fear I'm getting some of my wife's placenta brain laterly :)

 
At 28 May, 2008 19:54, Blogger Doug said...

Hi Simon,

Sort of. It helps to ensure that no new duplications occur, helping to lock down legacy codebases better than the current simain-check.

As an example, say a project's simian check is set at a threshold of 10. At level 9 it has 2 unique duplications across 4 duplications. If a someone were to check in such that one of the level 9 duplications were duplicated again (2 unique duplications across 5 duplications) the legacy-check would fail.

In this way, it really starts to highlight to the team when they are (and how they are) adding new duplicated code.

It's a bit rough in that it stills needs simian-check to back it up, but I think I've come up with a design to remove the reliance on simian-check altogether.

I'm at JAOO this week, but when I get back in the office on Monday, I'll post an example of what the ant task looks like in use. If there is any confusion, that should help to resolve it.

 

Post a Comment

Links to this post:

Create a Link

<< Home