Removing duplicate code is a great way to improve the internal quality of your application code. Duplications mean that you have more code than you should and are often the source of more subtle bugs of the “I’ve already fixed that ..” variety.

While working on a scriptable command line tool for Java (see Automated refactoring for library updates) I discovered a neat way of identifying duplications at the method and block level.

The tool is written in Scala (still learning so improvement suggestions welcome) and uses ANTLR4 to generate an Abstract Syntax Tree (AST) of Java source supplied to the parser. The AST and the Java grammar I am using make it really easy to traverse the tree after processing and identify blocks [cci lang=”scala”]{ code }[/cci]. The lexer ignores white space and comments so the AST just contains tokens of the blocks.

Buckets

Once the source code has been parsed and blocks identified they can be placed into buckets. A bucket is a hash-map of lists. The key for the hashmap must be unique to the significant elements of the block. For this I used a hash from the token text of all terminal elements within the block. The list contains the block represented by the hash. In this way lists with more than one element identify block duplicates.

private def bucketBlocks(models: List[CodeModel]): BucketSet[BlockDeclaration] = {
    val blockBuckets = new BucketSet[BlockDeclaration]
    models foreach (model => {
      model.blocks foreach (block => blockBuckets.add(block.signature, block))
    })
    blockBuckets
  }

The block.signature is a computed hash of all the token in the block. A BlockDeclaration holds a reference to the root node in the AST for the block.

Iterating through all the blocks and added them to the buckets means that duplicate blocks are buckets with more than one block:

class BucketSet[T] {

  val buckets = new mutable.HashMap[Signature, Bucket[T]]

  def add(signature: Signature, item: T) {
    if (buckets.contains(signature)) {
      buckets(signature).add(item)
    } else {
      buckets(signature) = new Bucket(item)
    }
  }

  def duplicates(): List[Bucket[T]] = buckets.values.filter(bucket => bucket.hasDuplicates).toList

  def getDuplicateCount = duplicates().size

  def getDuplicates = duplicates()

  def eachDuplicate(function: (Bucket[T]) => Any) {
    buckets.values foreach (bucket => {
      if (bucket.hasDuplicates) {
        function(bucket)
      }
    })
  }

  private def hasDuplicates(items: List[T]): Boolean = {
    items.size > 1
  }
}

What’s interesting about this approach is that the identified duplicates are naturally low hanging fruit for extracting common methods.