Fingerprint

I have always be fascinated with how things work. In my early years my focus was on the things around me. Bicycles, Motorcycles and then cars. When I was first introduced to a computer I was hooked and had to know how things work.

That was a long time ago but there are still times when I discover something and just need to know how it works. Typically this means working it out by building something.

Software development at scale presents its own unique challenges that mean manual inspection or review are just not viable. Pattern recognition for the humans is what we do. Machines tend to be a little more pedantic in comparing two things.

The source code for this article can be found here https://github.com/grahambrooks/fingerprint and the original article here http://igm.univ-mlv.fr/~mac/ENS/DOC/sigmod03-1.pdf

Using k-grams and winnowing (selecting a single entry from the k-gram) hides details that then allows us to compute similarity. if we don’t remove something from the original text or stream of data then we are comparing and not identifying similarity.

A worked example

The following gist scans the filesystem for readme files, computes a fingerprint for the contents of each file and then looks for pairs of files that have similar fingerprints.

For files with a similarity score above 0.75 the gist uses a diff library to highlight the differences between the files that are shown as similar.

package main

import (
	"fmt"
	"github.com/grahambrooks/fingerprint/fingerprinter"
	"github.com/grahambrooks/fingerprint/similarity"
	"github.com/sergi/go-diff/diffmatchpatch"
	"io/fs"
	"io/ioutil"
	"os"
	"path/filepath"
	"strings"
)

type FileFingerprint struct {
	path        string
	fingerprint fingerprinter.Fingerprint
}

type Clusters []FileFingerprint

func main() {
	options := fingerprinter.Options{
		GuaranteeThreshold: 4,
		NoiseThreshold:     4,
	}

	var matchingFiles Clusters

	scanRoot := os.Args[1]
	fmt.Printf("Scanning %s\n", scanRoot)
	filepath.WalkDir(scanRoot, func(path string, d fs.DirEntry, err error) error {
		if err != nil {
			fmt.Println(err)
			return nil
		}
		switch {
		case d.IsDir():
			switch d.Name() {
			case "node_modules":
				return filepath.SkipDir
			}
			return nil
		case strings.EqualFold(d.Name(), "readme.md"):
			fallthrough
		case strings.EqualFold(d.Name(), "readme.adoc"):
			content, err := ioutil.ReadFile(path)
			if err != nil {
				return err
			}
			fingerprint := fingerprinter.TextFingerprint(string(content), options)
			matchingFiles = append(matchingFiles, FileFingerprint{path: path, fingerprint: fingerprint})
			return nil
		default:
			return nil
		}
	})

	for i, f1 := range matchingFiles {
		for j, f2 := range matchingFiles {
			if i != j {
				s := similarity.Compare(f1.fingerprint, f2.fingerprint)
				switch {
				case s == 1.0:
					fmt.Printf(`
near identical
    %s
    %s
`, f1.path, f2.path)
				case s > 0.75:
					fmt.Printf(`
similar
    %s
    %s
`, f1.path, f2.path)
					dmp := diffmatchpatch.New()

					content1, err := ioutil.ReadFile(f1.path)
					if err != nil {
						return
					}

					content2, err := ioutil.ReadFile(f2.path)
					if err != nil {
						return
					}

					diffs := dmp.DiffMain(string(content1), string(content2), true)

					fmt.Println("================================================================================")
					fmt.Println(dmp.DiffPrettyText(diffs))
					fmt.Println("================================================================================")
				}
			}
		}
	}
}

This example is not very practical. At the moment the fingerprint is represented as an array of structs. Finding pairs that means O(n2) solution that is just way too slow for anything over a small number of files.

The fingerprinting library has only been tested on small samples and a very small set of options. Tuning and testing recommended!