Sunday, 24 June 2012

Towards Semantic Code Search

In this discussion I will use the term "unit" to generalize over functions, classes, methods, procedures, macros and so on.

As I sit and write code, I would like to have a background process quietly searching open source repositories for units that are semantically similar to the ones that I am currently engaged in creating and/or modifying; these can then be presented as options for re-use; to limit the re-invention of the wheel, or to help identify and give advance warning of potential pitfalls and/or bugs.

To achieve this, we need to create some sort of metric space wherein similar units sit close together, and dissimilar (functions/classes) far apart. I would like this similarity metric to take > 2 values, so we do not just limit ourselves to binary matching, but can tune the sensitivity of the algorithm. This approach is suitable for exploratory work, because it helps give us a tool that we can use to build an intuitive understanding of the problem.

The algorithm can follow the prototypical pattern recognition architecture: A collection of 2-6 feature-extraction algorithms, each of which extracts a structural feature of the code under search. These structural features shall be designed so that their outputs are invariant under certain non-significant transformations of the source unit. (E.g. arbitrary renaming of variables, non-significant re-ordering of operations etc..).


These feature extraction algorithms could equally correctly be called normalisation algorithms.


Together, the outputs of the feature-extraction algorithms will define a point in some feature space. This point will be represented either by a numeric vector (if all the feature-extraction algorithms have scalar-numeric outputs), or will be something more complex, in the highly probable event that one or more of the feature extraction algorithms produces non scalar-numeric output (e.g. a tree structure).


Once we can construct such feature "vector"/structures, we need a metric that we can use to measure similarity/dissimilarity between pairs of such structures. If all the features end up being scalar-numeric, and the resulting point in feature space is a numeric vector, then a wide range of possible metrics are immediately available, and something very simple will probably do the trick. If, on the other hand, the features end up being partially or wholly non-numeric, then the computation of the metric may end being more involved.


If this all sounds too complex, then perhaps an example of a possible (non scalar-numeric) feature extraction algorithm will bring the discussion back down to earth and make it more real: http://c2.com/doc/SignatureSurvey/

The number of features should be strictly limited by the number of samples that are available to search. Given repositories like Github and SourceForge, somewhere around 3 features would be appropriate. 

It is worth noting that we are not just looking for ways of identifying functional equivalence in programs. Functional equivalence is an important feature, but similarities in presentation are important also; so for some features, choice of function / class / variable name may not be important, but for others, it may be significant.


What types of normalization might be useful? What features of the source unit should we consider to be significant, and what features should we ignore?


Suggestions in comments gladly received.


Note: There exists some prior literature in this field. I have not yet started to explore it, so the above passages illustrate my uninformed initial thoughts on the problem.
http://www.cs.brown.edu/~spr/research/s6.html
http://adamldavis.com/post/24918802271/a-modest-proposal-for-open-source-project-hashes