Tuesday, June 11, 2013

Run your research: on the effectiveness of lightweight mechanization

Run your research: on the effectiveness of lightweight mechanization
C. Klein, J. Clements, C. Dimoulas, C. Eastlund, M. Felleisen, M. Flatt, J. McCarthy, S. Tobin-Hochstadt, and R. Findler
POPL 2012

This paper gives an overview of the use of Redex, a Scheme/Racket dialect, as a lightweight framework for formalization of programming language properties, with the goal of using automated/randomized testing to weed out shallow bugs.

The idea of randomized testing for language properties has been around for a long time (the authors note Hanford's work in the 1970s which I wasn't aware of before).  However, most interest over the last 10 years has been on formalization and verification.  The problem with this is that the applicable verification tools (Isabelle, Coq, Twelf) have a high learning curve, and do not necessarily apply in certain circumstances, for example when trying to formalize the behavior of a given, unknown system (e.g. relaxed memory models).  Moreover, it can be unclear whether a verification attempt is unsuccessful because the system is wrong or because the proof techniques/representations being used are unsuitable, and solving the latter kinds of problems can require new research orthogonal to the system being studied.  Failure to prove a result does not always give insight into counterexamples.  Finally, even having a complete formalization leaves open the possibility that the specification is wrong or that errors are introduced in the process of writing up the paper.

So there is strong motivation for complementary techniques that may not guarantee correctness but are applicable and helpful in the common case where the system is incorrect (or "almost" correct).  Some other researchers have proposed systems for automating parts of the associated tasks (Ott for mapping concrete to abstract syntax and LaTeX or Coq/Isabelle code, other systems mentioned include  αML, αProlog, K, and Ruler).  This paper focuses on Redex, which the authors have developed over nine years to include many convenient features such as visualization of reduction steps and mapping to LaTeX.

At a technical level, there is nothing new here: Redex has been around for a while and the randomized searching and unit testing techniques are well-established, though the paper devotes some space to working around Redex's lack of support for name-binding and other complications.  The main contribution of the paper is an in-depth investigation of one previous paper by some of the authors, and investigations of nine previous ICFP papers.  All ten turn out to have problems, some amounting to typos or transcription problems from on-paper or mechanical formalizations to LaTeX, others of which are nontrivial.  This is valuable empirical validation of the effectiveness of automated testing of formalizations.

I did some work in this area a while ago with Alberto Momigliano (building on αProlog developed with Christian Urban) which the authors are kind enough to mention.  It would be interesting to see whether the Redex implementation can be extended to support our bounded exhaustive search and negation elimination techniques (possibly adopting ideas from αKanren, an implementation of αProlog-like logic programming in Scheme).  Conversely, αProlog effectively already supports simple unit tests (though without a supporting library as in Redex) and its symbolic counterexample search technique is more advanced in some ways.  It could easily be extended to support random search.  The implementation of Redex is clearly vastly more mature.

It would also be interesting to construct a "benchmark" of systems and bugs (maybe starting with the ICFP 2009 papers studied here).  It's not clear though how one could fairly compare different mechanized metatheory testing/checking systems given that the representation languages and search strategies will likely vary widely.  Within a single system one can at least compare different search strategies effectively, but across systems it's less clear how to do an apples-to-apples comparison.

Labels: ,


Post a Comment

Subscribe to Post Comments [Atom]

<< Home