## Monday, July 28, 2014

### Composing queries in F#

In our ICFP 2013 paper last year, we introduced a formal model of Microsoft's "Language-Integrated Query", as implemented in F#.  The basic idea is to allow programmers to write queries in a way that more closely resembles ordinary code, can be typechecked alongside ordinary code, but still (hopefully) generates reasonable SQL queries.  For example, in F# one can write something like

query {
for student in db.Student do
where (student.Age.Value >= 10
&& student.Age.Value < 15)
select student
}

to select all students with ages in the interval $[10,15)$.

Observe that the above query does not depend on any data in the F# host language.  There's a single SQL query that we can generate to answer this query on the database:

SELECT  student
FROM Student student
WHERE student.Age >= 10 AND student.Age < 15

(here, I'm glossing over differences in SQL syntax between Microsoft SQL server and other DBMSs).

## Dynamic queries

The ability to write (and typecheck) these simple, "static" queries is useful, but would be fairly limited if that is all it allowed.  Most programs that access a database need to do so in a more dyanmic way, where the query depends on some data only available at run time.  A simple example is a query that is parameterized by lower and upper bounds:

fun range lower upper =
query {
for student in db.Student do
where (student.Age.Value >= lower
&& student.Age.Value < upper)
select student
}

Thus, there is not a single SQL query, but rather a set of possible instances of a more abstract query template, and we need to be able to select the right one at run time. The lower and upper bounds could be provided by the user, in which case there is no way we can know what their values will be in advance.

The way LINQ handles this is to treat the code inside the query brackets as a quotation (after desugaring from F#'s "Computation Expressions" to a more primitive API).  The query operator then processes this quoted code and transforms it into SQL.  (Technically, what happens is that the F# standard library transforms the F# Expr values into C# Expression values, which are in turn translated to SQL by the C# LINQ-to-SQL query provider.  But we'll treat that process as a black box - we are only going to be interested in what happens before F# takes over.)

Already, this is a win - since F# knows that lower and upper are supposed to be integers, they will automatically be converted to SQL strings representing integers, and we don't have to do this conversion explicitly ourselves; similarly, any string parameters will be safely escaped so that special characters used in SQL injection attacks will be escaped correctly.  In other words, the explicit templating and parameterization step needed in a library like JDBC to avoid SQL injection attachs is unnecessary here.

## Higher-order query parameters

Now let's ask what happens if we want to parameterize a query over something other than an integer or a string.  F# is a higher-order language; can we parameterize a query over a function?  For example, suppose we want to support lots of different predicates on students, not just ranges.  Can we do this?

let satisfies p =
query {
for student in db.Students do
if p student then
yield student
}

satisfies (fun s -> s.Age.Value >= 10
&& s.Age.Value < 15) 

The answer is "yes", but it's not quite as simple as the above.  The problem is the lambda-abstraction over $p$.  This is a function value, and we are using it inside a query expression, which is quoted.  At run time, we have no way of turning the function value bound to p into a quotation corresponding to the code for $p$.  For values of base type, this conversion is straightforward (i.e., F# does the "cross-stage persistence" for us) but for function values, we're out of luck.

To make F# happy, therefore, we need to quote the lambda-abstraction:

let satisfies =
<@ fun p -> query {
for student in db.Students do
if p student then
yield student
} @>
query { for x in (%satisfies) (fun s ->
s.Age.Value >= 10 && s.Age.Value < 15)
yield x }
Here, we define satisfies to be a quotation of a function that takes a student and returns a boolean.  (The <@... @> notation is F#'s syntax for quoted code.)  Subsequently we write a query that returns all of the elements of

(%satisfies) (fun s -> s.Age.Value >= 10
&& s.Age.Value < 15)

(here, % is F#'s notation for antiquotation).  The above code splices in the definition of satisfies, and applies it to a function.  But since this function is itself inside a quotation, its code representation will be available at run time.  Therefore, we can use it to generate a query.


## Normalization

You're an alert reader, so you've certainly already noticed a potential problem with this.  Suppose we evaluate the above parameterized, quoted query; we should wind up with something like:

query { for x in (fun p -> query {
for student in db.Students do
if p student then
yield student
} ) (fun s -> s.Age.Value >= 10
&& s.Age.Value < 15)
yield x }
Note that the code inside the query{} brackets is essentially a value: it's as evaluated as it's going to get (as far as F# is concerned).  In particular, the $\beta$-reduction arising from the anonymous function applied to the predicate is frozen. This code contains a $\lambda$-abstraction, and $\lambda$-abstraction is not a feature available in SQL.  How are we going to translate this code at run-time to equivalent SQL?

The answer is a normalization algorithm for query expressions that can remove incidental uses of features that go beyond SQL.  In this case, all that is really needed is to do a beta-conversion and substitute in the parameter for $p$, as follows:

query { for x in for student in db.Students do
if (fun s -> s.Age.Value >= 10
&& s.Age.Value < 15)
student then
yield student
yield x }

We can further simplify the query expression  as follows:

query { for student in db.Students do
if student.Age.Value >= 10
&& student.Age.Value < 15 then
yield student }

However, in general, things can be a little more complicated (e.g. we may need to do some transformations that depend on the fact that we are working with a query language, not just pure lambda-calculus reductions.)

A simple version of this algorithm (which would suffice for this example) dates to Wong's work in the mid-1990s on Kleisli; later, Ezra Cooper (one of Phil Wadler's students) generalized Wong's algorithm to the higher-order case in the context of the Links programming language.  Our ICFP 2013 paper gave a variant of Cooper's approach that is technically a little more lightweight, but is based on the same basic idea; the main novelty was the application to the F# context.

## Introducing FSharpComposableQuery

Our paper was primarily based on the F# 2.0 version of LINQ, which was distributed as part of the F# PowerPack rather than as a fully-supported language feature.  We did test our approach to some extend on F# 3.0, and in particular showed thatour library did not interfere with other F# 3.0 operators such as grouping and aggregation.

However, as is fairly typical for reserach-driven code, our approach was a proof of concept and not in a form that would have been easy for someone else to reuse.  In addition, as noted above it was developed originally against the F# 2.0 version of LINQ, and we expected some updating would be needed to adapt to F# 3.0.

Later in 2013, the F# team was kind enough to help us to set up a F# community project called FSharpComposableQuery containing our code, and we did indeed run  into some subtle issues in adapting our approach from F# 2.0 to F# 3.0.

Over the past month, with support from an LFCS internship for Edinburgh undergraduate Yordan Stoyanov, we have solved these problems and generally improved FSharpComposableQuery to the point where we believe other reserachers or F# programmers will be able to use it.  The most recent version can be downloaded from NuGet here and the source code (including test programs and examples) is available here.

## Why is this useful?

As we argued in more detail in our paper, the combination of quotation and normalization provides a significant boos to expressive power.  It means that abstraction features of the host language (e.g., lambda-abstraction in F#) can be used to organize and simplify database query code, as well as ordinary code meant for in-memory execution.  Queries can be decomposed into smaller, more reusable (and more comprehensible) units, as already illustrated above.  In addition, we can leverage the normalization property to write queries over virtual "views" of a database; as long as the end result is flat, and the query uses a simple subset of SQL, normalization will yield a single query.  Even more, we can build queries by traversing in-mempory recursive data structures; for example, we can write a simple XPath query evaluator that translates path expressions to SQL queries over a relational representation of XPath.

This post is already getting long, so I refer the interested reader to the paper, or to the tutorial examples accompanying FSharpComposableQuery.  More advanced features might be the subject of a future post.

[EDIT: Updated the links to the project pages, since the project has been renamed to remove the word "Experimental".]

Labels: ,