## Monday, April 29, 2013

### An algebraic approach for data-centric scientific workflows

An algebraic approach for data-centric scientific workflows
E. Ogasawara, D. de Oliveira, P. Vanduriez, J. Dias, F. Porto, M. Mattoso
VLDB 2011

This paper argues for the use of an algebraic core language to support parallel execution of data-centric workflows, i.e., graphs that correspond to programs over bulk data.  The idea is that just as relational algebra is used inside databases, and can be evaluated in different ways (using differeny physical operators) or optimized (using equivalences derived from the semantics, and profiling information/statistics driving cost estimates), the algebraic operations presented here can be used to support different execution models or optimizations.

The operators include "Map", "Reduce", "Filter", and a variant of "Map" called "SplitMap"; all of these can take an arbitrary executable and run it on many inputs.  SplitMap has some additional grouping / splitting behavior that isn't explained in detail in the paper.  There are also two relational operators, SRQuery, which applies a selection/projection query to a single relation, and JoinQuery which applies a multiple-input query to several relations.  The connections between operators are typed as tuples of base values or filenames (or possibly other nested values, but this isn't discussed further.)  So this can be viewed as a generalization of the relational calculus, where some nodes of the graph correspond to whole queries, and other nodes correspond to structured user-defined operations.

The main experimental result is a comparison of different scheduling strategies: "dynamic" vs. "static" and "first tuple first" (FTF) vs. "first activity first".  The definitions of these different strategies in sec. 4.3 is not easy for me to follow, but the basic idea seems to be:
• static = allocate each operator to a processor in advance, based on information available before execution
• dynamic = allocate operators / subgraphs to processors dynamically in response to demand
• FTF = partition graphs into connected components
• FAF = partition graphs into components that have no shared edges
It isn't clear why FTF and FAF are the only two options considered.  The experimental results suggest that dynamic FTF is best, which seems plausible given that FTF may be better for locality and dynamic techniques may be better at adapting to the data.

The paper contrasts its approach primarily with previous work in the workflow and parallel database areas, as opposed to parallel computing per se.  It would be interesting to compare the dynamic approaches here to techniques familiar from parallel programming languages, such as work stealing.  Likewise, it isn't clear what distinguishes the static approach from a static allocation obtained from synchronous dataflow (which is what Kepler typically uses).  Systems such as Taverna also have an underlying calculus that could in principle be used for optimization or parallelization but I'm not sure if this has been explored.  Blelloch's and others' work on nested data parallelism seems related too.

There is also a lot of work (now, though maybe not when the paper was written) on extensions / variations on MapReduce that decouple the map and reduce operations or allow other operations - PigLatin for example. There is now some work on static analysis and rewriting for such computations, whereas in this paper the possibility of rewriting workflows to improve performance is not investigated deeply (beyond an experiment showing that doing so manually can be beneficial).

Overall, I think it is a good idea to develop algebraic foundations for workflow computation, and it may be the case that using a more tightly constrained workflow model could lead to more optimization opportunities than general-purpose parallel programming does, but it seems necessary to see how much benefit can be derived from known techniques in parallel programming and compare new proposals with applicable previous techniques.

Labels: ,