An algebraic approach for data-centric scientific workflows
An algebraic approach for data-centric scientific workflows
E. Ogasawara, D. de Oliveira, P. Vanduriez, J. Dias, F. Porto, M. Mattoso
VLDB 2011
This paper argues for the use of an algebraic core language to support parallel execution of data-centric workflows, i.e., graphs that correspond to programs over bulk data. The idea is that just as relational algebra is used inside databases, and can be evaluated in different ways (using differeny physical operators) or optimized (using equivalences derived from the semantics, and profiling information/statistics driving cost estimates), the algebraic operations presented here can be used to support different execution models or optimizations.
The operators include "Map", "Reduce", "Filter", and a variant of "Map" called "SplitMap"; all of these can take an arbitrary executable and run it on many inputs. SplitMap has some additional grouping / splitting behavior that isn't explained in detail in the paper. There are also two relational operators, SRQuery, which applies a selection/projection query to a single relation, and JoinQuery which applies a multiple-input query to several relations. The connections between operators are typed as tuples of base values or filenames (or possibly other nested values, but this isn't discussed further.) So this can be viewed as a generalization of the relational calculus, where some nodes of the graph correspond to whole queries, and other nodes correspond to structured user-defined operations.
Read more »
Labels: algebraic optimization, workflows