Matthijs Ooms - Provenance Management in Practice

author:Matthijs Ooms
title:Provenance Management in Practice
keywords:
committee:dr. P.E. van der Vet (1st supervisor)
dr.ir. D. Hiemstra
dr.ir. R. Langerak
ir. I. Wassink
graduation date:24 September 2009


Abstract

Scientific Workflow Managements Systems (SWfMSs), such as our own research prototype e-BioFlow, are being used by bioinformaticians to design and run data-intensive experiments, connecting local and remote (Web) services and tools. Preserving data, for later inspection or reuse, determine the quality of results. To validate results is essential for scientific experiments. This can all be achieved by collecting provenance data. The dependencies between services and data are captured in a provenance model, such as the interchangeable Open Provenance Model (OPM). This research consists of the following two provenance related goals:

  1. Using a provenance archive effectively and efficiently as cache for workflow tasks.
  2. Designing techniques to support browsing and navigation through a provenance archive.

Early in this research it was determined that a representative use case was needed. A use case, in the form of a scientific workflow, can show the performance improvements possibly gained by caching workflow tasks. If this use case is large-scale and data-intensive, and provenance is collected during its execution, it can also be used to show the levels of detail that can be addressed in the provenance data. Different levels of detail can be of aid whilst browsing and navigating provenance data.

The use case identified is called OligoRAP, taken from the life science domain. OligoRAP is casted as a workflow in the SWfMS e-BioFlow. Its performance in terms of duration was measured and its results validated by comparing them to the results of the original Perl implementation. By casting OligoRAP as a workflow and using parallelism, its performance is improved by a factor two.

Many improvements were made to e-BioFlow in order to run OligoRAP, among which a new provenance implementation based on the OPM, enabling provenance capturing during the execution of OligoRAP in e-Bio- Flow. During this research, e-BioFlow has grown from a proof-of-concept to a powerful research prototype.

For the OPM implementation, a profile for the OPM to collect provenance data during workflow execution has been proposed, that defines how provenance is collected during workflow enactment. The proposed profile maintains the hierarchical structure of (sub)workflows in the collected provenance data. With this profile, interoperability of the OPM for SWfMS is improved.

A caching strategy is proposed for caching workflow tasks and is implemented in e-BioFlow. It queries the OPM implementation for previous task executions. The queries are optimised by formulating them differently and creating several indices. The performance improvement of each optimisation was measured using a query set taken from an OligoRAP cache run. Three tasks in OligoRAP were cached, resulting in a performance improvement of 19%. A provenance archive based on the OPM can be used to effectively cache workflow tasks.

A provenance browser is introduced that incorporates several techniques to help browsing through large provenance archives. Its primary visualisation is the graph representation specified by the OPM. The following techniques have been designed:

  • An account navigator that uses the hierarchy captured by the OPMprofile using composite tasks and subworkflows to visualise a tree structure of generic and detailed views towards the provenance data.
  • The provenance browser can use several perspectives towards provenance data, namely the data flow, control flow and resource perspectives, identical to the perspectives used towards workflows in e-BioFlow. This enables the end-user to show detail on demand.
  • A query panel that enables the end-user to specify a provenance query. The result is directly visualised in the provenance browser, allowing the user to query for certain data items, tasks or even complete derivation trails.
  • Retrieve tasks or data items that are not loaded in the provenance browser, but are neighbours of currently visible tasks or data items.

These techniques have already proven their value whilst debugging OligoRAP: error messages, and more interestingly, their cause, were easily identified using the provenance browser. The provenance archive could be queried for all generated pie charts using the query panel, presenting a clear overview of the results of an OligoRAP run.