Orchestrating Similar Stream Processing Jobs to Merge Equivalent Subjobs

Duration: July 2017 until July 2017

Summary of the project

The power and efficiency of distributed stream processing frameworks have greatly improved. However, optimizations mostly focus on improving the efficiency of single jobs or improving the load balance of the task scheduler. In this thesis, we propose an approach for merging equivalent subjobs between streaming jobs at runtime, that are generated from a predefined template. Since our template structure is similar to the structure of simple Spark Streaming applications, templates can be created with minimal development overhead. Furthermore, we have analyzed the complexity of benchmarking Spark Streaming applications. Based on the results of this analysis, we have designed a method to benchmark Spark Streaming applications with the maximum throughput as metric. This method is applied on performing an experimental analysis of the performance of merged jobs versus unmerged jobs on the CTIT cluster of University of Twente. Based on the results of this analysis however, we cannot conclude for which cases job merging results in an increase of the maximum throughput.