author: Vadim Zaytsev
title: Consider It Parsed
keywords: parsing, replication, optimisation
topics: Algorithms and Data Structures , Case studies and Applications , Software Technology
committee: Vadim Zaytsev

Description

The world of parsing techniques is huge and full of various techniques — if you don't believe me, have a look at Parsing Techniques and bask in its 700 page glory. As a matter of fact, even since the second edition of this lovely book came out in 2008, there have been several significant advances in this area, with important new results published almost yearly.

One of these results came out earlier this year, and published at PLDI 2020 as “Faster General Parsing through Context-Free Memoization”. It is a nicely readable 14-page proposal of a new approach which seems to:

  • be easier to understand than some other techniques
  • map more straightforwardly to actions at parse time, making parser debugging a sensible activity
  • theoretically, have the same/comparable complexity as competitors (other parsing techniques)
  • experimentally, outperform competitors (other parser generators) by speed factors of 3–30 depending on which task is needed exactly

These are all nice features. What's even nicer is that the authors put up their proof of concept implementation on GitLab, free for public use.

Let's perform a replication of this study. Concretely, this means:

  • read the paper/documentation, get a basic understanding of how their algorithm works;
  • familiarise yourself with the code linked above;
  • find inconsistencies between the two and report on them (e.g., the paper advertises a C++ implementation, while the repository has C#);
  • rerun the original experiment (on elastic/elasticsearch);
  • rerun the experiment on a larger data set (Elastic Search has ~9k Java files, it is easy to get much, much more by grabbing Qualitas Corpus or just the Top 50 Repositories, or even literally crawling the entire GitHub with GHTorrent or GitHub REST API);
  • propose optimisations or list ideas on optimisations based on the findings.

The code is relatively fresh and seems operational (compiles and runs on .NET Core on a Mac laptop, both in Rider and Visual Studio). If you don't know C#, that's not a problem: it looks and reads like Java, just designed better.