author: Ewout van der Wal
title: Rosetta ANTLR: the Ultimate Grammar Extractor
keywords: grammar extraction
topics: Algorithms and Data Structures , Languages
committee: Vadim Zaytsev
started: November 2020
end: January 2021


As a part of Software Language Processing Suite, there exists a grammar extractor from ANTLR to BGF, which relied on a method of "extraction by abstraction" [1][2][3], which produced "grammars in a broad sense" [4] from grammar specifications written in the notation of ANTLR3. That extractor has proven to be useful in the context of grammar recovery [1], grammar-based testing [5], as well as in general collecting grammars for empirical evaluation of grammarware [6].

However, extraction by abstraction has one inherent weakness of misrepresenting some of the details of the original grammar simply because the expressivity of the target notation is deliberately limited. In this project, we attempt to provide an alternative extractor that takes into account the disambiguating features of ANTLR4, the parsing technique details of ALL(*) [7], the underlying semantics of the programming language used in the semantic actions, as well as possibly other elements used to break the context-freedom nature of EBNF-like notations, and write an ultimate extractor that represents ANTLR grammars as universal technology-agnostic yet very precise specifications of commitment to grammatical structure.


  1. Lämmel, Zaytsev, Introduction to Grammar Convergence, iFM 2009.
  2. Zaytsev, Language Convergence Infrastructure, GTTSE 2009.
  3. Zaytsev, Notation-Parametric Grammar Recovery, LDTA 2012.
  4. Klint, Lämmel, Verhoef, Toward an Engineering Discipline for Grammarware, ToSEM, 2005.
  5. Fischer, Lämmel, Zaytsev, Comparison of Context-free Grammars Based on Parsing Generated Test Data, SLE 2011.
  6. Zaytsev, Grammar Zoo: A Corpus of Experimental Grammarware, SCP, 2015.
  7. Parr, Harwell, Fisher, Adaptive LL(*) Parsing: The Power of Dynamic Analysis, OOPSLA 2014.