Writing interpreters with ANTLR

23 Jan 2016

I recently decided to take a look at learning ANTLR for the purposes of building an interpreter similar to the one we developed not too long ago. ANTLR, or Another Tool For Language Recognition, takes as input a grammar, or language specification describing the language one would like to interpret, and generates as output source code for a recognizer for that language. Remember all the hard work we did to parse LISP code before we could interpret it? This time we’ll let ANTLR write that code for us; we just need to provide an implementation of a tree visitor to traverse the abstract syntax tree that the ANTLR parser gives us.

Motivation and why ANTLR

Rather than a programming language, for this project we’ll interpret Roman numerals to decimals. I thought what with the NFL championship right around the corner that this might be relevant. Turns out that Super Bowl fifty won’t be marketed in the tradition of previous championships as “Super Bowl L” but rather as “Super Bowl 50.” Oh well!

You might be wondering why I decided against writing everything for the interpreter myself. When I developed the LISP interpreter, I somewhat struggled with handling edge cases correctly. Calls to functions that returned functions proved a little difficult and required a lot of testing before I felt confident moving forward.

If you look at the source code for this project, it’s much simpler to debug the concise .g4 files than a hand written parser. Moreover, we know that the parser and lexer base classes from which our auto-generated parsers and lexers derive from are bulletproof; they’ve been tested for years and any errors are far more likely to originate in our specification of the language than in the parser or lexer algorithm that consumes our language.

Another advantage to using a parser generator like ANTLR is portability. When we compile our .g4 file, we’ll pass along the target language, or language for which to generate code to recognize the grammar (ANTLR currently supports Java, C#, Python, Ruby among others). Had we written code for the lexer and parser by hand, we would have had to manually reimplement our algorithms anytime we decided to change the target language. It’s really great that ANTLR takes care of that for us.

Setting up with Xamarin

I understand that the instructions for setting up on Visual Studio are significantly easier. Unfortunately, I have a Mac and had some difficulty installing ANTLR to work with Xamarin. Here’s my process for reference.

Install Java (version 1.6 or higher)
Open up Xamarin and start a new project. Go to your packages and install this version of ANTLR from the package manager
Move the ANTLR .jar file for the package above to /usr/local/lib
Add an alias to execute the .jar file to your .bash_profile e.g. alias antlr4='java -jar /usr/local/lib/antlr4-csharp-4.5-SNAPSHOT-complete.jar'

You should now be good to go. Here’s my typical workflow.

Create a new project
Add a package reference to ANTLR from nuget
Create a folder called “Antlr” for the project using the package
Create a .g4 grammar file at the same level as the Antlr directory
Whenever you modify the grammar recompile the .g4 file from a terminal window with antlr4 XXX.g4 -Dlanguage=CSharp_v4_0 -o Antlr. This last option will put them in the folder created in step 3. Make sure those files are added to the project, though the grammar file need not explicitly be part of the project

A grammar for Roman numerals

Rather than writing code to parse the characters of a Roman numeral directly we’ll be writing a “grammar” to describe the language. If you’ve heard the term “meta-language” before, that’s exactly what a grammar is; a language for describing languages. I also like to think of grammars as “specification documents” or the rules and requirements for generating valid text in the language.

Roman numerals are made from combining symbols (I, V, X, L, C, D, M) and adding their values. If a preceding numeral is less than a given numeral, the total value of the pair is the given character’s value minus that of the preceding character.

Let’s take a stab at the grammar. ANTLR grammars are made up primarily of token definitions and parser definitions. The first set of rules concerns terminal expressions, the basic set of characters or keywords that are used to build more complex expressions. In our case, these will be the seven symbols upon which the whole system is based.

// --- atomic definitions
one         : 'I';
five        : 'V';
ten         : 'X';
fifty       : 'L';
oneHundred  : 'C';
fiveHundred : 'D';
oneThousand : 'M';

Now for the tricky part. We need to write rules to express any other numeral in the system. Any Roman numeral can be interpreted in decimal form by translating the symbols representing the thousands, hundreds, tens, and units places individually, and then summing the constituent results from left to right. Consider the number MMMCMLXI. Starting from left to right, we can see that the first three symbols represent 3000. The next two symbols represent 900. The next two 46 and the last represents 1. Taken together, the number represents 3961 in decimal form.

Any Roman numeral should have zero or more thousand symbols in it. In ANTLR, the operator to denote “zero or more” is (...)*. Below I’ve defined this parser definition to be root since this will be the start of the abstract syntax tree for any Roman numeral. Perhaps roman would have been a better name, but root has plenty of significance associated with it, too.

root  : (oneThousand)* hundreds? tens? units?;

Notice that the rest of the place values are strictly zero or one, represented with the (...)? operation. Here we’ve defined hundreds, tens, and units to be themselves higher level parser rules as this makes the representation of a Roman numeral (i.e. root) much simpler.

// --- I, II, III, IV, IX or V VI, VII, VIII
units : one ((one)* | five  | ten) | five (one)*; 

// --- X, XX, XXX, XL, XC or L, LX, LXX, LXXX
tens  : ten ((ten)* | fifty | oneHundred) | fifty (ten)*;

// --- C, CC, CCC, CD, CM or D, DC, DCC, DCCC 
hundreds : oneHundred ((oneHundred)* | fiveHundred | oneThousand) | fiveHundred (oneHundred)*; 

// --- skip over white spaces, tabs, newlines
WS : [ \t\r\n]+ -> skip ;

The | (pipe) character above is used like a logical OR; for a given starting token, the parser will continue based on the first rule match that succeeds.

With that we can run antlr4 Roman.g4 -Dlanguage=CSharp_v4_0 -o Antlr on the grammar file to generate the lexer/parser source code.

Visiting the ANTLR abstract syntax tree

Among the files ANTLR generated for us is the interface IRomanVisitor<T>. Let’s implement this interface for int and also extend our concrete class to inherit from AbstractParseTreeVisitor<int>.

public class RomanDecodeVisitor : AbstractParseTreeVisitor<int>, IRomanVisitor<int>
{
    public int VisitRoot(RomanParser.RootContext context)
    {
        throw new NotImplementedException();
    }

    public int VisitHundreds(RomanParser.HundredsContext context)
    {
        throw new NotImplementedException();
    }

    public int VisitTens(RomanParser.TensContext context)
    {
        throw new NotImplementedException();
    }

    public int VisitUnits(RomanParser.UnitsContext context)
    {
        throw new NotImplementedException();
    }

    public int VisitOne(RomanParser.OneContext context)
    {
        return 1;
    }

    public int VisitFive(RomanParser.FiveContext context)
    {
        return 5;
    }

    public int VisitTen(RomanParser.TenContext context)
    {
        return 10;
    }

    public int VisitFifty(RomanParser.FiftyContext context)
    {
        return 50;
    }

    public int VisitOneHundred(RomanParser.OneHundredContext context)
    {
        return 100;
    }

    public int VisitFiveHundred(RomanParser.FiveHundredContext context)
    {
        return 500;
    }

    public int VisitOneThousand(RomanParser.OneThousandContext context)
    {
        return 1000;
    }  
}

The atomic visits should be pretty straightforward, and I’ve actually implemented them in the code above. Visiting our parser rules is a little more involved.

We need an algorithm for actually doing the adding/subtracting after we’ve evaluated each of the place value numerals. Recall that we need to look at the numeral in pairs, checking whether a given numeral is less than or greater than the preceding numeral. In code, this can be done as follows:

private static int RomanSum(IEnumerable<int> nums)
{
    // our zip might not capture the last number
    // start the sum with this (e.g. zip for 'X' will be no elements)
    int sum = nums.Last();

    // look at each consecutive pair
    // add/subtract based on less than/greater than
    foreach (var pair in nums.Zip(nums.Skip(1), 
        (smaller, larger) => new { smaller, larger }))
    {
        sum += (pair.smaller < pair.larger) ? 
            -pair.smaller : pair.smaller;
    }
    return sum;
}

To interpret the remaining cases in our visitor, all we have to do at a given node is recursively visit the child nodes and return their RomanSum. Note that the Visit method is actually defined in the abstract base class from which our implementation derives.

public int VisitRoot(RomanParser.RootContext context)
{
    return RomanSum(from child in context.children select Visit(child));
}

That’s all there is to it!

Testing

I had some trouble setting up nuget unit testing on my machine. It’s easy enough to mock this functionality in a test class. Below is a simple test drive that shows how the RomanParser should be called to produce the parse tree.

public class Program
{
    public static void Main(string[] args)
    {
        BasicTest("MMMCMLXI", 3961);
        BasicTest("MMMCMLXXIV", 3974);
        BasicTest("MCMXC", 1990);
        BasicTest("MMVIII", 2008);
        BasicTest("MDCLXVI", 1666);
        BasicTest("MCMLIV", 1954);
        BasicTest("MCMLXXIV", 1974);
    }

    private static void BasicTest(string text, int expected)
    {
        // setup the lexer
        var inputStream = new AntlrInputStream(text);
        var lexer = new RomanLexer(inputStream);

        // get the tree and and traverse
        IParseTree tree = new RomanParser(new CommonTokenStream(lexer)).root();
        int actual = tree.Accept(new RomanDecodeVisitor());

        if (expected != actual)
            throw new Exception(string.Format(
                "failed to convert {0}: expected {1} got {2}", text, expected, actual));
    }
}

Excellent!

Final thoughts

Regarding validation. If input text doesn’t parse ANTLR won’t crash and will instead trigger warnings in the output window. You can override methods in the parser code to catch these warnings and handle appropriately (e.g. throw an exception). I didn’t really want to modify this code and decided against it. I’m still working on adding validation in the visitor code and will try and update when I’ve worked things out fully. My idea for this project was simply to get acquainted with ANTLR and the idea of grammars in general.

We only used two ANTLR operators for this project (+, |, *, ?). A full reference can be found here. It’s worth noting that target language code can also be included in the grammar file. I decided against this because I wanted to decouple the grammar from the interpreter. I’ll note, however, that Rosetta Code has a great sample of how you would do this project that way rather than with a visitor. To each his own.

Not too long ago, I watched a documentary on IBM Watson and the programming that went into getting the machine ready for the Jeopardy Challenge. It turns out that only a short while before the competition did someone realize that Watson wasn’t trained to recognize Roman numerals. Not sure how they ended up implementing recognition but I’m sure it was more involved than this!

Thanks for reading!