Java LR Parsing Challenges 2014-09-17

I recently gave a talk at the Parsing@SLE workshop in Västerås about LR parsing problems in Java 8. I thought it would be nice to have a short list available of all parsing challenges including previous Java versions too, so that is what you will find in this post. I will mostly not talk about the different ways to solve the parsing challenges listed here, but that may be the topic of follow up posts.

Java Evolution

Obviously the Java language has evolved a lot from it’s original form. It is crazy to think that people used Java without generics once upon a time, yet that is how it started and now we have Lambdas, the shiny new feature that everyone will soon also take for granted.

Adding Java features does have a significant cost. Ideally this cost is paid only by the language designer. Java is pretty good at not breaking old code, which is usually one of the biggest costs when a programming language changes. Providing good backward compatibility is paid for instead by investing a lot in the design effort for new features. Personally I think that Java 8 is well designed, given the constraints of backward compatibility that were placed on the design. With that said though, there are definitely a whole bunch of new parsing problems that resulted from the evolution of Java. The Java grammar is no longer an LR(1) grammar, but with some massaging and scanner hacks you can get it to parse anyway.

LR Parsing Challenges

The Java parsing challenges that I am aware of are listed below. They are described in the context of an LR parser.

1. Right Angle Brackets Have Multiple Meanings

This is one of the most well known parsing problems in Java. Whenever two or three right angle brackets are encountered right after each other a regular Java lexer will tokenize them as the bit shift operators >> or >>>. Consider the below example:

class X<A<B>> { }
class X<A<B> > { }

On the first line the right angle brackets are not tokenized individually, but on the second line they are. The parser needs to be able to parse both variations.

This is not too difficult to solve, it has been done many times before in different Java parsers. Usually the bracket nesting depth is encoded for in the parser productions, so that, for example, when parsing inside two or more nested brackets the parser can accept the signed right shift operator >>.

2. Conflict Between Generic Type Cast and Less-Than Expressions

Example of conflict:

(T<A>) x    // casting to generic type
(T<A)       // less-than expression

As you can see there is a common prefix between the two expressions. In LR parsing common prefixes are not a problem if they involve the same terminals or nonterminals. However in this case the same terminal must be reduced to two different nonterminals, RelationalExpression versus ReferenceType, and so we have a reduce-reduce conflict.

3. Conflict Between Generic Type Cast and Relational Expressions

Example of conflict:

(T<A<B>>) x   // generic type cast
(T<A<B)        // relational expressions

It is not possible to use relational operators with boolean value expressions, in other words the second line in the above example is really semantically invalid. However the JLS still allows that syntax – the expression is just discarded during type checking. Another way of putting it is that the relational operators are associative when they could really be non-associative.

The fact that relational operators are associative causes a conflict in the above example. Even if we disregard for a while the common-prefix reduce-reduce conflict you would still get a shift-reduce problem after encountering the second less-than because the parser does not know if it should reduce the previous to a relational expression or shift to build a type argument list. This is a shift-reduce conflict.

4. Ambiguous Grammar Specification for Lambda Expressions

An example of an ambiguous string:

(T) () -> a * b

This is a lambda expression being cast to some type T. The Java Language Specification (JLS) argues for lambda expressions wanting to consume as big expressions as they can, i.e., they have lowest precedence. However in the JLS grammar you can find the following productions (simplified):

Expression -> Lambda
Cast -> "(" Type ")" Lambda

Lambda being in the Expression production makes it live in the top of the expression hierarchy, meaning that it should have the lowest precedence. However, the Cast production turns this on its head and makes the above sample expression parseable as if the expression had the following logical form:

((T) () -> a) * b

This interpretation of the example is incorrect, as the JLS points out. However, the correct interpretation is also a valid parse:

(T) (() -> a * b)

If there are multiple valid parse trees of a single expression it means that the grammar is ambiguous.

5. Conflict Between the Lambda Parameter List and Parenthesized Less-Than Expressions

Example of conflicting expressions:

(T<A> x) -> x    // lambda expression
(T<A)            // less-than expression

This is essentially the same conflict as described in 2. This conflict was introduced in Java 8, as opposed to the reduce-reduce conflict discussed in 2, which was added in Java 5.

6. Reduce-Reduce Conflict for Modifiers

In Java 8 there are new nonterminals for different kinds of modifiers, for example there exist the new nonterminals InterfaceMethodModifier and ClassModifier. Both of these kinds of modifier may exist inside an interface declaration:

interface I {
    static class C { } // static is ClassModifier
    public void m();   // public is InterfaceMethodModifier
}

The problem is that both static and public are valid productions of both these nonterminals, so there is a reduce-reduce conflict whenever the parser encounters either of these modifiers, and a few other modifiers, inside an interface body.

You can solve this reduce-reduce conflict by letting all modifiers be parsed as the same nonterminal and discard incorrect modifier uses during semantic error checking, however then you will run into a shift-reduce conflict in the switch statement, which is really interesting. This could be a topic for a whole other blog post.

7. Conflict Between Method References and Relational Expressions

Example of conflicting expressions:

f(T<A, B>::m)   // method reference as argument
f(T<A, B>  m)   // two relational expression arguments

Note that the method reference is qualified by a parameterized type. This is probably a very uncommon use of method references, nevertheless it is part of the specification and a complete parser must handle it correctly.

As seen in the above example there is a common prefix between the method reference and the list of relational expressions. This case is different from the case in 2 in that the list can be arbitrarily long.

8. Conflict Between Intersection Type Cast and Bitwise-And Expressions

An intersection type cast is simply a cast expression that looks like this:

(A & B) x

There can be many types in the intersection type, and these type casts conflict with parenthesized bitwise-and expressions:

(A & B & C) x   // intersection type cast
(A & B & C)     // bitwise-and

The type of the expression is determined by what follows the closing parenthesis. This is a shift-reduce conflict as the parser can not decide whether to shift an identifier onto a type list, or reduce to a bitwise-and expression.

Conclusions

This post was just a brief rundown of the parsing problems we encountered in implementing Java parsing using a generated LALR parser. The explanations were very short and I mostly skipped describing how to solve the challenges. This was done to save space and time. Nevertheless, I hope that this was somewhat useful and/or interesting.

Personally I think the parsing problems I have described here were interesting because they show how very specific and complex parsing problems can be. It is amazing to me that the lambda expression, which is syntactically very flexible, has broken so little in the Java grammar.

Categories: Uncategorized

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.