'Parse a string using ANTLR4

Example: (CHGA/B234A/B231

String:
        a) Designator: 3 LETTERS
        b) Message number (OPTIONAL): 1 to 4 LETTERS, followed by A SLASH (/) followed by 1 to 4 LETTERS, followed by 3 NUMBERS indicating the serial number.
        c) Reference data (OPTIONAL): 1 to 4 LETTERS, followed by A SLASH (/) followed by 1 to 4 LETTERS, followed by 3 NUMBERS indicating the serial number.

Result: 
 CHG
 A/B234
 A/B231

In grammar file:

/*
 * Parser Rules
 */

tipo3: designador idmensaje? idmensaje?;
designador: PARENTHESIS CHG;
idmensaje: LETTER4 SLASH LETTER4 DIGIT3;

/*
 * Lexer Rules
 */

CHG     : 'CHG' ;

fragment DIGIT      : [0-9] ;
fragment LETTER     : [a-zA-Z] ;

SLASH               : '/' ;
PARENTHESIS         : '(' ;

DIGIT3              : DIGIT DIGIT DIGIT ;
LETTER4             : LETTER LETTER? LETTER? LETTER? ;

But when testing the tipo3 rule its giving me the following message:

line 1:1 missing 'CHG' at 'CHGA'

How can i parse that string in antlr4?



Solution 1:[1]

When you're confused why a certain parser rule is not being matched, always start with the lexer. Dump what tokens your lexer is producing on the stdout. Here's how you can do that:

// I've placed your grammar in a file called T.g4 (hence the name `TLexer`)
String source = "(CHGA/B234A/B231";
TLexer lexer = new TLexer(CharStreams.fromString(source));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();

for (Token t : stream.getTokens()) {
  System.out.printf("%-20s `%s`%n",
      TLexer.VOCABULARY.getSymbolicName(t.getType()),
      t.getText().replace("\n", "\\n"));
}

If you runt the Java code above, this will be printed:

PARENTHESIS          `(`
LETTER4              `CHGA`
SLASH                `/`
LETTER4              `B`
DIGIT3               `234`
LETTER4              `A`
SLASH                `/`
LETTER4              `B`
DIGIT3               `231`
EOF                  `<EOF>`

As you can see, CHGA becomes a single LETTER4, not a CHG + LETTER4 token. Try changing LETTER4 into LETTER4 : LETTER; and re-test. Now you'll get the expected result.

In your current grammar CHGA will always become a single LETTER4. This is just how ANTLR works (the lexer tries to consume as many chars for a single rule as possible). You cannot change this.

What you could do, it move the construction of the multi-letter rule to the parser instead of the lexer:

tipo3       : designador idmensaje? idmensaje?;
designador  : PARENTHESIS CHG;
idmensaje   : letter4 SLASH letter4 DIGIT3;
letter4     : LETTER LETTER? LETTER? LETTER?
            | CHG
            ;

CHG         : 'CHG' ;
LETTER      : [a-zA-Z] ;
SLASH       : '/';
PARENTHESIS : '(';
DIGIT3      : DIGIT DIGIT DIGIT;

fragment DIGIT : [0-9];

resulting in:

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Bart Kiers