'ANTLR4: Lexer.getCharIndex() return value not behaving as expected

I want to extract specific fragment of lexer rule, so I wrote the following rule:

parser grammar TestParser;

options { tokenVocab=TestLexer; }

root
    : LINE+ EOF
    ;

lexer grammar TestLexer;

@lexer::members {
    private int startIndex = 0;

    private void updateStartIndex() {
        startIndex = getCharIndex();
    }

    private void printNumber() {
        String number = _input.getText(Interval.of(startIndex, getCharIndex() - 1));
        System.out.println(number);
    }
}

LINE:                          {getCharPositionInLine() == 0}? ANSWER SPACE {updateStartIndex();} NUMBER {printNumber();} .+? NEWLINE;
OTHER:                         . -> skip;

fragment NUMBER:               [0-9]+;
fragment ANSWER:               '( ' [A-D] ' )';
fragment SPACE:                ' ';
fragment NEWLINE:              '\n';
fragment DOT:                  '.';

Execute the following code:

import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.tree.ParseTree;

public class TestParseTest {

    public static void main(String[] args) {
        CharStream charStream = CharStreams.fromString("( A ) 1. haha\n" +
                "( B ) 12. hahaha\n" );
        Lexer lexer = new TestLexer(charStream);

        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TestParser parser = new TestParser(tokens);
        ParseTree parseTree = parser.root();

        System.out.println(parseTree.toStringTree(parser));
    }

}

The output is as follows:

1
12
(root ( A ) 1. haha\n ( B ) 12. hahaha\n <EOF>)

At this point, the value of the fragment NUMBER is printed as expected. Then I add the fragment DOT to the lexer rule LINE:

LINE:                          {getCharPositionInLine() == 0}? ANSWER SPACE {updateStartIndex();} NUMBER {printNumber();} DOT .+? NEWLINE;

The output of the above test code is as follows:

1
1
(root ( A ) 1. haha\n ( B ) 12. hahaha\n <EOF>)

Why does the second line of output change to 1, this is what I don't understand.

If we modify the test code as follows:

import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.tree.ParseTree;

public class TestParseTest {

    public static void main(String[] args) {
        CharStream charStream = CharStreams.fromString("( B ) 12. hahaha\n"+
                "( B ) 123. hahaha\n");
        Lexer lexer = new TestLexer(charStream);

        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TestParser parser = new TestParser(tokens);
        ParseTree parseTree = parser.root();

        System.out.println(parseTree.toStringTree(parser));
    }

}

At this time, when LINE does not contain DOT, the output is as follows:

12
123
(root ( B ) 12. hahaha\n ( B ) 123. hahaha\n <EOF>)

When LINE contains DOT, the output is as follows:

12
12
(root ( B ) 12. hahaha\n ( B ) 123. hahaha\n <EOF>)

Update

I have submitted this issue to GitHub: Lexer.getCharIndex() return value not behaving as expected · Issue #3606 · antlr/antlr4 · GitHub

antlr4

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'ANTLR4: Lexer.getCharIndex() return value not behaving as expected

Update

Sources

Related Questions