'ANTLR4: Lexer.getCharIndex() return value not behaving as expected
I want to extract specific fragment of lexer rule, so I wrote the following rule:
parser grammar TestParser;
options { tokenVocab=TestLexer; }
root
: LINE+ EOF
;
lexer grammar TestLexer;
@lexer::members {
private int startIndex = 0;
private void updateStartIndex() {
startIndex = getCharIndex();
}
private void printNumber() {
String number = _input.getText(Interval.of(startIndex, getCharIndex() - 1));
System.out.println(number);
}
}
LINE: {getCharPositionInLine() == 0}? ANSWER SPACE {updateStartIndex();} NUMBER {printNumber();} .+? NEWLINE;
OTHER: . -> skip;
fragment NUMBER: [0-9]+;
fragment ANSWER: '( ' [A-D] ' )';
fragment SPACE: ' ';
fragment NEWLINE: '\n';
fragment DOT: '.';
Execute the following code:
import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.tree.ParseTree;
public class TestParseTest {
public static void main(String[] args) {
CharStream charStream = CharStreams.fromString("( A ) 1. haha\n" +
"( B ) 12. hahaha\n" );
Lexer lexer = new TestLexer(charStream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
ParseTree parseTree = parser.root();
System.out.println(parseTree.toStringTree(parser));
}
}
The output is as follows:
1
12
(root ( A ) 1. haha\n ( B ) 12. hahaha\n <EOF>)
At this point, the value of the fragment NUMBER is printed as expected. Then I add the fragment DOT to the lexer rule LINE:
LINE: {getCharPositionInLine() == 0}? ANSWER SPACE {updateStartIndex();} NUMBER {printNumber();} DOT .+? NEWLINE;
The output of the above test code is as follows:
1
1
(root ( A ) 1. haha\n ( B ) 12. hahaha\n <EOF>)
Why does the second line of output change to 1, this is what I don't understand.
If we modify the test code as follows:
import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.tree.ParseTree;
public class TestParseTest {
public static void main(String[] args) {
CharStream charStream = CharStreams.fromString("( B ) 12. hahaha\n"+
"( B ) 123. hahaha\n");
Lexer lexer = new TestLexer(charStream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
ParseTree parseTree = parser.root();
System.out.println(parseTree.toStringTree(parser));
}
}
At this time, when LINE does not contain DOT, the output is as follows:
12
123
(root ( B ) 12. hahaha\n ( B ) 123. hahaha\n <EOF>)
When LINE contains DOT, the output is as follows:
12
12
(root ( B ) 12. hahaha\n ( B ) 123. hahaha\n <EOF>)
Update
I have submitted this issue to GitHub: Lexer.getCharIndex() return value not behaving as expected · Issue #3606 · antlr/antlr4 · GitHub
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
