'What is the best way to handle overlapping lexer patterns that are sensitive to context?
I'm attempting to write an Antlr grammar for parsing the C4 DSL. However, the DSL has a number of places where the grammar is very open ended, resulting in overlapping lexer rules (in the sense that multiple token rules match).
For example, the workspace rule can have a child properties element defining <name> <value> pairs. This is a valid file:
workspace "Name" "Description" {
properties {
xyz "a string property"
nonstring nodoublequotes
}
}
The issue I'm running into is that the rules for the <name> and <value> have to be very broad, basically anything except whitespace. Also, properties with spaces with double quotes will match my STRING token.
My current solution is the grammar below, using property_element: BLOB | STRING; to match values and BLOB to match names. Is there a better way here? If I could make context sensitive lexer tokens I would make NAME and VALUE tokens instead. In the actual grammar I define case insensitive name tokens for thinks like workspace and properties. This allows me to easily match the existing DSL semantics, but raises the wrinkle that a property name or value of workspace will tokenize to K_WORKSPACE.
grammar c4mce;
workspace : 'workspace' (STRING (STRING)?)? '{' NL workspace_body '}';
workspace_body : (workspace_element NL)* ;
workspace_element: 'properties' '{' NL (property_element NL)* '}';
property_element: BLOB property_value;
property_value : BLOB | STRING;
BLOB: [\p{Alpha}]+;
STRING: '"' (~('\n' | '\r' | '"' | '\\') | '\\\\' | '\\"')* '"';
NL: '\r'? '\n';
WS: [ \t]+ -> skip;
This tokenizes to
[@0,0:8='workspace',<'workspace'>,1:0]
[@1,10:15='"Name"',<STRING>,1:10]
[@2,17:29='"Description"',<STRING>,1:17]
[@3,31:31='{',<'{'>,1:31]
[@4,32:32='\n',<NL>,1:32]
[@5,37:46='properties',<'properties'>,2:4]
[@6,48:48='{',<'{'>,2:15]
[@7,49:49='\n',<NL>,2:16]
[@8,58:60='xyz',<BLOB>,3:8]
[@9,62:80='"a string property"',<STRING>,3:12]
[@10,81:81='\n',<NL>,3:31]
[@11,90:98='nonstring',<BLOB>,4:8]
[@12,100:113='nodoublequotes',<BLOB>,4:18]
[@13,114:114='\n',<NL>,4:32]
[@14,119:119='}',<'}'>,5:4]
[@15,120:120='\n',<NL>,5:5]
[@16,121:121='}',<'}'>,6:0]
[@17,122:122='\n',<NL>,6:1]
[@18,123:122='<EOF>',<EOF>,7:0]
This is all fine, and I suppose it's as much as the DSL grammar gives me. Is there a better way to handle situations like this?
As I expand the grammar I expect to have a lot of BLOB tokens simply because creating a narrower token in the lexer would be pointless because BLOB would match instead.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
