'Data model for transforming source into AST and back?
I am working on a custom programming language. On compiling it, the parser first converts the text into a simple stream of tokens. The tokens are then converted into a simple tree. The tree is then converted into an object graph (with holes in it, as the types haven't yet been necessarily fully figured out). The holey tree is then transformed into a compact object graph.
Then we can go further and compile it to, say, JavaScript. The compact object graph is then transformed into a JavaScript AST. The JS AST is then transformed into a "concrete" syntax tree (with whitespace and such), and then that is converted into the JS text.
So in going from text to compact object graph, there are 5 transformation steps (text -> token_list -> tree -> holey_graph -> graph). In other situations (other languages), you might have more or less.
The way I am doing this transformation now is very ad-hoc and not keeping track of line numbers, so it's impossible to really tell where an error is coming from. I would like to fix that.
In my case, I am wondering how you could create a data model to keep track of the line of text where something was defined. This way, you could report any compilation errors nicely to the developer. The way I have modeled that so far is with a sort of "folding" model as I'm calling it. The initial "fold" is on the text -> token_list transformation. For each token, it keeps track of 3 things: the line, the column, and the text length, for the token. At first you may model it like this:
{
token: 'function',
line: 10,
column: 2,
size: 8
}
But that is tying two concepts into one object: the token itself, and the "fold" as I am calling it. Really it would be better like this:
fold = {
line: 10,
column: 2,
size: 8
}
token = {
value: 'function'
}
// bind the two together.
fold.data = token
token.fold = fold
Then, you transform from token to AST node in the simple tree. That might be like:
treeNode = {
type: 'function'
}
fold = {
previous: tokenFold,
data: treeNode
}
And so connecting the dots like this. In the end, you would have a fold list, which could be traversed theorertically from the compact object graph, to the text, so if there was a compile eror when doing typechecking for example, you could report the exact line number and everything to the developer. The navigation would look something like this:
data = compactObjectGraph
.fold
.previous.previous.previous.previous
.data
data.line
data.column
data.size
In theory. But the problem is, the "compact object graph" might have been created not from a simple linear chain of inputs, but from a suite of inputs. While I have modeled this on paper so far, I am starting to think there isn't actually in reality a clear way of mapping from object to object how it was transfformed, using this sort of "fold" sort of system.
The question is, how can I define the data model to allow for getting back to the source text line/column number, given there is a complex sequence of transformations from one data structure to the next? That is, at a high level, what is a way to model this that will allow you to isolate the transformation data structures, yet be able to map from the last generated one to the first, to find how some compact object graph node was actually represented in the original source text?
Solution 1:[1]
I would create a data structure containing the filename, line and column. In C++ it may work well to store a reference to this structure, rather than copy it to many places.
There isn't really that many ways to solve this, but having a single structure that is re-usable across your other data structures is almost certainly the right solution.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Mats Petersson |
