'How to scoop up non-consecutive pieces of the input and concatenate the pieces together?

Here is an XML start tag and end tag with Hello, world inside the start-tag/end-tag pair:

<foo>Hello, world</foo>

In XML there is something called a CDATA section. It has this unusual syntax:

<![CDATA[...]]>

A CDATA section is a wrapper around data.

If a start-tag/end-tag pair contains a CDATA section, then the data inside the start-tag/end-tag pair is the concatenation of the data outside the CDATA section with the data inside the CDATA section. For example, the content of foo:

<foo>First expression <![CDATA[A < B]]>, second expression <![CDATA[C < D + 1]]>.</foo>

is this:

First expression A < B, second expression C < D + 1.

Question: How to scoop up each piece within foo and concatenate the pieces together? That is, how to scoop up these pieces ("First expression ", "A < B", ", second expression ", "C < D + 1", ".") and concatenate them together?

Below is a lexer I created. It works fine if foo doesn't have any CDATA sections but when foo has a CDATA section the lexer hangs.

Notice that my lexer uses yyless() and yymore(). I am imitating the example at the bottom of page 137 in the book Flex & Bison. The lexer scoops up the characters before the CDATA section plus the CDATA start syntax, then it pushes the CDATA start syntax back into the input and calls yymore(). Another rule discards the CDATA start syntax. I think this is not the right approach. What is the right way to accomplish this? Is there a way to solve this problem without using yyless() and yymore()?

%option noyywrap
%x ELEMENT_CONTENT
%{
  enum yytokentype {
    TOK_START_TAG = 258,
    TOK_END_TAG = 259,
    TOK_ELEMENT_CONTENT = 260
  };
%}
%%  
<INITIAL>{
   "<foo>"    { BEGIN(ELEMENT_CONTENT); return(TOK_START_TAG); }
   "</foo>"   { return(TOK_END_TAG); }
}

<ELEMENT_CONTENT>{
   [^<]+"<![CDATA["     { yyless(9); yymore(); }
   "<![CDATA["          { /* ignore CDATA start syntax */ }
   [^\]]+"]]>"          { yyless(3); yymore(); }
   "]]>"                { /* ignore CDATA end syntax */ }
   [^<]*                { BEGIN(INITIAL); return TOK_ELEMENT_CONTENT; }   
}
%%
int main(int argc, char *argv[])
{
    printf("In the lexer\n");
    yyin = fopen(argv[1], "r");
    int tok;
    while (tok = yylex()) {
       switch (tok){
          case 258:
             printf("TOK_START_TAG: %s\n", yytext);
             break;
          case 259:
             printf("TOK_END_TAG: %s\n", yytext);
             break;
          case 260:
             printf("TOK_ELEMENT_CONTENT: %s\n", yytext);
             break;
          default:
             printf("unexpected: %s\n", yytext);
       }
    }
    
    fclose(yyin);
    
    return 0;
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How to scoop up non-consecutive pieces of the input and concatenate the pieces together?

Sources

Related Questions