'Making a lexer piece by piece and am struggling with reading operators correctly
I am pretty new to coding and am working on a lexer project for school. I have made it so it can register everything as tokens and have been expanding its ability to distinguish specific options. So far I have gotten it to recognize comments, strings, characters, and keywords, but am struggling to get it to read operators correctly. The chunk of code for recognizing operators is as follows:
char* operatorStr[] = {":=", "..", "<<", ">>", "<>", "<=", ">=", "**", "!=", "=>", "{:", "}:"};
else if (line[index] == '.' || line[index] == ':' || line[index] == '<' || line[index] == '>' || line[index] == '*' || line[index] == '!' || line[index] == '=' || line[index] == '{' || line[index] == '}'){
char token[2];
strncpy(token, &line[index], 2);
int i;
while (i < 12){
i++;
if (strcmp(operatorStr[i], token) == 0){
lex(line, lastSpace, index + 1, 1, "Operator", size);
lastSpace = index + 2;
index = index + 2;
}
}
}
else if (line[index] == ':' || line[index] == '>' || line[index] == '<' || line[index] == '(' || line[index] == ')' || line[index] == '+' || line[index] == '-' || line[index] == '*' || line[index] == '/' || line[index] == '|' || line[index] == '&' || line[index] == ';' || line[index] == '=' || line[index] == '$' || line[index] == '@' || line[index] == '[' || line[index] == ']' || line[index] == '{' || line[index] == '}'){
lex(line, lastSpace, index - 1, 1, "Token", size);
lastSpace = index;
index++;
lex(line, lastSpace, index - 1, 1, "Operator", size);
lastSpace = index;
index++;
}
I know that this isn't the most effective way of doing a lexer, but I just need to get my code to read the operators correctly and I am done with this assignment. My line variable is the array of the current line, lastSpace indicates the last token sent, index is my current location, the following 1 is for removing leading spaces, "Operator" is the name for the sent token, and size is the size of the line. The intended output and the output I am getting are as follows:
Expected
Operator: :=
Operator: ..
Operator: <<
Operator: >>
Operator: <>
Operator: <=
Operator: >=
Operator: **
Operator: !=
Operator: =>
Operator: [
Operator: ]
Operator: {
Operator: }
Operator: {:
Operator: }:
Token: .
Operator: <
Operator: >
Operator: (
Operator: )
Operator: +
Operator: -
Operator: *
Operator: /
Operator: |
Operator: &
Operator: ;
Operator: ,
Operator: :
Operator: =
Operator: $
Operator: @
Operator: :=
Operator: ..
Operator: <<
Operator: >>
Operator: <>
Operator: <=
Operator: >=
Operator: **
Operator: !=
Operator: =>
Operator: [
Operator: ]
Operator: {
Operator: }
Operator: {:
Operator: }:
Token: .
Operator: >
Operator: <
Operator: (
Operator: )
Operator: +
Operator: -
Operator: *
Operator: /
Operator: |
Operator: &
Operator: ;
Operator: ,
Operator: =
Operator: :
Operator: $
Operator: @
Token: class
Operator: +
Token: seven
Token: a
Operator: +
Token: b
Keyword: if
Keyword: and
Keyword: ifc
Token: cliff
Keyword: if
Token: elsefred
Token: ifif
Operator: $
Operator: @
Operator: +
Operator: =
Operator: =>
Operator: {
Operator: {:
String: "string"
Comment: /*comment*/
String: "string"
Char: 'c'
Comment: /*comment*/
Char: 'c'
Got
Token: :=
Token: ..
Token: <<
Token: >>
Token: <>
Token: <=
Token: >=
Token: **
Token: !=
Token: =>
Operator: [
Token: ]
Token: {
Token: }
Token: {:
Token: }:
Token: .
Token: <
Token: >
Operator: (
Token: )
Operator: +
Token: -
Token: *
Operator: /
Token: |
Operator: &
Token: ;
Token: ,
Token: :
Token: =
Operator: $
Token: @
Token: :=..<<>><><=>=**!==>
Operator: [
Token: ]{}{:}:.><
Operator: (
Token: )+
Operator: -
Token: */
Operator: |
Token: &;,=:
Operator: $
Token: class
Operator: +
Token: seven
Token: a
Operator: +
Keyword: if
Keyword: and
Keyword: ifc
Token: cliff
Keyword: if
Token: elsefred
Token: ifif
Operator: $
Token: @+==>{{:
String: "string"
Comment: /*comment*/
Operator: /
Token: "string
String: "'c'/*comment*/'c'
Without my operator section of the code, the string and comment section works fine. Since I only have to declare operators as "Operator" instead of distinguishing between them I have the first function finding the two char operators and the second function finding the single char operators. I tried doing this all in one function, but it worked even less. I feel like my issue is mainly with the two char operators since that part isn't working at all and lastSpace/index increment bit is being set wrong. I have fiddled around with the amount increased and haven't been able to get it to work. I have also tried making the whole operator part one function, but there are symbols that aren't single char operators but are the start of two char operators and they mess it up again. Any suggestions would be helpful and the entirety of my tokenize and lex functions are:
//Breaks a line into tokens
void tokenize(char *line, int size){
int index = 0;
int lastSpace = 0;
static int inComment = 0;
char* operatorStr[] = {":=", "..", "<<", ">>", "<>", "<=", ">=", "**", "!=", "=>", "{:", "}:"};
//printf("%s\n", line):
for(index = 0; index < size; index++){
if(inComment){
while(index < size && (line[index] != '*' || line[index+1] != '/')){
index++;
}
//printf("SecondPart");
if(index == size){
lex(line, lastSpace, index, 0, "Comment", size);
lastSpace = index;
}
else{
lex(line, lastSpace, index + 1, 0, "Comment", size);
lastSpace = index + 2;
inComment = 0;
}
}
else if(line[index] == '/' && line[index+1] == '*'){
lex(line, lastSpace, index - 1, 1, "Token", size);
lastSpace = index;
index = index + 2;
while(index < size && (line[index] != '*' || line[index + 1] != '/')){
index++;
}
//printf("FirstPart");
if(index == size){
lex(line, lastSpace, index - 2, 0, "Comment", size);
lastSpace = index;
inComment = 1;
}
else{
lex(line, lastSpace, index + 1, 1, "Comment", size);
lastSpace = index + 2;
}
}
else if (line[index] == '.' || line[index] == ':' || line[index] == '<' || line[index] == '>' || line[index] == '*' || line[index] == '!' || line[index] == '=' || line[index] == '{' || line[index] == '}'){
char token[2];
strncpy(token, &line[index], 2);
int i;
while (i < 12){
i++;
if (strcmp(operatorStr[i], token) == 0){
lex(line, lastSpace, index + 1, 1, "Operator", size);
lastSpace = index + 2;
index = index + 2;
}
}
}
else if (line[index] == ':' || line[index] == '>' || line[index] == '<' || line[index] == '(' || line[index] == ')' || line[index] == '+' || line[index] == '-' || line[index] == '*' || line[index] == '/' || line[index] == '|' || line[index] == '&' || line[index] == ';' || line[index] == '=' || line[index] == '$' || line[index] == '@' || line[index] == '[' || line[index] == ']' || line[index] == '{' || line[index] == '}'){
lex(line, lastSpace, index, 1, "Token", size);
lastSpace = index;
index++;
lex(line, lastSpace, index, 1, "Operator", size);
lastSpace = index;
index++;
}
else if(line[index] == '"'){
lex(line, lastSpace, index - 1, 1, "Token", size);
lastSpace = index;
index++;
while(index < size && line[index] != '"'){
index++;
}
//index++;
lex(line, lastSpace, index, 1, "String", size);
lastSpace = index + 1;
}
else if(line[index] == '\''){
lex(line, lastSpace, index - 1, 1, "Token",size);
lastSpace = index;
index++;
if(line[index] == '\\'){
index++;
}
index++;
lex(line, lastSpace, index, 1, "Char", size);
lastSpace = index + 1;
}
else if(isspace(line[index])){
lex(line, lastSpace, index - 1, 1, "Token", size);
lastSpace = index;
}
}
}
//Assign meaning to tokens
void lex(char *line, int start, int end, int removeleadingspaces, char type[], int size){
char token[MAXTOKENSIZE];
char* keywords[] = {"accessor", "array", "and", "bool", "character", "constant", "elsif", "else", "exit", "end", "float", "func", "integer", "ifc", "if", "in", "is", "mutator", "natural", "null", "others", "out", "of", "or", "positive", "proc", "pkg", "ptr", "range", "subtype", "then", "type", "while", "when"};
while(start < size && removeleadingspaces && isspace(line[start])){
start++;
}
if(start >= end + 1){
return;
}
if(end-size >= 0){
strncpy(token, &line[start], (end + 1) - start);
}
else{
strncpy(token, &line[start], (end + 1) - start);
}
token[(end-start) + 1] = '\0';
int i;
for (i = 0; i < 34; i++){
if(strcmp(keywords[i], token) == 0){
printf("%s: %s\n", "Keyword", token);
return;
}
}
printf("%s: %s\n", type, token);
}
Edits:
I added a null terminator to the char arr token i.e.:
char token[3];
strncpy(token, &line[index], 2);
token[3] = '\0';
I combined the two functions and switched from a while loop to a for loop and it is working better, but it still isn't completely working. Here is the new code:
else if (line[index] == '.' || line[index] == '!' || line[index] == ':' || line[index] == '>' || line[index] == '<' || line[index] == '(' || line[index] == ')' || line[index] == '+' || line[index] == '-' || line[index] == '*' || line[index] == '/' || line[index] == '|' || line[index] == '&' || line[index] == ';' || line[index] == '=' || line[index] == '$' || line[index] == '@' || line[index] == '[' || line[index] == ']' || line[index] == '{' || line[index] == '}'){
if(line[index + 1] == '=' || line[index + 1] == '.' || line[index + 1] == '<' || line[index + 1] == '>' || line[index + 1] == '*' || line[index + 1] == ':'){
char token[3];
strncpy(token, &line[index], 2);
token[3] = '\0';
int i;
for (i = 0; i < 12; i++){
i++;
if (strcmp(operatorStr[i], token) == 0){
lex(line, lastSpace, index + 1, 1, "Operator", size);
lastSpace = index + 1;
index = index + 1;
}
}
} else if (line[index] == '!' || line[index] == '.'){
lex(line, lastSpace, index, 1, "Token", size);
lastSpace = index;
index++;
} else {
lex(line, lastSpace, index, 1, "Operator", size);
lastSpace = index;
index++;
}
}
The new output:
Expected
Operator: :=
Operator: ..
Operator: <<
Operator: >>
Operator: <>
Operator: <=
Operator: >=
Operator: **
Operator: !=
Operator: =>
Operator: [
Operator: ]
Operator: {
Operator: }
Operator: {:
Operator: }:
Token: .
Operator: <
Operator: >
Operator: (
Operator: )
Operator: +
Operator: -
Operator: *
Operator: /
Operator: |
Operator: &
Operator: ;
Operator: ,
Operator: :
Operator: =
Operator: $
Operator: @
Operator: :=
Operator: ..
Operator: <<
Operator: >>
Operator: <>
Operator: <=
Operator: >=
Operator: **
Operator: !=
Operator: =>
Operator: [
Operator: ]
Operator: {
Operator: }
Operator: {:
Operator: }:
Token: .
Operator: >
Operator: <
Operator: (
Operator: )
Operator: +
Operator: -
Operator: *
Operator: /
Operator: |
Operator: &
Operator: ;
Operator: ,
Operator: =
Operator: :
Operator: $
Operator: @
Token: class
Operator: +
Token: seven
Token: a
Operator: +
Token: b
Keyword: if
Keyword: and
Keyword: ifc
Token: cliff
Keyword: if
Token: elsefred
Token: ifif
Operator: $
Operator: @
Operator: +
Operator: =
Operator: =>
Operator: {
Operator: {:
String: "string"
Comment: /*comment*/
String: "string"
Char: 'c'
Comment: /*comment*/
Char: 'c'
Got
Operator: :=
Token: = ..
Operator: . <<
Operator: < >>
Operator: > <>
Operator: > <=
Operator: = >=
Operator: = **
Operator: * !=
Operator: = =>
Operator: > [
Operator: [ ]
Operator: ] {
Operator: { }
Operator: } {:
Operator: : }:
Token: : .
Operator: . <
Operator: < >
Operator: > (
Operator: ( )
Operator: ) +
Operator: + -
Operator: - *
Operator: * /
Operator: / |
Operator: | &
Operator: & ;
Token: ; ,
Operator: :
Operator: : =
Operator: = $
Operator: $ @
Operator: :=..<<>><><=>=**
Operator: *!==>
Operator: >[]
Operator: ]{}
Operator: }{:
Operator: :}:.><
Operator: <()
Operator: )+-*
Operator: */|
Operator: |&;
Operator: ;,=:
Operator: :$@
Operator: class+
Token: +seven
Operator: a+
Token: +b
Keyword: if
Keyword: and
Keyword: ifc
Token: cliff
Keyword: if
Token: elsefred
Token: ifif
Operator: $
Operator: $@+==>
Operator: >{{:
String: "string"
Comment: /*comment*/
Token: /"string
String: "'c'/*comment*/'c'
I am currently fiddling with the iteration to try to fix the issues, but any advice would still be very nice.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
