Improved Scanning of JSON with Flex/Bison

2020-10-27 flex/bison json

Using Flex/Bison in the last post we Parsed JSON into an AST. In the first post we setup a Simple Example to Parse JSON. In this post we will improve our Flex Scanner to tokenize and validate more JSON cases. We will continue to align with the grammar at json.org. We will add rules and tokens for DQUOTE (double-quote), string, HEX, and CHARACTERS not in our previous implementation.

Some Examples of Valid JSON

Here are some examples that the previous Flex scanner does not support; but the new implementation will. Many of them contrived and strange, but nevertheless they should be supported.

{ "some hex": "\uBEEF" }
{"":""}
{"\b": "\\", "\"quoted\"": "thing"}
{"x": -0}

The largest improvements are to support hex in the form of \uXXXX and support additional characters between the quotes. These additional characters are the escape characters in JSON of " \ / b f n r t which are all escaped with \. See the grammar at json.org for another reference.

Changes to the Token Stream

Previously, our scanner would include the quotes as part of a STRING token. For example we would have "hello" as a single token. We would just tokenize everything as a char * between two literal double quotes. This char * would include the double quotes. Below is the regex that includes the quotes.

\"[^"]*\" { yylval.string = strdup(yytext); return STRING; }

This isn’t aligned to the following grammar:

string: " characters "

characters: ""
          | characters character

j

Now, we can tokenize the quotes and the characters between the quotes seperately. We can put our scanner into a state called QUOTED and then we’ll read the characters until we hit another double quote. Once we reach and end quote we’ll go back to the INITIAL state. All the characters between the quotes will be returned in a token called CHARACTERS. This tokenization will look like DQUOTE CHARACTERS DQUOTE where the CHARACTERS are a char *. See the following where we have a regex just for a single quote, which starts a Flex state, then tokenizes the characters then the state ends with another single quote.

\" { BEGIN QUOTED; return DQUOTE; }
<QUOTED>\" { BEGIN INITIAL; return DQUOTE; }
<QUOTED>\\u[a-fA-F0-9]{4} { yylval.characters = strdup(yytext); return HEX; }
<QUOTED>([\x20-\x7e]{-}[\"\\]|\\[\\"\/bfnrt])+ { yylval.characters = strdup(yytext); return CHARACTERS; }

Remember that Flex will tokenize the longest regex match which is the concept of precedence here.

New Flex Scanner File

Here is our new scanner file. Note, we will still handle decimal numbers and integers separately. I highlighted the major lines that changed between the previous scanner.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
%{
// filename: scanner.l
#include "parser.h"
#include <string.h>
#include <math.h>

extern void yyerror(const char * message);

%}

%option noyywrap
%x QUOTED
EXP ([Ee][-+]?[0-9]+)

%%

"{" { return LCURLY; }
"}" { return RCURLY; }
"[" { return LBRAC; }
"]" { return RBRAC; }
"," { return COMMA; }
":" { return COLON; }
"true" { return VTRUE; }
"false" { return VFALSE; }
"null" { return VNULL; }
[ \t\r\n]+ { /* eat whitespace */ }
\" { BEGIN QUOTED; return DQUOTE; }
<QUOTED>\" { BEGIN INITIAL; return DQUOTE; }
<QUOTED>\\u[a-fA-F0-9]{4} { yylval.characters = strdup(yytext); return HEX; }
<QUOTED>([\x20-\x7e]{-}[\"\\]|\\[\\"\/bfnrt])+ { yylval.characters = strdup(yytext); return CHARACTERS; }
\-?[0-9]+[0-9]* {
    yylval.integer = atoi(yytext);
    return NUMBER_INTEGER;
}
\-?[0-9]+"."?[0-9]*{EXP}? {
  float f = atof(yytext);
  yylval.decimal = f;
  return NUMBER_FLOAT;
}

%%

While writing this post I realized that a valid JSON number is -0 which is strange, but accepted.

New Parser File

Now instead of having a STRING token of type char * we will return a CHARACTERS token of type char *. Our new string type will be the rule on lines 131-139. The parser is less impacted than the scanner for these improvements. This also illustrates a good architecture and decoupling between scanning and parsing.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
%{

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "ast.h"
#include "list.h"

extern struct value_type *json;

extern int yylineno;
extern char* yytext;
int yylex();
void yyerror(const char *s);

// gives good debug information
int yydebug=0;

%}

%start json

%union {
  char *characters;
  char *str;
  float decimal;
  int integer;
  struct value_type *v;
  struct member_type *memb;
  struct List *membs;
  struct List *elems;
};

%token LCURLY RCURLY LBRAC RBRAC COMMA COLON DQUOTE
%token VTRUE VFALSE VNULL
%token <characters> CHARACTERS;
%token <characters> HEX;
%token <decimal> NUMBER_FLOAT;
%token <integer> NUMBER_INTEGER;

%type <v> value object array json element
%type <str> string
%type <memb> member
%type <membs> members
%type <elems> elements

%%

json:
    {
     $$ = NULL;
    }
    | value
    {
      $$ = $1;
      json = $$;
    }
    ;

value: object {
        $$ = $1;
     }
     | array {
        $$ = $1;
     }
     | string {
        $$ = new_value_string($1);
     }
     | NUMBER_FLOAT {
       $$ = new_value_number($1);
     }
     | NUMBER_INTEGER {
       $$ = new_value_integer($1);
     }
     | VTRUE {
       $$ = new_value_boolean(1);
     }
     | VFALSE {
       $$ = new_value_boolean(0);
     }
     | VNULL {
       $$ = new_value_null();
     }
     ;

object: LCURLY RCURLY {
        $$ = new_value_object(NULL);
      }
      | LCURLY members RCURLY {
        $$ = new_value_object($2);
      }
      ;

members: member {
         $$ = new_members();
         list_add_last($$, $1);
       }
       | members COMMA member {
         list_add_last($1, $3);
       }
       ;

member: string COLON value {
        $$ = new_member_type($1, $3);
      }
      ;

array: LBRAC RBRAC {
       $$ = new_value_array(NULL);
     }
     | LBRAC elements RBRAC {
       $$ = new_value_array($2);
     }
     ;

elements: element {
          $$ = new_elements();
          new_array_element($$, $1);
        }
        | elements COMMA element {
          new_array_element($1, $3);
        }
        ;

element: value {
         $$ = $1;
       }
       ;

string: DQUOTE DQUOTE {
          $$ = strdup("\0");
        }
        | DQUOTE CHARACTERS DQUOTE {
          $$ = $2;
        }
        | DQUOTE HEX DQUOTE {
          $$ = $2;
        }
        ;

%%

void
yyerror(const char *s)
{
  fprintf(stderr,"error: %s on line %d\n", s, yylineno);
}

Downloading and Usage

Download the full source of parse_json.

To use it do the following.

$ tar xf parse_json-1.2.tar.gz
$ cd parse_json
$ ./configure
$ make
$ ./src/parse_json some_file.json

The files are all inside the src folder. See the following files that were changed in this post.

  • scanner.l the Flex Scanner
  • parser.y the Bison Parser