Improved Scanning of JSON with Flex/Bison
Using Flex/Bison in the last post we Parsed JSON into an AST. In the first post we setup a Simple Example to Parse JSON. In this post we will improve our Flex Scanner to tokenize and validate more JSON cases. We will continue to align with the grammar at json.org. We will add rules and tokens for DQUOTE
(double-quote), string
, HEX
, and CHARACTERS
not in our previous implementation.
Some Examples of Valid JSON
Here are some examples that the previous Flex scanner does not support; but the new implementation will. Many of them contrived and strange, but nevertheless they should be supported.
{ "some hex": "\uBEEF" }
{"":""}
{"\b": "\\", "\"quoted\"": "thing"}
{"x": -0}
The largest improvements are to support hex in the form of \uXXXX
and support additional characters between the quotes. These additional characters are the escape characters in JSON of " \ / b f n r t
which are all escaped with \
. See the grammar at json.org for another reference.
Changes to the Token Stream
Previously, our scanner would include the quotes as part of a STRING
token. For example we would have "hello"
as a single token. We would just tokenize everything as a char *
between two literal double quotes. This char *
would include the double quotes. Below is the regex that includes the quotes.
\"[^"]*\" { yylval.string = strdup(yytext); return STRING; }
This isn’t aligned to the following grammar:
string: " characters "
characters: ""
| characters character
j
Now, we can tokenize the quotes and the characters between the quotes seperately. We can put our scanner into a state called QUOTED
and then we’ll read the characters until we hit another double quote. Once we reach and end quote we’ll go back to the INITIAL
state. All the characters between the quotes will be returned in a token called CHARACTERS
. This tokenization will look like DQUOTE CHARACTERS DQUOTE
where the CHARACTERS
are a char *
. See the following where we have a regex just for a single quote, which starts a Flex state, then tokenizes the characters then the state ends with another single quote.
\" { BEGIN QUOTED; return DQUOTE; }
<QUOTED>\" { BEGIN INITIAL; return DQUOTE; }
<QUOTED>\\u[a-fA-F0-9]{4} { yylval.characters = strdup(yytext); return HEX; }
<QUOTED>([\x20-\x7e]{-}[\"\\]|\\[\\"\/bfnrt])+ { yylval.characters = strdup(yytext); return CHARACTERS; }
Remember that Flex will tokenize the longest regex match which is the concept of precedence here.
New Flex Scanner File
Here is our new scanner file. Note, we will still handle decimal numbers and integers separately. I highlighted the major lines that changed between the previous scanner.
|
|
While writing this post I realized that a valid JSON number is -0
which is strange, but accepted.
New Parser File
Now instead of having a STRING
token of type char *
we will return a CHARACTERS
token of type char *
. Our new string
type will be the rule on lines 131-139. The parser is less impacted than the scanner for these improvements. This also illustrates a good architecture and decoupling between scanning and parsing.
|
|
Downloading and Usage
Download the full source of parse_json.
To use it do the following.
$ tar xf parse_json-1.2.tar.gz
$ cd parse_json
$ ./configure
$ make
$ ./src/parse_json some_file.json
The files are all inside the src
folder. See the following files that were changed in this post.
- scanner.l the Flex Scanner
- parser.y the Bison Parser