A simple example that will parse JSON in C using Flex and Bison. Forewarning, while this example works well it’s not going to handle every JSON case. I’ll highlight the limitations of what cannot be parsed towards the end of the post.
FLEX Scanner to Parse JSON
The scanner will create a token stream. Tokens are just regex matches made by flex that can also have values. For example a
DECIMAL token we return a float by calling
atof(yytext). Whereas, we also tokenize
LCURLY, but there is no value. The
ECHO macro is turned off, and can be turned on for further debug info.
Bison Parser for JSON
Below is a Bison Parser for JSON. There are no semantic actions for each grammar rule, so action is taken to do anything. You’d have to add those to make the example useful. The code just runs through
yyparse() and will return a 0 exit code if successful.
For this grammar I consulted json.org. On the main page to the right JSON grammar is provided. This form of this grammar is specified in McKeeman Form. The McKeeman Form is clean and concise. However, the grammar we define in Bison is not exactly what as specified, but follows the major cases. One, difference is how whitespace is handled.
C Example function to Parse JSON
Let’s pull it all together using our Flex Scanner and Bison Parser to parse JSON. We’ll define a C file called
main.c that will read from
STDIN or a file if provided. There is a section below where we test this exampe, there, I just created some example JSON files.
Usage and Downloading
If you want to use it then simply download parse_json. From there untar it and run:
Note: Debugging is turned on so you’ll see the parser state as it pushes and pops tokens on the stack. If you want turn it off then I guess you’ll have to message me and I’ll need to make another tarball. I used flex 2.6.4 and bison version 3.5 for this. Although, I’d be surprised if it doesn’t work for much older versions, including the older lex and yacc.
The following limitations are present:
- Doesn’t handle the escape sequences. The escape sequences are defined as
\t. This also includes hex, which would be
- The Flex tool generates 8-bit parsing. This means UTF-8 will work, unless you have characters classes. The JSON specification has 0x0020 through 0x10FFFF, which is is more than 8-bit. Thus, if we’re talking UTF-16 or greater it’s not going to work unless something major is done to Flex.
- There are probably limitations I don’t know about.
Obviously, this example only parses! There are no semantic actions associated with each grammar rule. It simply parses the JSON and exits. If it can parse the JSON it returns 0, else a non-zero exit code. Actions would have to be added to the example for your use case.
Here is what I used to test: