A simple example that will parse JSON in C using Flex and Bison. Forewarning, while this example works well it’s not going to handle every JSON case. I’ll highlight the limitations of what cannot be parsed towards the end of the post.

FLEX Scanner to Parse JSON

The scanner will create a token stream. Tokens are just regex matches made by flex that can also have values. For example a DECIMAL token we return a float by calling atof(yytext). Whereas, we also tokenize { to LCURLY, but there is no value. The ECHO macro is turned off, and can be turned on for further debug info.

%{
// flex file: scanner.l
#include "parser.h"
#include <string.h>

extern void yyerror(const char * message);
#define ECHO fwrite( yytext, yyleng, 1, yyout )

%}

%option noyywrap
EXP ([Ee][-+]?[0-9]+)

%%

"{" { ECHO; return LCURLY; }
"}" { ECHO; return RCURLY; }
"[" { ECHO; return LBRAC; }
"]" { ECHO; return RBRAC; }
"," { ECHO; return COMMA; }
":" { ECHO; return COLON; }
"true" { ECHO; return VTRUE; }
"false" { ECHO; return VFALSE; }
"null" { ECHO; return VNULL; }
[ \t\r\n]+ { /* eat whitespace */ }
\"[^"]*\" { yylval.string = strdup(yytext); ECHO; return STRING; }
[1-9]+\.?[0-9]*{EXP}? { ECHO; yylval.decimal = atof(yytext); return DECIMAL; }

%%

Bison Parser for JSON

Below is a Bison Parser for JSON. There are no semantic actions for each grammar rule, so action is taken to do anything. You’d have to add those to make the example useful. The code just runs through yyparse() and will return a 0 exit code if successful.

For this grammar I consulted json.org. On the main page to the right JSON grammar is provided. This form of this grammar is specified in McKeeman Form. The McKeeman Form is clean and concise. However, the grammar we define in Bison is not exactly what as specified, but follows the major cases. One, difference is how whitespace is handled.

%{

// bison file: parser.y
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

extern int yylineno;
extern char* yytext;
int yylex();
void yyerror(const char *s);

// gives good debug information
int yydebug=1;

%}

%token LCURLY RCURLY LBRAC RBRAC COMMA COLON
%token VTRUE VFALSE VNULL
%token <string> STRING;
%token <decimal> DECIMAL;

%union {
  char *string;
  double decimal;
}

%start json

%%

json:
    | value
    ;

value: object
     | STRING
     | DECIMAL
     | array
     | VTRUE
     | VFALSE
     | VNULL
     ;

object: LCURLY RCURLY
      | LCURLY members RCURLY
      ;

members: member
       | members COMMA member
       ;

member: STRING COLON value
      ;

array: LBRAC RBRAC
     | LBRAC values RBRAC
     ;

values: value
      | values COMMA value
      ;

%%

void
yyerror(const char *s)
{
  fprintf(stderr,"error: %s on line %d\n", s, yylineno);
}

C Example function to Parse JSON

Let’s pull it all together using our Flex Scanner and Bison Parser to parse JSON. We’ll define a C file called main.c that will read from STDIN or a file if provided. There is a section below where we test this exampe, there, I just created some example JSON files.

// c file: main.c
#include <stdio.h>
#include <stdlib.h>
#include "parser.h"

extern FILE* yyin;

int
main(int argc, char *argv[])
{
  // if a file is given read from it
  // otherwise we'll read from STDIN
  if(argc == 2)
  {
    if(!(yyin = fopen(argv[1],"r")))
    {
      perror(argv[1]);
      return EXIT_FAILURE;
    }
  }
  return yyparse();
}

Usage and Downloading

If you want to use it then simply download parse_json. From there untar it and run:

$ tar xf parse_json-1.0.tar.gz
$ cd parse_json-1.0
$ ./configure
$ make
$ ./src/parse_json some_file.json
$ ./src/parse_json <enter in text or paste in, harder this way though>

Note: Debugging is turned on so you’ll see the parser state as it pushes and pops tokens on the stack. If you want turn it off then I guess you’ll have to message me and I’ll need to make another tarball. I used flex 2.6.4 and bison version 3.5 for this. Although, I’d be surprised if it doesn’t work for much older versions, including the older lex and yacc.

Limitations

The following limitations are present:

  • Doesn’t handle the escape sequences. The escape sequences are defined as \", \\, \/, \b, \f, \n, \r, \t. This also includes hex, which would be \u[a-fA-F]{4}.
  • The Flex tool generates 8-bit parsing. This means UTF-8 will work, unless you have characters classes. The JSON specification has 0x0020 through 0x10FFFF, which is is more than 8-bit. Thus, if we’re talking UTF-16 or greater it’s not going to work unless something major is done to Flex.
  • There are probably limitations I don’t know about.

Obviously, this example only parses! There are no semantic actions associated with each grammar rule. It simply parses the JSON and exits. If it can parse the JSON it returns 0, else a non-zero exit code. Actions would have to be added to the example for your use case.

Test Cases

Here is what I used to test:

{ }
[1,1]
[2,2,2,2,2e3, 2.0, 1e-9, [1,2,3,4.0]]
{ "hello" : "world" }
{
  "people":
    {
      "first": "bob",
      "last" : "stevens",
      "children": [ "sue", "anne" ],
      "wallet": null,
      "legs": true,
      "hair": false
    }
}
{
  "glossary": {
    "title": "example glossary",
    "GlossDiv": {
      "title": "S",
      "GlossList": {
        "GlossEntry": {
          "ID": "SGML",
          "SortAs": "SGML",
          "GlossTerm": "Standard Generalized Markup Language",
          "Acronym": "SGML",
          "Abbrev": "ISO 8879:1986",
          "GlossDef": {
            "para": "A meta-markup language, used to create markup languages such as DocBook.",
            "GlossSeeAlso": [
              "GML",
              "XML"
            ]
          },
          "GlossSee": "markup"
        }
      }
    }
  }
}