Saturday, 19 February 2011

Parsing JSON

For reasons I won't go into, I had need to write a JSON parser recently - yea, I genuinely couldn't just use Doug Crockford's json2.js. It was my first attempt at writing a parser and was an awesome learning experience. I was lucky that I had some pretty smart people on hand to help with the methodology for doing this. The process is relatively simple once you know how, but I will admit it took me a while to get my head round how to approach the problem.

The task was to write something that would take a JSON string and parse it back into a JSON object. The string might contain n depth of nested types (for example an array of objects with each object containing other objects or arrays). I initially expected to have to work out what the first tier primitive types were, then recursively pass in their contents to the parsing functions. So, I was somewhat surprised to find that the method suggested to me was simply to run through the string character by character, from left to right, keeping track of my position within the string, recursively calling the parse function each time I found a symbol which denoted a new primitive.


Functions

I wrote a few helper functions:

  • ignoreTrash - which moved past trash characters in the space between primitives, such as spaces, commas and semi-colons
  • MoveToNextCharacter - which moved the string position counter
  • amEscaping - which checked for backslashed quotes

Ok, so that list was shorter than I recalled. The parse function itself was really just a glorified if statement, looking for the first character in the string - if it is quotation marks, call parseString; if square bracket, call parseArray and so on.

The parse primitive functions would create a new object of their type, then they contained while loops to move through the string until they found their corresponding terminator symbol (usually a regular expression). If during the parsing they found another starter character, they simply called the main parse function again.

It all sounds pretty simple and the nice thing is that the code is really readable, but at the time, it took me about a day of thinking hard, to get my head round the concept. Still it was time well spent, as I think I learned quite a lot during the process. Unfortunately, one of the things I learned here is that the browser implementation of the javascript regular expression engine is a real let down on most browsers. The language seems to support groupings using parantheses, allowing lookahead and lookbehind, but I certainly couldn't get them to work.

2 comments:

  1. Look-aheads definitely work consistently well on every modern browser, as do capturing and non-capturing groups and many other useful Perl-esque regex features. Take a look at my CSS selector engine – I’ve used look-aheads quite extensively, and that code even works on IE6. Honest.

    Look-behinds on the other hand don’t work anywhere that I know of, for the simple reason that they aren’t part of EcmaScript, and not even Mozilla have decided that they should be added :).

    ReplyDelete
  2. Ah cool, that's awesome news, as I am currently working on something that could really use them. I'll run through your code on Monday and steal as much as I can :)

    ReplyDelete