'Python: Parse concatenated JSON

I have a giant file (few GB in size). It has a concatenated json, i.e. a few JSON's back to back with no delimiters (not even a comma or newline).

Does anyone know of a way I can somehow parse this? json.load(fileobj) and json.loads(line) both fail with "extra data" errors when the second json in the concatenation is arrived upon.

Even better if the solution allows for character streaming because of the giant size, but that's not necessary.

Edit: A concatenated json is https://en.wikipedia.org/wiki/JSON_streaming#Concatenated_JSON_2



Solution 1:[1]

Read the file character by character, writing the result to a file. Also keep track of brace indentation level.

Whenever you read a } character that brings the indentation level to zero, you've read an entire json object. Close the file, load it with json.load(), and start a new file.

However, if the file contains quoted or escaped } characters, then this solution is too naive and won't work; you'll need a "real" parser.

Solution 2:[2]

The other answer suggests keeping track of indentation. That is hard (although not nearly as bad as XML).

An easier solution is to realize that when it fails with "extra data", the JSONDecodeError exception contains a pos field which says where that extra data starts. That extra data is your second message. Hence, you want to re-parse the substring before pos.

Recursive solution to show the idea:

def parseConcatenatedJSON(s:str):
    try:
        json.loads(s)
    except json.JSONDecodeError as jde:
        head = s[0:jde.pos]
        json.loads(head)
        tail = s[jde.pos:]
        parseConcatenatedJSON(tail)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 John Gordon
Solution 2