'Python conversion from JSON to JSONL
I wish to manipulate a standard JSON object to an object where each line must contain a separate, self-contained valid JSON object. See JSON Lines
JSON_file =
[{u'index': 1,
u'no': 'A',
u'met': u'1043205'},
{u'index': 2,
u'no': 'B',
u'met': u'000031043206'},
{u'index': 3,
u'no': 'C',
u'met': u'0031043207'}]
To JSONL
:
{u'index': 1, u'no': 'A', u'met': u'1043205'}
{u'index': 2, u'no': 'B', u'met': u'031043206'}
{u'index': 3, u'no': 'C', u'met': u'0031043207'}
My current solution is to read the JSON file as a text file and remove the [
from the beginning and the ]
from the end. Thus, creating a valid JSON object on each line, rather than a nested object containing lines.
I wonder if there is a more elegant solution? I suspect something could go wrong using string manipulation on the file.
The motivation is to read json
files into RDD on Spark. See related question - Reading JSON with Apache Spark - `corrupt_record`
Solution 1:[1]
Your input appears to be a sequence of Python objects; it certainly is not valid a JSON document.
If you have a list of Python dictionaries, then all you have to do is dump each entry into a file separately, followed by a newline:
import json
with open('output.jsonl', 'w') as outfile:
for entry in JSON_file:
json.dump(entry, outfile)
outfile.write('\n')
The default configuration for the json
module is to output JSON without newlines embedded.
Assuming your A
, B
and C
names are really strings, that would produce:
{"index": 1, "met": "1043205", "no": "A"}
{"index": 2, "met": "000031043206", "no": "B"}
{"index": 3, "met": "0031043207", "no": "C"}
If you started with a JSON document containing a list of entries, just parse that document first with json.load()
/json.loads()
.
Solution 2:[2]
the jsonlines package is made exactly for your use case:
import jsonlines
items = [
{'a': 1, 'b': 2},
{'a', 123, 'b': 456},
]
with jsonlines.open('output.jsonl', 'w') as writer:
writer.write_all(items)
(yes, i wrote it years after you posted your original question.)
Solution 3:[3]
A simple way to do this is with jq
command in your terminal.
To install jq
on Debian and derivatives:
$ sudo apt-get install jq
CentOS/RHEL users should run:
$ sudo yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
$ sudo yum install jq -y
Basic usage:
$ jq -c '.[]' some_json.json >> output.jsonl
If you need to handle with huge files, i strongly recommend to use --stream
flag. This will make jq
parse your json in streaming mode.
$ jq -c --stream '.[]' some_json.json >> output.json
But, if you need to do this operation into a python file, you can use bigjson
, a useful library that parses the JSON in streaming mode:
$ pip3 install bigjson
To read a huge json (In my case, it was 40 GB):
import bigjson
# Reads json file in streaming mode
with open('input_file.json', 'rb') as f:
json_data = bigjson.load(f)
# Open output file
with open('output_file.jsonl', 'w') as outfile:
# Iterates over input json
for data in json_data:
# Converts json to a Python dict
dict_data = data.to_python()
# Saves the output to output file
outfile.write(json.dumps(dict_data)+"\n")
If you want, try to parallelize this code aiming to improve performance. Post the result here :)
Documentation and source code: https://github.com/henu/bigjson
Solution 4:[4]
If you don't want a library, it's easy enough to do using json directly.
source
[
{"index": 1,"no": "A","met": "1043205"},
{"index": 2,"no": "B","met": "000031043206"},
{"index": 3,"no": "C","met": "0031043207"}
]
code
import json
with open("test.json", 'r') as infile:
data = json.load(infile)
if len(data) > 0:
print(json.dumps([t for t in data[0]]))
for record in data:
print(json.dumps([v for (k,v) in record.items()]))
result
["index", "no", "met"]
[1, "A", "1043205"]
[2, "B", "000031043206"]
[3, "C", "0031043207"]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | |
Solution 4 | Konchog |