'Parse a file which is a lists of objects in Python [closed]

I have a json-like file in the below format, I would like to store the BLEU score attribute in a list and the chrF2++ score in another list.

The file format:

[
{
 "name": "BLEU",
 "score": 38.8,
 "signature": "nrefs:1|case:lc|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "75.0/45.5/30.0/22.2 (BP = 1.000 ratio = 1.000 hyp_len = 12 ref_len = 12)",
 "nrefs": "1",
 "case": "lc",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
},
{
 "name": "chrF2++",
 "score": 49.6,
 "signature": "nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.0.0",
 "nrefs": "1",
 "case": "mixed",
 "eff": "yes",
 "nc": "6",
 "nw": "2",
 "space": "no",
 "version": "2.0.0"
}
]
[
{
 "name": "BLEU",
 "score": 19.2,
 "signature": "nrefs:1|case:lc|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "61.5/33.3/18.2/5.0 (BP = 0.926 ratio = 0.929 hyp_len = 13 ref_len = 14)",
 "nrefs": "1",
 "case": "lc",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
},
{
 "name": "chrF2++",
 "score": 38.8,
 "signature": "nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.0.0",
 "nrefs": "1",
 "case": "mixed",
 "eff": "yes",
 "nc": "6",
 "nw": "2",
 "space": "no",
 "version": "2.0.0"
}
]
....

I tried:

with open(sys.argv[1]) as f:
    for jsonObj in f:
        list_of_scores = json.loads(jsonObj)
        print(list_of_scores)
        bleuScores.append(list_of_scores[0])
        chrfScores.append(list_of_scores[1])

but it did not work



Solution 1:[1]

Your data format is almost JSON, except that it appears you're getting multiple lists in a single file, without structure around them:

Your format, abbreviated:

[
  {"some": "dict"}
]
[
  {"some": "dict"}
]

Valid JSON:

[
  [
    {"some": "dict"}
  ],
  [
    {"some": "dict"}
  ]
]

So, an approach would be to add square brackets around the full content and replace any occurrence of a closing square bracket followed by nothing but whitespace (including newlines) and another opening square bracket by ],[

Of course a limitation of this approach is that a value like "oh ] [ no" would also be modified, so excluding anything in double quotes might be an added requirement, but that goes beyond the scope of your question.

A solution might look like:

import re
import json


def fix_content(s):
    s = re.sub(r']\s\[', '],\n[', s)
    return f'[{s}]'


with open('mess.json') as f:
    data = json.loads(fix_content(f.read()))
    for some_list in data:
        for d in some_list:
            print(d)

Getting those 2 lists of scores:

    BLEUs, chrF2s = zip(*((d['BLEU'], d['chrF2++'])
                          for d in (dict((d['name'], d['score'])
                                         for d in part) for part in data)))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1