'How to I convert it to dict
the list I have -
[
"Mathematics-2 (21SMT-125)",
"Mid-Semester Test-1",
"40",
"23.5",
"Mid-Semester Test-2",
"40",
"34",
"Disruptive Technologies - 2 (21ECH-103)",
"Experiment-1",
"20",
"19",
"Experiment-2",
"20",
"17",
"Experiment-3",
"20",
"18.5",
]
This list of stings is parsed from html using bs4
format to convert in :
{
"Subject": {
"Mathematics-2 (21SMT-125)": {
"Mid-Semester Test-1": [40,23.5],
"Mid-Semester Test-2": [40,34]
},
"Disruptive Technologies - 2 (21ECH-103)": {
"Experiment-1": [20,19],
"Experiment-2": [20,17],
"Experiment-3": [20,18.5]
}
}
}
Solution 1:[1]
The problem is that the list you provided is a flat list of items with no indicator of their hierarchical position in the desired structure.
One approach you could consider is if the entries that represent a parent object (Mathematics, etc...) are the only entries that contain parentheses, you could iterate on your list and use either string matching or regex to identify the parent, create a top level object for it then you'd need to add the next two entries as the value of the key/value pair as a list.
This assumes that you'll always have two subsequent values at the child level. If the number of attributes isn't fixed but they're always numeric you could use regex to determine if it's numeric or non-numeric and keep adding items to the value list until you hit another non-numeric entry, which would be treated as the next sibling in the hierarchy.
Solution 2:[2]
I would review the approach and check whether information from bs4 can be parsed in some smarter way - try to do more scrapping steps, first to reach subject, second "Semester/Experiment" third - grades.
If it's not possible and data returned from bs4 cannot be changed.. Only thing you can do is to try determine whether string is name of subject, semester or grade/score and try to use some while loops. Name of subject seems to have special code in the end, which can be distinguished from name of the semester/experiment using regexp and grade/scrore can be always parsed to number..
Solution 3:[3]
For data exactly like yours (where a string with a ( denotes a top-level entry, and there are always two numbers per entry), you could come up with a state machine sort of thing like this -- but like I commented, you really should improve your parsing code instead, since the HTML you're scraping your data off is likely already structured.
def is_float(s):
try:
float(s)
return True
except ValueError:
return False
def parse_inp(inp):
flat_map = {}
stack = []
x = 0
while x < len(inp):
if "(" in inp[x]:
stack.clear()
if is_float(inp[x]) and is_float(inp[x + 1]):
flat_map[tuple(stack)] = (float(inp[x]), float(inp[x + 1]))
x += 2
stack.pop(-1)
continue
stack.append(inp[x])
x += 1
return flat_map
def nest_flat_map(flat_map):
root = {}
for key_path, values_list in flat_map.items():
dst = root
for key in key_path[:-1]:
dst = dst.setdefault(key, {})
dst[key_path[-1]] = values_list
return root
inp = [
# ... data from original post
]
nested_map = nest_flat_map(parse_inp(inp))
print(nested_map)
This outputs the expected
{
"Mathematics-2 (21SMT-125)": {
"Mid-Semester Test-1": (40.0, 23.5),
"Mid-Semester Test-2": (40.0, 34.0),
},
"Disruptive Technologies - 2 (21ECH-103)": {
"Experiment-1": (20.0, 19.0),
"Experiment-2": (20.0, 17.0),
"Experiment-3": (20.0, 18.5),
},
}
Solution 4:[4]
You can use a fuzzy form of itertools.groupby to find the groups in this list of strings. This assumes that every class ends with the pattern "(classref-section)", and that it is followed by test or homework names each followed by one or more numeric scores.
source_data = [
"Mathematics-2 (21SMT-125)",
"Mid-Semester Test-1",
"40",
"23.5",
"Mid-Semester Test-2",
"40",
"34",
"Disruptive Technologies - 2 (21ECH-103)",
"Experiment-1",
"20",
"19",
"Experiment-2",
"20",
"17",
"Experiment-3",
"20",
"18.5",
]
from collections import defaultdict
import itertools
import json
import re
class_id_pattern = re.compile(r"\([A-Z0-9]+-\d+\)")
def is_class_reference(s):
return bool(class_id_pattern.match(s.rsplit(" ", 1)[-1]))
def group_by_class(s):
if is_class_reference(s):
group_by_class.current_class = s
return group_by_class.current_class
group_by_class.current_class = ""
def convert_numeric(s):
try:
return int(s)
except ValueError:
try:
return float(s)
except ValueError:
return None
def is_score(s):
return convert_numeric(s) is not None
def is_test(s):
return not is_score(s)
def group_by_test(s):
if is_test(s):
group_by_test.current_test = s
return group_by_test.current_test
group_by_test.current_test = ""
accum = defaultdict(lambda: defaultdict(list))
for class_name, class_name_and_tests in itertools.groupby(source_data, key=group_by_class):
class_name, *tests = class_name_and_tests
for test_name, test_name_and_scores in itertools.groupby(tests, key=group_by_test):
test_name, *scores = test_name_and_scores
accum[class_name][test_name].extend(convert_numeric(s) for s in scores)
print(json.dumps(accum, indent=4))
Prints:
{
"Mathematics-2 (21SMT-125)": {
"Mid-Semester Test-1": [
40,
23.5
],
"Mid-Semester Test-2": [
40,
34
]
},
"Disruptive Technologies - 2 (21ECH-103)": {
"Experiment-1": [
20,
19
],
"Experiment-2": [
20,
17
],
"Experiment-3": [
20,
18.5
]
}
}
Read more about fuzzy groupby in my blog post: https://thingspython.wordpress.com/2020/11/11/fuzzy-groupby-unusual-restaurant-part-ii/
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | bindlegrunt |
| Solution 2 | Robert Radzik |
| Solution 3 | AKX |
| Solution 4 |
