'Solr: indexing nested JSON files + some fields independent of UniqueKey (need new core?)

I am working on an NLP project and I have a large amount of text data to index with Solr. I have already created an initial index (Solr core) with fields title, authors, publication date, authors, abstract. The is an ID that is unique to each article (PMID). Since then, I have extracted more information from the dataset and I am stuck with how to incorporate this new info into the existing index. I don't know how to approach the problem and I would appreciate suggestions.

The new information is currently stored in JSON files that look like this:

{id: {entity: [[33, 39, 0, subj], [103, 115, 1, obj], ...],
      another_entity: [[88, 95, 0, subj], [444, 449, 1, obj], ...],
      ...},
another id,
...}

where the integers are the character span and the index of the sentence the entity appears in.

Is there a way to have something like subfields in Solr? Since the id is the same as the unique key in the main index I was thinking of adding a field entities, but then this field would need to have its own subfields start character, end character, sentence index, dependency tag. I have come across Nested Child Documents and I am considering changing the structure of the extracted information to:

{id: {entity: [{start:33, end:39, sent_idx:0, dep_tag:'subj'}, 
               {start:103, end:115, sent_idx:1, dep_tag:'obj'}, ...],
      another_entity: [{}, {}, ...],
      ...},
another id,
...}

Having keys for the nested values, I should be able to use the methods linked above - though I am still unsure if I am on the right track here. Is there a better way to approach this? All fields should be searchable. I am familiar with Python, and so far I have been using the library subprocess to post documents to Solr via Python script

sp.Popen(f"./post -c {core_name} {json_path}", shell=True, cwd=SOLR_BIN_DIR)

Additionally, I want to index some information that is not linked to a specific PMID (does not have the same unique key), so I assume I need to create a new Solr core for it? Does it mean I have to switch to SolrCloud mode? So far I have been using a simple, single core.

Example of such information (abbreviations and the respective long form - also stored in a JSON file):

{"IEOP": "immunoelectroosmophoresis", 
"ELISA": "enzyme-linked immunosorbent assay", 
"GAGs": "glycosaminoglycans", 
...}

I would appreciate any input - thank you!

S.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source