'What is the purpose of storing index information in multiple bin files in a circular way?

I am doing research and refactoring on an NLP-related project, which basically functions as a local search engine.

Initiating phase:

The app parses the corpus(stored in local MongoDB) and then generates in-memory index information for all the needed words, then it writes all index info into a specified number of .bin files in a circular fashion. For example, if we have 1000 words to index, then it specified a number 20 -- this ends up generating 20 .bin files named as 0.bin, 1.bin...19.bin.

Then the index info of each word is assigned to files as(0-999 are the index of words):

  • 0 goes into 0.bin
  • 1 goes into 1.bin
  • 2 goes into 2.bin
  • ......
  • 20 goes into 0.bin <--- back to 0.bin
  • ......
  • 998 goes into 18.bin
  • 999 goes into 19.bin

The rule is (word_index % n), n equals 20 in this case. It also keeps track of the data offset and length for each word in its assigned file and then writes this info into another two .h5 files.

Querying phase:

It first loads all offset and length info from the .h5file, then it gets the word index, uses the formula mentioned above to locate which file the word's index info was stored, read data based on corresponding offset, and length info.


Questions:

  1. Why write index info in such a way that it spreads evenly into n files? Why not write all index info into a single bin file, what's the difference?
  2. How to determine the value of n, based on the size of index data?
  3. Can I store index info directly into MongoDB(or other database services) rather than writing them into files? Which one would be more efficient during the query phase? By the way, the index not only contains location information but also includes extra info about each word.

Thanks :)



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source