'Elasticsearch re-index all vs join

I'm pretty new on Elasticsearch and all its concepts. I would like to understand how I could accomplish what I have in my Relational DB in an Elasticsearch architecture.

The scenario is the following

I have a index "data":

{
   "id": "00001",
   "content" : "some text here ..",
   "type": "T1",
   "categories: ["A", "A1", "B"]
}

The requirement says that data can be queried by:

  • some text search in the context field
  • that belongs to a specific type or category

So far, so simple, so good.

This data will not be completed from the creating time. It might happen that new categories will be added/removed to the data later. So, many data uploads/re-indexes might happen along the way

For example:

create the data

{
   "id": "00001",
   "content" : "some text here ..",
   "type": "T1",
   "categories: ["A"]
}

Then it was decided that all data with type=T1 must belong to both A & B categories.

{
   "id": "00001",
   "content" : "some text here ..",
   "type": "T1",
   "categories: ["A", "B"]
}

If I have a billion hits for type=T1 I would have to update/re-index a billion entries. Maybe it is how things should work and this where my question lands on.

Is ok to re-index all the data just to add/remove a new category, or would it be possible to have a second much smaller index just to do this association and somehow join both indexes at time to query?

Something like it:

Data:

{
   "id": "00001",
   "content" : "some text here ..",
   "type": "T1"
}

DataCategories:

{
   "type": "T1"
   "categories" : ["A", "B"]
}

Is it acceptable/possible?



Solution 1:[1]

This is a common scenario - but unfortunately, there is no 1:1 mapping for RDBMS features in text search engines like Lucene/elasticsearch.

Possible options:

1 - For the best performance, reindex. It may not be practical depending on the velocity of your change

2 - Consider Parent-Child; Though it's a slower option - often will meet performance requirements. The category could be a parent document, each having several thousands of children.

3 - If its category renaming - Consider using IDs for the category and translating it to text in the application.

4 - Update document depends on the number of documents to be updated; maybe for few thousand - run an update query, if more - reindex.

Suggested reading - https://www.elastic.co/blog/managing-relations-inside-elasticsearch

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Nirmal