'MongoDB schema design: when to use the extended reference pattern vs. relying on aggregation pipeline

I'm hoping to understand more generally when the extended reference pattern should be used over relying on MongoDB's aggregation pipelines, however as I know the answer is often "it depends on your use case" with MongoDB schema design, I'll root the question in the following scenario:

  • We're modelling lab testing data, where tests are performed on samples, and samples belong to a "set" of samples.
  • It's a one to many relationship: many samples belong to one set. Upper bound of samples in a set is about ~20.
  • Sets and samples are accessed and updated together ~60-70% of the time, however the samples are also filtered/sorted/paginated/updated independent of their set the other ~30-40% of the time.
  • Both sets and samples are operated on frequently.
  • Some queries on sets must determine if, and how many, samples in each set have a particular status.

Based on this scenario, vanilla referencing is not feasible due to the last query mentioned. The extended reference pattern seems natural here, as the set could contain an array of sample references along with their status. However, since sets and samples are accessed together so frequently, it also seems natural to embed the samples as a nested subdocument in the set, as the entirety of each of the set's samples is needed when accessing a set.

I've been looking into using aggregation pipelines for querying/operating on nested subdocuments, i.e. nesting the samples in the set document, and the API seems robust enough to achieve all queries and operations needed. However, intuitively I would suspect that querying/operating on nested subdocuments so frequently would be inefficient relative the samples being in their own collection. I've also been reading up on aggregate pipeline limitations and there is a 100MB limit on each step in the pipeline. With enough clever indexing and upfront $match operations to narrow down the data, I don't think this limit should ever be hit, but there does seem to be a pretty tangible scalability tradeoff.

For reference, here is an example of an aggregate query I have in mind:

[
  // Get all sets that have samples with a status of "Broken"
  { $match: { "samples.status": "Broken" } },
  // Filter out samples in the sets that don't have a status of "Broken"
  {
    $project: {
      samples: {
        $filter: {
          input: "$samples",
          as: "sample",
          cond: { $eq: ["$$sample.status", "Broken"] },
        },
      },
    },
  },
  // Create a new document in memory for each sample
  { $unwind: "$samples" },
  // Sort the samples based on break date
  { $sort: { "samples.breakDate": -1 } },
  // Operations relevant to pagination
  {
    $facet: {
      count: [{ $count: "count" }],
      entities: [{ $skip: 1000 }, { $limit: 50 }, { $replaceRoot: { newRoot: "$samples" } }],
    },
  },
  // Improving the shape of the output
  {
    $project: {
      count: { $arrayElemAt: ["$count.count", 0] },
      entities: "$entities",
    },
  },
]

The most aggressive approach would be to duplicate samples both in their own collection, and as nested subdocuments in the set. Sample documents are quite small (~15 fields), so this would make the space tradeoff slightly more palatable, however data consistency then becomes an issue. Sets and samples are updated frequently, so each sample update would require multiple operations, and the use of atomic transactions.

There doesn't seem to be a clear winner. Aside from doing some real world testing on the performance of each approach, what are some general guidelines to consider when deciding between the extended reference pattern and using aggregation pipelines to query/operate on nested subdocuments?



Solution 1:[1]

100MB limit applies only to RAM. If you allow disk use, it will work, but with major degradation in performance.

You touched most important questions, so it's really about real world testing. Generate sample dataset with mgodatagen or similar tool and test your typical queries, play with indexes, and compare performance. Ensure your test database is scaled down proportionally to the test dataset.

If you are using mongoose, take into account document versioning it adds on top of transactions. You will need to handle errors related to stale data on application level if "Sets and samples are updated frequently" is frequently enough to cause concurrent updates.

Apart from partial data duplication, there is also preaggregation, which makes wonders for queries like "how many, samples in each set have a particular status".

Basically, in case of mongodb schema design "it depends on your use case" really means "how much data you have", "what queries you run", "how often you insert/query/update documents", "how much you can spend on the cluster" etc in combination. Essentially doing real world testing.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Alex Blex