'Multi-schema data pipeline

We want to create a Spark-based streaming data pipeline that consumes from a source (e.g. Kinesis), apply some basic transformations, and write the data to a file-based sink (e.g. s3). We have thousands of different event types coming in and the transformations would take place on a set of common fields. Once the events are transformed, they need to be split by writing them to different output locations according to the event type. This pipeline is described in the figure below:

enter image description here

Goals:

  • To infer schema safely in order to apply transformations based on the merged schema. The assumption is that the event types are compatible with each other (i.e. without overlapping schema structure) but the schema of any of them can change at unpredictable times. The pipeline should handle it dynamically.
  • To split the output after the transformations while keeping the original individual schema.

What we considered:

  • Schema inference seems to work fine on sample data. But is it safe for production usecases and for a large number of different event types?
  • Simply using partitionBy("type") while writing out is not enough because it would use the merged schema.


Solution 1:[1]

Doing the same here, casting everything to string, using marshamallow to validate, and then using from_json in a foreach like in https://www.waitingforcode.com/apache-spark-structured-streaming/two-topics-two-schemas-one-subscription-apache-spark-structured-streaming/read seems the more reasonable approach

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Alberto C