'Join Kinesis Streams

I have two Kinesis streams and I would like to create a third stream that is the intersection of these two streams. My goal is to have a stream processor respond to an event on the resulting third stream without having to write a consumer that performs this intersection.

A record on stream a would be:

{
    "customer_id": 3,
    "first_name":"Marcy",
    "last_name":"Shurtleff"
}

and a record on stream b would be:

{
    "payment_id": 10001,
    "customer_id": 1,
    "amount":234.56,
    "date":"2018-09-07T10:25:43.511Z"

}

I would like to perform a join (like I can in KSQL with Kafka) that will join stream a.customer_id to stream b.customer_id resulting in:

{
    "customer_id": 3,
    "first_name":"Marcy",
    "last_name":"Shurtleff",
    "payment_id": 10001,
    "amount":234.56,
    "date":"2018-09-07T10:25:43.511Z"
}

(or whatever sql-like projection I choose).

I know this is possible with Kafka and KSQL, but is this possible with Kinesis?

Kinesis Data Analytics will not help as you cannot use more than one stream as a datasource in that product and you can only perform joins on 'in-application' streams.



Solution 1:[1]

I recently implemented a solution that does exactly what you are asking using Kinesis Data Anlytics. Indeed, a KDA In-application takes only one stream as input data source; so this limitation makes the schema standardization of the data flowing into KDA necessary when you are dealing with multiple sets of streams. To work around these issues, a python snippet code can be used inside of lambda to flatten and standardize any event by converting its entire payload to a JSON-encoded string. The image below shows how my whole solution is deployed: enter image description here

The process of standardize and flatten the streams is illustrated in detail below:

enter image description here

Note that after this stage both JSON events have the same schema and no nested fields. Yet, all information is preserved. In addition, the ssn field is placed on the header to be used as join key inside of the KDA application.

For more information about this solution, check this article I wrote: https://medium.com/@guilhermeepassos/joining-and-enriching-multiple-sets-of-streaming-data-with-kinesis-data-analytics-24b4088b5846

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1