'Writing into multiple BigQuery table using Spark Dataset based on one column

We have a streaming job which is reading the data from pubsub and writing to bigquery. A dataset is formed which has one field payloadname and it has 16 distinct values. I have a requirement to write this dataset into 16 different Bigquery tables. How can I do it?

One way is to do that is to cache this dataset and apply filter on payloadname and write the filtered dataset into 16 different tables sequentially but it is taking more time than our batch interval.

Does anyone know a way to either write the filtered dataset in parallel or any other better way to deal with this problem?

Sample Code:

private static void prepareAndWriteNormalizeTablesToBigQuery(Dataset<BasePayload> basePayloadDf,
            String projectId, String credentials) throws Exception {
        method1(basePayloadDf, projectId, credentials);

        method2(basePayloadDf, projectId, credentials);
        --
        //-- and so on
        
}

The above methods write the dataset sequentially and filtered condition is written inside the individual method. As i said, it needs to write into 16 tables, and a 5 mins batch is taking around 7 mins to complete. Volume per batch is around 80,000 records. And please note we do lot of calculation to prepare basePayloadDf like uncompress, decrypt and deserialise the data



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source