'JSON flattening in AWS Glue ETL job creates inferred schema with duplicated columns

I'm relatively new to AWS Glue and using the visual AWS Glue studio at the moment. Kind of a niche issue I'm having here...

Context:

I'm building an ETL job that, among other things, should parse/flatten json from a string column to replace it with different fields in different format which I can select to load in my datawarehouse table.

Approach:

I first extract my data from the Glue catalog as a dynamicFrame (in this case only one table). Then I'm trying to use the approach of unboxing and unnesting.

Let's call that json column data:

def transformTable (glueContext, dfc) -> DynamicFrameCollection:        
    dyf = dfc.select(list(dfc.keys())[0])
    dyf = Unbox.apply(frame=dyf, path="data", format="json")
    dyf = UnnestFrame.apply(frame=dyf)
    return DynamicFrameCollection({"TranformedTable": dyf}, glueContext)

(Then I have a step to select the right frame from the frame collection, and then I can apply mapping to my fields and load.)

My issue:

  • Glue automatically infers the data types of the my frame schema (rather successfully)
  • but it duplicates certain fields into several when the data type is unclear (similar to make_cols in the resolveChoice method), e.g. I end up with 2 fields in the output schema price_int and price_double, where price_int contains only the values that were round numbers by chance and null values everywhere else, etc.

So it seems like the default behavior of this method is to split columns in case of data type doubt (make_cols).

I understand that I could write a resolveChoice for each field, but with this approach they are already split in separate columns in the output schema.

Note: There are dozens of fields in this json, so I'm trying to devise a blanket solution that automatically makes all the fields of the json available in the schema to select and map in the next step, and avoid having to add one line of code for each field I want to extract. (And the json structure will grow with new fields in the future, so I'm trying to limit future ETL maintenance...)

Questions/help needed:

  • Any idea if there's a way to change this default behavior (like in the resolveChoice method)?
  • Alternatively, is there a way to apply a kind of default resolveChoice to all problematic fields from the json unboxing? For instance, I could force all problematic fields into string (similar to 'project:string'), and then reformat if needed in the applyMapping step. But resolveChoice seems to need to be applied field by field...
  • What's a different/better approach I could try? I would like to keep it as dynamic/automated as possible... e.g.:
  1. I think I could maybe extract specific fields from the JSON line by line, but I'm not sure how (looks like the Unbox method is already splitting columns by format). And as explained, it's dozens of fields and growing... so it requires updating the code regularly, instead of just ticking boxes in the list of available fields.
  2. TheRelationalize method could be an option, but it creates distinct frames and this quickly becomes much more complex (there are actually several columns with json, which all need to be flattened...).
  3. Creating crawlers or classifiers which are run automatically regularly for extracting the schema from that specific string column from a table should be an option as well...

Thanks in advance!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source