'AWS Glue - IllegalArgumentException: Duplicate value for path
I have a messy data source where some field values can come in with two different names but should map to one conformed field name on the output.
e.g. data source contains update_date or modified_date and both should map to timestamp.
Both field names are never present simultaneously on the same row of data.
The glue script looks like this. Some lines have been redacted for clarity:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node Data Catalog table
DataCatalogtable_node1 = glueContext.create_dynamic_frame.from_catalog(
database="mydb",
table_name="crawl_rawdata",
transformation_ctx="DataCatalogtable_node1",
)
# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
frame=DataCatalogtable_node1,
mappings=[
...
("update_date", "string", "timestamp", "string"),
...
("modified_date", "string", "timestamp", "string"),
...
],
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node S3 bucket
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
frame=ApplyMapping_node2,
connection_type="s3",
format="orc",
connection_options={
"path": "s3://mybucket/data-lake/glue/",
"compression": "snappy",
"partitionKeys": [ ... ],
},
transformation_ctx="S3bucket_node3",
)
job.commit()
How to make it work?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
