'Connecting to DocumentDB from AWS Glue

For a current ETL job, I am trying to create a Python Shell Job in Glue. The transformed data needs to be persisted in DocumentDB. I am unable to access the DocumentDB from Glue.

Since the DocumentDb cluster resides in a VPC, I thought of creating a Interface gateway to access the Document DB from Glue but DocumentDB was not one of the approved service in Interface gateway. I see tunneling as a suggested option but I do not wanna do that.

So, I want to know is there a way to connect to DocumentDB from Glue.



Solution 1:[1]

Create a dummy JDBC connection in AWS Glue. You will not need to do a test connection but this will allow ENIs to be created in the VPC. Attach this connection to your python shell job. This will allow you to interact with your resources.

Solution 2:[2]

Have you tried using the mongo db connection in glue connections, we can connect document db through that option.

Solution 3:[3]

I have been able to connect DocumentDb with glue and ingest data using a csv in S3, here's the script to do that

# Constants
data_catalog_database = 'sample-db'
data_catalog_table = 'data'

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

spark_context = SparkContext()
glue_context = GlueContext(spark_context)
job = Job(glue_context)
job.init(args['JOB_NAME'], args)

# Read from data source
## @type: DataSource
## @args: [database = "glue-gzip", table_name = "glue_gzip"]
## @return: dynamic_frame
## @inputs: []
dynamic_frame = glue_context.create_dynamic_frame.from_catalog(
    database=data_catalog_database,
    table_name=data_catalog_table
)

documentdb_write_uri = 'mongodb://yourdocumentdbcluster.amazonaws.com:27017'
write_documentdb_options = {
    "uri": documentdb_write_uri,
    "database": "yourdbname",
    "collection": "yourcollectionname",
    "username": "###",
    "password": "###"
}

# Write DynamicFrame to MongoDB and DocumentDB
glue_context.write_dynamic_frame.from_options(dynamic_frame, connection_type="documentdb",
                                             connection_options=write_documentdb_options)

In summary:

  1. Create a crawler that creates the schema of your data and a table, which can be stored in an S3 bucket.
  2. Use that db and table to ingest it into your documentdb.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Eman
Solution 2 Shubham Jain
Solution 3 gab