'send data from s3 to postres rds with glue
I'm trying to create an automated pipeline with aws. I'm able to get my csv file into my s3 bucket and that automatically triggers a lambda function to send the csv to my glue job. The glue job then turns the csv into a dataframe with pyspark. you cannot use psycopg2, pandas or sqlalchemy, or else glue will give an error saying the module doesn't exist. I have a postgres rds setup in aws rds. This is what i have so far
import sys
from pyspark.context import SparkContext
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from setuptools import setup
from sqlalchemy import create_engine
spark = SparkSession.builder.getOrCreate()
args = getResolvedOptions(sys.argv, ["VAL1", "VAL2"])
file_name = args['VAL1']
bucket_name = args["VAL2"]
file_path = "s3a://{}/{}".format(bucket_name, file_name)
df = spark.read.csv(file_path, sep=',', inferSchema=True, header=True)
df.drop("index")
url = "my rds endpoint link"
i have tried almost a dozen solutions before asking on stackoverflow. So any help would be amazing
Solution 1:[1]
I used this df.write
approach before. Starting where you left off with your pyspark dataframe
jdbc_url = 'jdbc:postgresql://<instance_name>.xxxxxxxxx.us-west-2.rds.amazonaws.com:5432/<db_name>'
(df.write.format('jdbc').option('url', 'jdbc_url')
? ? ? ? ? ? ? ? ? ? ? ? .option('user', 'myUsername')
? ? ? ? ? ? ? ? ? ? ? ? .option('password', 'myPassword')
? ? ? ? ? ? ? ? ? ? ? ? .option('dbtable', 'myTable')
? ? ? ? ? ? ? ? ? ? ? ? .option('driver', 'org.postgresql.Driver')
? ? ? ? ? ? ? ? ? ? ? ? ?.mode('append').save())
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Bob Haffner |