What is the best practice to load data to redshift with aws glue ?



We are using Glue and Redshift for our ETL purposes.

Our source is AWS RDS and we are using Redshift as our data warehouse.

For ETL we are using Glue. We write scripts in pyspark (Not using crawlers. We directly make connection to RDS in pyspark script)

Currently, we first create a job which writes DataFrame to s3 as json. After this we write another job which reads json from s3 as DynamicFrame and then writes to redshift.

Also, for pyspark etl script, we create our dimension and fact tables as SQL queries which are then used in spark.sql().

What is the best practice that we should follow ?
Should we remove the step of writing data to s3, convert our DataFrame to DynamicFrame and directly write to Redshift ?
Also, we read data from source as a df using spark.read.format("jdbc").options(url=, driver=, dbtable=, user=, password=,). Is this a good practice if we have to use job scheduling and job bookmarking ?

asked 4 years ago2123 views
1 Answer

Hi TiwaryShashwat,

Depending on how complex (or not) the transforms in your Glue jobs are it might be easier to just export or unload the source data from your RDS instance to S3 in a format compatible to load into Redshift with a COPY command. You can do most lightweight transforms in the select portion of your unload or export and even partly in your COPY command.

Another option is to export or unload from RDS into an external table compatible key structure in S3 and just map an external table definition on top of it in the Glue Data Catalog. That way with a simple CREATE EXTERNAL SCHEMA ... declaration in Redshift you can query the data in S3 directly using Redshift Spectrum. If you choose this route I'd suggest that you consider some form to time based partitioning of that external table to both make data management in S3 easier and your Redshift Spectrum queries faster where you can use a time based filter predicate on the partitioning column.

While the route you're going now using dataframes in Glue Spark can work, depending on your budget for scaling the Glue job and the amount of data you're operating on, you may likely run into scaling problems where you have to have the whole dataframe in memory on the Glue Spark cluster to perform your current process.

Please let me know if this helps.

answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions