I am trying to use the AWS Glue Studio to build a simple ETL workflow. Basically, I have a bunch of csv
files in different directories in S3.
I want those csvs to be accessible via a database and have chosen Redshift for the job. The directories and will be updated every day with new csv files. The file structure is:
YYYY-MM-DD (e.g. 2023-03-07)
|---- groupName1
|---- groupName1.csv
|---- groupName2
|---- groupName2.csv
...
|---- groupNameN
|---- groupNameN.csv
We will be keeping historical data, so every day I will have a new date-based directory.
I've read that AWS Glue can automatically copy data on a schedule but I can't see my Redshift databases or tables (screenshot below). I'm using my AWS admin account and I do have AWSGlueConsoleFullAccess
permission (screenshot below)
So if I have hundreds of new
.csv
files every day in new directories in S3, what is a recommended approach to scalably load that data into Redshift tables? Also, what is the best way of creating those hundreds of Redshift tables to begin with?