Create an Athena table that query multiple csv files in one S3 bucket

0

Hi all,

I hope you are doing amazing!! I'm trying to figure out how I can create a table with Athena that combines all the data in my CSV files stored in an S3 bucket. These CSV files share the same schema and they are updated regularly by my lambda function. How can I create such an Athena table that can combine them together?

Any leads would be appreciated! Many thanks!

asked 2 months ago129 views
1 Answer
0
Accepted Answer

To create an Athena table that queries multiple CSV files in one S3 bucket, you can follow these steps:

  1. Open the Athena console in AWS.

  2. In the query editor, choose "Create" next to "Tables and views", then select "S3 bucket data".

  3. Fill out the "Create Table From S3 bucket data" form with the following information:

    • Table name
    • Database name (create a new one if needed)
    • Location of your S3 bucket containing the CSV files
  4. Define the schema of your table based on the structure of your CSV files. Make sure it matches the columns in your files.

  5. For the data format, select "CSV".

  6. In the "Table properties" section, you may want to add: 'skip.header.line.count'='1' if your CSV files have headers.

  7. Choose "Create table" to finish.

This approach will create a table that includes all CSV files in the specified S3 location. Athena will treat all files with the same schema in that location as part of the same table. You don't need to do anything special to combine multiple files; Athena will automatically consider all matching files in the bucket when you query the table.

If your files are updated regularly by a Lambda function, the table will automatically include new data as it's added to the S3 bucket. You don't need to update the table definition each time new files are added.

To optimize performance and reduce costs, consider partitioning your data if it makes sense for your use case. For example, if your data has a date component, you could organize your files into a folder structure like 'year=YYYY/month=MM/day=DD/' and create a partitioned table in Athena. This would allow Athena to scan only the relevant partitions when querying data for specific date ranges.

Remember to ensure that your IAM permissions are set up correctly to allow Athena to access your S3 bucket and that your CSV files maintain a consistent schema across all files for the table to work correctly.
Sources
Create tables using AWS Glue or the Athena console - Amazon Athena
Query data from multiple sources in S3 on Athena? | AWS re:Post

profile picture
answered 2 months ago
profile picture
EXPERT
reviewed 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions