Is there a way to use existing bucket data to load data to an S3 table?

0

Hi,

A question related to the newly launched S3 Tables: How can I create an S3 table using existing data in a bucket path? The examples I see here: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-getting-started.html try to insert data into an S3 table or read from a path to create the S3 table. Do we continue to load data into the S3 table periodically if I have an existing pipeline that writes to a regular S3 bucket?

I have an S3 path to which data is written every hour. Currently, I need to setup a crawler, partition projection etc to build a Glue table. I want to onboard this dataset to an S3 table, to avoid creating crawlers/compaction scripts etc. But the documentation doesn't provide details on how to connect my bucket to an S3 table.

Any help on this would be great.

asked 2 months ago513 views
2 Answers
2

While the S3 Tables feature is designed to optimize storage and querying of tabular data, it doesn't directly provide a method to create an S3 table using existing data in a regular S3 bucket path. However, you can still leverage S3 Tables for your use case with some adjustments to your workflow.

To use S3 Tables with your existing data pipeline, you would need to modify your process to write data directly to the S3 table bucket instead of a regular S3 bucket. S3 Tables use a specific bucket type called a table bucket, which stores tables as subresources and supports the Apache Iceberg format.

Here's a general approach you could consider:

  1. Create a new table bucket using the S3 console or programmatically.

  2. Set up your table structure within the table bucket. This would involve creating a namespace and table using SQL queries through a service like Amazon EMR with Apache Iceberg or another supported analytics engine.

  3. Modify your existing hourly data pipeline to write directly to this new S3 table instead of the regular S3 bucket. This would likely involve changes to your data writing process to ensure it's compatible with the S3 Tables format (Apache Iceberg).

  4. For your existing data, you would need to migrate it into the new S3 table format. This might involve writing a script to read from your current S3 path and insert the data into the new S3 table using supported methods like SQL INSERT statements or by loading data files.

By using S3 Tables, you can indeed avoid the need for separate crawlers, compaction scripts, and manual partition management. S3 Tables provide automated table optimization, built-in support for Iceberg format, and integration with AWS analytics services.

It's worth noting that when you create a table bucket through the S3 console, it's automatically integrated with AWS analytics services. This integration adds your tables to the AWS Glue Data Catalog, allowing you to work with them using services like Amazon Athena, Amazon Redshift, and Amazon QuickSight without additional setup.

While this approach requires some initial setup and changes to your existing pipeline, it could provide long-term benefits in terms of performance, manageability, and integration with AWS analytics services.
Sources
Working with Amazon S3 Tables and table buckets - Amazon Simple Storage Service
Tutorial: Getting started with S3 Tables - Amazon Simple Storage Service

profile picture
answered 2 months ago
profile picture
EXPERT
reviewed 2 months ago
1

Hi,

Yes, you can load S3 Tables from existing bucket content. I am personally doing it right now via Athena:

  1. My python script first reads data items in S3
  2. Then, using boto3("athena") runs dml queries via https://boto3.amazonaws.com/v1/documentation/api/1.26.93/reference/services/athena/client/start_query_execution.html, it does SQL inserts to load it into an S3 table

It's currently briefly detailled here (with select): https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-athena.html

Stay tuned: I will publish my cfull ode after some polishing in the next few days (hopefully)(

Best,

Didier

profile pictureAWS
EXPERT
answered 2 months ago
profile picture
EXPERT
reviewed 2 months ago
  • Hi Didier, Do you have your code published? I need to also migrate existing data in S3 which is json files, do some transformations and insert the transformed json into s3 tables. So far I researched, spark job is a potential solution, but was curious to see if Athena can be used as its much light weight. Thanks

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions