- Newest
- Most votes
- Most comments
While the S3 Tables feature is designed to optimize storage and querying of tabular data, it doesn't directly provide a method to create an S3 table using existing data in a regular S3 bucket path. However, you can still leverage S3 Tables for your use case with some adjustments to your workflow.
To use S3 Tables with your existing data pipeline, you would need to modify your process to write data directly to the S3 table bucket instead of a regular S3 bucket. S3 Tables use a specific bucket type called a table bucket, which stores tables as subresources and supports the Apache Iceberg format.
Here's a general approach you could consider:
-
Create a new table bucket using the S3 console or programmatically.
-
Set up your table structure within the table bucket. This would involve creating a namespace and table using SQL queries through a service like Amazon EMR with Apache Iceberg or another supported analytics engine.
-
Modify your existing hourly data pipeline to write directly to this new S3 table instead of the regular S3 bucket. This would likely involve changes to your data writing process to ensure it's compatible with the S3 Tables format (Apache Iceberg).
-
For your existing data, you would need to migrate it into the new S3 table format. This might involve writing a script to read from your current S3 path and insert the data into the new S3 table using supported methods like SQL INSERT statements or by loading data files.
By using S3 Tables, you can indeed avoid the need for separate crawlers, compaction scripts, and manual partition management. S3 Tables provide automated table optimization, built-in support for Iceberg format, and integration with AWS analytics services.
It's worth noting that when you create a table bucket through the S3 console, it's automatically integrated with AWS analytics services. This integration adds your tables to the AWS Glue Data Catalog, allowing you to work with them using services like Amazon Athena, Amazon Redshift, and Amazon QuickSight without additional setup.
While this approach requires some initial setup and changes to your existing pipeline, it could provide long-term benefits in terms of performance, manageability, and integration with AWS analytics services.
Sources
Working with Amazon S3 Tables and table buckets - Amazon Simple Storage Service
Tutorial: Getting started with S3 Tables - Amazon Simple Storage Service
Hi,
Yes, you can load S3 Tables from existing bucket content. I am personally doing it right now via Athena:
- My python script first reads data items in S3
- Then, using boto3("athena") runs dml queries via https://boto3.amazonaws.com/v1/documentation/api/1.26.93/reference/services/athena/client/start_query_execution.html, it does SQL inserts to load it into an S3 table
It's currently briefly detailled here (with select): https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-athena.html
Stay tuned: I will publish my cfull ode after some polishing in the next few days (hopefully)(
Best,
Didier
If you have structured data in S3, and don't want to write code to move it to S3 tables. just create 2 tables. One a regular Athena external table and an S3 table. Insert into the S3 Table select * from Athena External Table.
In my case, working with the TPC dataset,
S3 Table
`CREATE TABLE perf1.call_center (
cc_call_center_sk int,
cc_call_center_id string,
cc_rec_start_date date,
cc_rec_end_date date,
cc_closed_date_sk int,
cc_open_date_sk int,
cc_name string,
cc_class string,
cc_employees int,
cc_sq_ft int,
cc_hours string,
cc_manager string,
cc_mkt_id int,
cc_mkt_class string,
cc_mkt_desc string,
cc_market_manager string,
cc_division int,
cc_division_name string,
cc_company int,
cc_company_name string,
cc_street_number string,
cc_street_name string,
cc_street_type string,
cc_suite_number string,
cc_city string,
cc_county string,
cc_state string,
cc_zip string,
cc_country string,
cc_gmt_offset decimal(5, 2),
cc_tax_percentage decimal(5, 2))
TBLPROPERTIES (
'table_type'='iceberg',
'write_compression'='zstd'
);`
External Table
CREATE EXTERNAL TABLE `call_center`(
`cc_call_center_sk` int,
`cc_call_center_id` string,
`cc_rec_start_date` date,
`cc_rec_end_date` date,
`cc_closed_date_sk` int,
`cc_open_date_sk` int,
`cc_name` string,
`cc_class` string,
`cc_employees` int,
`cc_sq_ft` int,
`cc_hours` string,
`cc_manager` string,
`cc_mkt_id` int,
`cc_mkt_class` string,
`cc_mkt_desc` string,
`cc_market_manager` string,
`cc_division` int,
`cc_division_name` string,
`cc_company` int,
`cc_company_name` string,
`cc_street_number` string,
`cc_street_name` string,
`cc_street_type` string,
`cc_suite_number` string,
`cc_city` string,
`cc_county` string,
`cc_state` string,
`cc_zip` string,
`cc_country` string,
`cc_gmt_offset` decimal(5,2),
`cc_tax_percentage` decimal(5,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://BUCKET/TPC-DS/3TB/call_center'
Insert Statement
insert into "s3tblnamespace"."call_center" select * from "AwsDataCatalog"."external_3tb_db"."call_center"
Hello, can you share the code which you used to read data items in S3 and then using boto3("athena") runs dml queries , it does SQL inserts to load it into an S3 table
Relevant content
- asked a year ago
- AWS OFFICIALUpdated 3 years ago

Hi Didier, Do you have your code published? I need to also migrate existing data in S3 which is json files, do some transformations and insert the transformed json into s3 tables. So far I researched, spark job is a potential solution, but was curious to see if Athena can be used as its much light weight. Thanks