- Newest
- Most votes
- Most comments
You can create tables using either the L1 constructor or the L2 constructor, whichever you prefer.
If you are familiar with CloudFormation, the L1 constructor may be easier to create.
https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_glue_alpha/Table.html
https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_glue/CfnTable.html
I'm a little late on this, but I was able to create a Glue table using AWS CDK (level 1 constructs - aws_cdk/aws_glue using Python). I first generated the table using the CREATE EXTERNAL TABLE Athena DDL command. I examined the generated table, then examined its properties to reverse engineer what was needed in the CDK. To make sure this worked, I also deleted the table generated by the CREATE EXTERNAL TABLE DDL, then ran the CDK separately.
I had previously created my Glue database using the following CDK:
my_database: CfnDatabase = CfnDatabase(
self,
"my_database",
catalog_id=self.account,
database_input={
"name": "my_database"
}
)
Here's my CDK for the table definition:
CfnTable(
self,
"my_table",
catalog_id=self.account,
database_name=my_database.database_input.name,
table_input={
"name": "my_table",
"tableType": "EXTERNAL_TABLE",
"parameters": {
# "delta.lastUpdateVersion": "2",
# "delta.lastCommitTimestamp": "1716930161932",
"EXTERNAL": "TRUE",
# "spark.sql.sources.schema.part.0": "{\"type\":\"struct\",\"fields\":" + json.dumps(
# my_data_schema) + "}",
"spark.sql.sources.partitionProvider": "catalog",
"spark.sql.sources.provider": "delta",
# "spark.sql.sources.schema.numParts": "1",
"table_type": "delta"
},
"partitionKeys": [{"name": "my_partition_field_column", "type": "string"}],
"storageDescriptor": {
"columns": my_data_schema,
"location": "s3://mybucket/path-to-delta",
"inputFormat": "org.apache.hadoop.mapred.SequenceFileInputFormat",
"outputFormat": "org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat",
"serdeInfo": {
"name": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
"serializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
"parameters": {
"path": "s3://mybucket/path-to-delta",
"serialization.format": "1"
}
}
}
}
)
The major annoyance is I had to include my schema as Glue could not automatically determine my schema (unlike the CREATE EXTERNAL TABLE) command. So the schema I have in my_data_schema looks something like:
my_data_schema: list[dict[str, Union[str, dict]]] = [
{"name": "field_1", "type": "decimal(10,2)", "nullable": True, "metadata": {}},
{"name": "field_2", "type": "string", "nullable": True, "metadata": {}},
...
]
Obviously, it would be nice if Glue can just determine the schema from the Delta table metadata, but the Athena queries failed with errors if I did not include the columns property under the storageDescriptor property.
I was able to comment out/delete fields like delta.lastUpdateVersion, delta.lastCommitTimestamp, spark.sql.sources.schema.part.0, and spark.sql.sources.schema.numParts as they didn't seem to be needed for my Athena queries to work. I didn't test out every single field so there may be a few others that could be removed without causing issues.
I'm still hoping I don't have to explicitly define the schema in the CDK code. If anyone can comment on how I can get all this to work without defining the schema in the CDK, I would really appreciate it!
Relevant content
- asked 3 years ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
Hi Riku thank you for answer, I've actually tried both before posting my question, you can see my edited question now for more transparency
I am not quite understanding that the Athena table is empty. Are the S3 buckets and bucket folders correct? Also, are you using a crawler to retrieve data from S3 in addition to creating Glue tables? https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html
I tried the code above and in glue table it show right location but when I query from Athena it output now result, while when I created the table CREATE EXTERNAL TABLE command everything works as perfect No I am not using a crawler
First, make sure that the table you created in Glue contains data. If the data is correctly in the Glue table, it seems to me that there is a problem with the configuration on the Athena side. Have you selected the correct database on the Athena side?
Yes everything is right, actually it seem there is an issue with my cdk code because the table is there, location is right but no data get showing when querying, Have you tried running my dummy cdk code on your own?