- Newest
- Most votes
- Most comments
I figured out what the issue was. It was to do with the fact that the column names in my parquet files were mixed case. So if the dataframe had a column name like SomeColumnName then I would be able to write the dataframe to a table in S3Tables but it wouldn't show up in the LakeFormation catalog. If I renamed all the dataframe columns so that they only contained lowercase (so something like somecolumnname or some_column_name) then I could see the table in both S3tables and the LakeFormation Catalog.
I used to following code to clean my dataframe before writing to S3.
from functools import reduce
oldColumns = df.schema.names
newColumns = []
for col in df.dtypes:
newColumns.append(col[0][0].lower()+''.join(['_'+x.lower() if x.isupper() else x for x in col[0][1:]]))
df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), range(len(oldColumns)), df)
df.printSchema()
When working with AWS Glue to load parquet files from S3 to S3 Tables, there are several potential reasons why your tables might not be visible in Lake Formation after creation.
The issue appears to be related to how you're creating the S3 table from the parquet file. When you create a dataframe directly in your code and write it to an S3 table, it works correctly. However, when reading from an existing parquet file in S3, the table is created but not visible in Lake Formation.
Here are some potential solutions to consider:
- Make sure your Spark session is properly configured with all the necessary S3 Tables Catalog for Apache Iceberg configurations. Your Spark session should include configurations like:
- spark.sql.extensions
- spark.sql.defaultCatalog
- spark.sql.catalog.s3tablesbucket
- spark.sql.catalog.s3tablesbucket.catalog-impl
- spark.sql.catalog.s3tablesbucket.warehouse
-
Ensure that the data lake location is properly registered in Lake Formation. If the S3 location isn't registered as a data lake location in Lake Formation, the tables might not appear.
-
Check that your AWS Glue job's IAM role has the necessary Lake Formation permissions (SELECT and DESCRIBE) on the tables you're trying to access.
-
Be aware that AWS Glue and Lake Formation don't support mixed-case column names and convert all column names to lowercase. If your parquet files have mixed-case column names, this could cause issues.
-
Try using a different approach to create the table. Instead of using the DataFrame API directly, you could try using Spark SQL to create the table first and then insert the data:
spark.sql(f"CREATE TABLE IF NOT EXISTS {namespace}.{table} USING iceberg AS SELECT * FROM parquet.`{s3_path}`")
- Consider explicitly specifying the schema when creating the table rather than inferring it from the parquet file.
These approaches might help ensure that your S3 tables are properly registered with Lake Formation when creating them from parquet files in S3.
Sources
Troubleshooting - AWS Glue
S3 tables catalog integration limitations - AWS Lake Formation
Running ETL jobs on Amazon S3 tables with AWS Glue - Amazon Simple Storage Service
Relevant content
- asked 7 months ago
- asked a year ago
- AWS OFFICIALUpdated 2 years ago
