Skip to content

Athena table does not work after copying underlying data in s3

0

Scenario:

I have an athena table: myTable

CREATE EXTERNAL TABLE `myTable`(
  `column1` bigint, 
  `column2` bigint, 
   ...
)
PARTITIONED BY ( 
  `columnA` string, 
  `columnB` string, 
   ...)

ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://bucket/prefix/.../myTable/'
TBLPROPERTIES (
  'classification'='parquet')

I can query this table in athena.

I copy the data from s3://bucket/prefix/.../myTable/ to s3://bucket/prefix/.../myNewTable/

(same bucket, same high level prefix, same s3 permissions should apply)

I create a new athena table:

CREATE EXTERNAL TABLE `myNewTable`(
  `column1` bigint, 
  `column2` bigint, 
   ...
)
PARTITIONED BY ( 
  `columnA` string, 
  `columnB` string, 
   ...)

ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://bucket/prefix/.../myNewTable/'
TBLPROPERTIES (
  'classification'='parquet')

I update the partitions in the glue metadata:

msck repair table myNewTable

Partitions not in metastore: (correct list of partitions)

Partitions missing from filesystem: (same list)

But when I go to query

select * from myNewTable limit 10

I get zero results.

No errors. Just no results.

What am I missing?

Doesn't seem to be permission related, because it's the same bucket and same overall prefix.

I pointed a glue crawler at the prefix for the new table, and that creates a table that works.

If I take the create table statement from the crawler-generated table and run it manually - same thing: No errors but no results in my query.

"Show partitions" shows the correct partitions (same in all 3 examples)

It seems like there is something in the glue metastore that is not lining up correctly, but I cannot figure out what it is.

asked a year ago180 views

2 Answers
0
Accepted Answer

I determined the root cause here:

One of my partition columns is a date in the format "YYYY-MM-DD HH:MM:SS"

If I replace that whitespace by removing it or replacing it with something like an underscore, this whole problem goes away.

But it is interesting that I can have the white space in the original table (created with data wrangler) and in the glue-created table (created with the crawler), but it fails if I create a copy manually.

answered a year ago

0

When you're getting zero results (not errors) from your Athena query after copying data to a new location, there are several potential issues to consider:

  1. S3 Location Format: Athena is very particular about the S3 location format in table definitions. Make sure your location path doesn't include any wildcards or file patterns. The location should be a directory path ending with a forward slash, not pointing to specific files.

  2. Partition Path Structure: When you copied the data, ensure that the partition directory structure was preserved exactly. The partition directories must match the partition column names and values defined in your table.

  3. MSCK REPAIR TABLE Behavior: The "Partitions missing from filesystem" message suggests that Glue is expecting partitions that don't exist in the physical location. This could indicate a mismatch between the partition structure in your table definition and the actual directory structure in S3.

  4. File Format Compatibility: Ensure the parquet files were copied correctly and maintain their format integrity.

  5. Partition Format Issues: If your partitions use date/time values or numbers with specific formatting (like leading zeros), there might be a mismatch between how the partitions are defined and how they're physically stored.

Since a Glue crawler successfully creates a working table, this suggests the data is accessible and properly formatted. The difference is likely in how the partitions are defined or recognized.

Try these approaches:

  • Compare the table properties between the crawler-generated table and your manually created one
  • Check if there are any hidden properties or settings in the crawler-generated table
  • Verify the exact partition path structure in S3 matches what's expected in your table definition
  • Consider using ALTER TABLE ADD PARTITION statements to manually add partitions instead of MSCK REPAIR

Since the crawler-generated table works, you might want to use that as your starting point rather than trying to recreate the table definition manually.
Sources
Athena query to table stored as parquet files on s3 returns empty result | AWS re:Post
Athena partition projection BUG | AWS re:Post
Amazon Athena returns no records for Amazon Connect CTR data on S3 | AWS re:Post

answered a year ago

  • I have taken all of these suggestions into consideration. I can use the EXACT syntax from the crawler table and simply change the name, and am unable to query the new table. I could really use some ideas, here.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.