- Newest
- Most votes
- Most comments
Here is a working solution : https://docs.aws.amazon.com/athena/latest/ug/avro-serde.html the files are stored in LOCATION 's3://athena-examples-us-west-1/flight/avro/'; Sadly the files are not accessible in the S3 bucket so I'm unable clarify why these work and mine do not.
If you are getting this "HIVE_CANNOT_OPEN_SPLIT: Not an Avro data file" the files in your s3 bucket are not in the correct format, I'm using https://github.com/apache/avro Apache Avro in C# to generate my files and it seems there is something wrong with them. Even the most basic files are not recognized.
Hi Chris,
Please go through the below steps and documentation links once i hope it will helps you to resolve your issue.
1. Ensure Correct Compression Configuration
Make sure the Firehose transformation Lambda is correctly compressing the files with gzip. Double-check the Lambda function to confirm it correctly handles compression after converting the data to Avro format.
2. Check S3 File Structure and Permissions
Verify the structure and permissions of the files in S3. Ensure the files are indeed in Avro format and properly gzipped. Sometimes, incorrect file permissions can cause issues with Athena.
3. Correct SerDe and Compression Settings in Athena DDL
Make sure the DDL statement in Athena correctly specifies the SerDe and takes into account the gzip compression. Here's an example with placeholders for your schema and correct S3 path:
CREATE EXTERNAL TABLE mytablename (
-- Specify your columns here, e.g.,
id STRING,
name STRING
)
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.literal' = '{
"type": "record",
"name": "YourRecordName",
"fields": [
{"name": "id", "type": "string"},
{"name": "name", "type": "string"}
// Add other fields as necessary
]
}'
)
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 's3://XXXXXXXXXX/archive/'
TBLPROPERTIES ('compressionType'='gzip');
4. Validate the Avro Schema
Ensure the Avro schema provided in avro.schema.literal exactly matches the schema used when writing the Avro files in your Firehose Lambda transformation. Any mismatch can cause Athena to fail when trying to read the files.
5. Validate File Format
You mentioned manually validating the schema by downloading the files. Ensure that the files are correctly gzipped and in Avro format. You can use tools like avro-tools to inspect the schema of an Avro file:
avro-tools getschema yourfile.avro.gz
This command will output the schema, allowing you to confirm it matches the schema defined in your Athena table.
6. Recreate the Table
If the table was previously created and has been giving issues, consider dropping and recreating it to ensure there are no residual misconfigurations:
DROP TABLE IF EXISTS mytablename;
CREATE EXTERNAL TABLE mytablename (
-- Specify your columns here
id STRING,
name STRING
)
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.literal' = '{
"type": "record",
"name": "YourRecordName",
"fields": [
{"name": "id", "type": "string"},
{"name": "name", "type": "string"}
// Add other fields as necessary
]
}'
)
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 's3://XXXXXXXXXX/archive/'
TBLPROPERTIES ('compressionType'='gzip');
https://docs.aws.amazon.com/athena/latest/ug/supported-serdes.html
https://docs.aws.amazon.com/athena/latest/ug/
https://docs.aws.amazon.com/athena/latest/ug/
https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html
https://docs.aws.amazon.com/lambda/latest/dg/troubleshooting-execution.html
Relevant content
- asked a year ago
- asked 3 years ago
- asked 6 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 7 months ago
- AWS OFFICIALUpdated 3 years ago
Thanks @Pandurangaswamy, I've added the gzip compression but with the same error, can you confirm if I need to add the "New line delimiter Not enabled" and set this to Enabled in order for each "line" to be a new object? AVRO as I understand it does not need a new line character for each object, but I'm not sure if hive looks for this when splitting the rows?