HIVE_METASTORE_ERROR when running an Athena query to select the first 10 rows from a partitioned table created by a Glue Crawler.

0
  • At a high level, I have a Glue Crawler that generates a table based on CloudTrail logs in an S3 bucket. The problem is that Athena always generates an error when trying to run the preview query over the generated table. The preview query is just selecting the first 10 rows in the table. I can't get any Athena queries to work on this table.
  • The error I'm receiving is very similar to the first error under the short description heading on this page: https://aws.amazon.com/premiumsupport/knowledge-center/athena-hive-metastore-error/
  • I did notice that some of the column names in my partition did have hyphens because the folders in S3 for the regions have hyphens. After converting the folders with hyphens to underscores in S3 and rerunning the Glue Crawler, I noticed the partition column names that were previously hyphens were now underscores; however, I still received the same hive metatstore error. At this point, I am stumped.
asked 2 years ago1391 views
1 Answer
0

Hello,

The above error message suggests that Hive was expecting ':' colon but found back-slash. This back-slash '' is working as an escape character for the colon ':'. This backslash is added by the crawler while creating a schema with special characters. Currently, Athena does not support any special character in the schema apart from underscore '_'

It is indeed that the structure of CloudTrail logs is quite standardized. However, at a closer look, structure of nested children fields might vary. Below are nested field structure in the "RequestParameter" field. As you can see, logs from different service API calls have different parameters.

S3 Request: {"bucketName":"xxxx","Host":"s3.eu-west-1.amazonaws.com","key":"xxxx"}

IAM Request: {"resourceArn":"arn:aws:sagemaker:eu-west-1:xxxx:xxxx"}

Thus, for CloudTrail logs, we'd recommend to use the CREATE TABLE DDL mentioned in the below document link[1] [-] https://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html#create-cloudtrail-table

It is because this DDL defines a basic structure for the CloudTrail JSON records, but leaves the structure of nested fields flexible (eg. treating the whole RequestParameter field as JSON string). On the other hand, when crawling the data, Glue crawlers will try its best to recognize all the fields it see.

Therefore, I would suggest the below workaround:

  1. Create the table with Athena DDL. (from the Doc above)
  2. Run a Glue crawler to add the partitions to the table.

Step 1. Create the table with Athena DDL. (from the Doc above):

CREATE EXTERNAL TABLE cloudtrail_logs_testing (

eventversion STRING,

useridentity STRUCT<

... ...

PARTITIONED BY (region string, year string, month string, day string)

ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'

STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION 's3://CloudTrail_bucket_name/AWSLogs/Account_ID/CloudTrail/';

Step 2: Run a Glue crawler to add the partitions to the table

Please consider that this crawler should be created differently. Please find the steps below to create and run the Glue crawler.

a) Sign in to the AWS Management Console and open the AWS Glue console at https://console.aws.amazon.com/glue/. Choose Crawlers in the navigation pane.

b) Choose Add crawler, provide 'Crawler name' and click on next.

c) Under 'Crawler source type', select 'Existing catalog tables' and click on next.

d) Under 'Available tables', click on 'Add' button which is adjacent to the table which was created in Athena i.e. 'cloudtrail_logs_testing' and click on next.

e) Select 'Choose an existing IAM role' and select your IAM role from the drop down list of 'IAM role' and click on next.

f) Under 'Frequency', choose as per your requirement and click on next. I have chosen 'Run on demand'.

g) On 'Configure the crawler's output' page, under 'Configuration options (optional)' select the following options for each section.

  - Under 'When the crawler detects schema changes in the data store, how should AWS Glue handle table updates in the data catalog section:

select 'Ignore the change and don't update the table in the data catalog' option and check the box for 'Update all new and existing partitions with metadata from the table' option.

  - Under 'How should AWS Glue handle deleted objects in the data store?' section, select 'Ignore the change and don't update the table in the data catalog' and click on next

h) Review all the steps and click on 'Finish' and Run the crawler to add all partition automatically.

Once crawler executed, You can query the data from Athena.

AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions