使用 Amazon Athena 查询 AWS Config 文件时,我收到以下错误:“Error: HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key.”
简短描述
在以下情况下,通常会出现此错误:
- AWS Config 资源具有多个采用相同名称的标签。
- 其中一部分标签采用大写,而另一部分采用小写。
例如,以下记录使用 tc:Name 和 tc:name JSON 键:
{
"fileVersion": "1.0",
"configSnapshotId": "35eced35-a13a-45b7-81e4-446e35616e70",
"configurationItems": [
{
"tags": { "tc:Name": "6", "tc:name": "abc6-38" }
},
{
"tags": { "tc:Name": "6", "tc:name": "abc6-38" }
},
{
"tags": { "tc:Name": "6" }
},
{
"tags": { "tc:name": "6" }
}
]
}
解决方法
运行类似于以下示例的 CREATE TABLE 语句。此语句将会创建一个 Athena 表,将 case.insensitive 设为 false,并将列名称映射至与列名称不同的 JSON 键。在运行此语句之前,请注意以下几点:
- 在 LOCATION(位置)字段中,将 s3://awsexamplebucket/AWSLogs/ 替换为 Amazon Simple Storage Service(Amazon S3)存储桶的名称。
- 将所有映射属性替换为列名称和 JSON 键(例如,mapping.fileversion'='fileVersion')。
CREATE EXTERNAL TABLE aws_config_configuration_snapshot (
fileversion STRING,
configsnapshotid STRING,
configurationitems ARRAY < STRUCT <
configurationItemVersion : STRING,
configurationItemCaptureTime : STRING,
configurationStateId : BIGINT,
awsAccountId : STRING,
configurationItemStatus : STRING,
resourceType : STRING,
resourceId : STRING,
resourceName : STRING,
ARN : STRING,
awsRegion : STRING,
availabilityZone : STRING,
configurationStateMd5Hash : STRING,
configuration : STRING,
supplementaryConfiguration : MAP < STRING, STRING >,
tags: MAP < STRING, STRING >,
resourceCreationTime : STRING > >
)
PARTITIONED BY ( dt STRING , region STRING )
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'case.insensitive'='false,
'mapping.fileversion'='fileVersion',
'mapping.configsnapshotid'='configSnapshotId',
'mapping.configurationitems'='configurationItems',
'mapping.configurationitemversion'='configurationItemVersion',
'mapping.configurationitemcapturetime'='configurationItemCaptureTime',
'mapping.configurationstateid'='configurationStateId',
'mapping.awsaccountid'='awsAccountId',
'mapping.configurationitemstatus'='configurationItemStatus',
'mapping.resourcetype'='resourceType',
'mapping.resourceid'='resourceId',
'mapping.resourcename'='resourceName',
'mapping.arn'='ARN',
'mapping.awsregion'='awsRegion',
'mapping.availabilityzone'='availabilityZone',
'mapping.configurationstatemd5hash'='configurationStateMd5Hash',
'mapping.supplementaryconfiguration'='supplementaryConfiguration',
'mapping.configurationstateid'='configurationStateId'
)
LOCATION 's3://awsexamplebucket/AWSLogs/';
如果您拥有已加载分区的表,则可以将新的 SerDe 属性添加到表中。使用如下语句:
ALTER TABLE aws_config_configuration_snapshot SET TBLPROPERTIES (
'case.insensitive'='false',
'mapping.fileversion'='fileVersion',
'mapping.configsnapshotid'='configSnapshotId',
'mapping.configurationitems'='configurationItems',
'mapping.configurationitemversion'='configurationItemVersion',
'mapping.configurationitemcapturetime'='configurationItemCaptureTime',
'mapping.configurationstateid'='configurationStateId',
'mapping.awsaccountid'='awsAccountId',
'mapping.configurationitemstatus'='configurationItemStatus',
'mapping.resourcetype'='resourceType',
'mapping.resourceid'='resourceId',
'mapping.resourcename'='resourceName',
'mapping.arn'='ARN',
'mapping.awsregion'='awsRegion',
'mapping.availabilityzone'='availabilityZone',
'mapping.configurationstatemd5hash'='configurationStateMd5Hash',
'mapping.supplementaryconfiguration'='supplementaryConfiguration',
'mapping.configurationstateid'='configurationStateId')
当表准备就绪时,使用 configurationItem.tags['TAGNAME'] 访问标签。例如,要访问 tc:Name 标签,请运行以下查询:
SELECT configurationItem.tags['tc:Name']
FROM your_table
CROSS JOIN unnest(configurationItems) AS t(configurationItem)
WHERE configurationItem.tags['tc:Name'] IS NOT NULL
相关信息
在 Amazon Athena 中通过使用 JSONSerDe 的嵌套 JSON 和映射创建表
为什么我在 Amazon Athena 中尝试读取 JSON 数据时遇到错误?
Hive-JSON-Serde(Github)