CTAS can't handle empty arrays or maps

0

When using the new CTAS functionality to convert JSON to PARQUET, I kept getting this error:

GENERIC_INTERNAL_ERROR: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead. You may need to manually clean the data at location '{scrubbed}' before retrying. Athena will not delete data in your account.

This query ran against the "sampledb" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: bfe7eeaf-fe0a-4936-8515-8a9c2d1b22d0.

I eventually realized this always happens if any of the JSON is an empty array "[]" or there is a map type that is an empty map "{}". Is this a know limitation? It would be great if this were a little more flexible/robust.

A sample table:

CREATE EXTERNAL TABLE sampledb.test_empty_object (
  `name` string,
  `fields` map <string, string>,
  `friends` array <string>,
  `data`struct <
     name: string,
     value: string
  >
)
ROW FORMAT  serde 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')
LOCATION 's3://somewhere'

sample CTAS query

CREATE TABLE sampledb.test_empty_array_parquet
WITH (
     format = 'PARQUET', 
     external_location = 's3://somewhere'
)
AS SELECT *
FROM sampledb.test_empty_array

It handles "data" being {} just fine, but if fields = {} or friends = [], the error above is thrown.

Edited by: JeffreyBarrus on Oct 25, 2018 9:26 AM

질문됨 6년 전1819회 조회
3개 답변
0

I had the same issue and solved that with creating a view and adding some size checks for array columns like this one:

IF(CARDINALITY(my_array_column) = 0, NULL, my_array_column) AS my_array_column

hakan
답변함 5년 전
0

I am having the same problem but unfortunately the above suggestion does not work because it appears to happen with nested arrays as well which i cannot check the cardinality of and replace with null if empty without unnesting everything first (if that would even be possible due to multiple struct & array columns and levels of nesting).

답변함 5년 전
0

Admittedly responding to an old question, but a simple solution is to use ORC as format instead of PARQUET (format='ORC'), which doesn't have an issue with empty values.

ngbusca
답변함 3년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠