Duplicate records in Athena

1

Hi All,

Can anyone tell what is the root cause for the duplication of the data in Athena query results?.

for example if the underlying S3 file is having 100 records , Athena shows 200 or multiples of 100s.

we are loading the Athena table using a glue job with insert overwrite query.

Thanks in advance

질문됨 일 년 전270회 조회
2개 답변
0

Hi ,

some additional information on the S3 structure, the Athena DDL , and the glue Job and how it implements the overwrite insert would be needed to correctly answer the question.

The behaviour you describe seems to point to additional files or partitions being present in the Athena table location.

AWS
전문가
답변함 일 년 전
  • S3 structure: CSV file with | delimiter DDL: Table is created with input format as textinputformat and outputformat as HiveIgnoreKeyTextOutputFormat along with table properties having delimiter as |. Glue job: It is a pyspark script which reads data from one S3 file convert it into dataframe , add a partition column and store it in another S3 bucket. After storing partition is added manually to Athena table using Alter table query. There are no multiple files under each partitions.

0

Looks like all the new partitions are added to the table. You should drop older partitions if you don't want to have duplicates.

profile pictureAWS
전문가
Tasio
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠