Duplicate records in Athena

Hi All,

Can anyone tell what is the root cause for the duplication of the data in Athena query results?.

for example if the underlying S3 file is having 100 records , Athena shows 200 or multiples of 100s.

we are loading the Athena table using a glue job with insert overwrite query.

Thanks in advance

Temas

Análisis

Etiquetas

Amazon Athena Análisis AWS Glue

Idioma

English

rePost-User-5479930

preguntada hace un año270 visualizaciones

2 Respuestas

Más nuevo
Más votos
Más comentarios

¿Son útiles estas respuestas? Vote a favor de la respuesta correcta para ayudar a la comunidad a beneficiarse de sus conocimientos.

Hi ,

some additional information on the S3 structure, the Athena DDL , and the glue Job and how it implements the overwrite insert would be needed to correctly answer the question.

The behaviour you describe seems to point to additional files or partitions being present in the Athena table location.

EXPERTO

Fabrizio@AWS

respondido hace un año

rePost-User-5479930
hace un año
S3 structure: CSV file with | delimiter DDL: Table is created with input format as textinputformat and outputformat as HiveIgnoreKeyTextOutputFormat along with table properties having delimiter as |. Glue job: It is a pyspark script which reads data from one S3 file convert it into dataframe , add a partition column and store it in another S3 bucket. After storing partition is added manually to Athena table using Alter table query. There are no multiple files under each partitions.

Looks like all the new partitions are added to the table. You should drop older partitions if you don't want to have duplicates.

EXPERTO

Tasio

respondido hace un año

Contenido relevante

¿Cómo puedo solucionar el error de AWS STS “the security token included in the request is expired” (el token de seguridad incluido en la solicitud ha caducado) cuando uso AWS CLI para asumir un rol de IAM?
OFICIAL DE AWSActualizada hace 2 años
¿Cómo soluciono el error de RegexSerDe «Number of matching groups doesn't match the number of columns» en Amazon Athena?
OFICIAL DE AWSActualizada hace 2 años
¿Cómo soluciono el error «The managed termination protection setting for the capacity provider is invalid» en Amazon ECS?
OFICIAL DE AWSActualizada hace 3 años
¿Cómo soluciono el error “The security token included in the request is expired” (El token de seguridad incluido en la solicitud ha caducado) cuando ejecuto aplicaciones Java en Amazon EC2?
OFICIAL DE AWSActualizada hace 2 años