- Más nuevo
- Más votos
- Más comentarios
There are four reasons why a crawler would create separate tables:
-
The source files might not be having the same type of file (CSV, parquet or JSON) Please check whether the source files in your given folder are of same type or not.
-
The source files might be of different compression types (snappy, gzip, bzip2) Make sure whether the compression types are of the same type for your source files.
-
The source files might not be having the same schema. As said in the earlier answer, for the crawler to detect a single schema for both the folders, the threshold of 70% must be met. That means, the similarity of schema between both the sources must be atleast 70%.
-
The structure of Amazon S3 storage partitions are different for both of the files. *I do not think this might be an issue because the structure of S3 partitions is almost the same. But please do check this.
You can know the exact reason behind your crawler creating multiple tables by checking through your crawler logs. You can login to your console -> select your crawler -> choose logs to view the logs of your crawler.
For more details please follow this article.
Hi Chaitu, Thank you for your answer.
I have checked everything and all seems good...
My S3 structure is as follows:
s3://a/b/c/products
- /version_0
- _temporary
- 0_$folder$
- part-00000-c5... ...c000.snappy.parquet
- _temporary
- /version_1
- _temporary
- 0_$folder$
- part-00000-29... ...c000.snappy.parquet
- _temporary
What is making the crawler create multiple tables in this case? The schemas are exactly the same. Tried with Table Levels: 4 5 and 6. Nothing worked...
- /version_0
For the schemas to be defined as one schema, there needs to be similarity and a threshold limit. See the examples in the below reference link that talk about partition threshold higher than 70%. In your case, I am assuming there are only 2 schemas for each version that you have.
The crawler infers the schema at folder level and compares the schemas across all folders. If the schemas that are compared match, that is, if the partition threshold is higher than 70%, then the schemas are denoted as partitions of a table. If they don’t match, then the crawler creates a table for each folder, resulting in a higher number of tables.
Reference : https://aws.amazon.com/premiumsupport/knowledge-center/glue-crawler-detect-schema/
Contenido relevante
- OFICIAL DE AWSActualizada hace un año
Hi, do the 2 versions of the table have a similar schema or are there many differences?