[BUG Report]: Reading gzip compressed files from Redshift Spectrum after Glue Crawler is run on bucket.

0

The glue crawler adds a table property called compressionType which Redshift Spectrum is unable to understand and We need to add it manually to compression_type as key and value remains same i.e. gzip or other compressed formats.

Yash
demandé il y a un an523 vues
1 réponse
0
Réponse acceptée

https://stackoverflow.com/questions/48827394/aws-glue-crawler-reading-a-gzip-file-of-csv https://docs.aws.amazon.com/redshift/latest/dg/t_loading-gzip-compressed-data-files-from-S3.html https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/ https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-troubleshooting.html

this bug is now well documented but there may be some insights

According to the AWS Glue Crawler's official documentation, its built-in classifiers should be able to handle CSV formats compressed with gzip, and this process should be transparent. For loading data files compressed using gzip, Amazon Redshift documentation suggests including the corresponding compression option (GZIP, LZOP, or BZIP2) in the COPY command​.

There are some best practices to follow when working with Amazon Redshift Spectrum. For storage optimization, it's recommended to use a columnar-based file format and use compression to fit more records into each storage block. Redshift Spectrum supports Gzip, Snappy, LZO, BZ2, and Brotli (only for Parquet). Avoiding very large files (greater than 512 MB) for formats and compression codecs that can't be split, such as Avro or Gzip, is recommended. Instead, use a uniform file size across all partitions to help reduce skew​.

AWS provides a quick reference (https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-troubleshooting.html) to identify and address some common issues you might encounter with Amazon Redshift Spectrum queries. Some of the potential issues include large file sizes, slow network throughput, access throttling by Amazon S3 or AWS KMS, resource limit exceeded, and incompatible data formats among others​.

profile picture
EXPERT
répondu il y a un an

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions