[BUG Report]: Reading gzip compressed files from Redshift Spectrum after Glue Crawler is run on bucket.

0

The glue crawler adds a table property called compressionType which Redshift Spectrum is unable to understand and We need to add it manually to compression_type as key and value remains same i.e. gzip or other compressed formats.

Yash
已提问 1 年前524 查看次数
1 回答
0
已接受的回答

https://stackoverflow.com/questions/48827394/aws-glue-crawler-reading-a-gzip-file-of-csv https://docs.aws.amazon.com/redshift/latest/dg/t_loading-gzip-compressed-data-files-from-S3.html https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/ https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-troubleshooting.html

this bug is now well documented but there may be some insights

According to the AWS Glue Crawler's official documentation, its built-in classifiers should be able to handle CSV formats compressed with gzip, and this process should be transparent. For loading data files compressed using gzip, Amazon Redshift documentation suggests including the corresponding compression option (GZIP, LZOP, or BZIP2) in the COPY command​.

There are some best practices to follow when working with Amazon Redshift Spectrum. For storage optimization, it's recommended to use a columnar-based file format and use compression to fit more records into each storage block. Redshift Spectrum supports Gzip, Snappy, LZO, BZ2, and Brotli (only for Parquet). Avoiding very large files (greater than 512 MB) for formats and compression codecs that can't be split, such as Avro or Gzip, is recommended. Instead, use a uniform file size across all partitions to help reduce skew​.

AWS provides a quick reference (https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-troubleshooting.html) to identify and address some common issues you might encounter with Amazon Redshift Spectrum queries. Some of the potential issues include large file sizes, slow network throughput, access throttling by Amazon S3 or AWS KMS, resource limit exceeded, and incompatible data formats among others​.

profile picture
专家
已回答 1 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则