[BUG Report]: Reading gzip compressed files from Redshift Spectrum after Glue Crawler is run on bucket.

0

The glue crawler adds a table property called compressionType which Redshift Spectrum is unable to understand and We need to add it manually to compression_type as key and value remains same i.e. gzip or other compressed formats.

Yash
asked 10 months ago499 views
1 Answer
0
Accepted Answer

https://stackoverflow.com/questions/48827394/aws-glue-crawler-reading-a-gzip-file-of-csv https://docs.aws.amazon.com/redshift/latest/dg/t_loading-gzip-compressed-data-files-from-S3.html https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/ https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-troubleshooting.html

this bug is now well documented but there may be some insights

According to the AWS Glue Crawler's official documentation, its built-in classifiers should be able to handle CSV formats compressed with gzip, and this process should be transparent. For loading data files compressed using gzip, Amazon Redshift documentation suggests including the corresponding compression option (GZIP, LZOP, or BZIP2) in the COPY command​.

There are some best practices to follow when working with Amazon Redshift Spectrum. For storage optimization, it's recommended to use a columnar-based file format and use compression to fit more records into each storage block. Redshift Spectrum supports Gzip, Snappy, LZO, BZ2, and Brotli (only for Parquet). Avoiding very large files (greater than 512 MB) for formats and compression codecs that can't be split, such as Avro or Gzip, is recommended. Instead, use a uniform file size across all partitions to help reduce skew​.

AWS provides a quick reference (https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-troubleshooting.html) to identify and address some common issues you might encounter with Amazon Redshift Spectrum queries. Some of the potential issues include large file sizes, slow network throughput, access throttling by Amazon S3 or AWS KMS, resource limit exceeded, and incompatible data formats among others​.

profile picture
EXPERT
answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions