[BUG Report]: Reading gzip compressed files from Redshift Spectrum after Glue Crawler is run on bucket.

0

The glue crawler adds a table property called compressionType which Redshift Spectrum is unable to understand and We need to add it manually to compression_type as key and value remains same i.e. gzip or other compressed formats.

Yash
질문됨 일 년 전524회 조회
1개 답변
0
수락된 답변

https://stackoverflow.com/questions/48827394/aws-glue-crawler-reading-a-gzip-file-of-csv https://docs.aws.amazon.com/redshift/latest/dg/t_loading-gzip-compressed-data-files-from-S3.html https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/ https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-troubleshooting.html

this bug is now well documented but there may be some insights

According to the AWS Glue Crawler's official documentation, its built-in classifiers should be able to handle CSV formats compressed with gzip, and this process should be transparent. For loading data files compressed using gzip, Amazon Redshift documentation suggests including the corresponding compression option (GZIP, LZOP, or BZIP2) in the COPY command​.

There are some best practices to follow when working with Amazon Redshift Spectrum. For storage optimization, it's recommended to use a columnar-based file format and use compression to fit more records into each storage block. Redshift Spectrum supports Gzip, Snappy, LZO, BZ2, and Brotli (only for Parquet). Avoiding very large files (greater than 512 MB) for formats and compression codecs that can't be split, such as Avro or Gzip, is recommended. Instead, use a uniform file size across all partitions to help reduce skew​.

AWS provides a quick reference (https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-troubleshooting.html) to identify and address some common issues you might encounter with Amazon Redshift Spectrum queries. Some of the potential issues include large file sizes, slow network throughput, access throttling by Amazon S3 or AWS KMS, resource limit exceeded, and incompatible data formats among others​.

profile picture
전문가
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인