- 최신
- 최다 투표
- 가장 많은 댓글
https://stackoverflow.com/questions/48827394/aws-glue-crawler-reading-a-gzip-file-of-csv https://docs.aws.amazon.com/redshift/latest/dg/t_loading-gzip-compressed-data-files-from-S3.html https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/ https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-troubleshooting.html
this bug is now well documented but there may be some insights
According to the AWS Glue Crawler's official documentation, its built-in classifiers should be able to handle CSV formats compressed with gzip, and this process should be transparent. For loading data files compressed using gzip, Amazon Redshift documentation suggests including the corresponding compression option (GZIP, LZOP, or BZIP2) in the COPY command.
There are some best practices to follow when working with Amazon Redshift Spectrum. For storage optimization, it's recommended to use a columnar-based file format and use compression to fit more records into each storage block. Redshift Spectrum supports Gzip, Snappy, LZO, BZ2, and Brotli (only for Parquet). Avoiding very large files (greater than 512 MB) for formats and compression codecs that can't be split, such as Avro or Gzip, is recommended. Instead, use a uniform file size across all partitions to help reduce skew.
AWS provides a quick reference (https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-troubleshooting.html) to identify and address some common issues you might encounter with Amazon Redshift Spectrum queries. Some of the potential issues include large file sizes, slow network throughput, access throttling by Amazon S3 or AWS KMS, resource limit exceeded, and incompatible data formats among others.
관련 콘텐츠
- AWS 공식업데이트됨 2년 전
- AWS 공식업데이트됨 2년 전