Getting this error:
Error opening Hive split s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2017-17/subset=crawldiagnostics/part-00189-ac1cf8ef-3644-4b49-ac73-0fe6bef46adf.c000.gz.parquet (offset=0, length=36266581): com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: J88VWPD7BXAFP01E; S3 Extended Request ID: 89ZPRgZ/qx3n0Gs4zFonvfA50JUPjf9ep5vHxhCKHIFXwVr70vgbnLSL9Ctx22GNikrR+p/3gQU=; Proxy: null), S3 Extended Request ID: 89ZPRgZ/qx3n0Gs4zFonvfA50JUPjf9ep5vHxhCKHIFXwVr70vgbnLSL9Ctx22GNikrR+p/3gQU= This query ran against the "ccindex" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 066d4ff2-89ce-4683-a4c0-71e2f45348cb
I know how to merge text files but not sure how to merge parquet files.
some ideas
If you use your data mostly with Athena or Hive, you could use a CTAS to create a new table and use at bucketing to limit the number of files per Partition. This would obviously apply if your table is already partitioned and filtering on single partitions you can avoid the above error.
Alternatively , you can have a look at this KB article https://aws.amazon.com/premiumsupport/knowledge-center/emr-concatenate-parquet-files/ or external blog post https://medium.com/bigspark/compaction-merge-of-small-parquet-files-bef60847e60b