My Amazon Athena query failed with the "HIVE_CANNOT_OPEN_SPLIT" error.
Short description
An Amazon Simple Storage Service (Amazon S3) bucket can support only 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket. This hard limit is combined across all users and services for an AWS account.
By default, Amazon S3 automatically scales to support high request rates. When your request rate scales higher, your S3 bucket is automatically partitioned to support higher request rates. However, if the request threshold is exceeded, then you receive "5xx Slow Down" errors.
For example, you run an Athena query on a prefix that has 10,000 files inside. Athena used the GET/HEAD requests to try to read all 10,000 files at the same time. However, because the prefix supports up to only 5,500 GET/HEAD requests per second, your S3 requests are throttled, and you get the "5xx Slow Down" error.
Resolution
Use one or more of the following methods so that requests don't throttle.
Distribute S3 objects and requests across multiple prefixes
To help you distribute objects and requests across multiple prefixes, partition your data. Don't store a large number of files under a single S3 prefix. Instead, use multiple prefixes to distribute the S3 objects across these prefixes. For more information, see Partitioning data in Athena.
For example, don't store all the files under s3://my-athena-bucket/my-athena-data-files. Instead, partition the data and store them under the following individual prefixes:
s3://my-athena-bucket/jan
s3://my-athena-bucket/feb
s3://my-athena-bucket/mar
You can further partition the data in these files to increase the distribution of objects such as, s3://my-athena-bucket/jan/01.
Reduce the number of files in each prefix
You might get the "HIVE_CANNOT_OPEN_SPLIT" error when you query an S3 bucket that has a large number of small objects. For example, if there's one 100 MB file in an S3 bucket, then Athena must make one GET request to read the file. However, for 1,000 files that are each 100 KB, Athena must make 1,000 GET requests to read the same 100 MB of data. The requests then exceed the S3 request limits.
To reduce the number of Amazon S3 requests, reduce the number of files. For example, use the S3DistCp tool to merge a large number of small files less than 128 MB into a smaller number of large files. For more information, see the Optimize file sizes section in Top 10 performance tuning tips for Amazon Athena.
Example command:
s3-dist-cp --src=s3://my_athena_bucket_source/smallfiles/ --dest=s3://my_athena_bucket_target/largefiles/ --groupBy='.*(.csv)'
Note: In the preceding command, replace my_athena_bucket_source with the source S3 bucket where the small files exist. Also, replace my_athena_bucket_target with the destination S3 bucket where the output is stored.
To optimize query performance and cost, use the groupBy option to aggregate small files into fewer large files.
Note: S3DistCp doesn't support concatenation for Apache Parquet files. Use PySpark instead. For more information, see How can I concatenate Parquet files in Amazon EMR?
Check if versioning is activated for your S3 bucket
When you delete objects from an S3 bucket that uses versioning, Amazon S3 doesn't permanently remove the object. Instead, Amazon S3 inserts a delete marker. If many files in your S3 bucket have delete markers, then you might get the "HIVE_CANNOT_OPEN_SPLIT" error. When you run a query on a bucket that has versioning turned on, Athena must check the different versions of each object. Then as the query is processed, Athena decides whether to include a particular object.
To resolve this issue, remove the delete markers from your S3 bucket. To remove the delete markers, take one of the following actions:
Check if other applications use the same S3 prefix
Use the Amazon CloudWatch 5xxErrors metric and S3 server access logs to check if other applications used the same S3 prefix when you ran the Athena query. Multiple applications that tried to read the data from the same S3 prefix can result in throttled requests. Don't schedule applications that access the same prefix at the same time. Also, use different S3 prefixes for the Athena data source and application data source.
Create a CloudWatch metrics configuration for all objects in your S3 bucket. Use these metrics to monitor the API call rate metrics for a specific prefix at a certain point in time. Activate S3 request metrics for a prefix to understand the overall API call rate for a prefix at a certain point in time. Use this information together with the S3 server access logs to find the application that used the API call for the prefix.
To partition your data within a single partition, you can also bucket your data. For more information, see What is bucketing?
Related information
How do I troubleshoot a HTTP 500 or 503 error from Amazon S3?
Troubleshooting in Athena
Performance tuning in Athena