Not able to read S3 Parquet file
Hi Team, I'm trying to read Parquet files in S3, but I get the following error. Please help. I'm not sure if the data inside the parquet file is corrupt or I'm unable to read the file due to datatype mismatch. Any help would be much appreciated.
df = spark.read.parquet("s3://xxxxxxx/edo_sms_replica_us_stg/event_t/TESTFILES/LOAD00000CAD.parquet") An error was encountered: Invalid status code '404' from http://ip-xx.xx.xx..awscorp.siriusxm.com:8998/sessions/168 with error payload: {"msg":"Session '168' not found."}
Hi - You can either use "Query with S3 Select" option from the S3 console if the compressed file size is less than 140 MB Or use the s3api (https://docs.aws.amazon.com/cli/latest/reference/s3api/select-object-content.html) CLI to validate if the parquet file is a valid one.
aws s3api select-object-content \
--bucket my-bucket \
--key my-data-file.parquet \
--expression "select * from s3object limit 100" \
--expression-type 'SQL' \
--input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' \
--output-serialization '{"JSON": {}}' "output.json"
Another option is to use AWS Glue Crawler to load the parquet file and query via Athena - https://docs.aws.amazon.com/glue/latest/ug/tutorial-add-crawler.html
This is the error we get -
An error was encountered: An error occurred while calling o91.parquet. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7) (ip-xx.xx.xxx.awscorp.siriusxm.com executor 11): org.apache.spark.sql.AnalysisException: Parquet type not yet supported: INT32 (TIME_MILLIS) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.typeNotImplemented$1(ParquetSchemaConverter.scala:104)
Relevant questions
How to use transformation rule in DMS task to perform replace operation(remove space) for all columns
asked a day agoNot able to read S3 Parquet file
asked a month agoSpectrum Scan Error - Unable to create parquet column scanner
asked 3 months agoRead Parquet file from S3 without hadoop
Accepted Answerasked 2 years agoAWS Glue crawler creating multiple tables
asked 3 months agoPARQUET argument is not supported when loading from file system
asked 4 years agoAthena HIVE_METASTORE_ERROR when working with map<string, string> columns in parquet file
asked 3 months agoHow to upload a parquet format file of RDS table data to S3 without using snapshots
asked 3 months agoHow to keep the source file name in the target output file with a AWS Glue job
Accepted Answerasked 2 years agoHIVE_BAD_DATA with Parquet BINARY
asked 5 years ago
Thanks Gokul. But, I'm not able to read the parquet file using S3 select in the coonsole or form API. In S3 select - it says "Successfully returned 0 records" (the file size is 40MB). In AWS CLI, the output is always "aws command usage option", no output or error. No error is displayed in both cases. How do I figure out if the file is invalid? Why is the file not being read? .