By using AWS re:Post, you agree to the Terms of Use
/Not able to read S3 Parquet file/

Not able to read S3 Parquet file

0

Hi Team, I'm trying to read Parquet files in S3, but I get the following error. Please help. I'm not sure if the data inside the parquet file is corrupt or I'm unable to read the file due to datatype mismatch. Any help would be much appreciated.

df = spark.read.parquet("s3://xxxxxxx/edo_sms_replica_us_stg/event_t/TESTFILES/LOAD00000CAD.parquet") An error was encountered: Invalid status code '404' from http://ip-xx.xx.xx..awscorp.siriusxm.com:8998/sessions/168 with error payload: {"msg":"Session '168' not found."}

asked a month ago7 views
1 Answers
0

Hi - You can either use "Query with S3 Select" option from the S3 console if the compressed file size is less than 140 MB Or use the s3api (https://docs.aws.amazon.com/cli/latest/reference/s3api/select-object-content.html) CLI to validate if the parquet file is a valid one.

aws s3api select-object-content \
    --bucket my-bucket \
    --key my-data-file.parquet \
    --expression "select * from s3object limit 100" \
    --expression-type 'SQL' \
    --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' \
    --output-serialization '{"JSON": {}}' "output.json"

Another option is to use AWS Glue Crawler to load the parquet file and query via Athena - https://docs.aws.amazon.com/glue/latest/ug/tutorial-add-crawler.html

EXPERT
answered a month ago
  • Thanks Gokul. But, I'm not able to read the parquet file using S3 select in the coonsole or form API. In S3 select - it says "Successfully returned 0 records" (the file size is 40MB). In AWS CLI, the output is always "aws command usage option", no output or error. No error is displayed in both cases. How do I figure out if the file is invalid? Why is the file not being read? .

  • This is the error we get -

    An error was encountered: An error occurred while calling o91.parquet. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7) (ip-xx.xx.xxx.awscorp.siriusxm.com executor 11): org.apache.spark.sql.AnalysisException: Parquet type not yet supported: INT32 (TIME_MILLIS) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.typeNotImplemented$1(ParquetSchemaConverter.scala:104)

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions