Skip to content

Loading data from S3 in to EMR Studio

0

I am trying to load data from an S3 bucket to a Data Frame in EMR Studio.

"df = spark.read.csv("s3://HIDDEN-sandbox/HIDDEN/avod_title_content_metrics_w/", header = True)"

When I run "df.show(5)" I am getting an error.

"An error occurred while calling o133.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 11) ([2600:1f18:5882:2d00:f38e:2fec:343d:400b] executor 2): org.apache.spark.SparkFileNotFoundException: No such file or directory 's3://HIDDEN-sandbox/HIDDEN/avod_title_content_metrics_w/TESTDATA.CSV' It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved."

asked a year ago382 views
1 Answer
0

The error you're encountering suggests that there might be an issue with accessing the S3 bucket or the specific file you're trying to read. Here are a few things to check and try:

  1. Permissions: Ensure that your EMR cluster has the necessary permissions to access the S3 bucket. The IAM role associated with your EMR cluster should have read access to the specified S3 bucket.

  2. File existence: Verify that the file 'TESTDATA.CSV' actually exists in the specified S3 path. The error message indicates that this specific file is not found.

  3. File format: If your files are not actually in CSV format, you might need to use a different reader method, such as 'parquet' or 'json' instead of 'csv'.

  4. Wildcard usage: If you have multiple CSV files in the directory, you might want to use a wildcard to read all of them:

    df = spark.read.csv("s3://prod-fin-sandbox/bothamj/avod_title_content_metrics_w/*.csv", header = True)

  5. S3A connector: Ensure you're using the S3A connector by prefixing your path with 's3a://' instead of 's3://':

    df = spark.read.csv("s3a://prod-fin-sandbox/bothamj/avod_title_content_metrics_w/", header = True)

  6. Refresh: As the error message suggests, if you believe the underlying files have been updated, you can try to refresh the table or recreate the DataFrame.

  7. EMR configuration: Make sure your EMR cluster is properly configured to work with S3. This includes having the correct EMR version and the necessary S3 configurations.

If none of these solutions work, you might want to check the EMR logs for more detailed error messages, which could provide additional insights into the problem.
Sources
Troubleshooting Amazon S3 - Amazon Data Firehose
EMR File System (EMRFS) - Amazon EMR
Launch a Spark job in a transient EMR cluster using a Lambda function - AWS Prescriptive Guidance

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.