- Newest
- Most votes
- Most comments
You're right, the handling of malformed/corrupt records in AWS Glue DynamicFrames is not as transparent or easy to access as in Spark DataFrames.
Since DynamicFrames are built on top of Spark DataFrames, the corrupt records are still being captured somewhere, but Glue does not expose an easy way to access them directly.
Here are some options to deal with corrupt records in Glue:
-
Use a DynamicFrame filter transformation to filter out the corrupt records into a separate DynamicFrame. You can check for null values or empty strings in required columns to find corrupt records.
-
Convert the DynamicFrame to a Spark DataFrame using
toDF()
, then access the_corrupt_record
column directly. -
Handle exceptions from transformations using
errorsAsDynamicFrame()
as you mentioned, but convert the error DynamicFrame to a DataFrame to get the corrupt record details. -
Write a custom Glue transform that accesses the underlying Spark DataFrame directly using
getDataFrame()
and extracts the corrupt records. -
As a last resort, read the Glue job logs to try to find details of corrupt records. But this is messy.
The best option is generally to filter out corrupt records as a separate DynamicFrame, then write that out to S3 or a reject folder for further processing/debugging.
It's an area that could be improved in Glue. But with some work
Relevant content
- asked 8 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 2 years ago
Regarding filtering out corrupt records in DynamicFrame: The problem is that DynamicFrame is somehow already filtering out the corrupt records upon creation. The corrupt records are nowhere and the only residue is the errorsAsDynamicFrame() as a separate nested frame which has little value in pinpointing the corrupt record especially if there are a plethora of corrupted records. The dynamicFrame record within the errorsAsDynamicFrame is not the RAW record.