- Newest
- Most votes
- Most comments
The issue you're experiencing is related to how AWS Glue processes JSON files containing arrays. By default, Glue may only read the first entry in each JSON array, which explains why you're only seeing the first record from each file.
To resolve this and process all records in your JSON arrays, you can try the following approaches:
-
Use a JSON SerDe: When creating your table, specify a JSON SerDe (Serializer/Deserializer) that can handle JSON arrays. The OpenX JSON SerDe or the Hive JSON SerDe can be used for this purpose. You'll need to set this in your table properties.
-
Use AWS Glue ETL jobs: Instead of relying solely on the Crawler, you can create an AWS Glue ETL job to read and process your JSON files. In the job, you can use the
format_options
parameter with themultiLine
option set to "true" when reading the JSON data. This allows Glue to properly handle multi-line JSON records. -
Modify your JSON structure: If possible, consider modifying your JSON structure to a newline-delimited JSON format, where each JSON object is on a separate line. This format is often easier for Glue and other big data tools to process.
-
Use AWS Glue DataBrew: AWS Glue DataBrew can be an alternative for preparing your JSON data. It has built-in transformations that can help flatten nested JSON structures.
-
Custom ETL script: Write a custom ETL script (in PySpark or Scala) that explicitly handles the JSON array structure and flattens it into individual records.
If you decide to use an AWS Glue ETL job, you can use the JSON format options to properly read your data. For example:
datasource0 = glueContext.create_dynamic_frame.from_options( format_options = {"multiLine": "true"}, connection_type = "s3", format = "json", connection_options = {"paths": ["s3://your-bucket/your-path"], "recurse":True}, transformation_ctx = "datasource0" )
This approach should allow you to process all records in your JSON arrays, not just the first entry in each file.
Sources
Using the JSON format in AWS Glue - AWS Glue
Parsing a string column containing JSON data - AWS Glue
Relevant content
- asked 2 years ago
- asked 9 months ago
- Accepted Answerasked a year ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 4 days ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 7 months ago
Option 2 - running a Glue job read the source data correctly, and "jsonPath": "$[*]" allowed me to extract array into columns in a new table in the same operation