XML interpret one struct as an array
I've been trying this for a week but I'm starting to give up - I need some help understanding this. I have an S3 bucket full of XML files, and I am creating a pyspark ETL job to convert them to Parquet so I can query them in Athena.
Within each XML file, there is an XML tag called ORDER_LINE. This tag is supposed to be an array of items, however in many files, there is only one item. XML does not have the concept of arrays, so when I pass this into my ETL job, Glue interprets the field as a Choice type in the schema, where it could either be an array or a struct type. I need to coerce this into an array type at all times. Here's a list of everything I've tried:
1. Using ResolveChoice to cast to an array. This doesn't work because a struct can't be casted to an array
2. Doing ResolveChoice to "make_struct", then the Map.apply() step to map the field where if "struct" has data, transform it to [struct]. This doesn't work and the Map docs hint that it does not support the python `map` function for arrays.
3. Converting the dynamic frame to a data frame, and then using pyspark withColumn(when(struct.isNotNull, [struct]).otherwise(array)) functions to convert the struct to an array, or make the array the main object, depending on which one is not null. This doesn't work because Glue is inferring the schema in the structs, and the fields in the structs are in a different order, so while all the fields in the schema are the same, Spark can't combine the result because the schema is not exactly the same.
4. Converting to data frame, then using a pyspark UDF to transform the data. This worked on a small dev sample set, but failed on the production dataset. The error message was extremely cryptic and I wasn't able to find the cause. Maybe this could work but I wasn't able to fully understand how to operate on the data in pyspark.
5. Trying to use the "withSchema" format_option when creating the dynamic frame from XML. The intention is to define the schema beforehand, but running this gives an error:
```
com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot deserialize instance of `java.util.LinkedHashMap<java.lang.Object,java.lang.Object>` out of VALUE_TRUE token
at [Source: (String)" [...] (through reference chain: com.amazonaws.services.glue.schema.types.StructType["fields"]->java.util.ArrayList[0]->com.amazonaws.services.glue.schema.types.Field["properties"])
```
So my question is, how do I make the XML data source for Glue interpret a tag as always an array, instead of a Choice, or how do I combine them? Even StackOverflow failed me here, and the forum post https://forums.aws.amazon.com/thread.jspa?messageID=931586&tstart=0 went unanswered.
Here's a snippet of my pyspark code:
```
import sys
import json
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.gluetypes import (
StructType,
Field,
StringType,
IntegerType,
ArrayType,
)
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(
sys.argv,
[
"JOB_NAME",
"source_bucket_name",
"target_bucket_name",
],
)
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
source_bucket_name = args["source_bucket_name"]
target_bucket_name = args["target_bucket_name"]
schema = StructType(
[
[fields removed as they are sensitive]
Field(
"ORDER_LINE",
ArrayType(
StructType(
[
Field("FIELD1", IntegerType(), True),
Field(
"FIELD2",
StructType([Field("CODE", StringType(), True)]),
True,
),
Field(
"FIELD#",
StructType([Field("CODE", StringType(), True)]),
True,
),
[fields removed]
]
)
),
True,
),
]
)
datasource0 = glueContext.create_dynamic_frame.from_options(
"s3",
{"paths": [f"s3://{source_bucket_name}"]},
format="xml",
format_options={
"rowTag": "ORDER",
"withSchema": json.dumps(schema.jsonValue()),
},
transformation_ctx="datasource0",
)
[more steps after this]
```