Glue PySpark - cannot read DynamicFrame from S3 with provided schema

0

When creating a DynamicFrame from JSON directly (and not from the Glue data catalog), you can have Glue infer the schema, or you can provide one.

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-json-home.html#aws-glue-programming-etl-format-simd-json-reader

^ this explains that it can be done, and links to the XML page for a syntax example

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-xml-home.html#aws-glue-programming-etl-format-xml-withschema

The syntax example provided is:

from awsglue.gluetypes import *

schema = StructType([ 
  Field("id", IntegerType()),
  Field("name", StringType()),
  Field("nested", StructType([
    Field("x", IntegerType()),
    Field("y", StringType()),
    Field("z", ChoiceType([IntegerType(), StringType()]))
  ]))
])

datasource0 = create_dynamic_frame_from_options(
    connection_type, 
    connection_options={"paths": ["s3://xml_bucket/someprefix"]},
    format="xml", 
    format_options={"withSchema": json.dumps(schema.jsonValue())},
    transformation_ctx = ""
)

I attempted to replicate this and it consistently fails with a JSON serialization error.

Here is my sample data:

{
	"id": "74J77",
	"name": "foo",
	"attributes": [
		{
			"attributeId": "bar",
			"value": "baz"
		}
	]
}

And here is my Glue script - I'm running this in an interactive notebook with Glue version 4.0

from awsglue.gluetypes import *
import json
from awsglue import DynamicFrame

schema = StructType([
    Field("id", StringType()),
    Field("name", StringType()),
    Field("attributes", ArrayType(MapType(StringType(), StringType()))),
])

df = glueContext.create_dynamic_frame_from_options(
    "s3",
    {"paths": ["s3://tedmo-debug-bucket/jsontest.json"]},
    format="json",
    format_options= {"withSchema": json.dumps(schema.jsonValue()), "optimizePerformance": True}
)

Executing this leads to the following error: TypeError: Object of type StringType is not JSON serializable

Expecting that maybe there was just a typo in the docs, I tried a few other variations as well to no avail:

format_options= {"withSchema": schema.jsonValue(), "optimizePerformance": True}

format_options= {"withSchema": schema, "optimizePerformance": True}

format_options= {"withSchema": json.dumps(schema.jsonValue())}

The problem seems to be that schema.jsonValue() references StringType in the properties of the map. If I don't use MapType it more or less seems to work,

Is this a bug in Glue, or am I doing something wrong?

EDIT: For future readers - I did solve the syntax issue based on Gonzalo's help - Notably, Glue and Spark both have MapType() but behave slightly differently.

However, there was a broader problem - my data was fundamentally ambiguous in type and DynamicFrames fundamentally don't have rigid schemas. So, I abandoned DynamicFrame entirely and did all my work in Spark dataframes/rdds which do allow you to enforce a defined schema.

Ted
asked 5 months ago217 views
1 Answer
1
Accepted Answer

That kind of Map has the key as string, so it only expects the value type, by passing two you are creating a property that cannot be serialized.

Field("attributes", ArrayType(MapType(StringType()))),
profile pictureAWS
EXPERT
answered 5 months ago
  • Thank you. That solved the immediate problem, but it raises a couple followups because this isn't working how I expected -

    The reason I wanted to specify a schema is that I was witnessing some inconsistency in how Glue inferred a schema. I am reading a structure similar to the attributes block in the JSON above - an array of dicts with predictable keys. What I am seeing is, sometimes, Glue will infer that as an array of structs with fields for each key. Other times, it will interpret it as an array of maps. This instability is causing problems and I need to stabilize the interpretation of the schema.

    I expected that specifying the schema in the format_options would do this, but it doesn't. The sample data is always read as a struct.

    So,

    1. What exactly does providing a schema do?
    2. What determines how Glue interprets a JSON list of dicts?
  • Remember you are using "DynamicFrame", that schema is meant more as a read optimization (notice the parameter) for strong schema enforcing use DataFrame

  • Alright, that makes sense - there's no concept of enforcing a rigid schema in a DynamicFrame. So what about question 2? What could cause my data, which is pretty consistent in format, to be inconsistently interpreted as a map or a struct? Conceptually both fit but clearly Glue has some heuristic to make the distinction.

  • Don't know that, I would guess it would use map because it's a much safer guess

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions