Glue DynamicFrame.resolveChoice match_catalog resolves values arbitrarily

0

Hi, I have an int type field in my Glue table. This field only takes one value (99) in my JSON data (or the value is missing), yet when I load the data as a dynamic frame this field is read as Field(startSE, ChoiceType([DoubleType({}),IntegerType({})]. When I resolve using resolveChoice(choice = 'match_catalog', ...) the field is not resolved to Field(startSE, IntegerType({}), {}) as I would expect, but is split into Field(startSE_int, IntegerType({}), {}), Field(startSE_double, DoubleType({}), {}). I have no trouble directly casting the field as an int resolveChoice(specs = [('scores.startSE','cast:int')]), so there doesn't seem to be a reason why resolveChoice can't match the type provided in the Glue table. This field exists in a struct, but other ChoiceType fields in the struct resolve without any issue. I run into the same behavior if I instead specify the field as a double in the Glue table.

After calling resolveChoice(choice = 'match_catalog', ...) the two resulting fields look like this. The first 24 rows are NaN (i.e., the field is missing from the first 24 records in the dynamic frame). But for those JSON which do have a value for scores.startSE, that value is arbitrarily resolved as an int or double. The value is formatted as an integer 99 in the JSON itself, and even values from records in the same partition have been split arbitrarily into int or double.

    scores_startSE_int  scores_startSE_double
0-23               NaN                    NaN
24                99.0                    NaN
25                 NaN                   99.0
26                99.0                    NaN
27                 NaN                   99.0
28                99.0                    NaN
29                 NaN                   99.0
30                 NaN                   99.0
31                99.0                    NaN
32                 NaN                   99.0
33                99.0                    NaN
  • @bridgedownstream did you ever find a solution for this. I'm having the same issue.

  • @hix76 No, I am still dealing with this issue. It's being triaged on my team, but I'm hoping that Glue can resolve the issue on their end before I need to create and test my own MRE. If you come across any solutions please do share.

asked 2 years ago2107 views
2 Answers
0

Hello,

  1. I have used some sample JSON data like below to simulate your issue at my end and uploaded into my s3 bucket

data1:

{
    "empid": 1,
    "empname": "messi",
    "game":{
        "score": 100
    }
}

data2:

{
    "empid": 2,
    "empname": "Ronaldo",
    "game":{"score":NaN
    }
}
  1. I have created a Glue catalog table on the above s3 path with a schema like below

Enter image description here

  1. Now, I created a Glue 3.0 job (python) using the below script
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

sdf = glueContext.create_dynamic_frame.from_catalog(
    database="db",
    table_name="table",
    transformation_ctx="sdf",
)

print(sdf.schema())
sdf.printSchema()
sdf.toDF().show()

df1 = ResolveChoice.apply(sdf, choice = "match_catalog",database="db",table_name="table")

print(df1.schema())
df1.printSchema()
df1.toDF().show()

job.commit()

  1. After running the job, I could see the score column in my data was initially being interpreted as Choice type with Double and Integers. This is due to the missing/NaN being treated as Double datatype by the Spark as explained here and here

  2. However, in my resolve choice method, I am instructing the dynamicframe to match the choice type columns to the schema of the same table. This works fine in my case as shown below

Before ResolveChoice()

StructType([Field(empid, IntegerType({}), {}),Field(empname, StringType({}), {}),Field(game, StructType([Field(score, ChoiceType([DoubleType({}),IntegerType({})], {}), {})], {}), {})], {}) 

root
|-- empid: int
|-- empname: string
|-- game: struct
|    |-- score: choice
|    |    |-- double
|    |    |-- int

+-----+-------+-------------+
|empid|empname|         game|
+-----+-------+-------------+
|    4| mbappe|{{NaN, null}}|
|    1|  messi|{{null, 100}}|
|    3| Neymar|{{null, 100}}|
|    2|Ronaldo|{{NaN, null}}|
+-----+-------+-------------+

After ResolveChoice()

StructType([Field(empid, IntegerType({}), {}),Field(empname, StringType({}), {}),Field(game, StructType([Field(score, IntegerType({}), {})], {}), {})], {}) 

root
|-- empid: int
|-- empname: string
|-- game: struct
|    |-- score: int

+-----+-------+-----+
|empid|empname| game|
+-----+-------+-----+
|    4| mbappe|  {0}|
|    1|  messi|{100}|
|    3| Neymar|{100}|
|    2|Ronaldo|  {0}|
+-----+-------+-----+

AWS
SUPPORT ENGINEER
answered 2 years ago
0

Hi @Chiranjeevi_N, that's unfortunate that the issue was unable to be reproduced with the example dataset. I have a few follow-up questions:

  1. If Spark treats missing values as NaN (a double), then it makes sense to use a double type field rather than an int type field in cases where that field may be missing from the source data. But why would I also experience the same field splitting behavior (scores_startSE_int and scores_startSE_double) when scores_startSE is double type? The DynamicFrame knows there are missing values, so there can't be an ambiguity between int and double types. Yet DynamicFrame.resolveChoice creates this int/double ambiguity although there are missing values in the source data and the field type is double.
  2. In the case where scores_startSE is int type in the Glue table schema, why would the DynamicFrame have trouble resolving scores_startSE to a single int field when calling DynamicFrame.resolveChoice? I was able to do this manually by using the specs parameter: resolveChoice(specs = [('scores.startSE','cast:int')]). But when doing this in an automated way via DynamicFrame.resolveChoice(choice="match_catalog", ...) a split field was created.
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions