By using AWS re:Post, you agree to the Terms of Use

Glue DynamicFrame.resolveChoice match_catalog resolves values arbitrarily

0

Hi, I have an int type field in my Glue table. This field only takes one value (99) in my JSON data (or the value is missing), yet when I load the data as a dynamic frame this field is read as Field(startSE, ChoiceType([DoubleType({}),IntegerType({})]. When I resolve using resolveChoice(choice = 'match_catalog', ...) the field is not resolved to Field(startSE, IntegerType({}), {}) as I would expect, but is split into Field(startSE_int, IntegerType({}), {}), Field(startSE_double, DoubleType({}), {}). I have no trouble directly casting the field as an int resolveChoice(specs = [('scores.startSE','cast:int')]), so there doesn't seem to be a reason why resolveChoice can't match the type provided in the Glue table. This field exists in a struct, but other ChoiceType fields in the struct resolve without any issue. I run into the same behavior if I instead specify the field as a double in the Glue table.

After calling resolveChoice(choice = 'match_catalog', ...) the two resulting fields look like this. The first 24 rows are NaN (i.e., the field is missing from the first 24 records in the dynamic frame). But for those JSON which do have a value for scores.startSE, that value is arbitrarily resolved as an int or double. The value is formatted as an integer 99 in the JSON itself, and even values from records in the same partition have been split arbitrarily into int or double.

    scores_startSE_int  scores_startSE_double
0-23               NaN                    NaN
24                99.0                    NaN
25                 NaN                   99.0
26                99.0                    NaN
27                 NaN                   99.0
28                99.0                    NaN
29                 NaN                   99.0
30                 NaN                   99.0
31                99.0                    NaN
32                 NaN                   99.0
33                99.0                    NaN
asked a month ago51 views
1 Answers
0

Hello,

  1. I have used some sample JSON data like below to simulate your issue at my end and uploaded into my s3 bucket

data1:

{
    "empid": 1,
    "empname": "messi",
    "game":{
        "score": 100
    }
}

data2:

{
    "empid": 2,
    "empname": "Ronaldo",
    "game":{"score":NaN
    }
}
  1. I have created a Glue catalog table on the above s3 path with a schema like below

Enter image description here

  1. Now, I created a Glue 3.0 job (python) using the below script
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

sdf = glueContext.create_dynamic_frame.from_catalog(
    database="db",
    table_name="table",
    transformation_ctx="sdf",
)

print(sdf.schema())
sdf.printSchema()
sdf.toDF().show()

df1 = ResolveChoice.apply(sdf, choice = "match_catalog",database="db",table_name="table")

print(df1.schema())
df1.printSchema()
df1.toDF().show()

job.commit()

  1. After running the job, I could see the score column in my data was initially being interpreted as Choice type with Double and Integers. This is due to the missing/NaN being treated as Double datatype by the Spark as explained here and here

  2. However, in my resolve choice method, I am instructing the dynamicframe to match the choice type columns to the schema of the same table. This works fine in my case as shown below

Before ResolveChoice()

StructType([Field(empid, IntegerType({}), {}),Field(empname, StringType({}), {}),Field(game, StructType([Field(score, ChoiceType([DoubleType({}),IntegerType({})], {}), {})], {}), {})], {}) 

root
|-- empid: int
|-- empname: string
|-- game: struct
|    |-- score: choice
|    |    |-- double
|    |    |-- int

+-----+-------+-------------+
|empid|empname|         game|
+-----+-------+-------------+
|    4| mbappe|{{NaN, null}}|
|    1|  messi|{{null, 100}}|
|    3| Neymar|{{null, 100}}|
|    2|Ronaldo|{{NaN, null}}|
+-----+-------+-------------+

After ResolveChoice()

StructType([Field(empid, IntegerType({}), {}),Field(empname, StringType({}), {}),Field(game, StructType([Field(score, IntegerType({}), {})], {}), {})], {}) 

root
|-- empid: int
|-- empname: string
|-- game: struct
|    |-- score: int

+-----+-------+-----+
|empid|empname| game|
+-----+-------+-----+
|    4| mbappe|  {0}|
|    1|  messi|{100}|
|    3| Neymar|{100}|
|    2|Ronaldo|  {0}|
+-----+-------+-----+

SUPPORT ENGINEER
answered 24 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions