2 Answers
- Newest
- Most votes
- Most comments
0
Hello,
- I have used some sample JSON data like below to simulate your issue at my end and uploaded into my s3 bucket
data1:
{
"empid": 1,
"empname": "messi",
"game":{
"score": 100
}
}
data2:
{
"empid": 2,
"empname": "Ronaldo",
"game":{"score":NaN
}
}
- I have created a Glue catalog table on the above s3 path with a schema like below
- Now, I created a Glue 3.0 job (python) using the below script
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
sdf = glueContext.create_dynamic_frame.from_catalog(
database="db",
table_name="table",
transformation_ctx="sdf",
)
print(sdf.schema())
sdf.printSchema()
sdf.toDF().show()
df1 = ResolveChoice.apply(sdf, choice = "match_catalog",database="db",table_name="table")
print(df1.schema())
df1.printSchema()
df1.toDF().show()
job.commit()
-
After running the job, I could see the score column in my data was initially being interpreted as Choice type with Double and Integers. This is due to the missing/NaN being treated as Double datatype by the Spark as explained here and here
-
However, in my resolve choice method, I am instructing the dynamicframe to match the choice type columns to the schema of the same table. This works fine in my case as shown below
Before ResolveChoice()
StructType([Field(empid, IntegerType({}), {}),Field(empname, StringType({}), {}),Field(game, StructType([Field(score, ChoiceType([DoubleType({}),IntegerType({})], {}), {})], {}), {})], {})
root
|-- empid: int
|-- empname: string
|-- game: struct
| |-- score: choice
| | |-- double
| | |-- int
+-----+-------+-------------+
|empid|empname| game|
+-----+-------+-------------+
| 4| mbappe|{{NaN, null}}|
| 1| messi|{{null, 100}}|
| 3| Neymar|{{null, 100}}|
| 2|Ronaldo|{{NaN, null}}|
+-----+-------+-------------+
After ResolveChoice()
StructType([Field(empid, IntegerType({}), {}),Field(empname, StringType({}), {}),Field(game, StructType([Field(score, IntegerType({}), {})], {}), {})], {})
root
|-- empid: int
|-- empname: string
|-- game: struct
| |-- score: int
+-----+-------+-----+
|empid|empname| game|
+-----+-------+-----+
| 4| mbappe| {0}|
| 1| messi|{100}|
| 3| Neymar|{100}|
| 2|Ronaldo| {0}|
+-----+-------+-----+
0
Hi @Chiranjeevi_N, that's unfortunate that the issue was unable to be reproduced with the example dataset. I have a few follow-up questions:
- If Spark treats missing values as NaN (a double), then it makes sense to use a
doubletype field rather than aninttype field in cases where that field may be missing from the source data. But why would I also experience the same field splitting behavior (scores_startSE_intandscores_startSE_double) whenscores_startSEisdoubletype? TheDynamicFrameknows there are missing values, so there can't be an ambiguity betweenintanddoubletypes. YetDynamicFrame.resolveChoicecreates thisint/doubleambiguity although there are missing values in the source data and the field type isdouble. - In the case where
scores_startSEisinttype in the Glue table schema, why would theDynamicFramehave trouble resolvingscores_startSEto a singleintfield when callingDynamicFrame.resolveChoice? I was able to do this manually by using thespecsparameter:resolveChoice(specs = [('scores.startSE','cast:int')]). But when doing this in an automated way viaDynamicFrame.resolveChoice(choice="match_catalog", ...)a split field was created.
answered 3 years ago

@bridgedownstream did you ever find a solution for this. I'm having the same issue.
@hix76 No, I am still dealing with this issue. It's being triaged on my team, but I'm hoping that Glue can resolve the issue on their end before I need to create and test my own MRE. If you come across any solutions please do share.