2 Answers
- Newest
- Most votes
- Most comments
0
Hello,
- I have used some sample JSON data like below to simulate your issue at my end and uploaded into my s3 bucket
data1:
{
"empid": 1,
"empname": "messi",
"game":{
"score": 100
}
}
data2:
{
"empid": 2,
"empname": "Ronaldo",
"game":{"score":NaN
}
}
- I have created a Glue catalog table on the above s3 path with a schema like below
- Now, I created a Glue 3.0 job (python) using the below script
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
sdf = glueContext.create_dynamic_frame.from_catalog(
database="db",
table_name="table",
transformation_ctx="sdf",
)
print(sdf.schema())
sdf.printSchema()
sdf.toDF().show()
df1 = ResolveChoice.apply(sdf, choice = "match_catalog",database="db",table_name="table")
print(df1.schema())
df1.printSchema()
df1.toDF().show()
job.commit()
-
After running the job, I could see the score column in my data was initially being interpreted as Choice type with Double and Integers. This is due to the missing/NaN being treated as Double datatype by the Spark as explained here and here
-
However, in my resolve choice method, I am instructing the dynamicframe to match the choice type columns to the schema of the same table. This works fine in my case as shown below
Before ResolveChoice()
StructType([Field(empid, IntegerType({}), {}),Field(empname, StringType({}), {}),Field(game, StructType([Field(score, ChoiceType([DoubleType({}),IntegerType({})], {}), {})], {}), {})], {})
root
|-- empid: int
|-- empname: string
|-- game: struct
| |-- score: choice
| | |-- double
| | |-- int
+-----+-------+-------------+
|empid|empname| game|
+-----+-------+-------------+
| 4| mbappe|{{NaN, null}}|
| 1| messi|{{null, 100}}|
| 3| Neymar|{{null, 100}}|
| 2|Ronaldo|{{NaN, null}}|
+-----+-------+-------------+
After ResolveChoice()
StructType([Field(empid, IntegerType({}), {}),Field(empname, StringType({}), {}),Field(game, StructType([Field(score, IntegerType({}), {})], {}), {})], {})
root
|-- empid: int
|-- empname: string
|-- game: struct
| |-- score: int
+-----+-------+-----+
|empid|empname| game|
+-----+-------+-----+
| 4| mbappe| {0}|
| 1| messi|{100}|
| 3| Neymar|{100}|
| 2|Ronaldo| {0}|
+-----+-------+-----+
0
Hi @Chiranjeevi_N, that's unfortunate that the issue was unable to be reproduced with the example dataset. I have a few follow-up questions:
- If Spark treats missing values as NaN (a double), then it makes sense to use a
double
type field rather than anint
type field in cases where that field may be missing from the source data. But why would I also experience the same field splitting behavior (scores_startSE_int
andscores_startSE_double
) whenscores_startSE
isdouble
type? TheDynamicFrame
knows there are missing values, so there can't be an ambiguity betweenint
anddouble
types. YetDynamicFrame.resolveChoice
creates thisint
/double
ambiguity although there are missing values in the source data and the field type isdouble
. - In the case where
scores_startSE
isint
type in the Glue table schema, why would theDynamicFrame
have trouble resolvingscores_startSE
to a singleint
field when callingDynamicFrame.resolveChoice
? I was able to do this manually by using thespecs
parameter:resolveChoice(specs = [('scores.startSE','cast:int')])
. But when doing this in an automated way viaDynamicFrame.resolveChoice(choice="match_catalog", ...)
a split field was created.
answered 2 years ago
Relevant content
- asked 4 months ago
- asked 8 months ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 2 years ago
@bridgedownstream did you ever find a solution for this. I'm having the same issue.
@hix76 No, I am still dealing with this issue. It's being triaged on my team, but I'm hoping that Glue can resolve the issue on their end before I need to create and test my own MRE. If you come across any solutions please do share.