AWS Glue data type conversion

0

Currently, I am using AWS Glue to extract data from Mongo DB and push alll data to json file in S3 service. I use create_dynamic_frame function to extract data from MongoDB and use write_dynamic_frame function to write data to S3, but I got this error Error Category: UNCLASSIFIED_ERROR; Failed Line Number: 64; An error occurred while calling o140.pyWriteDynamicFrame. Invalid field: 'createdAt'. The dataType 'timestamp' is invalid for 'BsonString{value='2024-01-18T10:31:26.985Z'}'. Even though in createdAt column, I do not see any value like this 2024-01-18T10:31:26.985Z. When I remove the createdAt column, this error still exsits. I do not know if there is any problem with AWS Glue or not, Any can help me with this. And this is schema in mongoDB { "timestamp": { "$date": "2024-05-12T12:28:41.000Z" }, "metadata": { "batchId": "6640b5ff911679d0912bc1a8", "boxId": "BSC-05", "jobProceedDate": { "$date": "2024-05-12T12:28:47.183Z" }, "locationCode": "190C", "microControllerId": "1000000008f78d2d", "uuid": { "$oid": "6640b5ff911679d0912bc1aa" }, "workOrderId": 22615 }, "armHeightLeft": null, "createdAt": { "$date": "2024-05-12T12:28:47.125Z" }, "pumpSwitchLeft": 0, "flowMeterRightAccumulation": null, "temperatureLeft": null, "flowMeterLeftAccumulation": null, "pumpSwitchRight": 0, "altitude": 28.7, "longitude": 105.1073293, "speed": 2.366, "temperatureRight": 28.9, "bearing": 29.8, "armHeightRight": 176, "flowMeterLeftRate": null, "pumpSwitchMain": 0, "latitude": -4.695701817, "flowMeterRightRate": null, "reportDate": { "$date": "2024-05-12T12:28:45.931Z" }, "gpsStatus": "stable", "_id": { "$oid": "6640b5ff911679d0912bc1a9" }, "utcTimestamp": { "$date": "2024-05-12T12:28:41.000Z" } }

rinney
asked 21 days ago118 views
2 Answers
0
Accepted Answer

Can you please share the code in more detail? I'd specifically like to see how you are establishing your conenctions to Mongo and S3. Are you using the "glue connector for MongoDB"?

Its possible that if you are not creating the "createdAt" column that it is being added as a part of the connection options. Since you are writing to S3, that could be part of it, or perhaps its a "hidden" field in MongoDB that you are not aware of? (I don't even know if thats possible with Mongo, but I have seen that when people work with DB views, and then use tables for ETL). Many people put columns like "createdAt" on schemas to track inserts so they can model downstream transaction semantics, especially in systems like Mongo where there aren't cascading semantics like in RDBMS.

Also, I typically recommend converting to a (spark) DataFrame, and before I write to S3, I output the schema and the first top 10ish rows to see what is going on. I generally find spark dataframes to be more reliable, in terms of doing what I expect them to do.

answered 21 days ago
AWS
SUPPORT ENGINEER
reviewed 14 days ago
  • this is sample script i use for ETL

    import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job

    args = getResolvedOptions(sys.argv, ['JOB_NAME']) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args)

    Script generated for node MongoDB

    MongoDB_node1717387510167 = glueContext.create_dynamic_frame.from_options(connection_type="mongodb", connection_options={"connectionName": "MongoDB Data Connection", "database": "boom_db_mongodb", "collection": "device_data", "partitioner": "com.mongodb.spark.sql.connector.read.partitioner.SinglePartitionPartitioner", "partitionerOptions.partitionSizeMB": "10", "partitionerOptions.partitionKey": "_id", "disableUpdateUri": "false"}, transformation_ctx="MongoDB_node1717387510167")

    Script generated for node Amazon S3

    AmazonS3_node1717387588570 = glueContext.write_dynamic_frame.from_options(frame=MongoDB_node1717387510167, connection_type="s3", format="json", connection_options={"path": "s3://glue-bucket-s3-storage-output-ggf2/aws-glue-output-dir/", "partitionKeys": []}, transformation_ctx="AmazonS3_node1717387588570")

    job.commit() I think that the problem from our mongoDB, because it the data is timeseries and it has TTL set for 14 days, do u have any idea when i still wanna use the dynamic frame

  • You can see that the error always resembles this "Error Category: UNCLASSIFIED_ERROR; Failed Line Number: 19; An error occurred while calling o113.pyWriteDynamicFrame. Invalid field: 'createdAt'. The dataType 'timestamp' is invalid for 'BsonString{value='2024-01-18T10:31:26.985Z'}'.", even though the value 2024-01-18T10:31:26.985Z do not exist in mongoDB, but it still return this ?, this is what i am wondering about. As I told you before, my DB has TTL to automatically remove document after 14 days since creation date. Now is June 3, 2024. but the error 2024-01-18T10:31:26.985Z still exists

  • You can see that the error always resembles this "Error Category: UNCLASSIFIED_ERROR; Failed Line Number: 19; An error occurred while calling o113.pyWriteDynamicFrame. Invalid field: 'createdAt'. The dataType 'timestamp' is invalid for 'BsonString{value='2024-01-18T10:31:26.985Z'}'.", even though the value 2024-01-18T10:31:26.985Z do not exist in mongoDB, but it still return this ?, this is what i am wondering about. As I told you before, my DB has TTL to automatically remove document after 14 days since creation date. Now is June 3, 2024. but the error 2024-01-18T10:31:26.985Z still exists

0

Between the create and write dynamic frame statements, do this:

print("schema before converting to dataframe")
MongoDB_node1717387510167 .printSchema()
src_df = MongoDB_node1717387510167 .toDF()
print("schema after converting to dataframe")
src_df.printSchema()

Now, if this shows a "createdAt" column somewhere in the output, somehow it is getting into your dynamic frame. My guess, as I stated earlier, is that createdAt is a part of an underlying Mongo table and you just don't know about it.

How to remedy:

#make sure and select only the columns you want/expect
final_df = src_df.select(...)
#then convert back to dynamic frame and write out as your script already does, but use final_df
answered 13 days ago
  • Hi Nathan Wayne Hanks, really thanks for your answer, but let me talk a little bit. So this job is run successfully then around 1 month later. This error became available in my ETL Glue job, it means that the first mongoDB schema already has created field, and when even I removed createdAt field from mongo schema before actually writing to S3, this error always be there :(. So my mongoDB has TTL set for collection, and now is Jun but the error announced this ('BsonString{value='2024-01-18T10:31:26.985Z'}), so the data is too old because TTL set for documentation is 14 days, i mean that it must disappear in my mongoDB, assuming that the error from createdAt field, so why that data always appear and even I remove that field, it is always there, do u have any idae ?

  • Could we have a call or meeting about this problem?, you can reply your email here so that i can contact u

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions