- Newest
- Most votes
- Most comments
Can you please share the code in more detail? I'd specifically like to see how you are establishing your conenctions to Mongo and S3. Are you using the "glue connector for MongoDB"?
Its possible that if you are not creating the "createdAt" column that it is being added as a part of the connection options. Since you are writing to S3, that could be part of it, or perhaps its a "hidden" field in MongoDB that you are not aware of? (I don't even know if thats possible with Mongo, but I have seen that when people work with DB views, and then use tables for ETL). Many people put columns like "createdAt" on schemas to track inserts so they can model downstream transaction semantics, especially in systems like Mongo where there aren't cascading semantics like in RDBMS.
Also, I typically recommend converting to a (spark) DataFrame, and before I write to S3, I output the schema and the first top 10ish rows to see what is going on. I generally find spark dataframes to be more reliable, in terms of doing what I expect them to do.
Between the create and write dynamic frame statements, do this:
print("schema before converting to dataframe")
MongoDB_node1717387510167 .printSchema()
src_df = MongoDB_node1717387510167 .toDF()
print("schema after converting to dataframe")
src_df.printSchema()
Now, if this shows a "createdAt" column somewhere in the output, somehow it is getting into your dynamic frame. My guess, as I stated earlier, is that createdAt is a part of an underlying Mongo table and you just don't know about it.
How to remedy:
#make sure and select only the columns you want/expect
final_df = src_df.select(...)
#then convert back to dynamic frame and write out as your script already does, but use final_df
Hi Nathan Wayne Hanks, really thanks for your answer, but let me talk a little bit. So this job is run successfully then around 1 month later. This error became available in my ETL Glue job, it means that the first mongoDB schema already has created field, and when even I removed createdAt field from mongo schema before actually writing to S3, this error always be there :(. So my mongoDB has TTL set for collection, and now is Jun but the error announced this ('BsonString{value='2024-01-18T10:31:26.985Z'}), so the data is too old because TTL set for documentation is 14 days, i mean that it must disappear in my mongoDB, assuming that the error from createdAt field, so why that data always appear and even I remove that field, it is always there, do u have any idae ?
Could we have a call or meeting about this problem?, you can reply your email here so that i can contact u
Relevant content
- asked a year ago
- asked 2 years ago
- AWS OFFICIALUpdated 3 months ago
this is sample script i use for ETL
import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME']) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args)
Script generated for node MongoDB
MongoDB_node1717387510167 = glueContext.create_dynamic_frame.from_options(connection_type="mongodb", connection_options={"connectionName": "MongoDB Data Connection", "database": "boom_db_mongodb", "collection": "device_data", "partitioner": "com.mongodb.spark.sql.connector.read.partitioner.SinglePartitionPartitioner", "partitionerOptions.partitionSizeMB": "10", "partitionerOptions.partitionKey": "_id", "disableUpdateUri": "false"}, transformation_ctx="MongoDB_node1717387510167")
Script generated for node Amazon S3
AmazonS3_node1717387588570 = glueContext.write_dynamic_frame.from_options(frame=MongoDB_node1717387510167, connection_type="s3", format="json", connection_options={"path": "s3://glue-bucket-s3-storage-output-ggf2/aws-glue-output-dir/", "partitionKeys": []}, transformation_ctx="AmazonS3_node1717387588570")
job.commit() I think that the problem from our mongoDB, because it the data is timeseries and it has TTL set for 14 days, do u have any idea when i still wanna use the dynamic frame
You can see that the error always resembles this "Error Category: UNCLASSIFIED_ERROR; Failed Line Number: 19; An error occurred while calling o113.pyWriteDynamicFrame. Invalid field: 'createdAt'. The dataType 'timestamp' is invalid for 'BsonString{value='2024-01-18T10:31:26.985Z'}'.", even though the value 2024-01-18T10:31:26.985Z do not exist in mongoDB, but it still return this ?, this is what i am wondering about. As I told you before, my DB has TTL to automatically remove document after 14 days since creation date. Now is June 3, 2024. but the error 2024-01-18T10:31:26.985Z still exists
You can see that the error always resembles this "Error Category: UNCLASSIFIED_ERROR; Failed Line Number: 19; An error occurred while calling o113.pyWriteDynamicFrame. Invalid field: 'createdAt'. The dataType 'timestamp' is invalid for 'BsonString{value='2024-01-18T10:31:26.985Z'}'.", even though the value 2024-01-18T10:31:26.985Z do not exist in mongoDB, but it still return this ?, this is what i am wondering about. As I told you before, my DB has TTL to automatically remove document after 14 days since creation date. Now is June 3, 2024. but the error 2024-01-18T10:31:26.985Z still exists