Glue Pyspark with MongoDB - "Partitioning failed... Document does not contain key avgObjSize"

0

I am using AWS Glue 4.0 (Pyspark) to get data from several MongoDB Atlas collections, using the GlueContext.create_dynamic_frame.from_options method. For all collections except one it has worked fine every run (hundreds), but one collection sometimes throws an error. Below is the summarised traceback from cloudwatch:

File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o99.getDynamicFrame.
: com.mongodb.spark.sql.connector.exceptions.MongoSparkException: Partitioning failed.
...
Caused by: org.bson.BsonInvalidOperationException: Document does not contain key avgObjSize

This is how I am getting the dynamic frame:

collection_dynamic_frame = my_glue_context.create_dynamic_frame.from_options(
        connection_type="mongodb",
        connection_options={
            "connectionName": my_connection_name,
            "database": my_database,
            "collection": my_collection,
        }

This has only happened when the job is deployed and running on a schedule, and I have been unable to re-create when manually running the job (both deployed, and in my local docker setup).

Thank you for any help you can give!

  • to me it sounds like a bug in the Spark connector, is it possible that collection is empty and doesn't handle it correctly?

po
asked a month ago119 views
2 Answers
0

Hello,

As per the stacktrace, I believe it encountered issue while trying to read one of the documents of the collection. I checked through various external sources to get to know the possible cause of this error and through this link, it can be understood that the BsonInvalidOperationException would occur in either of the two scenarios:

  1. if the document does not contain the expected key
  2. the value is not of the expected type

Now as per the stack trace, seems like one of the documents of the particular collection does not contain the key 'avgObjSize'.

As you are saying that the issue is transient, it might be that whenever this particular document was encountered, it fails with the error. It might as well be an unexpected behavior from the side of the connector. You could try using the latest JDBC connector for your job and see if the issue still repeats.

profile pictureAWS
SUPPORT ENGINEER
Chaitu
answered a month ago
profile picture
EXPERT
reviewed a month ago
0

Hi, I'm the original question asker but for some reason (skill issue) can't sign into my re:Post account. Thanks for the replies.

@Gonzales Herreros - Yes I think you are correct based on some experimenting I did yesterday. It seems to happen when the collection is empty (someone else on my team was emptying it in the testing environment and I didn't catch it). And yes I think it's a bug in the connector - it happens in this set up (using a glue connection object with mongodb atlas), but doesn't happen with for example using a URI string + username + password (which is what is being used in my local/testing pipeline setup, and why it wasn't caught there). In general, both from my experience and from what I've read from others, the glue connector seems to have a few issues with mongo (for example it can't do pushdown predicates), although I'm not sure if they are issues with the underlying pyspark tools or the glue implementations. Either way, I would probably suggest not using glue if you're getting data from mongodb until these things are resolved.

@Chaitu avgObjSize is some meta data for mongo collections - https://www.mongodb.com/docs/manual/reference/command/dbStats/ It isn't a key that we intentionally added onto the documents (which I also thought this meant originally which confused me!). Yes I think you're absolutely correct, using another connector is a good idea, there seem to be a few issues with the glue connector and mongodb.

PO
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions