Unable to read data from mongoDB using Pyspark or Python

0

I am trying to read data from 3 node MongoDB cluster(replica set) using PySpark and native python in AWS EMR. I am facing issues while executing the codes with in AWS EMR cluster as explained below but the same codes are working fine in my local windows machine.

  • spark version - 2.4.8
  • Scala version - 2.11.12
  • MongoDB version - 4.4.8
  • mongo-spark-connector version - mongo-spark-connector_2.11:2.4.4
  • python version - 3.7.10

Through Pyspark - (issue - pyspark is giving empty dataframe)

Below are the commands while running pyspark job in local and cluster mode.

  1. local mode : Enter image description here

  2. cluster mode : Enter image description here

with both the modes I am not able to read data from mongoDB(empty dataframe) even though telnet is working across all nodes from spark cluster(from all nodes) . From the logs, I can confirm that spark is able to communicate with mongoDB and my pyspark job is giving empty dataframe. Please find below screenshots for same pyspark logs - connecting to mongoDB successfully

Enter image description here

Below is the code snippet for same:

from pyspark import SparkConf, SparkContext
import sys
import json

sc = SparkContext()
spark = SparkSession(sc).builder.appName("MongoDbToS3").config("spark.mongodb.input.uri", "mongodb://username:password@host1,host2,host3/db.table/?replicaSet=ABCD&authSource=admin").getOrCreate()
data = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
data.show()

please let me know anything I am doing wrong or missing in pyspark code?

Through native python code - (issue - code is getting stuck if batch_size >1 and if batch_size =1 it will print first 24 mongo documents and then cursor hangs)

I am using pymongo driver to connect to mongoDB through native python code. The issue is when I try to fetch/print mongoDB documents with batch_size of 1000 the code hangs forever and then it gives network time out error. But if I make batch_size =1 then cursor is able to fetch first 24 documents after that again cursor hangs. we observed that 25th document is very big(around 4kb) compared to first 24 documents and then we tried skipping 25th document, then cursor started fetching next documents but again it was getting stuck at some other position, so we observed whenever the document size is large the cursor is getting stuck.

can you guys please help me in understanding the issue?

is there anything blocking from networking side or mongoDB side?

below is code snippet :

from datetime import datetime
import json
#import boto3
from bson import json_util
import pymongo


client = pymongo.MongoClient("mongodb://username@host:port/?authSource=admin&socketTimeoutMS=3600000&maxIdleTimeMS=3600000")

# Database Name
db = client["database_name"]

# Collection Name
quoteinfo__collection= db["collection_name"]

results = quoteinfo__collection.find({}).batch_size(1000)
doc_count = quoteinfo__collection.count_documents({})

print("documents count from collection: ",doc_count)
print(results)
record_increment_no = 1

for record in results:
    print(record)
    print(record_increment_no)
    record_increment_no = record_increment_no + 1
results.close()

below is output screenshot for same

for batch_size = 1000 (code hangs and gives network timeout error) Enter image description here Enter image description here

batch_size = 1 (prints documents only till 24th and then cursor hangs) Enter image description here

2 Answers
0
Accepted Answer

Hi All,

There were some issues with AWS account peering between our dev and MongoDB hosted AWS account as explained below.

  1. Traffic was flowing through VPC Peering for one of the routes instead of Transit Gateway.
  2. MongoDB IPs were not falling under CIDR ranges of the Route Table

After adding transit gateway for MongoDb IP1 and MongoDB IP2,we are able to read data properly with any batch size for any collection.

answered a year ago
0

Thank you for your question, however this requires us to look at cluster and logs so that we can troubleshoot further. Please open a case with premium support so that we may debug the issue and look into your resources accordingly. We are unable to share details on the cluster here due to privacy and security concerns.

AWS
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions