Questions tagged with Networking & Content Delivery

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Unable to read data from mongoDB using Pyspark or Python

I am trying to read data from 3 node MongoDB cluster(replica set) using PySpark and native python in AWS EMR. I am facing issues while executing the codes with in AWS EMR cluster as explained below but the same codes are working fine in my local windows machine. * spark version - 2.4.8 * Scala version - 2.11.12 * MongoDB version - 4.4.8 * mongo-spark-connector version - mongo-spark-connector_2.11:2.4.4 * python version - 3.7.10 **Through Pyspark -** (issue - pyspark is giving empty dataframe) Below are the commands while running pyspark job in local and cluster mode. 1. local mode : ![Enter image description here](/media/postImages/original/IMFZchDBzOQ2-DZZp1Itjx4Q) 2. cluster mode : ![Enter image description here](/media/postImages/original/IMVxDsFE1RRXG-EgBR8QTe5g) with both the modes I am not able to read data from mongoDB(empty dataframe) even though telnet is working across all nodes from spark cluster(from all nodes) . From the logs, I can confirm that spark is able to communicate with mongoDB and my pyspark job is giving empty dataframe. Please find below screenshots for same ![pyspark logs - connecting to mongoDB successfully](/media/postImages/original/IMpXkrBFETTG6IaMA66wnQLA) ![Enter image description here](/media/postImages/original/IMHYoDVNMHR2uqZwSQbDRJzw) Below is the code snippet for same: ``` from pyspark import SparkConf, SparkContext import sys import json sc = SparkContext() spark = SparkSession(sc).builder.appName("MongoDbToS3").config("spark.mongodb.input.uri", "mongodb://username:password@host1,host2,host3/db.table/?replicaSet=ABCD&authSource=admin").getOrCreate() data ="com.mongodb.spark.sql.DefaultSource").load() ``` please let me know anything I am doing wrong or missing in pyspark code? **Through native python code -** (issue - code is getting stuck if batch_size >1 and if batch_size =1 it will print first 24 mongo documents and then cursor hangs) I am using pymongo driver to connect to mongoDB through native python code. The issue is when I try to fetch/print mongoDB documents with batch_size of 1000 the code hangs forever and then it gives network time out error. But if I make batch_size =1 then cursor is able to fetch first 24 documents after that again cursor hangs. we observed that 25th document is very big(around 4kb) compared to first 24 documents and then we tried skipping 25th document, then cursor started fetching next documents but again it was getting stuck at some other position, so we observed whenever the document size is large the cursor is getting stuck. can you guys please help me in understanding the issue? is there anything blocking from networking side or mongoDB side? below is code snippet : ``` from datetime import datetime import json #import boto3 from bson import json_util import pymongo client = pymongo.MongoClient("mongodb://username@host:port/?authSource=admin&socketTimeoutMS=3600000&maxIdleTimeMS=3600000") # Database Name db = client["database_name"] # Collection Name quoteinfo__collection= db["collection_name"] results = quoteinfo__collection.find({}).batch_size(1000) doc_count = quoteinfo__collection.count_documents({}) print("documents count from collection: ",doc_count) print(results) record_increment_no = 1 for record in results: print(record) print(record_increment_no) record_increment_no = record_increment_no + 1 results.close() ``` below is output screenshot for same for batch_size = 1000 (code hangs and gives network timeout error) ![Enter image description here](/media/postImages/original/IMumguUSSGQWSkLwIqtOQ10Q) ![Enter image description here](/media/postImages/original/IMf14CYH8nS3u6zj9I7Ec4hA) batch_size = 1 (prints documents only till 24th and then cursor hangs) ![Enter image description here](/media/postImages/original/IMlUdoBC6LRGqD-TJL6RUH1A)
asked a month ago