GC overhead limit exceeded
I have a modest size dataset, and I am running Jupyter Notebook in Sagemaker (instance type ml.c5.xlarge with 200G instance size). I receive the error message " GC overhead limit exceeded" Everything ran fine with small data size. BTW, I need to go through the dataframe one row at a time using df.collect(), which seems t be an expensive operation... Would you suggest another way of accomplishing this? I would appreciate your kind help.
What is the size of your dataset? The instance you chose has 8GB of RAM.
Additionally, based on your error you seem to be running Spark, am I right to assume you are running Spark "locally" on that notebook? Please be mindful that Spark allocates memory between Reserved Memory, User Memory and Spark Memory so not all 8GB are available to handle your data at any given time.
For advice on how to avoid iterating through one line at a time, it's hard to advise without knowing what you want to achieve, but in general the first thing to look at is to vectorize your operations (https://www.geeksforgeeks.org/vectorization-in-python/)
The GC overhead limit exceeded error indicates that the JVM spent a lot of time on garbage collection but recovered very little memory, so it throws this error to let you know that your program is not making much progress but wasting time on doing useless garbage collection task. Iterating through the dataframe might be the problem, because you might be creating a lot of temporary objects when you go through each line, and they couldn't be garbage collected. What is the framework that you are using? And what are you trying to do by going through the dataframe row-by-row? Maybe you can think about processing multiple lines in a batch? For example using some vectorization or matrix operation as georgios_s suggested in the comment.
How to access file system in Sagemaker notebook instance from outside of that instance (ie via Python Sagemaker Estimator training call)Accepted Answerasked 7 months ago
How to install Phyton package in Jupyter Notebook instance in SageMaker?Accepted Answerasked 2 years ago
Optimal notebook instance type for DeepAR in AWS SagemakerAccepted Answerasked 5 months ago
OOM when generating embedding in Jupyter Labasked 4 months ago
Run different notebooks present in same Sagemaker notebook instance using lifecycle configurations based on different lambda triggers
Code running slow on Sagemaker notebook instance for the first time it runsasked 2 years ago
GC overhead limit exceededasked a month ago
Run different notebooks present in same Sagemaker notebook instance with lifecycle configurations based on different lambda triggers
Determining the "right" instance type running Jupyter notebook in Sagemaker when reading/writing a huge parquet file?asked 16 days ago
AWS Lambda - Body Size is Too Large Error, but Body Size is Under Limitasked a year ago