GC overhead limit exceeded

0

I have a modest size dataset, and I am running Jupyter Notebook in Sagemaker (instance type ml.c5.xlarge with 200G instance size). I receive the error message " GC overhead limit exceeded" Everything ran fine with small data size. BTW, I need to go through the dataframe one row at a time using df.collect(), which seems t be an expensive operation... Would you suggest another way of accomplishing this? I would appreciate your kind help.

  • What is the size of your dataset? The instance you chose has 8GB of RAM.

    Additionally, based on your error you seem to be running Spark, am I right to assume you are running Spark "locally" on that notebook? Please be mindful that Spark allocates memory between Reserved Memory, User Memory and Spark Memory so not all 8GB are available to handle your data at any given time.

    For advice on how to avoid iterating through one line at a time, it's hard to advise without knowing what you want to achieve, but in general the first thing to look at is to vectorize your operations (https://www.geeksforgeeks.org/vectorization-in-python/)

1개 답변
0

The GC overhead limit exceeded error indicates that the JVM spent a lot of time on garbage collection but recovered very little memory, so it throws this error to let you know that your program is not making much progress but wasting time on doing useless garbage collection task. Iterating through the dataframe might be the problem, because you might be creating a lot of temporary objects when you go through each line, and they couldn't be garbage collected. What is the framework that you are using? And what are you trying to do by going through the dataframe row-by-row? Maybe you can think about processing multiple lines in a batch? For example using some vectorization or matrix operation as georgios_s suggested in the comment.

AWS
S Lyu
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠