Jupyter kernel dies on SageMaker notebook instance when running join operation on large DataFrames using pd.merge

0

I am running a large pandas merge join operation on a jupyter notebook running on SageMaker notebook instance ml.t3.large i.e 8 gb of memory.

import pandas as pd
    
    df1 = pd.DataFrame({ 
                        'ID': [1, 2, 3],
                        'Name': ['A','B','C'],
                        ....
                      })

    df1.shape
    (3000000, 10)
    
    df2 = pd.DataFrame({
                        'ID': [],
                        'Name': [],
                        ....
                      )}
    
    df2.shape
    (50000, 12)
    
                       
    
   
    # Join data
    
    df_merge = pd.merge(
                         df1,
                         df2,
                         left_on = ['ID','Name'],
                         right_on = ['ID','Name'],
                         how = 'left'
                       )

When I run this operation, the kernel dies within a minute or so. How can I optimize this operation for memory efficiency?

  • Usually a kernel will die for one of two reasons: 1) runs out of memory, 2) a bug in the code or a library. Try running this with a subset of your dataset and see if it runs to completion without error. This would eliminate the possibility of a bug. Then choose an instance type with more memory (ml.t3.xlarge has 16GB or RAM) and see if that is enough memory for your dataset.

1 Answer
0

Hello,

A SageMaker kernel could die due to resource utilisation load or an issue within the code or a third-party library.

Please check the system resource utilisation in order to ensure that the operations are running at appropriate load levels.

To check SageMaker notebook instance resources, enter the following commands in the Notebook terminal:

  • To check memory utilisation: free -h
  • To check CPU utilisation: top
  • To check disk utilisation: df -h

If you see high utilisation of CPU, memory, or disk utilisation, please try these solutions:

  • Restart the notebook instance and try again.
  • Review your SageMaker notebook instance type to verify that it's properly scoped and configured for your jobs.

=> If a resource crunch is observed, kindly switch to a larger instance type and check if the issue gets fixed.

[+] https://aws.amazon.com/premiumsupport/knowledge-center/sagemaker-troubleshoot-connectivity/

You may also refer to the following documentation to check for CPU, memory and Disk utilisation : https://aws.amazon.com/premiumsupport/knowledge-center/open-sagemaker-jupyter-notebook/#:~:text=High%20CPU%20or%20memory%20utilization

Please try running the code separately and execute rows one by one if possible to identify the issue at a more granular level and check if everything works fine with respect to the code.

Additionally, please try the following to fix the issue:

  • Close the active sessions and clear the browser cache/cookies. When you have a large number of active sessions, the kernel might take longer to load in the browser.
  • Open the SageMaker Notebook in a different browser. Check if the kernel connects successfully.
  • Restart your notebook instance.

I hope this helps!

AWS
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions