Jupyter kernel dies on SageMaker notebook instance when running join operation on large DataFrames using pd.merge

0

I am running a large pandas merge join operation on a jupyter notebook running on SageMaker notebook instance ml.t3.large i.e 8 gb of memory.

import pandas as pd
    
    df1 = pd.DataFrame({ 
                        'ID': [1, 2, 3],
                        'Name': ['A','B','C'],
                        ....
                      })

    df1.shape
    (3000000, 10)
    
    df2 = pd.DataFrame({
                        'ID': [],
                        'Name': [],
                        ....
                      )}
    
    df2.shape
    (50000, 12)
    
                       
    
   
    # Join data
    
    df_merge = pd.merge(
                         df1,
                         df2,
                         left_on = ['ID','Name'],
                         right_on = ['ID','Name'],
                         how = 'left'
                       )

When I run this operation, the kernel dies within a minute or so. How can I optimize this operation for memory efficiency?

  • Usually a kernel will die for one of two reasons: 1) runs out of memory, 2) a bug in the code or a library. Try running this with a subset of your dataset and see if it runs to completion without error. This would eliminate the possibility of a bug. Then choose an instance type with more memory (ml.t3.xlarge has 16GB or RAM) and see if that is enough memory for your dataset.

已提問 1 年前檢視次數 868 次
1 個回答
0

Hello,

A SageMaker kernel could die due to resource utilisation load or an issue within the code or a third-party library.

Please check the system resource utilisation in order to ensure that the operations are running at appropriate load levels.

To check SageMaker notebook instance resources, enter the following commands in the Notebook terminal:

  • To check memory utilisation: free -h
  • To check CPU utilisation: top
  • To check disk utilisation: df -h

If you see high utilisation of CPU, memory, or disk utilisation, please try these solutions:

  • Restart the notebook instance and try again.
  • Review your SageMaker notebook instance type to verify that it's properly scoped and configured for your jobs.

=> If a resource crunch is observed, kindly switch to a larger instance type and check if the issue gets fixed.

[+] https://aws.amazon.com/premiumsupport/knowledge-center/sagemaker-troubleshoot-connectivity/

You may also refer to the following documentation to check for CPU, memory and Disk utilisation : https://aws.amazon.com/premiumsupport/knowledge-center/open-sagemaker-jupyter-notebook/#:~:text=High%20CPU%20or%20memory%20utilization

Please try running the code separately and execute rows one by one if possible to identify the issue at a more granular level and check if everything works fine with respect to the code.

Additionally, please try the following to fix the issue:

  • Close the active sessions and clear the browser cache/cookies. When you have a large number of active sessions, the kernel might take longer to load in the browser.
  • Open the SageMaker Notebook in a different browser. Check if the kernel connects successfully.
  • Restart your notebook instance.

I hope this helps!

AWS
已回答 1 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南