Use instance (NVMe) storage in SageMaker Studio notebooks


I'm exploring and pre-processing some raw data in SageMaker Studio which is split across many (20k+++) small files, and running into slow speed because SMStudio's main user storage is backed by EFS rather than an EBS volume as used on SageMaker Notebook Instances (NBIs). Navigating and manipulating this dataset is slower in Studio because of the extra metadata introduced by the communication being at filesystem level, rather than just a block storage device.

I know there's a little ephemeral block storage available to notebooks under /tmp which can help with these issues (as used here, in fact), but thought it would be more scalable to make use of the proper NVMe instance storage available with ml.m5d.* instances in Studio to work with bigger datasets.

Only trouble is, I'm not sure how to use these instance storage volume(s) from notebooks? When I run !df -aTh, the NVMe device only seems to be mounted on some very specific points as shown below:

Filesystem        Type      Size  Used Avail Use% Mounted on
[...] nfs4      8.0E   57G  8.0E   1% /root
/dev/nvme0n1p1    xfs       124G   14G  111G  11% /opt/.sagemakerinternal
/dev/nvme0n1p1    xfs       124G   14G  111G  11% /etc/resolv.conf
/dev/nvme0n1p1    xfs       124G   14G  111G  11% /etc/hostname
/dev/nvme0n1p1    xfs       124G   14G  111G  11% /etc/hosts
/dev/nvme0n1p1    xfs       124G   14G  111G  11% /var/log/studio
/dev/nvme0n1p1    xfs       124G   14G  111G  11% /var/log/apps
/dev/nvme0n1p1    xfs       124G   14G  111G  11% /opt/ml/metadata/resource-metadata.json

Should I be creating a new mount somehow to access the storage? Any particular best-practices to follow?

  • Did you find an answer? I'm keen to use the 8TB of instance storage on ml.p4d.24xlarge instances but am not sure how, if /tmp isn't preconfigured.

已提问 2 年前270 查看次数
1 回答

This may not apply to SageMaker Studio Notebooks so may not be a direct answer to the question, and the situation may have changed between since the question was asked, but as of now the documentation says that for a SageMaker Training job an instance type with NVMe storage uses that storage instead of allocating space on a gp2 EBS.

From Use Amazon SageMaker Training Storage Paths for Training Datasets, Checkpoints, Model Artifacts, and Outputs > Tips and Considerations for Setting Up Storage Paths:

  • When using an ML instance with NVMe SSD volumes, SageMaker doesn't provision Amazon EBS gp2 storage. Available storage is fixed to the NVMe-type instance's storage capacity. SageMaker configures storage paths for training datasets, checkpoints, model artifacts, and outputs to use the entire capacity of the instance storage. For example, ML instance families with the NVMe-type instance storage include ml.p4d, ml.g4dn, and ml.g5. When using an ML instance with the EBS-only storage option and without instance storage, you must define the size of EBS volume through the volume_size parameter in the SageMaker estimator class (or VolumeSizeInGB if you are using the ResourceConfig API). For example, ML instance families that use EBS volumes include ml.c5 and ml.p2. To look up instance types and their instance storage types and volumes, see Amazon EC2 Instance Types.
  • The default paths for SageMaker training jobs are mounted to Amazon EBS volumes or NVMe SSD volumes of the ML instance. When you adapt your training script to SageMaker, make sure that you use the default paths listed in the previous topic about SageMaker Environment Variables and Default Paths for Training Storage Locations. We recommend that you use the /tmp directory as a scratch space for temporarily storing any large objects during training. This means that you must not use directories that are mounted to small disk space allocated for system, such as /user and /home, to avoid out-of-space errors.

I have confirmed this on ml.p4d.2xlarge instances with mount and df as follows:

mount (extract):

overlay on / type overlay (rw,relatime,lowerdir=/mnt/docker-data/overlay2/...)
/dev/mapper/sagemaker_vg-sagemaker_lv on /tmp type ext4 (rw,relatime,stripe=128)
/dev/mapper/sagemaker_vg-sagemaker_lv on /opt/ml/checkpoints type ext4 (rw,relatime,stripe=128)
/dev/mapper/sagemaker_vg-sagemaker_lv on /opt/ml/input type ext4 (rw,relatime,stripe=128)
/dev/mapper/sagemaker_vg-sagemaker_lv on /opt/ml/model type ext4 (rw,relatime,stripe=128)
/dev/mapper/sagemaker_vg-sagemaker_lv on /opt/ml/output type ext4 (rw,relatime,stripe=128)

df -h (extract):

Filesystem                             Size  Used Avail Use% Mounted on
overlay                                120G   28G   93G  23% /
/dev/mapper/sagemaker_vg-sagemaker_lv  6.8T   44M  6.5T   1% /tmp

This suggests:

  • The 8× NVMe drives are being combined together into a single, large logical volume
  • SageMaker manages mounting said logical volume for you, with the space not needed for the container volumes made available for use for inputs, outputs, checkpoints, models and temporary storage via the appropriate mount points
  • An oversized primary volume / isn't as necessary on instances with NVMe storage as it is on EBS-only volumes, as long as the training script is configured to use the correct paths

So, for storage of temporary data and suchlike on training jobs, writing to /tmp is sufficient to use the local NVMe instance storage.

已回答 3 个月前
profile picture
已审核 23 天前

您未登录。 登录 发布回答。