1 Answer
- Newest
- Most votes
- Most comments
0
This may not apply to SageMaker Studio Notebooks so may not be a direct answer to the question, and the situation may have changed between since the question was asked, but as of now the documentation says that for a SageMaker Training job an instance type with NVMe storage uses that storage instead of allocating space on a gp2
EBS.
- When using an ML instance with NVMe SSD volumes, SageMaker doesn't provision Amazon EBS gp2 storage. Available storage is fixed to the NVMe-type instance's storage capacity. SageMaker configures storage paths for training datasets, checkpoints, model artifacts, and outputs to use the entire capacity of the instance storage. For example, ML instance families with the NVMe-type instance storage include ml.p4d, ml.g4dn, and ml.g5. When using an ML instance with the EBS-only storage option and without instance storage, you must define the size of EBS volume through the volume_size parameter in the SageMaker estimator class (or VolumeSizeInGB if you are using the ResourceConfig API). For example, ML instance families that use EBS volumes include ml.c5 and ml.p2. To look up instance types and their instance storage types and volumes, see Amazon EC2 Instance Types.
- The default paths for SageMaker training jobs are mounted to Amazon EBS volumes or NVMe SSD volumes of the ML instance. When you adapt your training script to SageMaker, make sure that you use the default paths listed in the previous topic about SageMaker Environment Variables and Default Paths for Training Storage Locations. We recommend that you use the /tmp directory as a scratch space for temporarily storing any large objects during training. This means that you must not use directories that are mounted to small disk space allocated for system, such as /user and /home, to avoid out-of-space errors.
I have confirmed this on ml.p4d.2xlarge
instances with mount
and df
as follows:
mount
(extract):
...
overlay on / type overlay (rw,relatime,lowerdir=/mnt/docker-data/overlay2/...)
/dev/mapper/sagemaker_vg-sagemaker_lv on /tmp type ext4 (rw,relatime,stripe=128)
/dev/mapper/sagemaker_vg-sagemaker_lv on /opt/ml/checkpoints type ext4 (rw,relatime,stripe=128)
/dev/mapper/sagemaker_vg-sagemaker_lv on /opt/ml/input type ext4 (rw,relatime,stripe=128)
/dev/mapper/sagemaker_vg-sagemaker_lv on /opt/ml/model type ext4 (rw,relatime,stripe=128)
/dev/mapper/sagemaker_vg-sagemaker_lv on /opt/ml/output type ext4 (rw,relatime,stripe=128)
...
df -h
(extract):
Filesystem Size Used Avail Use% Mounted on
overlay 120G 28G 93G 23% /
/dev/mapper/sagemaker_vg-sagemaker_lv 6.8T 44M 6.5T 1% /tmp
This suggests:
- The 8× NVMe drives are being combined together into a single, large logical volume
- SageMaker manages mounting said logical volume for you, with the space not needed for the container volumes made available for use for inputs, outputs, checkpoints, models and temporary storage via the appropriate mount points
- An oversized primary volume
/
isn't as necessary on instances with NVMe storage as it is on EBS-only volumes, as long as the training script is configured to use the correct paths
So, for storage of temporary data and suchlike on training jobs, writing to /tmp
is sufficient to use the local NVMe
instance storage.
answered 8 months ago
Relevant content
- asked 2 years ago
- asked 4 months ago
- Accepted Answerasked 2 years ago
- asked 10 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a month ago
- Why doesn't my SageMaker Studio Classic notebook in VPC only mode connect with my KernelGateway app?AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated a year ago
Did you find an answer? I'm keen to use the 8TB of instance storage on
ml.p4d.24xlarge
instances but am not sure how, if/tmp
isn't preconfigured.