By using AWS re:Post, you agree to the Terms of Use
/AWS Deep Learning AMIs/

Questions tagged with AWS Deep Learning AMIs

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

1
answers
0
votes
2
views
asked 2 months ago

Not able to clone a conda environment on deep learning amis

I am trying to clone a conda tensorflow environment in EC2 instance, but I am unable to do so. I need this as I am installing some libraries that are not compatible with the my current environment. I have a working environment now and I want to keep it to run other experiments while I meddle within the cloned version and new library installations. I get below errors while trying to clone ``` conda create --name tensorflow2_p38_clone --clone tensorflow2_p38 Source: /home/ubuntu/anaconda3/envs/tensorflow2_p38 Destination: /home/ubuntu/anaconda3/envs/tensorflow2_p38_clone Packages: 445 Files: 53101 Preparing transaction: done Verifying transaction: / SafetyError: The package for nb_conda located at /home/ubuntu/anaconda3/pkgs/nb_conda-2.2.1-py38h578d9bd_4 appears to be corrupted. The path 'lib/python3.8/site-packages/nb_conda/envmanager.py' has an incorrect size. reported size: 6179 bytes actual size: 6209 bytes ClobberError: This transaction has incompatible packages due to a shared path. packages: conda-forge/linux-64::pyqt5-sip-4.19.18-py38h709712a_8, conda-forge/linux-64::sip-4.19.25-py38h709712a_1 path: 'bin/sip' ClobberError: This transaction has incompatible packages due to a shared path. packages: conda-forge/linux-64::pyqt5-sip-4.19.18-py38h709712a_8, conda-forge/linux-64::sip-4.19.25-py38h709712a_1 path: 'include/python3.8/sip.h' ClobberError: This transaction has incompatible packages due to a shared path. packages: conda-forge/linux-64::pyqt5-sip-4.19.18-py38h709712a_8, conda-forge/linux-64::sip-4.19.25-py38h709712a_1 path: 'lib/python3.8/site-packages/__pycache__/sipconfig.cpython-38.pyc' ClobberError: This transaction has incompatible packages due to a shared path. packages: conda-forge/linux-64::pyqt5-sip-4.19.18-py38h709712a_8, conda-forge/linux-64::sip-4.19.25-py38h709712a_1 path: 'lib/python3.8/site-packages/__pycache__/sipdistutils.cpython-38.pyc' ClobberError: This transaction has incompatible packages due to a shared path. packages: conda-forge/linux-64::pyqt5-sip-4.19.18-py38h709712a_8, conda-forge/linux-64::sip-4.19.25-py38h709712a_1 path: 'lib/python3.8/site-packages/sipconfig.py' ClobberError: This transaction has incompatible packages due to a shared path. packages: conda-forge/linux-64::pyqt5-sip-4.19.18-py38h709712a_8, conda-forge/linux-64::sip-4.19.25-py38h709712a_1 path: 'lib/python3.8/site-packages/sipdistutils.py' ClobberError: This transaction has incompatible packages due to a shared path. packages: conda-forge/noarch::jupyter_client-7.0.6-pyhd8ed1ab_0, conda-forge/noarch::nbclient-0.5.8-pyhd8ed1ab_0 path: 'bin/jupyter-run' done Executing transaction: | For Linux 64, Open MPI is built with CUDA awareness but this support is disabled by default. To enable it, please set the environment variable OMPI_MCA_opal_cuda_support=true before launching your MPI processes. Equivalently, you can set the MCA parameter in the command line: mpiexec --mca opal_cuda_support 1 ... In addition, the UCX support is also built but disabled by default. To enable it, first install UCX (conda install -c conda-forge ucx). Then, set the environment variables OMPI_MCA_pml="ucx" OMPI_MCA_osc="ucx" before launching your MPI processes. Equivalently, you can set the MCA parameters in the command line: mpiexec --mca pml ucx --mca osc ucx ... Note that you might also need to set UCX_MEMTYPE_CACHE=n for CUDA awareness via UCX. Please consult UCX's documentation for detail. - Config option `kernel_spec_manager_class` not recognized by `EnableNBExtensionApp`. Enabling notebook extension jupyter-js-widgets/extension... - Validating: OK \ Enabling nb_conda_kernels... CONDA_PREFIX: /home/ubuntu/anaconda3/envs/tensorflow2_p38_clone Status: enabled / Config option `kernel_spec_manager_class` not recognized by `EnableNBExtensionApp`. Enabling notebook extension nb_conda/main... - Validating: OK Enabling tree extension nb_conda/tree... - Validating: OK Config option `kernel_spec_manager_class` not recognized by `EnableServerExtensionApp`. Enabling: nb_conda - Writing config: /home/ubuntu/anaconda3/envs/tensorflow2_p38_clone/etc/jupyter - Validating... nb_conda 2.2.1 OK ``` I had the same issue previously on different EC2 instance as well which I ignored at the time
0
answers
0
votes
4
views
asked 3 months ago

How do I make an ECS cluster spawn GPU instances with more root volume than default?

I need to deploy an ML app that needs GPU access for its response times to be acceptable (since it uses some heavy networks that run too slowly on CPU). The app is containerized and uses an nvidia/cuda base image, so that it can make use of its host machine's GPU. The image alone weighs ~10GB, and during startup it pulls several ML models and data which takes up about another ~10GB of disk. We were previously running this app on Elastic Beanstalk, but we realized it doesn't support GPU usage, even if specifying a Deep Learning AMI, so we migrated to ECS, which provides more configurability that the former. However, we soon ran into a new problem: **selecting a g4dn instance type when creating a cluster, which defaults the AMI to an ECS GPU one, turns the Root EBS Volume Size field into a Data EBS Volume Size field.** This causes the instance's 22GB root volume (which is the only one that comes formatted and mounted) to be too small for pulling our image and downloading the data it needs during startup. The other volume (of whatever size I specify during creation in the new Data EBS Volume Size field) is not mounted and therefore not accessible by the container. Additionally, the g4dn instances come with a 125GB SSD, that is not mounted either. If either of these were usable or it was possible to enlarge the root volume (which it is if using the default non-GPU AMI) ECS would be the perfect solution for us at this time. At the moment, we worked around this issue by creating an *empty* cluster in ECS, and the manually creating and attaching an Auto Scaling group to it, since when using a Launch configuration or template the root volume's size can be correctly specified, even if using the same exact ECS GPU AMI as ECS does. However, this is a tiresome process, and makes us lose valuable ECS functionality such as automatically spawning a new instance during a rolling update to maintain capacity. Am I missing something here? Is this a bug that will be fixed at some point? If its not, is there a simpler way to achieve what I need? Maybe by specifying a custom launch configuration to the ECS cluster or by automatically mounting the SSD on instance launch? Any help is more than appreciated. Thanks in advance!
0
answers
0
votes
4
views
asked 4 months ago
  • 1
  • 90 / page