Which CUDA version exists on p2.xlarge when I use it in Sagemaker?

0

Hi,
From my Host machine: Nvidia GPU GeForce GTX 1080 Ti, CUDA 9.0, Ubuntu 16.04:

I have compiled my docker image with:

  1. nvidia cuda 9.0 cudnn 7 devel (base image)
  2. installed python2.7, darknet yolo (from PJReddie site)
  3. built the darknet from source with options GPU=1, OPENMP=1, CUDNN=1
  4. deployed Flask based RESTful service that gives prediction from Images (object labels and bounding boxes)

On this docker image:

  1. trained the Darknet model with my images data and created model weights file (yolov2.backup)
  2. tested this model with test images, and it works
  3. ensured that the darknet YOLO model utilizes the host machine's GPU

Then to use this custom image in Sagemaker, for inference:

  1. pushed docker image to ECR, created Model, Endpoint Config and Endpoint in Sagemaker
  2. config = p2.xlarge, min instance=1, max instance=1
  3. The /ping and /invocations requests ran without error E2E.
  4. The RESTful client code was Python code from Sagemaker's Jupyter Notebook

Issue I am facing is:

There are no predictions generated during execution and my webservice returns an Image in response, when requested by Python's 'requests' API. The image that I get in response, does not show any object's bounding box. I feel that the CUDA version on P2 instance is not matching the CUDA9.0 that is in my docker image. For CUDA to run properly, it is required that the host machine and docker's CUDA version should be same.

When I test my webservice and Darknet predictions inside my docker on my local Host machine, it runs fine, but not on Sagemaker!!

Questions:

  1. Can you please tell me which CUDA version exists on the p2.xlarge instance? So that I can build my docker accordingly.
  2. Please suggest some way to print debug information in CloudWatch logs from my docker image so that I am aware what is going on inside the Docker on Sagemaker env. The Flask webservice code doesn't allow any print statements. Hence each time I need to debug, I am creating a new docker image. A good way could save my time.

Thanks

asked 5 years ago2781 views
5 Answers
0

Hello SachinAWS,

I believe that the CUDA version on the p2s should be CUDA 9.0. Our deep learning containers currently utilize CUDA 9.0 and I believe they run fine on the p2. Here is our GPU TensorFlow image that uses CUDA 9.0: https://github.com/aws/sagemaker-tensorflow-container/blob/master/docker/1.12.0/final/py2/Dockerfile.gpu#L2

I will reach out to the team that manages the host for your inferences on SageMaker to confirm the CUDA version.

answered 5 years ago
0

According to the team, the CUDA version is determined at the container level, while the drivers are installed on the host itself.

As for debugging, please consider using the Python SDK with local mode. This will spin up your docker container similar to that of serving within SageMaker. The benefit is that you will be able to utilize docker logs and also has the benefit of lower latency due to not waiting for your serving instances to be provisioned.

https://github.com/aws/sagemaker-python-sdk#local-mode
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_distributed_mnist/tensorflow_local_mode_mnist.ipynb

Please let me know if there is anything to clarify.

answered 5 years ago
0

Hi Daniel,
Thank you for your response. Even though my issue is still unresolved, I am now trying out Model Training inside Sagemaker itself and use the generated model for inference. I strongly hope that it fixes the problem during the inference.

FYI, a sample inference of Darknet YOLO model from Sagemaker is this:
(Notice that it uses the model binary file, and neither prints prediction nor any error. The prediction time of 0.035261 seconds also indicates that the GPU has been utilized, further indicating not a Sagemaker problem):

sample output:

layer     filters    size              input                output
    0 conv     32  3 x 3 / 1   608 x 608 x   3   ->   608 x 608 x  32  0.639 BFLOPs
    1 max          2 x 2 / 2   608 x 608 x  32   ->   304 x 304 x  32
    2 conv     64  3 x 3 / 1   304 x 304 x  32   ->   304 x 304 x  64  3.407 BFLOPs
    3 max          2 x 2 / 2   304 x 304 x  64   ->   152 x 152 x  64
    4 conv    128  3 x 3 / 1   152 x 152 x  64   ->   152 x 152 x 128  3.407 BFLOPs
    5 conv     64  1 x 1 / 1   152 x 152 x 128   ->   152 x 152 x  64  0.379 BFLOPs
    6 conv    128  3 x 3 / 1   152 x 152 x  64   ->   152 x 152 x 128  3.407 BFLOPs
    7 max          2 x 2 / 2   152 x 152 x 128   ->    76 x  76 x 128
    8 conv    256  3 x 3 / 1    76 x  76 x 128   ->    76 x  76 x 256  3.407 BFLOPs
    9 conv    128  1 x 1 / 1    76 x  76 x 256   ->    76 x  76 x 128  0.379 BFLOPs
   10 conv    256  3 x 3 / 1    76 x  76 x 128   ->    76 x  76 x 256  3.407 BFLOPs
   11 max          2 x 2 / 2    76 x  76 x 256   ->    38 x  38 x 256
   12 conv    512  3 x 3 / 1    38 x  38 x 256   ->    38 x  38 x 512  3.407 BFLOPs
   13 conv    256  1 x 1 / 1    38 x  38 x 512   ->    38 x  38 x 256  0.379 BFLOPs
   14 conv    512  3 x 3 / 1    38 x  38 x 256   ->    38 x  38 x 512  3.407 BFLOPs
   15 conv    256  1 x 1 / 1    38 x  38 x 512   ->    38 x  38 x 256  0.379 BFLOPs
   16 conv    512  3 x 3 / 1    38 x  38 x 256   ->    38 x  38 x 512  3.407 BFLOPs
   17 max          2 x 2 / 2    38 x  38 x 512   ->    19 x  19 x 512
   18 conv   1024  3 x 3 / 1    19 x  19 x 512   ->    19 x  19 x1024  3.407 BFLOPs
   19 conv    512  1 x 1 / 1    19 x  19 x1024   ->    19 x  19 x 512  0.379 BFLOPs
   20 conv   1024  3 x 3 / 1    19 x  19 x 512   ->    19 x  19 x1024  3.407 BFLOPs
   21 conv    512  1 x 1 / 1    19 x  19 x1024   ->    19 x  19 x 512  0.379 BFLOPs
   22 conv   1024  3 x 3 / 1    19 x  19 x 512   ->    19 x  19 x1024  3.407 BFLOPs
   23 conv   1024  3 x 3 / 1    19 x  19 x1024   ->    19 x  19 x1024  6.814 BFLOPs
   24 conv   1024  3 x 3 / 1    19 x  19 x1024   ->    19 x  19 x1024  6.814 BFLOPs
   25 route  16
   26 conv     64  1 x 1 / 1    38 x  38 x 512   ->    38 x  38 x  64  0.095 BFLOPs
   27 reorg              / 2    38 x  38 x  64   ->    19 x  19 x 256
   28 route  27 24
   29 conv   1024  3 x 3 / 1    19 x  19 x1280   ->    19 x  19 x1024  8.517 BFLOPs
   30 conv     75  1 x 1 / 1    19 x  19 x1024   ->    19 x  19 x  75  0.055 BFLOPs
   31 detection
mask_scale: Using default '1.000000'
Loading weights from backup/yolov2.backup...Done!
/root/darknet/data2/Test.png: Predicted in 0.035261 seconds.

Expected output (addition to above o/p):

object1: 66%
object1: 65%
object2: 56%
object3: 74%
object3: 61%
object4: 92%

Edited by: SachinAws on Feb 20, 2019 5:31 AM

answered 5 years ago
0

Thank you for the answers

answered 5 years ago
0

Sharing my observations:

  1. The Darknet Yolo Model is pretty dependent on the GPU env where it is trained. Means - the model generally gives inference/predictions where it is trained.
  2. The Darknet's yolo.cfg file contains few parameters that have to be enabled during training and testing (inference). The docker image should contain appropriate config parameters enabled during training and testing. Else the model does not give out any predictions, and neither prints any error.

Thank you.

answered 5 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions