We are using tensorflow.neuron to compile a tensorflow 1.x SavedModel to run on AWS Inferentia machines on EC2. We do this by calling:
tensorflow.neuron.saved_model.compile(model_dir, compiled_model_dir)
36 subgraphs are compiled successfully while two produces warnings:
WARNING:tensorflow:Failed to fuse subgraph {subgraph neuron_op_cd169bb88475d5d0 with input tensors ["<tf.Tensor 'encoder/c0/_52:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/c_70/_53:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/h0/_54:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/c_40/_55:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/h_10/_56:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/c_20/_57:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/c_50/_58:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/c_10/_59:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/c_60/_60:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/h_60/_61:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/h_50/_62:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/h_30/_63:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'downsampler2/3/add0/_64:0' shape=(1, 16, 176, 128) dtype=float32>", "<tf.Tensor 'downsampler2/3/elu/Elu0/_65:0' shape=(1, 16, 176, 128) dtype=float32>", "<tf.Tensor 'encoder/c_30/_66:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/h_40/_67:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/h_70/_68:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/h_20/_69:0' shape=(1, 88, 128) dtype=float32>"], output tensors ["<tf.Tensor 'encoder/stack:0' shape=(1, 16, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/encoder/multi_rnn_cell/cell_0/cell_0/conv_lstm_cell/LayerNorm_79/batchnorm/add_1:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/encoder/multi_rnn_cell/cell_0/cell_0/conv_lstm_cell/mul_127:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/encoder/multi_rnn_cell/cell_1/cell_1/conv_lstm_cell/LayerNorm_79/batchnorm/add_1:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/encoder/multi_rnn_cell/cell_1/cell_1/conv_lstm_cell/mul_127:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/encoder/multi_rnn_cell/cell_2/cell_2/conv_lstm_cell/LayerNorm_79/batchnorm/add_1:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/encoder/multi_rnn_cell/cell_2/cell_2/conv_lstm_cell/mul_127:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/encoder/multi_rnn_cell/cell_3/cell_3/conv_lstm_cell/LayerNorm_79/batchnorm/add_1:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/encoder/multi_rnn_cell/cell_3/cell_3/conv_lstm_cell/mul_127:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/encoder/multi_rnn_cell/cell_4/cell_4/conv_lstm_cell/LayerNorm_79/batchnorm/add_1:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/encoder/multi_rnn_cell/cell_4/cell_4/conv_lstm_cell/mul_127:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/encoder/multi_rnn_cell/cell_5/cell_5/conv_lstm_cell/LayerNorm_79/batchnorm/add_1:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/encoder/multi_rnn_cell/cell_5/cell_5/conv_lstm_cell/mul_127:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/encoder/multi_rnn_cell/cell_6/cell_6/conv_lstm_cell/LayerNorm_79/batchnorm/add_1:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/encoder/multi_rnn_cell/cell_6/cell_6/conv_lstm_cell/mul_127:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/encoder/multi_rnn_cell/cell_7/cell_7/conv_lstm_cell/LayerNorm_79/batchnorm/add_1:0' shape=(1, 88, 128) dtype=float32>", "<tf.Tensor 'encoder/encoder/multi_rnn_cell/cell_7/cell_7/conv_lstm_cell/mul_127:0' shape=(1, 88, 128) dtype=float32>"]} with '/home/ec2-user/tensorflow_venv/bin/neuron-cc compile /tmp/tmpqa4twpbj/neuron_op_cd169bb88475d5d0/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpqa4twpbj/neuron_op_cd169bb88475d5d0/graph_def.neff --io-config "{\"inputs\": {\"encoder/c0/_52:0\": [[1, 88, 128], \"float32\"], \"encoder/c_70/_53:0\": [[1, 88, 128], \"float32\"], \"encoder/h0/_54:0\": [[1, 88, 128], \"float32\"], \"encoder/c_40/_55:0\": [[1, 88, 128], \"float32\"], \"encoder/h_10/_56:0\": [[1, 88, 128], \"float32\"], \"encoder/c_20/_57:0\": [[1, 88, 128], \"float32\"], \"encoder/c_50/_58:0\": [[1, 88, 128], \"float32\"], \"encoder/c_10/_59:0\": [[1, 88, 128], \"float32\"], \"encoder/c_60/_60:0\": [[1, 88, 128], \"float32\"], \"encoder/h_60/_61:0\": [[1, 88, 128], \"float32\"], \"encoder/h_50/_62:0\": [[1, 88, 128], \"float32\"], \"encoder/h_30/_63:0\": [[1, 88, 128], \"float32\"], \"downsampler2/3/add0/_64:0\": [[1, 16, 176, 128], \"float32\"], \"downsampler2/3/elu/Elu0/_65:0\": [[1, 16, 176, 128], \"float32\"], \"encoder/c_30/_66:0\": [[1, 88, 128], \"float32\"], \"encoder/h_40/_67:0\": [[1, 88, 128], \"float32\"], \"encoder/h_70/_68:0\": [[1, 88, 128], \"float32\"], \"encoder/h_20/_69:0\": [[1, 88, 128], \"float32\"]}, \"outputs\": [\"encoder/stack:0\", \"encoder/encoder/multi_rnn_cell/cell_0/cell_0/conv_lstm_cell/LayerNorm_79/batchnorm/add_1:0\", \"encoder/encoder/multi_rnn_cell/cell_0/cell_0/conv_lstm_cell/mul_127:0\", \"encoder/encoder/multi_rnn_cell/cell_1/cell_1/conv_lstm_cell/LayerNorm_79/batchnorm/add_1:0\", \"encoder/encoder/multi_rnn_cell/cell_1/cell_1/conv_lstm_cell/mul_127:0\", \"encoder/encoder/multi_rnn_cell/cell_2/cell_2/conv_lstm_cell/LayerNorm_79/batchnorm/add_1:0\", \"encoder/encoder/multi_rnn_cell/cell_2/cell_2/conv_lstm_cell/mul_127:0\", \"encoder/encoder/multi_rnn_cell/cell_3/cell_3/conv_lstm_cell/LayerNorm_79/batchnorm/add_1:0\", \"encoder/encoder/multi_rnn_cell/cell_3/cell_3/conv_lstm_cell/mul_127:0\", \"encoder/encoder/multi_rnn_cell/cell_4/cell_4/conv_lstm_cell/LayerNorm_79/batchnorm/add_1:0\", \"encoder/encoder/multi_rnn_cell/cell_4/cell_4/conv_lstm_cell/mul_127:0\", \"encoder/encoder/multi_rnn_cell/cell_5/cell_5/conv_lstm_cell/LayerNorm_79/batchnorm/add_1:0\", \"encoder/encoder/multi_rnn_cell/cell_5/cell_5/conv_lstm_cell/mul_127:0\", \"encoder/encoder/multi_rnn_cell/cell_6/cell_6/conv_lstm_cell/LayerNorm_79/batchnorm/add_1:0\", \"encoder/encoder/multi_rnn_cell/cell_6/cell_6/conv_lstm_cell/mul_127:0\", \"encoder/encoder/multi_rnn_cell/cell_7/cell_7/conv_lstm_cell/LayerNorm_79/batchnorm/add_1:0\", \"encoder/encoder/multi_rnn_cell/cell_7/cell_7/conv_lstm_cell/mul_127:0\"]}" --neuroncore-pipeline-cores 4'
WARNING:tensorflow:Failed to fuse subgraph {subgraph neuron_op_1d250d6fb9ac33f with input tensors ["<tf.Tensor 'upsampler1/GatherV20/_77:0' shape=(1, 32, 88, 128) dtype=float32>"], output tensors ["<tf.Tensor 'upsampler1/LayerNorm/batchnorm/add_1:0' shape=(1, 32, 88, 128) dtype=float32>", "<tf.Tensor 'upsampler1/0/downsample/BiasAdd:0' shape=(1, 32, 88, 32) dtype=float32>"]} with '/home/ec2-user/tensorflow_venv/bin/neuron-cc compile /tmp/tmpqa4twpbj/neuron_op_1d250d6fb9ac33f/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpqa4twpbj/neuron_op_1d250d6fb9ac33f/graph_def.neff --io-config "{\"inputs\": {\"upsampler1/GatherV20/_77:0\": [[1, 32, 88, 128], \"float32\"]}, \"outputs\": [\"upsampler1/LayerNorm/batchnorm/add_1:0\", \"upsampler1/0/downsample/BiasAdd:0\"]}" --neuroncore-pipeline-cores 4'
These two subgraphs take around five hours to compile and use an immense amount of RAM in the process (80-100 GB). This doesn't make much sense since the model itself is a fairly small ConvLSTM of around 62 MB (17 MB when optimized for inference). The compilation does however finish but when testing the resulting model on Inf1 the performance is terrible, more than twice as slow as the same model on Amazon EI.
Looking closer at the logs I noticed that only a small subset of operations are actually placed on neuron:
2024-04-03 09:07:54.612207: I tensorflow/neuron/grappler/convert/segment.cc:456] There are 56 ops of 4 different types in the graph that are not compiled by neuron-cc: GatherV2, Elu, NoOp, Placeholder, (For more information see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/neuron-cc-ops/neuron-cc-ops-tensorflow.html).
2024-04-03 09:08:24.373061: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:786] Optimization results for grappler item: graph_to_optimize
2024-04-03 09:08:24.373111: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:788] aws_neuron_static_shape_inference: Graph size after: 13757 nodes (0), 19247 edges (0), time = 242.094ms.
2024-04-03 09:08:24.373117: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:788] aws_neuron_fuse_supported_operators: Graph size after: 111 nodes (-13646), 158 edges (-19089), time = 29833.4043ms.
INFO:tensorflow:Number of operations in TensorFlow session: 13757
INFO:tensorflow:Number of operations after tf.neuron optimizations: 13758
INFO:tensorflow:Number of operations placed on Neuron runtime: 487
As you can see, there's 56 unsupported operations in the graph which seems insignificant in the total of 13757 operations, but then only 487 operations are placed on Neuron runtime? Why is that?
We also compiled an unoptimized version of the model for 1 and 4 neuron cores with similar results:
Unoptimized 4x neuron cores:
INFO:tensorflow:Number of operations in TensorFlow session: 187829
INFO:tensorflow:Number of operations after tf.neuron optimizations: 13967
INFO:tensorflow:Number of operations placed on Neuron runtime: 578
Unoptimized 1x neuron cores:
INFO:tensorflow:Number of operations in TensorFlow session: 187829
INFO:tensorflow:Number of operations after tf.neuron optimizations: 13967
INFO:tensorflow:Number of operations placed on Neuron runtime: 602
Why are so few operations making it on the neuron runtime?
Regards,
Patrik Ohlsson
Doremir Music Research
Thank you for your answer! We've submitted a support ticket and will hopefully be able to provide the information needed to solve this issue.