Redis connection timeout...

0

Hi,

Trying to train a deep racer using the vanilla github setup (including redis setup in training_worker.py). Redis sets up fine every time but I get a redis connection error when the simulation complete the first episode of 20 runs. Sagemaker start and Robomaker errors logs below. I have searched high and low to find out what is going wrong and no luck - any assistance appreciated!

Excerpt of Sagemaker cloudwatch logs on start up related to redis:

22:14:50 Invoking script with the following command:
22:14:50 /usr/bin/python training_worker.py --RLCOACH_PRESET deepracer --aws_region us-east-1 --s3_bucket sagemaker-us-east-1-158249709041 --s3_prefix rl-deepracer-sagemaker-190413-220444
22:14:51 211:C 13 Apr 2019 22:14:51.317 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
22:14:51 211:C 13 Apr 2019 22:14:51.317 # Redis version=5.0.3, bits=64, commit=00000000, modified=0, pid=211, just started
22:14:51 211:C 13 Apr 2019 22:14:51.317 # Configuration loaded
22:14:51 211:M 13 Apr 2019 22:14:51.319 * Running mode=standalone, port=6379.
22:14:51 211:M 13 Apr 2019 22:14:51.319 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
22:14:51 211:M 13 Apr 2019 22:14:51.319 # Server initialized
22:14:51 211:M 13 Apr 2019 22:14:51.319 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
22:14:51 211:M 13 Apr 2019 22:14:51.319 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
22:14:51 211:M 13 Apr 2019 22:14:51.319 * Ready to accept connections
22:14:56 Redis server started successfully!
22:14:57 Uploaded IP address information to S3: 172.31.27.217

Excerpt from robomaker cloud watch logs where it bombs is below (note I've customized some of the messages):

22:17:27 Step No=7.00 Step Reward=8.1870 On track?=True Dist from centre=0.02 Frame Progress=0.0000 Total Progress=6.6870 To Finish=93.31 X, Y, Z=2.17, 0.60, 0.00
22:17:27 Action sent to Ros/Gazebo...
22:17:27 Message recieved from Ros/Gazebo... on_track=True
22:17:27 Step No=8.00 Step Reward=8.1870 On track?=True Dist from centre=0.02 Frame Progress=0.0000 Total Progress=6.6870 To Finish=93.31 X, Y, Z=2.19, 0.60, 0.00
22:21:49 Traceback (most recent call last):
22:21:49 File "/home/robomaker/workspace/bundle-store/b56d2ca1-05a5-4aa6-b6d2-84b6766f0d8b/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 484, in connect
22:21:49 sock = self._connect()
22:21:49 File "/home/robomaker/workspace/bundle-store/b56d2ca1-05a5-4aa6-b6d2-84b6766f0d8b/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 541, in _connect
22:21:49 raise err
22:21:49 File "/home/robomaker/workspace/bundle-store/b56d2ca1-05a5-4aa6-b6d2-84b6766f0d8b/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 529, in _connect
22:21:49 sock.connect(socket_address)
22:21:49 TimeoutError: Errno 110 Connection timed out


Further down the error log...

22:21:49 File "/home/robomaker/workspace/bundle-store/b56d2ca1-05a5-4aa6-b6d2-84b6766f0d8b/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 489, in connect
22:21:49 raise ConnectionError(self._error_message(e))
22:21:49 redis.exceptions.ConnectionError: Error 110 connecting to 172.31.27.217:6379. Connection timed out.
22:21:50 ================================================================================REQUIRED process agent-9 has died!
22:21:50 process has died pid 133, exit code 1, cmd /home/robomaker/workspace/bundle-store/b56d2ca1-05a5-4aa6-b6d2-84b6766f0d8b/opt/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh __name:=agent __log:=/home/robomaker/.ros/log/8bfee468-5e39-11e9-a84a-0242a9fe0103/agent-9.log.
22:21:50 log file: /home/robomaker/.ros/log/8bfee468-5e39-11e9-a84a-0242a9fe0103/agent-9*.log
22:21:50 Initiating shutdown!
22:21:50 ================================================================================
22:21:51 agent-9 killing on exit

Edited by: axb2035 on Apr 15, 2019 2:53 AM

Edited by: axb2035 on Apr 15, 2019 2:56 AM

axb2035
asked 5 years ago3730 views
2 Answers
0

So here three possibilities I thought of after reading up about redis:

  1. It's a permissions problem between sagemaker/robomaker.
  2. The robomaker machine doesn't have the right information to connect to redis.
  3. The information transfer to redis breaches some default redis setting.

Any further ideas/suggestions or info or thoughts on the theory above?

Thanks in advance!

Edited by: axb2035 on Apr 16, 2019 3:01 PM

axb2035
answered 5 years ago
0

As it's a timeout, it's likely to be a security group issue with the robomaker instance unable to connect to that port. The default script in https://github.com/awslabs/amazon-sagemaker-examples/blob/master/reinforcement_learning/rl_deepracer_robomaker_coach_gazebo/rl_deepracer_coach_robomaker.ipynb does setup all the network routes for you. Something could be going wrong in the assumptions that script makes. It tries to do the routing through a VPC and you should be able to look at that configuration. You could try to setup two tiny EC2 instances inside the VPC and see if they can communicate on the ports.

As for why it's timing out after steps are being made, robomaker (ROS in the background) uses Redis to sync the two tensorflow neural networks so it can do clipped proximal policy optimisation. The robomaker instance is doing a bunch of training on one network, and then sends it to Sagemaker through Redis (maybe S3 as well) to do the policy training. So robomaker doesn't immediately need Redis so it tries to a couple steps and episodes before the timeout is triggered.

crr0004
answered 5 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions