Hi,
Trying to train a deep racer using the vanilla github setup (including redis setup in training_worker.py). Redis sets up fine every time but I get a redis connection error when the simulation complete the first episode of 20 runs. Sagemaker start and Robomaker errors logs below. I have searched high and low to find out what is going wrong and no luck - any assistance appreciated!
Excerpt of Sagemaker cloudwatch logs on start up related to redis:
22:14:50 Invoking script with the following command:
22:14:50 /usr/bin/python training_worker.py --RLCOACH_PRESET deepracer --aws_region us-east-1 --s3_bucket sagemaker-us-east-1-158249709041 --s3_prefix rl-deepracer-sagemaker-190413-220444
22:14:51 211:C 13 Apr 2019 22:14:51.317 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
22:14:51 211:C 13 Apr 2019 22:14:51.317 # Redis version=5.0.3, bits=64, commit=00000000, modified=0, pid=211, just started
22:14:51 211:C 13 Apr 2019 22:14:51.317 # Configuration loaded
22:14:51 211:M 13 Apr 2019 22:14:51.319 * Running mode=standalone, port=6379.
22:14:51 211:M 13 Apr 2019 22:14:51.319 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
22:14:51 211:M 13 Apr 2019 22:14:51.319 # Server initialized
22:14:51 211:M 13 Apr 2019 22:14:51.319 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
22:14:51 211:M 13 Apr 2019 22:14:51.319 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
22:14:51 211:M 13 Apr 2019 22:14:51.319 * Ready to accept connections
22:14:56 Redis server started successfully!
22:14:57 Uploaded IP address information to S3: 172.31.27.217
Excerpt from robomaker cloud watch logs where it bombs is below (note I've customized some of the messages):
22:17:27 Step No=7.00 Step Reward=8.1870 On track?=True Dist from centre=0.02 Frame Progress=0.0000 Total Progress=6.6870 To Finish=93.31 X, Y, Z=2.17, 0.60, 0.00
22:17:27 Action sent to Ros/Gazebo...
22:17:27 Message recieved from Ros/Gazebo... on_track=True
22:17:27 Step No=8.00 Step Reward=8.1870 On track?=True Dist from centre=0.02 Frame Progress=0.0000 Total Progress=6.6870 To Finish=93.31 X, Y, Z=2.19, 0.60, 0.00
22:21:49 Traceback (most recent call last):
22:21:49 File "/home/robomaker/workspace/bundle-store/b56d2ca1-05a5-4aa6-b6d2-84b6766f0d8b/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 484, in connect
22:21:49 sock = self._connect()
22:21:49 File "/home/robomaker/workspace/bundle-store/b56d2ca1-05a5-4aa6-b6d2-84b6766f0d8b/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 541, in _connect
22:21:49 raise err
22:21:49 File "/home/robomaker/workspace/bundle-store/b56d2ca1-05a5-4aa6-b6d2-84b6766f0d8b/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 529, in _connect
22:21:49 sock.connect(socket_address)
22:21:49 TimeoutError: Errno 110 Connection timed out
Further down the error log...
22:21:49 File "/home/robomaker/workspace/bundle-store/b56d2ca1-05a5-4aa6-b6d2-84b6766f0d8b/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 489, in connect
22:21:49 raise ConnectionError(self._error_message(e))
22:21:49 redis.exceptions.ConnectionError: Error 110 connecting to 172.31.27.217:6379. Connection timed out.
22:21:50 ================================================================================REQUIRED process agent-9 has died!
22:21:50 process has died pid 133, exit code 1, cmd /home/robomaker/workspace/bundle-store/b56d2ca1-05a5-4aa6-b6d2-84b6766f0d8b/opt/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh __name:=agent __log:=/home/robomaker/.ros/log/8bfee468-5e39-11e9-a84a-0242a9fe0103/agent-9.log.
22:21:50 log file: /home/robomaker/.ros/log/8bfee468-5e39-11e9-a84a-0242a9fe0103/agent-9*.log
22:21:50 Initiating shutdown!
22:21:50 ================================================================================
22:21:51 agent-9 killing on exit
Edited by: axb2035 on Apr 15, 2019 2:53 AM
Edited by: axb2035 on Apr 15, 2019 2:56 AM