This post is both a bug report and a question.
We are trying to use SageMaker to train a model and everything is quite standard. Since we have a lot of images, we'll suffer from a super long image downloading time if we don't change the input_mode to FastFile. Then I struggled to successfully load image in the container.
In my dataset there are a lot of samples whose name contains Chinese. When I started debugging because I could not properly load files, I found that when sagemaker mounts the data from s3, it didn't take care of the encoding correctly. Here is an image name and the image path inside the training container:
七年级上_第10章分式_七年级上_第10章分式_1077759_title_0-0_4_mathjax
/opt/ml/input/data/validation/\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_1077759_title_0-0_4_mathjax.png
This is not neat but still I can have the right path in the container.
The problem is that I'm not able to read the file even though the path exists:
what I mean is
os.path.exists('/opt/ml/input/data/validation/\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_1077759_title_0-0_4_mathjax.png')
gives true
but
cv2.imread('/opt/ml/input/data/validation/\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_1077759_title_0-0_4_mathjax.png')
returns None. Then I tried to open the file and
fortunately it gives an error
The code is
with open('/opt/ml/input/data/validation/\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_1077759_title_0-0_4_mathjax.png', 'rb') as f: a = f.read()
and it gives me the error
OSError: [Errno 107] Transport endpoint is not connected
I tried to load a file in the same folder whose name doesn't contain any Chinese. Everything works well in this case so I'm sure that the Chinese characters in the filenames are causing problems. I wonder if there is a quick walk around so I don't need to rename maybe 80% of the data in s3.