How to access and/or mount Amazon public datasets to EC2

0

I have an EC2 instance running in us-east-1 that needs to be able to access/manipulate data available in the KITTI Vision Benchmark public dataset. I'd like to make this data available to the instance, but would also like to be able to reuse it with other instances in the future (more like a mounted S3 approach).

I understand that I can view the bucket and recursively download the data to a local folder using AWS cli from within the instance:

aws s3 ls --no-sign-request s3://avg-kitti/

aws s3 sync s3://avg-kitti/ or aws s3 cp s3://avg-kitti/ . --recursive

However, this feels like a brute force approach and would likely require me to increase my EBS volume size... and would limit my reuse of this data elsewhere (unless I was to snapshot and reuse). I did find some stackoverflow solutions that mentioned some of the open data sets being available as a snapshot you could copy over and attach as a volume. But the KITTI Vision Benchmark public dataset appears to be on S3 so I don't think it would have a snapshot like it would on EBS datasets...

That being said, is there an easier way to copy public data over to an existing S3 bucket? and then mount my instance to that? I have played around with S3FS and feel like that might be my best bet, but I am worried about 1) the cost of copying / downloading all data from public bucket to my own 2) best approach for reusing this data on other instances 3) simply not knowing if there's a better/cheaper way to make this data available without downloading or needing to download again in the future.

JLux
asked 2 years ago405 views
2 Answers
1

You could use AWS Storage Gateway as an Amazon S3 File Gateway. The File Gateway is deployed in your VPC on an EC2 instance and serves up a NFS mount in front of your S3 bucket.

profile pictureAWS
EXPERT
kentrad
answered 2 years ago
1

You can copy data between S3 buckets using the AWS CLI: aws s3 cp s3://source-bucket/ s3://destination-bucket/ but there will be a cost in terms of API requests and possibly data transfer if either of the buckets are not in the region that you run the command from.

Even if you do copy the data to another S3 bucket I don't think that solves the problem that you're describing - you want "filesystem" access to the data.

You could copy the data to EFS or FSx for Lustre; but either of those is going to have a cost associated with it as well.

S3FS is useful but you do need to be a little careful because it doesn't allow for multiple writers. Besides, performance may be an issue.

The best answer is to ensure that your code accesses S3 directly to grab the objects of interest and then manipulates them locally; rather than downloading the entire dataset to a location (be that EBS, your own S3 bucket, EFS, etc.). That will (most probably) involve code changes but it has the lowest cost (of AWS services) and the least number of workarounds.

profile pictureAWS
EXPERT
answered 2 years ago
  • ML Code uses this data for model training and evaluation - for training cycles I will need to load to GPU, process and be able to repeat model training cycles. I'd be repulling and unzipping this data multiple times unless I just pull down and store locally. Maybe I'll look into copying over to EFS - are download / data transfer rates the same for this compared to just transferring to a separate S3 bucket ?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions