[AI/ML] Data acquisition and preprocessing

0

Hi,

Customer who loads the e-bike data to S3 wants to get AI/ML insight from sensor data. The e-bike sensor data are size about 4KB files each and posted in S3 buckets. The sensor data is put into format like this

timestamp1, sensorA, sensorB, sensorC, ..., sensorZ timestamp2, sensorA, sensorB, sensorC, ..., sensorZ timestamp3, sensorA, sensorB, sensorC, ..., sensorZ ...

Then these sensor data are put into one file about 4KB size.

The plan I have is to

  • Read S3 objects
  • Parse S3 object with Lambda. I thought about Glue but wanted to put data in DynamoDB where Glue does not seem to support. Also, Glue seems to be more expensive.
  • Put the data in DynamoDB with bike ID as primary key and timestamp as sort key.
  • Use SageMaker to learn with the DynamoDB data. There will be separate discussion on choosing which model and making time-series inferencing.
  • If we need to re-learn, it will use the DynamoDB data, not from S3. I think it will be faster to get data from DynamoDB instead from the raw S3 data.
  • Also, I think we can filter out some bad input or apply little modification to DynamoDB data (shifting time stamps to the correct time, etc.)
  • Make inferencing output based on the model.

What do you think? Would you agree? Would you approach the problem differently? Would you rather learn from S3 directly via Athena or direct S3 access? Or would you rather use Glue and Redshift? But the data about 100MB would be sufficient to train the model we have in mind. Glue and Redshift maybe overkill. Currently, Korea region does not support Timestream database. So, time series database closest in Korea could be DynamoDB.

Please share your thoughts.

Thanks!

1 Answer
0
Accepted Answer

Thoughts about DynamoDB

Per GB, DynamoDB is around 5X more cost per GB of data stored. On top of that, you have RCU/WCU cost.

I would recommend keeping data in S3. Not only is it more cost effective, but with S3, you do not have to worry about RCU/WCU cost or throughput of DynamoDB.

SageMaker notebooks and training instances can read directly from S3, and S3 has high-throughput. I don't think you will have a problem with 100 MB datasets.

If you need to prep/transform your data, you can do the transformations "in place" in S3 using Glue, Athena, Glue DataBrew, GlueStudio, etc.

Glue and DynamoDB

I thought about Glue but wanted to put data in DynamoDB where Glue does not seem to support.

Glue supports both Python and Spark jobs. If you use a Glue Python job, you can import the boto3 (AWS SDK) library and write to DynamoDB.

Other strategies

How is your customer ingesting the sensor data / how is it being written to S3? Are they using AWS IoT Core?

Regardless, the pattern you've described thus far is:

Device -> Sensor data in S3 -> Transform with Lambda -> store data in DynamoDB

An alternative approach you could consider is using Kinesis Firehose with Lambda transformations. This will allow you to do "in-line" parsing / transformation of your data before it is ever written to S3, this removing the need to re-read the data from S3 and apply transformations after the fact. Firehose also allows you to write the stored data in formats such as Parquet, which can help with cost and subsequent query performance.

If you want to store both raw data and transformed data, you can use a "fanout" pattern with Kinesis Streams/Firehose, where one output is raw data to S3 and the other is a transformed stream.

answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions