S3 Dataset versioning with SageMaker?


Is there any standard for ML S3 dataset tracking or versioning? Basically, what setup allows to track a given model training execution to a given dataset? Interested to hear about proven or state-of-the-art ideas

질문됨 6년 전1791회 조회
3개 답변

Nowadays, there are 3rd party tool that can be used alongside SageMaker. One example is Data Version Control (DVC), and we have discussed it how to integrate within SageMaker Processing jobs and SageMaker Training Jobs in this blogpost. As an alternative, you can leverage SageMaker Pipelines when your data preparation step is executed as a processing step within a pipeline execution. Pipelines allows you to achieve data versioning in a programmatic way by using execution-specific variables like ExecutionVariables.PIPELINE_EXECUTION_ID, which is the unique ID of a pipeline run. We can, for example, create a unique key for storing the output datasets in S3 that ties them to a specific pipeline run. We have also discussed this possibility as part of this blogpost.

답변함 2년 전
수락된 답변

Unfortunately, managing versions of datasets and which models used them is not embedded in SageMaker. But, you can use SageMaker search to manage the differences in data location between experiments. In that case, if your dataset isn't too big, my recommendation will be to create a standard for data structure in S3. i.e. for each new dataset, create a new prefix in S3 with your logic. Using SageMaker search you'll be able to find all your jobs and compare between datasets.

답변함 6년 전
profile picture
검토됨 2달 전

As Paolo_DF suggested, DVC has become a best practice for controlling your dataset, model, and scripts versions. While I utilize Sagemaker Studio to train and deploy my custom ML model, I have found Paolo's blogpost to be challenging to follow. Although I respect their expertise, I prefer not to use the Training job and Sagemaker Experiment services. Instead, I would suggest referring to the following resources: https://dvc.org/doc/user-guide/integrations/sagemaker https://medium.com/analytics-vidhya/versioning-data-and-models-in-ml-projects-using-dvc-and-aws-s3-286e664a7209

답변함 5달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠