Keep training with manifest file

0

Let's say I have trained an object detection model using Pipe and manifest files. Let's say I have the model artifacts saved as a .tar.gz as well as the model endpoint ready. Let's say also that the performance of the model is not enough and I have collected and labeled more pictures. I can create new manifest files that point either to the new labeled pictures only or to the global dataset of labeled pictures (I learned how to combine different manifest files into one).

How do I take my "pre-trained" .tar.gz model and keep training it using the new manifest files?

What I have tried:

  1. Documentation (https://docs.aws.amazon.com/sagemaker/latest/dg/incremental-training.html) seems clear that you cannot do what they call "incremental training" in Pipe mode and this alone seems to exclude using manifest files. This seems quite shortsighted because it would mean you have to train from scratch everytime, which can cost a lot.
  2. You can perform incremental training if you use RecordIO file and the "File" mode instead of Pipe. I followed several notebooks to create RecordIO files from the manifest files I already have. Not sure if I did it well but when I train, the performance metrics behave weirdly (figures show dots not connected to each other) and the model is SUPER slow to train (like taking several hours for a few epochs) to the point I had to stop it. It is likely I am doing something wrong in building the RecordIO files (but maybe not because the training job runs along) but the point is this approach is not user friendly and is error prone.
fascani
asked a year ago295 views
1 Answer
0

Hello,

I understand that you are trying to use the pre-trained model for training jobs with the new manifest file and would like to gather more information on the same.

One of the workarounds could be to use Amazon Jumpstart, as JumpStart has the possibility to train models incrementally. This way, training models with both the old and new data will take much less time. Also, JumpStart received support for model tuning with SageMaker Automatic Model Tuning. This feature automates the process of searching for the best hyperparameter configuration for a model. [1]

Yes, you are right that you can perform incremental training if you use RecordIO files and the "File" mode instead of Pipe mode, as stated in the below documentation [3].

With incremental training, you can use the artifacts from an existing model and an expanded dataset to train a new model. Incremental training saves both time and resources.

Incremental training takes input as a new dataset along with an already trained model in the model.tar.gz file, and hence can be retrained. [7]

I would like to request that you please go through the End-to-End Incremental Training Image Classification Example [2].

Furthermore, I would also like to add that incremental training is only supported for Object Detection - MXNet [4], Image Classification - MXNet [5], and Semantic Segmentation Algorithm [6].

Here are more detailed steps to use a pre-trained model with a new manifest file for incremental training in AWS SageMaker:


  1. Prepare your training data: Ensure that your new training data is properly formatted and organized. Each training sample should be represented as a separate line in the manifest file, containing the following information:
  • "source" field: The S3 URI of the data file for the training sample.
  • "metadata" field: Any relevant metadata associated with the training sample.

For example, a sample line in the manifest file could look like:

{"source": "s3://your-bucket/training_data/sample1.jpg", "metadata": {"label": "cat"}}
  1. Create an AWS S3 bucket: If you don't already have one, create an S3 bucket in the AWS Management Console. This bucket will be used to store your training data, model artifacts, and other resources.
  2. Upload your new manifest file and training data: Upload your new manifest file and the associated training data files to the S3 bucket you created in the previous step.
  3. Create a SageMaker training job: Use the SageMaker Python SDK or the AWS Management Console to create a new training job. Here's an example of creating a training job using the SageMaker Python SDK:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
role = get_execution_role()
training_image = get_image_uri(sagemaker.Session().boto_region_name, 'your-framework', 'your-framework-version')
sagemaker_session = sagemaker.Session()
inputs = sagemaker.s3_input(s3_data='s3://your-bucket/training_data/', content_type='application/json')
estimator = sagemaker.estimator.Estimator(
    training_image,
    role,
    instance_count=1,
    instance_type='your-instance-type',
    output_path='s3://your-bucket/model-artifacts/',
    sagemaker_session=sagemaker_session
)
estimator.fit(inputs)

Replace 'your-framework' with the framework you used for the pre-trained model (e.g., tensorflow, pytorch) and 'your-framework-version' with the specific version you require. Also, make sure to set 'your-instance-type' to the appropriate SageMaker instance type for your training job.

  1. Start the training job: Start the SageMaker training job by running the code snippet above. This will launch the training instances, load the pre-trained model, and perform incremental training using the new data specified in the manifest file.
  2. Monitor the training job: Monitor the progress of your training job through the SageMaker console or programmatically using the SDK. You can view metrics, logs, and other useful information to assess the training job's status and performance.
  3. Retrieve the trained model: Once the training job is completed, you can access the trained model artifacts stored in the specified S3 location ('s3://your-bucket/model-artifacts/' in the code snippet above). These artifacts will contain the updated weights and parameters resulting from the incremental training.

If you have any difficulty verifying any of the above-mentioned points or if you still run into issues, please reach out to AWS Support [8] (Sagemaker) along with your issue or use case in detail, and we would be happy to assist you further.

References:

[1] Incremental training with Amazon SageMaker JumpStart: https://aws.amazon.com/blogs/machine-learning/incremental-training-with-amazon-sagemaker-jumpstart/  

[2] End-to-End Incremental Training Image Classification Example: https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/imageclassification_caltech/Image-classification-incremental-training-highlevel.html  

[3] For incremental training, you need to use file input mode - https://docs.aws.amazon.com/sagemaker/latest/dg/incremental-training.html#:~:text=For%20InputMode%2C%20choose%20File.%20For%20incremental%20training%2C%20you%20need%20to%20use%20file%20input%20mode  

[4] Object Detection - MXNet - https://docs.aws.amazon.com/sagemaker/latest/dg/object-detection.html  

[5] Image Classification - MXNet - https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html  

[6] Semantic Segmentation Algorithm - https://docs.aws.amazon.com/sagemaker/latest/dg/semantic-segmentation.html  

[7] Bring your own pre-trained MXNet or TensorFlow models into Amazon SageMaker - https://aws.amazon.com/blogs/machine-learning/bring-your-own-pre-trained-mxnet-or-tensorflow-models-into-amazon-sagemaker/

[8] Creating support cases and case management - https://docs.aws.amazon.com/awssupport/latest/user/case-management.html#creating-a-support-casehttps://docs.aws.amazon.com/awssupport/latest/user/case-management.html#creating-a-support-case

AWS
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions