I am currently working on a kmeans clustering algorithm for my dataset.
Currently what i have done is to creating a preprocess.py that preprocess my data and stores it in s3 bucket.and train step function called via Estimator sdk.
input_data = ParameterString(
name="InputDataUrl",
default_value="s3://ml-pipeline-jobs/input_files/mydata.csv",
)
# processing step for feature engineering
sklearn_processor = SKLearnProcessor(
framework_version="0.23-1",
instance_type=processing_instance_type,
instance_count=processing_instance_count,
base_job_name=f"{base_job_prefix}/sklearn-billofwork-preprocess",
sagemaker_session=pipeline_session,
role=role,
)
step_args = sklearn_processor.run(
outputs=[
ProcessingOutput(output_name="train_preprocessed", source="/opt/ml/processing/train/data_final"),
],
code=os.path.join(BASE_DIR, "preprocess.py"),
arguments=["--input-data", input_data],
)
step_process = ProcessingStep(
name="PreprocessBillOfWorkData",
step_args=step_args,
)
image_uri = sagemaker.image_uris.retrieve(
framework="kmeans",
region=region,
py_version="py3",
instance_type=training_instance_type,
)
kmeans = Estimator(
image_uri=image_uri,
sagemaker_session=pipeline_session,
role=role,
instance_type=training_instance_type,
instance_count=1,
)
kmeans.set_hyperparameters(
k= 40,
feature_dim=27295
)
step_args_preprocess= TrainingInput(
s3_data=step_process.properties.ProcessingOutputConfig.Outputs["preprocessed_data"].S3Output.S3Uri,
content_type="text/csv",
)
step_train = TrainingStep(
name="TrainBowModel",
estimator=kmeans,
inputs={
"train":step_args_preprocess,
}
)
Now what i would like to do is to have a step created that accepts part of input data from step_process function and also accept a .py file that can take the new data and do some additional preprocessing and then perform .predict function.
I was able to complete until training dataset using aws sdks.But i am not sure how to proceed after this.
I studied about how inferencing is done using AWS sdk and it seems there are 4 different types.
But i clueless which exactly suits my type of problem.
Kindly please guide.Thanks
Thanks for sharing the information @tomonori Shimomura You answered part of my question. Yes indeed I used the entry_poit method to run my custom file. The second part where i would like to use the inputs from step_process is achieved by writing the data to s3 bucket and calling it from inference file using boto3 client.