By using AWS re:Post, you agree to the Terms of Use
/Amazon SageMaker/

Questions tagged with Amazon SageMaker

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

SageMaker AutoML generates ExpiredTokenException

Hi, I can train models using different AWS SageMaker estimators, but when I use SageMaker AutoML Python SDK the following error occurs about 15 minutes into the model training process: "botocore.exceptions.ClientError: An error occurred (ExpiredTokenException) when calling the DescribeAutoMLJob operation: The security token included in the request is expired" The role used to create the AutoML object is associated with the following AWS pre-defined policies as well as one inline policy. Can you please let me know what I’m missing that's causing this ExpiredTokenException error? AmazonS3FullAccess AWSCloud9Administrator AWSCloud9User AmazonSageMakerFullAccess Inline policy: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "iam:PassRole" ], "Resource": "*", "Condition": { "StringEquals": { "iam:PassedToService": "sagemaker.amazonaws.com" } } }, { "Effect": "Allow", "Action": [ "sagemaker:DescribeEndpointConfig", "sagemaker:DescribeModel", "sagemaker:InvokeEndpoint", "sagemaker:ListTags", "sagemaker:DescribeEndpoint", "sagemaker:CreateModel", "sagemaker:CreateEndpointConfig", "sagemaker:CreateEndpoint", "sagemaker:DeleteModel", "sagemaker:DeleteEndpointConfig", "sagemaker:DeleteEndpoint", "cloudwatch:PutMetricData", "logs:CreateLogStream", "logs:PutLogEvents", "logs:CreateLogGroup", "logs:DescribeLogStreams", "s3:GetObject", "s3:PutObject", "s3:ListBucket", "ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage" ], "Resource": "*" } ] } Thanks, Stefan
0
answers
0
votes
0
views
AWS-User-9933965
asked 18 hours ago

GroundTruth text labelling - hide data columns, and methods of quality control

I have a csv of sentences which I'd like labelled, and have identified GroundTruth labelling jobs as a way to do this. Having spent some time exploring the service, I have some questions: **1) **I can't find a way to display only particular columns to the labellers - e.g. if the dataset has a column of IDs for each sentence, this ideally shouldn't be shown to labellers **2)** There is either single labelling or multi labelling, but I would like a way to have two sets of single-selection labels, where one captures difficulty of assigning the label: Select one for binary classification a) Yes, b) No Select one for difficulty of classification c) Easy, d) Medium, e) Hard Can this be done using custom HTML? Is there a guide to writing this - the template it gives you doesn't seem to render as-is. **3)** There appears to be a maximum of $1.20 payment per task. Is this the case, and why? **4)** Having not used mechanical turk before, are there ways of ensuring people take the work seriously and don't just select random answers? I can see there's an option to have x number of people answer the same question, but is there also a way to put in unambiguous questions to which we already have a 'pre_agreed_label' every nth question, and remove people from the task if they get them wrong? Thanks!
0
answers
0
votes
3
views
AWS-User-6024204
asked 20 days ago

Giving weights to event types in amazon personalize

1) For the VIDEO_ON_DEMAND domain, some use cases include multiple event types. For example, the 'Top picks for you' use case includes two event types 'watch' and 'click'. Is 'watch' given more weight than 'click' when training the model? In general, when there is more than one event type, do domain recommenders give more weight to some event types? 2) In our use case, we have a platform that recommends video content. However, we have multiple event types, and some events need to be given more weight than others. Below is the list of our event types in the order of their importance: SHARE > LIKE > WATCH_COMPLETE > WATCH_PARTIAL > STARTED > SKIP So when training the model, we would want 'SHARE' to have more weight than 'LIKE', and 'LIKE' to have more weight than 'WATCH_COMPLETE' and so on. I was looking into custom solutions. It looks like there is no way to give weights when using Personalize's custom solutions as mentioned in this [post](https://stackoverflow.com/questions/69456739/any-way-to-tell-aws-personalize-that-some-interactions-count-more-than-others/69483117#69483117)... --- **So when using Amazon Personalize, should we use domain recommenders or build custom solutions for our use case?** **If we cannot give weights to different event types using Personalize, then what are alternatives? **Should we use Amazon SageMaker and build models from scratch? *Open to any and all suggestions.* Thank you!
1
answers
0
votes
6
views
DharaniR
asked a month ago

Multi-file source_dir bundle with SM Training Compiler (distributed)

I'm hoping to use SageMaker Training Compiler with a (Hugging Face Trainer API, PyTorch) program split across **multiple .py files** for maintainability. The job needs to run on multiple GPUs (although at the current scale, multi-device single-node would be acceptable). Following [the docs](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-enable.html#training-compiler-enable-pysdk), I added the `distributed_training_launcher.py` launcher script to my `source_dir` bundle, and passed in the true training script via a `training_script` hyperparameter. ...But when the job tries to start, I get: ``` Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch_xla/distributed/xla_spawn.py", line 90, in <module> main() File "/opt/conda/lib/python3.8/site-packages/torch_xla/distributed/xla_spawn.py", line 86, in main xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_gpus) AttributeError: module 'train' has no attribute '_mp_fn' ``` Any ideas what might be causing this? Is there some particular limitation or additional requirement for training scripts that are written over multiple files? I also tried running in single-GPU mode (`p3.2xlarge`) instead - directly calling the train script instead of the distributed launcher - and saw the below error which seems to originate within [TrainingArguments](https://huggingface.co/transformers/v3.0.2/main_classes/trainer.html#transformers.TrainingArguments) itself? Not sure why it's trying to call a 'tensorflow/compiler' compiler when running in PT..? **EDIT: Turns out the below error can be solved by explicitly setting `n_gpus` as mentioned on the [troubleshooting doc](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-troubleshooting.html#training-compiler-troubleshooting-missing-xla-config) - but that takes me back to the error message above** ``` File "/opt/ml/code/code/config.py", line 124, in __post_init__ super().__post_init__() File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 761, in __post_init__ if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval): File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1764, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 975, in device return self._setup_devices File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1754, in __get__ cached = self.fget(obj) File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1764, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 918, in _setup_devices device = xm.xla_device() File "/opt/conda/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 231, in xla_device devices = get_xla_supported_devices( File "/opt/conda/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 137, in get_xla_supported_devices xla_devices = _DEVICES.value File "/opt/conda/lib/python3.8/site-packages/torch_xla/utils/utils.py", line 32, in value self._value = self._gen_fn() File "/opt/conda/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 19, in <lambda> _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices()) RuntimeError: tensorflow/compiler/xla/xla_client/computation_client.cc:273 : Missing XLA configuration ```
0
answers
0
votes
5
views
EXPERT
Alex_T
asked a month ago

AWS SageMaker Endpoint Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check

**Links to the AWS notebooks for reference** https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/xgboost_bring_your_own_model/xgboost_bring_your_own_model.ipynb https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/xgboost_script_mode_local_training_and_serving/code/inference.py I am using the example from the notebooks to create and deploy an endpoint to AWS SageMaker Cloud. I have passed all the checks locally and when I attempt to deploy the endpoint I run into the issue. **Code** In my local notebook (my personal machine NOT sagemaker notebook): ``` import pandas import xgboost from xgboost import XGBRegressor import numpy as np from sklearn.model_selection import train_test_split, RandomizedSearchCV print(xgboost.__version__) 1.0.1 # Fit model r.fit(X_train.toarray(), y_train.values) xgbest = r.best_estimator ``` **AWS SageMaker Endpoint code** ``` import boto3 import pickle import sagemaker from sagemaker.amazon.amazon_estimator import get_image_uri from time import gmtime, strftime region = boto3.Session().region_name role = 'arn:aws:iam::111:role/xxx-sagemaker-role' bucket = 'ml-model' prefix = "sagemaker/xxx-xgboost-byo" bucket_path = "https://s3-{}.amazonaws.com/{}".format('us-west-1', 'ml-model') client = boto3.client( 's3', aws_access_key_id=xxx aws_secret_access_key=xxx ) client.list_objects(Bucket=bucket) ``` **Save the model** ``` # save the model, either xgbest model_file_name = "xgboost-model" # using save_model # xgb_model.save_model(model_file_name) pickle.dump(xgbest, open(model_file_name, 'wb'))` !tar czvf xgboost_model.tar.gz $model_file_name ``` **Upload to S3** ``` key = 'xgboost_model.tar.gz' with open('xgboost_model.tar.gz', 'rb') as f: client.upload_fileobj(f, bucket, key) ``` **Import model** ``` # Import model into hosting container = get_image_uri(boto3.Session().region_name, "xgboost", "0.90-2") print(container) xxxxxx.dkr.ecr.us-west-1.amazonaws.com/sagemaker-xgboost:0.90-2-cpu-py3 ``` ``` %%time model_name = model_file_name + datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S") model_url = "https://s3-{}.amazonaws.com/{}/{}".format(region, bucket, key) from sagemaker.xgboost import XGBoost, XGBoostModel from sagemaker.session import Session from sagemaker.local import LocalSession sm_client = boto3.client( "sagemaker", region_name="us-west-1", aws_access_key_id='xxxx', aws_secret_access_key='xxxx' ) # Define session sagemaker_session = Session(sagemaker_client = sm_client) models3_uri = "s3://ml-model/xgboost_model.tar.gz" xgb_inference_model = XGBoostModel( model_data=models3_uri, role=role, entry_point="inference.py", framework_version="0.90-2", # Cloud sagemaker_session = sagemaker_session # Local # sagemaker_session = None ) #serializer = StringSerializer(content_type="text/csv") predictor = xgb_inference_model.deploy( initial_instance_count = 1, # Cloud instance_type="ml.t2.large", # Local # instance_type = "local", serializer = "text/csv" ) if xgb_inference_model.sagemaker_session.local_mode == True: print('Deployed endpoint in local mode') else: print('Deployed endpoint to SageMaker AWS Cloud') /Applications/Anaconda/anaconda3/lib/python3.9/site-packages/sagemaker/session.py in wait_for_endpoint(self, endpoint, poll) 3354 if status != "InService": 3355 reason = desc.get("FailureReason", None) -> 3356 raise exceptions.UnexpectedStatusException( 3357 message="Error hosting endpoint {endpoint}: {status}. Reason: {reason}.".format( 3358 endpoint=endpoint, status=status, reason=reason UnexpectedStatusException: Error hosting endpoint sagemaker-xgboost-xxxx: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.. ```
1
answers
0
votes
5
views
aws-user-2268062
asked a month ago

passing a numpy array to predict_fn when making inference for xgboost model

I have a model that's trained locally and deployed to SageMaker to make inferences / invoke endpoint. When I try to make predictions, I get the following exception. raise ValueError('Input numpy.ndarray must be 2 dimensional') ValueError: Input numpy.ndarray must be 2 dimensional My `model` is a xgboost model with some pre-processing (variable encoding) and hyper-parameter tuning. Here's what `model` object looks like: XGBRegressor(colsample_bytree=xxx, gamma=xxx, learning_rate=xxx, max_depth=x, n_estimators=xxx, subsample=xxx) My test data is a string of float values which is turned into an array as the data must be passed as numpy array. testdata = [........., 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 2000, 200, 85, 412412, 123, 41, 552, 50000, 512, 0.1, 10.0, 2.0, 0.05] I have tried to reshape the numpy array from 1d to 2d, however, that doesn't work as the number of features between test data and trained model do not match. My question is how do I pass a numpy array same as the length of # of features in trained model? I am able to make predictions by passing test data as a list locally. More info on inference script here: https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/xgboost_script_mode_local_training_and_serving/code/inference.py Traceback (most recent call last): File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_functions.py", line 93, in wrapper return fn(*args, **kwargs) File "/opt/ml/code/inference.py", line 75, in predict_fn prediction = model.predict(input_data) File "/miniconda3/lib/python3.6/site-packages/xgboost/sklearn.py", line 448, in predict test_dmatrix = DMatrix(data, missing=self.missing, nthread=self.n_jobs) File "/miniconda3/lib/python3.6/site-packages/xgboost/core.py", line 404, in __init__ self._init_from_npy2d(data, missing, nthread) File "/miniconda3/lib/python3.6/site-packages/xgboost/core.py", line 474, in _init_from_npy2d raise ValueError('Input numpy.ndarray must be 2 dimensional') ValueError: Input numpy.ndarray must be 2 dimensional
2
answers
0
votes
19
views
aws-user-2268062
asked a month ago

mxnet error encountered in Lambda Function

I trained and deployed a semantic segmentation network (mlp2.xlarge) using SageMaker. I wanted to use an AWS Lambda function to send an image to this endpoint and get a mask in return however when I use invoke_endpoint it gives an mxnet error in the logs. Funnily when I use the deployed model from a transformer object from inside the SageMaker notebook the mask is returned properly. Here is my Lambda function code: ``` import json import boto3 s3r = boto3.resource('s3') def lambda_handler(event, context): # TODO implement bucket = event["body"] key = 'image.jpg' local_file_name = '/tmp/'+key s3r.Bucket(bucket).download_file(key, local_file_name) runtime = boto3.Session().client('sagemaker-runtime') with open('/tmp/image.jpg', 'rb') as imfile: imbytes = imfile.read() # Now we use the SageMaker runtime to invoke our endpoint, sending the review we were given response = runtime.invoke_endpoint( EndpointName='semseg-2021-12-03-10-05-58-495', ContentType='application/x-image', Body=bytearray(imbytes)) # The actual image # The response is an HTTP response whose body contains the result of our inference result = response['Body'].read() return { 'statusCode': 200, 'body': json.dumps(result) } ``` Here are the errors I see in the logs: mxnet.base.MXNetError: [10:26:14] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.4.x.4276.0/AL2_x86_64/generic-flavor/src/3rdparty/dmlc-core/src/recordio.cc:12: Check failed: size < (1 << 29U) RecordIO only accept record less than 2^29 bytes
1
answers
0
votes
17
views
YashJain
asked a month ago

Error Invoking endpoint deployed locally using SageMaker SDK for a xgboost model

I am deploying a SageMaker endpoint locally for xgboost model and running to some issues when invoking the endpoint. I am able to successfully deploy the endpoint in local model using the following code sample: ``` from sagemaker.xgboost import XGBoost, XGBoostModel from sagemaker.session import Session xgb_inference_model = XGBoostModel( model_data=models3_uri, role=role, entry_point="inference.py", framework_version="0.90-2", sagemaker_session = None # sagemaker_session if cloud / prod mode ) print('Deploying endpoint in local mode') predictor = xgb_inference_model.deploy( initial_instance_count = 1, #instance_type="ml.m5.xlarge", instance_type = "local", ) ``` I have the `inference.py` that includes the functions for accepting input, making predictions and output. Link here: https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/xgboost_script_mode_local_training_and_serving/code/inference.py The issue I am running into is with type of data that `input_fn` accepts. I have tried passing passing a numpy array / dataframe / bytes object as input data, but still get the error. ``` def input_fn(request_body, request_content_type): """ The SageMaker XGBoost model server receives the request data body and the content type, and invokes the `input_fn`. Return a DMatrix (an object that can be passed to predict_fn). """ # Handle numpy array type if request_content_type == "application/x-npy": print(type(request_body)) array = np.load(BytesIO(request_body)) return xgb.DMatrix(request_body) if request_content_type == "text/csv": print("request body", request_body) # change to request_body to Pandas DataFrame return xgb_encoders.libsvm_to_dmatrix(request_body) #perform encoding on the input data here else: raise ValueError("Content type {} is not supported.".format(request_content_type)) ``` Encoder object. Training data is encoded before fit. Doing the same with test data. Posting here for reference. I tried making predictions using NumpySerializer and CSVSerializer [1]. Both don't work - ``` from sagemaker.serializers import NumpySerializer predictor.serializer = NumpySerializer() testpoint = encoder.transform(df).toarray() print(testpoint) [[0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00.........1.000e+01 2.000e+00 5.312e-02]] ``` Traceback with CSVSerializer(), when passing body of type text/csv ``` Exception on /invocations [POST] | Traceback (most recent call last): File "/opt/ml/code/inference.py", line 35, in input_fn return xgb_encoders.libsvm_to_dmatrix(request_body) packages/sagemaker_xgboost_container/encoder.py", line 65, in libsvm_to_dmatrix TypeError: a bytes-like object is required, not 'str' ``` Traceback with NumpySerializer() when passing <class 'numpy.ndarray'> type body ``` Exception on /invocations [POST] Traceback (most recent call last): raise TypeError('no supported conversion for types: %r' % (args,)) TypeError: no supported conversion for types: (dtype('O'),) ``` [1] https://sagemaker.readthedocs.io/en/stable/api/inference/serializers.html
2
answers
0
votes
14
views
aws-user-2268062
asked a month ago

Using SageMaker SDK to deploy a open source xgboost model locally

I have a locally trained model that I am trying to debug locally on docker container before deploying / creating endpoint on SageMaker. I am following the documentation that AWS customer service provided, however, I am running into issue with Creating Endpoint Config. Here's the code snippet: ``` from sagemaker.xgboost import XGBoost, XGBoostModel from sagemaker.session import Session sm_client = boto3.client( "sagemaker", aws_access_key_id='xxxxxx', aws_secret_access_key='xxxxxx' ) sagemaker_session = Session(sagemaker_client = sm_client) xgb_inference_model = XGBoostModel( model_data=model_url, role=role, entry_point="inference.py", framework_version="0.90-2", sagemaker_session = sagemaker_session ) print('Deploying endpoint in local mode') predictor = xgb_inference_model.deploy( initial_instance_count = 1, instance_type = "local" ) Traceback: 20 print('Deploying endpoint in local mode') 21 predictor = xgb_inference_model.deploy( 22 initial_instance_count = 1, ClientError: An error occurred (ValidationException) when calling the CreateEndpointConfig operation: 1 validation error detected: Value 'local' at 'productionVariants.1.member.instanceType' failed to satisfy constraint: Member must satisfy enum value set: [ml.r5d.12xlarge, ml.r5.12xlarge, ml.p2.xlarge, ml.m5.4xlarge, ml.m4.16xlarge, ml.r5d.24xlarge, ``` Here's the documentation link: https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/xgboost_script_mode_local_training_and_serving/xgboost_script_mode_local_training_and_serving.py
1
answers
0
votes
18
views
aws-user-2268062
asked a month ago

XGBoost Reports Not Generated

Hi! I have been trying to create a model using XGBoost, and was able to successfully run/train the model. However, I have not been able to generate the training reports. I have included the rules parameter as follows: "rules=[Rule.sagemaker(rule_configs.create_xgboost_report())]". I am following this tutorial, but I am using objective: "multi:softmax" instead of the "binary:logistic" used in the example. When I run the model everything is fine but only the Profiler Report gets generated and I do not see the XGBoostReport under the rule-output folder. According to the tutorial it should be under the same file path. Here is my code for the model if it helps any: ``` s3_output_location='s3://{}/{}/{}'.format(bucket, prefix, 'xgboost_model') container = sagemaker.image_uris.retrieve("xgboost", boto3.Session().region_name, "latest") train_input = TrainingInput( "s3://{}/{}/{}".format(bucket, prefix, "data/train.csv"), content_type="csv" ) validation_input = TrainingInput( "s3://{}/{}/{}".format(bucket, prefix, "data/validation.csv"), content_type="csv" ) rules=[ Rule.sagemaker(rule_configs.create_xgboost_report()) ] xgb = sagemaker.estimator.Estimator( image_uri=container, role=sagemaker.get_execution_role(), instance_count=1, instance_type="ml.c5.2xlarge", volume_size=5, output_path=s3_output_location, sagemaker_session=sagemaker.Session(), rules=rules ) xgb.set_hyperparameters( max_depth=6, objective='multi:softmax', num_class=num_classes, gamma=800, num_round=250 ) ``` Any help is appreciated! Thanks!
1
answers
0
votes
72
views
jughead
asked a month ago

botocore.exceptions.ClientError: An error occurred (ValidationException)

Hi, I want to deploy an MLflow image to an AWS Sagemaker endpoint that contains a machine learning model. I executed the following code, which I found in <https://towardsdatascience.com/deploying-models-to-production-with-mlflow-and-amazon-sagemaker-d21f67909198> . import mlflow.sagemaker as mfs run_id = run_id # the model you want to deploy - this run_id was saved when we trained our model region = "us-east-1" # region of your account aws_id = "XXXXXXXXXXX" # from the aws-cli output arn = "arn:aws:iam::XXXXXXXXXXX:role/your-role" app_name = "iris-rf-1" model_uri = "mlruns/%s/%s/artifacts/random-forest-model" % (experiment_id,run_id) # edit this path based on your working directory image_url = aws_id _ ".dkr.ecr." _ region + ".amazonaws.com/mlflow-pyfunc:1.2.0" # change to your mlflow version mfs.deploy(app_name=app_name, model_uri=model_uri, region_name=region, mode="create", execution_role_arn=arn, image_url=image_url) But I got the following error. I checked all policies and permissions attached to the IAM role. They all comply with what the error message complains about. I don't know what to do next. I'd appreciate your help. Thanks. botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Could not access model data at https://s3.amazonaws.com/mlflow-sagemaker-us-east-1-xxx/mlflow-xgb-demo-model-eqktjeoit5mxhmjn-abpanw/model.tar.gz. Please ensure that the role "arn:aws:iam::xxx:role/mlflow-sagemaker-dev" exists and that its trust relationship policy allows the action "sts:AssumeRole" for the service principal "sagemaker.amazonaws.com". Also ensure that the role has "s3:GetObject" permissions and that the object is located in us-east-1.
1
answers
0
votes
1
views
farshad123
asked a year ago

Processing Job automatically created when I start a training job

Hi, I haven't used sagemaker for a while and today I started a training job (with the same old settings I always used before), but this time I noticed that a processing job has been automatically created and it's running while my training job runs (I don't even know what a processing job is). I also checked in the dashboard to be sure, this was not happening before, it's the second time (first time was in December) but I've been using sagemaker for the last two years.. Is this a wanted behaviour? I didn't find anything related in the documentation, but it's important to know because I don't want extra costs.. This is the image used by the processing job, with a instance type of ml.m5.2xlarge which I didn't set anywhere.. 929884845733.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-debugger-rules:latest And this is how I launch my training job (the entrypoint script is basically Keras code for a MobileNetV3) import sagemaker from sagemaker.tensorflow import TensorFlow from sagemaker import get_execution_role bucket = 'mybucket' train_data = 's3://{}/{}'.format(bucket,'train') validation_data = 's3://{}/{}'.format(bucket,'test') s3_output_location = 's3://{}'.format(bucket) hyperparameters = {'epochs': 130, 'batch-size' : 512, 'learning-rate' : 0.0002} metrics = .. some regex here tf_estimator = TensorFlow(entry_point='train.py', role=get_execution_role(), train_instance_count=1, train_instance_type='ml.p2.xlarge', train_max_run=172800, output_path=s3_output_location, framework_version='2.3.0', py_version='py37', metric_definitions = metrics, hyperparameters = hyperparameters, source_dir="data") inputs = {'train': train_data, 'test': validation_data} myJobName = 'myname' tf_estimator.fit(inputs=inputs, job_name=myJobName) Edited by: rokk07 on Jan 25, 2021 2:55 AM
1
answers
0
votes
0
views
rokk07
asked a year ago

[AI/ML] Data acquisition and preprocessing

Hi, Customer who loads the e-bike data to S3 wants to get AI/ML insight from sensor data. The e-bike sensor data are size about 4KB files each and posted in S3 buckets. The sensor data is put into format like this timestamp1, sensorA, sensorB, sensorC, ..., sensorZ timestamp2, sensorA, sensorB, sensorC, ..., sensorZ timestamp3, sensorA, sensorB, sensorC, ..., sensorZ ... Then these sensor data are put into one file about 4KB size. The plan I have is to * Read S3 objects * Parse S3 object with Lambda. I thought about Glue but wanted to put data in DynamoDB where Glue does not seem to support. Also, Glue seems to be more expensive. * Put the data in DynamoDB with bike ID as primary key and timestamp as sort key. * Use SageMaker to learn with the DynamoDB data. There will be separate discussion on choosing which model and making time-series inferencing. * If we need to re-learn, it will use the DynamoDB data, not from S3. I think it will be faster to get data from DynamoDB instead from the raw S3 data. * Also, I think we can filter out some bad input or apply little modification to DynamoDB data (shifting time stamps to the correct time, etc.) * Make inferencing output based on the model. What do you think? Would you agree? Would you approach the problem differently? Would you rather learn from S3 directly via Athena or direct S3 access? Or would you rather use Glue and Redshift? But the data about 100MB would be sufficient to train the model we have in mind. Glue and Redshift maybe overkill. Currently, Korea region does not support Timestream database. So, time series database closest in Korea could be DynamoDB. Please share your thoughts. Thanks!
1
answers
0
votes
1
views
AWS-User-6598922
asked a year ago

Receiving consistent AccessDenied errors

I am trying to use SageMaker Notebook Instances, but consistently receive AccessDenied errors for commands that my IAM role should have access to (and for commands that worked the last time I tried several weeks ago). For example: `aws s3 ls` results in `An error occurred (AccessDenied) when calling the ListBuckets operation: Access Denied` despite my role having the AmazonS3FullAccess policy attached. Also `aws ecr describe-repositories --repository-names "sagemaker-decision-trees"` results in `An error occurred (AccessDeniedException) when calling the DescribeRepositories operation: User: arn:aws:sts::XXXXXXXXXX:assumed-role/AmazonSageMaker-ExecutionRole-20201123T151452/SageMaker is not authorized to perform: ecr:DescribeRepositories on resource: arn:aws:ecr:us-east-2:XXXXXXXXXX:repository/sagemaker-decision-trees with an explicit deny` despite my role having the AmazonEC2ContainerRegistryFullAccess policy attached. One thing that seems new is that "SageMaker" is appended to my user ARN. I can't remember seeing errors with this appended before. Note: I've replicated these errors with several combinations of configurations: * a new IAM role (which I created in the SageMaker console to have AmazonSageMakerFullAccess to any S3 bucket) * fresh notebook instance * with (and without) a VPC Also, these commands all work when run outside of a notebook instance (i.e. when run locally from my laptop). I'm guessing there's some problem with my account setup, but not sure what to try next. Thanks. Edited by: DJAIndeed on Nov 24, 2020 8:35 AM
2
answers
0
votes
0
views
DJAIndeed
asked a year ago

What value should I set for directory_path for the Amazon SageMaker SDK with FSx as data source?

What value should I set for the **directory_path** parameter in **FileSystemInput** for the Amazon SageMaker SDK? Here is some information about my Amazon FSx for Lustre file system: - My FSx ID is `fs-0684xxxxxxxxxxx`. - My FSx has the mount name `lhskdbmv`. - The FSx maps to an Amazon S3 bucket with files (without extra prefixes in their keys) My attempts to describe the job and the results are the following: **Attempt 1:** fs = FileSystemInput( file_system_id='fs-0684xxxxxxxxxxx', file_system_type='FSxLustre', directory_path='lhskdbmv', file_system_access_mode='ro') **Result:** `estimator.fit(fs)` returns `ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: FileSystem DirectoryPath 'lhskdbmv' for channel 'training' is not absolute or normalized. Please ensure you don't have a trailing "/", and/or "..", ".", "//" in the path.` **Attempt 2:** fs = FileSystemInput( file_system_id='fs-0684xxxxxxxxxxx', file_system_type='FSxLustre', directory_path='/', file_system_access_mode='ro') **Result:** `ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: The directory path for FSx Lustre file system fs-068406952bf758bac is invalid. The directory path must begin with mount name of the file system.` **Attempt 3:** fs = FileSystemInput( file_system_id='fs-0684xxxxxxxxxxx', file_system_type='FSxLustre', directory_path='fsx', file_system_access_mode='ro') **Result:** ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: FileSystem DirectoryPath 'fsx' for channel 'training' is not absolute or normalized. Please ensure you don't have a trailing "/", and/or "..", ".", "//" in the path.
1
answers
0
votes
2
views
EXPERT
Olivier_CR
asked a year ago
  • 1
  • 90 / page