By using AWS re:Post, you agree to the Terms of Use
/Amazon SageMaker/

Questions tagged with Amazon SageMaker

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Not able to convert Hugging Face fine-tuned BERT model into AWS Neuron

Hi Team, I have a fine-tuned BERT model which was trained using following libraries. torch == 1.8.1+cu111 transformers == 4.19.4 And not able to convert that fine-tuned BERT model into AWS neuron and getting following compilation errors. Could you please help me to resolve this issue? **Note:** Trying to compile BERT model on SageMaker notebook instance and with "conda_python3" conda environment. **Installation:** #### Set Pip repository to point to the Neuron repository !pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com #### Install Neuron PyTorch - Note: Tried both options below. "#!pip install torch-neuron==1.8.1.* neuron-cc[tensorflow] "protobuf<4" torchvision sagemaker>=2.79.0 transformers==4.17.0 --upgrade" !pip install --upgrade torch-neuron neuron-cc[tensorflow] "protobuf<4" torchvision --------------------------------------------------------------------------------------------------------------------------------------------------- **Model compilation:** ``` import os import tensorflow # to workaround a protobuf version conflict issue import torch import torch.neuron from transformers import AutoTokenizer, AutoModelForSequenceClassification model_path = 'model/' # Model artifacts are stored in 'model/' directory # load tokenizer and model tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForSequenceClassification.from_pretrained(model_path, torchscript=True) # create dummy input for max length 128 dummy_input = "dummy input which will be padded later" max_length = 128 embeddings = tokenizer(dummy_input, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt") neuron_inputs = tuple(embeddings.values()) # compile model with torch.neuron.trace and update config model_neuron = torch.neuron.trace(model, neuron_inputs) model.config.update({"traced_sequence_length": max_length}) # save tokenizer, neuron model and config for later use save_dir="tmpd" os.makedirs("tmpd",exist_ok=True) model_neuron.save(os.path.join(save_dir,"neuron_model.pt")) tokenizer.save_pretrained(save_dir) model.config.save_pretrained(save_dir) ``` --------------------------------------------------------------------------------------------------------------------------------------------------- **Model artifacts:** We have got this model artifacts from multi-label topic classification model. config.json model.tar.gz pytorch_model.bin special_tokens_map.json tokenizer_config.json tokenizer.json --------------------------------------------------------------------------------------------------------------------------------------------------- **Error logs:** ``` INFO:Neuron:There are 3 ops of 1 different types in the TorchScript that are not compiled by neuron-cc: aten::embedding, (For more information see https://github.com/aws/aws-neuron-sdk/blob/master/release-notes/neuron-cc-ops/neuron-cc-ops-pytorch.md) INFO:Neuron:Number of arithmetic operators (pre-compilation) before = 565, fused = 548, percent fused = 96.99% INFO:Neuron:Number of neuron graph operations 1601 did not match traced graph 1323 - using heuristic matching of hierarchical information WARNING:tensorflow:From /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch_neuron/ops/aten.py:2022: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where INFO:Neuron:Compiling function _NeuronGraph$698 with neuron-cc INFO:Neuron:Compiling with command line: '/home/ec2-user/anaconda3/envs/python3/bin/neuron-cc compile /tmp/tmpv4gg13ze/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpv4gg13ze/graph_def.neff --io-config {"inputs": {"0:0": [[1, 128, 768], "float32"], "1:0": [[1, 1, 1, 128], "float32"]}, "outputs": ["Linear_5/aten_linear/Add:0"]} --verbose 35' INFO:Neuron:Compile command returned: -9 WARNING:Neuron:torch.neuron.trace failed on _NeuronGraph$698; falling back to native python function call ERROR:Neuron:neuron-cc failed with the following command line call: /home/ec2-user/anaconda3/envs/python3/bin/neuron-cc compile /tmp/tmpv4gg13ze/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpv4gg13ze/graph_def.neff --io-config '{"inputs": {"0:0": [[1, 128, 768], "float32"], "1:0": [[1, 1, 1, 128], "float32"]}, "outputs": ["Linear_5/aten_linear/Add:0"]}' --verbose 35 Traceback (most recent call last): File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch_neuron/convert.py", line 382, in op_converter item, inputs, compiler_workdir=sg_workdir, **kwargs) File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch_neuron/decorators.py", line 220, in trace 'neuron-cc failed with the following command line call:\n{}'.format(command)) subprocess.SubprocessError: neuron-cc failed with the following command line call: /home/ec2-user/anaconda3/envs/python3/bin/neuron-cc compile /tmp/tmpv4gg13ze/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpv4gg13ze/graph_def.neff --io-config '{"inputs": {"0:0": [[1, 128, 768], "float32"], "1:0": [[1, 1, 1, 128], "float32"]}, "outputs": ["Linear_5/aten_linear/Add:0"]}' --verbose 35 INFO:Neuron:Number of arithmetic operators (post-compilation) before = 565, compiled = 0, percent compiled = 0.0% INFO:Neuron:The neuron partitioner created 1 sub-graphs INFO:Neuron:Neuron successfully compiled 0 sub-graphs, Total fused subgraphs = 1, Percent of model sub-graphs successfully compiled = 0.0% INFO:Neuron:Compiled these operators (and operator counts) to Neuron: INFO:Neuron:Not compiled operators (and operator counts) to Neuron: INFO:Neuron: => aten::Int: 97 [supported] INFO:Neuron: => aten::add: 39 [supported] INFO:Neuron: => aten::contiguous: 12 [supported] INFO:Neuron: => aten::div: 12 [supported] INFO:Neuron: => aten::dropout: 38 [supported] INFO:Neuron: => aten::embedding: 3 [not supported] INFO:Neuron: => aten::gelu: 12 [supported] INFO:Neuron: => aten::layer_norm: 25 [supported] INFO:Neuron: => aten::linear: 74 [supported] INFO:Neuron: => aten::matmul: 24 [supported] INFO:Neuron: => aten::mul: 1 [supported] INFO:Neuron: => aten::permute: 48 [supported] INFO:Neuron: => aten::rsub: 1 [supported] INFO:Neuron: => aten::select: 1 [supported] INFO:Neuron: => aten::size: 97 [supported] INFO:Neuron: => aten::slice: 5 [supported] INFO:Neuron: => aten::softmax: 12 [supported] INFO:Neuron: => aten::tanh: 1 [supported] INFO:Neuron: => aten::to: 1 [supported] INFO:Neuron: => aten::transpose: 12 [supported] INFO:Neuron: => aten::unsqueeze: 2 [supported] INFO:Neuron: => aten::view: 48 [supported] --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) <ipython-input-1-97bba321d013> in <module> 18 19 # compile model with torch.neuron.trace and update config ---> 20 model_neuron = torch.neuron.trace(model, neuron_inputs) 21 model.config.update({"traced_sequence_length": max_length}) 22 ~/anaconda3/envs/python3/lib/python3.6/site-packages/torch_neuron/convert.py in trace(func, example_inputs, fallback, op_whitelist, minimum_segment_size, subgraph_builder_function, subgraph_inputs_pruning, skip_compiler, debug_must_trace, allow_no_ops_on_neuron, compiler_workdir, dynamic_batch_size, compiler_timeout, _neuron_trace, compiler_args, optimizations, verbose, **kwargs) 182 logger.debug("skip_inference_context - trace with fallback at {}".format(get_file_and_line())) 183 neuron_graph = cu.compile_fused_operators(neuron_graph, **compile_kwargs) --> 184 cu.stats_post_compiler(neuron_graph) 185 186 # Wrap the compiled version of the model in a script module. Note that this is ~/anaconda3/envs/python3/lib/python3.6/site-packages/torch_neuron/convert.py in stats_post_compiler(self, neuron_graph) 491 if succesful_compilations == 0 and not self.allow_no_ops_on_neuron: 492 raise RuntimeError( --> 493 "No operations were successfully partitioned and compiled to neuron for this model - aborting trace!") 494 495 if percent_operations_compiled < 50.0: RuntimeError: No operations were successfully partitioned and compiled to neuron for this model - aborting trace! ``` --------------------------------------------------------------------------------------------------------------------------------------------------- Thanks a lot.
0
answers
0
votes
5
views
asked 17 hours ago
1
answers
0
votes
17
views
asked 6 days ago

Unable to configure SageMaker execution Role with access to S3 bucket in another AWS account

**Requirement:** Create SakeMaker GroundTruth labeling job with input/output location pointing to S3 bucket in another AWS account **High Level Steps Followed:** Lets say, *Account_A:* SageMaker GroundTruth labeling job and *Account_B*: S3 bucket 1. Create role *AmazonSageMaker-ExecutionRole* in *Account_A* with 3 policies attached: * AmazonSageMakerFullAccess * Account_B_S3_AccessPolicy: Policy with necessary S3 permissions to access S3 bucket in Account_B * AssumeRolePolicy: Assume role policy for *arn:aws:iam::Account_B:role/Cross-Account-S3-Access-Role* 2. Create role *Cross-Account-S3-Access-Role* in *Account_B* with 1 policy and 1 trust relationship attached: * S3_AccessPolicy: Policy with necessary S3 permissions to access S3 bucket in the this Account_B * TrustRelationship: For principal *arn:aws:iam::Account_A:role/AmazonSageMaker-ExecutionRole* **Error:** While trying to create SakeMaker GroundTruth labeling job with IAM role as *AmazonSageMaker-ExecutionRole*, it throws error *AccessDenied: Access Denied - The S3 bucket 'Account_B_S3_bucket_name' you entered in Input dataset location cannot be reached. Either the bucket does not exist, or you do not have permission to access it. If the bucket does not exist, update Input dataset location with a new S3 URI. If the bucket exists, give the IAM entity you are using to create this labeling job permission to read and write to this S3 bucket, and try your request again.*
2
answers
0
votes
68
views
asked 15 days ago

Inconsistent keras model.summary() output shapes on AWS SageMaker and EC2

I have the following model in a jupyter notebook: ```python import tensorflow as tf from tensorflow.keras.models import Model from tensorflow.keras.optimizers import Adam from tensorflow.keras import layers physical_devices = tf.config.list_physical_devices('GPU') tf.config.experimental.set_memory_growth(physical_devices[0], True) SIZE = (549, 549) SHUFFLE = False BATCH = 32 EPOCHS = 20 train_datagen = DataGenerator(train_files, batch_size=BATCH, dim=SIZE, n_channels=1, shuffle=SHUFFLE) test_datagen = DataGenerator(test_files, batch_size=BATCH, dim=SIZE, n_channels=1, shuffle=SHUFFLE) inp = layers.Input(shape=(*SIZE, 1)) x = layers.Conv2D(filters=549, kernel_size=(5,5), padding="same", activation="relu")(inp) x = layers.BatchNormalization()(x) x = layers.Conv2D(filters=549, kernel_size=(3, 3), padding="same", activation="relu")(x) x = layers.BatchNormalization()(x) x = layers.Conv2D(filters=549, kernel_size=(1, 1), padding="same", activation="relu")(x) x = layers.BatchNormalization()(x) x = layers.Conv2D(filters=549, kernel_size=(3, 3), padding="same", activation="sigmoid")(x) model = Model(inp, x) model.compile(loss=tf.keras.losses.binary_crossentropy, optimizer=Adam()) model.summary() ``` Sagemaker and EC2 are running tensorflow 2.7.1. The EC2 instance is p3.2xlarge with Deep Learning AMI GPU TensorFlow 2.7.0 (Amazon Linux 2) 20220607. The SageMaker notebook is using ml.p3.2xlarge and I am using the conda_tensorflow2_p38 kernel. The notebook is in an FSx Lustre file system that is mounted to both SageMaker and EC2 so it is definitely the same code running on both machines. nvidia-smi output on SageMaker: ``` +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 | | N/A 37C P0 24W / 300W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ``` nvidia-smi output on EC2: ``` +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 | | N/A 42C P0 51W / 300W | 2460MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 11802 C /bin/python3.8 537MiB | | 0 N/A N/A 26391 C python3.8 1921MiB | +-----------------------------------------------------------------------------+ ``` The model.summary() output on SageMaker is: ```python Model: "model" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) [(None, 549, 549, 1)] 0 conv2d (Conv2D) (None, 549, 549, 1) 7535574 batch_normalization (BatchN (None, 549, 549, 1) 4 ormalization) conv2d_1 (Conv2D) (None, 549, 549, 1) 2713158 batch_normalization_1 (Batc (None, 549, 549, 1) 4 hNormalization) conv2d_2 (Conv2D) (None, 549, 549, 1) 301950 batch_normalization_2 (Batc (None, 549, 549, 1) 4 hNormalization) conv2d_3 (Conv2D) (None, 549, 549, 1) 2713158 ================================================================= Total params: 13,263,852 Trainable params: 13,263,846 Non-trainable params: 6 ``` The model.summary() output on EC2 is (notice the shape change): ```python Model: "model" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) [(None, 549, 549, 1)] 0 conv2d (Conv2D) (None, 549, 549, 549) 14274 batch_normalization (BatchN (None, 549, 549, 549) 2196 ormalization) conv2d_1 (Conv2D) (None, 549, 549, 549) 2713158 batch_normalization_1 (Batc (None, 549, 549, 549) 2196 hNormalization) conv2d_2 (Conv2D) (None, 549, 549, 549) 301950 batch_normalization_2 (Batc (None, 549, 549, 549) 2196 hNormalization) conv2d_3 (Conv2D) (None, 549, 549, 549) 2713158 ================================================================= Total params: 5,749,128 Trainable params: 5,745,834 Non-trainable params: 3,294 _________________________________________________________________ ``` One other thing that is interesting, if I change my model on the EC2 instance to: ```python inp = layers.Input(shape=(*SIZE, 1)) x = layers.Conv2D(filters=1, kernel_size=(5,5), padding="same", activation="relu")(inp) x = layers.BatchNormalization()(x) x = layers.Conv2D(filters=1, kernel_size=(3, 3), padding="same", activation="relu")(x) x = layers.BatchNormalization()(x) x = layers.Conv2D(filters=1, kernel_size=(1, 1), padding="same", activation="relu")(x) x = layers.BatchNormalization()(x) x = layers.Conv2D(filters=1, kernel_size=(3, 3), padding="same", activation="sigmoid")(x) model = Model(inp, x) model.compile(loss=tf.keras.losses.binary_crossentropy, optimizer=Adam()) ``` My model.summary() output becomes: ```python Model: "model_2" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_3 (InputLayer) [(None, 549, 549, 1)] 0 conv2d_8 (Conv2D) (None, 549, 549, 1) 26 batch_normalization_6 (Batc (None, 549, 549, 1) 4 hNormalization) conv2d_9 (Conv2D) (None, 549, 549, 1) 10 batch_normalization_7 (Batc (None, 549, 549, 1) 4 hNormalization) conv2d_10 (Conv2D) (None, 549, 549, 1) 2 batch_normalization_8 (Batc (None, 549, 549, 1) 4 hNormalization) conv2d_11 (Conv2D) (None, 549, 549, 1) 10 ================================================================= Total params: 60 Trainable params: 54 Non-trainable params: 6 _________________________________________________________________ ``` In the last model the shape is similar to SageMaker but the trainable parameters are very low. Any ideas as to why the output shape is different and why this is happening with the filters? When I run this model on my personal computer, the shape is the same as EC2. I think there might be an issue with SageMaker.
0
answers
0
votes
10
views
asked 19 days ago

FSxLustre FileSystemInput in Sagemaker TrainingJob leads to: InternalServerError

We are submitting a Sagemaker Training job with Sagemaker SDK with a custom docker image. The job finishes successfully for EFS [FileSystemInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.FileSystemInput) or [TrainingInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.TrainingInput). Trying to use the FIleSystemInput with FSxLustre configuration leads to the training job dying during the `Preparing the instances for training` stage: ``` InternalServerError: We encountered an internal error. Please try again. ``` This error is persistent upon re-submission. What we figured out until now: - the job errors before the training image is downloaded. - specifying an invalid mount point leads to a proper error: ```ClientError: Unable to mount file system: xxx directory path: yyy. Incorrect mount path. Please ensure the mount path specified exists on the filesystem.``` - the job finishes successfully when running locally with docker-compose ([Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase) with `instance_type="local"`). - we can mount the FSx file system on an EC2 instance with the TrainingJob's VPC and security group. How can we narrow the problem down further and get more information about the failure reason? Can you suggest likely problems that could cause this behavior?
1
answers
0
votes
18
views
asked 21 days ago

Deploy YOLOv5 in sagemaker - ModelError: InvokeEndpoint operation: Received server error (0)

I'm trying to deploy custom trained Yolov5 model in Sagemaker for inference. (Note : The model was not trained in sagemaker). Followed this doc for deploying the model and inference script - [Sagemaker docs](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#bring-your-own-model) ``` ModelError Traceback (most recent call last) <ipython-input-7-063ca701eab7> in <module> ----> 1 result1=predictor.predict("FILE0032.JPG") 2 print(result1) ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/predictor.py in predict(self, data, initial_args, target_model, target_variant, inference_id) 159 data, initial_args, target_model, target_variant, inference_id 160 ) --> 161 response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args) 162 return self._handle_response(response) 163 ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs) 399 "%s() only accepts keyword arguments." % py_operation_name) 400 # The "self" in this scope is referring to the BaseClient. --> 401 return self._make_api_call(operation_name, kwargs) 402 403 _api_call.__name__ = str(py_operation_name) ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params) 729 error_code = parsed_response.get("Error", {}).get("Code") 730 error_class = self.exceptions.from_code(error_code) --> 731 raise error_class(parsed_response, operation_name) 732 else: 733 return parsed_response ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://ap-south-1.console.aws.amazon.com/cloudwatch/home?region=ap-south-1#logEventViewer:group=/aws/sagemaker/Endpoints/pytorch-inference-2022-06-14-11-58-04-086 in account 772044684908 for more information. ``` After researching about `InvokeEndpoint`, tried this ``` import boto3 sagemaker_runtime = boto3.client("sagemaker-runtime", region_name='ap-south-1') endpoint_name='pytorch-inference-2022-06-14-11-58-04-086' response = sagemaker_runtime.invoke_endpoint( EndpointName=endpoint_name, Body=bytes('{"features": ["This is great!"]}', 'utf-8') # Replace with your own data. ) print(response['Body'].read().decode('utf-8')) ``` But this didn't help as well, detailed output : ``` ReadTimeoutError Traceback (most recent call last) <ipython-input-8-b5ca204734c4> in <module> 12 response = sagemaker_runtime.invoke_endpoint( 13 EndpointName=endpoint_name, ---> 14 Body=bytes('{"features": ["This is great!"]}', 'utf-8') # Replace with your own data. 15 ) 16 ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs) 399 "%s() only accepts keyword arguments." % py_operation_name) 400 # The "self" in this scope is referring to the BaseClient. --> 401 return self._make_api_call(operation_name, kwargs) 402 403 _api_call.__name__ = str(py_operation_name) ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params) 716 apply_request_checksum(request_dict) 717 http, parsed_response = self._make_request( --> 718 operation_model, request_dict, request_context) 719 720 self.meta.events.emit( ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/client.py in _make_request(self, operation_model, request_dict, request_context) 735 def _make_request(self, operation_model, request_dict, request_context): 736 try: --> 737 return self._endpoint.make_request(operation_model, request_dict) 738 except Exception as e: 739 self.meta.events.emit( ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/endpoint.py in make_request(self, operation_model, request_dict) 105 logger.debug("Making request for %s with params: %s", 106 operation_model, request_dict) --> 107 return self._send_request(request_dict, operation_model) 108 109 def create_request(self, params, operation_model=None): ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/endpoint.py in _send_request(self, request_dict, operation_model) 182 request, operation_model, context) 183 while self._needs_retry(attempts, operation_model, request_dict, --> 184 success_response, exception): 185 attempts += 1 186 self._update_retries_context( ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/endpoint.py in _needs_retry(self, attempts, operation_model, request_dict, response, caught_exception) 306 event_name, response=response, endpoint=self, 307 operation=operation_model, attempts=attempts, --> 308 caught_exception=caught_exception, request_dict=request_dict) 309 handler_response = first_non_none_response(responses) 310 if handler_response is None: ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/hooks.py in emit(self, event_name, **kwargs) 356 def emit(self, event_name, **kwargs): 357 aliased_event_name = self._alias_event_name(event_name) --> 358 return self._emitter.emit(aliased_event_name, **kwargs) 359 360 def emit_until_response(self, event_name, **kwargs): ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/hooks.py in emit(self, event_name, **kwargs) 227 handlers. 228 """ --> 229 return self._emit(event_name, kwargs) 230 231 def emit_until_response(self, event_name, **kwargs): ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/hooks.py in _emit(self, event_name, kwargs, stop_on_response) 210 for handler in handlers_to_call: 211 logger.debug('Event %s: calling handler %s', event_name, handler) --> 212 response = handler(**kwargs) 213 responses.append((handler, response)) 214 if stop_on_response and response is not None: ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/retryhandler.py in __call__(self, attempts, response, caught_exception, **kwargs) 192 checker_kwargs.update({'retries_context': retries_context}) 193 --> 194 if self._checker(**checker_kwargs): 195 result = self._action(attempts=attempts) 196 logger.debug("Retry needed, action of: %s", result) ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/retryhandler.py in __call__(self, attempt_number, response, caught_exception, retries_context) 266 267 should_retry = self._should_retry(attempt_number, response, --> 268 caught_exception) 269 if should_retry: 270 if attempt_number >= self._max_attempts: ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/retryhandler.py in _should_retry(self, attempt_number, response, caught_exception) 292 # If we've exceeded the max attempts we just let the exception 293 # propogate if one has occurred. --> 294 return self._checker(attempt_number, response, caught_exception) 295 296 ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/retryhandler.py in __call__(self, attempt_number, response, caught_exception) 332 for checker in self._checkers: 333 checker_response = checker(attempt_number, response, --> 334 caught_exception) 335 if checker_response: 336 return checker_response ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/retryhandler.py in __call__(self, attempt_number, response, caught_exception) 232 elif caught_exception is not None: 233 return self._check_caught_exception( --> 234 attempt_number, caught_exception) 235 else: 236 raise ValueError("Both response and caught_exception are None.") ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/retryhandler.py in _check_caught_exception(self, attempt_number, caught_exception) 374 # the MaxAttemptsDecorator is not interested in retrying the exception 375 # then this exception just propogates out past the retry code. --> 376 raise caught_exception ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/endpoint.py in _do_get_response(self, request, operation_model, context) 247 http_response = first_non_none_response(responses) 248 if http_response is None: --> 249 http_response = self._send(request) 250 except HTTPClientError as e: 251 return (None, e) ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/endpoint.py in _send(self, request) 319 320 def _send(self, request): --> 321 return self.http_session.send(request) 322 323 ~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/httpsession.py in send(self, request) 449 raise ConnectTimeoutError(endpoint_url=request.url, error=e) 450 except URLLib3ReadTimeoutError as e: --> 451 raise ReadTimeoutError(endpoint_url=request.url, error=e) 452 except ProtocolError as e: 453 raise ConnectionClosedError( ReadTimeoutError: Read timeout on endpoint URL: "https://runtime.sagemaker.ap-south-1.amazonaws.com/endpoints/pytorch-inference-2022-06-14-11-58-04-086/invocations" ```
2
answers
0
votes
22
views
asked 22 days ago

Extending Docker image for SageMaker Inference

I'm trying to create my own Docker image for use with SageMaker Batch Transform by extending an existing one. Following the documentation at https://docs.aws.amazon.com/sagemaker/latest/dg/prebuilt-containers-extend.html, I have created the following to run Detectron 2: ``` FROM 763104351884.dkr.ecr.eu-west-2.amazonaws.com/pytorch-inference:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker ############# Installing latest builds ############ RUN pip install --upgrade torch==1.10.2+cu113 torchvision==0.11.3+cu113 -f https://download.pytorch.org/whl/torch_stable.html ENV FORCE_CUDA="1" # Build D2 only for Turing (G4) and Volta (P3) architectures. Use P3 for batch transforms and G4 for inference on endpoints ENV TORCH_CUDA_ARCH_LIST="Turing;Volta" # Install Detectron2 RUN pip install \ --no-cache-dir pycocotools~=2.0.0 \ --no-cache-dir https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/detectron2-0.6%2Bcu113-cp38-cp38-linux_x86_64.whl # Set a fixed model cache directory. Detectron2 requirement ENV FVCORE_CACHE="/tmp" ############# SageMaker section ############## ENV PATH="/opt/ml/code:${PATH}" COPY inference.py /opt/ml/code/inference.py ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code ENV SAGEMAKER_PROGRAM inference.py ``` I then create a model (`create-model`) with this image using the following configuration: ``` { "ExecutionRoleArn": "arn:aws:iam::[redacted]:role/model-role", "ModelName": "model-test", "PrimaryContainer": { "Environment": { "SAGEMAKER_PROGRAM": "inference.py", "SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/code", "SAGEMAKER_CONTAINER_LOG_LEVEL": "20", "SAGEMAKER_REGION": "eu-west-2", "MMS_DEFAULT_RESPONSE_TIMEOUT": "500" }, "Image": "[redacted].dkr.ecr.eu-west-2.amazonaws.com/my-image:latest", "ModelDataUrl": "s3://[redacted]/training/output/model.tar.gz" } } ``` And submit a batch transform job (`create-transform-job`) using the following configuration: ``` { "MaxPayloadInMB": 16, "ModelName": "model-test", "TransformInput": { "ContentType": "application/x-image", "DataSource": { "S3DataSource": { "S3DataType": "ManifestFile", "S3Uri": "s3://[redacted]/manifests/input.manifest" } } }, "TransformJobName": "transform-test", "TransformOutput": { "S3OutputPath": "s3://[redacted]/predictions/" }, "TransformResources": { "InstanceCount": 1, "InstanceType": "ml.m5.large" } } ``` Both of the above commands submit fine, but the transform job doesn't complete. When I look in the logs, the errors I'm getting seem to indicate that it's not using my inference script (`inference.py`, specified above) but is instead using the default script (`default_pytorch_inference_handler.py`) and therefore can't find the model. What am I missing so that it uses my inference script instead, and hence my model?
1
answers
0
votes
41
views
asked 22 days ago

Sagemaker Pipelines - Batch Transform job using generated predictions as input for the model

Hi all! So, we're trying to implement a very simple Sagemaker Pipeline with 3 steps: * **ETL:** for now it only runs a simple query * **Batch transform:** uses the ETL's result and generates predictions with a batch transform job * **Report:** generates an HTML report The thing is, when running the batch transform job alone in the Pipeline, everything runs OK. But when trying to run all the steps in a Pipeline, the batch transform job fails, and what we have seen in the logs is that the job takes the dataset which was generated in the ETL step, generates the predictions and saves them correctly in S3 (this is where we would expect the job to stop) but then it resends those predictions to the endpoint, as if they were a new input, and so the step fails as the model receives an array of 1 column thus mismatching the number of features which it was trained with. There's not much info out there on this, and Sagemaker is painfully hard to debug. Has anyone experienced anything like this? Our model and transformer code: ```python model = XGBoostModel( model_data=f"s3://{BUCKET}/{MODEL_ARTIFACTS_PATH}/artifacts.gzip", role=get_execution_role(), entry_point="predict.py", framework_version="1.3-1", ) transformer = model.transformer( instance_count=1, instance_type="ml.m5.large", output_path=f"s3://{BUCKET}/{PREDICTIONS_PATH}/", accept="text/csv", ) step = TransformStep( name="Batch", transformer=transformer, inputs=TransformInput( data=etl_step.properties.ProcessingOutputConfig.Outputs[ "dataset" ].S3Output.S3Uri, content_type="text/csv", split_type="Line", ), depends_on=[etl_step], ) ``` And our inference script: ```python def input_fn(request_body, content_type): return pd.read_csv(StringIO(request_body), header=None).values def predict_fn(input_obj, model): """ Function which takes the result of input_fn and generates predictions. """ return model.predict_proba(input_obj)[:, 1] def output_fn(predictions, content_type): return ",".join(str(pred) for pred in predictions) ```
1
answers
0
votes
35
views
asked a month ago

SageMaker Multi Model endpoint creation fails while creating for model built on container sagemaker-scikit-learn:0.23-1-cpu-py3

I am working on a use-case where I am using SageMaker multi-model endpoint for model inference and the models are trained using Databricks MLFlow platform. When I tried deploying a model trained from Databricks MLFlow platform on a single endpoint on SageMaker then it worked fine but the creation of multi-model endpoint for 'sagemaker-scikit-learn:0.23-1-cpu-py3' container is failed with the following error: Code Snippet::>> name = "sample-mme" sagemaker_client = boto3.client('sagemaker') model_path = "s3://test-bucket/multi-models" execution_role_arn = "IAM://sample-role" BASE_IMAGE = image_uris.retrieve( region=region, framework="sklearn",version='0.23-1',image_scope='inference' ) container = { 'Image': BASE_IMAGE, 'ModelDataUrl': model_path, 'Mode': 'MultiModel', 'MultiModelConfig': { 'ModelCacheSetting': 'Enabled' } } model_response = sagemaker_client.create_model( ModelName=name, ExecutionRoleArn=execution_role_arn, Containers=[container] ) config_response = sagemaker_client.create_endpoint_config( EndpointConfigName=f'{name}-config', ProductionVariants=[ { 'InstanceType': instance_type, 'InitialInstanceCount': instance_count, 'InitialVariantWeight': 1, 'ModelName': name, 'VariantName': 'AllTraffic' } ] ) response = sagemaker_client.create_endpoint( EndpointName=f'{name}-endpoint', EndpointConfigName=f'{name}-config' ) Endpoint creation is taking a lot if time and failing with the following error message : sagemaker_containers._errors.ImportModuleError: 'NoneType' object has no attribute 'startswith' Please provide me with some help to fix this. Also, my understanding is that I can train a model on the DataBricks MLFlow platform using sklearn libraries, and then I can store model artifacts "model.tar.gz" under the s3 directory for storing all multi-models. Now I can create a multi-model endpoint in SageMaker using the same s3 directory as the model path and using the above code. Once the endpoint is ready, I can do inference by providing the target model. Please let me know if my understanding is correct and share any relevant documents to follow for my use case.
1
answers
0
votes
30
views
asked a month ago

Sagemaker instances keep awakening and charge the credit

I have tried Data Wrangler in Sagemaker last month and close the service. A few weeks later I have noticed the credit was charge $1 every hour and just realized that the Data Wranger auto-save the flow every minute. So, I deleted the unsaved flow and shut down all the services and instances according to advice on these two links : * https://docs.aws.amazon.com/sagemaker/latest/dg/studio-lab-use-shutdown.html * https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html Then, I left the Sagemaker untouched for the whole month of May, and just got back to the console yesterday. This is what I found out for May's bill: Amazon SageMaker RunInstance $531.74 == | Detail | Usage | Total | | --- | --- | --- | | $0.00 for Host:ml.m5.xlarge per hour under monthly free tier | 125.000 Hrs | $0.00 | | $0.00 for Notebk:ml.t2.medium per hour under monthly free tier | 107.056 Hrs | $0.00 | | $0.00 per Data Wrangler Interactive ml.m5.4xlarge hour under monthly free tier | 25.000 Hrs | $0.00 | | $0.23 per Hosting ml.m5.xlarge hour in US East (N. Virginia) | 88.997 Hrs | $20.47 | | $0.922 per Data Wrangler Interactive ml.m5.4xlarge hour in US East (N. Virginia) | 554.521 Hrs | $511.27 | So, with another attempt, I installed an extension to automatically shut down idle kernels and set the limit to 10 min from advice here: https://aws.amazon.com/blogs/machine-learning/save-costs-by-automatically-shutting-down-idle-resources-within-amazon-sagemaker-studio/ Checked the cost in usage report, it turns out that the service was shut down after installing the extension but then it revoked itself after 5 hours later (during my sleep time). There's still cost from Studio although with less charge than previous one. | Service | Operation | UsageType | StartTime | EndTime |UsageValue | | --- | --- | --- | --- | --- | --- | | AmazonSageMaker | RunInstance | USE1-Studio_DW:KernelGateway-ml.m5.4xlarge | 5/24/2022 23:00 | 5/25/2022 0:00 | 1 | | AmazonSageMaker | RunInstance | USE1-Studio_DW:KernelGateway-ml.m5.4xlarge | 5/25/2022 0:00 | 5/25/2022 1:00 | 1 | | AmazonSageMaker | RunInstance | USE1-Studio_DW:KernelGateway-ml.m5.4xlarge | 5/25/2022 1:00 | 5/25/2022 2:00 | 1 | | AmazonSageMaker | RunInstance | USE1-Studio_DW:KernelGateway-ml.m5.4xlarge | 5/25/2022 2:00 | 5/25/2022 3:00 | 0.76484417 | | AmazonSageMaker | RunInstance | USE1-Studio_DW:KernelGateway-ml.m5.4xlarge | 5/25/2022 8:00 | 5/25/2022 9:00 | 0.36636722 | | AmazonSageMaker | RunInstance | USE1-Studio_DW:KernelGateway-ml.m5.4xlarge | 5/25/2022 9:00 | 5/25/2022 10:00 | 0.38959556 | During this time, I'm sure that there're no running instances, running apps, kernel sessions or terminal sessions. I even deleted the user profile. Last thing I haven't tried is to set up scheduled shutdown coz I think the services should not cause difficulty to our life that much. Any advice for any effective action to completely shutdown the Sagemaker instance? Thanks.
1
answers
0
votes
66
views
asked a month ago

Textract to multi column pdf files

I am using the code below that I took from an example [https://aws.amazon.com/pt/blogs/machine-learning/automatically-extract-text-and-structured-data-from-documents-with-amazon-textract/](), in the example it is used only for a case of 2 columns, in the code where there is division by 2, if my file has 4 columns for example, I just change that it works. But how to detect the amount of columns automatically or some way that I don't need this manual input anymore? In summary I want to use this code for cases of pdf files that have more than 2 columns, how to do it? ``` import boto3 # Document s3BucketName = "amazon-textract-public-content" documentName = "blogs/two-column-image.jpg" # Amazon Textract client textract = boto3.client('textract') # Call Amazon Textract response = textract.detect_document_text( Document={ 'S3Object': { 'Bucket': s3BucketName, 'Name': documentName } }) #print(response) # Detect columns and print lines columns = [] lines = [] for item in response["Blocks"]: if item["BlockType"] == "LINE": column_found=False for index, column in enumerate(columns): bbox_left = item["Geometry"]["BoundingBox"]["Left"] bbox_right = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"] bbox_centre = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]/2 column_centre = column['left'] + column['right']/2 if (bbox_centre > column['left'] and bbox_centre < column['right']) or (column_centre > bbox_left and column_centre < bbox_right): #Bbox appears inside the column lines.append([index, item["Text"]]) column_found=True break if not column_found: columns.append({'left':item["Geometry"]["BoundingBox"]["Left"], 'right':item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]}) lines.append([len(columns)-1, item["Text"]]) lines.sort(key=lambda x: x[0]) for line in lines: print (line[1]) ```
1
answers
0
votes
62
views
asked a month ago

ClientError: An error occurred (UnknownOperationException) when calling the CreateHyperParameterTuningJob operation: The requested operation is not supported in the called region.

Hi Dears, I am building ML model using DeepAR Algorithm. I faced this error while i reached to this point : Error : ClientError: An error occurred (UnknownOperationException) when calling the CreateHyperParameterTuningJob operation: The requested operation is not supported in the called region. ------------------- Code: from sagemaker.tuner import ( IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner, ) from sagemaker import image_uris container = image_uris.retrieve(region= 'af-south-1', framework="forecasting-deepar") deepar = sagemaker.estimator.Estimator( container, role, instance_count=1, instance_type="ml.m5.2xlarge", use_spot_instances=True, # use spot instances max_run=1800, # max training time in seconds max_wait=1800, # seconds to wait for spot instance output_path="s3://{}/{}".format(bucket, output_path), sagemaker_session=sess, ) freq = "D" context_length = 300 deepar.set_hyperparameters( time_freq=freq, context_length=str(context_length), prediction_length=str(prediction_length) ) Can you please help in solving the error? I have to do that in af-south-1 region. Thanks Basem hyperparameter_ranges = { "mini_batch_size": IntegerParameter(100, 400), "epochs": IntegerParameter(200, 400), "num_cells": IntegerParameter(30, 100), "likelihood": CategoricalParameter(["negative-binomial", "student-T"]), "learning_rate": ContinuousParameter(0.0001, 0.1), } objective_metric_name = "test:RMSE" tuner = HyperparameterTuner( deepar, objective_metric_name, hyperparameter_ranges, max_jobs=10, strategy="Bayesian", objective_type="Minimize", max_parallel_jobs=10, early_stopping_type="Auto", ) s3_input_train = sagemaker.inputs.TrainingInput( s3_data="s3://{}/{}/train/".format(bucket, prefix), content_type="json" ) s3_input_test = sagemaker.inputs.TrainingInput( s3_data="s3://{}/{}/test/".format(bucket, prefix), content_type="json" ) tuner.fit({"train": s3_input_train, "test": s3_input_test}, include_cls_metadata=False) tuner.wait()
1
answers
0
votes
16
views
asked 2 months ago
0
answers
0
votes
10
views
asked 2 months ago

XGBoost Error: Allreduce failed - 100GB Dask Dataframe on AWS Fargate ECS cluster dies with 1T of memory.

Overview: I'm trying to run an XGboost model on a bunch of parquet files sitting in S3 using dask by setting up a fargate cluster and connecting it to a Dask cluster. Total dataframe size totals to about 140 GB of data. I scaled up a fargate cluster with properties: Workers: 40 Total threads: 160 Total memory: 1 TB So there should be enough data to hold the data tasks. Each worker has 9+ GB with 4 Threads. I do some very basic preprocessing and then I create a DaskDMatrix which does cause the task bytes per worker to get a little high, but never above the threshold where it would fail. Next I run xgb.dask.train which utilizes the xgboost package not the dask_ml.xgboost package. Very quickly, the workers die and I get the error `XGBoostError: rabit/internal/utils.h:90: Allreduce failed`. When I attempted this with a single file with only 17MB of data, I would still get this error but only a couple workers die. Does anyone know why this happens since I have double the memory of the dataframe? ``` X_train = X_train.to_dask_array() X_test = X_test.to_dask_array() y_train = y_train y_test = y_test ``` dtrain = xgb.dask.DaskDMatrix(client,X_train, y_train) output = xgb.dask.train( client, {"verbosity": 1, "tree_method": "hist", "objective": "reg:squarederror"}, dtrain, num_boost_round=100, evals=[(dtrain, "train")])`
1
answers
0
votes
12
views
asked 2 months ago

Amazon SageMaker Data Wrangler now supports additional M5 and R5 instances for interactive data preparation

Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes in Amazon SageMaker Studio, the first fully integrated development environment (IDE) for ML. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization, from a single visual interface. SageMaker Data Wrangler runs on ml.m5.4xlarge by default. SageMaker Data Wrangler includes built-in data transforms and analyses written in PySpark so you can process large data sets (up to hundreds of gigabytes (GB) of data) efficiently on the default instance. Starting today, you can use additional M5 or R5 instance types with more CPU or memory in SageMaker Data Wrangler to improve performance for your data preparation workloads. Amazon EC2 M5 instances offer a balance of compute, memory, and networking resources for a broad range of workloads. Amazon EC2 R5 instances are the memory optimized instances. Both M5 and R5 instance types are well suited for CPU and memory intensive applications such as running built-in transforms for very large data sets (up to terabytes (TB) of data) or applying custom transforms written in Panda on medium data sets (up to tens of GBs). To learn more about the newly supported instances with Amazon SageMaker Data Wrangler, visit the [blog ](https://aws.amazon.com/blogs/machine-learning/process-larger-and-wider-datasets-with-amazon-sagemaker-data-wrangler/) or the [AWS document](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-data-flow.html), and the[ pricing page](https://aws.amazon.com/sagemaker/pricing/). To get started with SageMaker Data Wrangler, visit the [AWS documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler.html).
0
answers
0
votes
11
views
asked 2 months ago

Data Wrangler Full Outer Join Not Working As Expected Nor Concatenate

I've got two CSV files that are loaded into Data Wrangler that are intended to augment each other. The tables have some columns that are the same (in name) and some that are not, many of the rows are missing entries for many of the columns. The two tables represent separate datasets. Consider the example below: Table 1: | Filename | LabelA | LabelB | | --- | --- | --- | | ./A/001.dat | 1 | 1 | | ./A/002.dat | 0 | 1 | Table 2: | Filename | LabelB | LabelC | | --- | --- | --- | | ./B/001.dat | | 0 | | ./B/002.dat | 0 | 1 | I am looking to merge / concatenate the two table. The problem is that neither Data Wrangler join nor concatenate seems to work (at least as expected). Desired result: | Filename | LabelA | LabelB | LabelC | | --- | --- | --- | --- | | ./A/001.dat | 1 | 1 | | | ./A/002.dat | 0 | 1 | | | ./B/001.dat | | | 0 | | ./B/002.dat | | 0 | 1 | When using a "Full Outer" join and ask to combine "Filename" and "LabelB" columns, it will take all the values from Table 1 OR Table 2 even if Table 1 does not have that entry (for example, some rows will have Filename = <nothing> rather than Filename = ./B/001.dat). When using concatenate, Data Wrangler errors on the fact that it cannot match EVERY column between the tables. Now in my example there are many columns and many rows which precludes a manual process of joining without merging columns and then going through a renaming and merging process one-by-one. How do get these tables to simply merge? I feel I must be missing something obvious. I am about to give up on Data Wrangler and do it all in a python script using pandas, but I thought I should give Data Wrangler a try while learning the MLops process.
1
answers
0
votes
8
views
asked 2 months ago

¿How can we crate a lambda which uses a Braket D-Wave device?

We are trying to deploy a Lambda with some code which works in a Notebook. The code is rather simple and uses D-Wave — DW_2000Q_6. The problem is that when we execute the lambda (container lambda due to size problems), it give us the following error: ```json { "errorMessage": "[Errno 30] Read-only file system: '/home/sbx_user1051'", "errorType": "OSError", "stackTrace": [ " File \"/var/lang/lib/python3.8/imp.py\", line 234, in load_module\n return load_source(name, filename, file)\n", " File \"/var/lang/lib/python3.8/imp.py\", line 171, in load_source\n module = _load(spec)\n", " File \"<frozen importlib._bootstrap>\", line 702, in _load\n", " File \"<frozen importlib._bootstrap>\", line 671, in _load_unlocked\n", " File \"<frozen importlib._bootstrap_external>\", line 843, in exec_module\n", " File \"<frozen importlib._bootstrap>\", line 219, in _call_with_frames_removed\n", " File \"/var/task/lambda_function.py\", line 6, in <module>\n from dwave.system.composites import EmbeddingComposite\n", " File \"/var/task/dwave/system/__init__.py\", line 15, in <module>\n import dwave.system.flux_bias_offsets\n", " File \"/var/task/dwave/system/flux_bias_offsets.py\", line 22, in <module>\n from dwave.system.samplers.dwave_sampler import DWaveSampler\n", " File \"/var/task/dwave/system/samplers/__init__.py\", line 15, in <module>\n from dwave.system.samplers.clique import *\n", " File \"/var/task/dwave/system/samplers/clique.py\", line 32, in <module>\n from dwave.system.samplers.dwave_sampler import DWaveSampler, _failover\n", " File \"/var/task/dwave/system/samplers/dwave_sampler.py\", line 31, in <module>\n from dwave.cloud import Client\n", " File \"/var/task/dwave/cloud/__init__.py\", line 21, in <module>\n from dwave.cloud.client import Client\n", " File \"/var/task/dwave/cloud/client/__init__.py\", line 17, in <module>\n from dwave.cloud.client.base import Client\n", " File \"/var/task/dwave/cloud/client/base.py\", line 89, in <module>\n class Client(object):\n", " File \"/var/task/dwave/cloud/client/base.py\", line 736, in Client\n @cached.ondisk(maxage=_REGIONS_CACHE_MAXAGE)\n", " File \"/var/task/dwave/cloud/utils.py\", line 477, in ondisk\n directory = kwargs.pop('directory', get_cache_dir())\n", " File \"/var/task/dwave/cloud/config.py\", line 455, in get_cache_dir\n return homebase.user_cache_dir(\n", " File \"/var/task/homebase/homebase.py\", line 150, in user_cache_dir\n return _get_folder(True, _FolderTypes.cache, app_name, app_author, version, False, use_virtualenv, create)[0]\n", " File \"/var/task/homebase/homebase.py\", line 430, in _get_folder\n os.makedirs(final_path)\n", " File \"/var/lang/lib/python3.8/os.py\", line 213, in makedirs\n makedirs(head, exist_ok=exist_ok)\n", " File \"/var/lang/lib/python3.8/os.py\", line 213, in makedirs\n makedirs(head, exist_ok=exist_ok)\n", " File \"/var/lang/lib/python3.8/os.py\", line 223, in makedirs\n mkdir(name, mode)\n" ] } ``` It seems that the library tries to write to some files which are not in /tmp folder. I'm wondering if is possible to do this, and if not, what are the alternatives. imports used: ```python import boto3 from braket.ocean_plugin import BraketDWaveSampler from dwave.system.composites import EmbeddingComposite from neal import SimulatedAnnealingSampler ```
1
answers
0
votes
23
views
asked 2 months ago

How to create (Serverless) SageMaker Endpoint using exiting tensorflow pb (frozen model) file?

Note: I am a senior developer, but am very new to the topic of machine learning. I have two frozen TensorFlow model weight files: `weights_face_v1.0.0.pb` and `weights_plate_v1.0.0.pb`. I also have some python code using Tensorflow 2, that loads the model and handles basic inference. The models detect respectively faces and license plates, and the surrounding code converts an input image to a numpy array, and applies blurring to the images in areas that had detections. I want to get a SageMaker endpoint so that I can run inference on the model. I initially tried using a regular Lambda function (container based), but that is too slow for our use case. A SageMaker endpoint should give us GPU inference, which should be much faster. I am struggling to find out how to do this. From what I can tell reading the documentation and watching some YouTube video's, I need to create my own docker container. As a start, I can use for example `763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.8.0-gpu-py39-cu112-ubuntu20.04-sagemaker`. However, I can't find any solid documentation on how I would implement my other code. How do I send an image to SageMaker? Who tells it to convert the image to numpy array? How does it know the tensor names? How do I install additional requirements? How can I use the detections to apply blurring on the image, and how can I return the result image? Can someone here please point me in the right direction? I searched a lot but can't find any example code or blogs that explain this process. Thank you in advance! Your help is much appreciated.
1
answers
0
votes
7
views
asked 2 months ago

Unauthorized AWS account racked up charges on stolen credit card.

My mother was automatically signed up for an AWS account or someone used her credentials to sign up. She did not know that she had been signed up, and it sat unused for 3 years. Last month, she got an email from AWS for "unusual activity" and she asked me to help her look into it. Someone racked up $800+ in charges in 10 days for AWS services she has never heard of, let alone used (SageMaker, LightSail were among the services). The card on the AWS account is a credit card that was stolen years ago and has since been cancelled. So when AWS tried to charge the card, it didn't go through. My experience with AWS customer service has been unhelpful so far. Mom changed her AWS password in time so we could get into the account and contact support. I deleted the instances so that the services incurring charges are now stopped. But now AWS is telling me to put in a "valid payment method" or else they will not review the fraudulent bill. They also said that I have to set up additional AWS services (Cost Management, Amazon Cloud Watch, Cloud Trail, WAF, security services) before they'll review the bill. I have clearly explained to them that this entire account is unauthorized and we want to close it ASAP, so adding further services and a payment method doesn't make sense. Why am I being told to use more AWS services when my goal is to use zero? Why do I have to set up "preventative services" when the issue I'm trying to resolve is a PAST issue of fraud? They also asked me to write back and confirm that we have read and understood the AWS Customer Agreement and shared responsibility model." Of course we haven't, because we didn't even know the account existed! Any advice or input into this situation? It's extremely frustrating to be told that AWS won't even look into the issue unless I set up these additional AWS services and give them a payment method. This is a clear case of identity fraud. We want this account shut down. Support Case # is xxxxxxxxxx. Edit- removed case ID -Ann D
1
answers
0
votes
35
views
asked 3 months ago
1
answers
0
votes
34
views
asked 3 months ago

IncompleteSignature error while using Sklearn SDK

Currently, we are trying to SK-Learn model from a python script running in a local computer by uploading data to S3 bucket. ``` from sagemaker.amazon.amazon_estimator import get_image_uri # container = retrieve(framework='sklearn', region='us-east-1', version="0.23-1") container = sagemaker.image_uris.get_training_image_uri('us-east-1', 'sklearn', framework_version='0.23-1') sklearn_estimator = SKLearn( entry_point="script.py", # # role=get_execution_role(), role = role_aws, instance_count=1, instance_type="ml.m5.4xlarge", framework_version=FRAMEWORK_VERSION, base_job_name="rf-scikit", metric_definitions=[{"Name": "median-AE", "Regex": "AE-at-50th-percentile: ([0-9.]+).*$"}], hyperparameters={ "n-estimators": 100, "min-samples-leaf": 3, "features": "MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude", "target": "target", }, sagemaker_session=session, image_uri=container, image_uri_region='us-east-1', # output_path=model_output_path, ) # launch training job, with asynchronous call path_train_test = 's3://'+bucket_name+'/'+prefix sklearn_estimator.fit({"train": path_train_test, "test": path_train_test}, wait=False) ``` 'ClientError: An error occurred (IncompleteSignature) when calling the GetCallerIdentity operation: Credential must have exactly 5 slash-delimited elements, e.g. keyid/date/region/service/term, got 'https://elasticmapreduce.us-east-1b.amazonaws.com//20220406/us-east-1/sts/aws4_request' The access key and the secret key are passed through the session object via a client and passed to the SK-Learn estimator. ``` client_sagemaker = boto3.client('sagemaker', aws_access_key_id=accesskey , aws_secret_access_key=access_secret, ) session = sagemaker.Session(sagemaker_client =client_sagemaker ) ``` The same access key worked for Xgboost model (already available in sagemaker) Any ideas about the reason ?
0
answers
0
votes
7
views
asked 3 months ago

Is it possible to use smddp in notebook?

I recently tried the smddp v1.4.0 on SageMaker notebook instance (not sagemaker studio), using 8-GPU instances `ml.p3.16xlarge`, by directly using `smddp` as backend in the training scripts. I launched the estimator by setting `instance_type` to `local_gpu` and ended up with smddp error. Corresponding errors are attached below, saying an initialization error. ``` 42u1m0wni0-algo-1-36bbw | Traceback (most recent call last): 42u1m0wni0-algo-1-36bbw | File "true_main_notebook.py", line 636, in <module> 42u1m0wni0-algo-1-36bbw | main() 42u1m0wni0-algo-1-36bbw | File "true_main_notebook.py", line 178, in main 42u1m0wni0-algo-1-36bbw | dist.init_process_group(backend=args.dist_backend) 42u1m0wni0-algo-1-36bbw | File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group 42u1m0wni0-algo-1-36bbw | store, rank, world_size = next(rendezvous_iterator) 42u1m0wni0-algo-1-36bbw | File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 219, in _env_rendezvous_handler 42u1m0wni0-algo-1-36bbw | rank = int(_get_env_or_raise("RANK")) 42u1m0wni0-algo-1-36bbw | File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise 42u1m0wni0-algo-1-36bbw | raise _env_error(env_var) 42u1m0wni0-algo-1-36bbw | ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set 42u1m0wni0-algo-1-36bbw | Environment variable SAGEMAKER_INSTANCE_TYPE is not set 42u1m0wni0-algo-1-36bbw | Running smdistributed.dataparallel v1.4.0 42u1m0wni0-algo-1-36bbw | Error in atexit._run_exitfuncs: 42u1m0wni0-algo-1-36bbw | Traceback (most recent call last): 42u1m0wni0-algo-1-36bbw | File "/opt/conda/lib/python3.8/site-packages/smdistributed/dataparallel/torch/torch_smddp/__init__.py", line 51, in at_exit_smddp 42u1m0wni0-algo-1-36bbw | hm.shutdown() 42u1m0wni0-algo-1-36bbw | RuntimeError: Was this script started with smddprun? For more info on using smddprun, run smddprun -h 42u1m0wni0-algo-1-36bbw | 2022-04-03 16:07:30,005 sagemaker-training-toolkit ERROR Reporting training FAILURE 42u1m0wni0-algo-1-36bbw | 2022-04-03 16:07:30,005 sagemaker-training-toolkit ERROR ExecuteUserScriptError: 42u1m0wni0-algo-1-36bbw | ExitCode 1 42u1m0wni0-algo-1-36bbw | ErrorMessage "ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set 42u1m0wni0-algo-1-36bbw | Environment variable SAGEMAKER_INSTANCE_TYPE is not set Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/smdistributed/dataparallel/torch/torch_smddp/__init__.py", line 51, in at_exit_smddp hm.shutdown() RuntimeError: Was this script started with smddprun? For more info on using smddprun, run smddprun -h" ``` The original goal is to launch a single-node smddp for debugging. Does the smddp only support launched by AWS python SDK rather than the notebook? Or if something I've done is not correct?
0
answers
0
votes
9
views
asked 3 months ago
  • 1
  • 90 / page