By using AWS re:Post, you agree to the Terms of Use
Questions in Machine Learning & AI
Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

how to use sagemaker with input mode FastFile with files that has Chinese in their name?

This post is both a bug report and a question. We are trying to use SageMaker to train a model and everything is quite standard. Since we have a lot of images, we'll suffer from a super long image downloading time if we don't change the input_mode to FastFile. Then I struggled to successfully load image in the container. In my dataset there are a lot of samples whose name contains Chinese. When I started debugging because I could not properly load files, I found that when sagemaker mounts the data from s3, it didn't take care of the encoding correctly. Here is an image name and the image path inside the training container: `七年级上_第10章分式_七年级上_第10章分式_1077759_title_0-0_4_mathjax` `/opt/ml/input/data/validation/\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_1077759_title_0-0_4_mathjax.png` This is not neat but still I can have the right path in the container. The problem is that I'm not able to read the file even though the path exists: what I mean is `os.path.exists('/opt/ml/input/data/validation/\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_1077759_title_0-0_4_mathjax.png')` gives true but `cv2.imread('/opt/ml/input/data/validation/\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_1077759_title_0-0_4_mathjax.png')` returns None. Then I tried to open the file and fortunately it gives an error The code is `with open('/opt/ml/input/data/validation/\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_1077759_title_0-0_4_mathjax.png', 'rb') as f: a = f.read() ` and it gives me the error `OSError: [Errno 107] Transport endpoint is not connected` I tried to load a file in the same folder whose name doesn't contain any Chinese. Everything works well in this case so I'm sure that the Chinese characters in the filenames are causing problems. I wonder if there is a quick walk around so I don't need to rename maybe 80% of the data in s3.
0
answers
0
votes
9
views
asked a day ago

Rekognition: error when trying to detect faces with s3 object name containing a colon (:)

Actually Rekognition works fine but when I use a filename containing a colon (:) for the S3Object, it makes an error. It is very problematic for me because all my files already have colons and I can't change their names. So if use this It works fine: ``` { "Image":{ "S3Object":{ "Bucket":"console-sample-images", "Name":"skateboard.jpg" } ``` but if i use a name with a colon like this It gives me an error. ``` { "Image":{ "S3Object":{ "Bucket":"console-sample-images", "Name":"skate:board.jpg" } ``` Error output: `{"name":"Error","content":"{\"__type\":\"InvalidS3ObjectException\",\"Code\":\"InvalidS3ObjectException\",\"Message\":\"Unable to get object metadata from S3. Check object key, region and/or access permissions.\"}","message":"faultCode:Server.Error.Request faultString:'null' faultDetail:'null'","rootCause":{"errorID":2032,"target":{"bytesLoaded":174,"dataFormat":"text","bytesTotal":174,"data":"{\"__type\":\"InvalidS3ObjectException\",\"Code\":\"InvalidS3ObjectException\",\"Message\":\"Unable to get object metadata from S3. Check object key, region and/or access permissions.\"}"},"text":"Error #2032: Stream Error. URL: https://rekognition.eu-west-1.amazonaws.com","currentTarget":{"bytesLoaded":174,"dataFormat":"text","bytesTotal":174,"data":"{\"__type\":\"InvalidS3ObjectException\",\"Code\":\"InvalidS3ObjectException\",\"Message\":\"Unable to get object metadata from S3. Check object key, region and/or access permissions.\"}"},"type":"ioError","bubbles":false,"eventPhase":2,"cancelable":false},"errorID":0,"faultCode":"Server.Error.Request","faultDetail":null,"faultString":""}` Is there a workaround for this problem? (encoding the ':' a certain way?) Thank you for your help.
1
answers
0
votes
14
views
asked 5 days ago

Pytorch Lightning Progress bar not working on Sagemaker Jupyter Lab

Hello! We started using Sagemaker Jupyter Lab to run a few Depp Learning experiments we previously ran on GoogleColabPro+. The training starts fine and everything seems to work, however, the progress bar appears as follows: **Validation sanity check: 0it [00:00, ?it/s] Training: 0it [00:00, ?it/s]** The progress bar was working fine on GoogleColab. I tried uninstalling ipywidgets as [suggested here](https://github.com/PyTorchLightning/pytorch-lightning/issues/11208), but still no luck. Anyone has an idea of how to fix the problem? Below you will find a copy of the TrainerFunction I am using. ``` class T5FineTuner(pl.LightningModule): def __init__(self, hparams): super(T5FineTuner, self).__init__() self.hparams = hparams self.model = T5ForConditionalGeneration.from_pretrained(hparams['model_name_or_path']) self.tokenizer = T5Tokenizer.from_pretrained(hparams['tokenizer_name_or_path']) def hparams(self): return self.hparams def is_logger(self): return True #self.trainer.proc_rank <= 0 def forward( self, input_ids, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, labels=None ): return self.model( input_ids, attention_mask=attention_mask, decoder_input_ids=decoder_input_ids, decoder_attention_mask=decoder_attention_mask, labels=labels, ) def _step(self, batch): labels = batch["target_ids"] labels[labels[:, :] == self.tokenizer.pad_token_id] = -100 outputs = self( input_ids=batch["source_ids"], attention_mask=batch["source_mask"], labels=labels, decoder_attention_mask=batch['target_mask'] ) loss = outputs[0] return loss def training_step(self, batch, batch_idx): loss = self._step(batch) tensorboard_logs = {"train_loss": loss} return {"loss": loss, "log": tensorboard_logs} def training_epoch_end(self, outputs): avg_train_loss = torch.stack([x["loss"] for x in outputs]).mean() tensorboard_logs = {"avg_train_loss": avg_train_loss} # return {"avg_train_loss": avg_train_loss, "log": tensorboard_logs, 'progress_bar': tensorboard_logs} def validation_step(self, batch, batch_idx): loss = self._step(batch) return {"val_loss": loss} def validation_epoch_end(self, outputs): avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean() tensorboard_logs = {"val_loss": avg_loss} return {"avg_val_loss": avg_loss, "log": tensorboard_logs, 'progress_bar': tensorboard_logs} def configure_optimizers(self): "Prepare optimizer and schedule (linear warmup and decay)" model = self.model no_decay = ["bias", "LayerNorm.weight"] optimizer_grouped_parameters = [ { "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], "weight_decay": self.hparams['weight_decay'], }, { "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0, }, ] optimizer = AdamW(optimizer_grouped_parameters, lr=self.hparams['learning_rate'], eps=self.hparams['adam_epsilon']) self.opt = optimizer return [optimizer] def optimizer_step(self, epoch=None, batch_idx=None, optimizer=None, optimizer_idx=None, optimizer_closure=None, second_order_closure=None, on_tpu=False, using_native_amp=False, using_lbfgs=False): # if self.trainer.use_tpu: # xm.optimizer_step(optimizer) # else: optimizer.step(closure=optimizer_closure) optimizer.zero_grad() self.lr_scheduler.step() def get_tqdm_dict(self): tqdm_dict = {"loss": "{:.3f}".format(self.trainer.avg_loss), "lr": self.lr_scheduler.get_last_lr()[-1]} return tqdm_dict def train_dataloader(self): train_dataset = get_dataset(tokenizer=self.tokenizer, type_path="translated_train", args=self.hparams) dataloader = DataLoader(train_dataset, batch_size=self.hparams['train_batch_size'], drop_last=True, shuffle=True, num_workers=4) t_total = ( (len(dataloader.dataset) // (self.hparams['train_batch_size'] * max(1, self.hparams['n_gpu']))) // self.hparams['gradient_accumulation_steps'] * float(self.hparams['num_train_epochs']) ) scheduler = get_linear_schedule_with_warmup( self.opt, num_warmup_steps=self.hparams['warmup_steps'], num_training_steps=t_total ) self.lr_scheduler = scheduler return dataloader def val_dataloader(self): val_dataset = get_dataset(tokenizer=self.tokenizer, type_path="test_2k", args=self.hparams) return DataLoader(val_dataset, batch_size=self.hparams['eval_batch_size'], num_workers=4) logger = logging.getLogger(__name__) class LoggingCallback(pl.Callback): def on_validation_end(self, trainer, pl_module): logger.info("***** Validation results *****") if pl_module.is_logger(): metrics = trainer.callback_metrics # Log results for key in sorted(metrics): if key not in ["log", "progress_bar"]: logger.info("{} = {}\n".format(key, str(metrics[key]))) def on_test_end(self, trainer, pl_module): logger.info("***** Test results *****") if pl_module.is_logger(): metrics = trainer.callback_metrics # Log and save results to file output_test_results_file = os.path.join(pl_module.hparams["output_dir"], "test_results.txt") with open(output_test_results_file, "w") as writer: for key in sorted(metrics): if key not in ["log", "progress_bar"]: logger.info("{} = {}\n".format(key, str(metrics[key]))) writer.write("{} = {}\n".format(key, str(metrics[key]))) ```
0
answers
0
votes
2
views
asked 6 days ago

ClientError: An error occurred (UnknownOperationException) when calling the CreateHyperParameterTuningJob operation: The requested operation is not supported in the called region.

Hi Dears, I am building ML model using DeepAR Algorithm. I faced this error while i reached to this point : Error : ClientError: An error occurred (UnknownOperationException) when calling the CreateHyperParameterTuningJob operation: The requested operation is not supported in the called region. ------------------- Code: from sagemaker.tuner import ( IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner, ) from sagemaker import image_uris container = image_uris.retrieve(region= 'af-south-1', framework="forecasting-deepar") deepar = sagemaker.estimator.Estimator( container, role, instance_count=1, instance_type="ml.m5.2xlarge", use_spot_instances=True, # use spot instances max_run=1800, # max training time in seconds max_wait=1800, # seconds to wait for spot instance output_path="s3://{}/{}".format(bucket, output_path), sagemaker_session=sess, ) freq = "D" context_length = 300 deepar.set_hyperparameters( time_freq=freq, context_length=str(context_length), prediction_length=str(prediction_length) ) Can you please help in solving the error? I have to do that in af-south-1 region. Thanks Basem hyperparameter_ranges = { "mini_batch_size": IntegerParameter(100, 400), "epochs": IntegerParameter(200, 400), "num_cells": IntegerParameter(30, 100), "likelihood": CategoricalParameter(["negative-binomial", "student-T"]), "learning_rate": ContinuousParameter(0.0001, 0.1), } objective_metric_name = "test:RMSE" tuner = HyperparameterTuner( deepar, objective_metric_name, hyperparameter_ranges, max_jobs=10, strategy="Bayesian", objective_type="Minimize", max_parallel_jobs=10, early_stopping_type="Auto", ) s3_input_train = sagemaker.inputs.TrainingInput( s3_data="s3://{}/{}/train/".format(bucket, prefix), content_type="json" ) s3_input_test = sagemaker.inputs.TrainingInput( s3_data="s3://{}/{}/test/".format(bucket, prefix), content_type="json" ) tuner.fit({"train": s3_input_train, "test": s3_input_test}, include_cls_metadata=False) tuner.wait()
1
answers
0
votes
6
views
asked 6 days ago
0
answers
0
votes
2
views
asked 12 days ago

XGBoost Error: Allreduce failed - 100GB Dask Dataframe on AWS Fargate ECS cluster dies with 1T of memory.

Overview: I'm trying to run an XGboost model on a bunch of parquet files sitting in S3 using dask by setting up a fargate cluster and connecting it to a Dask cluster. Total dataframe size totals to about 140 GB of data. I scaled up a fargate cluster with properties: Workers: 40 Total threads: 160 Total memory: 1 TB So there should be enough data to hold the data tasks. Each worker has 9+ GB with 4 Threads. I do some very basic preprocessing and then I create a DaskDMatrix which does cause the task bytes per worker to get a little high, but never above the threshold where it would fail. Next I run xgb.dask.train which utilizes the xgboost package not the dask_ml.xgboost package. Very quickly, the workers die and I get the error `XGBoostError: rabit/internal/utils.h:90: Allreduce failed`. When I attempted this with a single file with only 17MB of data, I would still get this error but only a couple workers die. Does anyone know why this happens since I have double the memory of the dataframe? ``` X_train = X_train.to_dask_array() X_test = X_test.to_dask_array() y_train = y_train y_test = y_test ``` dtrain = xgb.dask.DaskDMatrix(client,X_train, y_train) output = xgb.dask.train( client, {"verbosity": 1, "tree_method": "hist", "objective": "reg:squarederror"}, dtrain, num_boost_round=100, evals=[(dtrain, "train")])`
1
answers
0
votes
6
views
asked 12 days ago

Amazon SageMaker Data Wrangler now supports additional M5 and R5 instances for interactive data preparation

Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes in Amazon SageMaker Studio, the first fully integrated development environment (IDE) for ML. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization, from a single visual interface. SageMaker Data Wrangler runs on ml.m5.4xlarge by default. SageMaker Data Wrangler includes built-in data transforms and analyses written in PySpark so you can process large data sets (up to hundreds of gigabytes (GB) of data) efficiently on the default instance. Starting today, you can use additional M5 or R5 instance types with more CPU or memory in SageMaker Data Wrangler to improve performance for your data preparation workloads. Amazon EC2 M5 instances offer a balance of compute, memory, and networking resources for a broad range of workloads. Amazon EC2 R5 instances are the memory optimized instances. Both M5 and R5 instance types are well suited for CPU and memory intensive applications such as running built-in transforms for very large data sets (up to terabytes (TB) of data) or applying custom transforms written in Panda on medium data sets (up to tens of GBs). To learn more about the newly supported instances with Amazon SageMaker Data Wrangler, visit the [blog ](https://aws.amazon.com/blogs/machine-learning/process-larger-and-wider-datasets-with-amazon-sagemaker-data-wrangler/) or the [AWS document](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-data-flow.html), and the[ pricing page](https://aws.amazon.com/sagemaker/pricing/). To get started with SageMaker Data Wrangler, visit the [AWS documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler.html).
0
answers
0
votes
6
views
asked 15 days ago

Data Wrangler Full Outer Join Not Working As Expected Nor Concatenate

I've got two CSV files that are loaded into Data Wrangler that are intended to augment each other. The tables have some columns that are the same (in name) and some that are not, many of the rows are missing entries for many of the columns. The two tables represent separate datasets. Consider the example below: Table 1: | Filename | LabelA | LabelB | | --- | --- | --- | | ./A/001.dat | 1 | 1 | | ./A/002.dat | 0 | 1 | Table 2: | Filename | LabelB | LabelC | | --- | --- | --- | | ./B/001.dat | | 0 | | ./B/002.dat | 0 | 1 | I am looking to merge / concatenate the two table. The problem is that neither Data Wrangler join nor concatenate seems to work (at least as expected). Desired result: | Filename | LabelA | LabelB | LabelC | | --- | --- | --- | --- | | ./A/001.dat | 1 | 1 | | | ./A/002.dat | 0 | 1 | | | ./B/001.dat | | | 0 | | ./B/002.dat | | 0 | 1 | When using a "Full Outer" join and ask to combine "Filename" and "LabelB" columns, it will take all the values from Table 1 OR Table 2 even if Table 1 does not have that entry (for example, some rows will have Filename = <nothing> rather than Filename = ./B/001.dat). When using concatenate, Data Wrangler errors on the fact that it cannot match EVERY column between the tables. Now in my example there are many columns and many rows which precludes a manual process of joining without merging columns and then going through a renaming and merging process one-by-one. How do get these tables to simply merge? I feel I must be missing something obvious. I am about to give up on Data Wrangler and do it all in a python script using pandas, but I thought I should give Data Wrangler a try while learning the MLops process.
1
answers
0
votes
1
views
asked 16 days ago

Rekognition search faces API endpoint

Hi everyone! Currently, I've accomplished detecting all the faces from a collection and then generating sub-galleries of each subject with all their photos associated with the ruby SDK '~> 1.65' To do this, I've indexed the faces of all photos within a collection, list all the faces (https://docs.aws.amazon.com/sdk-for-ruby/v3/api/Aws/Rekognition/Client.html#list_faces-instance_method), then grabbing each face_id recognized and search the faces related to that face_id (https://docs.aws.amazon.com/sdk-for-ruby/v3/api/Aws/Rekognition/Client.html#search_faces-instance_method), and delete the face id used to do the API call and all the returned ones to tell when a new detected subject starts and ends. My issue is that the search faces API returns different results depending on which face id param you are doing the request with. For example, if there are 10 faces ids detected that belong to a person (1, 2, 3, ..., 10), the search faces call with face id = 1 param, should return the faces id (2, 3, 4, ..., 10) but if you continue to do this with the other face ids this is not always the case with some scenarios where the search faces call with face id = 3 has returned a subset of the previously mentioned like just (4, 5, 6). Is there any other way to achieve this to prevent this kind of "error"? if not, this is a real concern for us because it depends on the order in which we call the search faces with different face ids, and sometimes it seems like there is more than 1 subject detected with almost the same photos when in reality it's the same person. Thanks in advance!
3
answers
0
votes
5
views
asked 19 days ago

¿How can we crate a lambda which uses a Braket D-Wave device?

We are trying to deploy a Lambda with some code which works in a Notebook. The code is rather simple and uses D-Wave — DW_2000Q_6. The problem is that when we execute the lambda (container lambda due to size problems), it give us the following error: ```json { "errorMessage": "[Errno 30] Read-only file system: '/home/sbx_user1051'", "errorType": "OSError", "stackTrace": [ " File \"/var/lang/lib/python3.8/imp.py\", line 234, in load_module\n return load_source(name, filename, file)\n", " File \"/var/lang/lib/python3.8/imp.py\", line 171, in load_source\n module = _load(spec)\n", " File \"<frozen importlib._bootstrap>\", line 702, in _load\n", " File \"<frozen importlib._bootstrap>\", line 671, in _load_unlocked\n", " File \"<frozen importlib._bootstrap_external>\", line 843, in exec_module\n", " File \"<frozen importlib._bootstrap>\", line 219, in _call_with_frames_removed\n", " File \"/var/task/lambda_function.py\", line 6, in <module>\n from dwave.system.composites import EmbeddingComposite\n", " File \"/var/task/dwave/system/__init__.py\", line 15, in <module>\n import dwave.system.flux_bias_offsets\n", " File \"/var/task/dwave/system/flux_bias_offsets.py\", line 22, in <module>\n from dwave.system.samplers.dwave_sampler import DWaveSampler\n", " File \"/var/task/dwave/system/samplers/__init__.py\", line 15, in <module>\n from dwave.system.samplers.clique import *\n", " File \"/var/task/dwave/system/samplers/clique.py\", line 32, in <module>\n from dwave.system.samplers.dwave_sampler import DWaveSampler, _failover\n", " File \"/var/task/dwave/system/samplers/dwave_sampler.py\", line 31, in <module>\n from dwave.cloud import Client\n", " File \"/var/task/dwave/cloud/__init__.py\", line 21, in <module>\n from dwave.cloud.client import Client\n", " File \"/var/task/dwave/cloud/client/__init__.py\", line 17, in <module>\n from dwave.cloud.client.base import Client\n", " File \"/var/task/dwave/cloud/client/base.py\", line 89, in <module>\n class Client(object):\n", " File \"/var/task/dwave/cloud/client/base.py\", line 736, in Client\n @cached.ondisk(maxage=_REGIONS_CACHE_MAXAGE)\n", " File \"/var/task/dwave/cloud/utils.py\", line 477, in ondisk\n directory = kwargs.pop('directory', get_cache_dir())\n", " File \"/var/task/dwave/cloud/config.py\", line 455, in get_cache_dir\n return homebase.user_cache_dir(\n", " File \"/var/task/homebase/homebase.py\", line 150, in user_cache_dir\n return _get_folder(True, _FolderTypes.cache, app_name, app_author, version, False, use_virtualenv, create)[0]\n", " File \"/var/task/homebase/homebase.py\", line 430, in _get_folder\n os.makedirs(final_path)\n", " File \"/var/lang/lib/python3.8/os.py\", line 213, in makedirs\n makedirs(head, exist_ok=exist_ok)\n", " File \"/var/lang/lib/python3.8/os.py\", line 213, in makedirs\n makedirs(head, exist_ok=exist_ok)\n", " File \"/var/lang/lib/python3.8/os.py\", line 223, in makedirs\n mkdir(name, mode)\n" ] } ``` It seems that the library tries to write to some files which are not in /tmp folder. I'm wondering if is possible to do this, and if not, what are the alternatives. imports used: ```python import boto3 from braket.ocean_plugin import BraketDWaveSampler from dwave.system.composites import EmbeddingComposite from neal import SimulatedAnnealingSampler ```
1
answers
0
votes
6
views
asked a month ago

How to create (Serverless) SageMaker Endpoint using exiting tensorflow pb (frozen model) file?

Note: I am a senior developer, but am very new to the topic of machine learning. I have two frozen TensorFlow model weight files: `weights_face_v1.0.0.pb` and `weights_plate_v1.0.0.pb`. I also have some python code using Tensorflow 2, that loads the model and handles basic inference. The models detect respectively faces and license plates, and the surrounding code converts an input image to a numpy array, and applies blurring to the images in areas that had detections. I want to get a SageMaker endpoint so that I can run inference on the model. I initially tried using a regular Lambda function (container based), but that is too slow for our use case. A SageMaker endpoint should give us GPU inference, which should be much faster. I am struggling to find out how to do this. From what I can tell reading the documentation and watching some YouTube video's, I need to create my own docker container. As a start, I can use for example `763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.8.0-gpu-py39-cu112-ubuntu20.04-sagemaker`. However, I can't find any solid documentation on how I would implement my other code. How do I send an image to SageMaker? Who tells it to convert the image to numpy array? How does it know the tensor names? How do I install additional requirements? How can I use the detections to apply blurring on the image, and how can I return the result image? Can someone here please point me in the right direction? I searched a lot but can't find any example code or blogs that explain this process. Thank you in advance! Your help is much appreciated.
1
answers
0
votes
2
views
asked a month ago
  • 1
  • 90 / page