By using AWS re:Post, you agree to the Terms of Use

Questions tagged with Containers

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Intermittent ConnectTimeoutError accessing SSM

My app uses SSM Parameter Store on Fargate instances and locally in a Docker container. We're accessing it with Boto3 from Python. Multiple developers on my team, in different countries, have seen a very intermittent issue, cropping up maybe once every 1–4 weeks, where for 10 minutes or so, calls to SSM will fail with this error: ``` botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://ssm.us-east-2.amazonaws.com/" ``` The ECS instances do not see the issue as far as I'm aware, this is only a problem when we're accessing the endpoint via Boto3 from our home networks. It occurs to me now that I haven't verified whether all users see the problem at the same time, or if it's just one user at a time. I will try to test this the next time I see it. I have tried: 1. Reducing the number of calls we make to SSM. It's now down to about 2/sec per user at the maximum, with effectively no other users cuncurrently hitting the API. So we're never getting anywhere near the [40 requests/second limit](https://docs.aws.amazon.com/general/latest/gr/ssm.html#limits_ssm). In looking at the logs, the most I can see is 12 requests in *one minute.* We're just not using this very agressively, so it doesn't seem possible that the problem is throttling. All of our calls are paginated calls to GetParametersByPath, and we are using `WithDecryption=true`. 2. Changing the Boto3 retry method from Legacy to [Standard](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html#standard-retry-mode). This is probably a good thing to do anyway, but doesn't seem to have fixed the problem. The only reliable solution I've come up with is to wait. Eventually, the endpoint comes back and my application begins working again. But this is really an unacceptable level of service interruption, and I feel like I must be doing something wrong. Is there a setting I have overlooked? Does anyone have any troubleshooting suggestions for things to try when I inevitably see the problem again?
1
answers
0
votes
85
views
asked 2 months ago

Unable to load the service index for source error for a CodeArtifact NuGet feed using Dockerfile

Hi! We’re using AWS CodeArtifact for storing our packages and when we try to build a Docker image from our Dockerfile it fails because it's unable to load the source during the restore process. We have a web API in .Net that we want to deploy using AWS Fargate. This project runs smoothly from Visual Studio 2022 using Docker but we can’t build the image from PowerShell after adding our packages from CodeArtifact. Our approach to include the credentials in the build is to pass the NuGet.Config stored on the host using Buildkit. This’s our Dockerfile: ```powershell #See https://aka.ms/containerfastmode to understand how Visual Studio uses this Dockerfile to build your images for faster debugging. FROM mcr.microsoft.com/dotnet/aspnet:6.0 AS base WORKDIR /app EXPOSE 80 ENV ASPNETCORE_URLS=http://+:49151 FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build WORKDIR /src COPY . . WORKDIR /src COPY ["Src/Presentation/Project.API/Project.API.csproj", "Src/Presentation/Project.API/"] RUN --mount=type=cache,id=nuget,target=/root/.nuget/packages \ --mount=type=secret,id=nugetconfig \ dotnet restore "Src/Presentation/Project.API/Project.API.csproj" \ --configfile /run/secrets/nugetconfig COPY . . WORKDIR "/src/Src/Presentation/Project.API" RUN --mount=type=cache,id=nuget,target=/root/.nuget/packages \ dotnet build "Project.API.csproj" -c Release -o /app/build \ --no-restore FROM build AS publish RUN --mount=type=cache,id=nuget,target=/root/.nuget/packages \ dotnet publish "Project.API.csproj" -c Release -o /app/publish \ --no-restore FROM base AS final WORKDIR /app COPY --from=publish /app/publish . ENTRYPOINT ["dotnet", "Project.API.dll"] ``` Our script in PowerShell: ```powershell docker buildx build --secret id=nugetconfig,src=$HOME\AppData\Roaming\NuGet\NuGet.Config -f "Src\Presentation\Project.API\Dockerfile" -t my-dotnet-image . ``` Output: ```powershell #14 [build 6/9] RUN --mount=type=cache,id=nuget,target=/root/.nuget/packages --mount=type=secret,id=nugetconfig dotnet restore "Src/Presentation/Project.API/Project.API.csproj" --configfile /run/secrets/nugetconfig #14 1.175 Determining projects to restore... #14 2.504 /src/Src/Presentation/Project.API/Project.API.csproj : error NU1301: Unable to load the service index for source https://domain-123456789012.d.codeartifact.us-east-2.amazonaws.com/nuget/repository/v3/index.json. ``` What are we missing here? Once we have the Dockerfile working we want to use CDK for deploying the Docker image alongside our infrastructure. We’re using **aws codeartifact login** command to authenticate with the service. Thanks!
1
answers
0
votes
74
views
profile picture
asked 2 months ago

Error for Training job catboost-classification-model , ErrorMessage "TypeError: Cannot convert 'xxx'' to float

When I performed the following AWS tutorial, I got an error when training the model. **https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Classification_LightGBM_CatBoost.ipynb** The error that occurred is ``` UnexpectedStatusException: Error for Training job jumpstart-catboost-classification-model-2022-07-22-07-33-18-038: Failed. Reason: AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "TypeError: Cannot convert 'b'BROOKLYN'' to float ``` **These are all the files that I have upload in S3 bucket** : Amazon S3 --> Buckets---> R-sandbox-sagemaker--->ml/---> train/ and in the train folder 'data.csv' and 'categorical_index.json' are uploaded based on the mentioned tutorial. data point "BROOKLYN" is in the categorical column, its index is already included in the JSON file to tell Catboost that it is categorical data. Data has 55 categorical data columns; only two of them are integers , all other string Could you give me some advice on how to solve it? Also here all the code and full traceback of the issue: ``` !pip install sagemaker ipywidgets --upgrade –quiet import sagemaker, boto3, json from sagemaker import get_execution_role aws_role = get_execution_role() aws_region = boto3.Session().region_name sess = sagemaker.Session() ##2.1 Retrieve Training Artifacts- #retrieve the training docker container, the training algorithm source, and the tabular algorithm. Note that model_version="*" fetches the latest model. # Currently, not all the object detection models in jumpstart support finetuning. Thus, we manually select a model # which supports finetuning. from sagemaker import image_uris, model_uris, script_uris train_model_id, train_model_version, train_scope = "catboost-classification-model", "*", "training" training_instance_type = "ml.m5.xlarge" # Retrieve the docker image train_image_uri = image_uris.retrieve( region=None, framework=None, model_id=train_model_id, model_version=train_model_version, image_scope=train_scope, instance_type=training_instance_type, ) # Retrieve the training script train_source_uri = script_uris.retrieve( model_id=train_model_id, model_version=train_model_version, script_scope=train_scope ) # Retrieve the pre-trained model tarball to further fine-tune train_model_uri = model_uris.retrieve( model_id=train_model_id, model_version=train_model_version, model_scope=train_scope ) ## 2.2 Set Training Parameters # Sample training data is available in this bucket training_data_bucket = "R-sandbox-sagemaker" training_data_prefix = "ml" training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}" output_bucket = sess.default_bucket() output_prefix = "jumpstart-example-tabular-training" s3_output_location = f"s3://{output_bucket}/{output_prefix}/output" from sagemaker import hyperparameters # Retrieve the default hyper-parameters for fine-tuning the model hyperparameters = hyperparameters.retrieve_default( model_id=train_model_id, model_version=train_model_version ) # [Optional] Override default hyperparameters with custom values hyperparameters[ "iterations" ] = "500" # The same hyperparameter is named as "iterations" for CatBoost print(hyperparameters) ## 2.3. Train with Automatic Model Tuning from sagemaker.tuner import ContinuousParameter, IntegerParameter, HyperparameterTuner use_amt = True if train_model_id == "lightgbm-classification-model": hyperparameter_ranges = { "learning_rate": ContinuousParameter(1e-4, 1, scaling_type="Logarithmic"), "num_boost_round": IntegerParameter(2, 30), "early_stopping_rounds": IntegerParameter(2, 30), "num_leaves": IntegerParameter(10, 50), "feature_fraction": ContinuousParameter(0, 1), "bagging_fraction": ContinuousParameter(0, 1), "bagging_freq": IntegerParameter(1, 10), "max_depth": IntegerParameter(5, 30), "min_data_in_leaf": IntegerParameter(5, 50), } if train_model_id == "catboost-classification-model": hyperparameter_ranges = { "learning_rate": ContinuousParameter(0.00001, 0.1, scaling_type="Logarithmic"), "iterations": IntegerParameter(50, 1000), "early_stopping_rounds": IntegerParameter(1, 10), "depth": IntegerParameter(1, 10), "l2_leaf_reg": IntegerParameter(1, 10), "random_strength": ContinuousParameter(0.01, 10, scaling_type="Logarithmic"), } ## 2.4. Start Training from sagemaker.estimator import Estimator from sagemaker.utils import name_from_base training_job_name = name_from_base(f"jumpstart-{'catboost-classification-model'}-training") # Create SageMaker Estimator instance tabular_estimator = Estimator( role=aws_role, image_uri=train_image_uri, source_dir=train_source_uri, model_uri=train_model_uri, entry_point="transfer_learning.py", instance_count=1, instance_type=training_instance_type, max_run=360000, #hyperparameters=hyperparameters, output_path=s3_output_location, ) # Launch a SageMaker Training job by passing s3 path of the training data tabular_estimator.fit( {"training": training_dataset_s3_path}, logs=True, job_name=training_job_name ) ``` ```` 2022-07-22 07:33:18 Starting - Starting the training job... 2022-07-22 07:33:46 Starting - Preparing the instances for trainingProfilerReport-1658475198: InProgress 2022-07-22 07:35:06 Downloading - Downloading input data... 2022-07-22 07:35:46 Training - Downloading the training image... 2022-07-22 07:36:11 Training - Training image download completed. Training in progress..bash: cannot set terminal process group (-1): Inappropriate ioctl for device bash: no job control in this shell 2022-07-22 07:36:14,025 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training 2022-07-22 07:36:14,027 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed) 2022-07-22 07:36:14,036 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed. 2022-07-22 07:36:14,041 sagemaker_pytorch_container.training INFO Invoking user training script. 2022-07-22 07:36:15,901 sagemaker-training-toolkit INFO Installing dependencies from requirements.txt: /opt/conda/bin/python3.8 -m pip install -r requirements.txt Processing ./catboost/tenacity-8.0.1-py3-none-any.whl Processing ./catboost/plotly-5.1.0-py2.py3-none-any.whl Processing ./catboost/graphviz-0.17-py3-none-any.whl Processing ./catboost/catboost-1.0.1-cp38-none-manylinux1_x86_64.whl Processing ./sagemaker_jumpstart_script_utilities-1.0.0-py2.py3-none-any.whl Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from plotly==5.1.0->-r requirements.txt (line 2)) (1.16.0) Requirement already satisfied: numpy>=1.16.0 in /opt/conda/lib/python3.8/site-packages (from catboost==1.0.1->-r requirements.txt (line 4)) (1.19.1) Requirement already satisfied: scipy in /opt/conda/lib/python3.8/site-packages (from catboost==1.0.1->-r requirements.txt (line 4)) (1.7.1) Requirement already satisfied: matplotlib in /opt/conda/lib/python3.8/site-packages (from catboost==1.0.1->-r requirements.txt (line 4)) (3.4.3) Requirement already satisfied: pandas>=0.24.0 in /opt/conda/lib/python3.8/site-packages (from catboost==1.0.1->-r requirements.txt (line 4)) (1.2.4) Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.8/site-packages (from pandas>=0.24.0->catboost==1.0.1->-r requirements.txt (line 4)) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.8/site-packages (from pandas>=0.24.0->catboost==1.0.1->-r requirements.txt (line 4)) (2021.3) Requirement already satisfied: pillow>=6.2.0 in /opt/conda/lib/python3.8/site-packages (from matplotlib->catboost==1.0.1->-r requirements.txt (line 4)) (8.3.2) Requirement already satisfied: pyparsing>=2.2.1 in /opt/conda/lib/python3.8/site-packages (from matplotlib->catboost==1.0.1->-r requirements.txt (line 4)) (2.4.7) Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.8/site-packages (from matplotlib->catboost==1.0.1->-r requirements.txt (line 4)) (0.10.0) Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.8/site-packages (from matplotlib->catboost==1.0.1->-r requirements.txt (line 4)) (1.3.2) tenacity is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel. Installing collected packages: plotly, graphviz, sagemaker-jumpstart-script-utilities, catboost Attempting uninstall: plotly Found existing installation: plotly 5.3.1 Uninstalling plotly-5.3.1: Successfully uninstalled plotly-5.3.1 Successfully installed catboost-1.0.1 graphviz-0.17 plotly-5.1.0 sagemaker-jumpstart-script-utilities-1.0.0 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv 2022-07-22 07:36:32,568 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed) 2022-07-22 07:36:32,604 sagemaker-training-toolkit INFO Invoking user script Training Env: { "additional_framework_parameters": {}, "channel_input_dirs": { "model": "/opt/ml/input/data/model", "training": "/opt/ml/input/data/training" }, "current_host": "algo-1", "framework_module": "sagemaker_pytorch_container.training:main", "hosts": [ "algo-1" ], "hyperparameters": {}, "input_config_dir": "/opt/ml/input/config", "input_data_config": { "model": { "ContentType": "application/x-sagemaker-model", "TrainingInputMode": "File", "S3DistributionType": "FullyReplicated", "RecordWrapperType": "None" }, "training": { "TrainingInputMode": "File", "S3DistributionType": "FullyReplicated", "RecordWrapperType": "None" } }, "input_dir": "/opt/ml/input", "is_master": true, "job_name": "jumpstart-catboost-classification-model-2022-07-22-07-33-18-038", "log_level": 20, "master_hostname": "algo-1", "model_dir": "/opt/ml/model", "module_dir": "s3://jumpstart-cache-prod-us-east-1/source-directory-tarballs/catboost/transfer_learning/classification/v1.1.3/sourcedir.tar.gz", "module_name": "transfer_learning", "network_interface_name": "eth0", "num_cpus": 4, "num_gpus": 0, "output_data_dir": "/opt/ml/output/data", "output_dir": "/opt/ml/output", "output_intermediate_dir": "/opt/ml/output/intermediate", "resource_config": { "current_host": "algo-1", "current_instance_type": "ml.m5.xlarge", "current_group_name": "homogeneousCluster", "hosts": [ "algo-1" ], "instance_groups": [ { "instance_group_name": "homogeneousCluster", "instance_type": "ml.m5.xlarge", "hosts": [ "algo-1" ] } ], "network_interface_name": "eth0" }, "user_entry_point": "transfer_learning.py" } Environment variables: SM_HOSTS=["algo-1"] SM_NETWORK_INTERFACE_NAME=eth0 SM_HPS={} SM_USER_ENTRY_POINT=transfer_learning.py SM_FRAMEWORK_PARAMS={} SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.m5.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.m5.xlarge"}],"network_interface_name":"eth0"} SM_INPUT_DATA_CONFIG={"model":{"ContentType":"application/x-sagemaker-model","RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}} SM_OUTPUT_DATA_DIR=/opt/ml/output/data SM_CHANNELS=["model","training"] SM_CURRENT_HOST=algo-1 SM_MODULE_NAME=transfer_learning SM_LOG_LEVEL=20 SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main SM_INPUT_DIR=/opt/ml/input SM_INPUT_CONFIG_DIR=/opt/ml/input/config SM_OUTPUT_DIR=/opt/ml/output SM_NUM_CPUS=4 SM_NUM_GPUS=0 SM_MODEL_DIR=/opt/ml/model SM_MODULE_DIR=s3://jumpstart-cache-prod-us-east-1/source-directory-tarballs/catboost/transfer_learning/classification/v1.1.3/sourcedir.tar.gz SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"model":"/opt/ml/input/data/model","training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{},"input_config_dir":"/opt/ml/input/config","input_data_config":{"model":{"ContentType":"application/x-sagemaker-model","RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"jumpstart-catboost-classification-model-2022-07-22-07-33-18-038","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://jumpstart-cache-prod-us-east-1/source-directory-tarballs/catboost/transfer_learning/classification/v1.1.3/sourcedir.tar.gz","module_name":"transfer_learning","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.m5.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.m5.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"transfer_learning.py"} SM_USER_ARGS=[] SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate SM_CHANNEL_MODEL=/opt/ml/input/data/model SM_CHANNEL_TRAINING=/opt/ml/input/data/training PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python38.zip:/opt/conda/lib/python3.8:/opt/conda/lib/python3.8/lib-dynload:/opt/conda/lib/python3.8/site-packages Invoking script with the following command: /opt/conda/bin/python3.8 transfer_learning.py INFO:root:Validation data is not found. 20.0% of training data is randomly selected as validation data. The seed for random sampling is 200. Traceback (most recent call last): File "_catboost.pyx", line 2167, in _catboost.get_float_feature File "_catboost.pyx", line 1125, in _catboost._FloatOrNan File "_catboost.pyx", line 949, in _catboost._FloatOrNanFromString TypeError: Cannot convert 'b'BROOKLYN'' to float During handling of the above exception, another exception occurred: Traceback (most recent call last): File "transfer_learning.py", line 221, in <module> ru ```
2
answers
0
votes
100
views
asked 2 months ago