XGBoost Error: Allreduce failed - 100GB Dask Dataframe on AWS Fargate ECS cluster dies with 1T of memory.
Overview: I'm trying to run an XGboost model on a bunch of parquet files sitting in S3 using dask by setting up a fargate cluster and connecting it to a Dask cluster.
Total dataframe size totals to about 140 GB of data. I scaled up a fargate cluster with properties:
Workers: 40 Total threads: 160 Total memory: 1 TB So there should be enough data to hold the data tasks. Each worker has 9+ GB with 4 Threads. I do some very basic preprocessing and then I create a DaskDMatrix which does cause the task bytes per worker to get a little high, but never above the threshold where it would fail.
Next I run xgb.dask.train which utilizes the xgboost package not the dask_ml.xgboost package. Very quickly, the workers die and I get the error XGBoostError: rabit/internal/utils.h:90: Allreduce failed
. When I attempted this with a single file with only 17MB of data, I would still get this error but only a couple workers die. Does anyone know why this happens since I have double the memory of the dataframe?
X_train = X_train.to_dask_array()
X_test = X_test.to_dask_array()
y_train = y_train
y_test = y_test
dtrain = xgb.dask.DaskDMatrix(client,X_train, y_train)
output = xgb.dask.train( client, {"verbosity": 1, "tree_method": "hist", "objective": "reg:squarederror"}, dtrain, num_boost_round=100, evals=(dtrain, "train"))`
Hi, Regarding the issue that you are seeing, we would need additional information related to the Fargate tasks that are running and the failures that you are seeing. I would recommend you to open a case with AWS Premium Support ECS Fargate team so that we can discuss more on the specific details of the issue along with the configuration in your use case. You can open a support case with AWS using the link: https://console.aws.amazon.com/support/home?#/case/create
Relevant questions
Unable to run ECS Task using image from ECR
GoldieLocksasked 9 months agoGlue job s3 file not found exception
davidwbrasked 5 years agoMoving to ECS-Fargate from EC2
Karthikeyan Venkatramanasked 3 months agoCan't load xgboost models created with Sagemaker Estimator
DaniVasked 2 months agoXGBoost Error: Allreduce failed - 100GB Dask Dataframe on AWS Fargate ECS cluster dies with 1T of memory.
AWS-User-7732475asked 7 days agoResource Utilization for Fargate ECS
Accepted AnswerExporting Sagemaker model to local computer
arusso98asked a month agoTraining a classifier on parquet with SageMaker ?
Accepted Answerloading and deploying a previously trained sagemaker xgboost model
hadi86asked 3 years agoHow to pass null/Nan values into the dataframe passed into batch transform
bharat-patidarasked 3 years ago