XGBoost Error: Allreduce failed - 100GB Dask Dataframe on AWS Fargate ECS cluster dies with 1T of memory.

Question

Overview: I'm trying to run an XGboost model on a bunch of parquet files sitting in S3 using dask by setting up a fargate cluster and connecting it to a Dask cluster.

Total dataframe size totals to about 140 GB of data. I scaled up a fargate cluster with properties:

Workers: 40
Total threads: 160
Total memory: 1 TB
So there should be enough data to hold the data tasks. Each worker has 9+ GB with 4 Threads. I do some very basic preprocessing and then I create a DaskDMatrix which does cause the task bytes per worker to get a little high, but never above the threshold where it would fail.

Next I run xgb.dask.train which utilizes the xgboost package not the dask_ml.xgboost package. Very quickly, the workers die and I get the error `XGBoostError: rabit/internal/utils.h:90: Allreduce failed`. When I attempted this with a single file with only 17MB of data, I would still get this error but only a couple workers die. Does anyone know why this happens since I have double the memory of the dataframe?

```
X_train = X_train.to_dask_array()
X_test = X_test.to_dask_array()
y_train = y_train
y_test = y_test
```

dtrain = xgb.dask.DaskDMatrix(client,X_train, y_train)

output = xgb.dask.train(
client,
{"verbosity": 1, "tree_method": "hist", "objective": "reg:squarederror"},
dtrain,
num_boost_round=100,
evals=[(dtrain, "train")])`

Answer

Hi, 
Regarding the issue that you are seeing, we would need additional information related to the Fargate tasks that are running and the failures that you are seeing. I would recommend you to open a case with AWS Premium Support ECS Fargate team so that we can discuss more on the specific details of the issue  along with the configuration in your use case. You can open a support case with AWS using the link: https://console.aws.amazon.com/support/home?#/case/create

XGBoost Error: Allreduce failed - 100GB Dask Dataframe on AWS Fargate ECS cluster dies with 1T of memory.

Contenuto pertinente