XGBoost Error: Allreduce failed - 100GB Dask Dataframe on AWS Fargate ECS cluster dies with 1T of memory.

0

Overview: I'm trying to run an XGboost model on a bunch of parquet files sitting in S3 using dask by setting up a fargate cluster and connecting it to a Dask cluster.

Total dataframe size totals to about 140 GB of data. I scaled up a fargate cluster with properties:

Workers: 40 Total threads: 160 Total memory: 1 TB So there should be enough data to hold the data tasks. Each worker has 9+ GB with 4 Threads. I do some very basic preprocessing and then I create a DaskDMatrix which does cause the task bytes per worker to get a little high, but never above the threshold where it would fail.

Next I run xgb.dask.train which utilizes the xgboost package not the dask_ml.xgboost package. Very quickly, the workers die and I get the error XGBoostError: rabit/internal/utils.h:90: Allreduce failed. When I attempted this with a single file with only 17MB of data, I would still get this error but only a couple workers die. Does anyone know why this happens since I have double the memory of the dataframe?

X_train = X_train.to_dask_array()
X_test = X_test.to_dask_array()
y_train = y_train
y_test = y_test

dtrain = xgb.dask.DaskDMatrix(client,X_train, y_train)

output = xgb.dask.train( client, {"verbosity": 1, "tree_method": "hist", "objective": "reg:squarederror"}, dtrain, num_boost_round=100, evals=[(dtrain, "train")])`

1개 답변
0

Hi, Regarding the issue that you are seeing, we would need additional information related to the Fargate tasks that are running and the failures that you are seeing. I would recommend you to open a case with AWS Premium Support ECS Fargate team so that we can discuss more on the specific details of the issue along with the configuration in your use case. You can open a support case with AWS using the link: https://console.aws.amazon.com/support/home?#/case/create

AWS
지원 엔지니어
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인