SageMaker XGBoost Parquet Example Code Fails and Errors out. Bug?

0

Hi, I'm trying to run the SageMaker XGBoost Parquet example linked here. I followed the exact same steps but using my own data. I uploaded my data, converted it to a pandas df. The train_df shape is (15279798, 32) while the test_df shape is (150848, 32). I then converted it to parquet files and uploaded it to an S3 bucket - per example instructions.

My error is as follows:

Failure reason
AlgorithmError: framework error: Traceback (most recent call last): File "/miniconda3/lib/python3.7/site-packages/sagemaker_xgboost_container/data_utils.py", line 422, in _get_parquet_dmatrix_pipe_mode data = np.vstack(examples) File "<__array_function__ internals>", line 6, in vstack File "/miniconda3/lib/python3.7/site-packages/numpy/core/shape_base.py", line 283, in vstack return _nx.concatenate(arrs, 0) File "<__array_function__ internals>", line 6, in concatenate ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 32 and the array at index 1 has size 9 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_trainer.py", line 84, in train entrypoint() File "/miniconda3/lib/python3.7/site-packages/sagemaker_xgboost_container/training.py", line 94, in main train(framework.tr

But I'm confused because the train and test are the same shape and I added no extra code. My code below:

# requires PyArrow installed
train.to_parquet("Xgb_train.parquet")
test.to_parquet("Xgb_test.parquet")

%%time
sagemaker.Session().upload_data(
    "Xgb_train.parquet", bucket=bucket, key_prefix=prefix + "/" + "Ptrain"
)

sagemaker.Session().upload_data(
    "Xgb_test.parquet", bucket=bucket, key_prefix=prefix + "/" + "Ptest"
)

container = sagemaker.image_uris.retrieve("xgboost", region, "1.2-2")

%%time
import time
from time import gmtime, strftime

job_name = "xgboost-parquet-example-training-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Training job", job_name)

# Ensure that the training and validation data folders generated above are reflected in the "InputDataConfig" parameter below.

create_training_params = {
    "AlgorithmSpecification": {"TrainingImage": container, "TrainingInputMode": "Pipe"},
    "RoleArn": role,
    "OutputDataConfig": {"S3OutputPath": bucket_path + "/" + prefix + "/single-xgboost"},
    "ResourceConfig": {"InstanceCount": 1, "InstanceType": "ml.m5.2xlarge", "VolumeSizeInGB": 20},
    "TrainingJobName": job_name,
    "HyperParameters": {
        "max_depth": "5",
        "eta": "0.2",
        "gamma": "4",
        "min_child_weight": "6",
        "subsample": "0.7",
        "objective": "reg:linear",
        "num_round": "10",
        "verbosity": "2",
    },
    "StoppingCondition": {"MaxRuntimeInSeconds": 3600},
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": bucket_path + "/" + prefix + "/Ptrain",
                    "S3DataDistributionType": "FullyReplicated",
                }
            },
            "ContentType": "application/x-parquet",
            "CompressionType": "None",
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": bucket_path + "/" + prefix + "/Ptest",
                    "S3DataDistributionType": "FullyReplicated",
                }
            },
            "ContentType": "application/x-parquet",
            "CompressionType": "None",
        },
    ],
}


client = boto3.client("sagemaker", region_name=region)
client.create_training_job(**create_training_params)
print(client)
status = client.describe_training_job(TrainingJobName=job_name)["TrainingJobStatus"]
print(status)
while status != "Completed" and status != "Failed":
    time.sleep(60)
    status = client.describe_training_job(TrainingJobName=job_name)["TrainingJobStatus"]
    print(status)
asked 2 years ago652 views
1 Answer
0
Accepted Answer

I just changed my bucket name and file names. It worked now.

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions