Bad RMSE when predicting Price with Linear Regression

0

Hi. I have a dataset of price data. It looks like this

PriceBranchItemCodeDiscountDateTimeOfPrice
10002523454360.332022-03-24 14:00

The dataset has about 1M records

I feature engineered it in the following way

PriceDiscountItemCodeYearMonthDayHourBranch1Branch2Branch3
100.33523454362022032414010

Each component of the DateTimeOfPrice got a separate column We have 3 branches. To avoid the situation when algorithm may think that "branch" column is some kind of priority column, I created 3 new column (we have 3 branches). If the item belongs to branch2, the column will get the value 1, if not - it will be 0

I run Linear Learner, XGBoost build-in algorithms and also SageMaker AutoPilot. In all cases I run , the best RMSE was 60 and prediction/ validation gives sometimes a result which is far from the actual value. I tried also to run XGBoost from the notebook with the following parameters

hyperparams = {
    "max_depth": "7",
    "eta": "0.2",
    "gamma": "4",
    "min_child_weight": "6",
    "subsample": "0.7",
    "objective": "reg:squarederror",
    "num_round": "100",
    "eval_metric":"rmse",
    "verbosity": "2",
}

Still, the RMSE is arround 60.

Please advice what can be done to improve the mertic and predication

3 Answers
0

Since I see you have a timestamp field in your data, would it be fair to assume your use case is mainly aimed at forecasting future prices - rather than estimating missing historical prices at different points in time?

If so, plain tabular regression (Autopilot regression task type) is probably not a good way to tackle this problem as forecasting techniques would work better instead. You could instead explore:

  • SageMaker Canvas, which offers a forecasting model (see the docs here to make sure your input timestamp is recognised so that Canvas shows you the forecasting option)
  • Amazon Forecast, a dedicated managed forecasting service separate from SageMaker
AWS
EXPERT
Alex_T
answered 2 years ago
  • I followed you suggestion and used Sagemaker Canvas

    I modified the data structure in the following way. Create 5 records pnly

    ItemPrice Branch Discount ItemCode PriceDate

    I choose ItemCode as "id" and "grouped" by "branch". However the score of the prediction is very poor score 22%

    According to the analisys the reason is because of the Discount column. So I removed it and run the process again. And the score was even lower :(

    | | | | | |

0

I followed you suggestion and used Sagemaker Canvas

I modified the data structure in the following way

ItemPriceBranchDiscountItemCodePriceDate
DataDataDataDataData
DataDataDataDataData

I choose ItemCode as "id" and "grouped" by "branch". However the score of the prediction is very poor score 22%

According to the analisys the reason is because of the Discount column. So I removed it and run the process again. And the score was even lower :(

AWS
answered 2 years ago
0

I suggest before you start to build your algorithm, do a data exploration. Does your data have a seasonality? Some items are just not seasonal.

AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions