Using Spotfire for Predictive Analytics (Regression Modeling)

Posted on Updated on

We are building a model using Linear Regression to forecast sales `Sales` ~ `Order Quantity` + `Discount` + `Shipping Cost` + `Profit` + `Unit Price` + `Product Base Margin`

This is the model with “Sales” as the Response variable and all the subsequent columns after the “~” considered as Predictor variables.

Let us click on “OK” to examine the results for the model In the Model Summary pane, we can check the summary metrics:

Residual standard error: 1421 on 8329 degrees of freedom
(63 observations deleted due to missingness)
Multiple R-squared: 0.8421, Adjusted R-squared: 0.842
F-statistic: 7406 on 6 and 8329 DF, p-value: 0

Below is the significance of the model parameters:

Residual Standard Error: A lower value indicates the model is better fit for our data.

Adjusted R-Squared: This is a commonly used measure of fit of a regression equation. It penalizes the addition of too many variables while rewarding a good fit of the regression equation. A higher Adjusted R-Squared value represents the model to be a better fit.

p-value: Predictors with this value closer to zero are better contributing to the model

Some of the other factors which will influence our model are Collinearity and multicollinearity, and Variance Inflation Factor (VIF), AIC and BIC values can help assess our model.

Collinearity is a case of an independent variable being a linear function of another. And in Mulitcollinearity, a variable is a linear function of two or more variables. These issues can increase the likelihood of making false conclusions from our estimates.

High VIF means that multicollinearity significantly impacts the equation whereas lower AIC and BIC are better. The Table of Coefficients will have various p-values for various predictors (also called Regressors). Lower p-values will give the significance of each predictor in the model If there are patterns in the the “Residuals vs. Fitted” plot, then the current model could be improved. A simple horizontal bar signifying the relative importance of each predictor used in the model. Discount is the least important predictor. If the normal QQ plot closely approximates to the line y=x, then the model fits the data well.

In the above plot, the larger values represent points (data points) which are more influential and have to be further investigated.

Depending on these various factors, the model has to go through a series of investigative steps till a satisfactory level of fit is reached.

In addition to the knowledge of statistics, domain specific understanding is also quite crucial in assessing the inputs and the results. For example when analyzing sales, we examine specific types of sales broken into tiers depending on various criteria such as quarter of the year, geographic factors, economic indicators, seasonal influences etc.

We can exclude the outliers which will skew our results. Further, appropriate weights could be distributed on each input parameter to identify whether the specific type of sale is profitable to our business.