Hypothesis

In this analysis, we are interested in whether wine would be more expensive if its points (X1) (rated by Wine Enthusiast) increases, adjusting for categorical variables variety (X2) and continent of production (X3).


Distribution of Outcomes : Wine Prices over the World



We detect a highly skewed pattern of the price distribution, thus our response variable Price needs to be log transformed before fitting into the regression model. After log transformation, the distribution becomes much more normal:



Modeling


We try to fit a final model in the following form:

\[ log(Price) = \beta_1 + \beta_2X_{Points} + \beta_3X_{Variety} + \beta_4X_{Continent}\]

Price: The cost for a bottle of the wine.

Points: The number of points Wine Enthusiast rated the wine on a scale of 1-100.

Variety: The top four types of grapes used to make the wine (i.e. Chardonnay, Pinot Noir, Cabernet Sauvignon, Red Blend).

Continent: The continent that the wine is produced from.

Since log(Price) and Points are numerical variables, we only need to worry about categorical predictor Variety, which includes 632 unique levels. To simplify our model, we choose to select the most representative four varieties: Chardonnay, Pinot Noir, Cabernet Sauvignon, Red Blend. Here, we define representative as having more than ten thousands counts of wines for the particular variety.


Interaction: Points and Variety


After log-transforming of independent variable Price, we observe a linear trend in Log_Price vs. Points. This implies a positive correlation between Points and Log_Price. The trend lines for the 4 most frequent varieties are very close together. So there is a similar positive correlation between price and points for each variety. We also observe that lines of Cabernet Sauvignon and Chardonnay cross with each other, so let’s take a look at p-value of each term in Log_Price ~ Points * Variety:


term estimate p.value
(Intercept) -8.2792558 0.0000000
points 0.1310157 0.0000000
varietyPinot Noir 2.8857093 0.0000000
varietyCabernet Sauvignon 0.0542255 0.7309678
varietyRed Blend 1.2574494 0.0000000
points:varietyPinot Noir -0.0293591 0.0000000
points:varietyCabernet Sauvignon 0.0016767 0.3497581
points:varietyRed Blend -0.0131835 0.0000000


Since two interaction terms Points * Pinot Noir and Points * Red Blend are significant at alpha = 0.05, we further check the change of determination coefficient before and after adding the interaction term to decide whether to include it in our final model:

x
0.429
x
0.433

Without interaction term, coefficient of determination (\(R^2\)) equals to 0.429 suggests a moderate association between points, variety and log(price). Although there is a significant p-value of Points * Variety interactions, the coefficient of determination does not change much (0.429 to 0.433) after interaction term is added. Therefore, we consider the interaction as not significant, which means that the relationship between Points and Price does not vary by different Variety. So we delete the interaction term Points * Variety from model.


Complete Model: Log_Price vs. Points + Variety + Continent


Qustion: Is variable Continent significant enough to be added in the model?

res.df rss df sumsq statistic p.value
49478 12565.96 NA NA NA NA
49473 11632.08 5 933.875 794.382 0

Yes. By doing an ANOVA test, we find that p-value of the F-test equals to zero, which suggests that the larger model Log_Price ~ Points + Variety + Continent is superior over smaller model Log_Price ~ Points + Variety, which means continent is indeed a significant variable. Therefore, we add continent into our model and get the following estimates:

term estimate p.value
(Intercept) -6.656 0.000
points 0.113 0.000
varietyPinot Noir 0.278 0.000
varietyCabernet Sauvignon 0.248 0.000
varietyRed Blend 0.059 0.000
continentAfrica -0.341 0.000
continentAsia -0.049 0.064
continentEurope 0.089 0.000
continentOceania -0.241 0.000
continentSouth America -0.411 0.000


The complete model:

\[ log(Price) = -6.656 + 0.113Points + 0.278Pinot Noir +0.248Cabernet Sauvignon + 0.059Red Blend\] \[ - 0.341Africa - 0.049Asia + 0.089 Europe - 0.241Oceania -0.411SouthAmerica\]

However, because interaction between Continent and Points is significant, we decide to stratify the analysis by Continent for easier interpretation.


Stratified Analysis by Continent


continent (Intercept) points varietyPinot Noir varietyCabernet Sauvignon varietyRed Blend
North America -4.977 0.093 0.343 0.403 0.171
Oceania -8.044 0.126 0.330 0.110 0.068
Europe -11.964 0.175 0.225 -0.182 -0.171
Asia -4.120 0.082 -0.013 0.319 0.228
South America -6.018 0.101 0.202 0.016 0.439
Africa -8.301 0.128 0.573 0.122 0.133


Findings




Profound Analysis: Closer Look in North America


Non-random distribution of variances of residual tells us that bootstrapping needs to be down before fitting the model for Price vs. Points of wine in North America.


Bootstrap 1000 times


term mean_est sd_est
(Intercept) -4.981 0.069
points 0.093 0.001
varietyCabernet Sauvignon 0.403 0.007
varietyPinot Noir 0.343 0.006
varietyRed Blend 0.171 0.010

After 1000 times bootstraps, we get an almost 17 percent decrease (0.007 to 0.006) in standard error of Pinot Noir estimate, which provides us a more accurate test statistic in predicting the range of price by change of variety. Therefore, the estimated price range of Pinot Noir in North America becomes narrower after bootstraps.