Recently, I found a data about red wine quality from here. Which chemical composition might effect the red wine quality? As a big fan of red wine, I started my journey to build a simple statistical model and tried to find out what factors effect the quality.

Image for post


Before We Start

Data Description

This data was original from the UCI Machine Learning website. It contains 11 different chemical columns and 1 dependent variable column — “quality” .

Causality Thinking

Using hypothesis testing and multiple regression, we looked to see if these independent variables were significant in affecting the dependent variable. We could simply set null hypothesis are all of the columns are significant variables.

When we start a model, we have to make sure there is no ambiguous in causality. Does A effect B or B effect A? As in this case, we would like to find what effect the quality but not quality effect other things. Also, we have to identify junk variables in this step. Are all the factors in our data make sense? Sometimes it is possible that some factor seems effect our dependent variable but actually not. Leave junk variables in the data might reduce the precision and increase error of our analysis.

Check All columns:

fixed acidity: The predominant fixed acids found in wines are tartaric, malic, citric, and succinic. Their respective levels found in wine can vary greatly and wines with high acidity might taste sour which effect the quality.

volatile acidity: The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant taste.

citric acid: Similar as above, very less in wine.

residual sugar: This effect the sweetness of the wine.

chlorides: This will stay in the form of potassium salt in wine.

free sulfur dioxide: The SO2 exists in equilibrium, too much will effect the health.

total sulfur dioxide: Amount of free and bound forms of S02

density: During the fermentation process, fructose is converted into alcohol, the density is reduced also influence the taste.

pH: Describes how acidic or basic a wine is, most wines are between 3–4. Monitoring the total acidity of red wine is an important indicator of its quality and an important influencing factor of its taste.

sulphates: It has the effects of selection, clarification, anti-oxidation, acidification, dissolution, etc. it used to keep it fresh and taste, and maintain the wine flavor.

alcohol: The percent alcohol content of the wine

As we found above, the causality check is finished and all of the factors make sense in the effectiveness in wine quality. No junk variables, no missing data in our data. Also, you could do simple scatter plot of each factor vs quality as correlation check. Then we could start our next step!

Multicollinearity Check

Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related.

You might certain sure that either A or C effect B, or both them do. But which one matters more? What if A also changes with C? You cannot sure, neither cannot computer.

As the information about each column above, we can easily find out that pH value is a result of amount acid in the wine and the density is also related to the alcohol. Free sulfur dioxide also is a part of total sulfur dioxide.

To verify our assumption, just do simple scatter plots and calculate the correlation between them.

#data #statistics #math #regression #excel #data analysis

A Statistical Analysis for Beginner — Red Wine Quality
1.30 GEEK