Predicting Wine Quality

Using Multiple Ensemble and

Balancing Methods

Professor: Dr. Bin Liu

Co-author: Nicholas Gmernicki

Data Mining - Spring 2024

Executive Summary

Wine is a large industry with a market value of $483.6 billion in 2023 (imarc, 2023). However, while it is one of the markets with the longest history, it has recently faced issues that must be taken into consideration by anyone operating in the market. The two key issues are an overall decrease in the demand for wine (Rob, 2024) and a decrease in the production volume of wine (at a global scale) (OIV, 2023). These issues in the market have become more evident in recent years.

The decrease in the production volume of wine is most evident in the Southern Hemisphere as well as in Europe, where climate issues have led to drought and the lowest crop yields in years. The United States, however, has seen an increase in their wine production, though this increase does not match the consumer demand. This means that in the future there may be problems with waste as overproduction leads to excess wine sitting in storage.

Due to these problems in the market, selling a good quality wine is necessary and to understand what makes a wine “good” the factors that make up a wine must be taken into consideration. With this in mind, the objective of this paper is to develop a predictive model to assist in the classification of the produced wine in order to determine what products to keep and sell and what products to discard.

The data for the project was taken from the UCI Machine Learning Repository (initially found on Kaggle and then sourced both datasets directly from UCI. There were two datasets, containing samples from red and white wines, for a total of 6,497 observations of the twelve features, including the original target feature quality.

We used a standard system design method to build our classifier, starting with data pre-processing, then algorithm development, post-processing (results and evaluation), and information gained (conclusions and lessons learned).

During pre-processing, in order to build a classifier model, the target feature quality_label was created. When the datasets were combined, an indicator feature for the color of the wine (color_indicator) was added and any duplicates found in the data were removed.

To deal with imbalanced data, we used the imbalance-learn package and Random Oversampling with the Random Forest ensemble method. This is a more controlled method of balancing the data than just using the basic imbalance-learn package. With this model we had a 6% false positive rate with 24% of the data used as a test dataset. Alcohol, density, and chlorides were found to be the most important features when determining the quality of the wine.

Business Problem

Wine is one of the oldest products in the world, having a history stretching back thousands of years. This long-standing history of consumption means that there are different problems to consider when entering (or maintaining a position in) the wine market. According to Jean-Michel Valette, a Master of Wine, there are three key concepts to keep in mind: brand-building, brand strength, and distribution (Rieger, 2017).

While brand strength is important, brand-building and distribution have the most to do with wine quality. Brand-building focuses on making people aware of your brand, trying your products, and becoming a repeat customer (Rieger, 2017). Distribution involves making sure your product meets regulatory standards and customer satisfaction. To build repeat customers and maintain distribution of the product, in this case wine, needs to be of a good quality.

Wine quality is used to determine price, with a positive relationship between the two: the higher the price, the greater the quality. There are many factors that contribute to wine quality before it is even made into wine, viniculture: taking into consideration environment (such as temperature, sunlight, and soil), growing practices (canopy management, pruning, harvesting), and the variations in the winemaking method (standard process; maceration, fermentation, extraction, and aging) (JJ Buckley Fine Wines, 2018).

Determining the quality of wine itself depends on a variety of factors, which must all be balanced and taken into consideration when rating or reviewing wines. According to Charters, the main gustatory (taste) dimensions, according to wine drinkers, are taste, smoothness, mouthfeel and body, drinkability, structural balance, concentration, complexity, and interest (Charters & Pettigrew, 2007).

Data Description

The data is from a study that focuses on the red and white variants of vinho verde, from the north-west region of Portugal known as Minho. The data was collected from May 2004 to February 2007 utilizing protected samples, automatically collected by the iLab computer system, which manages the process of wine sample testing including “laboratory and sensory analysis” (Cortez, Cerdeira, Almeida, Matos, & Reis, 2009). Each row is a unique wine sample with all the related tests.

Included in the dataset(s) are 1,599 red wine samples and 4,898 white wine samples for a total of 6,497 samples. There are eleven features that can be utilized to predict the target feature, quality. There is no variable included to label each sample, only the tests and results. With the exception of pH and quality, which are categorical, all features are continuous. These tests include fixed_acidity which is measured by the grams of tartaric acid by cubic decimeter (g/dm³), volatile_acidity measured by the grams of acetic acid by cubic decimeter, citric_acid (g/dm³), residual_sugar (g/dm³), chlorides measured by the grams of sodium chloride by cubic decimeter (g/dm³), free_sulfur_dioxide (g/dm³), total_sulfur_dioxide (g/dm³), density (g/dm³), pH (0-14), sulfates measured by the grams of potassium sulfate by cubic decimeter (g/dm³), and alcohol (% vol.).

System Design

Figure 1: System Design Process

We followed the standard process for system design which is outlined in Figure 1. We took the original data from the wine study, pre-processed it, created multiple models during algorithm development, determined the best model(s) during post-processing, and then determined the usefulness of our predictive model.

During pre-processing, we will observe the data for any missing values, outliers, duplicates, or an imbalance in the data to prepare for the next step in the process. In algorithm development, we first split the data into training and test data and then observed multiple models including AdaBoost, Bagging, and Random Forest in order to determine which would be the best model. For post-processing, we evaluated the model using the ROC curve and a confusion matrix.

Data Pre-Processing

Prior to developing our model, we combined the red wine and white wine datasets into one dataset and created an additional binary feature color_indicator where “1” indicates a red wine and “0” indicates a white wine. We decided to combine the datasets so there was a larger amount of data for the model to work with.

There are no missing values in the dataset, so there is not need to drop and/or impute data for any of the features.

There are three features that have outliers, total_sulfur_dioxide, residual_sugars, and sulfates. While these features have outliers, based on the highly controlled conditions of the study (from point of origin to laboratory testing) we have decided to include them in the model. This is due to the fact that during the wine-making process, anything added will contribute to the overall levels of each factor and thus might be important in determining if the wine is of “good” quality.

The original target feature, quality, is a categorical variable with a scale of 1-10. In order to simplify the model, we determined the cutoff for “good” quality wine to be 7 and above and anything below that would be considered to be of “bad” quality. This new target feature is quality_label and is binary with 1 for “good” quality and 0 for “bad” quality. The original prediction feature quality was then dropped from the dataset.

There are 1,177 duplicates within the dataset which we dropped from the dataset which left us with 5,320 observations. This decision was made after considering the possibility that any duplicates may be a result of testing identical samples, due to the closed nature of the study. We decided that even with the removal of almost 1,200 observations, the data was still robust enough to run models on.

Algorithm Development

The first step for algorithm development is to use sklearn train_test_split to split the dataset into training and testing datasets to evaluate different classification models. We investigated multiple supervised learning model types pre- and post-balancing, including: Decision Tree, Boosting, Bagging, Random Forest, and AdaBoost.

A Decision Tree is a model that takes into account multiple factors in order to predict the target variable. The tree is created by establishing the root node and then splitting the data into subsets, based on the classification features using recursive partitioning. How deep (how many levels) varies from analysis to analysis.

Boosting is an ensemble method that takes weak learners (factors that have little correlation with the target) and turns them into a strong learner (factor that has a strong correlation with the target), which reduces bias.

Bagging is another ensemble method that utilizes bootstrapping (random sampling with replacement) to create datasets and trains models on those datasets. This method helps to reduce variance and avoid overfitting the data.

Random Forest is an ensemble method similar to bagging in how multiple models are built and aggregated, but instead of using all features when splitting nodes in a decision tree, Random Forest randomly selects subsets of features for each node split.

AdaBoost is an ensemble method that works with other methods to combine the output of the other methods into a weighted sum to represent a boosted classifier. This method is most often used with binary classification. With AdaBoost, weak learners that were previously misclassified are changed. The individual learners then may be weak, however they often end with a strong learner. AdaBoost is less prone to overfitting than other methods due to how it handles learners.

The data is imbalanced, as after sorting the data by quality_label there were 1,210 good quality wines and 5,102 “bad quality” wines. To balance the data, we initially used the imbalance-learn library to perform the balancing for Bagging, Random Forest and the “easy ensemble” (Boosting) classifiers. After observing the results of these methods, we decided against using just this library, as there was a lack of information and control over how the sampling was performed between each model.

Based on these results, we decided to use imbalance-learn to balance the training dataset (leaving the test data alone) using Synthetic Minority Oversampling Technique (SMOTE), Random Oversampling, and Random Undersampling.

The Synthetic Minority Oversampling Technique (SMOTE) is a method to address the imbalance between the minority and majority classes. SMOTE operates by duplicating examples in the minority class, where the new examples are created from existing examples. A random example is taken from the minority class after which k of the nearest neighbors are found and one is randomly selected, and the synthetic example is chosen at a random point between the two examples (Chawla, Bowyer, Hall, & Kegelmeyer, 2002).

Random Oversampling is a method that randomly duplicates examples from the minority class. Random Undersampling is a method that randomly deletes examples from the majority class. Random Oversampling and Random Undersampling are referred to as “naive resampling” methods due to the fact that there are no assumptions made about the data with their use. The Random Oversampling method minimizes false positives, which we decided to focus on, in order to reduce the amount of bad wines classified as good wines.

Results and Evaluation

After balancing the training dataset through SMOTE, Random Oversampling, and Random Undersampling with Boosting, Bagging, and Random Forest, we determined the best method to be Random Oversampling with Random Forest ensemble methods. Based on a test set split consisting of about twenty-four percent of the total data, the chosen method had a roughly 6% false positive rate.

Figure 2: Confusion Matrix

Before deciding on the best model, we observed the Receiver Operating Characteristic (ROC) curve and Area Under Curve (AUC) scores of the 9 models we ran, which were AdaBoost, Bagging, and Random Forest with each of the different balancing methods. AUC scores range from 0.5 to 1 where 1 indicates the model accurately predicts the results 100% of the time and 0.5 indicates that the model is not accurate and is more akin to random chance at predicting the results. For our investigation all of the models had AUC scores greater than 70, indicating each classifier’s ability to accurately predict whether the wine was good (1) or bad (0) was at minimum acceptable.

Figure 3: AUC, False Positive, and True Positive Score Results

The best AUC scores were 0.81, belonging to Bagging and Random Forest both with the Random Undersampling balance method. We did not choose these methods as although they had the highest AUC scores, we wanted to minimize the false positive rate and the method that did that best was Random Forest Random Oversampling, though it still had a decent AUC score of 0.78. This is due to the method of Random Oversampling, where instead of deleting the majority class “bad”, it duplicates the minority class “good” giving the model more opportunities to predict both “good” and “bad” wines.

Examining the feature importance of our chosen model, we observe that the most significant feature in predicting whether a wine is good or not was alcohol. After alcohol content of the wine, the most significant features are density and chlorides. Density is important as it helps measure the content of sugar in the wine and therefore how much of the sugar has been fermented into alcohol (Bouchon Family Wines, 2021). Chlorides are an indicator of salinity in wine, which does not make a wine salty, but in fact increases the “perception of sweetness” and lowers the presence of acid (Russan, 2022). Then a good wine, in this case, as a higher alcohol content (based on actual alcohol volume and density) and a higher chloride content (saltiness).

Figure 4: Receiver Operating Characteristic (ROC) Curve

Figure 5: Feature Importance

For our model, which examined both red and white wines together, based on feature importance, color is of little to no importance when determining the quality of wine. Further examination into whether the difference between the two wines (based on the other features), would be an interesting study.

Conclusions and Lessons Learned

In order for businesses to continue to grow and sell their products, they must adapt to new trends and be able to market to the right consumers and deliver a consistent product of good quality. This is no different for wine. Wineries have to maintain their vineyards and make good quality wines to keep their customers returning for more. The goal of this project was to build a supervised learning model that could help a winery determine what was a “good” wine based off of lab results to mitigate any potential loss from selling a “bad” wine.

After we found our data and processed it, we observed issues with our imbalanced data and determined that it needed to be balanced in order for our model to be successful. The initial balancing of the data did not yield good results due to the lack of control over the method of balancing. Thus, we learned we need to use more controlled methods of balancing with SMOTE, Random Oversampling, and Random Undersampling.

After running AdaBoost, Bagging, and Random Forest methods, each with the three different methods of balancing, we chose the Random Forest, Random Oversampling model. The model did not have the highest area under curve (AUC) curve score, however it did have the lowest false positive rate, which was one method we used to determine the model that would be most useful, as we wanted to minimize the number of times a “bad” wine was labeled “good”. This is due to wanting to focus on maintaining a quality product for the winery.

For future investigation, it may be worth looking into more taste factors and how they can be attributed to the lab results. This way it would be easier to determine how to market the wines, such as with the salinity (via chlorides) and sugar (via residual sugars and density) and how sweet the wine is.  

With this study, there was more white wine data than red wine data. The red wine quality varied between 3 and 8, so there were more bad quality wines than good quality wines (by our definition). In addition, the balancing was done after we combined the datasets, therefore upon further evaluation our model may not have been fair. Thus, it would be worth looking into having more red wines of higher quality in a study so as to have a better understanding of both wines and if there is really no difference between the two.

References

pdf format

Data Mining_Nedeau_Gmernicki.pdf
Final Project Presentation_Nedeau_Gmernicki (powerpoint version).pdf