This was the first project of the Machine Learning course for my Master’s degree.

Classification Problems

The two datasets chosen for this assignment are the Census Income and Statlog (German Credit Data) datasets from the UCI Machine Learning Repository. I will refer to them as the census and credit datasets respectively. The census dataset has 14 features and 48,842 instances. The credit dataset has 20 features and 1,000 instances. Part of the reason I choose them is because of the relatively large difference between the number of instances in each. The difference should be instructive of the concept of sample complexity, and I expected to see algorithms perform worse in terms of accuracy on the credit dataset. The number of features in each set are fairly close (credit has 40% more), however additional features can increase algorithms runtime exponentially, so it is factor to consider when evaluating performance.

The census data set seeks to classify whether an individual will earn an income of over $50,000 a year. It includes features related to demographic information such as age, education, and race. The credit dataset classifies whether an individual seeking credit is a good or bad credit risk. The features describe personal information related to the applicant such as credit history, job history, and assets owned. Both sets include a mix of numerical and qualitative data types. The census has incomplete information while the credit set is complete.

The census data was already separated into a training set and test set. The test set contained approximately one third of the total instances. I decided to use this convention for the credit data and placed two thirds of the data in the training set, and one third in the test set.

Both datasets have the benefit of representing problems that are intuitively easy to understand, so the meaning of the outputs of the various algorithms is not overly abstract.

Analysis Overview

I used the Weka application developed by the University of Waikato in order to implement the classification algorithms. Specifically I used the Explorer GUI provided. Using Weka required converting the data sets to a .ARFF file type.

Decision Tree

The decision tree learning algorithm can be implemented in Weka using the included J48 classifier. The J48 classifier is an implementation of the C4.5 algorithm which is an adaptation of the ID3 algorithm covered in lecture. C4.5 allows for the use of continuous and discrete values, which is applicable for both data sets with monetary features like capital-gains and credit-amount. It also handles incomplete data sets which is helpful for the census data.

C4.5 also incorporates pruning by default. After the tree if formed, branches are eliminated based on their impact to the overall performance. Weka uses a specified confidence factor to determine when to prune (initially set at 0.25).

I was curious to determine the impact of the confidence factor on the performance of the data for each set. A smaller confidence factor increases the amount of pruning done by the algorithm by making reductions to the training set accuracy more acceptable. Using Weka’s Experimenter portal to adjust the confidence factor for the pruning over 10-fold cross validation yielded the following plot:

A confidence factor of around 0.15 produced the best results for both data sets, indicating that the data benefited from additional pruning up to a point. This is more evident with the small Credit data set. As the confidence factor increases it approaches the accuracy of an unpruned tree as pruning is more and more prohibitive.

After running the J48 algorithm on 10-folds cross-validation and test set with the 0.15 confidence factor, I got the following results:

Dataset Name Training Set Error Validation Set Error Test Set Error Processing Time Size of Tree
Census 0.1257 0.1367 0.1396 1.80 s 406
Credit 0.2192 0.2928 0.2575 0.01 s 32

The algorithm is approximately twice as accurate on the test set for the Census data compared to the Credit data. This could be due in part to the difference in sample size. I believe the validation set error for the credit set is especially high because 10 folds were used meaning each validation set had approximately 60 instances. This is substantially less compared to the 3,200 instances in each fold used in the Census cross-validation.

The degree of difference in the runtime of the algorithm on both sets was rather surprising. The time complexity for C4.5 is O(m*n^2) where m is the number of instances and n is the number of features [1]. Based on that estimation, I expected the Census set’s runtime to be approximately 21 times larger than the Credit Set ((32,000*14^2)/(666*21^2)). What I observed was that the Census runtime was 180 times longer than the Credit runtime. This difference is greater than the ratio of instances (32,000/666 ≅ 48) but not significantly larger, indicating to that the complexity of the J48 implementation used by Weka may be linearly dependent on the number of instances. Additional data sets would be needed to further investigate the runtime.

In order to track the performance of the algorithm with an increasingly large training set, I created following learning curves:

The learning curve for the Census data was inline with my expectations. The training error continues to fall as the sample size increases. When the percentage of instances reaches around 60%, the testing error begins to increase slightly which is probably caused by overfitting. The fact the error is reducing so gradually may be attributed to the successful pruning techniques employed by the C4.5 algorithm.

The percent of incorrect classification for the test set of the Credit curve continues to decline with additional instances. This trend could be a result of under-fitting, and indicates that the tree algorithm could benefit from having a larger training set. Interestingly the training error hits a minimum value with approximately 70% of instances. The size of the tree decreases with additional instances due to pruning. This has the intended consequence of improving the algorithms ability to generalize on the test set.

Neural Networks

I created Artificial Neural Networks, or ANN, for both sets in Weka using the Multilayer Perceptron classifier. The implementation utilizes a sigmoid activation function. It employs the method of backpropagation to increase the overall accuracy of the model by adjusting edge weights that are contributing the highest error each iteration.

The Multilayer Perceptron classifier offers several configuration options in order to better train an ANN. A key setting is the number/size of the hidden layers used. Weka’s course on the subject suggests that one hidden layer is acceptable for most problems [2]. However there does not appear to be a definitive guideline for ANN structure (ex. a University of Maryland study found larger networks performed better in some cases [3]), so it was necessary to do some experimentation. I tested a few different settings for the smaller Credit set using 10-fold cross validation while keeping the other configuration settings the same as a control. It yielded the following results:

Data Set Layers Processing Time Validation Error
Credit 0 0.75 24.32
Credit a 9.71 29.88
Credit t 20.02 30.48
Credit 1 0.77 28.98
Credit 5 1.86 28.82
Credit 1,1 0.6 27.63
Credit 5,5 1.94 29.12
Credit a,a 16.97 30.03
Credit t,t 56.39 27.63
Credit 5,5,5 2.69 29.28

Increasing the size of the network did not reliably reduce the error on the validation data. In fact the lowest error was observed when no hidden layer was present. This is a standard perceptron algorithm with each of the input attributes mapped to either the good or bad credit risk algorithm. should work well for linearly separable data, so there may be a half plane that exists that approximately divides the Credit data.

I was not able to test as many layer variations for the Census data due to the large size of the training set. The run time became prohibitively long. The layer configurations I was able to generate were:

Data Set Layers Processing Time Validation Error
Census 0 63.44 15.32
Census 1 49.31 17.0941
Census 5 161.65 16.03
Census a 1413.64 17.1
Census t 2944.27 17.68
Census 1_1 47.88 17.82

Again the best performer found was a network with 0 hidden layers. The runtime for the algorithms with the Census set versus the Credit set was approximately 2 to 3 times more than the sample size ratio. Part of the reason the Census set required a proportionally longer execution time is that it has 1.75 times more nodes in its input layer.

Another variable that affects the algorithm is the training time or the number of iterations. The number of iterations is set at 500 by default in Weka. I tested a few variations using the 0 hidden layer settings for both data sets. The Credit set performed better on cross-validation with a slightly higher training time (550) while the Census set performed better with a lower training time (350). The improved performance for the Census data with fewer iterations sheds light on the fact the ANN algorithm has the potential for overfitting the longer it runs. The weights tend to grow with additional iterations which increases the complexity of the model.

The last variable I investigated was momentum (default value 0.2).The use of backprogation can result in the algorithm converging on local minima for error instead of the global minima. The momentum factor can potentially help overcome this by encouraging the weight change to continue in the same direction as the previous iteration. The classification for the Credit set did not benefit from adjusting the momentum to 0.15 or 0.25 (decreasing accuracy 0.60% in both cases).

The learning curve for the Census shows that the training, validation, and test data decrease for the most part with additional training data. However the behaviour with the full set is strange in that the training error actually exceeds the validation error. The explanation may be that the lack of hidden layers limited the algorithm from overfitting on the training data. It could also be an anomaly. The test and validation error in the Credit training curve plateaus at around 60% of the instances present.

Boosting

Boosting is a form of ensemble learning that combines the hypotheses of weak learners to create a stronger learner that can tackle classification problems. Boosting can be performed in Weka using the AdaBoostM1 classifier. For each iteration, the algorithm places more emphasis on instances that were classified incorrectly in previous iterations by assigning higher weights.

The AdaBoostM1 algorithm requires that the weak learners have an error rate that is less than ½ (better than chance) [4]. For my implementation, I am using the J48 decision tree algorithm as the base learner. At its simplest level, a decision tree can always perform a binary classification with a training error rate equal to or less than ½ by returning the most common value in the target concept with a single node.

I experimented with the effect the number of iterations had on the cross-validation error (plot below).

Boosting proved successful for the Credit data and reduced the error to 23.6% with 150 iterations. This shows that the Adaboost algorithm was able to classify instances in that the base decision tree misclassified in various iterations of the tree.

However Adaboost did not manage to improve upon the J48 algorithm for the Census data set. Two iterations yielded the same validation error as the regular algorithm (13.7%), and then it increases with following iterations until it stabilizes at around 15.5% error after about 30 iterations (the algorithm appears to terminate at 35 iterations). Part of the reason for this decreasing performance is due to J48 algorithm over fitting the data as it progresses. This is evident as the size of the decision tree grows from 408 nodes to between 3000-6000 nodes in subsequent iterations. The Census data set is also significantly larger than Credit set, so it has more potential for noisy data or outliers which can hurt the Boosting method by receiving too much attention.

There is a parameter called weight threshold used in Weka’s AdaBoostM1 algorithm that controls the pruning of instances as the algorithm progresses. If an instance has distribution weight that is less than 1 - weight threshold/100 then it is omitted in the following iteration. The primary goal of pruning in boosting seems more around reducing the execution time than reducing overfitting like pruning in C4.5 [5]. Changing the threshold to 90 for the Credit set reduced the runtime by half (1.2 s to 0.6 s), but it reduced the accuracy by around 3%. For the purposes of this assignment I am not overly concerned with the runtime, so the trade off with the accuracy is not worth it.

For the Census set, reducing the weight threshold to a very small number like 20 did increase both the accuracy and runtime. However this improvement is because the algorithm terminates after a few iterations and closely resembles the non-boosted tree.

I generated the following learning curves with 150 iterations for the Credit set and 30 iterations for the Census set (weight threshold set at 100 for both):

A similar pattern in the learning curves is observed for both datasets. The training error decreases consistently to around zero with additional training instances. Validation and test error falls very gradually, and significant reductions are not made after approximately 50% of the training instances are used. This may due to the iterative nature of boosting which is able to extract more information from fewer samples.

Support Vector Machines

Support Vector Machines (or SVM) algorithms seek to classify datasets by finding a set of hyperplanes that separate the data classes with the maximum margin. I used Weka’s Sequential Minimal Optimization (or SMO) function as implementation of SVM.

For this algorithm I chose to utilize Weka’s Resample filter on the Census data in order to reduce the number of samples in the training and test sets to 10% of their original size. The filter seeks to preserve the dataset’s original distribution. I initially attempted to used the full dataset but a single fold in cross-validation was taking over 30 minutes to process, making experimentation very challenging.

The SVM algorithm involves finding the dual of the maximum margin optimization problem. One component of the dual is called the kernel, and it can be configured based on the domain knowledge of the classification problem at hand. I experimented with a RBF, linear, and 2nd-order polynomial kernels. a linear kernel proved most success for for both sets with errors of 24.6% for Credit 15.9% for the reduced Census set.

I next experimented with the SMO algorithm’s complexity parameter, c. It is a multiplier for the error in the dual optimization problem. A higher c should reduce the training error while potentially decreasing the algorithm’s ability to generalize. Modifying the parameter for the Credit data did not reduce the error. The Census data performed best when c was set to 100.

The learning curves for the best configurations are shown below.

The learning curves for the training, validation, and testing sets were remarkably stable for the Census set which may have to do with the high complexity variable which seeks to reduce over error.

k-Nearest Neighbors

I ran the k-Nearest Neighbors (or kNN) algorithm in Weka using the IBk classifier. The kNN algorithm is a type of instance based learning that queries the dataset to find k instances that have the smallest distances from the target input. These k instances then vote on the class for the target input.

The Weka implementation I am using finds neighbors with a brute force search method where it calculates the distance between the target and every instance, retrieving the k closest. The actual distance calculation methods I tested are Euclidean or Manhattan.

The choice of k is an important factor for the success of the kNN algorithm. The graph belows shows the relationship between the size of k and the error percentage with cross-validation for Euclidean and Manhattan distance for both sets:

Both sets performed slightly better with k set to 20 and Manhattan distance resulting in 28.1% error for Credit and 16.4% for Census. The difference between the Manhattan and Euclidean distance performance was negligible for the Census data. Most of the attributes in the Census set are categorical so the calculation difference is limited.

The IBk classifier gives the option to specify whether distance weighting should occur. However adjusting it did not improve the performance and lead to instances being unassigned.

Learning Curves:

The kNN algorithm for the Credit data set does not have a smooth learning curve. There is a fairly large decrease in error between 40% and 60% of instances. The only reason I can think why this may be occurring is the data set is relatively small, so the size of k at 20 is a relatively significant portion of the total available dataset at lower percentages.

Final Analysis

The five learning algorithms had similar performance to each other for both datasets. The error on the testing set for the Census set was between 14.0% and 16.1% and the Credit set was between 24.0% and 26.7%. The best performing algorithm for the Census set was the Decision Tree and the best for the Credit set was kNN. However the testing set alone is not the only indicator of successful classification. Cross-validation also serves as valuable tool for gauging algorithm performance. The data for each fold is withheld from the model being created and acts very much like the test data. When looking at the average validation and testing error, the best algorithm for the Credit set is actually the Neural Network (the Decision Tree has the lowest average error for the Census set).

Another factor to consider when evaluating the algorithms is their required execution times. For the Census set there was a wide range of difference in execution time with the Decision Tree taking only 1.8 seconds while the SVM took 2 minutes. There was also a difference with the Credit data in general however in general the most successful configurations of each algorithm all took less than 10 seconds to run. For example the ANN only took 0.78 seconds with the Credit data.

Something interesting that I discovered as a result of the algorithms is that both datasets performed best with a linear configuration. For the ANN algorithm, both sets had the greatest performance with no hidden layers which suggests the data may be close to be linearly separable. With the SVM algorithm, both sets had the greatest performance with a linear kernel. This was somewhat unexpected as data sets with fewer features compared to the number of instances often map to higher dimensionalities [6].

Citation

[1] Su, Jiang, and Harry Zhang, “A fast decision tree learning algorithm,” In AAAI, vol. 6, pp. 500-505. 2006. http://www.cs.unb.ca/~hzhang/publications/AAAI06.pdf

[2] Witten, Ian H, “More Data Mining with Weka (5.2: Multilayer Perceptrons),” March 5, 2014. https://www.youtube.com/watch?v=mo2dqHbLpQo

[3] C. Tamon and J. Xiang, “On the Boosting Pruning Problem,” Machine Learning: ECML 2000 Lecture Notes in Computer Science, pp. 404–412, 2000. http://people.clarkson.edu/~ttamon/ps-dir/tx-ecml00.pdf

[4] Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm,” ICML’96 Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, pp. 148–156, Jan. 1996. http://web.eecs.utk.edu/~leparker/Courses/CS425-528-fall10/Handouts/AdaBoost.M1.pdf

[5] Margineantu, Dragos D. and Thomas G. Dietterich, “Pruning adaptive boosting,” In ICML, vol. 97, pp. 211-218, 1997. http://web.engr.oregonstate.edu/~tgd/publications/ml97-pruning-adaboost.pdf

[6] Hsu, Chih-Wei, Chih-Chung Chang, and Chih-Jen Lin, “A practical guide to support vector classification,” pp. 1-16, 2003. https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf