This is another project for the Machine Learning course for my Master’s degree.

Datasets

The two datasets chosen for this assignment are the Census Income and Statlog (German Credit Data) datasets from the UCI Machine Learning Repository. I will refer to them as the Census and Credit datasets respectively. The original Census dataset has 14 features and 2,998 instances. The original Credit dataset has 20 features and 666 instances. Part of the reason I choose them is because of the relatively large difference between the number of instances in each.

The Census data set seeks to classify whether an individual will earn an income of over $50,000 a year. It includes features related to demographic information such as age, education and race. The Credit dataset classifies whether an individual seeking credit is a good or bad credit risk. The features describe personal information related to the applicant such as credit history, job history, and assets owned. Both sets include a mix of numerical and qualitative data types. For this assignment I used Weka’s NominaltoBinary filter to transform features with qualitative values. This increased the number attributes to 104 for the Census set and 59 for the Credit set. I also normalized the numerical values between 0 and 1

Both datasets have the benefit of representing problems that are intuitively easy to understand, so the meaning of the outputs of the various algorithms is not overly abstract.

Clustering Algorithms

I ran the k-means and Expectation Maximization clustering algorithms with the implementations found in the ABAGAIL java library (the same is true with the rest of the algorithms references in the assignment).

k-means Clustering Algorithm

The k-means clustering algorithm seeks to group data points by calculating their proximity to k centers over multiple iterations. After each iteration the centers are recomputed based on the mean of the points currently in the cluster. The algorithm terminates when the centers converge. Euclidean distance is used in this implementation.

To start off, I set k equal to 2. Both the credit and census data sets have two labels, so generating two clusters allows for interesting comparisons to be made to the classification problem. The tables below shows the distribution of how data is labeled versus how they are clustered:

Census <=50K >50K
Cluster 0 1491 112
Cluster 1 751 644
Credit <=50K >50K
Cluster 0 275 186
Cluster 1 95 110

The clusters generated by k-means are not actually labeled, but for the sake of comparison, a label can be associated with the cluster that contains the largest percentage of data points with that label. In terms of classification, the clusters had a 71.2% accuracy for the Census data and 57.8% accuracy for the Credit data. While these results are better than chance, they are worse than labeling all the data points as the most common, or median, classification (implemented by the ZeroR algorithm in Weka). The ZeroR algorithm results in 74.8% accuracy for the Census data and 69.2% accuracy for the Credit set.

One of the reasons k-means is a relatively weak classifier is that it is treating all of the features with equal importance as they all factor into the distance calculation from the centers. This is especially problematic for these data sets where the NominaltoBinary filter is applied, because one feature can be expanded to many features which may dilute the effects of potentially more important features.

I ran the k-means algorithm 5 times for each dataset in order to account for the effect of the initial randomly selected centers. The ending clusters were the same for 4 of the 5 runs for the Census data, whereas only 2 of the runs had the same clusters for the Credit data. This indicates that there may be more variability in the Credit set.

Expectation Maximization Clustering Algorithm

The Expectation Maximization algorithm uses soft clustering to group data points. It is called soft because it does not assign a given point to a single cluster, but rather calculates the likelihood that the point is in each of the clusters. The clusters are defined by k Gaussians, and their means are recalculated each iteration weighted by the likelihood a point belongs to the cluster.

I ran the algorithms on the datasets with the number of clusters set to 2 for the same reasons as before. The comparison between the labels and clusters are shown below (best result of 5 restarts):

Census <=50K >50K
Cluster 0 1497 112
Cluster 1 745 644
Credit <=50K >50K
Cluster 0 354 122
Cluster 1 107 83

The EM clusters had a 71.4% accuracy for the Census data and 65.6% accuracy for the Credit data. The result was roughly the same for the Census data, while the result for the Credit data improved by approximately 10%. Based on this, it is possible that the Credit data is more aligned with a Gaussian distribution. Again the restarts for the Census data yielded more precise clusters (all 5 were the same), while the clusters for the Credit data were more spread out (only 2 were the same). Another possible reason for this outcome may be that the Census data has more samples than the Credit data (approximately 5 times as much) giving the EM algorithm more data points to base its clustering on.

Dimensionality Reduction Algorithms

Principal Component Analysis

Principal Component Analysis helps illuminate the underlying structure of a dataset by finding a vector with maximum variance (eigenvector). Variance in this case can be seen as equivalent to information about the data [1]. There is an eigenvector corresponding to each of the dimensions in a data set. If an eigenvector has eigenvalue equal to 0 then it provides no information and can be removed from the data set to reduce the dimensionality.

To start off, I found the eigenvalues for both datasets. A graph of the distribution of eigenvalues, rounded to the nearest tenth, is shown below:

For both sets the majority of features had eigenvalues below 0.15, but none of the eigenvalues actually equaled 0 (the lowest value was 0.01). I believe the high number of low eigenvalues corresponds largely to the nominal features that were transformed into binary. Many of those dimensions equal 0 for all but a handful of instances, so it is not hard to see that they would provide lower insight into the data.

In order to use PCA to reduce the dimensionality of the datasets, I set the threshold for eigenvalues to 0.05. This reduced the number of features in the Census set from 104 to 25 and reduced the number features in the Credit set from 59 to 36. Datasets with the same dimensions as the original datasets can be reconstructed from the PCA transformations. Comparing the original and reconstructed set shows a sum of squared error of 1,779 for the Census set and 163 for the Credit set. The SSE is significantly higher for the Census set because more features are removed. When compared to the results for Random Projections below, it can be seen that PCA performs reconstruction extremely well (approximately 20 times better).

Independent Component Analysis

Independent Component Analysis seeks to transform the features of data set to make them statistically independent of one another. ICA accomplishes this by minimizing the Gaussianity of the projection of the data [2]. One measure that can be used to help determine this Gaussianity is kurtosis. Kurtosis measures the spikiness (or tail heaviness) of the data and is lower for Gaussian distributions.

Kurtosis can be used to determine the number of components to keep with ICA. A graph below shows the average kurtosis of the columns of the projection produced by ICA after setting it to reduce the number of components to keep.

The kurtosis for the Credit set peaks at 6.8 when there are only 9 components kept. Interestingly for the Census set, the kurtosis only increases with the first 10 components removed (94 remaining), and then decreases with fewer components kept. It also starts out with a relatively higher kurtosis indicating that it may have a non-Gaussian distribution to start with. It is also possible that the kurtosis measurements are initially high because they are being thrown off by outliers [3].

Random Projections

Random Projection projects features into random directions of lower dimensions. The resulting lower dimension has a linear combinations of the features at the higher dimension. I was not entirely sure how to select the number of components to retain in RP algorithm, so I experimented with different numbers and calculated the sum of squared error between the original dataset and the datasets reconstructed from the RP algorithm. The graph is shown below:

There is clear relationship between the reconstruction SSE, and the number of components removed. This indicates that information may be lost due to the linear transformations taking place. However depending on what the reduced data set is being used for,the reconstruction error may not matter. If the reduced datasets are used for classification, then a Wrapping feature selection approach could be used . The datasets would be used as the training set for a learner. The resulting accuracy would be a measure of a successful reduction.

I used 5 restarts for each of the component configurations. As the name of the algorithm applies, the projections used are randomly generated and differ for each run. By in large, the SSE for each restart of the same component configuration were within a relatively narrow band. With additional time, I would like to explore the use of the Johnson-Lindenstrauss Lemma to help determine which Random Projections preserve the most information about the data [4].

Linear Discrimination Analysis

Linear Discrimination Analysis finds projections based on the label information of a data set. It finds a linear combination of variables that best separates the data where best is defined as the maximum of the Fisher criterion J(w) =wT*SB*w/(wT*SW*w) [5] . LDA is unique compared to the previous Dimensionality Reduction algorithms in that it only produces C -1 number of feature projections where C is in the number of classes. For the datasets used in this assignment, that means it will return only one feature because there is a binary classification for both sets.

I ran the ABAGAIL implementation of LDA on both sets and compared the original and reconstructed sets. The SSE was 12,426 for the Census set and 4,623 for the Credit set. These results are rather interesting because they are better than the reconstruction of the RP algorithm with half of the components removed. Intuitively I was expecting the reconstruction to be very bad because the dimensionality is reduced to 1. The RP may perform worse at reconstructing because the features are getting randomly combined too much, and the initial information is being distorted. LDA on the other hand combines all features in a meaningful way, that is more easily reversed.

Clustering with Dimensionality Reduction Algorithms

I ran the k-means and EM clustering algorithms on the data sets generated from the Dimensionality Reduction algorithms covered in the previous section. Again I used 2 clusters, or k=2. A table below shows the comparison between the labels and clusters by algorithm and by set (L0-C0 = Label 0, Cluster 0):

DR Cluster Set Time Iter. L0-C0 L0-C1 L1-C0 L1-C1 Accur.
ICA EM Census 429.56 1000 64 2178 24 732 0.73
ICA EM Credit 0.073 2 461 0 205 0 0.69
ICA k-means Census 0.002 2 29 2213 4 752 0.74
ICA k-means Credit 0.001 16 456 5 205 0 0.68
LDA EM Census 0.857 41 1251 991 738 18 0.58
LDA EM Credit 0.873 164 199 262 185 20 0.67
LDA k-means Census 0.004 8 791 1451 696 60 0.72
LDA k-means Credit 0.005 9 147 314 170 35 0.73
PCA EM Census 1.114 3 743 1499 644 112 0.71
PCA EM Credit 1.934 27 266 195 93 112 0.57
PCA k-means Census 0.005 4 1491 751 112 644 0.71
PCA k-means Credit 0.002 12 275 186 95 110 0.58
RP EM Census 10.471 4 741 1501 644 112 0.72
RP EM Credit 2.442 27 236 225 169 36 0.59
RP k-means Census 0.015 10 1512 730 122 634 0.72
RP k-means Credit 0.04 30 345 116 103 102 0.67

As with the initial clustering exercise, I performed 5 restarts for each of the reduced data sets. The table displays the most accurate for each algorithm set in terms of classification.

ICA performs best for census EM. However it takes significantly longer than the other sets. It seems like in trying to maximize mutual independence, it created a handful of outlier data points. Both EM and k-means basically group all but a few of the instances in the one cluster. I would not consider this a good cluster. It makes sense that the ICA dataset would not work well with EM, as ICA is trying to minimize the Gaussianity of the projection while EM is clustering with k-Gaussians. Also ICA may not have been successful if the underlying sources of the data are not mutually independent. Conceptually I would not expect the Credit or Census data to have independent sources, as they are based on demographic information. It does not to fit the mold of the Cocktail party example given in the lecture.

K-means with LDA produces the most accurate cluster for the Credit data. The reduction LDA performs is based around separating the label data, so it already does part of the work for the Clustering algorithm. LDA is similar to the Support Vector Machines Algorithm with a linear kernel which also performed well on the Credit set in Assignment 1. LDA is not as effective with EM, but I’m not sure why since LDA is also makes assumptions that the model is Gaussian.

The k-means clusters for the PCA reduced set were the same as the clusters from the original set. This indicates that the reduction algorithm did not remove relevant features. The results for the Credit set were also more consistent over the restarts, which may be due to PCA removing some of the noise in the data. The EM clusters with were also the same for the Census set, but not the Credit set as accuracy decreased. It’s possible that the performance on the PCA set could increase with additional restarts.

The clustering done on the RP reduced sets did not match any of the original clusterings. Nor did any of the clusterings in the restarts match each other. This is to be expected, as the data sets are projected randomly, so no two sets are transformed the same way. Even with the variability, on average they performed about the same as the clustering on the original sets in terms of classification accuracy. The features are combined in such a way that they still retain enough core information about the data to make meaningful classifications over multiple restarts.

Neural Network with Dimensionality Reduction Algorithms

I chose to create Neural Networks with Dimensionality Reduction algorithms for the Credit data set since it has less data and is quicker to train. The NN from assignment 1 has 5 hidden nodes and runs for 500 iterations. It achieves a training accuracy of 94.7% and a test accuracy of 72.2%. When running the same NN on the reduced sets I get the following results:

Algorithm Time Accuracy
None 1.07 94.74
LDA 0.62 79.73
ICA 0.58 72.07
RP 1.15 87.84
PCA 0.75 96.2

Only the PCA dimension reduction performed better in terms of training accuracy. It improved by approximately 4% which is very impressive considering it also cut the number of features almost in half. It was able to weed out features that were not contributing significantly and may have even been skewing the data.

LDA and ICA sets both performed relatively poorly. LDA is understandable because it only contains one feature, so any relationships between features that would be captured normally in a NN are lost. I believe the issue with ICA is that the fundamental assumption of the data coming from independent mutual sources is not true for the Credit set.

The RP set performs fairly well, but interestingly takes about the same time as the unfiltered data set. The other algorithms perform quicker though, illustrating the benefit of fewer features. I believe that curse of dimensionality, or the effect the number features has on computational complexity, would become more apparent with a larger data set.

Neural Network with Clustering

In order to incorporate the clustering results into the Neural Network, I modified each of the reduced datasets to include an additional feature containing the cluster the instance belongs to. It produced the following results:

DR Cluster Time Accuracy
ICA EM 1.36 71.02
ICA k-means 1.14 71.32
LDA EM 1.1 79.58
LDA k-means 1.09 79.73
PCA EM 1.01 96.25
PCA k-means 1.53 98.05
RP EM 1.02 89.19
RP k-means 1.65 87.39

All in all, the new results were not significantly different compared to the NN results in the previous section. PCA with the k-means clustering increased its accuracy about 1.5%, showing that the clusters did contain some useful information for the NN. For ICA the accuracy actually decreased slightly with the addition of the cluster column. As seen previously, the clusters are not especially effective for ICA (grouping the vast majority of samples in one cluster), so this carried through to the NN. For the most part k-means did better than EM, but the difference was not significant.

I also trained NNs on datasets containing only the clusters from the reduced algorithms and the labels. However the NNs were not able to exceed the ZeroR results and classified all instances as the most common label. The clusters alone do not appear to be a rich enough data set. As seen in section 3, the clusters do not accurately align with the labels, so it is not surprising the NNs performed poorly.

Citation

[1] “Lecture 15: Principal Component Analysis.” [Online]. Available: http://people.duke.edu/~hpgavin/SystemID/References/Gillies-PCA-notes.pdf.

[2] “ICA for dummies,” Arnaud Delorme. [Online]. Available: http://arnauddelorme.com/ica_for_dummies/

[3] “Measures of Non-Gaussianity,” Newton-Raphson method (multivariate). [Online]. Available: http://fourier.eng.hmc.edu/e161/lectures/ica/node4.html.

[4] R. Ward, “Dimension reduction via random projections,” University of Texas at Austin. [Online]. Available: https://www.ma.utexas.edu/users/rachel/madison14.pdf.

[5] R. Gutierrez-Osuna, “L10: Linear discriminants analysis.” [Online]. Available: http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf.