Supervised Learning for Binary Classification on US Adult Income

In this project, various binary classification methods have been used to make predictions about US adult income level in relation to social factors including age, gender, education, and marital status. We first explore descriptive statistics for the dataset and deal with missing values. After that, we examine some widely used classification methods, including logistic regression, discriminant analysis, support vector machine, random forest, and boosting. Meanwhile, we also provide suitable R functions to demonstrate applications. Various metrics such as ROC curves, accuracy, recall and F-measure are calculated to compare the performance of these models. We find the boosting is the best method in our data analysis due to its highest AUC value and the highest prediction accuracy. In addition, among all predictor variables, we also find three variables that have the largest impact on the US adult income level.


Objective
The inequality of wealth and income is a huge concern around the globe, and governments in different countries are using different interventions to address income inequality. In this project, we aim to utilize some of the existing classification methods to help understand these inequality issues. Our strategy is to train a binary classifier, denoted as Y, to predict the whether a person earns more than $50K or not per year based on the social factors and to find out what factors influence the income level the most.

Description of Dataset and Challenges of the project
The US Adult income dataset was extracted from the 1994 US Census Database. The data set consists of anonymous social information such as occupation, age, native country, race, capital gain, capital loss, education, work class and others. The data set includes figures on 48,842 different records and 14 attributes for 42 nations. The 14 attributes consist of 8 categorical and 6 continuous attributes containing information, where the detailed information is summarized in Table 1. The dataset can be accessed from UCI repository [1] and Kaggle [2]. The challenges of this dataset were discussed in some of the past literature. First, the dataset contains around 7.4% of missing values, and the handling of these missing data is of great challenge to the prediction outcome. Second, the number of observations is around 48k, which is considered computational expensive for some of the classification algorithms. Third, half of the predictor variables are multi-level categorical, which reduces the amount of information preserved in data. Besides, many similar observations have different response classes, posing a challenge to the accuracy of the prediction. In this project, we address each challenge above to our best knowledge in order to best predict the US adult income level.

Project-related literature
Supervised learning is perhaps a powerful and useful approach in contemporary machine learning and statistical analysis, whose main purpose is to characterize the relationship of the main interested response and the information of predictors, and then use the predictors to do modeling and prediction. Classification, whose response variable is taken as binary outcome, is one of common applications in the framework of supervised learning.
For the US adult income dataset, it was especially preferred in machine learning models because it is large enough to allow enough room for train and test sets, and to make discerning small differences in performance reliable. In the early approaches, [3] employed Gradient boosting classifier method; [4] made the usage of Extreme Gradient Boosting (XGBOOST) for prediction tasks; [5] implemented Principal Component Analysis (PCA) to generate and evaluate income prediction data based on the current population survey provided by the U.S. Census Bureau. [6] tried to replicate Bayesian networks, decision tree induction and lazy classifier for the dataset and presented a comparative analysis of the predictive performances. In addition to the existing approaches, there are a lot of machine learning strategies that might be suitable to analyze this dataset, such as discriminant analysis, support vector machine (SVM), random forest, and neural nets [7][8][9].

Main Contributions
Different from existing work, in this paper we analyze the US adult income data by examining some popular machine learning methods, including logistic regression, discriminant analysis, support vector machine (SVM), random forest, and boosting. In addition to discussing detailed algorithms, we also provide some commonly used R functions to implement the methods. In addition, based on the boosting method, we further summarize the relative importance and identify most important variables that determine the level of adult income. To the best of knowledge, it was not explored in existing work.

Data cleaning
After browsing the data, we find that the two variables Education and Education.num are simply two different representations of education level. There is no need to keep both of them when we are training our models. Considering that Education.num is a numerical variable and larger value of it means higher education level, we decide to remove Education from our dataset and keep Education.num only. Moreover, we also find that there are several variables having values '?' that is unreasonable, so we first treat the character '?' as missing value and then check whether it is appropriate to remove all observations with missing values.
By exploring the missing values, we find that about 7.4% of observations have missing values. All the missing values occur in the variables occupation, workclass and native.country, and the proportions of missingness are shown in Figure 1.  Figure 2, we can see that for those observations with missing values, a larger proportion of them have a low salary level, compared to those observations without missing values. It means that the missing data mechanism here is not missing completely at random (MCAR). Thus, it is inappropriate to remove all the observations with missing values.
Therefore, for this dataset, we are not going to remove the observations with the character ' ?'. Instead, we will treat '?' as a new level 'unknown' and do further analysis using all the observations in the original data.

Split training set and testing set
In order to avoid overfitting and to measure the performance of our models in a more reasonable way, we randomly split our data into 70% of the training set and 30 % of the testing set. We use the training set to build the models, while the testing set is used to measure the performance of our models.

Exploratory data analysis
The training dataset contains 48,800 observations with 15 variables that are denoted as . A description of the variables is shown in Table 1. 76.07% of the data is from the low salary group with annual income less than or equal to 50k, while the other 23.93% is from the high salary group with annual income over 50K.  Figure 3 shows the distribution of each categorical variable for both salary levels, and it provides us with an initial idea of how different attributes of individuals vary between people from the low salary group and the high salary group. Most of the observations shown in Figure 3 are not out of ordinary. We can see that most people from the high-income group own corporations while most people from the low-income group are working class. The percentage of people with higher education (Bachelors and above) is also much higher for the high-income group. The marriage status of people from the low-income group seems more "complicated", it seems that money can indeed ruin your marriage sometimes. A higher percentage of people with high income are white-collar and gold-collar workers, while people with low income are more involved in blue-collar jobs. The majority of people with high income are husbands, while the relationship of lowincome people is more diverse. This is consistent with the marital-status of these two groups. In terms of race, we can see a higher percentage of the white race with high-income and a higher percentage of the black race with low-income. This is consistent with the social phenomenon that the white race being the dominant race in middle-and upper-class in the U.S. In terms of gender, it is no surprise that the percentage of males is much higher for the high-income group.

Numerical variables
We explore the relationship between numerical variables to see if there is any significant correlation. As shown in Figure 4, we do not see high correlation between any pairs of variables. In addition, we can also see the distribution of all numerical variables in Figure 5. We realize for most observations, their capital.gain and capital.loss is 0. Besides, from the boxplots, we can see the distribution of these variables is scattered. Moreover, the histograms show that no continuous variables is normal distribution.

Model introduction
Logistic regression models the probability of response using the logit function and the parameters are obtained from the maximum likelihood estimates. As a generalized linear model, logistic regression is a widely used technique because it is highly interpretable and does not require too many computational resources [10]. However, the disadvantage of logistic regression is that it cannot solve non-linear problems and is highly relied on the choice of predictors with good performance [10].

Model Fitting
The income data has the response salary that falls into one of two categories, "<= 50K" or ">50K" that allows us to fit it to a logistic regression model. We use glm(family= binomial) function to fit three logistic regression models to compare the results. They are (i) the model with only intercept, (ii) the model only with numeric variables, and (iii) the model with all variables except education.num, respectively. The third model with all the variables except education gives the best performance since all the variables are shown as significant in impacting the salary and it has the lowest AIC and MSE. We used halfnorm() to check the outliers of the third model, which is also our final model, and we find that the model fits very well since there are no detected outliers in the model.

Linear discriminant analysis 4.2.1 Model introduction
Although logistic regression is a simple and powerful linear classification algorithm, it has limitations such as instability for well-separated classes. Linear Discriminant Analysis (LDA) can address this limitation with prior probability obtained for each class from Bayes' theorem. The classification of an observation is done in the following two steps: (1) Identify the distribution for input for each of the class ( = with = 0,1).
(2) Flip the distribution using Bayes theorem to calculate the probability where Pr( = | = ) represents probability that an observation belongs to response class = and Pr( = | = ) represents probability of = , for a particular response class = [11]. The distribution of = needs to be calculated from the historical data for every response class = . For example, if we want to predict the salary level (>50K or <=50K) of a person based on predictors, we need to first identify the distribution of these characteristics from historical data for the two salary level (>50K or <=50K). Then using Bayes theorem to find the probability of this person belonging to each level (>50K or <=50K) from the distribution of the set of characteristics. The level that gets the highest probability is the output class and a prediction is made. Note that the LDA assumes that predictors are normally distributed and that the different classes have class-specific means and equal variance/covariance.

Model fitting
Before fitting the model, we need to check for the normality and equal-variance assumptions of the LDA model. Apparently, these two assumptions are both violated so it is expected that the LDA may not perform very well. Next, we use the lda() function in the MASS package to train the model and measure its performance by making predictions on the test data. The prediction performance of the model is shown in Section 5.1.

Model Introduction
Quadratic discriminant analysis (QDA) [11] provides an alternative approach as LDA. Similar to LDA, QDA assumes that the observations from each class are drawn from a Gaussian distribution, and plug estimates for the parameters into Bayes' theorem in order to perform prediction. The difference between QDA and LDA is the homogeneity and heterogeneity of variance between each class, LDA assumes that the covariance matrices are the same for all classes, while QDA assumes the covariance matrices are different. Under this assumption, we have that assigns an observation = to the class for which is largest. By plugging the estimators for Σk, µk and πk into (1), we can assign each observation into a class.
The quantity x appears to be quadratic term in (1), this is where QDA get its name.

Model fitting
Before using qda() function in package MASS, we need to filtrate the variables to make sure the data can be fitted into the model. By testing each variable, we find that the variables workclass, education, and native.country will lead to rank deficiency, which may be due to that the factorized data conflicts with the assumption that each variable follows a normal distribution. So we remove those three variables from the original data and fit the QDA model.
Similarly, we make predictions on the test data to measure the performance of the QDA model. We plot the ROC curve and compute different metrics for classification to measure the performance, as shown in Section 5.1.

Model introduction
SVM [12] constructs a hyperplane f (x) = β0 + x T β or set of hyperplanes in a high-dimensional space, which can be used for binary and multi-class classification problems. The estimation of parameters of binary classification is obtained from the cost minimization problem, and the class is determined from the decision function � ( ) = ( � 0 + � ). SVM can also be generalized for classification of nonlinear decision boundaries, in which case functions of the predictors in quadratic, cubic or radial terms will be used to address the non-linearity.

Model fitting
The first consideration for fitting the SVM is to choose an appropriate kernel function. Given the size of the training dataset n and the number of covariates q, we think a linear kernel with a running time of O(n 2 q) is appropriate for this classification task, since the dataset does not inherit explicit internal structures, and the training time for other kernel functions will be much more computationally expensive especially when the size of the dataset is large.
The function tune() is used to perform a ten-fold cross-validation on a set of cost parameters ranging from 0.05 to 5. The best model was chosen with an optimal cost parameter C=0.1. The prediction performance of the model is shown in Section 5.1.

Model introduction
Random forest [13] is a popular method in machine learning. It constructs many classification and regression trees (CARTs) [14] randomly, and each CART is independent. When the forest is constructed, it can be used to make predictions based on new inputs. The class of an input is determined by every single CART. Therefore, the predicted class of the input is the most frequent class generated by CARTs.
A CART is trained by a sample generated from the original sample. Since the random forest consists of many CARTs, bootstrap [15] is applied. Bootstrap generates multiple samples from the original sample with replacement. Each sample has many features. Random forest randomly selects different feature subsets, and then each CART in the forest uses ID3 algorithm [16], C4.5 algorithm [17] or Gini index [14] to select the most important feature from a feature subset to split the tree.
Gini index represents the probability of a randomly selected feature being misclassified. The form is given by with pk being the probability of a input being classified in a particular class and K being the number of categories in a sample. Then the Gini index of a sample D is where CK is the number of observations in categorical K, and ND is the sample size. If a feature A separates the sample D into two samples D1 and D2, then A smaller Gini index indicates a better result. Therefore, a CART selects features that reduce maximum Gini index to split from the root to the leaves. In this way, many CARTs are generated to form the random forest.

Model fitting
In the R language, there is a package called randomForest, and in this package, a built-in function named randomForest() can be used to fit the random forest model. Since random forest consists of many CARTs and randomly selects feature subsets, mtry and ntree are two important parameters need to be determined. mtry indicates size of feature subset selected in each split, and ntree demonstrates number of trees trained in the model. First, we fit the data with default mtry and set ntree = 500. We find the error rate becomes constant after 100 trees. Thus, we can choose any number which is greater than 100. In this model, we set ntree = 300. Next, we need to find the best mtry. Because this dataset contains 14 predictors, and the variable education has been excluded, we try mtry from 1 to 12. The results demonstrates the average error reaches the lowest point when mtry equals to 3. Finally, we fit the data into the random forest model with mtry = 3 and ntree = 300, and the AUC of the model is 0.91.

Model introduction
Boosting is a general approach that can be applied to many statistical learning methods for regression or classification [13]. However, we only consider the boosting method to improve the performance of decision trees here.
By using the boosting approach, we grow multiple trees sequentially: each tree is grown using information from previously grown trees. The algorithm of applying boosting for regression trees is shown below [13]: (1) Set ̂( ) = 0 and = for all i in the training set.
(2) For = 1, ⋯ , , repeat: 1) Fit a tree ̂ with splits to the training data ( , ). Here d, λ and B are tuning parameters, which would determine the performance of the boosting method, and we are going to set d = 1 and use cross-validation to determine the optimal value of λ and B in the following model fitting part.

Model fitting
First of all, we use the gbm.step() function in the R package dismo to determine the number of trees needed for this this problem by a 10-fold cross-validation [18]. This algorithm identifies the optimal number of trees as that at which the holdout deviance is minimized [19]. As a result, the optimal number of trees we get is 10000. After that, we optimize the learning rate by setting the number of trees as 10000 using 5-fold cross-validation to minimize the CV error. The optimal learning rate we get is 0.03. Thus, we decide to set B = 10000 and λ = 0.03 to train our final boosting model by using the training data.

Combined ROC curves
Since this is a binary classification problem, to assess the performance of several models in Section 4, we consider to use receiver operating characteristic (ROC) curves and area under curve (AUC) values to measure the performance of our models. The combined ROC curves of our all six models are displayed in Figure  6.

Prediction performance
Since the response variable salary is unbalanced, where 76.1% of individuals have an income level of "≤ 50K" and the other 23.9% of individuals have an income level of " > 50 K", we decide to treat " > 50 K" as the positive level and compute the measures for performance of classification in Table 2, where the definitions of several commonly used criteria, such as accuracy, recall, specificity, precision, and F-measure, can be found in [10] and [20]. From Table 2, we can see that the boosting model has the highest prediction accuracy, while the random forest model has the highest F-measure. However, the prediction accuracy of the random forest model is only 76.67%, which is the second-worst among all six models, while the boosting model has a second-best Fmeasure. Therefore, we decide to choose the boosting model as our final model and make further improvements and interpretations based on this model.

Boosting prediction
When we use this boosting model to make predictions on the test data, the results we get are predicting probabilities. So we need to optimize the cutoff using cross-validation such that the cutoff maximizes the accuracy. By using the cross-validation approach, the optimal cutoff we get is 0.5022067, which is very close to 0.5. So we use this optimal cutoff to make prediction on the testing data and the final confusion matrix we get is shown in Figure 7.
From the confusion matrix, we can see that the predictive accuracy of our final model is about 86.34%, which is a bit higher than the accuracy by using default cutoff 0.5.

Interpretation
According to the boosting model, we summarize the relative importance of each variable as shown in Figure 8. Moreover, we also fit a decision tree model to explain the data intuitively. The decision tree model is displayed in Figure 9.  Figure 9, we can find that for educated (education.num ≥ 12) married individuals (relation = Husband, Wife), their income tends to be at a higher level. But for married individuals who have not received much education, their income level is highly correlated with their capital gain. Besides, the income of those un-married individuals (relation = Not-in-family, Other-relative, Own-child, Unmarried) is mainly determined by their capital gain.

Conclusion
In this paper, we explore the factors that affect the income of adults. We apply logistic regression, LDA, QDA, SVM, random forest and boosting to solve this problem. After comparing the performance of these models, we find that boosting provides the highest AUC value, which indicates that it is the best model for fitting the data. Moreover, we find relationship, native.country, and capital.gain have the largest impact on salary levels.
Although we solved the problem successfully, we still have some extensions. For example, most variables in this dataset are categorical and some of them have many levels. We think it would improve the performance of our models by using some suitable methods to reduce the number of levels for some categorical predictors before constructing predicted models. Besides, the performance of LDA and QDA is bad because the assumption of normality is invalid. Some methods can be employed to transform variables into normal distribution, which we do not consider in this project. Moreover, more advanced data science methods like XG-boost and neural network [21][22][23] or Artificial Intelligence approaches [24] can also be applied to deal with this problem, and they might have a better prediction performance than the current models. In addition, we only consider the relationship between salary and the other covariates in this project. But the relationship between covariates might also be interesting, we can adopt graphical models [25] to look into the dependence between predictors.