Multiple correspondence analysis, Clustering and Tandem Analysis through a basic income analysis example


Abstract

Okay… So there were several basic income experiments launched in 2017, Finland started a two-year experiment by giving 2,000 unemployed citizens approximately $600 a month. In the Silicon Valley, Y Combinator, announced in mid-2016 that it would begin paying out monthly salaries between $1,000 and $2,000 a month to 100 families in Oakland, while in Utrecht, Netherland 250 Dutch citizens will receive about $1,100 per month. These are just three of the already launched experiments, and their aim is to measure how basic income could provide new structure for social security and to see how people’s productivity levels change when they receive a guaranteed salary.

But how people think about basic income? Are we supportive of it or we fear it? Who is the most likely to vote for it? Is there a difference between people according to their education or job status who are more pro or contra of this idea? This study aims to answer these question by using a semi-supervised approach, Clustering and a Tandem analysis to classify people according to their characteristics and their opinion of basic income.


Basic income

Data

This dataset contains information of a survey linked to the basic income, conducted in Europe in 2016. The results have shown that two third of European would vote for the universal basic income.

The goal of this article is to show which type of person is more or less likely to vote in favour of the basic income, and to show which individual caracteristics affect the most this decision.

Then, I realise a cluster analysis (only with 100 individuals as the original size of the dataset- 9649 observations make it difficult to visualise the clusters.)

Data source: https://www.kaggle.com/daliaresearch/basic-income-survey-european-dataset/data

Calling the dataset

rm(list=ls())
income <- read.csv("/Users/mehdidellagi/Downloads/basic_income_dataset_dalia.csv")

Data cleaning

I organise the data in the following structure:

I keep only the following variables (all of them categorical)

  1. country
  2. gender
  3. rural
  4. dem_education_level (rename it education)
  5. dem_full_time_job (rename it job)
  6. dem_has_children (rename it children)
  7. question_bbi_2016wave4_basicincome_awareness (rename it awareness)
  8. question_bbi_2016wave4_basicincome_vote (rename it vote)

I also create a new categorical variable ‘age_group’:

  1. age_group
data <- income[, c(1, 4:10, 14)]

# Renaming the variables
colnames(data)[1] <- "country"
colnames(data)[4] <- "education"
colnames(data)[5] <- "job"
colnames(data)[6] <- "children"
colnames(data)[7] <- "awareness"
colnames(data)[8] <- "vote"
# colnames(data)[data$question_bbi_2016wave4_basicincome_vote] <- "vote"

# The variable "age_group" 
levels(data$age_group)[levels(data$age_group)=="14_25"] <- "young"
levels(data$age_group)[levels(data$age_group)=="26_39"] <- "adult"
levels(data$age_group)[levels(data$age_group)=="40_65"] <- "aged"

summary(data)
##     country        gender       rural       education     job      
##  DE     :1420   female:4555   rural:2771   high  :3270   no :3947  
##  FR     :1248   male  :5094   urban:6878   low   :1815   yes:5702  
##  GB     :1199                              medium:3578             
##  IT     :1138                              no    : 323             
##  ES     :1005                              NA's  : 663             
##  PL     : 860                                                      
##  (Other):2779                                                      
##  children                                 awareness   
##  no :4600   I have heard just a little about it:2305  
##  yes:5049   I have never heard of it           :1566  
##             I know something about it          :3389  
##             I understand it fully              :2389  
##                                                       
##                                                       
##                                                       
##                                vote      age_group   
##  I would not vote                :1159   young:1940  
##  I would probably vote against it:1439   adult:2813  
##  I would probably vote for it    :3272   aged :4896  
##  I would vote against it         : 960               
##  I would vote for it             :2819               
##                                                      
## 
attach(data)

Summary of the data

for (i in 1:ncol(data)) {
  plot(data[,i], main=colnames(data)[i],
       ylab = "Count", col="darkblue", las = 1, col.main = "darkblue")
  }

Capture d’écran 2018-03-22 à 10.07.24.png

MCA – Multiple Correspondance Analysis

The Multiple correspondence analysis (MCA) is an extension of the simple correspondence analysis (see PCA article) for summarizing and visualizing a data table containing more than two categorical variables. It can also be seen as a generalization of principal component analysis when the variables to be analyzed are categorical instead of quantitative. I do not explain MCA here in more details as the idea is identical with the idea of the PCA analysis, however, for further explanation, see: http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/114-mca-multiple-correspondence-analysis-in-r-essentials/.

I choose ‘country’ and ‘age_group’ as complementary variables.

#install.packages("FactoMineR")
#install.packages("factoextra")

#library("FactoMineR")
#library("factoextra")
res.mca <- MCA(data, quali.sup = c(1, 9), ncp = 5, graph = FALSE)

Eigenvalues

eigenvalues <- get_eigenvalue(res.mca)
head(round(eigenvalues, 2))
     eigenvalue variance.percent cumulative.variance.percent
Dim.1       0.24            11.23                       11.23
Dim.2       0.18             8.31                       19.54
Dim.3       0.17             7.88                       27.42
Dim.4       0.15             7.13                       34.56
Dim.5       0.15             6.91                       41.47
Dim.6       0.15             6.90                       48.36

We see that the variance explained by the first two dimensions are quite weak, however, knowing the volume of the data and its nature, I still think it is okay.

We can also graph the variance explained by the dimensions:

fviz_screeplot(res.mca)

Capture d’écran 2018-03-22 à 10.09.49.png

MCA plots

Analyses of the variables

Variables’ contributions to each dimensions
fviz_contrib(res.mca, choice = "var", axes = 1)

Capture d’écran 2018-03-22 à 10.10.38.png

We see that the following categories are important for the first dimension:
1. ‘I would not vote’
2. ‘I have never heard of it’
3. ‘job_no’
4. ‘job_yes’
5. ‘education_high’

And for the second dimension:

fviz_contrib(res.mca, choice = "var", axes = 2)

Capture d’écran 2018-03-22 à 10.10.57.png

The categories showing the comprehension of the basic income and the willingness to vote contribute more than average.

Now, we show the quality of representation of each categories. We see that they are relatively low, this might be explained by the volume of the data.

fviz_cos2(res.mca, choice = "var", axes = 1:2)
Capture d’écran 2018-03-22 à 10.11.11.png
Plot of active categories
plot(res.mca, invisible = c("quali.sup", "ind"), cex=1, col.var = "darkblue", title = "Active categories", cex.main=2, col.main= "darkblue")

Capture d’écran 2018-03-22 à 10.13.39.png

We see that the categories, such that low education, no job and no prior knowledge of the basic income are very close to the category: ‘I would not vote’. These categories are positives for the two dimensions. (The category ‘education.NA’ is just between the categories ‘education_no’ and ‘education_low’, which is quite logic, as people with lower level of education are less likely to admit their level of education.)

The categories ‘education_high’, ‘urban’, ‘I would vote against it’ are negative on the first dimension but positive on the second. This makes us think that the more we move to the right on the first dimension, the less educated individuals we find, who know less about the basic income and the more likely is that they are not employed.

Finally, the second dimension seem to seperate the people who are sure about voting: higher on the second dimension, we find the individuals who are sure about not voting or sure about voting, whereas below we find the individuals who are not certain to vote at all.

Individuals

Firstly, extract the individuals.

ind <- get_mca_ind(res.mca)
ind
Multiple Correspondence Analysis Results for individuals
 ===================================================
  Name       Description                       
1 "$coord"   "Coordinates for the individuals" 
2 "$cos2"    "Cos2 for the individuals"        
3 "$contrib" "contributions of the individuals"

The following table shows the quality of representation of the first few individuals on each of the dimensions. Teh quality of representation is low, but it seems normal with more than 9000 individuals.

head(ind$cos2)

       Dim 1      Dim 2       Dim 3        Dim 4       Dim 5
1 0.18927320 0.01985636 0.013295201 0.0427709980 0.051268598
2 0.42784689 0.05153570 0.011548365 0.0003802403 0.060352387
3 0.26607336 0.02267226 0.035181580 0.1081329399 0.001559895
4 0.07712656 0.12171939 0.027984932 0.0172307050 0.004742808
5 0.09180824 0.17111081 0.005836723 0.0721050205 0.009465591
6 0.04586928 0.02825110 0.011520517 0.1159209085 0.035885994

Next, we show the contributions of the top 20 individuals (again, it is relatively low, but it is explained by the quantity of the data).

fviz_contrib(res.mca, choice = "ind", axes = 1:2, top = 20)

Capture d’écran 2018-03-22 à 10.15.45.png

Finally, we show the individuals by two color different showing their job status.

fviz_mca_ind(res.mca, label = "none", habillage=job, title="MCA individues according to their job status", addEllipses = TRUE, ellipse.level = 0.95)

Capture d’écran 2018-03-22 à 10.16.35.png

Then we show the individuals and the categories together.

plot(res.mca, label = c("quali.sup", "var"), select = "cos2 10", cex=1, col.var = "darkblue", col.quali.sup = "brown3",  col.ind = "seashell3", title = "Individuals with active and supplementary categories", cex.main=2, col.main= "darkblue")

Capture d’écran 2018-03-22 à 10.17.47.png

Clustering

In this section, I do a Hierarchical clustering, then a K-means analysis.

Clustering is a technique for finding similarity groups in a data, called clusters. It attempts to group individuals in a population together by similarity, but not driven by a specific purpose, therefore it is often called an unsupervised learning, as we don???t have prescribed labels in the data and no class values denoting a priori grouping of the data instances are given.

We can estimate the similarity between two data instances by the distance between them. We can use

  • euclidean (pythagorean) distance
  • manhattan (sum of absolute differences between coordinates) distance
  • mahalanobis distance (distance from the mean by standard deviation)
  • Pearson correlation or Spearman correlation.

In the following, I use Euclidean distance.

Given a set of 100 items to be clustered, and an 100 x 100 distance (or similarity) matrix, the basic process of the hierarchical clustering is the following:

  • We assign each item to its own cluster, so we have 100 clusters. each containing one item. For now, therefore, the distances between the clusters are equal to the distances between the items.
  • Next, we find the 2 closest clusters and we merge them, so that we have one cluster with two items now. Now we have 99 (100-1) clusters.
  • We compute again the distances between this new cluster and all of the other clusters.
  • We repeate the second and third steps until we have a single cluster of size 100.

We see that really, the basic idea of clustering is really really simple.

Now let’s implement it in practice:

Firstly, I create a randomly selected subset of the data for a smaller, more visible clustering.

# I choose the observations in France
data_cluster <- data[data[, "country"] == "FR",]
# Gives us always the same subset 
set.seed(330)
# Sample for random rows
random_rows <- sample(rownames(data_cluster), size = 100, replace = FALSE)
data_hcpc <- data[random_rows, 2:9]

And now we make the MCA plots:

res2.mca <- MCA(data_hcpc, quali.sup = 8, ncp = 5, graph = FALSE)
res.hcpc <- HCPC(res2.mca, min = 3, nb.clust = -1, graph = TRUE)
# Keep the axis for kmeans 
km=data.frame(res2.mca$ind$coord)

Capture d_écran 2018-03-22 à 10.20.16Capture d_écran 2018-03-22 à 10.20.27Capture d_écran 2018-03-22 à 10.20.34

We see that the clustering proposes three clusters.
To see what separates most these clusters, we ask R to show us some individual observations of each clusters:

names(res.hcpc)
[1] "data.clust" "desc.var"   "desc.axes"  "call"       "desc.ind"

On affiche des individus de chaque cluster pour mieux comprendre l’analyse.

print(res.hcpc$desc.ind)
$para
Cluster: 1
     3696      4199      3995      3890      3453 
0.3615251 0.3802195 0.4362871 0.5758796 0.6145176 
------------------------------------------------------------------------------------------------------------ 
Cluster: 2
     3744      3555      4013      4268      3568 
0.4524800 0.4533426 0.4708304 0.5124976 0.5155082 
------------------------------------------------------------------------------------------------------------ 
Cluster: 3
     3701      4358      4042      3625      3387 
0.2743815 0.3004191 0.4752501 0.5307634 0.5432425 

$dist
Cluster: 1
    3796     3810     4184     4139     3484 
1.820734 1.820734 1.627126 1.605598 1.508880 
------------------------------------------------------------------------------------------------------------ 
Cluster: 2
    4225     3850     4021     4326     4072 
2.005300 2.005300 1.668767 1.551963 1.516803 
------------------------------------------------------------------------------------------------------------ 
Cluster: 3
    3992     3545     4397     3766     4358 
2.444105 2.121605 1.757267 1.630860 1.451258

For the first cluster, we show four individuals randomly:

c1 <- c(which(rownames(data_hcpc) == 3696), which(rownames(data_hcpc) == 4199), which(rownames(data_hcpc) == 3995), which(rownames(data_hcpc) == 3890))
cl1 <- data_hcpc[c1,]
print(cl1)
gender
rural
education
job
children
awareness
vote
age_group
3696 male urban medium yes yes I have never heard of it I would probably vote against it aged
4199 male urban medium yes yes I have heard just a little about it I would not vote aged
3995 female urban high yes yes I have never heard of it I would vote against it adult
3890 female urban medium no yes I have heard just a little about it I would probably vote against it aged

We see that the first cluster contain observations of mostly employed people who would probably vote against basic income and have not heard much about it.  We do the same for the second cluster:

c2 <- c(which(rownames(data_hcpc) == 3744) , which(rownames(data_hcpc) == 3555), which(rownames(data_hcpc) == 4013), which(rownames(data_hcpc) == 4268))
cl2 <- data_hcpc[c2,]
print(cl2)
gender
rural
education
job
children
awareness
vote
age_group
3744 male urban low yes yes I know something about it I would vote for it aged
3555 female urban medium yes yes I understand it fully I would probably vote for it aged
4013 female urban high yes no I know something about it I would vote for it young
4268 female urban high no yes I understand it fully I would probably vote for it aged

We see that the second cluster contains individuals who know more about the basic income and would vote for it, mostly females living in a urban area, with children.

And finally for the third cluster:

c3 <- c(which(rownames(data_hcpc) == 3701) , which(rownames(data_hcpc) == 4358), which(rownames(data_hcpc) == 4042), which(rownames(data_hcpc) == 3625))
cl3 <- data_hcpc[c3,]
print(cl3)
gender
rural
education
job
children
awareness
vote
age_group
3701 female rural medium no no I have never heard of it I would not vote young
4358 male rural low no no I have never heard of it I would not vote young
4042 male rural low no no I know something about it I would not vote young
3625 male urban medium no no I have never heard of it I would not vote young

The third cluster contains the observations of people who would not vote at all, who have ni children, ni job. 

K-means

K-means clustering is also a type of unsupervised learning. The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.

It can be a good idea to ask 3 clusters in the K-means analysis, like this, we could compare the results of the Hierarchical Clustering and K-means.

set.seed(333)
groupes.kmeans <- kmeans(km, centers = 3, nstart = 5)
print(groupes.kmeans)
K-means clustering with 3 clusters of sizes 19, 63, 18

Cluster means:
       Dim.1       Dim.2       Dim.3      Dim.4        Dim.5
1  0.8876059  0.10548369  0.07053484 -0.1326389  0.008544427
2 -0.1605333 -0.09936673 -0.19486188  0.1266472 -0.079478257
3 -0.3750506  0.23643967  0.60756314 -0.3032576  0.269154784

Clustering vector:
4503 4139 4517 4368 4048 4235 4219 4076 4222 4321 3870 4155 3384 3798 3810 4233 4174 4042 3544 3768 4286 3935 3545 3991 3886 4199 4195 3766 
   2    3    2    3    2    3    2    3    2    3    1    2    2    2    3    2    2    1    2    2    2    2    1    2    1    3    2    1 
4026 3555 3821 3563 4397 3696 3970 3598 3509 3944 3890 4072 4358 3353 3992 4265 3728 4378 4021 4247 3532 4107 4523 3625 3605 3742 4149 4405 
   2    2    2    2    1    3    2    2    2    2    3    2    1    1    1    2    1    2    2    1    2    1    2    1    2    2    2    3 
3744 3904 3332 3453 4326 3727 3701 3869 3685 4011 3570 3505 4268 4184 4024 3968 3843 4427 3484 3604 3740 4256 3885 3796 3683 4225 3394 3607 
   2    2    1    3    2    3    1    2    2    2    2    3    2    3    2    2    1    2    3    2    1    2    2    3    2    2    2    2 
4160 4539 4469 3568 3480 3850 3995 3840 4013 3649 3859 3387 4410 4278 3488 3370 
   3    1    2    2    2    2    3    2    2    2    2    1    2    2    2    2 

Within cluster sum of squares by cluster:
[1] 12.60606 44.27574 15.51596
 (between_SS / total_SS =  32.5 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"         "iter"         "ifault"

We see that R proposes 3 clusters with 19, 63, 18 observations respectively.

We can also show the optimal number of clusters. We see that after the third cluster, the total within sum of quares gained is lower then before therefore we can keep 3 clusters for the analysis.

fviz_nbclust(km, hcut, method = "wss")

Capture d’écran 2018-03-22 à 10.24.33.png

Finally, we represent our K-means also on a graph:

fviz_cluster(groupes.kmeans, data = km, palette = "jco", repel= TRUE, main = "Kmeans", ggtheme = theme_classic())

Capture d’écran 2018-03-22 à 10.25.31.png

Well we see that the clusters do not actually show the clear seperation of the data.

dist.hcpc <- dist(km)
cah.ward <- hclust(dist.hcpc, method = "ward.D2")
cah.ward
Call:
hclust(d = dist.hcpc, method = "ward.D2")

Cluster method   : ward.D2 
Distance         : euclidean 
Number of objects: 100

Finally we compare the two types of Clustering:

groupes.cah <- cutree(cah.ward, k=3)
print(table(groupes.cah, groupes.kmeans$cluster))
groupes.cah  1  2  3
          1  6 63  1
          2  0  0 15
          3 13  0  2

On voit que:

  1. cluster 1 (k-means) corresponds to cluster 2 (CAH) with a difference of 7 observations
  2. cluster 2 (k-means) correspond to cluster 3 exactly
  3. cluster 3 (k-means) corresponds to cluster 1 with a difference of 2 observations.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: