Okay… So there were several basic income experiments launched in 2017, Finland started a two-year experiment by giving 2,000 unemployed citizens approximately $600 a month. In the Silicon Valley, Y Combinator, announced in mid-2016 that it would begin paying out monthly salaries between $1,000 and $2,000 a month to 100 families in Oakland, while in Utrecht, Netherland 250 Dutch citizens will receive about $1,100 per month. These are just three of the already launched experiments, and their aim is to measure how basic income could provide new structure for social security and to see how people’s productivity levels change when they receive a guaranteed salary.

But how people think about basic income? Are we supportive of it or we fear it? Who is the most likely to vote for it? Is there a difference between people according to their education or job status who are more pro or contra of this idea? This study aims to answer these question by using a semi-supervised approach, Clustering and a Tandem analysis to classify people according to their characteristics and their opinion of basic income.

Basic income

Data

This dataset contains information of a survey linked to the basic income, conducted in Europe in 2016. The results have shown that two third of European would vote for the universal basic income.

The goal of this article is to show which type of person is more or less likely to vote in favour of the basic income, and to show which individual caracteristics affect the most this decision.

Then, I realise a cluster analysis (only with 100 individuals as the original size of the dataset- 9649 observations make it difficult to visualise the clusters.)

Data source: https://www.kaggle.com/daliaresearch/basic-income-survey-european-dataset/data

Calling the dataset

rm(list=ls())
income <- read.csv("/Users/mehdidellagi/Downloads/basic_income_dataset_dalia.csv")

Data cleaning

I organise the data in the following structure:

I keep only the following variables (all of them categorical)

country
gender
rural
dem_education_level (rename it education)
dem_full_time_job (rename it job)
dem_has_children (rename it children)
question_bbi_2016wave4_basicincome_awareness (rename it awareness)
question_bbi_2016wave4_basicincome_vote (rename it vote)

I also create a new categorical variable ‘age_group’:

age_group

data <- income[, c(1, 4:10, 14)]

# Renaming the variables
colnames(data)[1] <- "country"
colnames(data)[4] <- "education"
colnames(data)[5] <- "job"
colnames(data)[6] <- "children"
colnames(data)[7] <- "awareness"
colnames(data)[8] <- "vote"
# colnames(data)[data$question_bbi_2016wave4_basicincome_vote] <- "vote"

# The variable "age_group" 
levels(data$age_group)[levels(data$age_group)=="14_25"] <- "young"
levels(data$age_group)[levels(data$age_group)=="26_39"] <- "adult"
levels(data$age_group)[levels(data$age_group)=="40_65"] <- "aged"

summary(data)

##     country        gender       rural       education     job      
##  DE     :1420   female:4555   rural:2771   high  :3270   no :3947  
##  FR     :1248   male  :5094   urban:6878   low   :1815   yes:5702  
##  GB     :1199                              medium:3578             
##  IT     :1138                              no    : 323             
##  ES     :1005                              NA's  : 663             
##  PL     : 860                                                      
##  (Other):2779                                                      
##  children                                 awareness   
##  no :4600   I have heard just a little about it:2305  
##  yes:5049   I have never heard of it           :1566  
##             I know something about it          :3389  
##             I understand it fully              :2389  
##                                                       
##                                                       
##                                                       
##                                vote      age_group   
##  I would not vote                :1159   young:1940  
##  I would probably vote against it:1439   adult:2813  
##  I would probably vote for it    :3272   aged :4896  
##  I would vote against it         : 960               
##  I would vote for it             :2819               
##                                                      
##

attach(data)

Summary of the data

for (i in 1:ncol(data)) {
  plot(data[,i], main=colnames(data)[i],
       ylab = "Count", col="darkblue", las = 1, col.main = "darkblue")
  }

MCA – Multiple Correspondance Analysis

The Multiple correspondence analysis (MCA) is an extension of the simple correspondence analysis (see PCA article) for summarizing and visualizing a data table containing more than two categorical variables. It can also be seen as a generalization of principal component analysis when the variables to be analyzed are categorical instead of quantitative. I do not explain MCA here in more details as the idea is identical with the idea of the PCA analysis, however, for further explanation, see: http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/114-mca-multiple-correspondence-analysis-in-r-essentials/.

I choose ‘country’ and ‘age_group’ as complementary variables.

#install.packages("FactoMineR")
#install.packages("factoextra")

#library("FactoMineR")
#library("factoextra")
res.mca <- MCA(data, quali.sup = c(1, 9), ncp = 5, graph = FALSE)

Eigenvalues

eigenvalues <- get_eigenvalue(res.mca)

head(round(eigenvalues, 2))

     eigenvalue variance.percent cumulative.variance.percent
Dim.1       0.24            11.23                       11.23
Dim.2       0.18             8.31                       19.54
Dim.3       0.17             7.88                       27.42
Dim.4       0.15             7.13                       34.56
Dim.5       0.15             6.91                       41.47
Dim.6       0.15             6.90                       48.36

We see that the variance explained by the first two dimensions are quite weak, however, knowing the volume of the data and its nature, I still think it is okay.

We can also graph the variance explained by the dimensions:

fviz_screeplot(res.mca)

MCA plots

Analyses of the variables

Variables’ contributions to each dimensions

fviz_contrib(res.mca, choice = "var", axes = 1)

We see that the following categories are important for the first dimension:
1. ‘I would not vote’
2. ‘I have never heard of it’
3. ‘job_no’
4. ‘job_yes’
5. ‘education_high’

And for the second dimension:

fviz_contrib(res.mca, choice = "var", axes = 2)

The categories showing the comprehension of the basic income and the willingness to vote contribute more than average.

Now, we show the quality of representation of each categories. We see that they are relatively low, this might be explained by the volume of the data.

fviz_cos2(res.mca, choice = "var", axes = 1:2)

Plot of active categories

plot(res.mca, invisible = c("quali.sup", "ind"), cex=1, col.var = "darkblue", title = "Active categories", cex.main=2, col.main= "darkblue")

We see that the categories, such that low education, no job and no prior knowledge of the basic income are very close to the category: ‘I would not vote’. These categories are positives for the two dimensions. (The category ‘education.NA’ is just between the categories ‘education_no’ and ‘education_low’, which is quite logic, as people with lower level of education are less likely to admit their level of education.)

The categories ‘education_high’, ‘urban’, ‘I would vote against it’ are negative on the first dimension but positive on the second. This makes us think that the more we move to the right on the first dimension, the less educated individuals we find, who know less about the basic income and the more likely is that they are not employed.

Finally, the second dimension seem to seperate the people who are sure about voting: higher on the second dimension, we find the individuals who are sure about not voting or sure about voting, whereas below we find the individuals who are not certain to vote at all.

Individuals

Firstly, extract the individuals.

ind <- get_mca_ind(res.mca)

ind

Multiple Correspondence Analysis Results for individuals
 ===================================================
  Name       Description                       
1 "$coord"   "Coordinates for the individuals" 
2 "$cos2"    "Cos2 for the individuals"        
3 "$contrib" "contributions of the individuals"

The following table shows the quality of representation of the first few individuals on each of the dimensions. Teh quality of representation is low, but it seems normal with more than 9000 individuals.

head(ind$cos2)

       Dim 1      Dim 2       Dim 3        Dim 4       Dim 5
1 0.18927320 0.01985636 0.013295201 0.0427709980 0.051268598
2 0.42784689 0.05153570 0.011548365 0.0003802403 0.060352387
3 0.26607336 0.02267226 0.035181580 0.1081329399 0.001559895
4 0.07712656 0.12171939 0.027984932 0.0172307050 0.004742808
5 0.09180824 0.17111081 0.005836723 0.0721050205 0.009465591
6 0.04586928 0.02825110 0.011520517 0.1159209085 0.035885994

Next, we show the contributions of the top 20 individuals (again, it is relatively low, but it is explained by the quantity of the data).

fviz_contrib(res.mca, choice = "ind", axes = 1:2, top = 20)

Finally, we show the individuals by two color different showing their job status.

fviz_mca_ind(res.mca, label = "none", habillage=job, title="MCA individues according to their job status", addEllipses = TRUE, ellipse.level = 0.95)

Then we show the individuals and the categories together.

plot(res.mca, label = c("quali.sup", "var"), select = "cos2 10", cex=1, col.var = "darkblue", col.quali.sup = "brown3",  col.ind = "seashell3", title = "Individuals with active and supplementary categories", cex.main=2, col.main= "darkblue")

Clustering

In this section, I do a Hierarchical clustering, then a K-means analysis.

Clustering is a technique for finding similarity groups in a data, called clusters. It attempts to group individuals in a population together by similarity, but not driven by a specific purpose, therefore it is often called an unsupervised learning, as we don???t have prescribed labels in the data and no class values denoting a priori grouping of the data instances are given.

We can estimate the similarity between two data instances by the distance between them. We can use

euclidean (pythagorean) distance
manhattan (sum of absolute differences between coordinates) distance
mahalanobis distance (distance from the mean by standard deviation)
Pearson correlation or Spearman correlation.

In the following, I use Euclidean distance.

Given a set of 100 items to be clustered, and an 100 x 100 distance (or similarity) matrix, the basic process of the hierarchical clustering is the following:

We assign each item to its own cluster, so we have 100 clusters. each containing one item. For now, therefore, the distances between the clusters are equal to the distances between the items.
Next, we find the 2 closest clusters and we merge them, so that we have one cluster with two items now. Now we have 99 (100-1) clusters.
We compute again the distances between this new cluster and all of the other clusters.
We repeate the second and third steps until we have a single cluster of size 100.

We see that really, the basic idea of clustering is really really simple.

Now let’s implement it in practice:

Firstly, I create a randomly selected subset of the data for a smaller, more visible clustering.

# I choose the observations in France
data_cluster <- data[data[, "country"] == "FR",]
# Gives us always the same subset 
set.seed(330)
# Sample for random rows
random_rows <- sample(rownames(data_cluster), size = 100, replace = FALSE)
data_hcpc <- data[random_rows, 2:9]

And now we make the MCA plots:

res2.mca <- MCA(data_hcpc, quali.sup = 8, ncp = 5, graph = FALSE)
res.hcpc <- HCPC(res2.mca, min = 3, nb.clust = -1, graph = TRUE)
# Keep the axis for kmeans 
km=data.frame(res2.mca$ind$coord)

We see that the clustering proposes three clusters.
To see what separates most these clusters, we ask R to show us some individual observations of each clusters:

names(res.hcpc)

[1] "data.clust" "desc.var"   "desc.axes"  "call"       "desc.ind"

On affiche des individus de chaque cluster pour mieux comprendre l’analyse.

print(res.hcpc$desc.ind)

$para
Cluster: 1
     3696      4199      3995      3890      3453 
0.3615251 0.3802195 0.4362871 0.5758796 0.6145176 
------------------------------------------------------------------------------------------------------------ 
Cluster: 2
     3744      3555      4013      4268      3568 
0.4524800 0.4533426 0.4708304 0.5124976 0.5155082 
------------------------------------------------------------------------------------------------------------ 
Cluster: 3
     3701      4358      4042      3625      3387 
0.2743815 0.3004191 0.4752501 0.5307634 0.5432425 

$dist
Cluster: 1
    3796     3810     4184     4139     3484 
1.820734 1.820734 1.627126 1.605598 1.508880 
------------------------------------------------------------------------------------------------------------ 
Cluster: 2
    4225     3850     4021     4326     4072 
2.005300 2.005300 1.668767 1.551963 1.516803 
------------------------------------------------------------------------------------------------------------ 
Cluster: 3
    3992     3545     4397     3766     4358 
2.444105 2.121605 1.757267 1.630860 1.451258

For the first cluster, we show four individuals randomly:

c1 <- c(which(rownames(data_hcpc) == 3696), which(rownames(data_hcpc) == 4199), which(rownames(data_hcpc) == 3995), which(rownames(data_hcpc) == 3890))
cl1 <- data_hcpc[c1,]
print(cl1)

	gender	rural	education	job	children	awareness	vote	age_group
3696	male	urban	medium	yes	yes	I have never heard of it	I would probably vote against it	aged
4199	male	urban	medium	yes	yes	I have heard just a little about it	I would not vote	aged
3995	female	urban	high	yes	yes	I have never heard of it	I would vote against it	adult
3890	female	urban	medium	no	yes	I have heard just a little about it	I would probably vote against it	aged

We see that the first cluster contain observations of mostly employed people who would probably vote against basic income and have not heard much about it. We do the same for the second cluster:

c2 <- c(which(rownames(data_hcpc) == 3744) , which(rownames(data_hcpc) == 3555), which(rownames(data_hcpc) == 4013), which(rownames(data_hcpc) == 4268))
cl2 <- data_hcpc[c2,]
print(cl2)

	gender	rural	education	job	children	awareness	vote	age_group
3744	male	urban	low	yes	yes	I know something about it	I would vote for it	aged
3555	female	urban	medium	yes	yes	I understand it fully	I would probably vote for it	aged
4013	female	urban	high	yes	no	I know something about it	I would vote for it	young
4268	female	urban	high	no	yes	I understand it fully	I would probably vote for it	aged

We see that the second cluster contains individuals who know more about the basic income and would vote for it, mostly females living in a urban area, with children.

And finally for the third cluster:

c3 <- c(which(rownames(data_hcpc) == 3701) , which(rownames(data_hcpc) == 4358), which(rownames(data_hcpc) == 4042), which(rownames(data_hcpc) == 3625))
cl3 <- data_hcpc[c3,]
print(cl3)

	gender	rural	education	job	children	awareness	vote	age_group
3701	female	rural	medium	no	no	I have never heard of it	I would not vote	young
4358	male	rural	low	no	no	I have never heard of it	I would not vote	young
4042	male	rural	low	no	no	I know something about it	I would not vote	young
3625	male	urban	medium	no	no	I have never heard of it	I would not vote	young

The third cluster contains the observations of people who would not vote at all, who have ni children, ni job.

K-means

K-means clustering is also a type of unsupervised learning. The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.

It can be a good idea to ask 3 clusters in the K-means analysis, like this, we could compare the results of the Hierarchical Clustering and K-means.

set.seed(333)
groupes.kmeans <- kmeans(km, centers = 3, nstart = 5)

print(groupes.kmeans)

K-means clustering with 3 clusters of sizes 19, 63, 18

Cluster means:
       Dim.1       Dim.2       Dim.3      Dim.4        Dim.5
1  0.8876059  0.10548369  0.07053484 -0.1326389  0.008544427
2 -0.1605333 -0.09936673 -0.19486188  0.1266472 -0.079478257
3 -0.3750506  0.23643967  0.60756314 -0.3032576  0.269154784

Clustering vector:
4503 4139 4517 4368 4048 4235 4219 4076 4222 4321 3870 4155 3384 3798 3810 4233 4174 4042 3544 3768 4286 3935 3545 3991 3886 4199 4195 3766 
   2    3    2    3    2    3    2    3    2    3    1    2    2    2    3    2    2    1    2    2    2    2    1    2    1    3    2    1 
4026 3555 3821 3563 4397 3696 3970 3598 3509 3944 3890 4072 4358 3353 3992 4265 3728 4378 4021 4247 3532 4107 4523 3625 3605 3742 4149 4405 
   2    2    2    2    1    3    2    2    2    2    3    2    1    1    1    2    1    2    2    1    2    1    2    1    2    2    2    3 
3744 3904 3332 3453 4326 3727 3701 3869 3685 4011 3570 3505 4268 4184 4024 3968 3843 4427 3484 3604 3740 4256 3885 3796 3683 4225 3394 3607 
   2    2    1    3    2    3    1    2    2    2    2    3    2    3    2    2    1    2    3    2    1    2    2    3    2    2    2    2 
4160 4539 4469 3568 3480 3850 3995 3840 4013 3649 3859 3387 4410 4278 3488 3370 
   3    1    2    2    2    2    3    2    2    2    2    1    2    2    2    2 

Within cluster sum of squares by cluster:
[1] 12.60606 44.27574 15.51596
 (between_SS / total_SS =  32.5 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"         "iter"         "ifault"

We see that R proposes 3 clusters with 19, 63, 18 observations respectively.

We can also show the optimal number of clusters. We see that after the third cluster, the total within sum of quares gained is lower then before therefore we can keep 3 clusters for the analysis.

fviz_nbclust(km, hcut, method = "wss")

Finally, we represent our K-means also on a graph:

fviz_cluster(groupes.kmeans, data = km, palette = "jco", repel= TRUE, main = "Kmeans", ggtheme = theme_classic())

Well we see that the clusters do not actually show the clear seperation of the data.

dist.hcpc <- dist(km)
cah.ward <- hclust(dist.hcpc, method = "ward.D2")
cah.ward

Call:
hclust(d = dist.hcpc, method = "ward.D2")

Cluster method   : ward.D2 
Distance         : euclidean 
Number of objects: 100

Finally we compare the two types of Clustering:

groupes.cah <- cutree(cah.ward, k=3)
print(table(groupes.cah, groupes.kmeans$cluster))

groupes.cah  1  2  3
          1  6 63  1
          2  0  0 15
          3 13  0  2

On voit que:

cluster 1 (k-means) corresponds to cluster 2 (CAH) with a difference of 7 observations
cluster 2 (k-means) correspond to cluster 3 exactly
cluster 3 (k-means) corresponds to cluster 1 with a difference of 2 observations.

2 thoughts on “Multiple correspondence analysis, Clustering and Tandem Analysis through a basic income analysis example”

Add yours

Hai Belveal says:

September 1, 2020 at 11:32 am

Great post, you have pointed out some great details , I too believe this s a very fantastic website.

LikeLike

Brooks says:

January 28, 2022 at 4:15 pm

We stumbled over here by a different page and thought I might check things out.
I like what I see so now i am following you. Look forward to looking at your
web page yet again.

LikeLike

Multiple correspondence analysis, Clustering and Tandem Analysis through a basic income analysis example

Basic income

Data

Calling the dataset

Data cleaning

Summary of the data

MCA – Multiple Correspondance Analysis

Eigenvalues

MCA plots

Analyses of the variables

Variables’ contributions to each dimensions

Plot of active categories

Individuals

Clustering

K-means

2 thoughts on “Multiple correspondence analysis, Clustering and Tandem Analysis through a basic income analysis example”

Add yours

Leave a comment Cancel reply

Author: Reka Halmai

Basic income

Data

Calling the dataset

Data cleaning

Summary of the data

MCA – Multiple Correspondance Analysis

Eigenvalues

MCA plots

Analyses of the variables

Variables’ contributions to each dimensions

Plot of active categories

Individuals

Clustering

K-means

Share this:

Related

2 thoughts on “Multiple correspondence analysis, Clustering and Tandem Analysis through a basic income analysis example”

Add yours

Leave a comment Cancel reply

Author: Reka Halmai