Okay… So there were several basic income experiments launched in 2017, Finland started a two-year experiment by giving 2,000 unemployed citizens approximately $600 a month. In the Silicon Valley, Y Combinator, announced in mid-2016 that it would begin paying out monthly salaries between $1,000 and $2,000 a month to 100 families in Oakland, while in Utrecht, Netherland 250 Dutch citizens will receive about $1,100 per month. These are just three of the already launched experiments, and their aim is to measure how basic income could provide new structure for social security and to see how people’s productivity levels change when they receive a guaranteed salary.
But how people think about basic income? Are we supportive of it or we fear it? Who is the most likely to vote for it? Is there a difference between people according to their education or job status who are more pro or contra of this idea? This study aims to answer these question by using a semi-supervised approach, Clustering and a Tandem analysis to classify people according to their characteristics and their opinion of basic income.
Basic income
Data
This dataset contains information of a survey linked to the basic income, conducted in Europe in 2016. The results have shown that two third of European would vote for the universal basic income.
The goal of this article is to show which type of person is more or less likely to vote in favour of the basic income, and to show which individual caracteristics affect the most this decision.
Then, I realise a cluster analysis (only with 100 individuals as the original size of the dataset- 9649 observations make it difficult to visualise the clusters.)
Data source: https://www.kaggle.com/daliaresearch/basic-income-survey-european-dataset/data
Calling the dataset
rm(list=ls())
income <- read.csv("/Users/mehdidellagi/Downloads/basic_income_dataset_dalia.csv")
Data cleaning
I organise the data in the following structure:
I keep only the following variables (all of them categorical)
- country
- gender
- rural
- dem_education_level (rename it education)
- dem_full_time_job (rename it job)
- dem_has_children (rename it children)
- question_bbi_2016wave4_basicincome_awareness (rename it awareness)
- question_bbi_2016wave4_basicincome_vote (rename it vote)
I also create a new categorical variable ‘age_group’:
- age_group
data <- income[, c(1, 4:10, 14)]
# Renaming the variables
colnames(data)[1] <- "country"
colnames(data)[4] <- "education"
colnames(data)[5] <- "job"
colnames(data)[6] <- "children"
colnames(data)[7] <- "awareness"
colnames(data)[8] <- "vote"
# colnames(data)[data$question_bbi_2016wave4_basicincome_vote] <- "vote"
# The variable "age_group"
levels(data$age_group)[levels(data$age_group)=="14_25"] <- "young"
levels(data$age_group)[levels(data$age_group)=="26_39"] <- "adult"
levels(data$age_group)[levels(data$age_group)=="40_65"] <- "aged"
summary(data)
## country gender rural education job ## DE :1420 female:4555 rural:2771 high :3270 no :3947 ## FR :1248 male :5094 urban:6878 low :1815 yes:5702 ## GB :1199 medium:3578 ## IT :1138 no : 323 ## ES :1005 NA's : 663 ## PL : 860 ## (Other):2779 ## children awareness ## no :4600 I have heard just a little about it:2305 ## yes:5049 I have never heard of it :1566 ## I know something about it :3389 ## I understand it fully :2389 ## ## ## ## vote age_group ## I would not vote :1159 young:1940 ## I would probably vote against it:1439 adult:2813 ## I would probably vote for it :3272 aged :4896 ## I would vote against it : 960 ## I would vote for it :2819 ## ##
attach(data)
Summary of the data
for (i in 1:ncol(data)) {
plot(data[,i], main=colnames(data)[i],
ylab = "Count", col="darkblue", las = 1, col.main = "darkblue")
}
MCA – Multiple Correspondance Analysis
The Multiple correspondence analysis (MCA) is an extension of the simple correspondence analysis (see PCA article) for summarizing and visualizing a data table containing more than two categorical variables. It can also be seen as a generalization of principal component analysis when the variables to be analyzed are categorical instead of quantitative. I do not explain MCA here in more details as the idea is identical with the idea of the PCA analysis, however, for further explanation, see: http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/114-mca-multiple-correspondence-analysis-in-r-essentials/.
I choose ‘country’ and ‘age_group’ as complementary variables.
#install.packages("FactoMineR")
#install.packages("factoextra")
#library("FactoMineR")
#library("factoextra")
res.mca <- MCA(data, quali.sup = c(1, 9), ncp = 5, graph = FALSE)
Eigenvalues
eigenvalues <- get_eigenvalue(res.mca)
head(round(eigenvalues, 2))
eigenvalue variance.percent cumulative.variance.percent Dim.1 0.24 11.23 11.23 Dim.2 0.18 8.31 19.54 Dim.3 0.17 7.88 27.42 Dim.4 0.15 7.13 34.56 Dim.5 0.15 6.91 41.47 Dim.6 0.15 6.90 48.36
We see that the variance explained by the first two dimensions are quite weak, however, knowing the volume of the data and its nature, I still think it is okay.
We can also graph the variance explained by the dimensions:
fviz_screeplot(res.mca)
MCA plots
Analyses of the variables
Variables’ contributions to each dimensions
fviz_contrib(res.mca, choice = "var", axes = 1)
We see that the following categories are important for the first dimension:
1. ‘I would not vote’
2. ‘I have never heard of it’
3. ‘job_no’
4. ‘job_yes’
5. ‘education_high’
And for the second dimension:
fviz_contrib(res.mca, choice = "var", axes = 2)
The categories showing the comprehension of the basic income and the willingness to vote contribute more than average.
Now, we show the quality of representation of each categories. We see that they are relatively low, this might be explained by the volume of the data.
fviz_cos2(res.mca, choice = "var", axes = 1:2)
Plot of active categories
plot(res.mca, invisible = c("quali.sup", "ind"), cex=1, col.var = "darkblue", title = "Active categories", cex.main=2, col.main= "darkblue")
We see that the categories, such that low education, no job and no prior knowledge of the basic income are very close to the category: ‘I would not vote’. These categories are positives for the two dimensions. (The category ‘education.NA’ is just between the categories ‘education_no’ and ‘education_low’, which is quite logic, as people with lower level of education are less likely to admit their level of education.)
The categories ‘education_high’, ‘urban’, ‘I would vote against it’ are negative on the first dimension but positive on the second. This makes us think that the more we move to the right on the first dimension, the less educated individuals we find, who know less about the basic income and the more likely is that they are not employed.
Finally, the second dimension seem to seperate the people who are sure about voting: higher on the second dimension, we find the individuals who are sure about not voting or sure about voting, whereas below we find the individuals who are not certain to vote at all.
Individuals
Firstly, extract the individuals.
ind <- get_mca_ind(res.mca)
ind
Multiple Correspondence Analysis Results for individuals =================================================== Name Description 1 "$coord" "Coordinates for the individuals" 2 "$cos2" "Cos2 for the individuals" 3 "$contrib" "contributions of the individuals"
The following table shows the quality of representation of the first few individuals on each of the dimensions. Teh quality of representation is low, but it seems normal with more than 9000 individuals.
head(ind$cos2)
Dim 1 Dim 2 Dim 3 Dim 4 Dim 5 1 0.18927320 0.01985636 0.013295201 0.0427709980 0.051268598 2 0.42784689 0.05153570 0.011548365 0.0003802403 0.060352387 3 0.26607336 0.02267226 0.035181580 0.1081329399 0.001559895 4 0.07712656 0.12171939 0.027984932 0.0172307050 0.004742808 5 0.09180824 0.17111081 0.005836723 0.0721050205 0.009465591 6 0.04586928 0.02825110 0.011520517 0.1159209085 0.035885994
Next, we show the contributions of the top 20 individuals (again, it is relatively low, but it is explained by the quantity of the data).
fviz_contrib(res.mca, choice = "ind", axes = 1:2, top = 20)
Finally, we show the individuals by two color different showing their job status.
fviz_mca_ind(res.mca, label = "none", habillage=job, title="MCA individues according to their job status", addEllipses = TRUE, ellipse.level = 0.95)
Then we show the individuals and the categories together.
plot(res.mca, label = c("quali.sup", "var"), select = "cos2 10", cex=1, col.var = "darkblue", col.quali.sup = "brown3", col.ind = "seashell3", title = "Individuals with active and supplementary categories", cex.main=2, col.main= "darkblue")
Clustering
In this section, I do a Hierarchical clustering, then a K-means analysis.
Clustering is a technique for finding similarity groups in a data, called clusters. It attempts to group individuals in a population together by similarity, but not driven by a specific purpose, therefore it is often called an unsupervised learning, as we don???t have prescribed labels in the data and no class values denoting a priori grouping of the data instances are given.
We can estimate the similarity between two data instances by the distance between them. We can use
- euclidean (pythagorean) distance
- manhattan (sum of absolute differences between coordinates) distance
- mahalanobis distance (distance from the mean by standard deviation)
- Pearson correlation or Spearman correlation.
In the following, I use Euclidean distance.
Given a set of 100 items to be clustered, and an 100 x 100 distance (or similarity) matrix, the basic process of the hierarchical clustering is the following:
- We assign each item to its own cluster, so we have 100 clusters. each containing one item. For now, therefore, the distances between the clusters are equal to the distances between the items.
- Next, we find the 2 closest clusters and we merge them, so that we have one cluster with two items now. Now we have 99 (100-1) clusters.
- We compute again the distances between this new cluster and all of the other clusters.
- We repeate the second and third steps until we have a single cluster of size 100.
We see that really, the basic idea of clustering is really really simple.
Now let’s implement it in practice:
Firstly, I create a randomly selected subset of the data for a smaller, more visible clustering.
# I choose the observations in France
data_cluster <- data[data[, "country"] == "FR",]
# Gives us always the same subset
set.seed(330)
# Sample for random rows
random_rows <- sample(rownames(data_cluster), size = 100, replace = FALSE)
data_hcpc <- data[random_rows, 2:9]
And now we make the MCA plots:
res2.mca <- MCA(data_hcpc, quali.sup = 8, ncp = 5, graph = FALSE)
res.hcpc <- HCPC(res2.mca, min = 3, nb.clust = -1, graph = TRUE)
# Keep the axis for kmeans km=data.frame(res2.mca$ind$coord)
We see that the clustering proposes three clusters.
To see what separates most these clusters, we ask R to show us some individual observations of each clusters:
names(res.hcpc)
[1] "data.clust" "desc.var" "desc.axes" "call" "desc.ind"
On affiche des individus de chaque cluster pour mieux comprendre l’analyse.
print(res.hcpc$desc.ind)
$para Cluster: 1 3696 4199 3995 3890 3453 0.3615251 0.3802195 0.4362871 0.5758796 0.6145176 ------------------------------------------------------------------------------------------------------------ Cluster: 2 3744 3555 4013 4268 3568 0.4524800 0.4533426 0.4708304 0.5124976 0.5155082 ------------------------------------------------------------------------------------------------------------ Cluster: 3 3701 4358 4042 3625 3387 0.2743815 0.3004191 0.4752501 0.5307634 0.5432425 $dist Cluster: 1 3796 3810 4184 4139 3484 1.820734 1.820734 1.627126 1.605598 1.508880 ------------------------------------------------------------------------------------------------------------ Cluster: 2 4225 3850 4021 4326 4072 2.005300 2.005300 1.668767 1.551963 1.516803 ------------------------------------------------------------------------------------------------------------ Cluster: 3 3992 3545 4397 3766 4358 2.444105 2.121605 1.757267 1.630860 1.451258
For the first cluster, we show four individuals randomly:
c1 <- c(which(rownames(data_hcpc) == 3696), which(rownames(data_hcpc) == 4199), which(rownames(data_hcpc) == 3995), which(rownames(data_hcpc) == 3890))
cl1 <- data_hcpc[c1,]
print(cl1)
gender
|
rural
|
education
|
job
|
children
|
awareness
|
vote
|
age_group
|
|
---|---|---|---|---|---|---|---|---|
3696 | male | urban | medium | yes | yes | I have never heard of it | I would probably vote against it | aged |
4199 | male | urban | medium | yes | yes | I have heard just a little about it | I would not vote | aged |
3995 | female | urban | high | yes | yes | I have never heard of it | I would vote against it | adult |
3890 | female | urban | medium | no | yes | I have heard just a little about it | I would probably vote against it | aged |
We see that the first cluster contain observations of mostly employed people who would probably vote against basic income and have not heard much about it. We do the same for the second cluster:
c2 <- c(which(rownames(data_hcpc) == 3744) , which(rownames(data_hcpc) == 3555), which(rownames(data_hcpc) == 4013), which(rownames(data_hcpc) == 4268))
cl2 <- data_hcpc[c2,]
print(cl2)
gender
|
rural
|
education
|
job
|
children
|
awareness
|
vote
|
age_group
|
|
---|---|---|---|---|---|---|---|---|
3744 | male | urban | low | yes | yes | I know something about it | I would vote for it | aged |
3555 | female | urban | medium | yes | yes | I understand it fully | I would probably vote for it | aged |
4013 | female | urban | high | yes | no | I know something about it | I would vote for it | young |
4268 | female | urban | high | no | yes | I understand it fully | I would probably vote for it | aged |
We see that the second cluster contains individuals who know more about the basic income and would vote for it, mostly females living in a urban area, with children.
And finally for the third cluster:
c3 <- c(which(rownames(data_hcpc) == 3701) , which(rownames(data_hcpc) == 4358), which(rownames(data_hcpc) == 4042), which(rownames(data_hcpc) == 3625))
cl3 <- data_hcpc[c3,]
print(cl3)
gender
|
rural
|
education
|
job
|
children
|
awareness
|
vote
|
age_group
|
|
---|---|---|---|---|---|---|---|---|
3701 | female | rural | medium | no | no | I have never heard of it | I would not vote | young |
4358 | male | rural | low | no | no | I have never heard of it | I would not vote | young |
4042 | male | rural | low | no | no | I know something about it | I would not vote | young |
3625 | male | urban | medium | no | no | I have never heard of it | I would not vote | young |
The third cluster contains the observations of people who would not vote at all, who have ni children, ni job.
K-means
K-means clustering is also a type of unsupervised learning. The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.
It can be a good idea to ask 3 clusters in the K-means analysis, like this, we could compare the results of the Hierarchical Clustering and K-means.
set.seed(333)
groupes.kmeans <- kmeans(km, centers = 3, nstart = 5)
print(groupes.kmeans)
K-means clustering with 3 clusters of sizes 19, 63, 18 Cluster means: Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 1 0.8876059 0.10548369 0.07053484 -0.1326389 0.008544427 2 -0.1605333 -0.09936673 -0.19486188 0.1266472 -0.079478257 3 -0.3750506 0.23643967 0.60756314 -0.3032576 0.269154784 Clustering vector: 4503 4139 4517 4368 4048 4235 4219 4076 4222 4321 3870 4155 3384 3798 3810 4233 4174 4042 3544 3768 4286 3935 3545 3991 3886 4199 4195 3766 2 3 2 3 2 3 2 3 2 3 1 2 2 2 3 2 2 1 2 2 2 2 1 2 1 3 2 1 4026 3555 3821 3563 4397 3696 3970 3598 3509 3944 3890 4072 4358 3353 3992 4265 3728 4378 4021 4247 3532 4107 4523 3625 3605 3742 4149 4405 2 2 2 2 1 3 2 2 2 2 3 2 1 1 1 2 1 2 2 1 2 1 2 1 2 2 2 3 3744 3904 3332 3453 4326 3727 3701 3869 3685 4011 3570 3505 4268 4184 4024 3968 3843 4427 3484 3604 3740 4256 3885 3796 3683 4225 3394 3607 2 2 1 3 2 3 1 2 2 2 2 3 2 3 2 2 1 2 3 2 1 2 2 3 2 2 2 2 4160 4539 4469 3568 3480 3850 3995 3840 4013 3649 3859 3387 4410 4278 3488 3370 3 1 2 2 2 2 3 2 2 2 2 1 2 2 2 2 Within cluster sum of squares by cluster: [1] 12.60606 44.27574 15.51596 (between_SS / total_SS = 32.5 %) Available components: [1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault"
We see that R proposes 3 clusters with 19, 63, 18 observations respectively.
We can also show the optimal number of clusters. We see that after the third cluster, the total within sum of quares gained is lower then before therefore we can keep 3 clusters for the analysis.
fviz_nbclust(km, hcut, method = "wss")
Finally, we represent our K-means also on a graph:
fviz_cluster(groupes.kmeans, data = km, palette = "jco", repel= TRUE, main = "Kmeans", ggtheme = theme_classic())
Well we see that the clusters do not actually show the clear seperation of the data.
dist.hcpc <- dist(km)
cah.ward <- hclust(dist.hcpc, method = "ward.D2")
cah.ward
Call: hclust(d = dist.hcpc, method = "ward.D2") Cluster method : ward.D2 Distance : euclidean Number of objects: 100
Finally we compare the two types of Clustering:
groupes.cah <- cutree(cah.ward, k=3)
print(table(groupes.cah, groupes.kmeans$cluster))
groupes.cah 1 2 3 1 6 63 1 2 0 0 15 3 13 0 2
On voit que:
- cluster 1 (k-means) corresponds to cluster 2 (CAH) with a difference of 7 observations
- cluster 2 (k-means) correspond to cluster 3 exactly
- cluster 3 (k-means) corresponds to cluster 1 with a difference of 2 observations.
Great post, you have pointed out some great details , I too believe this s a very fantastic website.
LikeLike
We stumbled over here by a different page and thought I might check things out.
I like what I see so now i am following you. Look forward to looking at your
web page yet again.
LikeLike