Principal Component Analysis through the Happiness Index exemple
What determines happiness? Why countries are more (or less) happy than other ones? In 2017, Norway tops the global happiness ranking, made as an annual publication of the United Nations Sustainable Development Solutions Network. In this article, we use their data to show correlations of the variables used in this Index, furthermore we analyse the countries with the help of the Principal Component Analysis technic.
In this section, we try to get familiar with the Principal Component Analysis – PCA through an exemple that analyses data from the happiness index, available here: https://www.kaggle.com/unsdsn/world-happiness.
Open the data with attention that the first row is the name of the columns.
happiness_2017 <- read.csv("/Users/mokuska/Documents/Site_Web/Happiness/2017.csv", row.names = 1)
Need the following packages
c <- c("FactoMineR", "factoextra", "corrplot", "mice") # install.packages(c) library("FactoMineR") # Multivariate data analysis package library("factoextra") # For the PCA graphs library("corrplot") # For correlation graphs/ plots library("mice") # Checking for missing variables
A bit of data cleaning…
Firstly, we clean the dataset and get rid of the columns that we cannot use. Secondly, we check the data for any missing values with the md.pattern function.
act_col <- c(2, 5:10) happiness_new <- happiness_2017[, act_col] md.pattern(happiness_new)
## Happiness.Score Economy..GDP.per.Capita. Family ## [1,] 1 1 1 ## [2,] 0 0 0 ## Health..Life.Expectancy. Freedom Generosity ## [1,] 1 1 1 ## [2,] 0 0 0 ## Trust..Government.Corruption. ## [1,] 1 0 ## [2,] 0 0
Now that our data is clean and tidy, we are ready to start our Principal Component Analysis. But what is Principal Component Analysis?
PCA – a bit of explanation
PCA finds the principal components of data. But what are principal components of that data even mean? PCA can be used to reduce the dimensions of a data set. Dimension reduction is analogous to being philosophically reductionist: It reduces the data down into it’s basic components, stripping away any unnecessary parts.
The PCA is a technique that finds underlying variables (known as principal components) that best differentiate our data points. Principal components are dimensions along which our data points are most spread out, they’re the underlying structure in the data. They are the directions where there is the most variance, the directions where the data is most spread out. This is easiest to explain by way of an example.
Imagine that you want to explain to somebody what a “happy country” means, you want to show them which nations are the happiest and explain them why.. What do you say? People are happy because they can afford their basic needs? People are happy because they are free? Because they have access to education? Healthcare? Because they are family centered or simply because their culture is such?
Knowing the variables that differentiate the best our data has several advantages:
1. Easier visualisation since plotting the data according to the right variables makes it easier to visualise and understand data.
2. More visibles clusters.
So let’s try to explain to somebody what we think a happy country is..
First of all, it can be a good idea to show how variables relate to each other. The correlation matrix shows the correlation of any variable to another. For instance, we see that the Economy/ GDP score and Health, Life expectancy are highly correlated while we have literally no variables that are negatively correlated one-another.
M <- cor(happiness_new) corrplot(M, method = "ellipse")
Asking for PCA
The FactoMineR package makes it very simple to run a principal component analysis, seems so simple with just a line of code. But what is going on behind this command?
In this exemple we are measuring seven things: happiness score, GDP/capita, Family, Health, Freedom, Generousity, Trust in Government/ Corruption. There are 7 variables so it is a 7 dimensional data set. For simplicity, think about only 3 variables first, let’s say GDP/capita, Family and Health. We would need 3 dimensions to illustrate our data, let’s say that we put GDP on the x axis, Family on the y axis and Health on the z axis. Now imagine that the data forms our 3D data into an oval like plane so that most of the data points lie on only one plane within this 3D graph. Now we are able to visualise the data only by looking at a 2D graph, with minimal loss of data. Mathematically we are interested in finding the eigenvalues, check this video for an intuitive introduction: https://www.youtube.com/watch?v=PFDu9oVAE-g. (I strongly recommend all videos of this channel, makes maths really really fun..)
Now imagine that the data forms our 7 dimensional data set into an an abstract object, that we are now able to modelise only on 6 dimensions without a large loss of data. You might ask yourself: why would I want to do that? And the reason is simple: well, with 7 dimensions, we cannot do much, we cannot even visualise 7 dimensions.. So we want to reduce our dimensions to two, so that we comfortable looking our data by a minimal loss of information.
happy.pca <- PCA(happiness_new, graph = F)
So here you go, our famous eigenvectors. We see also the variance explained in the dimensions on the following table. What does that mean? Simply means that that the first dimension explained 54.2% of the total variation of the data, the second one explains 19.26%, etc etc. We see that the variance explained by the dimensions are decreasing, something that does not surprise us as the goal of this the PCA is to show as much information on the first axis, as could possibly shown by an axis, and as much on the second as could be possibly shown on the second knowing that we are already modelised data with the first dimension (so we want the second
dimension to be independent of the first), and so on, so on..
Notice also that we have 7 dimensions (just as variables) as the first 6 dimensions cannot explain all variation of the data by itself, but it can do a great deal since it explains 98% of the variation.. Do we really need to show the 7th dimension to see the remaining 2% unexplained variance?? Well I do not really think so, moreover, we are quite comfortable with the data already on 2 dimensions, we explain a great deal (73.47%) of all variation. This is not at all bad!
eig.val <- get_eigenvalue(happy.pca) eig.val
## eigenvalue variance.percent cumulative.variance.percent ## Dim.1 3.7949305 54.213294 54.21329 ## Dim.2 1.3481674 19.259534 73.47283 ## Dim.3 0.6725884 9.608406 83.08123 ## Dim.4 0.5438140 7.768771 90.85000 ## Dim.5 0.3606272 5.151817 96.00182 ## Dim.6 0.1444689 2.063841 98.06566 ## Dim.7 0.1354037 1.934339 100.00000
Of course we can also visualise the explained variance of each dimension as the graph below shows that.
fviz_eig(happy.pca, addlabels = TRUE, ylim = c(0, 60), linecolor = "red", barfill = "darkblue", barcolor = "darkblue")
Showing the variables.
Now we are interested in two things. Firstly, how variables relate to the dimensions, and how they relate to each other. The main indicator here is the distance, distance between a variable and the axis, distance between 2 variables’ arrows and the distance between the tip of the arrow and the circle. So what does all of these distances indicate?
- If an arrow is close to dimension 1 (horizontal axis), it correlates well with it. In our case, Hapiness seems to move together with the right side of the first dimension whereas Generousity is a better indicator for the second dimension. (vertical axis)
- If 2 variables are close to each other, they have little distance, they are similar
- If the tip of the arrow of a variable is close to the circle, this variable is well-explained by the first two dimensions (example hapiness) while of the length of the arrow is shorter, it is not as well explained (for instance Corruption).
var <- get_pca_var(happy.pca) fviz_pca_var(happy.pca, col.var = "darkblue")
Cos2 shows the quality of representation
How well are the variables represented in the first two dimensions? The cos2 variable shows just that- it is a an indicator of the quality of representation of a variable.
The following graph shows us how well each variables are reprented in the first two dimensions. We see that they are quite quite well explained, just look at the Happiness score with its cos2 of 0.75..
fviz_cos2(happy.pca, choice ="var", axes = 1:2, top = 10, color = "darkblue" )
Contribution of the variables
How much each variable contributes to each axis? The contribution shows that. So for instance the Happiness Score, GDP, Family, Health contributes 23%, 20%, 17% and 18% to the first dimension respectively. This will be important when we interpret the dimensions.
## Dim.1 Dim.2 Dim.3 Dim.4 ## Happiness.Score 23.233218 0.422551 0.303085884 0.2289653 ## Economy..GDP.per.Capita. 20.423545 7.182351 0.002091468 6.6203425 ## Family 17.042155 3.871564 3.559805027 14.4535811 ## Health..Life.Expectancy. 18.865755 5.860330 2.924262190 11.2688937 ## Freedom 11.398763 16.095782 4.829612378 44.5576870 ## Generosity 1.184413 44.895329 48.076908177 4.3516251 ## Trust..Government.Corruption. 7.852151 21.672094 40.304234877 18.5189053 ## Dim.5 ## Happiness.Score 0.01107323 ## Economy..GDP.per.Capita. 1.29142607 ## Family 51.08282720 ## Health..Life.Expectancy. 17.86134007 ## Freedom 19.55819568 ## Generosity 0.32802390 ## Trust..Government.Corruption. 9.86711386
Contribution of the top 5 variables
Here we visualise it graphically also (only for the top five most contributing variables).
fviz_contrib(happy.pca, choice = "var", axes = 1, top = 5)
And so finally we can plot the PCA plot, that is all of our observations (the countries) on a two dimensional system. We see that it’s not very visible so we apply another method to show only the 50 countries best represented on the first two dimensions.
ind <- get_pca_ind(happy.pca) ind
## Principal Component Analysis Results for individuals ## =================================================== ## Name Description ## 1 "$coord" "Coordinates for the individuals" ## 2 "$cos2" "Cos2 for the individuals" ## 3 "$contrib" "contributions of the individuals"
fviz_pca_ind (happy.pca, pointsize = "cos2", pointshape = 22, fill = "blue", repel = TRUE)
plot(happy.pca, select = "cos2 50", cex=1, col.ind = "darkblue", title = "50 countries with highest cos2", cex.main=2, col.main= "darkblue")
This plot is much prettier, cleaner and more comprehensible!
So what is going on on this two dimensions? We see for instance that the countries New Zealand, Australia, Norway, Denmark, etc in the right side of the graphs are close to each other, meaning that they show similar caracteristics in the variables we studies.. We could almost say that they kind of build a cluster of happy countries with high GDP per capita, high life expectancy, little corruption, etc..
Eastern-European countries seem to build another cluster with Turkey, Greece and Spain. So why are they different from the first bloc of countries? Well, first of all they manage to be less good on the first dimension, and as they are below the first dimensional axis, they score worse even in the second dimension.
Finally African countries seem to stay together. So knowing all that we said about the contributions of each variable to the dimensions and the quality of representation, how could we interpret the dimensions?
I think the more we are on the right side of the first dimension, the highest is the Happiness score, the GDP per capita and the Health/ Life expectancy and the Family variables. The more high a country is on the second dimensional axis, the less corruption is in his country. If case of a difficult interpretation of the dimensions, we can see observations that are caracteristically different from others, and we can see again our results from the contribution table, however, it is not always possible to interpret the dimensions of a PCA.
So keeping in mind this translation for the two axis, the first bloc of countries seem to be happier, with higher level of freedom and lower corruption, they live longer and they have a higher GDP per capita. Eastern-European countries, Spain, Greece and Turkey do not only score worse on the happiness, health, family varaibles, but corruption seems to be much higher in this countries. Finall, African countries seem to be the less happy, with the lowest revenue, less freedom and the lowest life expectancy.