Terrorism around the Word- Study with R
“Everybody’s worried about stopping terrorism. Well, there is a really easy way: stop participating in it.” – Noam Chomsky
According to the Wikipedia, the English word “terror“, just like the French “terreur”, derives from that Latin word “terrere” and means to fright, alarm, anguish, fear, panic. Indeed, we all fear terrorism that is the more and more part of our life. But do we understand the global picture? Who attacks who, where and why? Why do we see the more and more suicide attacks? This study focuses on answering these questions by an Exploratory Data Analysis, semi-supervised learning and a supervised Logit model.
In this project I will attempt to review the evolution of the number, frequency and types of terrorist attacks around the world. The Global Terrorism Database served as the data set and it can be found sur Kaggle: https://www.kaggle.com/START-UMD/gtd/data.
I will focus on the MENA area (Middle East and North Africa region) as the highest and deadliest attacks are taken place here, moreover this study aims to explain a relatively new type of behaviour: the suicide among terrorists. Since we start to observe suicidal attack from the years of 2000, we only analyse this period.
2. To start.. a bit of data cleaning
2. 1. Reduction of the original data set
Calling the data
The packages needed (need to install them if they are not yet installed on the R used by the command, install.packages(“name_of_package”))
library(tidyr) # Tidy data - data cleaning library(dplyr) # Data manipulation - data frames library(FactoMineR) # Multivariate Exploratory Data Analysis library(factoextra) # Visualisation of Multivariate Exploratory Data Analysis library(corrplot) # Correlation plot, visualisation library(purrr) # Functional Programming library(tm) # Text Mining library(wordcloud) # For wordclouds library(RColorBrewer) # Palettes library(ggplot2) # Data visualisation library(maps) # Creating maps
The original data set of more than 170 000 observations and 135 variables makes it difficult to analyse the data. This study only focuses on certain variables therefore any supplementary variables are deleted.
TERmini <- select(TER, eventid, iyear, country_txt, region_txt, latitude, longitude, crit1, crit2, crit3, multiple, success, suicide, attacktype1_txt, targtype1_txt, gname, weaptype1_txt, nkill, nwound, property, INT_MISC, INT_LOG, propextent_txt, motive,targtype1, weaptype1)
I only check for the missing variables that are important for the study.
NA_<-sapply(TERmini, function(x) (sum(is.na(x))/count(TERmini))*100) data.frame(NA_)
The above code shows the percentage of missing variables. If it is below 5%, the rate of missing variables are considered insignificant. Fortunately, almost all of the variables of importance have a percentage of missing values below 5%.
The variable nkill shows the number of people killed in the attack. This variable has 5.7% missing observation, however, I assume that if an observation is missing, it is likely to be zero as it is more plausible to exclude here observations with nobody died than victims dying in an attack. The same treatment is applied for the variable “nwound”, the number of people wounded in an attack.
Finally I delete all observation with missing values of the “weaptype_1” (type of weapon used) and “int_mics” (whether the attack was against international targets).
TERmini$nkill[is.na(TERmini$nkill)]=0 TERmini$nwound[is.na(TERmini$nwound)]=0 ind<- which(TERmini$weaptype1_txt=="") TERmini<-TERmini[-ind,] inddd<- which(TERmini$INT_MISC==-9) TERmini<- TERmini[-inddd,] inde<-which(is.na(TERmini$INT_MISC)) TERmini <-TERmini[-inde,]
Renaming certain variables and changing their factors
# Changing the factors TERmini$region_txt<-as.factor(TERmini$region_txt) TERmini$weaptype1_txt<-as.factor(TERmini$weaptype1_txt) TERmini$attacktype1_txt<-as.factor(TERmini$attacktype1_txt) TERmini$country_txt<-as.factor(TERmini$country_txt) TERmini$targtype1_txt <-as.factor(TERmini$targtype1_txt) TERmini$crit1<-as.numeric(TERmini$crit1) TERmini$crit2<-as.numeric(TERmini$crit2) TERmini$crit3<-as.numeric(TERmini$crit3) TERmini$multiple<-as.numeric(TERmini$multiple) TERmini$success<-as.numeric(TERmini$success) TERmini$INT_MISC<-as.numeric(TERmini$INT_MISC) TERmini$property<-as.numeric(TERmini$property) # Creation of a new variable, suicide_txt TERmini$Suicide_txt <- ifelse(TERmini$suicide==1,"Suicide","No suicide") # Renaming colnames(TERmini)<-"year" colnames(TERmini)<-"country" colnames(TERmini)<-"region" colnames(TERmini)<-"attack" colnames(TERmini)<-"target" colnames(TERmini)<-"group" colnames(TERmini)<-"weapon" colnames(TERmini)<-"killed" colnames(TERmini)<-"wounded" levels(TERmini$region)[levels(TERmini$region)=="Middle East & North Africa"] <- "MENA" levels(TERmini$attack)[levels(TERmini$attack)=="Hostage Taking (Kidnapping)"] <- "Hostage" levels(TERmini$attack)[levels(TERmini$attack)=="Facility/Infrastructure Attack"] <- "Facility" levels(TERmini$group)[levels(TERmini$group)=="Islamic State of Iraq and the Levant (ISIL)"] <- "IS" levels(TERmini$weapon)[levels(TERmini$weapon)=="Explosives/Bombs/Dynamite"] <- "Explosives" levels(TERmini$target)[levels(TERmini$target)=="Private Citizens & Property"] <- "Citizens" levels(TERmini$target)[levels(TERmini$target)=="Government (General)"] <- "Government" levels(TERmini$weapon)[levels(TERmini$weapon)=="Vehicle (not to include vehicle-borne explosives, i.e., car or truck bombs)"] <- "Vehicle"
Finally, reducing the data base for observation between :
- 2011-2017 (data11_16)
- 2000-2017 (dat00_16)
- For the 5 biggest terrorist groups (data)
index=2000) data00_16 <- TERmini[index, ] index1=2011) data11_16 <- TERmini[index1, ] data <- subset(data11_16, data11_16$group == "Taliban" | data11_16$group == "Boko Haram" | data11_16$group == "Maoists" | data11_16$group == "IS" | data11_16$group == "Al-Shabaab")
List of variables of importance after data cleaning
- iyear : year of attack
- country_txt : the country in which the attack took place
- region_txt : the region geographic in which the attack took place
- city : the city in which the attack took place
- longitude : these two variables were used for a visualization of the attacks world-wide
- summary : short description of the attack
- multiple : dummy indicating if the attack was a multiple attack
- attacktype1_txt: type of attack
- targtype1_txt : the type of target attacked
- targsubtype1_txt : the type of target attacked- more precisedly
- gname : terrorist group
- weaptype1_txt : the type of weapon used in the attack
- nkill : number of victims died in the attack
- nwound : number of victims wounded in the attack
2.2. Visual representation
Visual representation of the whole period 1970 – 2016
blue.bold.text <- element_text(face = "bold", color = "Darkblue") blue.text <- element_text(color = "Darkblue") p <- ggplot(data = TER, aes(x = iyear)) + geom_histogram(aes(y = ..count..), binwidth = 1, color= 'white', fill = "darkblue") + scale_x_continuous(name = "Year",breaks = seq(from = 1970, to = 2017, by = 5), limits = c(1970, 2017)) + scale_y_continuous(name = "Effectif") + ggtitle(label = "Number of terrorist attacks worldwide") + theme(title = blue.bold.text, axis.title = blue.text) + theme(axis.text.x= element_text(angle=45, hjust=1)) p
summarise(nr_of_attacks = n())
## 1 170350
We observe a strong upward trend regarding the number of terrorist attacks per year since 1970. We see also that the data set misses observations completely for the year 1993. This does not cause problems in the analysis as the study focuses on a latter time period.
The total number of attack between 1970 and 2016 is 170 350 (excluding 1993).
ggplot(data = TERmini) + geom_bar(aes(x=year,y=killed), stat = 'identity', fill='darkblue') + ylab('Total') + ggtitle('Victims died in terrorist attack worldwide') + scale_x_continuous(name= 'Year', breaks = seq(from = 1970, to = 2017, by = 5))+ theme(title = blue.bold.text, axis.title = blue.text)+ theme(axis.text.x= element_text(angle=45, hjust=1))
We see that the increase of the number of people died in terrorsit attacks is increasing but at a faster rate then the number of attacks increasing, leaving us wondering whether the attacks are more focused on the number of lives taken now than 30 years ago.
We see that the increase of the number of people died in terrorsit attacks is increasing but at a faster rate then the number of attacks increasing, leaving us wondering whether the attacks are more focused on the number of lives taken now than 30 years ago.
##  382459
In total, 382 459 victims died as a result of terrorist attacks since 1970 (1993 excluded).
Suicide attacks between 1970-2016
# Creation of a data frame with the number of attack and number of suicidal attack per year year_suicide <- data.frame(table(TERmini$Suicide_txt, TERmini$year)) year_suicide$suicide <- year_suicide$Var1 year_suicide$year <- year_suicide$Var2 year_suicide <- subset(year_suicide, select = -c(Var1, Var2)) # Find the suicide rates for each year yearly_suicide_rate <- year_suicide$Freq[year_suicide$suicide == 'Suicide'] / (year_suicide$Freq[year_suicide$suicide == 'Suicide'] + year_suicide$Freq[year_suicide$suicide == 'No suicide']) #Turn it into a dataframe yearly_suicide_rate <- data.frame(yearly_suicide_rate) yearly_suicide_rate$rate <- yearly_suicide_rate$yearly_suicide_rate yearly_suicide_rate <- subset(yearly_suicide_rate, select = -(yearly_suicide_rate)) # Adding years to the dataframe yearly_suicide_rate$year <- unique(year_suicide$year)
We represent here the number of suicidal attacks per year:
ggplot(data = year_suicide, aes(x=year, y=Freq, fill = suicide)) + geom_bar(stat = 'identity') + ylab('Number') + ggtitle("Number of suicidal and non-suicidal attacks") + scale_x_discrete(name= 'Year', breaks = seq(from = 1970, to = 2017, by = 5))+ theme(title = blue.bold.text, axis.title = blue.text)+ theme(axis.text.x= element_text(angle=45, hjust=1))
We see the phenomenon “suicidal attack” was almost non-existent before the 2000s, and increasingly present ever since. This shows us that this deadly technic is a relatively approach of terrorist groups.
ggplot(data = yearly_suicide_rate, aes(x=year, y=rate)) + geom_point(colour = 'darkblue', size=2) + geom_line(aes(group=1)) + ylab('Rate of suicidal attacks ') + ggtitle("Percentage of suicidal attacks worldwide") + scale_x_discrete(name= 'Year', breaks = seq(from = 1970, to = 2017, by = 5))+ theme(title = blue.bold.text, axis.title = blue.text)+ theme(axis.text.x= element_text(angle=45, hjust=1))
The above graph shows the same information but as a rate of percentage.
As the number of suicidal attacks is negligible before 2000, we only focus for the period after this year.
index=2000) data00_16 <- TERmini[index, ]
Dispersion of terrorism in the world
# Deletion of missing observations with missing laltitude or longitude- only for this graph indixxx<-which(is.na(data00_16$latitude)) data00_16_sans<- data00_16[-indixxx,] world <- map_data("world") ggplot() + geom_polygon(data = world, aes(x = long, y = lat, group = group))+ geom_point(data = data00_16_sans, aes(x = longitude, y = latitude), colour = "red", size = 0.01)
This map shows that most of the attacks have taken place in the Middle-East, India and Subsaharan-Africa. Other dense regions are South Asia and and the North part of South-America.
Countries the most attacked
T1<-as.data.frame(head(sort(table(data00_16$country), decreasing = TRUE), 10)) T2<-as.data.frame(head(sort(table(data00_16$country), decreasing = TRUE), 10) / dim(data00_16) * 100) TT<-cbind(T1,T2[,2]) colnames(TT)=c("Country","Number of attack", "Percentage") TT
We see above the countries with the highest numbers of attacks are indeed those of MENA, furthermore Pakistan, India, and the Phillippines. Iraq takes alone more than 21% of all attacks wordlwide.
Now we turn to the three most affected countries to see in the evolution of the number of attacks per year.
index<- which(data00_16$country=='Iraq') Iraq <- data00_16[index, ] p <- ggplot(data = Iraq, aes(x = year)) + geom_histogram(aes(y = ..count..), binwidth = 1, color= 'white', fill = "darkblue") + scale_x_continuous(name = "Year",breaks = seq(from = 2000, to = 2017, by = 1), limits = c(2000, 2017)) + scale_y_continuous(name = "Number of attack") + ggtitle(label = "Number of attack in Iraq") + theme(title = blue.bold.text, axis.title = blue.text) + theme(axis.text.x= element_text(angle=45, hjust=1)) p index<- which(data00_16$country=='Pakistan') Pakistan <- data00_16[index, ] p <- ggplot(data = Pakistan, aes(x = year)) + geom_histogram(aes(y = ..count..), binwidth = 1, color= 'white', fill = "darkblue") + scale_x_continuous(name = "Year",breaks = seq(from = 2000, to = 2017, by = 1), limits = c(2000, 2017)) + scale_y_continuous(name = "Number of attacks") + ggtitle(label = "Number of attack in Pakistan") + theme(title = blue.bold.text, axis.title = blue.text) + theme(axis.text.x= element_text(angle=45, hjust=1)) p index<- which(data00_16$country=='Afghanistan') Afghanistan <- data00_16[index, ] p <- ggplot(data = Pakistan, aes(x = year)) + geom_histogram(aes(y = ..count..), binwidth = 1, color= 'white', fill = "darkblue") + scale_x_continuous(name = "Year",breaks = seq(from = 2000, to = 2017, by = 1), limits = c(2000, 2017)) + scale_y_continuous(name = "Number of attacks") + ggtitle(label = "Number of attacks in Afghanistan") + theme(title = blue.bold.text, axis.title = blue.text) + theme(axis.text.x= element_text(angle=45, hjust=1)) p
We see on the first graph that the number of attacks in Iraq has increased enormously since 2003, the start of the Second Persian Gulf War. In case of Pakistan and Afghanistan, the number of attacks increased steadily from 2001 until 2014. The strong increase of the number of attacks in case of Afghanistan coincides with the 11 September, 2001 terrorist attack and the start of the Afghanistan war.
The most active organisations
T1<-as.data.frame(head(sort(table(data00_16$group), decreasing = TRUE), 10)) T2<-as.data.frame(head(sort(table(data00_16$group), decreasing = TRUE), 10) / dim(data00_16) * 100) TT<-cbind(T1,T2[,2]) colnames(TT)=c("Organisation","Frequency", "Percentage") TT
We see the ten biggest terrorist groups on the above list, that is the 10 most active terrorist groups between 2011 and 2016. It is clear that the Taliban and IS committed the most attacks, while the Al-Shabaab military group and Boko Haram are responsible each for more than 2000 attacks.
First we focus on the types of targets
ggplot(data=data00_16, aes(x=target)) + geom_histogram(stat='count', fill= 'darkblue') + theme(axis.text.x= element_text(angle=45, hjust=1)) + labs(title="Number of attacks per target type (2000-2016)")+ xlab('Target type') + ylab('Total')+ theme(title = blue.bold.text, axis.title = blue.text))+ theme(title = blue.bold.text, axis.title = blue.text)
We see that the most attacks target citizens. This supports the idea that terrorist attacks aim to effect the most people possible to create fear and insecurity in the society. The second target group is military followed by police targets.
T1<-as.data.frame(head(sort(table(data00_16$target), decreasing = TRUE), 10)) T2<-as.data.frame(head(sort(table(TERmini$target), decreasing = TRUE), 10) / dim(TERmini) * 100) TT<-cbind(T1,T2[,2]) colnames(TT)=c("Country","Frequency", "Percentage") TT
Citizen targeted attacks account for more than 23 % of the total attacks. Almost 15 % of attacks are against military while 13% against government targets. These three biggest target groups account for more than 50 % of all attacks.
ggplot(data=data00_16, aes(x=weapon)) + geom_histogram(stat='count', fill= 'darkblue') + theme(axis.text.x= element_text(angle=45, hjust=1)) + labs(title="Number of attack per weapon type (2000-2016)")+ xlab("Weapon type") + ylab('Total')+ theme(title = blue.bold.text, axis.title = blue.text))+ theme(title = blue.bold.text, axis.title = blue.text)
T1<-as.data.frame(head(sort(table(data00_16$weapon), decreasing = TRUE), 10)) T2<-as.data.frame(head(sort(table(data00_16$weapon), decreasing = TRUE), 10) / dim(TERmini) * 100) TT<-cbind(T1,T2[,2]) colnames(TT)=c("Country","Frequency", "Percentage") TT) TT
## 9 Other 61 0.03592314 ## 10 Biological 27 0.01590040
In fact firearms and explosives account for more than 50 % of all attacks.
Finally, we turn our focus to the motivations of attacks. As this information is represented in the form of text, I will use a wordcloud.
Aims of attacks
wordcloud(data00_16$motive, max.words = 100, min.freq = 45, rot.per=.15,colors=brewer.pal(8, "Dark2"), random.order = FALSE)
This wordcloud tells us an interesting story: the most common aim of attacks are simply unknown. This leaves us wondering whether our security agencies should use more resources on better understanding these terrorist groups. How can we fight something that we do not even understand?
3. Non-supervised models
3.1 Multiple Correspondant Analysis (MCA)
In this section, we present the results of the MCA analysis, that only focus on the suicide attacks between 2011 and 2016. We end up with 13 994 observations. This method allows us to represent the observations (in this case, the terrorist attacks) in space, in which all dimensions correspond to several variables. The aim of the MCA analysis is, (as explained in the theory part,) to reduce the number of dimensions of a large number of observations, thus allowing a more compact and simple representation and comprehension of the observations.
The terrorism database that we use in this study, has more modalities that can be conveniently represent in an MCA. (For instance the number of different terrorist groups is several thousands. This would most probably not contribute to a deeper understanding of the phenomenon of terrorist attacks as even with the MCA, we would need many dimensions to represent all different terrorist groups, moreover, as a consequence of the heterogeneity of the observations.)
Therefore we only keep : – the five weapon types the most used,
- the five biggest terrorist groups,
- and the five most common target types.
Our MCA base contains therefore 13 360 observations.
# 5 weapon the most used data.mca <- subset(data, data$weapon == "Explosives" | data$weapon == "Firearms" | data$weapon == "Unknown" | data$weapon == "Incendiary" | data$weapon == "Melee" | data$weapon == "Chemical") # 5 most used attack types data.mca2 <- subset(data.mca, data.mca$attack == "Bombing/Explosion" | data.mca$attack == "Armed Assault" | data.mca$attack == "Unknown" | data.mca$attack == "Hostage" | data.mca$attack == "Assassination" | data.mca$attack == "Facility") # 5 most attacked target type data.mca3 <- subset(data.mca2, data.mca2$target == "Citizens" | data.mca2$target == "Military" | data.mca2$target == "Police" | data.mca2$target == "Government" | data.mca2$target == "Business" | data.mca2$target == "Unknown") c <- c(4, 12, 13, 14, 15, 16) data.mca <- data.mca3[, c]
We have a glance at the data:
region suicide attack target group South Asia :5747 Min. :0.0000 Bombing/Explosion:6112 Citizens :4200 Taliban :4726 Sub-Saharan Africa:3923 1st Qu.:0.0000 Armed Assault :3510 Military :3574 IS :3689 MENA :3668 Median :0.0000 Unknown :1482 Police :3183 Al-Shabaab:2237 Southeast Asia : 9 Mean :0.1237 Hostage :1284 Government:1244 Boko Haram:1685 Western Europe : 9 3rd Qu.:0.0000 Assassination : 649 Business : 665 Maoists :1023 Central Asia : 2 Max. :1.0000 Facility : 323 Unknown : 494 : 0 (Other) : 2 (Other) : 0 (Other) : 0 (Other) : 0 weapon Explosives:6833 Firearms :3919 Unknown :2066 Incendiary: 298 Melee : 223 Chemical : 21 (Other) : 0
We also need to factorise the variables so that MCA functions.
# Creation of a qualitative variable for suicide data.mca$type <- ifelse(data.mca$suicide==1,"Suicidaire","Non-suicidaire") data.mca$region<-as.factor(data.mca$region) data.mca$attack<-as.factor(data.mca$attack) data.mca$target<-as.factor(data.mca$target) data.mca$group<-as.factor(data.mca$group) data.mca$weapon<-as.factor(data.mca$weapon) data.mca$type<-as.factor(data.mca$type) # Clear the quantitative suicide variable data.mca <- select(data.mca, -suicide)
Our first representation of the MCA reveales some difficulties linked to a representation of a similar volume of observations: with as many observations, our MCA is simply not visible.
res.mca <- MCA(data.mca, quali.sup = c(1), ncp = 5, graph = FALSE) plot(res.mca, cex=0.7, col.var = "darkblue", title = "Active categories", cex.main=2, col.main= "darkblue")
As the representation of this MCA did not get us any closer to the comprehension of data, the study focuses in what follows on the dimensions, variables and individuals. After a better understanding of all these, we make a second attemp to represent our MCA.
Summary of the dimensions
The dimensions can include several variables, according to the highest explained variance for the observations. That is, the first dimension explains the highest variance explainable by one sigle dimension, the second dimensions aims to explain the second highest variance of observations, knowing that it is perpendicular to the second, etc.
Below we represent the eigenvalues of all dimensions and the percentage of explained variance by them. We see for instance that the first dimension explains 11.73 percent of all variation, while the second account only for 9.84%, etc.
eigenvalues <- get_eigenvalue(res.mca) head(round(eigenvalues, 2))
eigenvalue variance.percent cumulative.variance.percent Dim.1 0.47 11.73 11.73 Dim.2 0.39 9.84 21.57 Dim.3 0.37 9.16 30.73 Dim.4 0.30 7.38 38.11 Dim.5 0.28 6.90 45.01 Dim.6 0.26 6.47 51.48
The first two dimensions explain only 21,57% of variation of the observations. Considering the number of observations, however, this seems right.
The below graph represent the percentage explained by each dimension. As it is explained previously, the first dimension explains the most variation, followed by the second, third, etc..
Analysis of the variables
Contribution of the variables to the dimensions
To better understand our dimensions, it can be useful to see which variables contribute the most to each dimensions. We can choose a graphical representations to do that, we can see on the below graph that Explosion, Explosives and Firearms seem to contribute the most to the first dimension. This makes us thinking that the first dimension represents mostly the weapon type. We also that the second dimension’s biggest contributing variable is the “Unknown” attack type.
var <- get_mca_var(res.mca)
corrplot(var$contrib, is.corr = FALSE)
For more visibility, we represent contributions to the first and second dimensions. Variables that are above the red dotted line contribute more than the average.
fviz_contrib(res.mca, choice = "var", axes = 1)
fviz_contrib(res.mca, choice = "var", axes = 2)
It can be also useful to see the quality of representation of each variables. Below is a table representing the quality of representation of each variables on each dimension. For instance we see that Bombing/Explosion and Explosives are well-represented on the first dimension.
Again, it can be useful to see the quality of representation of each variable to the first two dimension (represented together). This is in fact the cos^2, the higher it is, the better represented is the variable.
fviz_cos2(res.mca, choice = "var", axes = 1:2)
Finally, we represent the different active variables/ categories on a better, clearer MCA graph. Naturally, as we can better understand two-dimensional graphs, we only focus on the first two dimensions.
plot(res.mca, invisible = c( "ind"), cex=1, col.var = "darkblue", title = "Active categories", cex.main=2, col.main= "darkblue")
Now we can try to translate the first two dimensions.
We see for instance that the attack type “Unknown” and the weapon type “Unknown” are represented on the bottom of the graph. We have seen before that these variables have the biggest contributions to the second dimension, furthermore that their quality of representation is high. So we conclude that if an observation is in the bottom of the graph, it is more likely that the type of weapon and the type of attack are unknown. Furthermore, they are very close together, the closer variables are on this graph, the more they correlate. We can think of this as the distance between the two variables is small, implying that they are similar.
We also observe that the Explosives and Bombing/Explosions are on the left side of the graph, furthermore that they are close on to another, implying similarity. On the right side, we see Firearms and Armed assults.
Finally, this study is addressing the question of suicide, this variable being on the left side of the first dimension, implying that observations that will be situated on this die will be more likely to be suicidaire. We also see that suicide is closest to unknown target type and Bombing/Explosion, implying that the most suicide attacks are indeed Bombing/Explosives attacks.
We can also see that MENA is the closest to Chemical weapons, (so that chemical attacks are more likely here than other parts of the world).
Individual observations – attacks
Now, after a better understanding of the first two dimensions and how the different variables are correlated, we turn to the individual observations, that is, the attacks themselves.
ind <- get_mca_ind(res.mca) ind
## Multiple Correspondence Analysis Results for individuals ## =================================================== ## Name Description ## 1 "$coord" "Coordinates for the individuals" ## 2 "$cos2" "Cos2 for the individuals" ## 3 "$contrib" "contributions of the individuals"
We can represent the quality of representation of each attack. Here, however, as we have more then 13 000 observations, we do not expect a large cos^2.
We show only the first few individuals. We see that the quality of representation is generally very small and (2-3%) although we have a much higher cos^2 for certain attacks.
Dim 1 Dim 2 Dim 3 Dim 4 Dim 5 99600 0.26269351 0.1253694656 0.0696668345 0.01969511 0.145961263 99611 0.03210075 0.0003395013 0.0003588428 0.14970651 0.070537832 99612 0.02501009 0.0680279128 0.0669620482 0.01146119 0.033724086 99619 0.02387862 0.0044075809 0.0053810941 0.28557389 0.125952973 99637 0.21295419 0.0309962574 0.0701373677 0.03006849 0.001055433 99649 0.02169697 0.0225874010 0.0288138269 0.12520376 0.103929290
Just as for the variables, we can also represent the contribution of each observations. We expect a small contribution, however, as a consequence of the large number of observations. Indeed we see that the contribution is small and quite homogeneous.
fviz_contrib(res.mca, choice = "ind", axes = 1:2, top = 20)
Finally we represent the individuals on an MCA graph, seperately for the suicidal(blue) and non-suicidal attacks (red). We can see that the MCA graph distinguished between these attacks quite clearly, and that the suicidal attacks are indeed on the left side of the first dimension, as expected.
This also makes us think of the attack type – Bombing/ Explosives, that situated on the left side of the first dimension also. These two phenomenons, suicidal and explosive attacks correlate strongly.
fviz_mca_ind(res.mca, label = "none", habillage=data.mca$type, title="MCA graph representation of suicidal and non-suicidal attacks", cex.main=4, addEllipses = TRUE, ellipse.level = 0.95)
Finally we turn to a common represetation of the individuals and the variables;
plot(res.mca, label = c("quali.sup", "var"), select = "cos2 30", cex=1.2, col.var = "darkblue", col.quali.sup = "brown3", col.ind = "seashell3", title = "Individuals with active and supplementary categories", cex.main=2, col.main= "darkblue")
This confirms the analysis of the dimensions made previously.
I would like to make a simple Hierarchical Clustering here from the MCA made previously. The only problem is that having more than 13 000 observations makes it impossible to represent the clusters at the end of this section, therefore I will choose 200 observations randomly, and carry out the Clustering only on this 200 observations.
set.seed(300) random_rows <- sample(rownames(data.mca), size = 200, replace = FALSE) data_hcpc <- data.mca[random_rows, ]
res2.mca <- MCA(data_hcpc, quali.sup = c(1, 6), ncp = 5, graph = FALSE) res.hcpc <- HCPC(res2.mca, min = 3, nb.clust = -1, graph = TRUE)
We create a table to show 4 observations of each clusters.
c1 <- c((which(rownames(data_hcpc) == 156643)), (which(rownames(data_hcpc) == 133063)), (which(rownames(data_hcpc) == 144457)), (which(rownames(data_hcpc) == 132794)), (which(rownames(data_hcpc) == 127701))) cl2 <- data_hcpc[c1,] print(cl2)
We see that the first cluster includes attacks carried out mostly by Bombing/Explosives, against the police or citizens, mostly by Explosives, in MENA or South Asia.
c2 <- c((which(rownames(data_hcpc) == 164863)), (which(rownames(data_hcpc) == 136786)), (which(rownames(data_hcpc) == 147038)), (which(rownames(data_hcpc) == 146353)), (which(rownames(data_hcpc) == 135790))) cl2 <- data_hcpc[c2,] print(cl2)
|136786||Sub-Saharan Africa||Unknown||Military||Boko Haram||Unknown||Non-suicidaire|
The second cluster contains attacks with unknown weapon types, non-suicidal and mostly military targeted attacks.
c3 <- c((which(rownames(data_hcpc) == 101866 )), (which(rownames(data_hcpc) == 168373)), (which(rownames(data_hcpc) == 133220)), (which(rownames(data_hcpc) == 121599)), (which(rownames(data_hcpc) == 163226))) cl3 <- data_hcpc[c3,] print(cl3)
|101866||Sub-Saharan Africa||Armed Assault||Citizens||Al-Shabaab||Firearms||Non-suicidaire|
|168373||Sub-Saharan Africa||Armed Assault||Citizens||Al-Shabaab||Firearms||Non-suicidaire|
The third cluster contains non-suicidal attacks carried out by firearms.
c4<- c((which(rownames(data_hcpc) == 148231)), (which(rownames(data_hcpc) == 156248)), (which(rownames(data_hcpc) == 151572)), (which(rownames(data_hcpc) == 164419)), (which(rownames(data_hcpc) == 141649))) cl4 <- data_hcpc[c4,] print(cl4)
|151572||Sub-Saharan Africa||Facility||Citizens||Boko Haram||Incendiary||Non-suicidaire|
The fourth cluster contains mostly ‘incendiary’, non-suicidal, facility targeted attacks.
We also make a k-means classification. We keep in mind that k-mean for a given data set can give us different results as it starts randomly from an observation. To be able to compare the k-mean results with outr classification made earlier, we choose 4 clusters.
set.seed(13) groupes.kmeans <- kmeans(km, centers = 4, nstart = 5) print(groupes.kmeans)
K-means clustering with 4 clusters of sizes 5, 81, 25, 89 Cluster means: Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 1 1.0157455 3.85344781 0.70046022 0.30378384 0.06657276 2 0.7248291 -0.29081709 -0.02711369 0.05404063 0.17294555 3 -0.5529222 -0.14616286 1.62931536 0.09630131 -0.29570187 4 -0.5614250 0.08924738 -0.47234805 -0.09330048 -0.07807761 Clustering vector: 166282 156989 159632 155224 152096 102369 156695 140747 138613 165396 167291 129733 158945 146718 150239 2 4 2 3 2 2 2 4 2 4 4 4 4 2 2 169740 164419 150858 149456 140903 169763 140346 138422 106680 108972 103452 142302 164942 152838 121243 4 1 4 4 2 3 2 4 2 2 2 2 4 3 2 141953 153783 166831 133935 109302 138039 134705 156643 148231 157532 133063 101866 155285 163293 133548 3 2 4 4 4 2 2 4 1 4 4 2 4 4 2 163226 144457 131568 111933 155921 133567 159085 147038 169936 165792 146353 138133 164236 162049 105462 2 4 4 2 4 2 4 3 2 2 3 2 3 3 4 132794 148524 149500 127701 120058 147371 136786 141649 158117 157721 148985 162132 123417 107253 141071 4 2 2 4 4 4 3 1 3 4 4 4 4 2 4 128712 135790 166113 111475 151572 159919 117756 119700 167326 135630 122030 150006 150804 165521 136442 2 3 3 4 1 3 4 2 4 2 3 2 2 3 2 165953 128763 161111 141763 155743 134462 151387 116128 147330 119094 161665 131439 167462 159073 106797 4 2 4 4 3 4 2 2 2 4 2 4 4 2 2 170036 142266 116768 144218 116901 139141 146994 122110 102999 142586 104219 122347 154990 118817 103075 4 4 4 4 2 3 2 2 2 4 2 2 3 4 2 138974 133220 169743 133126 155350 164863 131160 151666 165039 117751 119921 154721 166382 165911 110474 2 2 4 2 4 3 3 2 2 4 4 2 4 2 4 121347 138936 156857 132614 138770 139645 168373 101970 131386 120137 128888 121599 163024 139275 160756 4 4 2 4 4 2 2 2 4 3 4 2 2 2 4 165286 108089 168977 124345 137001 150157 128637 125759 119487 149947 154107 137480 131692 170136 137542 4 4 4 3 3 2 4 4 2 4 2 4 4 4 2 157890 139864 150343 108428 169570 156248 131391 154091 112929 115316 166411 137334 149501 145462 151804 2 4 2 4 4 1 2 2 2 2 3 4 4 4 4 136247 165690 162011 108725 158547 155175 109304 134744 169604 134639 166351 158728 129390 119122 152257 4 4 2 2 2 4 4 4 4 4 2 3 4 2 4 105739 149755 157427 166875 158072 2 2 4 4 2 Within cluster sum of squares by cluster:  5.910466 105.512136 18.023409 61.506283 (between_SS / total_SS = 57.8 %)
We see that we have 4 clusters with 5, 81, 25, 89 observation respectively.
We ask whether the optimal number of clusters is indeed 4. To answer this question, we show the Optimal number of clusters.
fviz_nbclust(km, hcut, method = "wss")
We see that we can indeed have 4 clusters as having 5 clusters (so one additionally) would deduct our total within sum squared by less than the first 4.
We also represent the for clusters below, and we see that there is one (the first) which is quite different from the others.
fviz_cluster(groupes.kmeans, data = km, palette = "jco", repel=
TRUE, main = "Kmeans", ggtheme = theme_classic())
dist.hcpc <- dist(km) cah.ward <- hclust(dist.hcpc, method = "ward.D2") cah.ward
## ## Call: ## hclust(d = dist.hcpc, method = "ward.D2") ## ## Cluster method : ward.D2 ## Distance : euclidean ## Number of objects: 200
Now we ask the question, how does our K-mean and Hierarchical Clustering correspond? Are they quite similar or very different? To answer this question, we show how the two correspond.
groupes.cah <- cutree(cah.ward, k=4) print(table(groupes.cah, groupes.kmeans$cluster))
## ## groupes.cah 1 2 3 4 ## 1 0 81 0 5 ## 2 0 0 0 84 ## 3 0 0 24 0 ## 4 5 0 1 0
We see that:
- cluster 1 (k-means) correspond to cluster 2 (CAH), with a difference of 5 observations,
- cluster 2 (k-means) correspond to cluster 4 (CAH) exactly,
- cluster 3 (k-means) correspond to cluster 3 (CAH) exacty,
- cluster 4 (k-means) correspond to cluster 1 (CAH), with only observations difference.
Altogether we see that our two methods of clustering gave us very similar results.
Supervised model – Logit
The aims of this study is to explain the suicide, therefore our explained variable is indeed a dummy, taking the value of 1 if the attack is suicidal and 0 otherwise.
We could use logit or probit model depending on the underlying distribution of the error term, for the easier translation of the coeffitients I’ll go forward with a logit. This is a binomial model estimated by maximum likelihood estimation.
In this section, there is no need to to limit our observations to the data presented in the MCA section, rather, we focus on all atacks between 2011 and 2016. Finally we have 15 150 observations. With a first method of stepwise, we select the most pertinant variables. Finally we have to be cautious about the percentage of suicidal and non-suicidal attacks: for a good estimation of the logit, we need a balanced data set. Therefore we choose all suicidal attacks in a base, and we add the same number of non-suicidal, randomly chosen attack to this base, finally, the logit will bne estimated on this, constructed data set.
set.seed(300) index1<- which(data$suicide==1) YYY<-data[index1,] sample1<-sample(nrow(YYY),size=nrow(YYY)*0.6) suicide1<- YYY[sample1,] index2<-which(data$suicide==0) XXX<-data[index2,] set.seed(300) sample2<- sample(nrow(XXX),size=nrow(suicide1)) suicide2<- XXX[sample2,] train<-rbind(suicide1,suicide2) suicide11<- YYY[-sample1,] xx<- XXX[-sample2,] suicide22<-xx[sample(nrow(xx), size=nrow(train)-nrow(suicide11)),] test<- rbind(suicide11,suicide22) sum(test$suicide==1)
##  771
The training dataset contains 1541 observations, half of that being suicidal and the other half non-suicidal. This training data set will be used to estimate the probability of a suicidal attack.
data$propextent_txt<-as.factor(data$propextent_txt) data$weapon<-as.factor(data$weapon) data$target<-as.factor(data$target)
model<- glm(suicide~ target + multiple+ success + attack + wounded+ killed+ property , family=binomial(link="logit"), data=train) # quand on ajoute weaptype ou targettype le mod?le ne marche pas
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.0397 0.6244 0.5000 0.7872 1.0000
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.819e+00 1.667e+00 -3.491 0.000482 *** targetBusiness -1.040e-01 1.268e+00 -0.082 0.934611 targetEducational Institution -1.257e+00 1.404e+00 -0.895 0.370805 targetFood or Water Supply -3.902e-01 1.894e+00 -0.206 0.836800 targetGovernment (Diplomatic) 2.538e+00 1.737e+00 1.461 0.143960 targetGovernment 5.538e-01 1.263e+00 0.438 0.661102 targetJournalists & Media 1.162e+00 1.663e+00 0.699 0.484699 targetMilitary 4.225e-01 1.248e+00 0.339 0.734956 targetNGO 2.485e-01 1.543e+00 0.161 0.872068 targetOther -1.258e-01 1.433e+00 -0.088 0.930056 targetPolice 1.130e-01 1.251e+00 0.090 0.928022 targetCitizens -8.894e-01 1.249e+00 -0.712 0.476407 targetReligious Figures/Institutions 4.620e-02 1.291e+00 0.036 0.971444 targetTelecommunication -1.801e+01 1.487e+03 -0.012 0.990337 targetTerrorists/Non-State Militia 9.904e-01 1.293e+00 0.766 0.443679 targetTourists 5.476e-01 2.670e+00 0.205 0.837480 targetTransportation -3.876e-01 1.352e+00 -0.287 0.774372 targetUnknown -8.993e-01 1.267e+00 -0.710 0.477991 targetUtilities -1.043e+00 1.357e+00 -0.769 0.441909 targetViolent Political Party 4.733e-02 1.779e+00 0.027 0.978780 multiple 1.066e-01 1.382e-01 0.771 0.440635 success -1.582e+00 2.013e-01 -7.859 3.86e-15 *** attackAssassination 5.602e+00 1.130e+00 4.958 7.11e-07 *** attackBombing/Explosion 7.453e+00 1.096e+00 6.800 1.05e-11 *** attackFacility -1.114e+01 8.813e+02 -0.013 0.989919 attackHijacking -1.023e+01 6.523e+03 -0.002 0.998749 attackHostage Taking (Barricade Incident) 8.397e+00 1.302e+00 6.451 1.11e-10 *** attackHostage 4.703e+00 1.152e+00 4.083 4.44e-05 *** attackUnarmed Assault -1.373e+01 2.317e+03 -0.006 0.995270 attackUnknown -1.222e+01 5.476e+02 -0.022 0.982203 wounded 7.989e-02 8.439e-03 9.467 < 2e-16 *** killed 1.142e-02 3.867e-03 2.953 0.003149 ** property -4.560e-02 1.548e-02 -2.947 0.003214 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1)
Null deviance: 3205.1 on 2311 degrees of freedom Residual deviance: 1769.9 on 2279 degrees of freedom AIC: 1835.9 Number of Fisher Scoring iterations: 17
We see on the above table that the following variables increases the probability of a suicidal attack: – If the government, media or non-state militia was attacked. – If the attack was multiple. – If the attack was an assassination, an explosion, and if hostages were taken. – The number of wounded and killed people is also positively correlated with the probability of suicide.
We therefore explain this by the followings: Terrorists who aim to hurt as many people as possible are more likely/ willing to suicide. Similarly, attackers involved in multiple attacks are more likely to agree to a possible suicide.
What is interesting is that success affects negatively the probability of suicide. There is a hypothesis saying that they are negatively correlated as more experience terrorists do not commit suicide and as they have more experience, their rate of success is higher.