https://www.cartoonmovement.com/p/7403

Terrorism around the Word- Study with R


“Everybody’s worried about stopping terrorism. Well, there is a really easy way: stop participating in it.”  –  Noam Chomsky

Abstract

According to the Wikipedia, the English word “terror“, just like the French “terreur”, derives from that Latin word “terrere” and means to fright, alarm, anguish, fear, panic. Indeed, we all fear terrorism that is the more and more part of our life. But do we understand the global picture? Who attacks who, where and why? Why do we see the more and more suicide attacks? This study focuses on answering these questions by an Exploratory Data Analysis, semi-supervised learning and a supervised Logit model.


1. Introduction

In this project I will attempt to review the evolution of the number, frequency and types of terrorist attacks around the world. The Global Terrorism Database served as the data set and it can be found sur Kaggle: https://www.kaggle.com/START-UMD/gtd/data.

I will focus on the MENA area (Middle East and North Africa region) as the highest and deadliest attacks are taken place here, moreover this study aims to explain a relatively new type of behaviour: the suicide among terrorists. Since we start to observe suicidal attack from the years of 2000, we only analyse this period.

2. To start.. a bit of data cleaning

2. 1. Reduction of the original data set

Calling the data

The packages needed (need to install them if they are not yet installed on the R used by the command, install.packages(“name_of_package”))

library(tidyr) # Tidy data - data cleaning 
library(dplyr) # Data manipulation - data frames 
library(FactoMineR) # Multivariate Exploratory Data Analysis 
library(factoextra) # Visualisation of Multivariate Exploratory Data Analysis 
library(corrplot) # Correlation plot, visualisation
library(purrr) # Functional Programming 
library(tm) # Text Mining 
library(wordcloud) # For wordclouds 
library(RColorBrewer) # Palettes 
library(ggplot2) # Data visualisation
library(maps) # Creating maps

Data reduction

The original data set of more than 170 000 observations and 135 variables makes it difficult to analyse the data. This study only focuses on certain variables therefore any supplementary variables are deleted.

TERmini <- select(TER, eventid, iyear, country_txt, region_txt, latitude, longitude, crit1, crit2, crit3, multiple, success, suicide, attacktype1_txt, targtype1_txt, gname, weaptype1_txt, nkill, nwound, property, INT_MISC, INT_LOG, propextent_txt, motive,targtype1, weaptype1)

Missing observations

I only check for the missing variables that are important for the study.

NA_<-sapply(TERmini, function(x) (sum(is.na(x))/count(TERmini))*100)
data.frame(NA_)

The above code shows the percentage of missing variables. If it is below 5%, the rate of missing variables are considered insignificant. Fortunately, almost all of the variables of importance have a percentage of missing values below 5%.

The variable nkill shows the number of people killed in the attack. This variable has 5.7% missing observation, however, I assume that if an observation is missing, it is likely to be zero as it is more plausible to exclude here observations with nobody died than victims dying in an attack. The same treatment is applied for the variable “nwound”, the number of people wounded in an attack.

Finally I delete all observation with missing values of the “weaptype_1” (type of weapon used) and “int_mics” (whether the attack was against international targets).

TERmini$nkill[is.na(TERmini$nkill)]=0
TERmini$nwound[is.na(TERmini$nwound)]=0
ind<- which(TERmini$weaptype1_txt=="") 
TERmini<-TERmini[-ind,]

inddd<- which(TERmini$INT_MISC==-9)
TERmini<- TERmini[-inddd,] 
inde<-which(is.na(TERmini$INT_MISC))
TERmini <-TERmini[-inde,]

Renaming certain variables and changing their factors

# Changing the factors 
TERmini$region_txt<-as.factor(TERmini$region_txt)
TERmini$weaptype1_txt<-as.factor(TERmini$weaptype1_txt)
TERmini$attacktype1_txt<-as.factor(TERmini$attacktype1_txt)
TERmini$country_txt<-as.factor(TERmini$country_txt)
TERmini$targtype1_txt <-as.factor(TERmini$targtype1_txt)
TERmini$crit1<-as.numeric(TERmini$crit1)
TERmini$crit2<-as.numeric(TERmini$crit2)
TERmini$crit3<-as.numeric(TERmini$crit3)
TERmini$multiple<-as.numeric(TERmini$multiple)
TERmini$success<-as.numeric(TERmini$success)
TERmini$INT_MISC<-as.numeric(TERmini$INT_MISC)
TERmini$property<-as.numeric(TERmini$property)

# Creation of a new variable, suicide_txt
TERmini$Suicide_txt <- ifelse(TERmini$suicide==1,"Suicide","No suicide")

# Renaming
colnames(TERmini)[2]<-"year"
colnames(TERmini)[3]<-"country"
colnames(TERmini)[4]<-"region"
colnames(TERmini)[13]<-"attack"
colnames(TERmini)[14]<-"target"
colnames(TERmini)[15]<-"group"
colnames(TERmini)[16]<-"weapon"
colnames(TERmini)[17]<-"killed"
colnames(TERmini)[18]<-"wounded"

levels(TERmini$region)[levels(TERmini$region)=="Middle East & North Africa"] <- "MENA"
levels(TERmini$attack)[levels(TERmini$attack)=="Hostage Taking (Kidnapping)"] <- "Hostage"
levels(TERmini$attack)[levels(TERmini$attack)=="Facility/Infrastructure Attack"] <- "Facility"
levels(TERmini$group)[levels(TERmini$group)=="Islamic State of Iraq and the Levant (ISIL)"] <- "IS"
levels(TERmini$weapon)[levels(TERmini$weapon)=="Explosives/Bombs/Dynamite"] <- "Explosives"
levels(TERmini$target)[levels(TERmini$target)=="Private Citizens & Property"] <- "Citizens"
levels(TERmini$target)[levels(TERmini$target)=="Government (General)"] <- "Government"
levels(TERmini$weapon)[levels(TERmini$weapon)=="Vehicle (not to include vehicle-borne explosives, i.e., car or truck bombs)"] <- "Vehicle"

Finally, reducing the data base for observation between :

  • 2011-2017 (data11_16)
  • 2000-2017 (dat00_16)
  • For the 5 biggest terrorist groups (data)
index=2000)
data00_16 <- TERmini[index, ]

index1=2011)
data11_16 <- TERmini[index1, ]
data <- subset(data11_16, data11_16$group == "Taliban" | data11_16$group == "Boko Haram" | data11_16$group == "Maoists" | data11_16$group == "IS" | data11_16$group == "Al-Shabaab")

List of variables of importance after data cleaning

  1. iyear : year of attack
  2. country_txt : the country in which the attack took place
  3. region_txt : the region geographic in which the attack took place
  4. city : the city in which the attack took place
  5. lattitude
  6. longitude : these two variables were used for a visualization of the attacks world-wide
  7. summary : short description of the attack
  8. multiple : dummy indicating if the attack was a multiple attack
  9. attacktype1_txt: type of attack
  10. targtype1_txt : the type of target attacked
  11. targsubtype1_txt : the type of target attacked- more precisedly
  12. gname : terrorist group
  13. weaptype1_txt : the type of weapon used in the attack
  14. nkill : number of victims died in the attack
  15. nwound : number of victims wounded in the attack

2.2. Visual representation

Visual representation of the whole period 1970 – 2016

blue.bold.text <- element_text(face = "bold", color = "Darkblue")
blue.text <- element_text(color = "Darkblue")

p <- ggplot(data = TER, aes(x = iyear)) +
 geom_histogram(aes(y = ..count..), binwidth = 1, color= 'white', fill = "darkblue") +
 scale_x_continuous(name = "Year",breaks = seq(from = 1970, to = 2017, by = 5), limits = c(1970, 2017)) +
 scale_y_continuous(name = "Effectif") +
 ggtitle(label = "Number of terrorist attacks worldwide") + theme(title = blue.bold.text, axis.title = blue.text) + 
 theme(axis.text.x= element_text(angle=45, hjust=1))
p

TER %>%

summarise(nr_of_attacks = n())

# nr_of_attacks

## 1 170350

We observe a strong upward trend regarding the number of terrorist attacks per year since 1970. We see also that the data set misses observations completely for the year 1993. This does not cause problems in the analysis as the study focuses on a latter time period.

The total number of attack between 1970 and 2016 is 170 350 (excluding 1993).

ggplot(data = TERmini) + geom_bar(aes(x=year,y=killed), stat = 'identity', fill='darkblue') +
  ylab('Total') + ggtitle('Victims died in terrorist attack worldwide') + 
  scale_x_continuous(name= 'Year', breaks = seq(from = 1970, to = 2017, by = 5))+
  theme(title = blue.bold.text, axis.title = blue.text)+
  theme(axis.text.x= element_text(angle=45, hjust=1))

We see that the increase of the number of people died in terrorsit attacks is increasing but at a faster rate then the number of attacks increasing, leaving us wondering whether the attacks are more focused on the number of lives taken now than 30 years ago.

We see that the increase of the number of people died in terrorsit attacks is increasing but at a faster rate then the number of attacks increasing, leaving us wondering whether the attacks are more focused on the number of lives taken now than 30 years ago.

sum(TERmini$killed)
## [1] 382459

In total, 382 459 victims died as a result of terrorist attacks since 1970 (1993 excluded).

Suicide attacks between 1970-2016

# Creation of a data frame with the number of attack and number of suicidal attack per year
year_suicide <- data.frame(table(TERmini$Suicide_txt, TERmini$year))
year_suicide$suicide <- year_suicide$Var1
year_suicide$year <- year_suicide$Var2
year_suicide <- subset(year_suicide, select = -c(Var1, Var2))
# Find the suicide rates for each year
yearly_suicide_rate <- year_suicide$Freq[year_suicide$suicide == 'Suicide'] / (year_suicide$Freq[year_suicide$suicide == 'Suicide'] + year_suicide$Freq[year_suicide$suicide == 'No suicide'])
#Turn it into a dataframe
yearly_suicide_rate <- data.frame(yearly_suicide_rate)
yearly_suicide_rate$rate <- yearly_suicide_rate$yearly_suicide_rate
yearly_suicide_rate <- subset(yearly_suicide_rate, select = -(yearly_suicide_rate))
# Adding years to the dataframe
yearly_suicide_rate$year <- unique(year_suicide$year)

We represent here the number of suicidal attacks per year:

ggplot(data = year_suicide, aes(x=year, y=Freq, fill = suicide)) + 
  geom_bar(stat = 'identity') +
  ylab('Number') + ggtitle("Number of suicidal and non-suicidal attacks") + 
  scale_x_discrete(name= 'Year', breaks = seq(from = 1970, to = 2017, by = 5))+
  theme(title = blue.bold.text, axis.title = blue.text)+
  theme(axis.text.x= element_text(angle=45, hjust=1))

We see the phenomenon “suicidal attack” was almost non-existent before the 2000s, and increasingly present ever since. This shows us that this deadly technic is a relatively approach of terrorist groups.

ggplot(data = yearly_suicide_rate, aes(x=year, y=rate)) + 
  geom_point(colour = 'darkblue', size=2) +
  geom_line(aes(group=1)) +
  ylab('Rate of suicidal attacks ') + ggtitle("Percentage of suicidal attacks worldwide") + 
  scale_x_discrete(name= 'Year', breaks = seq(from = 1970, to = 2017, by = 5))+
  theme(title = blue.bold.text, axis.title = blue.text)+
  theme(axis.text.x= element_text(angle=45, hjust=1))

The above graph shows the same information but as a rate of percentage.

As the number of suicidal attacks is negligible before 2000, we only focus for the period after this year.

index=2000)
data00_16 <- TERmini[index, ]

Dispersion of terrorism in the world

# Deletion of missing observations with missing laltitude or longitude- only for this graph
indixxx<-which(is.na(data00_16$latitude))
data00_16_sans<- data00_16[-indixxx,]

world <- map_data("world")
ggplot() + geom_polygon(data = world, aes(x =  long, y = lat, group = group))+  
geom_point(data = data00_16_sans, aes(x = longitude, y = latitude), colour = "red", size = 0.01) 

This map shows that most of the attacks have taken place in the Middle-East, India and Subsaharan-Africa. Other dense regions are South Asia and and the North part of South-America.

Countries the most attacked

T1<-as.data.frame(head(sort(table(data00_16$country), decreasing = TRUE), 10))
T2<-as.data.frame(head(sort(table(data00_16$country), decreasing = TRUE), 10) / dim(data00_16)[1] * 100)
TT<-cbind(T1,T2[,2])
colnames(TT)=c("Country","Number of attack", "Percentage")
TT

Capture d’écran 2018-03-22 à 22.33.16.png

We see above the countries with the highest numbers of attacks are indeed those of MENA, furthermore Pakistan, India, and the Phillippines. Iraq takes alone more than 21% of all attacks wordlwide.

Now we turn to the three most affected countries to see in the evolution of the number of attacks per year.

index<- which(data00_16$country=='Iraq')
Iraq <- data00_16[index, ]

p <- ggplot(data = Iraq, aes(x = year)) +
 geom_histogram(aes(y = ..count..), binwidth = 1, color= 'white', fill = "darkblue") +
 scale_x_continuous(name = "Year",breaks = seq(from = 2000, to = 2017, by = 1), limits = c(2000, 2017)) +
 scale_y_continuous(name = "Number of attack") +
 ggtitle(label = "Number of attack in Iraq") + theme(title = blue.bold.text, axis.title = blue.text) + 
 theme(axis.text.x= element_text(angle=45, hjust=1))
p

index<- which(data00_16$country=='Pakistan')
Pakistan <- data00_16[index, ]

p <- ggplot(data = Pakistan, aes(x = year)) +
 geom_histogram(aes(y = ..count..), binwidth = 1, color= 'white', fill = "darkblue") +
 scale_x_continuous(name = "Year",breaks = seq(from = 2000, to = 2017, by = 1), limits = c(2000, 2017)) +
 scale_y_continuous(name = "Number of attacks") +
 ggtitle(label = "Number of attack in Pakistan") + theme(title = blue.bold.text, axis.title = blue.text) + 
 theme(axis.text.x= element_text(angle=45, hjust=1))
p

index<- which(data00_16$country=='Afghanistan')
Afghanistan <- data00_16[index, ]

p <- ggplot(data = Pakistan, aes(x = year)) +
 geom_histogram(aes(y = ..count..), binwidth = 1, color= 'white', fill = "darkblue") +
 scale_x_continuous(name = "Year",breaks = seq(from = 2000, to = 2017, by = 1), limits = c(2000, 2017)) +
 scale_y_continuous(name = "Number of attacks") +
 ggtitle(label = "Number of attacks in Afghanistan") + theme(title = blue.bold.text, axis.title = blue.text) + 
 theme(axis.text.x= element_text(angle=45, hjust=1))
p

We see on the first graph that the number of attacks in Iraq has increased enormously since 2003, the start of the Second Persian Gulf War. In case of Pakistan and Afghanistan, the number of attacks increased steadily from 2001 until 2014. The strong increase of the number of attacks in case of Afghanistan coincides with the 11 September, 2001 terrorist attack and the start of the Afghanistan war.

The most active organisations

T1<-as.data.frame(head(sort(table(data00_16$group), decreasing = TRUE), 10))
T2<-as.data.frame(head(sort(table(data00_16$group), decreasing = TRUE), 10) / dim(data00_16)[1] * 100)
TT<-cbind(T1,T2[,2])
colnames(TT)=c("Organisation","Frequency", "Percentage")
TT

Capture d_écran 2018-03-22 à 22.30.29

We see the ten biggest terrorist groups on the above list, that is the 10 most active terrorist groups between 2011 and 2016. It is clear that the Taliban and IS committed the most attacks, while the Al-Shabaab military group and Boko Haram are responsible each for more than 2000 attacks.

Attack characteristics

First we focus on the types of targets

ggplot(data=data00_16, aes(x=target)) +
 geom_histogram(stat='count', fill= 'darkblue') +
 theme(axis.text.x= element_text(angle=45, hjust=1)) +
 labs(title="Number of attacks per target type (2000-2016)")+ 
 xlab('Target type') + ylab('Total')+ 
 theme(title = blue.bold.text, axis.title = blue.text))+ 
  theme(title = blue.bold.text, axis.title = blue.text)

We see that the most attacks target citizens. This supports the idea that terrorist attacks aim to effect the most people possible to create fear and insecurity in the society. The second target group is military followed by police targets.

T1<-as.data.frame(head(sort(table(data00_16$target), decreasing = TRUE), 10))
T2<-as.data.frame(head(sort(table(TERmini$target), decreasing = TRUE), 10) / dim(TERmini)[1] * 100)
TT<-cbind(T1,T2[,2])
colnames(TT)=c("Country","Frequency", "Percentage")
TT
Country
Frequency
Percentage
Citizens 27168 23.460164
Military 16033 15.013515
Police 15094 13.504155
Government 11253 11.951804
Business 8165 11.644396
Unknown 3993 3.898544
Transportation 3001 3.433898
Religious Figures/Institutions 2843 2.805538
Educational Institution 2786 2.461618
Utilities 2144 2.446896

Citizen targeted attacks account for more than 23 % of the total attacks. Almost 15 % of attacks are against military while 13% against government targets. These three biggest target groups account for more than 50 % of all attacks.

Weapon types

ggplot(data=data00_16, aes(x=weapon)) +
 geom_histogram(stat='count', fill= 'darkblue') +
 theme(axis.text.x= element_text(angle=45, hjust=1)) +
 labs(title="Number of attack per weapon type (2000-2016)")+ 
 xlab("Weapon type") + ylab('Total')+ 
 theme(title = blue.bold.text, axis.title = blue.text))+ 
  theme(title = blue.bold.text, axis.title = blue.text)
Most attacks are carried out by explosives, followed by firearms.
T1<-as.data.frame(head(sort(table(data00_16$weapon), decreasing = TRUE), 10))
T2<-as.data.frame(head(sort(table(data00_16$weapon), decreasing = TRUE), 10) / dim(TERmini)[1] * 100)
TT<-cbind(T1,T2[,2])
colnames(TT)=c("Country","Frequency", "Percentage")
TT)
TT

Country
Frequency
Percentage
Explosives 57176 33.67116786
Firearms 29373 17.29787347
Unknown 7123 4.19476229
Incendiary 4544 2.67597920
Melee 1823 1.07357176
Chemical 178 0.10482489
Sabotage Equipment 92 0.05417916
Vehicle 84 0.04946793
Other 61 0.03592314
Biological 27 0.01590040

## 9               Other        61  0.03592314
## 10         Biological        27  0.01590040

In fact firearms and explosives account for more than 50 % of all attacks.

Finally, we turn our focus to the motivations of attacks. As this information is represented in the form of text, I will use a wordcloud.

Aims of attacks

wordcloud(data00_16$motive, max.words = 100, min.freq = 45, rot.per=.15,colors=brewer.pal(8, "Dark2"),
random.order = FALSE)

This wordcloud tells us an interesting story: the most common aim of attacks are simply unknown. This leaves us wondering whether our security agencies should use more resources on better understanding these terrorist groups. How can we fight something that we do not even understand?

3. Non-supervised models

3.1 Multiple Correspondant Analysis (MCA)

In this section, we present the results of the MCA analysis, that only focus on the suicide attacks between 2011 and 2016. We end up with 13 994 observations. This method allows us to represent the observations (in this case, the terrorist attacks) in space, in which all dimensions correspond to several variables. The aim of the MCA analysis is, (as explained in the theory part,) to reduce the number of dimensions of a large number of observations, thus allowing a more compact and simple representation and comprehension of the observations.

The terrorism database that we use in this study, has more modalities that can be conveniently represent in an MCA. (For instance the number of different terrorist groups is several thousands. This would most probably not contribute to a deeper understanding of the phenomenon of terrorist attacks as even with the MCA, we would need many dimensions to represent all different terrorist groups, moreover, as a consequence of the heterogeneity of the observations.)

Therefore we only keep : – the five weapon types the most used,

  • the five biggest terrorist groups,
  • and the five most common target types.

Our MCA base contains therefore 13 360 observations.

# 5 weapon the most used
data.mca <- subset(data, data$weapon == "Explosives" | data$weapon == "Firearms" | data$weapon == "Unknown" | data$weapon == "Incendiary" | data$weapon == "Melee" | data$weapon == "Chemical")

# 5 most used attack types 
data.mca2 <- subset(data.mca, data.mca$attack == "Bombing/Explosion" | data.mca$attack == "Armed Assault" | data.mca$attack == "Unknown" | data.mca$attack == "Hostage" | data.mca$attack == "Assassination" | data.mca$attack == "Facility")

# 5 most attacked target type
data.mca3 <- subset(data.mca2, data.mca2$target == "Citizens" | data.mca2$target == "Military" | data.mca2$target == "Police" | data.mca2$target == "Government" | data.mca2$target == "Business" | data.mca2$target == "Unknown")

c <- c(4, 12, 13, 14, 15, 16)
data.mca <- data.mca3[, c]

We have a glance at the data:

summary(data.mca)
                region        suicide                     attack            target            group     
 South Asia        :5747   Min.   :0.0000   Bombing/Explosion:6112   Citizens  :4200   Taliban   :4726  
 Sub-Saharan Africa:3923   1st Qu.:0.0000   Armed Assault    :3510   Military  :3574   IS        :3689  
 MENA              :3668   Median :0.0000   Unknown          :1482   Police    :3183   Al-Shabaab:2237  
 Southeast Asia    :   9   Mean   :0.1237   Hostage          :1284   Government:1244   Boko Haram:1685  
 Western Europe    :   9   3rd Qu.:0.0000   Assassination    : 649   Business  : 665   Maoists   :1023  
 Central Asia      :   2   Max.   :1.0000   Facility         : 323   Unknown   : 494             :   0  
 (Other)           :   2                    (Other)          :   0   (Other)   :   0   (Other)   :   0  
        weapon    
 Explosives:6833  
 Firearms  :3919  
 Unknown   :2066  
 Incendiary: 298  
 Melee     : 223  
 Chemical  :  21  
 (Other)   :   0

We also need to factorise the variables so that MCA functions.

# Creation of a qualitative variable for suicide
data.mca$type <- ifelse(data.mca$suicide==1,"Suicidaire","Non-suicidaire")

data.mca$region<-as.factor(data.mca$region)
data.mca$attack<-as.factor(data.mca$attack)
data.mca$target<-as.factor(data.mca$target)
data.mca$group<-as.factor(data.mca$group)
data.mca$weapon<-as.factor(data.mca$weapon)
data.mca$type<-as.factor(data.mca$type)

# Clear the quantitative suicide variable
data.mca <- select(data.mca, -suicide)

Our first representation of the MCA reveales some difficulties linked to a representation of a similar volume of observations: with as many observations, our MCA is simply not visible.

res.mca <- MCA(data.mca, quali.sup = c(1), ncp = 5, graph = FALSE)
plot(res.mca, cex=0.7, col.var = "darkblue", title = "Active categories", cex.main=2, col.main= "darkblue")

As the representation of this MCA did not get us any closer to the comprehension of data, the study focuses in what follows on the dimensions, variables and individuals. After a better understanding of all these, we make a second attemp to represent our MCA.

Summary of the dimensions

The dimensions can include several variables, according to the highest explained variance for the observations. That is, the first dimension explains the highest variance explainable by one sigle dimension, the second dimensions aims to explain the second highest variance of observations, knowing that it is perpendicular to the second, etc.

Below we represent the eigenvalues of all dimensions and the percentage of explained variance by them. We see for instance that the first dimension explains 11.73 percent of all variation, while the second account only for 9.84%, etc.

eigenvalues <- get_eigenvalue(res.mca)
head(round(eigenvalues, 2))
      eigenvalue variance.percent cumulative.variance.percent
Dim.1       0.47            11.73                       11.73
Dim.2       0.39             9.84                       21.57
Dim.3       0.37             9.16                       30.73
Dim.4       0.30             7.38                       38.11
Dim.5       0.28             6.90                       45.01
Dim.6       0.26             6.47                       51.48

The first two dimensions explain only 21,57% of variation of the observations. Considering the number of observations, however, this seems right.

The below graph represent the percentage explained by each dimension. As it is explained previously, the first dimension explains the most variation, followed by the second, third, etc..

fviz_screeplot(res.mca)

MCA graphics

Analysis of the variables

Contribution of the variables to the dimensions

To better understand our dimensions, it can be useful to see which variables contribute the most to each dimensions. We can choose a graphical representations to do that, we can see on the below graph that Explosion, Explosives and Firearms seem to contribute the most to the first dimension. This makes us thinking that the first dimension represents mostly the weapon type. We also that the second dimension’s biggest contributing variable is the “Unknown” attack type.

var <- get_mca_var(res.mca)
corrplot(var$contrib, is.corr = FALSE)

For more visibility, we represent contributions to the first and second dimensions. Variables that are above the red dotted line contribute more than the average.

fviz_contrib(res.mca, choice = "var", axes = 1)

fviz_contrib(res.mca, choice = "var", axes = 2)

It can be also useful to see the quality of representation of each variables. Below is a table representing the quality of representation of each variables on each dimension. For instance we see that Bombing/Explosion and Explosives are well-represented on the first dimension.

corrplot(var$cos2, is.corr=FALSE)

Again, it can be useful to see the quality of representation of each variable to the first two dimension (represented together). This is in fact the cos^2, the higher it is, the better represented is the variable.

fviz_cos2(res.mca, choice = "var", axes = 1:2)

Active categories

Finally, we represent the different active variables/ categories on a better, clearer MCA graph. Naturally, as we can better understand two-dimensional graphs, we only focus on the first two dimensions.

plot(res.mca, invisible = c( "ind"), cex=1, col.var = "darkblue", title = "Active categories", cex.main=2, col.main= "darkblue") 

Now we can try to translate the first two dimensions.

We see for instance that the attack type “Unknown” and the weapon type “Unknown” are represented on the bottom of the graph. We have seen before that these variables have the biggest contributions to the second dimension, furthermore that their quality of representation is high. So we conclude that if an observation is in the bottom of the graph, it is more likely that the type of weapon and the type of attack are unknown. Furthermore, they are very close together, the closer variables are on this graph, the more they correlate. We can think of this as the distance between the two variables is small, implying that they are similar.

We also observe that the Explosives and Bombing/Explosions are on the left side of the graph, furthermore that they are close on to another, implying similarity. On the right side, we see Firearms and Armed assults.

Finally, this study is addressing the question of suicide, this variable being on the left side of the first dimension, implying that observations that will be situated on this die will be more likely to be suicidaire. We also see that suicide is closest to unknown target type and Bombing/Explosion, implying that the most suicide attacks are indeed Bombing/Explosives attacks.

We can also see that MENA is the closest to Chemical weapons, (so that chemical attacks are more likely here than other parts of the world).

Individual observations – attacks

Now, after a better understanding of the first two dimensions and how the different variables are correlated, we turn to the individual observations, that is, the attacks themselves.

ind <- get_mca_ind(res.mca)
ind
## Multiple Correspondence Analysis Results for individuals
##  ===================================================
##   Name       Description                       
## 1 "$coord"   "Coordinates for the individuals" 
## 2 "$cos2"    "Cos2 for the individuals"        
## 3 "$contrib" "contributions of the individuals"

We can represent the quality of representation of each attack. Here, however, as we have more then 13 000 observations, we do not expect a large cos^2.

We show only the first few individuals. We see that the quality of representation is generally very small and (2-3%) although we have a much higher cos^2 for certain attacks.

head(ind$cos2)
 Dim 1        Dim 2        Dim 3      Dim 4       Dim 5
99600 0.26269351 0.1253694656 0.0696668345 0.01969511 0.145961263
99611 0.03210075 0.0003395013 0.0003588428 0.14970651 0.070537832
99612 0.02501009 0.0680279128 0.0669620482 0.01146119 0.033724086
99619 0.02387862 0.0044075809 0.0053810941 0.28557389 0.125952973
99637 0.21295419 0.0309962574 0.0701373677 0.03006849 0.001055433
99649 0.02169697 0.0225874010 0.0288138269 0.12520376 0.103929290

Just as for the variables, we can also represent the contribution of each observations. We expect a small contribution, however, as a consequence of the large number of observations. Indeed we see that the contribution is small and quite homogeneous.

fviz_contrib(res.mca, choice = "ind", axes = 1:2, top = 20)

Finally we represent the individuals on an MCA graph, seperately for the suicidal(blue) and non-suicidal attacks (red). We can see that the MCA graph distinguished between these attacks quite clearly, and that the suicidal attacks are indeed on the left side of the first dimension, as expected.

This also makes us think of the attack type – Bombing/ Explosives, that situated on the left side of the first dimension also. These two phenomenons, suicidal and explosive attacks correlate strongly.

fviz_mca_ind(res.mca, label = "none", habillage=data.mca$type, title="MCA graph representation of suicidal and non-suicidal attacks", cex.main=4, addEllipses = TRUE, ellipse.level = 0.95)

Finally we turn to a common represetation of the individuals and the variables;

plot(res.mca, label = c("quali.sup", "var"), select = "cos2 30", cex=1.2, col.var = "darkblue", col.quali.sup = "brown3", col.ind = "seashell3", title = "Individuals with active and supplementary categories", cex.main=2, col.main= "darkblue")

This confirms the analysis of the dimensions made previously.

Clustering

I would like to make a simple Hierarchical Clustering here from the MCA made previously. The only problem is that having more than 13 000 observations makes it impossible to represent the clusters at the end of this section, therefore I will choose 200 observations randomly, and carry out the Clustering only on this 200 observations.

set.seed(300)
random_rows <- sample(rownames(data.mca), size = 200, replace = FALSE)

data_hcpc <- data.mca[random_rows, ]
res2.mca <- MCA(data_hcpc, quali.sup = c(1, 6), ncp = 5, graph = FALSE)
res.hcpc <- HCPC(res2.mca, min = 3, nb.clust = -1, graph = TRUE)

We create a table to show 4 observations of each clusters.

c1 <- c((which(rownames(data_hcpc) == 156643)), (which(rownames(data_hcpc) == 133063)), (which(rownames(data_hcpc) == 144457)), (which(rownames(data_hcpc) == 132794)), (which(rownames(data_hcpc) == 127701)))
cl2 <- data_hcpc[c1,]
print(cl2)
region
attack
target
group
weapon
type
156643 MENA Bombing/Explosion Police IS Explosives Non-suicidaire
133063 MENA Bombing/Explosion Police IS Explosives Suicidaire
144457 South Asia Bombing/Explosion Citizens Taliban Explosives Suicidaire
132794 South Asia Bombing/Explosion Citizens Taliban Explosives Suicidaire
127701 South Asia Bombing/Explosion Citizens Taliban Explosives Non-suicidaire


We see that the first cluster includes attacks carried out mostly by Bombing/Explosives, against the police or citizens, mostly by Explosives, in MENA or South Asia. 

c2 <- c((which(rownames(data_hcpc) ==  164863)), (which(rownames(data_hcpc) == 136786)), (which(rownames(data_hcpc) == 147038)), (which(rownames(data_hcpc) == 146353)), (which(rownames(data_hcpc) == 135790)))
cl2 <- data_hcpc[c2,]
print(cl2)
region
attack
target
group
weapon
type
164863 South Asia Unknown Citizens Taliban Unknown Non-suicidaire
136786 Sub-Saharan Africa Unknown Military Boko Haram Unknown Non-suicidaire
147038 MENA Unknown Military IS Unknown Non-suicidaire
146353 MENA Unknown Military IS Unknown Non-suicidaire
135790 MENA Unknown Military IS Unknown Non-suicidaire


The second cluster contains attacks with unknown weapon types, non-suicidal and mostly military targeted attacks.

c3 <- c((which(rownames(data_hcpc) ==  101866 )), (which(rownames(data_hcpc) == 168373)), (which(rownames(data_hcpc) == 133220)), (which(rownames(data_hcpc) == 121599)), (which(rownames(data_hcpc) == 163226)))
cl3 <- data_hcpc[c3,]
print(cl3)
region
attack
target
group
weapon
type
101866 Sub-Saharan Africa Armed Assault Citizens Al-Shabaab Firearms Non-suicidaire
168373 Sub-Saharan Africa Armed Assault Citizens Al-Shabaab Firearms Non-suicidaire
133220 MENA Armed Assault Government IS Firearms Non-suicidaire
121599 South Asia Hostage Military Taliban Firearms Non-suicidaire
163226 South Asia Assassination Citizens Taliban Firearms Non-suicidaire


The third cluster contains non-suicidal attacks carried out by firearms. 

c4<- c((which(rownames(data_hcpc) ==  148231)), (which(rownames(data_hcpc) == 156248)), (which(rownames(data_hcpc) == 151572)), (which(rownames(data_hcpc) == 164419)), (which(rownames(data_hcpc) == 141649)))
cl4 <- data_hcpc[c4,]
print(cl4)
region
attack
target
group
weapon
type
148231 South Asia Facility Citizens Maoists Incendiary Non-suicidaire
156248 South Asia Facility Police Taliban Incendiary Non-suicidaire
151572 Sub-Saharan Africa Facility Citizens Boko Haram Incendiary Non-suicidaire
164419 South Asia Facility Business Maoists Incendiary Non-suicidaire
141649 South Asia Facility Business Maoists Unknown Non-suicidaire


The fourth cluster contains mostly ‘incendiary’, non-suicidal, facility targeted attacks. 

K-means

We also make a k-means classification. We keep in mind that k-mean for a given data set can give us different results as it starts randomly from an observation. To be able to compare the k-mean results with outr classification made earlier, we choose 4 clusters.

set.seed(13)
groupes.kmeans <- kmeans(km, centers = 4, nstart = 5)
print(groupes.kmeans)
K-means clustering with 4 clusters of sizes 5, 81, 25, 89

Cluster means:
       Dim.1       Dim.2       Dim.3       Dim.4       Dim.5
1  1.0157455  3.85344781  0.70046022  0.30378384  0.06657276
2  0.7248291 -0.29081709 -0.02711369  0.05404063  0.17294555
3 -0.5529222 -0.14616286  1.62931536  0.09630131 -0.29570187
4 -0.5614250  0.08924738 -0.47234805 -0.09330048 -0.07807761

Clustering vector:
166282 156989 159632 155224 152096 102369 156695 140747 138613 165396 167291 129733 158945 146718 150239 
     2      4      2      3      2      2      2      4      2      4      4      4      4      2      2 
169740 164419 150858 149456 140903 169763 140346 138422 106680 108972 103452 142302 164942 152838 121243 
     4      1      4      4      2      3      2      4      2      2      2      2      4      3      2 
141953 153783 166831 133935 109302 138039 134705 156643 148231 157532 133063 101866 155285 163293 133548 
     3      2      4      4      4      2      2      4      1      4      4      2      4      4      2 
163226 144457 131568 111933 155921 133567 159085 147038 169936 165792 146353 138133 164236 162049 105462 
     2      4      4      2      4      2      4      3      2      2      3      2      3      3      4 
132794 148524 149500 127701 120058 147371 136786 141649 158117 157721 148985 162132 123417 107253 141071 
     4      2      2      4      4      4      3      1      3      4      4      4      4      2      4 
128712 135790 166113 111475 151572 159919 117756 119700 167326 135630 122030 150006 150804 165521 136442 
     2      3      3      4      1      3      4      2      4      2      3      2      2      3      2 
165953 128763 161111 141763 155743 134462 151387 116128 147330 119094 161665 131439 167462 159073 106797 
     4      2      4      4      3      4      2      2      2      4      2      4      4      2      2 
170036 142266 116768 144218 116901 139141 146994 122110 102999 142586 104219 122347 154990 118817 103075 
     4      4      4      4      2      3      2      2      2      4      2      2      3      4      2 
138974 133220 169743 133126 155350 164863 131160 151666 165039 117751 119921 154721 166382 165911 110474 
     2      2      4      2      4      3      3      2      2      4      4      2      4      2      4 
121347 138936 156857 132614 138770 139645 168373 101970 131386 120137 128888 121599 163024 139275 160756 
     4      4      2      4      4      2      2      2      4      3      4      2      2      2      4 
165286 108089 168977 124345 137001 150157 128637 125759 119487 149947 154107 137480 131692 170136 137542 
     4      4      4      3      3      2      4      4      2      4      2      4      4      4      2 
157890 139864 150343 108428 169570 156248 131391 154091 112929 115316 166411 137334 149501 145462 151804 
     2      4      2      4      4      1      2      2      2      2      3      4      4      4      4 
136247 165690 162011 108725 158547 155175 109304 134744 169604 134639 166351 158728 129390 119122 152257 
     4      4      2      2      2      4      4      4      4      4      2      3      4      2      4 
105739 149755 157427 166875 158072 
     2      2      4      4      2 

Within cluster sum of squares by cluster:
[1]   5.910466 105.512136  18.023409  61.506283
 (between_SS / total_SS =  57.8 %)

We see that we have 4 clusters with 5, 81, 25, 89 observation respectively.

We ask whether the optimal number of clusters is indeed 4. To answer this question, we show the Optimal number of clusters.

fviz_nbclust(km, hcut, method = "wss")

We see that we can indeed have 4 clusters as having 5 clusters (so one additionally) would deduct our total within sum squared by less than the first 4.

We also represent the for clusters below, and we see that there is one (the first) which is quite different from the others.

fviz_cluster(groupes.kmeans, data = km, palette = "jco", repel=
TRUE, main = "Kmeans", ggtheme = theme_classic())

dist.hcpc <- dist(km)
cah.ward <- hclust(dist.hcpc, method = "ward.D2")
cah.ward
## 
## Call:
## hclust(d = dist.hcpc, method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 200

Now we ask the question, how does our K-mean and Hierarchical Clustering correspond? Are they quite similar or very different? To answer this question, we show how the two correspond.

groupes.cah <- cutree(cah.ward, k=4)
print(table(groupes.cah, groupes.kmeans$cluster))
##            
## groupes.cah  1  2  3  4
##           1  0 81  0  5
##           2  0  0  0 84
##           3  0  0 24  0
##           4  5  0  1  0

We see that:

  1. cluster 1 (k-means) correspond to cluster 2 (CAH), with a difference of 5 observations,
  2. cluster 2 (k-means) correspond to cluster 4 (CAH) exactly,
  3. cluster 3 (k-means) correspond to cluster 3 (CAH) exacty,
  4. cluster 4 (k-means) correspond to cluster 1 (CAH), with only observations difference.

Altogether we see that our two methods of clustering gave us very similar results.

Supervised model – Logit

The aims of this study is to explain the suicide, therefore our explained variable is indeed a dummy, taking the value of 1 if the attack is suicidal and 0 otherwise.

We could use logit or probit model depending on the underlying distribution of the error term, for the easier translation of the coeffitients I’ll go forward with a logit. This is a binomial model estimated by maximum likelihood estimation.

In this section, there is no need to to limit our observations to the data presented in the MCA section, rather, we focus on all atacks between 2011 and 2016. Finally we have 15 150 observations. With a first method of stepwise, we select the most pertinant variables. Finally we have to be cautious about the percentage of suicidal and non-suicidal attacks: for a good estimation of the logit, we need a balanced data set. Therefore we choose all suicidal attacks in a base, and we add the same number of non-suicidal, randomly chosen attack to this base, finally, the logit will bne estimated on this, constructed data set.

set.seed(300)
index1<- which(data$suicide==1)
YYY<-data[index1,]
sample1<-sample(nrow(YYY),size=nrow(YYY)*0.6)
suicide1<- YYY[sample1,]
index2<-which(data$suicide==0)
XXX<-data[index2,]
set.seed(300)
sample2<- sample(nrow(XXX),size=nrow(suicide1))
suicide2<- XXX[sample2,]
train<-rbind(suicide1,suicide2)

suicide11<- YYY[-sample1,]
xx<- XXX[-sample2,]
suicide22<-xx[sample(nrow(xx), size=nrow(train)-nrow(suicide11)),]
test<- rbind(suicide11,suicide22)
sum(test$suicide==1)

## [1] 771

The training dataset contains 1541 observations, half of that being suicidal and the other half non-suicidal. This training data set will be used to estimate the probability of a suicidal attack.

data$propextent_txt<-as.factor(data$propextent_txt)
data$weapon<-as.factor(data$weapon)
data$target<-as.factor(data$target)
model<- glm(suicide~ target + multiple+ success + attack + wounded+ killed+ property  , family=binomial(link="logit"), data=train)  # quand on ajoute weaptype ou targettype le mod?le ne marche pas 
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model$fitted.values)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0397  0.6244  0.5000  0.7872  1.0000
#summary(test$suicide)
summary(model)
Coefficients:
                                            Estimate Std. Error z value Pr(>|z|)    
(Intercept)                               -5.819e+00  1.667e+00  -3.491 0.000482 ***
targetBusiness                            -1.040e-01  1.268e+00  -0.082 0.934611    
targetEducational Institution             -1.257e+00  1.404e+00  -0.895 0.370805    
targetFood or Water Supply                -3.902e-01  1.894e+00  -0.206 0.836800    
targetGovernment (Diplomatic)              2.538e+00  1.737e+00   1.461 0.143960    
targetGovernment                           5.538e-01  1.263e+00   0.438 0.661102    
targetJournalists & Media                  1.162e+00  1.663e+00   0.699 0.484699    
targetMilitary                             4.225e-01  1.248e+00   0.339 0.734956    
targetNGO                                  2.485e-01  1.543e+00   0.161 0.872068    
targetOther                               -1.258e-01  1.433e+00  -0.088 0.930056    
targetPolice                               1.130e-01  1.251e+00   0.090 0.928022    
targetCitizens                            -8.894e-01  1.249e+00  -0.712 0.476407    
targetReligious Figures/Institutions       4.620e-02  1.291e+00   0.036 0.971444    
targetTelecommunication                   -1.801e+01  1.487e+03  -0.012 0.990337    
targetTerrorists/Non-State Militia         9.904e-01  1.293e+00   0.766 0.443679    
targetTourists                             5.476e-01  2.670e+00   0.205 0.837480    
targetTransportation                      -3.876e-01  1.352e+00  -0.287 0.774372    
targetUnknown                             -8.993e-01  1.267e+00  -0.710 0.477991    
targetUtilities                           -1.043e+00  1.357e+00  -0.769 0.441909    
targetViolent Political Party              4.733e-02  1.779e+00   0.027 0.978780    
multiple                                   1.066e-01  1.382e-01   0.771 0.440635    
success                                   -1.582e+00  2.013e-01  -7.859 3.86e-15 ***
attackAssassination                        5.602e+00  1.130e+00   4.958 7.11e-07 ***
attackBombing/Explosion                    7.453e+00  1.096e+00   6.800 1.05e-11 ***
attackFacility                            -1.114e+01  8.813e+02  -0.013 0.989919    
attackHijacking                           -1.023e+01  6.523e+03  -0.002 0.998749    
attackHostage Taking (Barricade Incident)  8.397e+00  1.302e+00   6.451 1.11e-10 ***
attackHostage                              4.703e+00  1.152e+00   4.083 4.44e-05 ***
attackUnarmed Assault                     -1.373e+01  2.317e+03  -0.006 0.995270    
attackUnknown                             -1.222e+01  5.476e+02  -0.022 0.982203    
wounded                                    7.989e-02  8.439e-03   9.467  < 2e-16 ***
killed                                     1.142e-02  3.867e-03   2.953 0.003149 ** 
property                                  -4.560e-02  1.548e-02  -2.947 0.003214 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 3205.1  on 2311  degrees of freedom
Residual deviance: 1769.9  on 2279  degrees of freedom
AIC: 1835.9

Number of Fisher Scoring iterations: 17

We see on the above table that the following variables increases the probability of a suicidal attack: – If the government, media or non-state militia was attacked. – If the attack was multiple. – If the attack was an assassination, an explosion, and if hostages were taken. – The number of wounded and killed people is also positively correlated with the probability of suicide.

We therefore explain this by the followings: Terrorists who aim to hurt as many people as possible are more likely/ willing to suicide. Similarly, attackers involved in multiple attacks are more likely to agree to a possible suicide.

What is interesting is that success affects negatively the probability of suicide. There is a hypothesis saying that they are negatively correlated as more experience terrorists do not commit suicide and as they have more experience, their rate of success is higher.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: