Using Python to Analyze the Persecution of the Coptic Orthodox Church

Photo by Rafik Wahba on Unsplash

Coptic Orthodoxy is one of the oldest denominations in church history. Since its founding in the first century, the Copts have suffered great persecution and martyrdom. The Synaxarium, a daily reading in the church of the lives of the saints and martyrs, has countless records of persecution in the church’s lifetime. I hope to draw a picture of documented incidents from 1980–2018 with the Global Terrorism Database (GTD)¹.

I first came across the GTD in a class I took called The Theory and Politics of Terrorism for my intelligence analysis minor. Intelligence analysis always interested me and I believe that data science compliments it well. I was fascinated that this data was so robustly collected and organized; it was set up quite well to be analyzed. With the systematic format of this data and my knowledge on terrorism, I knew I was ready to dive into the dataset to analyze attacks against Copts.

My approach…

First, I will clean up the data set to keep only the most important variables that relate to the Coptic church. Then, I will look into the Coptic incidents summary’s and headline’s of the news sources with text analysis via NLTK. I will also plot some histograms to better describe the variety of attacks. Next, I will map the incidents with an interactive map utilizing plotly to see where they occur. To understand if there’s a trend of when the attacks occur, I will plot additional histograms of the years, months, and days of the attacks. Finally, I will look into the motives of these attacks with text analysis once more to describe the motive summary and ransom note text data.

Data Curation:

I first downloaded the entire GTD datafile. I removed any excessive columns that I believed wouldn’t add much benefit to this analysis — this file has over 100 features with some very detailed variables. To read more about them, take a look at their codebook.

import pandas as pdpath = ('GTD_all.csv')df = pd.read_csv(path)col_remove = ['INT_ANY', 'INT_IDEO', 'INT_LOG', 'INT_MISC', 'approxdate', 'attacktype2', 'attacktype2_txt', 
'attacktype3', 'attacktype3_txt', 'claim2', 'claim3', 'claimmode2', 'claimmode2_txt', 'claimmode3', 'claimmode3_txt', 'compclaim', 'corp2', 'corp3', 'dbsource', 'doubtterr', 'eventid', 'extended',
'gname3', 'gsubname3', 'guncertain1', 'guncertain2', 'guncertain3', 'natlty3', 'natlty3_txt', 'nhostkidus', 'nwoundus', 'ransomamtus', 'ransompaidus', 'related', 'resolution', 'specificity',
'targsubtype3', 'targsubtype3_txt', 'targtype3', 'targtype3_txt', 'vicinity', 'weapsubtype3', 'weapsubtype3_txt', 'weapsubtype4', 'weapsubtype4_txt', 'weaptype3', 'weaptype3_txt', 'weaptype4',
df = df.drop(columns=col_remove)

I then searched through columns like incident, summary, location, motive, property damage, additional/ransom notes, targets, entity, and source citations simply for the word ‘coptic’. I chose these columns in particular because they were the ones that would include the term ‘Coptic’ specifically. Other text columns wouldn’t relate to my word search and would not be worth parsing through. By looking through that set of columns, I was able to find incidents on Coptic Orthodox Churches, Coptic Monasteries, Coptic individuals and businesses, among other finds. I concatenated all of these searches and deleted any duplicates. I was left with 86 columns and 64 rows.

key_word = ['coptic']     #identify word to search#specify which features to look at
search = ['summary', 'location', 'motive', 'propcomment', 'addnotes', 'target1', 'target2', 'target3', 'scite1',
'scite2', 'scite3', 'corp1', 'ransomnote']
#lowercase words in all of above columns so search function catches all rows with 'coptic'
for i in search:
df[i] = df[i].str.lower()
# function to search through above columns and add to new df called church
def SearchWords(ColName):
church = df[df[ColName].str.contains(pattern, na=False)]
searched = []#naming each column search to variable name and '_search'
for i in search:
globals()[i+'_search'] = SearchWords(i)
# concatenating all searches together to 1 df called church
church = pd.concat([summary_search, location_search, motive_search, propcomment_search, addnotes_search, target1_search, target2_search, target3_search, scite1_search, scite2_search, scite3_search, corp1_search, ransomnote_search])
# dropping duplicates from searching for 'coptic' in same row
church = church.drop_duplicates()
First 5 rows of concatenated ‘church’ data frame.

Now that our data is ready I’ll analyze the data frame to answer 5 important questions: who? what? where? when? and why?


To understand who the Copts are from these incidents, I looked into 5 variables to analyze their text: summary, motive, target1, propcomment, and scite1. These had valuable information on the incidents and with a little cleaning up, they could provide some insight. I chose to analyze each column as one long string. I wrote 1 function to do all the heavy lifting at once for each of the columns. My TextAnalysis function does the following:

  1. Removes punctuation
  2. Tokenizes each word
  3. Filters out stop words
  4. Get parts of speech
  5. Lemmatizes the strings
  6. Plot results! (with a word cloud for fun :)
from nltk.stem import WordNetLemmatizer 
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from wordcloud import WordCloud
def TextAnalysis(column):
# Putting all of column content to 1 string
sentence = church[column].tolist()
sentence = str(sentence)
# Initializing punctuations string
punc = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''

# Removing punctuations in string
# Using loop + punctuation string
for ele in sentence:
if ele in punc:
sentence = sentence.replace(ele, "")

# Tokenizing word to remove stop words

# Remove stop words and adding nan and al from empty values and beginning of city name
stop_words.add('al') # is interpreted as single word from city name Al Arish

# Filtering sentence to remove stop words
for w in tokenized_word:
if w not in stop_words:

# Lemmatize with POS Tag
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)# Making filtered sentence single string to lemmatize
filtered_sent = ' '.join(filtered_sent)

# Init Lemmatizer
lemmatizer = WordNetLemmatizer()
# Lemmatize string with the appropriate POS tag
sentfinal = ([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(filtered_sent)])
# Print freq of words in final clean sentence
fdist = FreqDist(sentfinal)

# Making sentence a string from a list for word cloud
str_sentfinal = ' '.join(map(str, sentfinal))

#Printing top 5 most common words and freq.

fig1 = plt.figure()

# Plotting top 30 words
title = "Top Words in {} Column".format(column.capitalize())
fdist.plot(10, cumulative=False, title=title)
fdist_fig_name = str(column) + '_fdist.jpg'
fig1.savefig(fdist_fig_name, bbox_inches = "tight")
# if using a Jupyter notebook, include:
%matplotlib inline
# Making word cloud
wordcloud = WordCloud(width = 700, height = 700,
background_color ='white',
min_font_size = 10).generate(str_sentfinal)
# plot the WordCloud image
plt.figure(figsize = (5, 5), facecolor = None)
plt.tight_layout(pad = 0)
titlewc = "Word Cloud From {} Column".format(column.capitalize())
wc_fig_name = str(column) + '_wordcloud.jpg'

#summary, motive, target1, scite1, propcomment
Word cloud of ‘scite1’ column (left) and Frequency Distribution plot of ‘summary’ column (right)

The word cloud for ‘scite1’, the primary citation used to document the incident, stuck out particularly. The size of the frequency of the words kill, church, Egypt, and attack is alarming and indicate the severity of the attacks on Copts, primarily in Egypt.

Additionally, when looking at the top ‘Summary’ words, ‘claimed’ and ‘responsibility’ stuck out to me, so I wanted to look and see who did.

findword = 'claimed'
numwords = 20
claimed = []
# Making entire column single string to analyze
summs = church['summary'].tolist()
summs = str(summs)
# Printing 12 words before and after word claimed if found
for i in summs.split('\n'):
z = i.split(' ')
for x in [x for (x, y) in enumerate(z) if findword in y]:
claimed.append(' '.join(z[max(x-numwords,0):x+numwords+1]))


For most, I was surprised to see that the words preceding claimed is ‘no group’.

So I went one step further and looked at when the word ‘however’ was in these incidents:

however = []
other = []
howev = 'however'
# if word has claimed, looking into which attacks had 'no group'for i in claimed:
if howev in i:

Many are attributed to Islamic extremist groups.


One question you may be asking is how do we know the attacks are terrorist attacks? Firstly, according to the GTD, the 3 following criteria must be met:

  1. The incident must be intentional — the result of a conscious calculation on the part of a perpetrator. 
  2. The incident must entail some level of violence or immediate threat of violence -including property violence, as well as violence against people.
  3. The perpetrators of the incidents must be sub-national actors. The database does not include acts of state terrorism.

Additionally, the GTD defined acts of terrorism with the following definitions:

Criterion from GTD Codebook.

Here we see that all 3 criteria are met for every attack against Copts:

Histograms of above 3 criterion which define an attack as terrorism.

Next we’ll look at the attacks themselves. The most common attacks we see are armed assault, bombings, and kidnappings.

And the known method of claiming responsibility of these attacks were mostly announced on a website or with a video.


Most attacks are in Egypt but there have been significant incidents in Libya. The beheading of 21 men is one of the largest attacks in which 21 men, 20 Egyptians and 1 Ghanaian, were martyred on the Mediterranean Sea Cost in Sirte, Libya². This case seems to be missing from the database.

Interactive Map of Attacks Against Coptic Orthodox Individuals/Churches from 1980–2018


Using a histogram to see the frequency of the attacks by month, I was able to see that most attacks happen in January and December. This is probably attributed to the fact that Orthodox celebrate New Year Eve at church as well as their Christmas, which is January 7th.

Upon looking at the most common date for attacks in these months I found that indeed they increased at the end December and on January 6th and 7th, Christmas Eve and Day when Coptic Churches pray a midnight liturgy.

Perhaps the next most common months, which are in the spring, are attributed to when Copts pray their Easter liturgies, a date which varies every year.

Additionally, we see an overall increase in attacks from 1980 to 2018, however this could be due to recording more attacks as news sources document these incidents more since the 2000s.


This is arguably the most important question one would ask themselves when studying data of this nature…why would anyone commit these attacks? When we look at the frequency of words and word cloud for motives we see it could be attributed to violence as well as religious tensions with a minority group such as the Copts. We also see it’s unknown…it’s hard to ever really say why these attacks happen in the frequency they do.

Frequency Distribution plot of ‘motive’ column (left) and Word cloud of ‘motive’ column (right)

Additionally, I took a look at ransom notes to see if that played a large part.

rans = church[church['ransompaid'] > 0]
pd.set_option("display.max_colwidth", -1)
rans2 = rans[['target1', 'summary', 'motive', 'ransomamt', 'ransompaid']]
Incidents with ransom note (-99 indicates an unknown value)

These 2 targets may have been wealthy so they had a ransom note attached to their cases, but it doesn’t seem common as these are the only 2 cases from the data where there was a ransom note /paid ransom amount included.


The Copts have a long history of persecution and it would be contemptible to try to box the blood shed into a numerical analysis. Behind each row are names and faces of someone that was unjustly targeted; this persecution should not go unrecognized.

The goal of this article was to draw an understanding of the documented attacks against the Coptic Orthodox Church found in the GTD and bring awareness to their persecution using a number of libraries in Python.

An organization that encourages, aids, and uplifts Copts and overall Christians affected by this oppression is Take Heart. You can donate to their cause here.

[1]: The Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world since 1970 (currently updated through 2018).

[2]: More information on the 21 martyrs kidnapping can be found here.

👩🏽‍💻 Connect with me on my LinkedIn here!

MS Data Science @ UVA | powered by coffee and good company