~ posts articles presentations code about

Let there be word clouds: most common words per Bible book

2023-08-20

One of my goals for this year is to read the bible cover-to-cover. At the same time I'm learning about natural language processing for a project at work. So I got curious about the word usage per book. And particularly how the word usage evolves from book to book. In this blog post I share the word clouds and how I generated them.

All the code to generate the word clouds can be found in this repo. It consists of 2 parts:

  1. Getting the word counts
  2. Generating the word clouds

I'm reading the bible in Dutch (Bijbel in Gewone Taal), so the word clouds are also in Dutch ;-)

Get word counts

To get the word counts, I use the spacy library with the dutch pipeline. SpaCy is a natural language processing processing library that can tokenize and classify a text. At first I tried my own tokenization using regexes, but this is way easier and much better.

import spacy
nlp = spacy.load('nl_core_news_sm')

Tokenizing the text is then pretty straightforward. Just join all the lines together, pass it through the nlp engine, and get the tokens. For the word cloud, I'm only interested in the nouns, propositions, adjectives and verbs.

from collections import Counter

poss = ('NOUN', 'PROPN', 'ADJ', 'VERB')

def get_counts(book):
    text = ' '.join(df[df['book'] == book]['text'])
    doc = nlp(text)
    tokens = (token.text for token in doc if token.pos_ in poss)
    return Counter(tokens)

I use the handy collection.Counter class to count the tokens. This gives me a dictionary that looks something like this:

>>> c = get_counts('Genesis')
>>> c.most_common(10)
[('zei', 366),
 ('Jakob', 246),
 ('God', 223),
 ('Jozef', 189),
 ('Heer', 152),
 ('ging', 139),
 ('Abraham', 138),
 ('zoon', 131),
 ('land', 119),
 ('vader', 119)]

Generate word cloud

There really is a library for everything, for word clouds I used this one. It's pretty straightforward to use.

import wordcloud as wc

def get_wordcloud(counts, colormap='rainbow'):
    wordcloud = wc.WordCloud(
        width=800,
        height=500,
        colormap=colormap,
        background_color=None,
        mode='RGBA',
    ).generate_from_frequencies(counts)

    return wordcloud

What's left is to just loop over all the books and plot them:

for book in books:
    counts = get_counts(book)
    wordcloud = get_wordcloud(counts)

    plt.imshow(wordcloud)

Old Testament

Below are the word clouds for all books in the Old Testament.

Old Testament
Old Testament
Genesis
Genesis
Exodus
Exodus
Leviticus
Leviticus
Numeri
Numeri
Deuteronomium
Deuteronomium
Jozua
Jozua
Richteren
Richteren
Ruth
Ruth
1-Samuel
1-Samuel
2-Samuel
2-Samuel
1-Koningen
1-Koningen
2-Koningen
2-Koningen
1-Kronieken
1-Kronieken
2-Kronieken
2-Kronieken
Ezra
Ezra
Nehemia
Nehemia
Esther
Esther
Job
Job
Psalmen
Psalmen
Spreuken
Spreuken
Prediker
Prediker
Hooglied
Hooglied
Jesaja
Jesaja
Jeremia
Jeremia
Klaagliederen
Klaagliederen
Ezechiël
Ezechiël
Daniël
Daniël
Hosea
Hosea
Joël
Joël
Amos
Amos
Obadja
Obadja
Jona
Jona
Micha
Micha
Nahum
Nahum
Habakuk
Habakuk
Sefanja
Sefanja
Haggai
Haggai
Zacharia
Zacharia
Maleachi
Maleachi

New Testament

Below are the word clouds for all books in the New Testament.

New Testament
New Testament
Matteüs
Matteüs
Marcus
Marcus
Lucas
Lucas
Johannes
Johannes
Handelingen
Handelingen
Romeinen
Romeinen
1-Korintiërs
1-Korintiërs
2-Korintiërs
2-Korintiërs
Galaten
Galaten
Efeziërs
Efeziërs
Filippenzen
Filippenzen
Kolossenzen
Kolossenzen
1-Tessalonicenzen
1-Tessalonicenzen
2-Tessalonicenzen
2-Tessalonicenzen
1-Timoteüs
1-Timoteüs
2-Timoteüs
2-Timoteüs
Titus
Titus
Filemon
Filemon
Hebreeën
Hebreeën
Jakobus
Jakobus
1-Petrus
1-Petrus
2-Petrus
2-Petrus
1-Johannes
1-Johannes
2-Johannes
2-Johannes
3-Johannes
3-Johannes
Judas
Judas
Openbaring
Openbaring