Let there be word clouds: most common words per Bible book

2023-08-20

One of my goals for this year is to read the bible cover-to-cover. At the same time I'm learning about natural language processing for a project at work. So I got curious about the word usage per book. And particularly how the word usage evolves from book to book. In this blog post I share the word clouds and how I generated them.

All the code to generate the word clouds can be found in this repo. It consists of 2 parts:

Getting the word counts
Generating the word clouds

I'm reading the bible in Dutch (Bijbel in Gewone Taal), so the word clouds are also in Dutch ;-)

Get word counts

To get the word counts, I use the spacy library with the dutch pipeline. SpaCy is a natural language processing processing library that can tokenize and classify a text. At first I tried my own tokenization using regexes, but this is way easier and much better.

import spacy
nlp = spacy.load('nl_core_news_sm')

Tokenizing the text is then pretty straightforward. Just join all the lines together, pass it through the nlp engine, and get the tokens. For the word cloud, I'm only interested in the nouns, propositions, adjectives and verbs.

from collections import Counter

poss = ('NOUN', 'PROPN', 'ADJ', 'VERB')

def get_counts(book):
    text = ' '.join(df[df['book'] == book]['text'])
    doc = nlp(text)
    tokens = (token.text for token in doc if token.pos_ in poss)
    return Counter(tokens)

I use the handy collection.Counter class to count the tokens. This gives me a dictionary that looks something like this:

>>> c = get_counts('Genesis')
>>> c.most_common(10)
[('zei', 366),
 ('Jakob', 246),
 ('God', 223),
 ('Jozef', 189),
 ('Heer', 152),
 ('ging', 139),
 ('Abraham', 138),
 ('zoon', 131),
 ('land', 119),
 ('vader', 119)]

Generate word cloud

There really is a library for everything, for word clouds I used this one. It's pretty straightforward to use.

import wordcloud as wc

def get_wordcloud(counts, colormap='rainbow'):
    wordcloud = wc.WordCloud(
        width=800,
        height=500,
        colormap=colormap,
        background_color=None,
        mode='RGBA',
    ).generate_from_frequencies(counts)

    return wordcloud

What's left is to just loop over all the books and plot them:

for book in books:
    counts = get_counts(book)
    wordcloud = get_wordcloud(counts)

    plt.imshow(wordcloud)

Old Testament

Below are the word clouds for all books in the Old Testament.


Genesis	Exodus	Leviticus
Numeri	Deuteronomium	Jozua
Richteren	Ruth	1-Samuel
2-Samuel	1-Koningen	2-Koningen
1-Kronieken	2-Kronieken	Ezra
Nehemia	Esther	Job
Psalmen	Spreuken	Prediker
Hooglied	Jesaja	Jeremia
Klaagliederen	Ezechiël	Daniël
Hosea	Joël	Amos
Obadja	Jona	Micha
Nahum	Habakuk	Sefanja
Haggai	Zacharia	Maleachi

New Testament

Below are the word clouds for all books in the New Testament.


Matteüs	Marcus	Lucas
Johannes	Handelingen	Romeinen
1-Korintiërs	2-Korintiërs	Galaten
Efeziërs	Filippenzen	Kolossenzen
1-Tessalonicenzen	2-Tessalonicenzen	1-Timoteüs
2-Timoteüs	Titus	Filemon
Hebreeën	Jakobus	1-Petrus
2-Petrus	1-Johannes	2-Johannes
3-Johannes	Judas	Openbaring