Skip to main content
  1. Posts/

Let there be word clouds: most common words per Bible book

·486 words·3 mins
Word cloud for the entire Bible.

One of my goals for this year is to read the bible cover-to-cover. At the same time I’m learning about natural language processing for a project at work. So I got curious about the word usage per book. And particularly how the word usage evolves from book to book. In this blog post I share the word clouds and how I generated them.

All the code to generate the word clouds can be found in this repo. It consists of 2 parts:

  1. Getting the word counts
  2. Generating the word clouds

I’m reading the bible in Dutch (Bijbel in Gewone Taal), so the word clouds are also in Dutch ;-)

Get word counts #

To get the word counts, I use the spacy library with the dutch pipeline. SpaCy is a natural language processing processing library that can tokenize and classify a text. At first I tried my own tokenization using regexes, but this is way easier and much better.

1
2
import spacy
nlp = spacy.load('nl_core_news_sm')

Tokenizing the text is then pretty straightforward. Just join all the lines together, pass it through the nlp engine, and get the tokens. For the word cloud, I’m only interested in the nouns, propositions, adjectives and verbs.

1
2
3
4
5
6
7
8
9
from collections import Counter

poss = ('NOUN', 'PROPN', 'ADJ', 'VERB')

def get_counts(book):
    text = ' '.join(df[df['book'] == book]['text'])
    doc = nlp(text)
    tokens = (token.text for token in doc if token.pos_ in poss)
    return Counter(tokens)

I use the handy collection.Counter class to count the tokens. This gives me a dictionary that looks something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
>>> c = get_counts('Genesis')
>>> c.most_common(10)
[('zei', 366),
 ('Jakob', 246),
 ('God', 223),
 ('Jozef', 189),
 ('Heer', 152),
 ('ging', 139),
 ('Abraham', 138),
 ('zoon', 131),
 ('land', 119),
 ('vader', 119)]

Generate word cloud #

There really is a library for everything, for word clouds I used this one. It’s pretty straightforward to use.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import wordcloud as wc

def get_wordcloud(counts, colormap='rainbow'):
    wordcloud = wc.WordCloud(
        width=800,
        height=500,
        colormap=colormap,
        background_color=None,
        mode='RGBA',
    ).generate_from_frequencies(counts)

    return wordcloud

What’s left is to just loop over all the books and plot them:

1
2
3
4
5
for book in books:
    counts = get_counts(book)
    wordcloud = get_wordcloud(counts)

    plt.imshow(wordcloud)

Old Testament #

Below are the word clouds for all books in the Old Testament.

Old Testament
Genesis
Exodus
Leviticus
Numeri
Deuteronomium
Jozua
Richteren
Ruth
1-Samuel
2-Samuel
1-Koningen
2-Koningen
1-Kronieken
2-Kronieken
Ezra
Nehemia
Esther
Job
Psalmen
Spreuken
Prediker
Hooglied
Jesaja
Jeremia
Klaagliederen
Ezechiël
Daniël
Hosea
Joël
Amos
Obadja
Jona
Micha
Nahum
Habakuk
Sefanja
Haggai
Zacharia
Maleachi

New Testament #

Below are the word clouds for all books in the New Testament.

New Testament
Matteüs
Marcus
Lucas
Johannes
Handelingen
Romeinen
1-Korintiërs
2-Korintiërs
Galaten
Efeziërs
Filippenzen
Kolossenzen
1-Tessalonicenzen
2-Tessalonicenzen
1-Timoteüs
2-Timoteüs
Titus
Filemon
Hebreeën
Jakobus
1-Petrus
2-Petrus
1-Johannes
2-Johannes
3-Johannes
Judas
Openbaring