Table of Contents
One of my goals for this year is to read the bible cover-to-cover. At the same time I’m learning about natural language processing for a project at work. So I got curious about the word usage per book. And particularly how the word usage evolves from book to book. In this blog post I share the word clouds and how I generated them.
All the code to generate the word clouds can be found in this repo. It consists of 2 parts:
- Getting the word counts
- Generating the word clouds
I’m reading the bible in Dutch (Bijbel in Gewone Taal), so the word clouds are also in Dutch ;-)
Get word counts #
To get the word counts, I use the spacy library with the dutch pipeline. SpaCy is a natural language processing processing library that can tokenize and classify a text. At first I tried my own tokenization using regexes, but this is way easier and much better.
Tokenizing the text is then pretty straightforward. Just join all the lines together, pass it through the nlp engine, and get the tokens. For the word cloud, I’m only interested in the nouns, propositions, adjectives and verbs.
I use the handy collection.Counter class to count the tokens. This gives me a dictionary that looks something like this:
Generate word cloud #
There really is a library for everything, for word clouds I used this one. It’s pretty straightforward to use.
What’s left is to just loop over all the books and plot them:
Old Testament #
Below are the word clouds for all books in the Old Testament.
New Testament #
Below are the word clouds for all books in the New Testament.