Let there be word clouds: most common words per Bible book
2023-08-20
One of my goals for this year is to read the bible cover-to-cover. At the same time I'm learning about natural language processing for a project at work. So I got curious about the word usage per book. And particularly how the word usage evolves from book to book. In this blog post I share the word clouds and how I generated them.
All the code to generate the word clouds can be found in this repo. It consists of 2 parts:
- Getting the word counts
- Generating the word clouds
I'm reading the bible in Dutch (Bijbel in Gewone Taal), so the word clouds are also in Dutch ;-)
Get word counts
To get the word counts, I use the spacy library with the dutch pipeline. SpaCy is a natural language processing processing library that can tokenize and classify a text. At first I tried my own tokenization using regexes, but this is way easier and much better.
import spacy
nlp = spacy.load('nl_core_news_sm')
Tokenizing the text is then pretty straightforward. Just join all the lines together, pass it through the nlp engine, and get the tokens. For the word cloud, I'm only interested in the nouns, propositions, adjectives and verbs.
from collections import Counter
poss = ('NOUN', 'PROPN', 'ADJ', 'VERB')
def get_counts(book):
text = ' '.join(df[df['book'] == book]['text'])
doc = nlp(text)
tokens = (token.text for token in doc if token.pos_ in poss)
return Counter(tokens)
I use the handy collection.Counter class to count the tokens. This gives me a dictionary that looks something like this:
>>> c = get_counts('Genesis')
>>> c.most_common(10)
[('zei', 366),
('Jakob', 246),
('God', 223),
('Jozef', 189),
('Heer', 152),
('ging', 139),
('Abraham', 138),
('zoon', 131),
('land', 119),
('vader', 119)]
Generate word cloud
There really is a library for everything, for word clouds I used this one. It's pretty straightforward to use.
import wordcloud as wc
def get_wordcloud(counts, colormap='rainbow'):
wordcloud = wc.WordCloud(
width=800,
height=500,
colormap=colormap,
background_color=None,
mode='RGBA',
).generate_from_frequencies(counts)
return wordcloud
What's left is to just loop over all the books and plot them:
for book in books:
counts = get_counts(book)
wordcloud = get_wordcloud(counts)
plt.imshow(wordcloud)
Old Testament
Below are the word clouds for all books in the Old Testament.
New Testament
Below are the word clouds for all books in the New Testament.