Dark Knight Word Cloud

3 minute read

Fun WordCloud Project

This project shows what can be done with Natural Language Processing, data visualization and a little bit of Python magic. For this quick demonstration I am using the script from a Batman movie in the form of a text (.txt) document. I’ll also be using an image of Batman’s symbol to form the shape of the word cloud. All files can be found in the following GitHub repository click here.

Jupyter Lab

My IDE of choice is Jupyter Lab. In order to have the maximum amount of fun with this project please make sure that you have the following libraries installed:

  • wordcloud: This is how we will create the word cloud.
  • matplotlib: This is the library that shows the word cloud.

I will also be using numpy to render the color of the image and PIL to load our image; These are Python native libraries therefore, I didn’t have to install these.

Step 1: Import dependencies

As you are probably very familiar with if you have used Python in any capacity, the first step of any project is to import the libraries that you’ll be using. This is an industry best practice and a great way to ensure that you don’t forget any tools that you’ll need. Here’s an example of how everything should look:

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt #to display our wordcloud
from PIL import Image #to load our image
import numpy as np #to get the color of our image

Note: I recommend separating each step into their own cells in Jupyter Lab. This allows you to know exactly where errors are happening. If you get an error at this step it is likely because of a library not being installed.

Step 2: Import script and set stop words

After setting up our library toolkit, we are now ready to import the script (batman.txt) and set up stop words. You’re probably asking yourself what stop words are at this point. Stop words are the words that usually occur frequently in text that don’t add any significant meaning to the article or document like “I”, “am”, or “the”. The wordcloud is going to show the most frequent words in our Batman script and we don’t want insignificant words to “pollute” our work of art. Notice that in the code snippet above we imported “STOPWORDS” with WordCloud. This gives us access to wordcloud’s built-in stopwords library. There is an option to add our own words to the existing library but for the purposes of this project the existing library is good enough. This is what the code should look like:

text = open('batman.txt', 'r').read()
stopwords = set(STOPWORDS)

Note: You may notice that next to the batman.txt file there is an “r”. This is to utilize the read function.

Step 3: Customize appearance

This is the part of the project that I enjoyed the most. We are setting up the appearance of our wordcloud. The first thing that we need to do is set up a custom mask. This will allow us to force the word cloud into the shape of our chosen image which happens to be the batman logo. In order to do this we create a variable named “custom_mask” and we call the numpy array function to assist us in opening our batman symbol.

Next we instantiate our wordcloud by creating a variable named “wc”, setting it equal to WordCloud and filling in the required parameters of “background_color”, “stopwords”, and “mask”. We then actually create the word cloud by using the “generate” function and passing in the “text” variable that we made earlier. This is how the code should look:

custom_mask = np.array(Image.open('batman.png')) 
wc = WordCloud(background_color = 'black',
               stopwords = stopwords,
               mask = custom_mask,)

wc.generate(text)

Plotting

Now we get to see the fruit of our labor. We use the pyplot library to show our word cloud. Notice that the axis has been turned off. This is because we aren’t plotting a graph.

plt.imshow(wc, interpolation = 'bilinear')
plt.axis('off')
plt.show()

Saving the image

Finally, we save our file to a .png image for later viewing. I hope that you found this to be helpful. Thank you for reading and please look at my other projects.

wc.to_file('Batman_wordcloud_black.png')

Final Product