Making a Word Cloud from Reddit Comments
We as humans must admit: We can be identified quite uniquely by analyzing the words we use. You are about to read more about how we can visualize the frequencies of words in typed text.
According to an article by The Economist from 2013, the vocabulary of adults in their native language ranges from 20,000–35,000 words.
But what do you think happens if automated tools took mark of a writer's favorite words, ranked them in popularity and evaluated the result? Exactly. Doing so allows for some good profiling of the writer's opinion, style and even personality traits. Although I have no experience in that field, I can imagine an examination and comparison of written text being used in criminal investigation as well.
Having built a decently large comment post history on Reddit over the last couple of years, I found myself interested in how much the wording of Reddit comments would reveal about me.
Fortunately, the Reddit API allows for browsing a user's comments without much difficulty. There are many API wrappers for all kinds of programming languages out there. Following my enthusiasm for Python, I will be using a wrapper called Python Reddit API Wrapper (PRAW).
Data without good visualization can be quite boring. One way to highlight word frequencies is by using a tag/word cloud. It's typically a frontend to a weighted list made from the number of occurences of a word. The more often a word appears in a text, the larger it will be sized.
Using a Python library like word_cloud by amueller, it's a walk in the park to generate a tag cloud.
Putting the API and library together
Let's start by configuring PRAW. As described in their documentation, for accessing the Reddit API you must register your app. That way you will be supplied with an ID and a secret. Give these as parameters when initializing PRAW. As for the user_agent
think of a unique user-agent.
import praw
reddit = praw.Reddit(user_agent='reddit-comment-cloud by chmey.com', client_id=2374143735, client_secret='uwalrireg')
Once created, the reddit object can be used to query the entire history of a Redditor. As far as I know, there's no comments.all()
so I just queried by new and set the limit to None
. To eliminate possible duplicates caused by capitalization the comment.body
is set to lowercase.
text = ""
for comment in reddit.redditor(name=username).comments.new(limit=None):
text += comment.body.lower()
After compiling the text, we are ready to render the word cloud. It’s as simple as creating an instance of class WordCloud
from the module. The class allows for some interesting configuration, four of them you are reading here. There is a lot more in the module’s documentation.</p>
from wordcloud import WordCloud
wordcloud = WordCloud(width=1280, height=720, margin=0, prefer_horizontal=0.8).generate(text)
For saving the word cloud object to an image, I'm using the Python matplotlib
. The options shown here generate a word cloud without any white margin or borders in decent image quality.
import matplotlib.pyplot as plt
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig(username+'.png', dpi=200, bbox_inches='tight', pad_inches = 0)
Finally, you'll find the generated image saved as a .png
in the current working directory.
What's next?
I'm working on a website wrapper for the module so everyone can generate their word cloud from the web!
For fun, I'm rendering a word cloud out of this post and leaving it here! 😜