Guides
Creating WordClouds in Python from a single-column in Pandas dataframe
In this tutorial, you will learn how to create a WordCloud of your own in Python and customise it as you see fit. Specifically, we will focus on how to generate a WorldCloud from a column in Pandas dataframe. This type of visualisation will be quite handy for exploring text data and making your presentation more lively.
What is a WordCloud?
A WorldCloud /Word Cloud (also known as a tag cloud or word art) is a simple visualisation of data, in which words are shown in varying sizes depending on how often they appear in your text/data.
There are many free word cloud generators online that can help you perform text analysis, and spot trends and patterns at a glance. Python is not the only tool capable of creating such visuals. So if you need to make a word cloud visualisation quickly and you are not working with your data in Python, then this tutorial is not for you.
Tutorial structure
In this tutorial, we will use a rose data scraped from Helpmefind.com, but you can use any other data set to learn:
How to create a WorldCloud from a column in Pandas dataframe
Mask your WordCloud into any shape of your choice
WorldCloud install
In order to create a WorldCloud viz in Python you will need to install below packages:
First four packages are data analytics staples, so don't require an introduction.
The pillow
library is a package that enables image reading. You can find a tutorial for pillow here. Pillow is a wrapper for PIL - Python Imaging Library. You will need this library to read in an image as the mask for the WordCloud.
The wordcloud
library is the one responsible for creating WorldClouds. It can be a little tricky to install. If you only need it for plotting a basic WordCloud, then running one of the commands below would be sufficient.
pip install wordcloud
or conda install -c conda-forge wordcloud
for Anaconda-Navigator.
However, if you are after the library with the ability to mask the cloud into any shape of your choice, you will need the latest version that requires a different method of installation.
git clone https://github.com/amueller/word_cloud.git cd word_cloud pip install .
Import Libraries & Create a Dataframe
# Importing libraries import numpy as np import pandas as pd from os import path from PIL import Image from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator import matplotlib.pyplot as plt # Run below comand if you are running IPython % matplotlib inline
Magic command %matplotlib inline will make your plot outputs appear and be stored within the notebook.
Next step is to import your csv file with data and create Pandas dataframe from it.
# Creating dataframe from csv # Dataframe name is df1 df1 = pd.read_csv('Helpmefind_roses_merged_cleanned.csv') # My file did not have titles so I manually assigned them df1.columns = ["Unnamed: 0","NAME","URL","SYNONYM","ARS","ORIG","CLASS","BLOOM","HABIT","GROWING","PARENTAGE","COMMENTS","BRED_BY","INTRODUCED_FULL", "INTRODUCED", "YEAR_CREATED", "BRED_BY_DOB", "ALL_YEARS", "COUNTRY", "EXH_NAME","REGISTR_NAME", "PETALS", "COLOR", "PETALS2", "PETALS3", "ALL_PETALS_BRACKETS", "MAIN_COLOR" ] # Then I removed column "Unnamed: 0" del df1['Unnamed: 0'] # Listing basic info about the file # Just to check that everything imported correctly from the csv file df1.info()
Below you can see info about my dataset:
In this article, I will demonstrate how easy it is to make your own worldcloud visuals from a column in Pandas datafreame. As an example, I will use a column named BLOOM from my dataframe named df1. Below you can see the first ten rows from the dataframe.
How column BLOOM looks like
Create a simple WordCloud visual from a column in Pandas dataframe
In this step, we create two important strings for our WorldCloud generation.
WordCloud.generate(text) method will generate wordcloud from text.
The WordCloud method expects a text file / a string on which it will count the word instances. But we want to use it on Pandas dataframe, so we will need to transform our input for it to work with the WordCloud.generate()
method.
The ‘text’ is the string that will be used to store all the words from your column in a single line of text.
This you can do in the following way:
Firstly, you will need to create a text list of all words in column bloom.
text = " ".join(review for review in df.YOUR_COLUMN_NAME.astype(str))
Secondly, you will need to print how many words are in the text list that you just created from the Pandas column.
print ("There are {} words in the combination of all cells in column YOUR_COLUMN_NAME.".format(len(text)))
Next step is to create a list of stop words.
The ‘stopwords’ list is used to store all the words that are very commonly used in the English language such as ‘the’, ‘a’, ‘an’, ‘in’. Plus you can add any other words that you don't want to go in your world cloud. These words will be later filtered while generating the word cloud.
stopwords = set(STOPWORDS)
If you need to add any custom word exclusions use below:
stopwords.update(["bloom", "rose", "petals", "Average", "diameter","flushes", 'throughout', "Blooms", "form"])
In the above example, I am excluding words: bloom, rose, petals, average, diameter, flushes, throughout, blooms, and form. You will need to replace those with your custom word exclusions.
Now we are finally ready to generate a WorldCloud image. It is time to run
WorldClound().generate(text)
It accepts the following Parameters:
font_path : string - Font path to the font that will be used (OTF or TTF).
width : int(default=400) - Width of the canvas.
heightint : int(default=200) - Height of the canvas.
prefer_horizontal : float (default=0.90) -The ratio of times to try horizontal fitting as opposed to vertical. If prefer_horizontal < 1, the algorithm will try rotating the word if it doesn’t fit. (There is currently no built-in way to get only vertical words.)
mask : nd-array or None (default=None) - If not None, gives a binary mask on where to draw words. If mask is not None, width and height will be ignored and the shape of mask will be used instead. All white (#FF or #FFFFFF) entries will be considerd “masked out” while other entries will be free to draw on. [This changed in the most recent version!]
contour_width: float (default=0) - If mask is not None and contour_width > 0, draw the mask contour.
contour_color: color value (default=”black”) - Mask contour color.
scale : float (default=1) - Scaling between computation and drawing. For large word-cloud images, using scale instead of larger canvas size is significantly faster, but might lead to a coarser fit for the words.
min_font_size : int (default=4) - Smallest font size to use. Will stop when there is no more room in this size.
font_step : int (default=1) - Step size for the font. font_step > 1 might speed up computation but give a worse fit.
max_words : number (default=200) - The maximum number of words.
stopwords : set of strings or None - The words that will be eliminated. If None, the build-in STOPWORDS list will be used. Ignored if using generate_from_frequencies.
background_color : color value (default=”black”) - Background color for the word cloud image.
max_font_size : int or None (default=None) - Maximum font size for the largest word. If None, height of the image is used.
mode : string (default=”RGB”) - Transparent background will be generated when mode is “RGBA” and background_color is None.
relative_scaling : float (default=’auto’) - Importance of relative word frequencies for font-size. With relative_scaling=0, only word-ranks are considered. With relative_scaling=1, a word that is twice as frequent will have twice the size. If you want to consider the word frequencies and not only their rank, relative_scaling around .5 often looks good. If ‘auto’ it will be set to 0.5 unless repeat is true, in which case it will be set to 0.
color_func : callable, default=None - Callable with parameters word, font_size, position, orientation, font_path, random_state that returns a PIL color for each word. Overwrites “colormap”. See colormap for specifying a matplotlib colormap instead. To create a word cloud with a single color, use
color_func=lambda *args, **kwargs: "white"
. The single color can also be specified using RGB code. For examplecolor_func=lambda *args, **kwargs: (255,0,0)
sets color to red.regexp : string or None (optional) - Regular expression to split the input text into tokens in process_text. If None is specified,
r"\w[\w']+"
is used. Ignored if using generate_from_frequencies.collocations : bool, default=True - Whether to include collocations (bigrams) of two words. Ignored if using generate_from_frequencies.
colormap : string or matplotlib colormap, default=”viridis” - Matplotlib colormap to randomly draw colors from for each word. Ignored if “color_func” is specified.
normalize_plurals : bool, default=True - Whether to remove trailing ‘s’ from words. If True and a word appears with and without a trailing ‘s’, the one with trailing ‘s’ is removed and its counts are added to the version without trailing ‘s’ – unless the word ends with ‘ss’. Ignored if using generate_from_frequencies.
repeat : bool, default=False - Whether to repeat words and phrases until max_words or min_font_size is reached.
include_numbers : bool, default=False - Whether to include numbers as phrases or not.
min_word_length : int, default=0 - Minimum number of letters a word must have to be included.
collocation_threshold: int, default=30 - Bigrams must have a Dunning likelihood collocation score greater than this parameter to be counted as bigrams. Default of 30 is arbitrary.
If you want to generate a world cloud with stopwords excluded, with a white background, a width of 800, and a height of 400 then run below.
wordcloud = WordCloud(stopwords=stopwords, background_color="white", width=800, height=400).generate(text)
Please remember that you can specify your look of the word cloud using parameters.
Now that you generated a world cloud, it is time to display it in your notebook using matplotlib.
plt.axis("off")
plt.figure( figsize=40,20)
plt.tight_layout(pad=0)
plt.imshow(wordcloud, interpolation='bilinear')
plt.show()
You may wonder why we use the argument interpolation="bilinear"
in the plt.imshow()
. This is to make the displayed image appear more smoothly.
Let's combine all lines of code that we need to run in order to make a WordCloud vis for column BLOOM.
# Python program to generate WordCloud for column BLOOM # Text of all words in column bloom text = " ".join(review for review in df1.BLOOM.astype(str)) print ("There are {} words in the combination of all cells in column BLOOM.".format(len(text))) # Create stopword list: # remove words that we want to exclude stopwords = set(STOPWORDS) stopwords.update(["bloom", "rose", "petals", "Average", "diameter","flushes", 'throughout', "Blooms", "form"]) # Generate a word cloud image wordcloud = WordCloud(stopwords=stopwords, background_color="white", width=800, height=400).generate(text) # Display the generated image: # the matplotlib way: plt.axis("off") plt.figure( figsize=(40,20)) plt.tight_layout(pad=0) plt.imshow(wordcloud, interpolation='bilinear') plt.show()
Above code will produce the following message and a Worldcloud based on words in column BLOOM inside Dataframe df1.
Mask your wordcloud into any shape of your choice
Now, let's create a rose-shaped word cloud for words in column BLOOM.
Seriously, you can use any shape you want with the WorldCloud method in Python.
To create a shape for your WordCloud, first, you need to find a JPG file to become the mask. Below is a simple rose shape that I found on the internet:
Rose shape for WorldCloud
Not all mask images are the same so expect different outcomes. Some may not work with the WordCloud function. To make sure that your mask works, let's take a look at it in the NumPy array form:
# converting jpg image to a numpy array rose_mask = np.array(Image.open("1599661.jpg")) rose_mask
rose_mask array
The way the masking functions works is that it requires all-white part of the mask should be 255. If you use a JPG file then you should get 255 in your array, while PNG file will populate your array with 0 (integer type). This value represents the "intensity" of the pixel. Values of 255 are pure white, whereas values of 1 are black. Please note that WorldCloud will not work with the PNG image straight away and you will need to transform it in order for it to work correctly.
Rose_mask array is populated with the 255, so everything is ready for the next stage.
Lets' make a wordcloud with the selected shape. Notice in the WordCloud
function, there is a mask
argument that takes in the mask that you created from the JPG image above. The contour_width
and contour_color
are arguments that allow you to adjust the outline characteristics of the cloud.
# Create a word cloud image wc = WordCloud(background_color="white", max_words=100, mask=rose_mask, stopwords=stopwords, contour_width=3, contour_color='green') # Generate a wordcloud wc.generate(text) # store to file wc.to_file("rose_bloom.png") # show plt.figure(figsize=[20,10]) plt.imshow(wc, interpolation='bilinear') plt.axis("off") plt.show()
Hurray! You created a wordcloud in the shape of a rose!
Worldcloud in the shape of a rose
Continue Reading
Apps
Timestripe - my new favourite productivity app
March 5, 2023
Guides
How to scrape tables from websites using Pandas read_html() function
February 2, 2023
Guides
Drop all duplicate rows across multiple columns in Python Pandas
January 28, 2023
Guides
How to create effective prompts for AI image generation
August 15, 2022
Guides
Generate Huge Datasets With Fake Data Easily and Quickly using Python and Faker
April 16, 2022
Guides
How to change or update a specific cell in Python Pandas Dataframe
March 25, 2021
Guides
How to add a row at the top in Pandas dataframe
March 22, 2021
Guides
Python Regex examples - How to use Regex with Pandas
September 9, 2020
Guides
Python regular expressions (RegEx) simple yet complete guide for beginners
September 15, 2020
Guides
8 Python Pandas Value_counts() tricks that make your work more efficient
May 31, 2020
Guides
Exploring Correlation in Python: Pandas, SciPy
May 5, 2020
Guides
How to add new columns to Pandas dataframe?
March 22, 2020
Guides
Delete column/row from a Pandas dataframe using .drop() method
February 2, 2020
Guides
How to visualize data with Matplotlib from a Pandas Dataframe
November 15, 2019
Guides
The ultimate beginners guide to Group by in Python Pandas
August 8, 2019
Guides
Guide to renaming columns with Python Pandas
July 2, 2019
Guides
How to suppress scientific notation in Pandas
July 12, 2019
Guides
The complete beginners guide to Pandas
June 29, 2019
Guides
Data project #1: Stockmarket analysis
June 29, 2019
Guides
Use Jupyter notebooks anywhere
June 10, 2019