Text mining methods allow us to highlight the most frequently used keywords in a paragraph of texts. One can create a word cloud, also referred as text cloud or tag cloud, which is a visual representation of text data.
The procedure of creating word clouds is very simple in R if you know the different steps to execute. The text mining package (tm) and the word cloud generator package (wordcloud) are available in R for helping us to analyze texts and to quickly visualize the keywords as a word cloud.
3 reasons you should use word clouds to present your text data
- Word clouds add simplicity and clarity. The most used keywords stand out better in a word cloud
- Word clouds are a potent communication tool. They are easy to understand, to be shared and are impactful
- Word clouds are visually engaging than a table data
Steps to create a word cloud in R
- Create a text file
- Copy and paste the text in a plain text file (e.g. genesis.txt)
- Save the file
-
Install and load required libraries
# Install install.packages("tm") # for text mining install.packages("SnowballC") # for text stemming install.packages("wordcloud") # word-cloud generator install.packages("RColorBrewer") # color palettes install.packages("rstudioapi") # rstudio environment reference # Load library("tm") library("SnowballC") library("wordcloud") library("RColorBrewer") library("rstudioapi")
- Text mining
- Load text
The text is loaded using Corpus() function from text mining (tm) package. Corpus is a list of a document (in our case, we only have one document).
- We start by importing the text file created in Step 1
setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) filePath <- paste(getwd(), "/data", sep = "") files <- as.character((list.files(path = filePath))) data <- unname(sapply(paste(filePath,.Platform$file.sep,files,sep=""), readLines))
- Load the data as a corpus
dataCorpus <- Corpus(VectorSource(data))
- Inspect the content of the document
inspect(dataCorpus)
-
Text transformation
Transformation is performed using tm_map function to replace, for example, special characters from the text.
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) dataCorpus1 <- tm_map(dataCorpus, toSpace, "/") dataCorpus2 <- tm_map(dataCorpus1, toSpace, "@") dataCorpus3 <- tm_map(dataCorpus2, toSpace, "\\|")
-
Cleaning the text
The tm_map() function is used to remove unnecessary white space, to convert the text to lower case, to remove common stopwords like “the”, “we”.
The information value of “stopwords” is near zero due to the fact that they are so common in a language. Removing this kind of words is useful before further analyses. For “stopwords”, supported languages are Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish and Swedish. Language names are case sensitive.
You could also remove numbers and punctuation with removeNumbers and removePunctuation arguments.
Another important preprocessing step is to make a text stemming which reduces words to their root form. In other words, this process removes suffixes from words to make it simple and to get the common origin. For example, a stemming process reduces the words “moving”, “moved” and “movement” to the root word, “move”.
The R code below can be used to clean your text:
# Convert the text to lower case dataCorpus4 <- tm_map(dataCorpus3, content_transformer(tolower)) # Remove numbers dataCorpus5 <- tm_map(dataCorpus4, removeNumbers) # Remove english common stopwords dataCorpus6 <- tm_map(dataCorpus5, removeWords, stopwords("english")) # Remove your own stop word # specify your stopwords as a character vector dataCorpus7 <- tm_map(dataCorpus6, removeWords, c("said", "will")) # Remove punctuations dataCorpus8 <- tm_map(dataCorpus7, removePunctuation) # Eliminate extra white spaces dataCorpus9 <- tm_map(dataCorpus8, stripWhitespace) # Text stemming #dataCorpus10 <- tm_map(dataCorpus9, stemDocument)
-
Build a term-document matrix
Document matrix is a table containing the frequency of the words. Column names are words and row names are documents. The function TermDocumentMatrix() from text mining package can be used as follow:
dtm <- TermDocumentMatrix(dataCorpus10) m <- as.matrix(dtm) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) head(d, 10)
-
Generate the Word cloud
set.seed(1234) wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))
You can clone this repository to get the finished product.