Introduction

This demo will cover the basics of clustering, topic modeling, and classifying documents in R using both unsupervised and supervised machine learning techniques. We will also spend some time discussing and comparing some different methodologies.

The data used in this tutorial is a set of documents from Reuters on different topics. This is a classic dataset for learning textmining and is available all over the internet (including here under Reuters-21578 R8 - All Terms - Training).

We will assume that you are familiar with basic textmining in R (as shown in the Intro to Textmining tutorial) including loading/cleaning text data and creating document-term matrices.

After a brief introduction to/discussion of unsupervised and supervised machine learning, we will continue to a coded example. Please feel free to skip through the conceptual piece if you are comfortable with the topics.

Unsupervised or Supervised Machine Learning?

Often the goal of textmining is to differentiate between documents. Differentiation can help answer questions like: what kinds of things are mentioned in these documents? which of these documents is about a certain topic? and what is the sentiment of these documents?

We can leverage unsupervised and/or supervised machine learning algorithms to help us with this differentiation, but which type of algorithm/methodology we choose will be driven by our specific question, the data we have available, and the resources we are willing/able to expend.

The key “divining rod” when determining which kind of methodology to apply is whether or not your data is tagged. Supervised algorithms require data to train and test the model that is pre-coded (tagged) with the variable of interest. For example, if we are performing sentiment analysis, we would want a set of documents that a person manually went through and tagged as positive or negative. The supervised algorithm will then use this information to train a model. Even better, if we can hold out some of our tagged data, we can test the model to see how accurate it is. Supervised methodologies can include human-readable models such as Bayes Nets and Trees, as well as “black box” methods such as Support Vector Machines and Maximum Entropy models.

If we do not have access to tagged data and are unwilling/unable to tag it ourselves, we will have to settle for unsupervised techniques. These techniques try to tell the difference between documents without any prior knowledge. It is not hard to guess that unsupervised methods are rarely comparable in accuracy to supervised methods.

It may seem that supervised methods are always preferable to unsupervised methods, but this is not necessarily true. For one, it may be very expensive to tag a suitable set of representative training/test documents. Furthermore, your question may be such that an unsupervised method with low accuracy is suitable.

For example, suppose I have 1000 documents that are either about sports or R programming. I can run a quick unsupervised clustering algorithm that should reasonably separate the documents, especially when they deal with concepts as different as these. Even in this case, however, I would need to go through and manually check the cluster results before placing any confidence in them (what if there are a bunch of documents about using R to analyze sports?). If I was going to manually read the documents anyway, the initial split provided by the clustering algorithm may speed up my workflow.

After reviewing your data and your question, you should be able to determine which method is right for your application. In general, you may want to first try unsupervised learning as an exploratory analysis before diving into supervised learning.

The remainder of this tutorial will go through unsupervised and supervised analysis on the same dataset.

Loading and Cleaning the Data

As mentioned above, this tutorial will use Reuters articles as the data source. Unlike the individual .text documents that we will often have to deal with, the Reuters data is conveniently packaged in a single .text file that is tab delimited. We can read it in like this:

library(tm)
library(wordcloud)
library(Rgraphviz)
library(RColorBrewer)
library(wordcloud)
library(topicmodels)
library(plyr)
library(ggplot2)
library(RTextTools)
library(e1071)

x <- read.table('r8-train-all-terms.txt', header=FALSE, sep='\t')

Now we have a data frame where a row represents a document with one column containing the document text (V2) and another column containing the tag for the topic (V1). The tags we have are:

unique(x$V1)
## [1] earn     acq      trade    ship     grain    crude    interest money-fx
## Levels: acq crude earn grain interest money-fx ship trade

Our dataset contains 5,485 documents. This is good - especially for supervised analysis - because we will have plenty of documents on which to base our model. In the interest computational expense, lets limit to three of the document tags: trade, crude, and money-f.

x <- x[which(x$V1 %in% c('trade','crude', 'money-fx')),]
nrow(x)
## [1] 710

Now we are down to 710 documents about our three selected subjects.

We can make a corpus from the column containing the document text:

source <- VectorSource(x$V2)
corpus <- Corpus(source)

We will take the standard steps to clean and prepare the data:

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords('english'))

Note that when you have tagged data like this, we will still create the corpus from only the document text. We keep the tags associated with the documents in our “x” variable to be used later.

Unsupervised Analysis

Clustering

This tutorial will show how to use k-means clustering. K-means basically tries to cluster the individuals in a dataset by comparing them across many variables. In the textmining case, these variables come from word frequencies. Lets first create a document-term matrix:

mat <- DocumentTermMatrix(corpus)

At this point, we could just move on to our clustering with this matrix, but we will instead create a weighted T-Ida version of the matrix. This method - short for term-frequency/inverse-term-frequency - takes into account how often a term is used in the entire corpus as well as in a single document. The logic here is that if a term is used in the entire corpus frequently, it is probably not as important when differentiating documents. Alternatively, if a word appears rarely in the corpus, it may be an important differentiation even if it only occurs a few times in a document.

mat4 <- weightTfIdf(mat)
mat4 <- as.matrix(mat4)

Finally, we will normalize the T-Ida scores by euclidean distance. This is one of many scoring methods in textmining and you should take some time to explore which method is right for your particular application (or find a bunch and try them out!).

norm_eucl <- function(m)
  m/apply(m,1,function(x) sum(x^2)^.5)
mat_norm <- norm_eucl(mat4)

Now we can run the k-means algorithm. The only thing that we need to specify is the number of centroids in the model. In our case, we know that there are 3 different groups, but in practice you will need to do some hypothesizing and testing to get to the correct number. There are also some methods in the fpc library that will help find the optimal number of centroids, but this is very computationally expensive with large datasets.

set.seed(5)
k <- 3
kmeansResult <- kmeans(mat_norm, k)

Now we have a model object called kmeansResult. You will often see people plot the results of k-means algorithms in bivariate analyses - with one variable on the y axis and one on the x axis. In our case, we have roughly 8,000 variables (terms) which represents dimensionality that we could never visualize well.

The most informative output of this model is which cluster each document was placed in:

kmeansResult$cluster[1:5]
## 1 2 3 4 5 
## 3 3 3 3 3
count(kmeansResult$cluster)
##   x freq
## 1 1   52
## 2 2  266
## 3 3  392

We can see that documents 1 through 5 are all in cluster 3. We have 52 documents in cluster 1, 266 documents in cluster 2, and 392 documents in cluster 3.

In true unsupervised analysis, this is all we could get. We would know which documents were grouped together, but would need to dive into the actual documents to see what (if anything) this means.

Model Performance

In this case, we know which documents should have been grouped together (which you won’t have in a real application), so lets see how the k-means did as a proof of concept.

We will put the results of the clustering into the original table (x), then graph the results to see how the clusters match up against the known tags.

result <- data.frame('actual'=x$V1, 'predicted'=kmeansResult$cluster)
result <- result[order(result[,1]),]

result$counter <- 1
result.agg <- aggregate(counter~actual+predicted, data=result, FUN='sum')

result.agg
##     actual predicted counter
## 1    crude         1       1
## 2 money-fx         1      51
## 3    crude         2       2
## 4 money-fx         2      73
## 5    trade         2     191
## 6    crude         3     250
## 7 money-fx         3      82
## 8    trade         3      60
ggplot(data=result.agg, aes(x=actual, y=predicted, size=counter)) + geom_point()

We can see that the clustering did OK (my subjective opinion). 98% of the documents in cluster 1 pertain to money-f, 72% of the documents in cluster 2 pertain to trade, and 64% of the documents in cluster 3 pertain to crude. On average, roughly 78% of the documents in each cluster correspond to the “correct” tag.

Again, whether or not this is sufficiently accurate depends on your application. If you want to review only documents pertaining to money-f, you would be well served by looking at cluster 1. If, however, are only interested in looking at documents about crude, you would get a lot of documents in cluster 3 that are about other topics.

Topic Modeling

Apart from clustering, we will also perform another type of unsupervised analysis: topic modeling. We are going to use the lda() function (Latent Dirichlet Allocation).

Unlike k-means which is a discriminative model (it tries to tell documents apart by conditioning on the contents of the document), LDA is a generative model (it creates a probabilistic model of how the words in each document were generated/written). LDA will determine which words are likely generated from a specific topic, then determine the topic of a document by examining these probabilities. LDA will also give a guess at the name of a topic. Like k-means, we need to supply the number of topics.

These are high level descriptions and are potentially oversimplified - please look these up for more information.

Note that LDA takes the DocumentTermMatrix because the algorithm requires regular frequency weighting (not T-Ida).

k <- 3
lda <- LDA(mat, k)
terms(lda)
## Topic 1 Topic 2 Topic 3 
##  "said"   "oil" "trade"
x <- topics(lda)
new.df <- data.frame('response'=names(x), 'topic'=x, row.names=NULL)
count(new.df, vars='topic')
##   topic freq
## 1     1  281
## 2     2  196
## 3     3  233

We can see from the results that LDA did a pretty good job here. Two of the topics are oil and trade. These correspond surprisingly well to the tags “crude” and “trade”. The other topic’s title is less informative. It appears that the model was not able to identify the money-f category, or possibly that the word “said” appears so much in these documents that LDA guessed that it was the topic title.

Again, whether or not this is useful for your application depends on what question you are trying to answer. If you wanted a general sense of the documents you had, you got 2 good descriptions of topics and 1 description that you would have to look into more - that’s not bad. If you needed definitive answers, you still have a lot of work to do to validate the LDA output. Either way, this type of light-weight, unsupervised exploratory analysis likely has some utility.

Supervised Analysis

Now to apply some real horsepower to our analysis with supervised analysis. Remember, these methods are only applicable if you have tagged data. In practice, people will often tag their own data (or hire others/poor graduate students to do so) in situations where accuracy is very important.

We will need to split our data into training and test sets in supervised analysis - both of which need to be tagged. We first train a model on the training data, then apply it to the test data and see how well it did (otherwise, we could just be overfitting to the training data). If our data is sufficiently large and representative, our accuracy numbers should give us a good idea of how well the model will perform on data we don’t have.

A good rule of thumb is to use 80% of the data to train and 20% to test. You should also look into n fold cross validation for further machine learning applications. For a quick example: If we use 10 fold cross validation, our data will be split into 10 groups. The algorithm we choose will run 10 time, each time using a different group (fold) as the training data. The results are then compared to give a final accuracy number. This has the benefit of using all of the data to train the model, but is computationally more expensive.

For now, lets stick to the 80-20 split. In our case, that is 568 for training, 142 for test.

We will need to reprocess the data and randomize it to make sure we get a good split:

set.seed(10)
x <- read.table('r8-train-all-terms.txt', header=FALSE, sep='\t')
x.rand <- x[sample(1:nrow(x)),]
x.rand <- x.rand[which(x.rand$V1 %in% c('trade','crude', 'money-fx')),]
source <- VectorSource(x.rand$V2)
corpus <- Corpus(source)

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords('english'))

mat <- DocumentTermMatrix(corpus)

mat4 <- weightTfIdf(mat)
mat4 <- as.matrix(mat4)

Lets first try a naiveBayes model.

classifier <- naiveBayes(mat4[1:568,], x.rand$V1[1:568])
predicted <- predict(classifier, mat4[569:710,])
table(as.character(x.rand$V1[569:710]), as.character(predicted))
##           
##            crude money-fx trade
##   crude       14       35     0
##   money-fx     0       44     2
##   trade        0       22    25
recall_accuracy(as.character(x.rand$V1[569:710]), as.character(predicted))
## [1] 0.584507

As you can see by the confusion matrix and the recall accuracy, this is not a great model.

In a confusion matrix, the rows represent the actual group of the data, while the columns represent the predicted group of the data. You can see that this model classified many of the documents as money-f regardless of their actual group.

To quantify the results another way, the recall accuracy tells us: of the documents that were truly in a given class how many were correctly labeled in that class by the algorithm. Precision (not shown here for simplicity) is another interesting measure that shows us: of the documents that were labeled in a given class, how many were correctly placed there. A third measure, F1, combines precision and recall. Please look into these if you are interested in validating your models.

A major benefit of supervised analysis is that there are many different machine learning algorithms that can be applied. Lets try a Tree next. Note that RTextTools allows us to create a container with our training and test data already defined. We can then call this object with train_model and classify_model.

container <- create_container(mat, x.rand$V1, trainSize=1:568,testSize=569:710, virgin=FALSE)
model <- train_model(container, 'TREE',kernel='linear')
results <- classify_model(container, model)
table(as.character(x.rand$V1[569:710]), as.character(results[,"TREE_LABEL"]))
##           
##            crude money-fx trade
##   crude       47        2     0
##   money-fx     0       43     3
##   trade        0        5    42
recall_accuracy(x.rand$V1[569:710], results[,"TREE_LABEL"])
## [1] 0.9295775

The results here look much better, with a recall accuracy of 92%.

Now lets try one last model, Support Vector Machines. SVM tends to work well with textmining classification and it also happens to be very fast.

container <- create_container(mat, x.rand$V1, trainSize=1:568,testSize=569:710, virgin=FALSE)
model <- train_model(container, 'SVM',kernel='linear')
results <- classify_model(container, model)
table(as.character(x.rand$V1[569:710]), as.character(results[,"SVM_LABEL"]))
##           
##            crude money-fx trade
##   crude       49        0     0
##   money-fx     0       45     1
##   trade        1        2    44
recall_accuracy(x.rand$V1[569:710], results[,"SVM_LABEL"])
## [1] 0.971831

It seems that SVM is the best model here. The recall accuracy was a substantial 97%.

If our data is representative, we would expect that we could classify documents of these three classes with 97% accuracy. Depending on your question/application, it may have been well worth your time to tag these 710 documents manually - especially if you will need to repeat this classification task in the future.

Conclusion

The methods shown here should be enough to get you moving in the right direction in textmining classification, clustering, and topic modeling. If you were previously unfamiliar with any of the topics mentioned here (especially those related to machine learning), it may be worth your time to do some additional research into the specific assumptions and limitations of each method/algorithm.

This tutorial purposefully did not cover sentiment analysis (mostly because it is overused and sort of boring), but because of its popularity, I will briefly discuss it here. People unfamiliar with textmining may have a problem framing a sentiment analysis problem: Should they count up “negative” and “positive” words in a dictionary? How can we correlate this with the documents? What about sarcasm?

After reading this tutorial, hopefully you went straight to the answer: frame sentiment analysis as a classification problem. The best sentiment analyses will involve running supervised text classification algorithms with data tagged as positive or negative. It would be tempting to use a classifier that someone else created from a different corpus, but keep in mind that the types of positive and negative language/usage in your corpus is probably fairly unique. When discussing certain topics or using certain media, people will use different degrees of sarcasm, colloquial language, “inside jokes”, and so on. In these cases, it is probably not useful to try to guess at these predictors - instead, take the time to tag your data and let supervised machine learning do the hard work for you!

Thank you for reading through this tutorial - feel free to email me with any questions, errors, or suggestions for additional material.