28 Jan 2019. Machine learning using Keras in R


In this blog post we will take a look at binary classification using Keras. We’ll look into installing Keras, understanding your data, defining a model and subsequently testing your model.

This blog post assumes you’ve got the latest version of R installed, which, at the time of writing is either R 3.5.2 or Microsoft R open 3.5.3. it also assumes you have intermediate or better programming skill and a basic understanding of machine learning.

Installing Keras

To start out you should install the Keras package to your R environment. This can be achieved by running the install.packages("Keras") command. Once Keras is installed you can then load it into your library with the library("Keras") command.

With the library loaded in you can then install Keras by running install_Keras() this will set up the necessary environment on your local machine. In this guide we will also be making use of the magrittr package which provides the pipe operator (%>%)

Getting the dataset

The basis of any machine learning project is getting the right data. I chose the enriched hotel reviews dataset from Kaggle.

This dataset contains clearly labeled reviews and will serve as an ideal platform to start with sentiment analysis.

Once you have downloaded the dataset and placed it in your working directory you can load it using the following command.

fn_hotel_reviews <- "hotel_Reviews_Enriched.csv"
    hotel_reviews <- read.csv(file = fn_hotel_reviews,
                              header = T,
                              quote = '\"',
                              dec = '.')
    print("Reviews read in successfully")
} else {
    print(paste("File not found: ", fn_hotel_reviews))

It is important to note that the Kaggle dataset contains values in quotes. This is to make sure that, in locales that use the comma for their decimal separator, the values are retained properly.

Selecting the data.

When we’ve got our dataset loaded in we can start selecting the columns that are interesting. In this case the dataset contains several columns that we can use for our sentiment analysis.

* Positive_Review. 
* Negative_Review.
* Review_Is_Positive.

These three columns tell us all we need to know. So we take these columns and leave the rest,

selected_columns <- c("Positive_Review", "Negative_Review", "Review_Is_Positive")
hotel_reviews <- hotel_reviews[,selected.columns]

We then shuffle the dataset and select the positive and split the positive and negative reviews. We take the second and third columns for the negative reviews as they contain the negative reviews and our tag.

hotel_reviews <- hotel_reviews[sample(nrow(hotel_reviews)),]
negative.reviews <- hotel_reviews[which(hotel_reviews$Review_Is_Positive == 0), c(2,3)]
positive.reviews <- hotel_reviews[which(hotel_reviews$Review_Is_Positive == 1), c(1,3)] 

In my case I ended up with 9785 negative reviews and 10215 positive reviews. This will make our final neural network lean slightly to the positive side, but with a larger, balanced dataset that can be overcome.

We then set the column names for both dataframes and stitch them together row-wise.

colnames(negative_reviews) <- c("Review", "Label")
colnames(positive_reviews) <- c("Review", "Label")
mixed_reviews <- rbind(negative_reviews, positive_reviews)
mixed_reviews <- mixed_reviews[sample(nrow(hotel_reviews)),]


Because we would have 10k negative reviews and then 10k positive reviews we shuffle them again.

Tokenizing the data

With Keras installed and the dataset ready we can start tokenizing our data. By tokenizing we turn an n amount of words into integers that can be fed into our machine learning model. According to lingholic the average English speaking person uses about 1000 words in 90% of his writing. For this exercise we can therefore limit ourselves to the top 1000 words used in our training set. Bigger vocabularies might net more accurate results but will also take longer to train.

vocab_size <- 1000
tokenizer <- text_tokenizer(num_words = vocab.size)

We can then fit our tokenizener to our data and turn our posts into a document-term matrix. In this case we use the term frequency–inverse document frequency or tf-idf for short. The tf-idf is a number that is intended to reflect how important a word is to a collection or corpus.

tokenizer <- fit_text_tokenizer(tokenizer, mixed.reviews$Review)
x_train <- texts_to_matrix(tokenizer, mixed.reviews$Review, mode = "tfidf")
y_train <- to_categorical(mixed.reviews$label)

Defining the model

With the data tokenized we can then start thinking about our model. A neural network is built up out of multiple layers of neurons. Each with their own activation function and weight.

batch.size <- 512

model <- Keras_model_sequential() %>%
layer_dense(units = batch.size, input_shape = c(vocab.size), activation = 'relu') %>%
layer_dropout(rate = 0.2) %>%
layer_dense(units = (batch.size / 2), activation = "relu") %>%
layer_dense(units = 2, activation = 'sigmoid')

In the model above, we have four layers. The input layer has 512 neurons that take in our 1000 most frequent words. This layer has a rectified linear unit as activation function. This activation function can be described as max(0,x) where x is the input to the neuron.

After that we randomly drop 20% of our data to prevent overfitting. We then have another layer with the same activation and finally a layer that has a sigmoid function. This practically speaking defines a probability < .5 as 0 and anything higher as a 1.

After the model is defined we can compile it.

compile(model, loss = 'binary_crossentropy', optimizer = 'adam', metrics = c('accuracy'))  

We are interested in getting ones or zeroes, so we select a binary loss function and the adam optimizer. There are many more optimizers and loss functions available, so do read up on them! In short a loss function calculates a “cost” to an event. The optimizer then tries to minimize the cost whilst maintaining accuracy.

After it is compiled we can start training the model

history <- model %>% fit(x_train, 
                         batch.size = batch.size,
                         epochs = 2,
                         verbose = 1,
                         validation_split = 0.1)

An epoch in Keras is defined as “one pass over the entire dataset.” As we have a relatively small dataset two passes should be more than sufficient to produce accurate success. The validation_split parameter tells Keras to reserve 10% of the training data as test data to prove the accuracy of its model.

That’s it! With our model fitted and built we can now see the accuracy of our model by running history. The output should look similar to this:

Trained on 18,000 samples, validated on 2,000 samples (batch_size=32, epochs=2)
Final epoch (plot to see history):
val_loss: 0.1409
val_acc: 0.9483
loss: 0.1052
acc: 0.9607 

This means that our model was about 95% accurate at predicting the sentiment of the reviews!

Verification and prediction

You don’t just have to take my, or Keras’ word for it. you can check the accuracy of your model by loading in a new batch of data and using the evaluate function. This will give you the accuracy and loss values for your dataset.

x_test = texts_to_matrix(tokenizer, test_posts, mode = "tfidf" )
y_test = to_categorical(test_tags)

score <- evaluate(model, x_test, y_test, batch.size = batch.size, verbose = 1)

We can also predict the classes for new data.

verify_post <- c("this hotel was very nice", "the waiter was bad and my bathroom was leaky")
x_verify = texts_to_matrix(tokenizer, verify_post, mode = matrix.mode)

prediction <- model %>% predict_classes(x_verify)

If you liked this blog post, or have any more questions about Keras or R, feel free to get in touch.