Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNN)¶
1. Background¶
Sequences are extremely important for understanding what’s happening around us. As you read this text you make sense of it by combining previous words with new ones into a meaninful sequence of words. Similarly for many processes, it’s incredibly important to use data from previous ‘states’ in order to predict future ones. To use anther example, when we listen to someone, our brain is constantly trying to complete the sentence, and we often do.
Recurrent Neural Networks are a class of argorithms that allow us to train machines to do just that!
The main innovation of RNNs is that they allow information to persist, or in other words, allow for memory units to pass information from one step to the next. You can think of it as information being passed over many time steps, or simply as a loop repeating over time with information being updated.
Long-Short-Term-Memory units, or LSTMs for short, have been extremely successful in solving many problems in speech recognition, text generation, captioning, etc.
In this tutorial we build a simple RNN model using LSTMs that predicts future text characters by training on J.M. Barrie’s Peter Pan novel. This example is essentially an abridged version from the problem set in the Udacity deep learning nanodegree.
For a more in depth explanation I recommend you watch this video by Nando de Freitas, and read this excellent article.
2.1 Loading and Cleaning the Dataset¶
import urllib.request
response = urllib.request.urlopen('http://www.gutenberg.org/files/16/16-0.txt')
data = response.read()
# Write data to file
filename = "peterpan.txt"
file_ = open(filename, 'w')
file_.write(str(data))
file_.close()
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
# The text for training can be obtained at http://www.gutenberg.org/ebooks/16
# read in the text, transforming everything to lower case
text = open('peterpan.txt').read().lower()
print('our original text has ' + str(len(text)) + ' characters')
### find and replace '\n' and '\r' symbols - replacing them
text = text[1302:]
text = text.replace('\n',' ') # replacing '\n' with '' simply removes the sequence
text = text.replace('\r',' ')
### print out the first 1000 characters of the raw text to get a sense of what we need to throw out
text[:1000]
### TODO: list all unique characters in the text and remove any non-english ones
import string
allowed_chars = string.ascii_lowercase + ' ' + '!' + ',' + '.' + ':' + ';' + '?'
# remove as many non-english characters and character sequences as you can
for char in text:
if char not in allowed_chars:
text = text.replace(char, ' ')
# shorten any extra dead space created above
text = text.replace(' ',' ')
### print out the first 2000 characters of the raw text to get a sense of what we need to throw out
text[:2000]
# count the number of unique characters in the text
chars = sorted(list(set(text)))
# print some of the text, as well as statistics
print ("this corpus has " + str(len(text)) + " total number of characters")
print ("this corpus has " + str(len(chars)) + " unique characters")
2.2 Cutting data into input/output pairs¶
### TODO: fill out the function below that transforms the input text and window-size into a set of input/output pairs for use with our RNN model
def window_transform_text(text,window_size,step_size):
# containers for input/output pairs
inputs = []
outputs = []
ctr = 0
# Goes from window_size until the end, and pick previous characters
for i in range(window_size, len(text), step_size):
inputs.append(text[ctr:i])
outputs.append(text[i])
ctr = ctr + step_size
return inputs,outputs
# run your text window-ing function
window_size = 100
step_size = 5
inputs, outputs = window_transform_text(text,window_size,step_size)
# print out a few of the input/output pairs to verify that we've made the right kind of stuff to learn from
print('input = ' + inputs[2])
print('output = ' + outputs[2])
print('--------------')
print('input = ' + inputs[100])
print('output = ' + outputs[100])
# print out the number of unique characters in the dataset
chars = sorted(list(set(text)))
print ("this corpus has " + str(len(chars)) + " unique characters")
print ('and these characters are ')
print (chars)
2.3 One-hot encoding characters¶
The easiest way to think of this problem is as a classification problem! Essentially all we’re doing is predicting what the next character will be, and we know that there are 33 unique forms this character ti can take. So, if we think of characters as classes then we just need to predict what class will the next character belong to!
Since the number of classes is relatively small, only 33, we can simply use one-hot encoding. However, note that for larger models with many more classes this becomes inefficient and using an embedding layer would be better.
# this dictionary is a function mapping each unique character to a unique integer
chars_to_indices = dict((c, i) for i, c in enumerate(chars)) # map each unique character to unique integer
# this dictionary is a function mapping each unique integer back to a unique character
indices_to_chars = dict((i, c) for i, c in enumerate(chars)) # map each unique integer back to unique character
Now we can transform our input/output pairs – consisting of characters – to equivalent input/output pairs made up of one-hot encoded vectors. In the next cell we provide a function for doing just this: it takes in the raw character input/outputs and returns their numerical versions. In particular the numerical input is given as $\bf{X}$, and numerical output is given as the $\bf{y}$
# transform character-based input/output into equivalent numerical versions
def encode_io_pairs(text,window_size,step_size):
# number of unique chars
chars = sorted(list(set(text)))
num_chars = len(chars)
# cut up text into character input/output pairs
inputs, outputs = window_transform_text(text,window_size,step_size)
# create empty vessels for one-hot encoded input/output
X = np.zeros((len(inputs), window_size, num_chars), dtype=np.bool)
y = np.zeros((len(inputs), num_chars), dtype=np.bool)
# loop over inputs/outputs and tranform and store in X/y
for i, sentence in enumerate(inputs):
for t, char in enumerate(sentence):
X[i, t, chars_to_indices[char]] = 1
y[i, chars_to_indices[outputs[i]]] = 1
return X,y
Now run the one-hot encoding function by activating the cell below and transform our input/output pairs!
# use your function
window_size = 100
step_size = 5
X,y = encode_io_pairs(text,window_size,step_size)
3.1 Setting up the RNN¶
With our dataset loaded and the input/output pairs extracted / transformed we can now begin setting up our RNN for training. Again we will use Keras to quickly build a single hidden layer RNN – where our hidden layer consists of LTSM modules.
Time to get to work: build a 3 layer RNN model of the following specification
- layer 1 should be an LSTM module with 200 hidden units –> note this should have input_shape = (window_size,len(chars)) where len(chars) = number of unique characters in your cleaned text
- layer 2 should be a linear module, fully connected, with len(chars) hidden units –> where len(chars) = number of unique characters in your cleaned text
- layer 3 should be a softmax activation ( since we are solving a multiclass classification)
Use the categorical_crossentropy loss
This network can be constructed using just a few lines – as with the RNN network you made in part 1 of this notebook. See e.g., the general Keras documentation and the LTSM documentation in particular for examples of how to quickly use Keras to build neural network models.
### necessary functions from the keras library
from keras.models import Sequential
from keras.layers import Dense, Activation, LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import keras
import random
# TODO build the required RNN model: a single LSTM hidden layer with softmax activation, categorical_crossentropy loss
model = Sequential()
model.add(LSTM(200, input_shape=(window_size, 33)))
model.add(Dense(33, activation='softmax'))
# initialize optimizer
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)
# compile model --> make sure initialized optimizer and callbacks - as defined above - are used
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
3.2 Training our RNN model for text generation¶
With our RNN setup we can now train it! Lets begin by trying it out on a small subset of the larger version. In the next cell we take the first 10,000 input/output pairs from our training database to learn on.
# a small subset of our input/output pairs
Xsmall = X[:10000,:,:]
ysmall = y[:10000,:]
# train the model
model.fit(Xsmall, ysmall, batch_size=500, epochs=40,verbose = 1)
# save weights
model.save_weights('model_weights/best_RNN_small_textdata_weights.hdf5')
3.3 Predicting text from trained model¶
# function that uses trained model to predict a desired number of future characters
def predict_next_chars(model,input_chars,num_to_predict):
# create output
predicted_chars = ''
for i in range(num_to_predict):
# convert this round's predicted characters to numerical input
x_test = np.zeros((1, window_size, len(chars)))
for t, char in enumerate(input_chars):
x_test[0, t, chars_to_indices[char]] = 1.
# make this round's prediction
test_predict = model.predict(x_test,verbose = 0)[0]
# translate numerical prediction back to characters
r = np.argmax(test_predict) # predict class of each test input
d = indices_to_chars[r]
# update predicted_chars and input
predicted_chars+=d
input_chars+=d
input_chars = input_chars[1:]
return predicted_chars
With your trained model try a few subsets of the complete text as input – note the length of each must be exactly equal to the window size. For each subset us the function above to predict the next 100 characters that follow each input.
# TODO: choose an input sequence and use the prediction function in the previous Python cell to predict 100 characters following it
# get an appropriately sized chunk of characters from the text
start_inds = [0, 500, 1000]
# load in weights
model.load_weights('model_weights/best_RNN_small_textdata_weights.hdf5')
for s in start_inds:
start_index = s
input_chars = text[start_index: start_index + window_size]
# use the prediction function
predict_input = predict_next_chars(model,input_chars,num_to_predict = 100)
# print out input characters
print('------------------')
input_line = 'input chars = ' + '\n' + input_chars + '"' + '\n'
print(input_line)
# print out predicted characters
line = 'predicted chars = ' + '\n' + predict_input + '"' + '\n'
print(line)
This looks ok, but not great. Now lets try the same experiment with a larger chunk of the data – with the first 100,000 input/output pairs.
Tuning RNNs for a typical character dataset like the one we will use here is a computationally intensive endeavour and thus timely on a typical CPU. Using a reasonably sized cloud-based GPU can speed up training by a factor of 10. Also because of the long training time it is highly recommended that you carefully write the output of each step of your process to file. This is so that all of your results are saved even if you close the web browser you’re working out of, as the processes will continue processing in the background but variables/output in the notebook system will not update when you open it again.
In the next cell we show you how to create a text file in Python and record data to it. This sort of setup can be used to record your final predictions.
### A simple way to write output to file
f = open('my_test_output.txt', 'w') # create an output file to write too
f.write('this is only a test ' + '\n') # print some output text
x = 2
f.write('the value of x is ' + str(x) + '\n') # record a variable value
f.close()
# print out the contents of my_test_output.txt
f = open('my_test_output.txt', 'r') # create an output file to write too
f.read()
With this recording devices we can now more safely perform experiments on larger portions of the text. In the next cell we will use the first 100,000 input/output pairs to train our RNN model.
First we fit our model to the dataset, then generate text using the trained model in precisely the same generation method applied before on the small dataset.
Note: your generated words should be – by and large – more realistic than with the small dataset, but you won’t be able to generate perfect English sentences even with this amount of data. A rule of thumb: your model is working well if you generate sentences that largely contain real English words.
NOTE: If you’re running this on your CPU, this following piece of code may take significant resouces and could take several hours to complete.
# a small subset of our input/output pairs
Xlarge = X[:100000,:,:]
ylarge = y[:100000,:]
# TODO: fit to our larger dataset
model.fit(Xlarge, ylarge, batch_size=500, nb_epoch=100,verbose = 1)
# save weights
model.save_weights('model_weights/best_RNN_large_textdata_weights.hdf5')
# TODO: choose an input sequence and use the prediction function in the previous Python cell to predict 100 characters following it
# get an appropriately sized chunk of characters from the text
start_inds = [0, 500, 1000]
# save output
f = open('text_gen_output/RNN_large_textdata_output.txt', 'w') # create an output file to write too
# load weights
model.load_weights('model_weights/best_RNN_large_textdata_weights.hdf5')
for s in start_inds:
start_index = s
input_chars = text[start_index: start_index + window_size]
# use the prediction function
predict_input = predict_next_chars(model,input_chars,num_to_predict = 100)
# print out input characters
line = '-------------------' + '\n'
print(line)
f.write(line)
input_line = 'input chars = ' + '\n' + input_chars + '"' + '\n'
print(input_line)
f.write(input_line)
# print out predicted characters
predict_line = 'predicted chars = ' + '\n' + predict_input + '"' + '\n'
print(predict_line)
f.write(predict_line)
f.close()