This notebook contains the code samples found in Chapter 8, Section 1 of Deep Learning with R. Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.
Let’s put these ideas in practice in a Keras implementation. The first thing we need is a lot of text data that we can use to learn a language model. You could use any sufficiently large text file or set of text files – Wikipedia, the Lord of the Rings, etc. In this example we will use some of the writings of Nietzsche, the late-19th century German philosopher (translated to English). The language model we will learn will thus be specifically a model of Nietzsche’s writing style and topics of choice, rather than a more generic model of the English language.
Let’s start by downloading the corpus and converting it to lowercase:
library(keras)
library(stringr)
path <- get_file(
"nietzsche.txt",
origin = "https://s3.amazonaws.com/text-datasets/nietzsche.txt"
)
Downloading data from https://s3.amazonaws.com/text-datasets/nietzsche.txt
16384/600901 [..............................] - ETA: 0s
24576/600901 [>.............................] - ETA: 1s
57344/600901 [=>............................] - ETA: 1s
163840/600901 [=======>......................] - ETA: 0s
335872/600901 [===============>..............] - ETA: 0s
606208/600901 [==============================] - 0s 1us/step
614400/600901 [==============================] - 0s 1us/step
text <- tolower(readChar(path, file.info(path)$size))
cat("Corpus length:", nchar(text), "\n")
Corpus length: 600893
Next, you’ll extract partially overlapping sequences of length maxlen
, one-hot encode them, and pack them in a 3D array x
of shape (sequences, maxlen, unique_characters)
. Simultaneously, you’ll prepare an array y
containing the corresponding targets: the one-hot-encoded characters that come after each extracted sequence.
maxlen <- 60 # Length of extracted character sequences
step <- 3 # We sample a new sequence every `step` characters
text_indexes <- seq(1, nchar(text) - maxlen, by = step)
# This holds our extracted sequences
sentences <- str_sub(text, text_indexes, text_indexes + maxlen - 1)
# This holds the targets (the follow-up characters)
next_chars <- str_sub(text, text_indexes + maxlen, text_indexes + maxlen)
cat("Number of sequences: ", length(sentences), "\n")
Number of sequences: 200278
# List of unique characters in the corpus
chars <- unique(sort(strsplit(text, "")[[1]]))
cat("Unique characters:", length(chars), "\n")
Unique characters: 57
# Dictionary mapping unique characters to their index in `chars`
char_indices <- 1:length(chars)
names(char_indices) <- chars
# Next, one-hot encode the characters into binary arrays.
cat("Vectorization...\n")
Vectorization...
x <- array(0L, dim = c(length(sentences), maxlen, length(chars)))
y <- array(0L, dim = c(length(sentences), length(chars)))
for (i in 1:length(sentences)) {
sentence <- strsplit(sentences[[i]], "")[[1]]
for (t in 1:length(sentence)) {
char <- sentence[[t]]
x[i, t, char_indices[[char]]] <- 1
}
next_char <- next_chars[[i]]
y[i, char_indices[[next_char]]] <- 1
}
This network is a single LSTM layer followed by a dense classifier and softmax over all possible characters. But note that recurrent neural networks aren’t the only way to do sequence data generation; 1D convnets also have proven extremely successful at this task in recent times.
model <- keras_model_sequential() %>%
layer_lstm(units = 128, input_shape = c(maxlen, length(chars))) %>%
layer_dense(units = length(chars), activation = "softmax")
Since our targets are one-hot encoded, we will use categorical_crossentropy
as the loss to train the model:
optimizer <- optimizer_rmsprop(lr = 0.01)
model %>% compile(
loss = "categorical_crossentropy",
optimizer = optimizer
)
Given a trained model and a seed text snippet, we generate new text by repeatedly:
This is the code we use to reweight the original probability distribution coming out of the model, and draw a character index from it (the “sampling function”):
sample_next_char <- function(preds, temperature = 1.0) {
preds <- as.numeric(preds)
preds <- log(preds) / temperature
exp_preds <- exp(preds)
preds <- exp_preds / sum(exp_preds)
which.max(t(rmultinom(1, 1, preds)))
}
Finally, the following loop repeatedly trains and generates text. You begin generating text using a range of different temperatures after every epoch. This allows you to see how the generated text evolves as the model begins to converge, as well as the impact of temperature in the sampling strategy.
for (epoch in 1:60) {
cat("epoch", epoch, "\n")
# Fit the model for 1 epoch on the available training data
model %>% fit(x, y, batch_size = 128, epochs = 1)
# Select a text seed at random
start_index <- sample(1:(nchar(text) - maxlen - 1), 1)
seed_text <- str_sub(text, start_index, start_index + maxlen - 1)
cat("--- Generating with seed:", seed_text, "\n\n")
for (temperature in c(0.2, 0.5, 1.0, 1.2)) {
cat("------ temperature:", temperature, "\n")
cat(seed_text, "\n")
generated_text <- seed_text
# We generate 400 characters
for (i in 1:400) {
sampled <- array(0, dim = c(1, maxlen, length(chars)))
generated_chars <- strsplit(generated_text, "")[[1]]
for (t in 1:length(generated_chars)) {
char <- generated_chars[[t]]
sampled[1, t, char_indices[[char]]] <- 1
}
preds <- model %>% predict(sampled, verbose = 0)
next_index <- sample_next_char(preds[1,], temperature)
next_char <- chars[[next_index]]
generated_text <- paste0(generated_text, next_char)
generated_text <- substring(generated_text, 2)
cat(next_char)
}
cat("\n\n")
}
}