--- title: "One-hot encoding of words or characters" output: html_notebook: theme: cerulean highlight: textmate --- ```{r setup, include=FALSE} knitr::opts_chunk$set(warning = FALSE, message = FALSE) ``` *** This notebook contains the first code sample found in Chapter 6, Section 1 of [Deep Learning with R](https://www.manning.com/books/deep-learning-with-r). Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments. *** One-hot encoding is the most common, most basic way to turn a token into a vector. You already saw it in action in our initial IMDB and Reuters examples from chapter 3 (done with words, in our case). It consists in associating a unique integer index to every word, then turning this integer index i into a binary vector of size N, the size of the vocabulary, that would be all-zeros except for the i-th entry, which would be 1. Of course, one-hot encoding can be done at the character level as well. To unambiguously drive home what one-hot encoding is and how to implement it, here are two toy examples of one-hot encoding: one for words, the other for characters. Word level one-hot encoding (toy example): ```{r} # This is our initial data; one entry per "sample" # (in this toy example, a "sample" is just a sentence, but # it could be an entire document). samples <- c("The cat sat on the mat.", "The dog ate my homework.") # First, build an index of all tokens in the data. token_index <- list() for (sample in samples) # Tokenizes the samples via the strsplit function. In real life, you'd also # strip punctuation and special characters from the samples. for (word in strsplit(sample, " ")[[1]]) if (!word %in% names(token_index)) # Assigns a unique index to each unique word. Note that you don't # attribute index 1 to anything. token_index[[word]] <- length(token_index) + 2 # Vectorizes the samples. You'll only consider the first max_length # words in each sample. max_length <- 10 # This is where you store the results. results <- array(0, dim = c(length(samples), max_length, max(as.integer(token_index)))) for (i in 1:length(samples)) { sample <- samples[[i]] words <- head(strsplit(sample, " ")[[1]], n = max_length) for (j in 1:length(words)) { index <- token_index[[words[[j]]]] results[[i, j, index]] <- 1 } } ``` Character level one-hot encoding (toy example): ```{r} samples <- c("The cat sat on the mat.", "The dog ate my homework.") ascii_tokens <- c("", sapply(as.raw(c(32:126)), rawToChar)) token_index <- c(1:(length(ascii_tokens))) names(token_index) <- ascii_tokens max_length <- 50 results <- array(0, dim = c(length(samples), max_length, length(token_index))) for (i in 1:length(samples)) { sample <- samples[[i]] characters <- strsplit(sample, "")[[1]] for (j in 1:length(characters)) { character <- characters[[j]] results[i, j, token_index[[character]]] <- 1 } } ``` Note that Keras has built-in utilities for doing one-hot encoding text at the word level or character level, starting from raw text data. This is what you should actually be using, as it will take care of a number of important features, such as stripping special characters from strings, or only taking into the top N most common words in your dataset (a common restriction to avoid dealing with very large input vector spaces). Using Keras for word-level one-hot encoding: ```{r} library(keras) samples <- c("The cat sat on the mat.", "The dog ate my homework.") # Creates a tokenizer, configured to only take into account the 1,000 # most common words, then builds the word index. tokenizer <- text_tokenizer(num_words = 1000) %>% fit_text_tokenizer(samples) # Turns strings into lists of integer indices sequences <- texts_to_sequences(tokenizer, samples) # You could also directly get the one-hot binary representations. Vectorization # modes other than one-hot encoding are supported by this tokenizer. one_hot_results <- texts_to_matrix(tokenizer, samples, mode = "binary") # How you can recover the word index that was computed word_index <- tokenizer$word_index cat("Found", length(word_index), "unique tokens.\n") ``` A variant of one-hot encoding is the so-called "one-hot hashing trick", which can be used when the number of unique tokens in your vocabulary is too large to handle explicitly. Instead of explicitly assigning an index to each word and keeping a reference of these indices in a dictionary, one may hash words into vectors of fixed size. This is typically done with a very lightweight hashing function. The main advantage of this method is that it does away with maintaining an explicit word index, which saves memory and allows online encoding of the data (starting to generate token vectors right away, before having seen all of the available data). The one drawback of this method is that it is susceptible to "hash collisions": two different words may end up with the same hash, and subsequently any machine learning model looking at these hashes won't be able to tell the difference between these words. The likelihood of hash collisions decreases when the dimensionality of the hashing space is much larger than the total number of unique tokens being hashed. Word-level one-hot encoding with hashing trick (toy example): ```{r} library(hashFunction) samples <- c("The cat sat on the mat.", "The dog ate my homework.") # We will store our words as vectors of size 1000. # Note that if you have close to 1000 words (or more) # you will start seeing many hash collisions, which # will decrease the accuracy of this encoding method. dimensionality <- 1000 max_length <- 10 results <- array(0, dim = c(length(samples), max_length, dimensionality)) for (i in 1:length(samples)) { sample <- samples[[i]] words <- head(strsplit(sample, " ")[[1]], n = max_length) for (j in 1:length(words)) { # Hash the word into a "random" integer index # that is between 0 and 1,000 index <- abs(spooky.32(words[[i]])) %% dimensionality results[[i, j, index]] <- 1 } } ```