Enriching Word Vectors with Subword Information [PAPER SUMMARY]


Our Goals can be listed as follows :

  1. For a vocabulary size of ‘W’, we need to learn a vector representation for every wordw’.
  2. The word vector representation should consider the context of the surrounding words.
  3. The word vector representation should also consider the internal morphological structure of the word.


The first thing to be done here is setting up a word dictionary with an index for every word.

Next, we define a window size which will give us the context words for every target word. For example, a window size of 2 for the target word “army in the sentence: “I have an army. We have a hulk” , the context words would be:

( “have”, “an”, “we”, “have” )

The next step is to calculate the score for context between words. Let’s assume a function S(w,c) → R to numerically score the context between the target word <w> and the context word <c>. Where <w> and <c> are the vector representations for these words and <R> represents the single dimension real numbers space.

Further, we can define the probability of <c> occurring as a context word as a softmax function as follows :

Here and are the vectors representing the target and context words and W is the vocabulary size.

But this probability distribution only considers the probability of occurrence of a single context word to that of the whole vocabulary. Thus this cannot be used to define our objective function to train the model. Predicting context words can be seen as a set of binary classification problems. Therefore, we use the binary logistic loss to get the following negative log-likelihood as the objective function:

The final objective function for training.

Now, to parameterize the model, we define the context score S(w,c) as the scalar product between the vectors and .

Until here, the approach is exactly the same as the one given by Mikolov. Clearly, the current approach only considers the neighboring context words for calculating the features and completely ignores the structure of the word itself.

The Subword Model:

To address this issue, we represent the word “w” as a bag of all possible character n-grams in the word. The word is padded by a set of unique symbols like the angled brackets: . This helps in distinguishing the suffixes and prefixes from the rest of the character sequences. We also add the complete word as a sequence into this bag of n-grams.

EXAMPLE: For n = 3, the word “where” will give us the following n-grams bag represented as Gw

Gw = [ ,  ]

Now, every character sequence in the n-grams bag is denoted by a vector and the word vector is given by the sum of the vectors of all the n-grams in that word. We also change the context score function to :

This allows the sharing of representations across words, thus allowing to learn reliable representation for rare words. This way, the subword information is utilized in the learning process of calculating word embeddings.

This model can be memory-wise costlier. So, the authors use Fowler-Noll-Vo hashing function (FNV-1a variant) to hash the character sequences. Ultimately, a word is represented by its index in the word dictionary and the set of hashed n-grams it contains.


Given, the negative log-likelihood as the objective function before, we can optimize to minimize it ( and hence maximize the likelihood ) by an optimization method. The authors have used Stochastic Gradient Descent with Linear Learning Rate decay for all their experiments. The same optimization method has been used by Mikolov et al in their word2vec model.

Implementation Details:

The authors have tried to maintain maximum similarity between their approach and Mikolov et al’s approach.

  1. Word vector size = 300
  2. Negative Sampling Size = 5 samples per positive sample
  3. Word rejection criteria = If the word occurs less than times in the corpus, it is removed from the dictionary.

This implementation is 1.5x slower than Mikolov et al’s SkipGram implementation.


Datasets: Wikipedia dumps in nine languages: Arabic, Czech, German, English, Spanish, French, Italian, Romanian and Russian.

Human similarity judgment: Computing Spearman’s rank correlation coefficient between human judgment and the cosine similarity between the vector representations.

Word analogy tasks: This is tightly related to the choice of the length of character n-grams that we consider. The analogy accuracy is split as Semantic (meaning-based) and Syntactic (syntax and grammar-based).

Other Experiments: Further the authors also experiment with Language Modelling tasks, n-grams size variation, morphemes and qualitative analysis.

read original article here