### GOALS:

#### Our Goals can be listed as follows :

- For a
*vocabulary size*of ‘**W’**, we need to learn a vector representation for every*word*‘**w’.** - The word vector representation should consider the context of the surrounding words.
- The word vector representation should also consider the internal morphological structure of the word.

### APPROACH:

The first thing to be done here is setting up a word dictionary with an index for every word.

Next, we define a window size which will give us the context words for every target word. For example, a *window size of* ** 2** for the target word “

*army**”*in the sentence: “

**” , the context words would be:**

*I have an army. We have a hulk*( “have”, “an”, “we”, “have” )

The next step is to calculate the score for context between words. Let’s assume a function **S(w,c) → R** to numerically score the context between the *target word* <**w>** and the *context word* <**c>**. Where <**w>** and <**c>** are the vector representations for these words and <**R>** represents the *single dimension real numbers space*.

Further, we can define the probability of <**c>** occurring as a context word as a softmax function as follows :

But this probability distribution only considers the probability of occurrence of a single context word to that of the whole vocabulary. Thus this cannot be used to define our *objective function* to train the model. Predicting context words can be seen as a set of binary classification problems. Therefore, we use the binary logistic loss to get the following *negative* *log-likelihood* as the *objective function*:

Now, to parameterize the model, we *define* the ** context score S(w,c)** as the

**between the vectors**

*scalar product***and**

**.**

Until here, the approach is exactly the same as the one given by Mikolov. Clearly, the current approach only considers the neighboring context words for calculating the features and completely ignores the structure of the word itself.

**The Subword Model:**

To address this issue, we represent the word “** w”** as

*a bag of all possible character n-grams*in the word. The

*word is padded by a set of unique symbols like the angled brackets*:

**. This helps in distinguishing the suffixes and prefixes from the rest of the character sequences. We also add the complete word as a sequence into this bag of n-grams.**

**EXAMPLE:** For n = 3, the word “** where**” will give us the following n-grams bag represented as

*Gw*

Gw= [, ]

Now, every character sequence in the n-grams bag is denoted by a vector and the word vector

**is given by the sum of the vectors of all the n-grams in that word. We also change the context score function to :**

This allows the sharing of representations across words, ** thus allowing to learn reliable representation for rare words**. This way, the subword information is utilized in the learning process of calculating word embeddings.

This model can be memory-wise costlier. So, the authors use Fowler-Noll-Vo hashing function (FNV-1a variant) to hash the character sequences. Ultimately, a word is represented by its index in the word dictionary and the set of hashed n-grams it contains.

**Optimization:**

Given, the *negative log-likelihood* as the *objective function* before, we can optimize to *minimize* it ( and hence *maximize the likelihood* ) by an optimization method. The authors have used ** Stochastic Gradient Descent** with

**for all their experiments. The same optimization method has been used by Mikolov et al in their word2vec model.**

*Linear Learning Rate decay***Implementation Details:**

The authors have tried to maintain maximum similarity between their approach and Mikolov et al’s approach.

**Word vector size**=*300***Negative Sampling Size**=*5 samples per positive sample***Word rejection criteria**=*If the word occurs less than times in the corpus, it is removed from the dictionary*.

This implementation is **1.5x slower** than Mikolov et al’s SkipGram implementation.

### EXPERIMENTS:

**Datasets:** *Wikipedia dumps in nine languages: Arabic, Czech, German, English, Spanish, French, Italian, Romanian and Russian.*

**Human similarity judgment: C**

*omputing*

*Spearman’s rank correlation coefficient*

*between human judgment and the cosine similarity between the vector representations.*

**Word analogy tasks:** *This is tightly related to the choice of the length of character n-grams that we consider. The analogy accuracy is split as* *Semantic**(meaning-based) and* *Syntactic**(syntax and grammar-based).*

**Other Experiments:** *Further the authors also experiment with Language Modelling tasks, n-grams size variation, morphemes and qualitative analysis.*