Beam Search & Attention for text summarization made easy (Tutorial 5)

1. Beam Search

1.A Intuition (why beam search)

Our task of text summarization can be seen as a conditional language model , meaning that we generate an output given an input sentence , so the output is conditioned on the input sentence , so this is why it is called conditional .

This can be seen as a contrary to a simple language model , as a normal language model only outputs the probability of a certain sentence , this can be used to generate novel sentences ,( if its input can be zero) , but in our case , it would select the most likely output given an input sentence.

So our architecture is actually divided into 2 parts , an encoder , and a decoder (like discussed here) , as the encoder would represent the input sentence as a vector , and pass it to the next part of the architecture (decoder)

But a problem would arise in the decoder , which sentence to select , as there could be a huge number of outputs for a certain input sentence

so how to choose the most likely sentence from this large pool of possible outputs ?

  1. Jane is visiting Africa in September
  2. Jane is going to be visiting Africa in September
  3. In September Jane will visit Africa

One solution could be by generating 1 word at a time , generating the first most likely word , then generating the next most word and so … , this is called greedy search , but this tends not to be the most optimized approach , due to 2 main reasons

  1. in practice , greedy search doesn’t perform well
  2. in each step we must take into consideration all possibilities of all words in the dict , if you have 10k words , then in each step you would have to consider 10k^ 10k (10k to the power 10k)

so we would go to a more optimized approximate search approach → Beam Search

1.B How to Beam Search

Now that we have understood the basic intuition behind beam search , how to actually implement it ?

Simply beam search differs from basic greedy search by considering multiple options in every step , not just 1 option , these number of options is controlled by a variable called Beam Width

In greedy search :

in each step we only consider the most likely output at each step , but this not always guarantee the best solution as

greedy search would output that the best solution is Jane is going , while Jane is visiting is better output

this is because the probability of the word “going” after “is” is bigger than the word “visiting” after is , so this is why we need to consider another approach not greedy search , an approach that would take multiple words into consideration

So in Beam Search it would be (Beam Width=3)

in step 1 , we would choose the 3 most likely words , then for each of the 3 selected words , we would get the most 3 likely words , then the same logic would occur on these 3 output sentences till we build our best output.

So if our vocab is 10k , for the each step we tried to look for the most likely word from 10k vocab , so by this we only considered 30k words

1.C Effect of Beam Width

when Beam Width = 3 → consider 3 words each step

when Beam Width = 1 → consider 1 word each step → greedy search

As Beam width increases → better results → but would require more computational resources

read original article here