1. Beam Search
1.A Intuition (why beam search)
Our task of text summarization can be seen as a conditional language model , meaning that we generate an output given an input sentence , so the output is conditioned on the input sentence , so this is why it is called conditional .
This can be seen as a contrary to a simple language model , as a normal language model only outputs the probability of a certain sentence , this can be used to generate novel sentences ,( if its input can be zero) , but in our case , it would select the most likely output given an input sentence.
So our architecture is actually divided into 2 parts , an encoder , and a decoder (like discussed here) , as the encoder would represent the input sentence as a vector , and pass it to the next part of the architecture (decoder)
But a problem would arise in the decoder , which sentence to select , as there could be a huge number of outputs for a certain input sentence
so how to choose the most likely sentence from this large pool of possible outputs ?
- Jane is visiting Africa in September
- Jane is going to be visiting Africa in September
- In September Jane will visit Africa
One solution could be by generating 1 word at a time , generating the first most likely word , then generating the next most word and so … , this is called greedy search , but this tends not to be the most optimized approach , due to 2 main reasons
- in practice , greedy search doesn’t perform well
- in each step we must take into consideration all possibilities of all words in the dict , if you have 10k words , then in each step you would have to consider 10k^ 10k (10k to the power 10k)
so we would go to a more optimized approximate search approach → Beam Search
1.B How to Beam Search
Now that we have understood the basic intuition behind beam search , how to actually implement it ?
Simply beam search differs from basic greedy search by considering multiple options in every step , not just 1 option , these number of options is controlled by a variable called Beam Width
In greedy search :
in each step we only consider the most likely output at each step , but this not always guarantee the best solution as
this is because the probability of the word “going” after “is” is bigger than the word “visiting” after is , so this is why we need to consider another approach not greedy search , an approach that would take multiple words into consideration
So in Beam Search it would be (Beam Width=3)
in step 1 , we would choose the 3 most likely words , then for each of the 3 selected words , we would get the most 3 likely words , then the same logic would occur on these 3 output sentences till we build our best output.
So if our vocab is 10k , for the each step we tried to look for the most likely word from 10k vocab , so by this we only considered 30k words
1.C Effect of Beam Width
when Beam Width = 3 → consider 3 words each step
when Beam Width = 1 → consider 1 word each step → greedy search
As Beam width increases → better results → but would require more computational resources