Problem: So inorder to train this net, we need to know what the target_sequence is. But how are we going to get that target_sequence? We certainly cannot annotate every timestep in the sequence of length W/2(that’s impossible!).
Solution: Guess the target_sequence using a reasonable heuristic. Yes you read it right. Make an initial guess for the target_sequence. Consider the input image above, what is the decoded output for that input?
decoded output →“STATE”
Say the length of the target_sequence/predicted_sequence(i.e output from the RNN) is 14. Now let’s guess the target_sequence to be “SSSTTAAAATTEEE” (we’ll look at blank character later).
Given that we have a target_sequence, we can train the network with that sequence for the first step, then use the network’s predictions as the new target_sequence(for next step) and iterate it over and over during training — Iterate and Estimate.
But there is one key problem here. How do we ensure that the network’s predicted target_sequence when decoded gives us the desired output, “STATE” in our case?
What if the RNN outputs “RRRRAAAEEELLLL”? We can’t use this output sequence as the new target_sequence for the next step in the training process. Hence we need to impose few restrictions on the sequence predicted by the RNN.
Constraint 1: Instead of taking into account the predictions of entire vocabulary(alphabets A to Z), consider only the alphabets that is part of the decoded output.
In our case, the alphabets S,T,A,E are part of the decoded output “STATE”.
Okay, this now ensures that our network produces a predicted_sequence containing only the alphabets in the decoded output. But there is no constraint on the sequence in which the alphabets should be outputted.
For eg: the RNN outputs “SSAAAATTTAAEEE”, which still contains only the target alphabets but when decoded produces SATAE instead of STATE.
Constraint 2: To ensure the network outputs a sequence which when decoded gives us the desired collapsed output, let’s impose the following constraint:
Not only consider the alphabets S,T,A,E but also order the individual alphabets in the sequence in which they occur in the decoded output(even if the alphabets have to be repeated).
And now fix that the first output should always be the top left symbol and last output should always be the bottom right symbol and the sequence should always be from top to bottom strictly i.e no upward movement. Every path from the top-left symbol to the bottom-right symbol is considered a valid sequence/alignment, since it will always decode to a valid ‘decoded_sequence’.
Visualize every output symbol at every timestep as a node in a graph and edges represented by arrows. Every node’s score is its probability from the network at that timestep and the edge scores are 1. The score of a sequence(or path) is the product of probabilities of all nodes contained in it.
Sub-Problem: Given a set of such valid sequences, find the most probable sequence?
Solution: This problem is solved using Viterbi’s algorithm(dynamic programming).
Once we ensure that the predicted sequence is a valid sequence(or alignment), we can use the ITERATE and ESTIMATE approach to train the network. One could also use a pretrained RNN trained on a similar task to make the training process more robust.
But all of this is heavily dependent on how the initial target sequence is guessed as well as how the target_sequence is computed during training(max probable sequence using Viterbi’s algorithm). As a result of which this approach is sub-optimal.
Instead of choosing the most probable sequence, we use an expectation over all the possible valid sequences. This again can be computed using dynamic programming and the algorithm used is known as CTC-Forward-Backward algorithm. I am not going to get into the details of Forward-Backward/Viterbi algorithm as it is still a grey area for me.
Lastly, consider the case when the output has repeated characters. Let the output word be “LETTER” and the predicted_sequence be “LLLLEETTTTEERR”. Now what happens when this predicted_sequence is collapsed?
We get the decoded output as “LETER” which is undesirable.
Therefore inorder to account for repetitive characters in the output domain, blank character(-) is included as part of the vocabulary. Basically a neural-net outputs a blank character when none of the symbols in the vocabulary is present at that timestep.
So instead of producing “LLLLEETTTTEERR”, an RNN with blank as part of the vocabulary might now output “-L-EE-TT-T-E-R” which when decoded produces “LETTER”.
Since a blank character has been included and it can happen to occur anywhere in the target_sequence, we arrange the target alphabets + blank as follows: