This time it is difficult, I have set myself an ambitious goal. To explain the transformers to people who have neither a background in programming nor in artificial intelligence. The challenge is great, I will try to do my best.
This morning I got it into my head to learn a new language, Portuguese. I like the sound of the language and I said to myself, let’s learn it! The first thing that came into my mind was to take some words in Portuguese translated from my mother tongue, Italian, in order to build a first elementary vocabulary.
It was amusing because some of the words sounded a lot like Italian, and in the context of Italian I tried to understand things as synonyms and autonomous so as to appear a little bit more scented than what they really are. So I tried to understand the semantics or the meaning of the relationships between words.
Practically I used a language I know well – my mother tongue – and then associated those new Portuguese language terms and slowly learned the new language. A similar process, thanks to deep machine learning and the considerable increase in computing power, has been done in the computational field.
The computer knows only one language, mathematics, so you have to refer to that if you want to “teach” the machine the interpretation of a human language.
It is important to remember that any problem that is solved with Deep Learning is a mathematical problem. The computer, for example, “sees” thanks to a Convolutional Neural Network. This CNN receives the images in the form of mathematical matrices, whether they are black and white or color, and then applies linear algebra rules. The same happens in tasks such as topic detection, sentiment analysis, etc.
The system, “listening” sequentially to the “words” – mathematical vectors – translates them into new instructions, both language-to-language translations and computational models for “understanding”. …the tongue.
Of course, this allowed me to translate small sentences from Portuguese into Italian, but when the sentence became longer, or even became a whole document, at this point the word-for-word system no longer worked very well. I had to increase my ability to concentrate and try to better understand the context in which each new word was presented in Portuguese and of course relate it to my Italian knowledge.
The approach, despite the increased level of attention, is always sequential, and clearly shows its limits. In my case, with every new word that comes from Portuguese, I try to read it in Italian paying a lot of attention, but I have to say that certain ambiguities of the language, I can hardly interpret them correctly. The same happens on the computer where complex semantic rules are hardly understood by the model.
I have to say that my level of translation from Portuguese to Italian has increased considerably, I can translate, albeit with some errors, sentences much longer than those made with the previous methods. At this point, however, I need more, I want to be faster, and more precise. I want to understand the context much better. I want to reduce ambiguities. I need some kind of parallel process as well as context knowledge, and finally, I need to understand long term dependencies.
Let’s take a little example, let’s look at these two sentences:
- I went to the bank to open an account.
- The ship had approached the bank.
My next step to better learn the Portuguese language was to read lots of books, listen to Portuguese television, watch Portuguese language movies, etc.. I tried to significantly increase my vocabulary, understand the language and its dependencies.
When you learn different aspects of the language, you realize that exposure to a variety of texts is very useful for applying Transfer Learning. Start reading books to build a strong vocabulary and understanding of the language. When some words in a sentence are masked or hidden, then rely on your knowledge of the language and read the entire sentence from left to right and right to left (two-way).
Are you confused? Don’t worry, so am I. I will only try to explain the real advantages of BERT from a practical point of view.
Of all the things we’ve seen before, BERT is a big evolution. It collects all the features of the previous models, from word embedding to transformers, with all the advantages they achieve. But it brings other very interesting practical innovations:
BERT is bidirectional, it doesn’t just “read” from left to right, it does the opposite. This allows it to better “understand” words in context. Not only for ambiguous words, but also for related words, an example: Mike has gone to the stage. He had a great time! BERT understands that “he” refers to Mike, which is no small thing to solve language problems.BERT, while training, not only “reads”, but he hides 15% of the words and tries to “guess” them.
In this way, he tries to create knowledge that goes beyond “reading”, but helps BERT to anticipate the word based on the previous context, and even predict the sentence based on the previous one. Which is no small thing in an automatic question and answer system, or a chatbot, for instance. BERT offers several generic models that can be “uploaded” and then fine-tuned to the specific case (e.g. topic detection or sentiment analysis), without having a huge mass of data to do the fine-tuning.
This is no small thing for those who have already tried to train NLP models by labeling the data.
I’ve been working in Natural Language Processing for several years – the real thing, not the keyword search that my competitors pass off as Text Mining – and I was impressed with BERT. Let me give you a small practical example. I built a sentiment model, with a final accuracy F1 89% from a dataset so composed:
- 1,270 happy
- 154 indifferent
- 26 angry
- 11 bored
- 3 frustrated
All this is only possible because by using Transfer Learning and the generic models available from BERT, even very small cases (in our case FRUSTRATED) can be fine-tuned. Practically as if you could load in your brain a model that summarizes the linguistic knowledge obtained by reading all Wikipedia in Portuguese, and then just do a little fine-tuning for the specific case you want to solve. A leap forward in NLP!
I forgot, BERT clearly provides a number of models for Transfer Learning. And it clearly offers a large number of languages. For example, the BERT-Base Multilingual Cased model “has read texts” in 104 different languages, and can be refined with your own small dataset in each of them.