BERT: Bidirectional Transformers for Language Understanding

One of the major advances in deep learning in 2018 has been the development of effective NLP transfer learning methods, such as ULMFiT, ELMo and BERT. The Transformer Bidirectional Encoder Representations aka BERT has shown strong empirical performance therefore BERT will certainly continue to be a core method in NLP for years to come.

Although the original article is not difficult read, it can be difficult for those without the necessary background to understand. This post will cover BERT, therefore a general idea and a short introduction to the method and the way to fine-tune the model on a binary classification task. I also show the relevant code but the exact code can be found here on github. This code is heavily based on the pytorch-transformers framework (link) and is implemented in Pytorch.

Introduction

Traditionally, there are two ways of transfer learning: we either transfer the language understanding by encoding it by means of embeddings, or we transfer the model itself, that is the architectures of a deep neural network with its weight pre-initialized on another task.

The transfer learning based on pre-training a neural network on another task is highly successful and applied widely in computer vision. As it would be very time-consuming to train a model from scratch, we need only fine-tune the model on the specific task. Intuitively, the model has learnt once already to recognise basic forms, so we can concentrate on the last few layers that are much more specific and need to be adjusted.

Natural Language Processing (NLP) experienced a breakthrough in 2018 with the implementation of successful transfer learning methods. This means that now, with a minimal training (fine-tuning), we can achieve results that could have been achieved by days, maybe weeks of training before. Intuitively, of course, it makes sense to do transfer learning in NLP since once a model learned the synthetic and semantic relationships in a language, it should perform better even on specific tasks on the given language, right?

But why could not be transfer learning used before then? Well, previous attempts experienced “catastrophic forgetting” when the model was fine-tuned on a domain specific corpus/ specific task. I find this name quite amusing, although the expression fits the situation of training your language model on a large Wikipedia corpus for days just to realise that when applied to another dataset it forgets everything it learned previously!

But why did this happen to NLP models while we could use transfer learning in computer vision (CV) so efficiently in the past? Well, not really sure (and I think nobody is), but one reason might be the fact that NLP models were shallower than CV ones. With shallower models the first few layers are already specialised whereas CV uses many layers, the shallowest ones capturing only basic features. Therefore, when fine-tuning CV models, by focusing only on the last layers we can keep the less specific features while slightly changing the last, specific ones.

But now we arrived to a new era with the era of transfer learning for high-performance language models! Let’s now look at BERT and how exactly transfer learning is done with it!

BERT Introduction

The Bidirectional Encoder Representations from Transformers (BERT) is a transfer learning method of NLP that is based on the Transformer architecture. If you are not familiar with the Transformer, check my blog here, but in a nutshell the Transformer model is a Sequence-to-Sequence model consisting of an Encoder and a Decoder unit. Instead of using recurrent networks, it builds heavily on the Attention mechanism. The Encoder takes a source sentence (a sequence) and projects it to a smaller hidden dimension, that will be inputted in the Decoder framework. The Decoder “decodes” this representation and is able to give the translation of the source sentence.

With the Attention mechanism, the distances between the input sequence’s elements (the input words) are no longer defined by their relative distances, but are now also encoded as the importance weights/ attention, computed between all words. Therefore the Transformer architecture can be trained quicker than recurrent networks, but in the same time it keeps the ability to remember long sentences.

Now you might say, that you don’t actually want to do translation now, therefore the Transformer architecture is not needed! But hold on a little, and you will see the magic trick here. You see, the Encoder unit encodes the input sentence, creating a hidden representation (this sounds fancy but really it is just a smaller dimensional tensor) for each word. Then, the Decoder decodes it by translating each word to the target language.

Therefore, by stopping at the Encoder level and by disregarding the Decoder unit, we get a hidden representation for each word, that encodes plenty of information about the language. Keep in mind that the Encoder uses Attention mechanism, therefore it can actually encode the same word in different contexts in different ways. This all sounds good, now we have an imaginary Encoder unit, we input somehow a sequence of words in it, but how do we train it?

Traditionally, language models are trained by either predicting the next word/ character conditional on the previous few words/ characters, or predicting a word/ character conditional on the following words/ characters. We could do that but this method has a limitation: it is not bidirectional!

So you might say: why would we need a bidirectional language model?

Well, imagine that you want to predict whether “Teddy” refers to a person in the following two sentences:

“My Teddy bear is brown.”
“Teddy Roosevelt was the president of the US.”

teddy
Source: giffy.com
Screenshot 2019-09-18 at 23.01.42
Source: wikipedia.com

This example illustrates why one-directional methods are not sufficient for all language modeling tasks. Now imagine that the task has changed, you only need to predict the missing word (shown in [MASK] as I mask them):

“My [MASK] bear is brown.”
“[MASK] Roosevelt was the president of the US.”

Now you have an intuition of how BERT works: it is bidirectional as the pre-training is done by masking some words in the input sequence and by trying to train the model on two tasks:

1) predicting the masked words (and only the masked words)
2) predicting whether two sentence is actually following each other (IsNextSentence)

So now let’s go to the details.

BERT architecture

Pre-training

For the pre-training, BERT takes input sequences, that is two sentences that follow each other (or not). The authors combined sentences by allowing 50% of them to be sentences that follow each other and 50% where they randomly assigned the second sentence.

Next, once each sentence pair (I will refer to this as a sequence) is created, there is an initial transformation of the sentence:

1) First, the authors assign a [CLS] token at the beginning of each sequence

2) They also assign a [SEP] token between the first and second sentence in each sequence, to indicate the start of the second sentence

3) They randomly choose 15% of the tokens to be masked, however, they do not mask all 15% of them. 10% of the chosen tokens will be switched to another, randomly chosen token, while 10% of them will be kept unchanged. The remaining chosen tokens will be masked [MASK].

BERT_sequences

Next, the words needs to be embedded. The BERT architecture contains learned embeddings with positional encodings (same as Transformers architecture until now) plus a segment embedding showing wether the token belongs to the first or second embedding (this embedding is also learned).

transformer_les_donnees.png

This will be then inputted into the Encoder unit, containing 12 or 24 encoding layers (depending on which BERT we are talking about). Then, the language model will be trained on to things:

  1. Predicting the masked tokens (Masked Language Modeling = MLM)
  2. Predicting whether the two sentence really follows each other

The first objective can be achieved by applying a fully connected layer with a GELU and a Normalization Layer, followed by a SoftMax that will return the probability of each word, while the second objective can be achieved by building a classifier on the outputted representation of the [CLS] token, followed by a SoftMax to predict NextSentence or NotNextSentence. This two objectives are trained together.

BERT_architecture.png

BERT has been trained on these two tasks on a massive corpus and can be directly used for the specific task by selecting the model (classification, next sentence prediction, ..) and the type of architecture we wish to use.
It has principally two different architectures:

  • base architectures (these contain 12 encoder layers in the encoder unit and they use a hidden unit dimension of 768 and 12 attention heads. These models have around 110M parameters
  • large architectures (containing 24 encoder layers in the encoder unit, employing a hidden unit dimension of 1024 and 16 attention heads, counting over 340M parameters!)

The models are also trained on cased (lowercased) and uncased corpus so you can select the one you think is more relevant for your target task. The models are also available for different languages (English, Chinese, multi-language model). Please check the list of available models (here) but keep in mind that the large models are quite difficult to train and might give a memory error.

Fine-tuning

For fine-tuning, you can load the model type and architecture that you have chosen (for instance a base model for classification) and you can train for a few epochs (the authors recommend 2-4-6 epochs) on the target task and target corpus. That’s it really, sounds quite simple!

Let’s do then a binary classification example on the IMDB corpus (by only selecting the labeled examples, so actually training the classifier on 50 000 labeled examples). You can download the dataset from here. In what follows I show extraits of the code but the full code can be found on my github. The code is based heavily on the pytorch-transformers examples, please check it on github (repo).

Example of binary classification with BERT 

First, import the package (this can be done simply with pip install pytorch-transformers) and organize your data into two (or three) tsv files in the bert_data_path. In this introduction, I only use a training and a hold-out evaluation set, however, you can also use a test set (where you predict the unknown labels). It is also a good practice to compare the training loss with a validation loss, and by defining a subset of the labeled data as a validation dataset the code can be modified simply, however, I do not do it here as once we do it training becomes even longer!

I use the ‘bert-base-cased’ model in this example with the following parameters:

In [22]:
import torch 
%matplotlib inline
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
In [3]:
TASK_NAME = "classification"
params = {"DATA_DIR": "./imdb/", 
          "BERT_DATA_PATH": "./imdb/bert_data/",
          "BERT_MODEL": "bert-base-cased", 
          "PRE_TRAINED_MODEL": None,
          "TASK_NAME": f"{TASK_NAME}", 
          "OUTPUT_DIR": f"outputs/{TASK_NAME}/", 
          "REPORTS_DIR": f'reports/{TASK_NAME}_evaluation_report/', 
          "MAX_SEQ_LENGTH": 128, 
          "TRAIN_BATCH_SIZE": 32,
          "EVAL_BATCH_SIZE": 32, 
          "LEARNING_RATE": 2e-5, 
          "NUM_TRAIN_EPOCHS": 6, 
          "RANDOM_SEED": 42, 
          "GRADIENT_ACCUMULATION_STEPS": 1, 
          "WARMUP_PROPORTION":  0.1, 
          "OUTPUT_MODE": "classification", 
          "CONFIG_NAME": "config.json", 
          "CACHE_DIR": 'cache/', 
          "WEIGHTS_NAME": "pytorch_model.bin",  
          "DEVICE": torch.device("cuda" if torch.cuda.is_available() else "cpu")}

Here a little explanation might come handy. I indicate that my model will be the base cased model, I will have a max sequence length of 128 and a train batch size of 32. The maximum sequence length is quite small, if you have a larger memory you might be able to use larger sequence length with 256 or 512. This is computationally more expensive but can give a better language understanding. Also, in the case of the IMDB dataset it would be adventegous to actually have a larger sequence length as the reviews are generally longer than 128 characters. Shorter inputs will be padded while longer inputs will be trunctuated.

I use the batch sizes of 32 in training and evaluation, I train for 6 epochs and use a gradient batch size of 1. If you have a smaller memory, you can try to decrease the batch size to 16 but once you define smaller batchsizes than 16, the gradients might be very dependent on the chosen batch. In this case you might use gradient accumulation.

A little reminder about gradient accumulation:

As already said, when training large neural networks, it can happen that you run out of memory, so you might reduce the batchsize in order to be able to train your model. But then small batch sizes might lead to inefficient training due to very noisy gradients. Gradient averaging is a technique allowing to increase the effective mini-batch size arbitrarily by computing the gradient on several batches and then averaging them.

In my code I choose the gradient accumulation step of 1 that is no gradient accumulation. Finally, the trained model will be saved as pytorch_model.bin.

Next, call the data and organize it in the following way:

- the subset 'unsup' is non labeled, usually used for prediction (test). I will not use this data here.
- the subset 'train' contains 25K labeled examples, I use this for training.
- a hold out validation set, the 'evaluation_set' will consist of the subset = 'test'.
In [4]:
import pandas as pd 

df = pd.read_csv("imdb/imdb.csv", sep='|')
df.head()
Out[4]:
Screenshot 2019-09-19 at 18.10.58.png

Now we need to rearrange the data. The following function serves only to rearrange the data in the way BERT requires the input data:

  • The training and validation (evaluation) datasets needs to have an “id” and “label” column, followed by a string column “alpha” and the actual “text” to classify. (The “id” should be unique.)
  • If we have a test set that we wish to use for prediction, we can use it only with an id and a text, however, I use here the test set as a hold out evaluation set and not really for prediction.

Therefore, for me, the test set is labeled and I use it for evaluation. That is, I train on the 25 000 reviews that are ‘subset’ == ‘train’ in the original imdb dataframe and evaluate the model on the 25 000 labeled examples that are in the ‘test’ subset.

If you wish to have also a set that you use for prediction (the subset == ‘unsup’, 50 000 reviews), you need to convert this in the same way as we need to transform the train and the evaluation sets, however, by only keeping two columns: the “id” and the “text” columns.

In [5]:
from lib.data import get_data_for_bert_class

# We can run this code with recreate=False for second time
train_bert, evaluation_bert = get_data_for_bert_class(params, recreate=True)
Saved files to ./imdb/bert_data/
In [6]:
train_bert.head()
Out[6]:
Screenshot 2019-09-19 at 18.06.58.png
In [7]:
evaluation_bert.head()
Out[7]:
Screenshot 2019-09-19 at 18.07.43.png
In [8]:
train_bert.shape, evaluation_bert.shape
Out[8]:
((25000, 4), (25000, 4))

Data to features

Before we transform the data, we need to define a Tokenizer as the reviews will be trunctuated/padded and tokenized. I use the BertTokenizer for the base model and I add this to my params dictionary:

In [9]:
from pytorch_pretrained_bert import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(params["BERT_MODEL"], do_lower_case=False)
params["TOKENIZER"] = tokenizer

Now I return to the data. Once we have the two datasets as a tsv files, you can transform them into the desired input to the model: DataLoaders. First, for the BERT model we usually have a sequence, that is a first and a second sentence. For classification, we equally need to provide the label of the example while a unique id is also needed. This is written in the InputExample class. Notice that if we use the model only for a classification on a review, we do not need to give the second sentence (text_b is specified as None).

In [10]:
class InputExample(object):
    """
    A single training/test example for simple sequence classification in BERT.
    """

    def __init__(self, guid, text_a, text_b=None, label=None):
        """
        Constructs a InputExample
        
        :param guid: Id for the example
        :param text_a: (str) The untokenized text of the first sequence.
        :param text_b: (str), (optional) The untokenized text of the second sequence.
        :param label: (str), (optional) The label of the example. This should be
           specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label

Next, the BinaryClassificationProcessor class will be able to read the data (saved as .tsv files) and convert it to the InputExample object. If we wish to specify a test set for prediction we only need to save the id and the text.

In [11]:
from lib.input_data_processing import DataProcessor

class BinaryClassificationProcessor(DataProcessor):
    """
    Processor for binary classification dataset.
    """
  
    def get_examples(self, data_dir, set_type):
        """
        Gets the examples for the given set_type set. 
        
        :params data_dir: str, the path of the train.tsv/ test.tsv/ dev.tsv files 
        :params set_type: str, either 'train', 'evaluation' or 'test'
        """
        data = set_type + ".tsv"
        data = self._read_tsv(os.path.join(data_dir, data))
        return self._create_examples(data, set_type)

    def get_labels(self):
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
        """
        Creates examples for the training and dev sets.
        """
        examples = []
        for (i, line) in enumerate(lines):
            if len(line) == 4:
                guid = "%s-%s" % (set_type, i)
                text_a = line[3]
                label = line[1]
                examples.append(
                    InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
            else: 
                guid = "%s-%s" % (set_type, i)
                text_a = line[3]
                examples.append(
                    InputExample(guid=guid, text_a=text_a, text_b=None, label=None))
        return examples

After applying the BinaryClassification class to our data, it will organize all reviews as an InputExample with an id (“guid”), the review (“text_a”), an optional second sentence (“text_b”, None in this case) and the label (None for the test set if there is any).

The above show it as an example:

In [12]:
import os

processor = BinaryClassificationProcessor()
train_examples = processor.get_examples(params["BERT_DATA_PATH"], 'train')
train_examples[0].guid, train_examples[0].text_a[:128], train_examples[0].text_b, train_examples[0].label
Out[12]:
('train-0',
 'Loony Tunes have ventured (at least) twice into the future. The first time was with the brilliantly funny "Duck Dodgers". The la',
 None,
 '0')
Next, we wish to transform the data as BERT uses it:
  • first, we tokenize the trunctuated reviews with the WordPiece tokenizer. This will be the input_id (id of each token) – we create an input_mask: this is 1 for tokens and 0 if the review was short and we needed to pad it.
  • we equally create a segment_id. The original article indicates for each sentence pair whether the tokens belong to the first or second sentence by 0/ 1 respectively.
  • finally we add the label to each example as well

These four things combines defines the InputFeature class:

In [13]:
class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, input_ids, input_mask, \
                 segment_ids, label_id):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_id = label_id

Therefore, I trunctuate/ pad reviews according to the max_seq_len, I tokenize them with the BertTokenizer and create the InputFeature class from them. I do that for each dataset, creating features (described above). Finally I create a dataloader from the features, by inputing them into a TensorDataset class, defining a RandomSampler and finally combining them into a DataLoader. This class contains the data in the predefined batches so needs to be recreated when changing the Batch size.

In [ ]:
from lib.convert_examples_to_features import create_features

# Run this code with recreate=False if you have already created the dataloaders with the given batchsize
train_features, train_examples, train_examples_len = create_features(params, 'train', recreate=True)
eval_features, eval_examples, eval_examples_len = create_features(params, 'evaluation', recreate=True)
In [15]:
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, TensorDataset)

print("***** Transform train features into a dataloader *****")
print("  Number of training examples = ", train_examples_len)
print("  Batch size = ", params["TRAIN_BATCH_SIZE"])
all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in train_features], dtype=torch.long)
all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.long)
# Input the train data into Tensors
train_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=params["TRAIN_BATCH_SIZE"])

print("***** Transform validation features into a dataloader *****")
print("  Number of validation examples = ", eval_examples_len)
print("  Batch size = ", params["EVAL_BATCH_SIZE"])
all_input_ids_dev = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
all_input_mask_dev = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
all_segment_ids_dev = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
all_label_ids_eval  = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)

validation_data = TensorDataset(all_input_ids_dev, all_input_mask_dev, all_segment_ids_dev, all_label_ids_eval)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=params["TRAIN_BATCH_SIZE"])
***** Transform train features into a dataloader *****
  Number of training examples =  25000
  Batch size =  32
***** Transform validation features into a dataloader *****
  Number of validation examples =  25000
  Batch size =  32

Once we transformed the input data in the desired format, I define the model. Let’s see our model: As I previously said, the model I chose is the base model with 12 encoder layers, a hidden size of 768 and 12 attention heads.

In [17]:
from pytorch_pretrained_bert import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(params["BERT_MODEL"], \
                                                      cache_dir=params["CACHE_DIR"], num_labels=2)
model.to(params["DEVICE"])
Out[17]:
BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1)
    ) ...
In [20]:
# Define some additional parameters 
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]

from pytorch_pretrained_bert.optimization import BertAdam

num_train_optimization_steps = int(train_examples_len
     / params["TRAIN_BATCH_SIZE"] / params["GRADIENT_ACCUMULATION_STEPS"]) * params["NUM_TRAIN_EPOCHS"]


optimizer = BertAdam(optimizer_grouped_parameters,
                     lr=params["LEARNING_RATE"],
                     warmup=params["WARMUP_PROPORTION"],
                     t_total=num_train_optimization_steps)

params["OPTIMIZER"] = optimizer

Now we can train our model

Finally, we can train the model according to the parameters defined in the params dictionary.
Note that here I use no evaluation data but it is a good practice to do so, in order to not only see the training loss but also see the loss on another dataset that we do not use for training. (Simply definine another DataLoader and add it to the training for the validation_dataloader argument, however, prepare yourself that training will take longer!)

In [23]:
from torch.nn import CrossEntropyLoss
from time import time
from tqdm import tqdm_notebook, trange

num_labels = 2
train_loss = []
global_step = 0
nb_tr_steps = 0

start = time()
model.train()
loss_fct = CrossEntropyLoss()

for _ in trange(int(params["NUM_TRAIN_EPOCHS"]), desc="Epoch"):
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0
    for step, batch in enumerate(tqdm_notebook(train_dataloader, desc="Iteration")):
        batch = tuple(t.to(params["DEVICE"]) for t in batch)
        input_ids, input_mask, segment_ids, label_ids = batch

        logits = model(input_ids, segment_ids, input_mask, labels=None)
        loss = loss_fct(logits.view(-1, num_labels), label_ids.view(-1))
        train_loss.append(loss)
        
        if params["GRADIENT_ACCUMULATION_STEPS"] > 1:
            loss = loss / params["GRADIENT_ACCUMULATION_STEPS"]

        loss.backward()
        print("\r%f" % loss, end='')

        tr_loss += loss.item()
        nb_tr_examples += input_ids.size(0)
        nb_tr_steps += 1
        if (step + 1) % params["GRADIENT_ACCUMULATION_STEPS"] == 0:
            optimizer.step()
            optimizer.zero_grad()
            global_step += 1

end = time()
Epoch:   0%|          | 0/6 [00:00]
Epoch:  17%|█▋        | 1/6 [04:01<20:06, 241.25s/it]
Epoch:  33%|███▎      | 2/6 [08:03<16:06, 241.63s/it]
Epoch:  50%|█████     | 3/6 [12:06<12:05, 241.86s/it]
Epoch:  67%|██████▋   | 4/6 [16:08<08:04, 242.03s/it]
Epoch:  83%|████████▎ | 5/6 [20:11<04:02, 242.19s/it]
Epoch: 100%|██████████| 6/6 [24:13<00:00, 242.30s/it]

Evaluation

Now we will evaluate the data on all of the validation set.

First, I read back the model (you don’t need to read it back if you run the notebook at once but it is a good practice to save each model!) then I directly use the evaluate_model function (check github repos for exact code).

In [43]:
# Load pre-trained model (weights)# Load pre-trained model (weights)
model = BertForSequenceClassification.from_pretrained("cache/model.tar.gz", \
                                                      cache_dir=params["CACHE_DIR"], num_labels=2)
model.to(params["DEVICE"])
In [27]:
from lib.model import evaluate_model

evaluate_model(model, params, validation_dataloader, all_label_ids_eval, num_labels=2)
{'task': 'classification', 'mcc': 0.7723200889712794, 'accuracy': 0.88616, 'precision': 0.8863454458139907, 'recall': 0.88592, 'f1-score': 0.8861326718412419, 'tp': 11074, 'tn': 11080, 'fp': 1420, 'fn': 1426, 'eval_loss': 0.5760936848581066}

And that’s it really! Not that difficult once the pretrained model can be used directly!

Some takeaways:

  1. Is this result good? Well not really, although I did not train it for long. A good accuracy with DL technique would be something like 0.94 and not 0.89.. We can even achieve better results with a ssimple TF-IDF than this!
  2. I find the model easy to use but a little heavy. I could not import and train the large model with sucess even on a p3.2xlarge aws instance.
  3. The model might perform better once we fine-tune it. One thing that I find a problem in this implementation is the max_seq_len, since the reviews are quite long and we disregard a big part of the review by dropping everything after the max_seq_len.

Altogether I find that although the results are not as great, the approach of this model is very interesting. It gives a bidirectional (or a non-directional) model that can be fine-tuned on several NLP tasks. Some improvements of the model might be achieved by

  1. increasing the max_seq_len
  2. search the optimal learning rate (I used one that was advised by the authors)
  3. search the optimal batch size, maybe use accumulated gradients
  4. fine tune for more epochs!

Finally, model size does matter, the large model would certainly gives better results although I think it might be impractical because of its size. I also read that fine-tuning the Language model might improve results. Although I tried that (easy to do with pytroch-transformers, see here for more info), it did not improve my results.

Finally, although this is an intuitive introduction to BERT, there are some other libraries (built on pytorch-transformers) that might be easier to use and might be worth it to try: this, this.


One thought on “BERT: Bidirectional Transformers for Language Understanding

Add yours

  1. Pingback: XLNet – MLIT

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: