In information theory, byte pair encoding (BPE) or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. Look up Wikipedia for a good example of using BPE on a single string.
This technique is also employed in natural language processing models, such as the GPT-2, to tokenize word sequences. Why not to tokenize a text simply by separating the words by space? Well, not every language have space. Even if we tokenize a text that employs space, it may not be a good idea as:
- If we tokenize a text with heuristic rules (such as the presence of an space), there are some NLP tasks that produce OOV (ou of vocabulary) words. For instance, BPE is used for translation tasks (as it avoids OOV words, you will understand how if you read on).
- If we use a tokenization technique that is based on individual words and not on their subunits, we might not be able to see the similarity between strong -> stronger, weak -> weaker.
Altogether, BPE is used in NLP tasks for tokenization by using the elements of the text, that is the individual characters that construct them. Let’s see how BPE tokenization is exactly implemented!
The BPE tokenization was proposed by Sennrich et al. 2015 . The workflow of the BPE tokenization algorithm is described below:
- First, we define a large corpus to train our tokenizer and a maximum vocabulary length.
- Next, we add a word end token to each word (before the space, this token is ”). This special character shows that the subword/ character is just before the space. We also compute the frequency of each word.
- Generate a new subword according to the high frequency occurrence. Add the subword to the vocabulary that occurs the most often.
- Repeat step 3 until reaching the maximum subword vocabulary size (defined in step 1) or until the next highest frequency pair is 1.
text = """
Lay lady lay,
Lay across my big brass bed
Lay lady lay
Lay across my big brass bed
"""
tokens = text.split()
text = [' '.join(tokens[i]) for i in range(len(tokens))]
text = [text[i]+' <w>' for i in range(len(text))]
vocab = dict(Counter(text))
print(vocab)
import re, collections
def get_vocab(filename):
vocab = collections.defaultdict(int)
with open(filename, 'r', encoding='utf-8') as fhand:
for line in fhand:
words = line.strip().split()
for word in words:
vocab[' '.join(list(word)) + ' </w>'] += 1
return vocab
def get_stats(vocab):
pairs = collections.defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols)-1):
pairs[symbols[i],symbols[i+1]] += freq
return pairs
def merge_vocabulary(pair, v_in):
v_out = {}
bigram = re.escape(' '.join(pair))
p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
for word in v_in:
w_out = p.sub(''.join(pair), word)
v_out[w_out] = v_in[word]
return v_out
def get_tokens(vocab):
tokens = collections.defaultdict(int)
for word, freq in vocab.items():
word_tokens = word.split()
for token in word_tokens:
tokens[token] += freq
return tokens
vocab = {'L a y </w>': 4, 'l a d y </w>': 2, 'l a y , </w>': 1, 'a c r o s s </w>': 2, 'm y </w>': 2, 'b i g </w>': 2, 'brass': 2, 'bed': 2, 'lay': 1}
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3}
print('----------------')
print('Tokens Before BPE')
tokens = get_tokens(vocab)
print('Tokens: {}'.format(tokens))
print('# of tokens: {}'.format(len(tokens)))
print('----------------')
num_merges = 10
for i in range(num_merges):
pairs = get_stats(vocab)
if not pairs:
break
best = max(pairs, key=pairs.get)
vocab = merge_vocabulary(best, vocab)
print('Iter: {}'.format(i))
print('Best pair: {}'.format(best))
tokens = get_tokens(vocab)
print('Tokens: {}'.format(tokens))
print('# of tokens: {}'.format(len(tokens)))
print('----------------')
Leave a Reply