🤗 BERT tokenizer from scratch

tenexcoder

tenexcoder

Posted on November 11, 2020

🤗 BERT tokenizer from scratch

As part of 🤗 Tokenizers 0.9 release, it has never been easier to create extremely fast and versatile tokenizers for your next NLP task.
No better way to showcase tokenizers' new capabilities than to create a Bert tokenizer from scratch.

Tokenizer

First, BERT relies on WordPiece, so we instantiate a new Tokenizer with this model:

from tokenizers import Tokenizer
from tokenizers.models import WordPiece

bert_tokenizer = Tokenizer(WordPiece())
Enter fullscreen mode Exit fullscreen mode

Then we know that BERT preprocesses texts by removing accents and lowercasing. We also use a unicode normalizer:

from tokenizers import normalizers
from tokenizers.normalizers import Lowercase, NFD, StripAccents

bert_tokenizer.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])
Enter fullscreen mode Exit fullscreen mode

The pre-tokenizer is just splitting on whitespace and punctuation:

from tokenizers.pre_tokenizers import Whitespace

bert_tokenizer.pre_tokenizer = Whitespace()
Enter fullscreen mode Exit fullscreen mode

And the post-processing uses the template we saw in the previous section:

from tokenizers.processors import TemplateProcessing

bert_tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)
Enter fullscreen mode Exit fullscreen mode

We can use this tokenizer and train on it on wikitext like in the Quicktour:

from tokenizers.trainers import WordPieceTrainer

trainer = WordPieceTrainer(
    vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
bert_tokenizer.train(trainer, files)

model_files = bert_tokenizer.model.save("data", "bert-wiki")
bert_tokenizer.model = WordPiece.from_file(*model_files, unk_token="[UNK]")

bert_tokenizer.save("data/bert-wiki.json")
Enter fullscreen mode Exit fullscreen mode

Now that the BERT tokenizer has been configured and trained the BERT tokenizer, we can load it with:

from tokenizers import Tokenizer

bert_tokenizer = Tokenizer.from_file("data/tokenizer-wiki.json")
Enter fullscreen mode Exit fullscreen mode

Decoding

On top of encoding the input texts, a Tokenizer also has an API for decoding, that is converting IDs generated by your model back to a text. This is done by the methods decode() (for one predicted text) and decode_batch() (for a batch of predictions).
The decoder will first convert the IDs back to tokens (using the tokenizer’s vocabulary) and remove all special tokens, then join those tokens with spaces.
If you used a model that added special characters to represent subtokens of a given “word” (like the "##" in WordPiece) you will need to customize the decoder to treat them properly. If we take our previous bert_tokenizer for instance the default decoding will give:

output = bert_tokenizer.encode("Welcome to the 🤗 Tokenizers library.")
print(output.tokens)
# ["[CLS]", "welcome", "to", "the", "[UNK]", "tok", "##eni", "##zer", "##s", "library", ".", "[SEP]"]

bert_tokenizer.decode(output.ids)
# "welcome to the tok ##eni ##zer ##s library ."
Enter fullscreen mode Exit fullscreen mode

But by changing it to a proper decoder, we get:

from tokenizers import decoders

bert_tokenizer.decoder = decoders.WordPiece()
bert_tokenizer.decode(output.ids)
# "welcome to the tokenizers library."
Enter fullscreen mode Exit fullscreen mode

Resources

Documentation: https://huggingface.co/docs/tokenizers/python/latest/pipeline.html#all-together-a-bert-tokenizer-from-scratch
Colab: https://colab.research.google.com/github/tenexcoder/huggingface-tutorials/blob/main/BERT_tokenizer_from_scratch.ipynb
Gist: https://gist.github.com/tenexcoder/85b38e17a5557f0bb7c44bda4a08271d

Credit

All credit goes to Hugging Face Tokenizers Documentation — see resources for more details
I simply packaged the example in a digestible and shareable form.

đź’– đź’Ş đź™… đźš©
tenexcoder
tenexcoder

Posted on November 11, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related