Journey Towards English Bodo Text-to-Text Transfer Transformer (T5)
Unified Text-to-Text Transfer Transformer (T5) is a language model based on the encoder-decoder transformer architecture, which is commonly used in neural machine translation (NMT). However, unlike NMT, T5 is trained on a monolingual corpus of a particular language, and requires a downstream task for specific NLP applications. For machine translation, the monolingual corpus must include both the source and target languages.
The T5 is a general-purpose language model that can be fine-tuned for a variety of NLP tasks, such as translation, summarization, and question answering. To fine-tune T5 for a particular task, you simply need to provide it with a dataset of examples where the input and output are both in the same language (summarization, question answering, headline generation).
For machine translation, i.e., to fine-tune T5 for machine translation, you would provide it with a dataset of pairs of sentences, one in the source language and one in the target language. T5 would then learn to translate between the two languages.
T5 has been shown to achieve state-of-the-art results on a variety of NLP tasks, and is a powerful tool for developing new NLP applications.
In this section, I will discuss the process of developing a language model based on T5 and fine-tuning it for machine translation between English and Bodo. To do this, we will need monolingual and parallel corpora for both languages.
The monolingual corpus will be used to train the T5 model to learn the statistical relationships between words and phrases in each language. The parallel corpus will be used to fine-tune the T5 model for machine translation.
Once the T5 model has been trained and fine-tuned, we can use it to translate between English and Bodo. To do this, we simply need to provide the model with a sentence in one language, and it will output the corresponding sentence in the other language.
Here is a step-by-step guide to the process:
- Collect a monolingual corpus for English and a monolingual corpus for Bodo.
- Train a T5 model on the monolingual corpora.
- Collect a parallel corpus of English and Bodo sentences.
- Fine-tune the T5 model on the parallel corpus.
- Use the fine-tuned T5 model to translate between English and Bodo.
It is important to note that the quality of the machine translation model will depend on the quality and size of the training and fine-tuning corpora. Larger and higher-quality corpora will result in better translation performance.
Dataset
Datasets are an essential requirement for any NLP task. Monolingual datasets are used for pre-training, while parallel datasets are used for fine-tuning.
Pre-training is the process of training a language model on a large corpus of text in a single language. This allows the model to learn the statistical relationships between words and phrases in that language.
Fine-tuning is the process of adapting a pre-trained language model for a specific NLP task, such as machine translation or text summarization. To fine-tune a model, you need to provide it with a dataset of examples where the input and output are both in the same language.
For example, to fine-tune a model for machine translation, you would need to provide it with a dataset of pairs of sentences, one in the source language and one in the target language. The model would then learn to translate between the two languages.
Monolingual datasets are used for pre-training because they are much more common and easier to collect than parallel datasets. However, parallel datasets are essential for fine-tuning models for machine translation and other tasks that require the model to understand the relationship between two languages.
Bodo Dataset:
AI4Bharat has published two monolingual and parallel datasets, IndicCorp and IndicTrans 2 (BPCC), which include text from all 22 Indian national languages, including Bodo. We extracted the monolingual Bodo dataset from IndicCorp and mined several articles from The Sentinel Bodo and Bodosa News, resulting in a combined monolingual Bodo dataset of about 100 MB. This dataset can be used to pre-train and fine-tune language models for a variety of NLP tasks, such as machine translation, text summarization, and question answering.
This dataset will be helpful for researchers and developers who are working on NLP tasks for the Bodo language.
English Dataset:
We collected the English monolingual and parallel datasets from AI4Bharat. The English monolingual dataset is about 22 GB in size.
Parallel Dataset
We use the English-Bodo parallel dataset from AI4Bharat BPCC to fine-tune our language model. The dataset contains about 118,000 parallel sentences. To fine-tune our model, we provide it with the parallel dataset and allow it to learn the relationship between the two languages. Once the model is fine-tuned, we can use it to translate sentences between English and Bodo.
Experimental Setup
We conducted our experiments on a DGX system with NVIDIA V100 32 GB GPUs running Ubuntu 22.04 with CUDA 12.2. We used PyTorch 2.2 and the development version of Transformers 2.34.
Firs clone the Huggingface transformers and installed it
git clone https://github.com/huggingface/transformers.git
cd transformers
pip install -e ./
cd ..
Tokenization
Tokenization is the first step in every natural language processing task. We used a customized tokenization script to tokenize the combined English and Bodo monolingual corpus before training our model.
Here is a summary of our tokenization process:
- We combined the English and Bodo monolingual corpora into a single corpus.
- We trained a customized tokenization script on the combined corpus.
- We used the trained tokenization script to tokenize the training and evaluation datasets for our model.
Our customized tokenization script is important because it allows us to tokenize the English and Bodo text in a consistent way. This is important for ensuring that our model is able to learn the statistical relationships between words and phrases in both languages.
Custom Tokenizer Script t5_tokenizer_model.py
#!/usr/bin/env python3
import json
from typing import Iterator, List, Union
from tokenizers import AddedToken, Regex, Tokenizer, decoders, normalizers, pre_tokenizers, trainers
from tokenizers.implementations.base_tokenizer import BaseTokenizer
from tokenizers.models import Unigram
from tokenizers.processors import TemplateProcessing
class SentencePieceUnigramTokenizer(BaseTokenizer):
"""
This class is a copy of `DeDLOC's tokenizer implementation <https://github.com/yandex-research/DeDLOC/blob/main/sahajbert/tokenizer/tokenizer_model.py>`__ .
Custom SentencePiece Unigram Tokenizer with NMT, NKFC, spaces and lower-casing characters normalization
Represents the Unigram algorithm, with the pretokenization used by SentencePiece
"""
def __init__(
self,
replacement: str = "▁",
add_prefix_space: bool = True,
unk_token: Union[str, AddedToken] = "<unk>",
eos_token: Union[str, AddedToken] = "</s>",
pad_token: Union[str, AddedToken] = "<pad>",
):
self.special_tokens = {
"pad": {"id": 0, "token": pad_token},
"eos": {"id": 1, "token": eos_token},
"unk": {"id": 2, "token": unk_token},
}
self.special_tokens_list = [None] * len(self.special_tokens)
for token_dict in self.special_tokens.values():
self.special_tokens_list[token_dict["id"]] = token_dict["token"]
tokenizer = Tokenizer(Unigram())
tokenizer.normalizer = normalizers.Sequence(
[
normalizers.Nmt(),
normalizers.NFKC(),
normalizers.Replace(Regex(" {2,}"), " "),
normalizers.Lowercase(),
]
)
tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
[
pre_tokenizers.Metaspace(replacement=replacement, add_prefix_space=add_prefix_space),
pre_tokenizers.Digits(individual_digits=True),
pre_tokenizers.Punctuation(),
]
)
tokenizer.decoder = decoders.Metaspace(replacement=replacement, add_prefix_space=add_prefix_space)
tokenizer.post_processor = TemplateProcessing(
single=f"$A {self.special_tokens['eos']['token']}",
special_tokens=[(self.special_tokens["eos"]["token"], self.special_tokens["eos"]["id"])],
)
parameters = {
"model": "SentencePieceUnigram",
"replacement": replacement,
"add_prefix_space": add_prefix_space,
}
super().__init__(tokenizer, parameters)
def train(
self,
files: Union[str, List[str]],
vocab_size: int = 8000,
show_progress: bool = True,
):
"""Train the model using the given files"""
trainer = trainers.UnigramTrainer(
vocab_size=vocab_size,
special_tokens=self.special_tokens_list,
show_progress=show_progress,
)
if isinstance(files, str):
files = [files]
self._tokenizer.train(files, trainer=trainer)
self.add_unk_id()
def train_from_iterator(
self,
iterator: Union[Iterator[str], Iterator[Iterator[str]]],
vocab_size: int = 8000,
show_progress: bool = True,
):
"""Train the model using the given iterator"""
trainer = trainers.UnigramTrainer(
vocab_size=vocab_size,
special_tokens=self.special_tokens_list,
show_progress=show_progress,
)
self._tokenizer.train_from_iterator(iterator, trainer=trainer)
self.add_unk_id()
def add_unk_id(self):
tokenizer_json = json.loads(self._tokenizer.to_str())
tokenizer_json["model"]["unk_id"] = self.special_tokens["unk"]["id"]
self._tokenizer = Tokenizer.from_str(json.dumps(tokenizer_json))
Load the dataset
from datasets import load_dataset
dataset=load_dataset('text', data_files={'train': ['~/data/eng-brx-mono/train_eng-brx_mono.txt'],
'test': '~/data/eng-brx-mono/valid_eng-brx_mono.txt'})
import datasets
import sys
from t5_tokenizer_model import SentencePieceUnigramTokenizer
vocab_size = 52_000
input_sentence_size = None
# Initialize a dataset
tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")
# Build an iterator over this dataset
def batch_iterator(input_sentence_size=None):
if input_sentence_size is None:
input_sentence_size = len(dataset)
batch_length = 100
for i in range(0, input_sentence_size, batch_length):
yield dataset['train']["text"][i: i + batch_length]
# Train tokenizer
tokenizer.train_from_iterator(
iterator=batch_iterator(input_sentence_size=input_sentence_size),
vocab_size=vocab_size,
show_progress=True,
)
# Save Tokenizer
tokenizer.save("./t5-eng-brx-base/tokenizer.json")
Load the configuration from T5 base and save it to project
from transformers import T5Config
config = T5Config.from_pretrained("google/t5-v1_1-base", vocab_size=52_000)
config.save_pretrained("./t5-eng-brx-base")
Once tokenization and configurations are available you can pretrain your T5 model with huggingface scripts.
export model="t5-eng-brx-base"
export model=${model:-"t5-eng-brx-base"}
CUDA_VISIBLE_DEVICES=0,1,2,3 python transformers/examples/flax/language-modeling/run_t5_mlm_flax.py \
--output_dir="./${model}" \
--model_type="t5" \
--do_train \
--do_eval \
--config_name="./${model}" \
--tokenizer_name="./${model}" \
--train_file="~/data/eng-brx-mono/train_eng-brx_mono.txt" \
--validation_file="~/data/eng-brx-mono/valid_eng-brx_mono.txt" \
--max_seq_length="512" \
--weight_decay="0.01" \
--per_device_train_batch_size="2" \
--per_device_eval_batch_size="2" \
--learning_rate="3e-4" \
--warmup_steps="1000" \
--overwrite_output_dir \
--num_train_epochs="18" \
--adam_beta1="0.9" \
--adam_beta2="0.98" \
--logging_steps="500" \
--save_steps="2500" \
--eval_steps="2500"
The training process will generate a file called flax_model.msgpack
. This file contains the updated model weights, which are saved every 2500 steps. You can use this model even if the training process does not complete.
xxx Pretraining ends xxx
We can now load the model and evaluate its performance on a translation task. However, since we have not yet fine-tuned the model for machine translation, it is likely to generate a high loss. This is because the model has not yet learned the relationship between the English and Bodo languages.
To improve the model's performance, we need to fine-tune it on a parallel dataset of English and Bodo sentences. This will allow the model to learn the statistical relationships between words and phrases in both languages.
Once the model has been fine-tuned, we can re-evaluate its performance on the translation task. We expect that the model will generate a much lower loss after fine-tuning.
# Load model directly
from transformers import AutoTokenizer, T5ForConditionalGeneration
model_path="./t5-eng-brx-base"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path, from_flax=True)
input_ids = tokenizer("translate English to Bodo: is an important cause of undernutrition. .", return_tensors="pt").input_ids
labels = tokenizer("खिलुनाया खम निउथ्रिसननि मोनसे गोनां जाहोन।", return_tensors="pt").input_ids
loss = model(input_ids=input_ids, labels=labels).loss
loss.item()
The fine-tune will be a follow up post :)