How to create and train a Transformer model from scratch using Hugging Face Transformers

Image by the editor | Midjourney

The Hugging Face Transformers library provides tools to easily load and use pre-trained language models (LMs) based on the Transformer architecture. But did you know that you can also use this library to implement and train your Transformer model from scratch? This tutorial demonstrates this with a step-by-step example of sentiment classification.

Important NOTE: Training a Transformer model from scratch is computationally intensive, with a training loop typically taking hours to say the least. To run the code in this tutorial, it is highly recommended to have access to high-performance computing resources, whether on-premises or through a cloud provider.

Step-by-step process

Initial setup and loading of the dataset

Depending on the type of Python development environment you are working in, you may need to install Hugging Face’s. Transformers And Records Libraries and the accelerate Library for training your Transformer model in a distributed computing environment.

!pip install transformers datasets
!pip install accelerate -U

After the required libraries are installed, we load the Emotions dataset for sentiment classification of Twitter messages from the Hugging Face Hub:

from datasets import load_dataset
dataset = load_dataset('jeffnyman/emotions')

Using the data to train a transformer-based LM requires tokenizing the text. The following code initializes a BERT tokenizer (BERT is a family of transformer models suitable for text classification tasks), defines a function to tokenize text data with padding and truncation, and applies it to the dataset in batches.

from transformers import AutoTokenizer

def tokenize_function(examples):
  return tokenizer(examples('text'), padding="max_length", truncation=True)

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Before we proceed with the initialization of the transformer model, we verify the unique labels in the dataset. A verified set of existing class labels helps to avoid GPU-related errors during training by checking the consistency and correctness of the labels. We will use this label set later.

unique_labels = set(tokenized_datasets('train')('label'))
print(f"Unique labels in the training set: {unique_labels}")

def check_labels(dataset):
  for label in dataset('train')('label'):
    if label not in unique_labels:
      print(f"Found invalid label: {label}")

check_labels(tokenized_datasets)

Next, we create and define a model configuration and then instantiate the transformer model with this configuration. Here, we specify hyperparameters about the transformer architecture, such as embedding size, number of attention heads, and the previously computed set of unique labels, which are crucial for building the final output layer for sentiment classification.

from transformers import BertConfig
from transformers import BertForSequenceClassification

config = BertConfig(
vocab_size=tokenizer.vocab_size,
hidden_size=512,
num_hidden_layers=6,
num_attention_heads=8,
intermediate_size=2048,
max_position_embeddings=512,
num_labels=len(unique_labels)
)

model = BertForSequenceClassification(config)

We are almost ready to train our Transformer model. There are only two necessary instances left to instantiate: Training argumentswith specifications about the training loop such as the number of epochs and trainerwhich combines the model instance, the training arguments, and the data used for training and validation.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
  output_dir="./results",
  evaluation_strategy="epoch",
  learning_rate=2e-5,
  per_device_train_batch_size=16,
  per_device_eval_batch_size=16,
  num_train_epochs=3,
  weight_decay=0.01,
)

trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=tokenized_datasets("train"),
  eval_dataset=tokenized_datasets("test"),
)

It’s time to train the model. Sit back and relax. Remember that this guide will take a lot of time to complete:

After training, your Transformer model should be ready to pass input examples for sentiment prediction.

Troubleshooting

If you encounter or persist problems when running the training loop or during its setup, you may need to check the configuration of the GPU/CPU resources used. For example, if you are using a CUDA GPU, you can avoid errors in the training loop by adding these instructions at the beginning of your code:

import os
os.environ("CUDA_LAUNCH_BLOCKING") = "1"

These lines disable the GPU and synchronize CUDA operations, providing more immediate and accurate error messages for debugging.

On the other hand, if you try this code in a Google Colab instance, there is a chance that you will see this error message during execution, even if you have previously installed the Accelerate library:

ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.21.0`: Please run `pip install transformers(torch)` or `pip install accelerate -U`

To resolve this issue, try restarting your session from the Runtime menu: The Accelerate library typically requires a reset of the execution environment after installation.

Summary and conclusion

This tutorial shows the main steps to build your transformer-based LM from scratch using Hugging Face libraries. The main steps and elements can be summarized as follows:

Loading the dataset and tokenizing the text data.
Initialize your model using a model configuration instance for the model type (language task) for which it is intended, for example: BertConfig.
Setting up a trainer And Training arguments instances and running the training loop.

As a next step in your learning, we recommend that you explore how to make predictions and inferences with your newly trained model.

Ivan Palomares Carrascosa is a leading expert, author, speaker and consultant in the fields of AI, machine learning, deep learning and LLMs. He trains and guides others in using AI in the real world.

Step-by-step process

Initial setup and loading of the dataset

Troubleshooting

Summary and conclusion

Related Posts

Superheroes can be created with any budget: Guhan Seniappan

Skipton Police urge ‘Tell us’ about incidents involving misbehaving youths

In Florida you can go to the Goblin Market and turn your trash into treasures

Leave a Reply Cancel reply