Building a Seq2Seq Transformer Model for Language Translation: A Comprehensive Guide

4 min readJan 15, 2025

Language translation has always been a fascinating task in the field of Natural Language Processing (NLP). With the advent of deep learning, models like the Transformer have revolutionized translation systems. In this blog, I’ll walk you through my recent project, where I implemented a Seq2Seq Transformer model for translating English to German using PyTorch and TorchText. This guide will cover everything from dataset preprocessing to model evaluation, with detailed code snippets along the way.

Introduction - Seq2Seq Transformer

What is a Seq2Seq Model?

A Sequence-to-Sequence (Seq2Seq) model maps an input sequence to an output sequence, often used for tasks like translation, summarization, and question answering. Unlike traditional recurrent architectures, the Transformer model employs self-attention mechanisms, making it highly efficient for parallel computation and long-range dependencies.

Why Use a Transformer for Translation?

The Transformer architecture, introduced in the famous paper “Attention is All You Need”, overcomes the limitations of Recurrent Neural Networks (RNNs) by leveraging:

Self-attention mechanisms to capture relationships between words in a sequence.
Parallelization for faster training compared to sequential models like RNNs and LSTMs.

Dataset: Multi30k

The project uses the Multi30k dataset, a popular benchmark for image captioning and machine translation tasks. It contains around 29,000 English-German sentence pairs for training, with additional validation and test sets.

Due to broken URLs in TorchText’s default dataset loader, I manually specified the dataset URLs to ensure successful downloading.

Step 1: Setting Up the Environment

Before diving into the code, make sure you have the following libraries installed:

pip install torch torchtext spacy
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm

Step 2: Tokenization and Vocabulary Creation

The first step in processing text data is tokenization. I used SpaCy tokenizers for both English and German:

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

SRC_LANGUAGE = 'en'
TGT_LANGUAGE = 'de'
# Tokenizers for English and German
token_transform = {
    SRC_LANGUAGE: get_tokenizer('spacy', language='en_core_web_sm'),
    TGT_LANGUAGE: get_tokenizer('spacy', language='de_core_news_sm')
}

Next, I built the vocabulary for both languages:

def yield_tokens(data_iter, language):
    for data_sample in data_iter:
        yield token_transform[language](data_sample[language])

# Building vocabularies
vocab_transform = {
    lang: build_vocab_from_iterator(yield_tokens(train_data, lang),
                                    specials=["<unk>", "<pad>", "<bos>", "<eos>"],
                                    min_freq=2)
    for lang in [SRC_LANGUAGE, TGT_LANGUAGE]
}

Special tokens such as <unk>, <pad>, <bos>, and <eos> are added to handle unknown words, padding, and sequence boundaries.

Step 3: Positional Encoding

Since Transformers do not have any inherent notion of word order, positional encoding is crucial to provide this information:

import math
import torch
from torch import nn

class PositionalEncoding(nn.Module):
    def __init__(self, embed_size, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.encoding = torch.zeros(max_len, embed_size)
        positions = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embed_size, 2) * -(math.log(10000.0) / embed_size))
        self.encoding[:, 0::2] = torch.sin(positions * div_term)
        self.encoding[:, 1::2] = torch.cos(positions * div_term)
        self.encoding = self.encoding.unsqueeze(0)
    def forward(self, x):
        return x + self.encoding[:, :x.size(1), :]

Step 4: Seq2Seq Transformer Model

The core of the project is the Transformer-based Seq2Seq model, which consists of an encoder and a decoder:

class Seq2SeqTransformer(nn.Module):
    def __init__(self, num_encoder_layers, num_decoder_layers, embed_size, num_heads, src_vocab_size, tgt_vocab_size, feedforward_dim, dropout=0.1):
        super(Seq2SeqTransformer, self).__init__()
        self.transformer = nn.Transformer(d_model=embed_size, nhead=num_heads, num_encoder_layers=num_encoder_layers,
                                          num_decoder_layers=num_decoder_layers, dim_feedforward=feedforward_dim, dropout=dropout)
        self.src_embedding = nn.Embedding(src_vocab_size, embed_size)
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, embed_size)
        self.positional_encoding = PositionalEncoding(embed_size)
        self.generator = nn.Linear(embed_size, tgt_vocab_size)

    def forward(self, src, tgt, src_mask, tgt_mask):
        src = self.positional_encoding(self.src_embedding(src))
        tgt = self.positional_encoding(self.tgt_embedding(tgt))
        output = self.transformer(src, tgt, src_mask, tgt_mask)
        return self.generator(output)

Step 5: Training the Model

The training process involves batching data, applying masks, and optimizing the model using the Adam optimizer:

from torch.optim import Adam

optimizer = Adam(model.parameters(), lr=0.0001)
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for src, tgt in train_dataloader:
        optimizer.zero_grad()
        src_mask, tgt_mask = create_masks(src, tgt)
        output = model(src, tgt, src_mask, tgt_mask)
        loss = criterion(output.view(-1, output.size(-1)), tgt.view(-1))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss / len(train_dataloader)}")

Step 6: Evaluating the Model

After training, I evaluated the model using BLEU scores, a common metric for machine translation quality.

Results and Analysis

The model achieved a reasonable BLEU score on the validation set. The translations, though not perfect, were understandable and reflected the effectiveness of the Transformer.

Future Work

Implement Beam Search: To improve the quality of translations.
Experiment with Tokenization: Use subword tokenization techniques like Byte Pair Encoding (BPE).
Support Additional Languages: Extend the model to support multilingual translation.

Conclusion

This project was an excellent opportunity to explore the inner workings of the Transformer architecture for machine translation. The Seq2Seq Transformer model demonstrated its capability to handle complex language tasks effectively. I encourage readers to try implementing this model themselves, experiment with different datasets, and explore further improvements.

Check out the complete code on my GitHub repository and feel free to connect with me on LinkedIn for discussions or collaboration opportunities.