What does RipeSeed do?

RipeSeed is a software engineering studio that builds web apps, mobile apps, Shopify apps, AI agents, and automation tools for SaaS, eCommerce, and tech companies. From MVPs to production, we design, build, and run your products.

How do you charge for projects?

We offer both fixed-price and retainer models. Fixed pricing is ideal for defined scopes, while retainers work best for long-term or evolving projects. Either way, youll always have full cost transparency.

Who do you usually work with?

We work with startups, established companies, and global teams that want reliable development partners. Our clients come from fintech, SaaS, e-commerce, and creative industries across the US, UK, and Middle East.

How does your development process work?

Every project starts with a short discovery call where we understand your goals. From there, we create a detailed proposal, handle design and development, and share regular updates through demos until launch.

Do you provide post-launch support?

Yes. We offer ongoing maintenance and support after launch to handle bug fixes, scaling, and new features. Our goal is to keep your product secure, updated, and ready for growth.

What technologies do you use?

Our developers use React, Next.js, Nest.js, Django, Flutter, Firebase, AWS, and Docker for full-stack web and mobile app development. For AI integration, we use LangChain, OpenAI, and Flask to build chatbots and automation systems.

How long does a typical project take?

A standard MVP takes about 4 to 8 weeks, depending on the size and complexity. Larger systems can take longer, but we deliver in milestones so you can test and refine early.

Translating Languages using PyTorch Transformers

July 18, 2024

Explore the transformative power of PyTorch Transformers in this blog. Dive into the technical intricacies of machine translation and discover how self-attention mechanisms and parallel processing are setting new benchmarks in linguistic accuracy and efficiency.

Translating Languages using PyTorch Transformers

Nushirvan Naseer

Software Engineer

12 min read

Translating Languages using PyTorch Transformers

Transformers are the latest innovation in modern Deep Learning. To get started with text generational models, you can use PyTorch's built-in transformer module.

Transformers are essential for language translation tasks because they offer several advantages over previous models, such as recurrent neural networks (RNNs):

Long-range dependencies: Transformers can learn long-range dependencies between words in a sentence, which is essential for accurate translation. RNNs, on the other hand, can struggle with long-range dependencies, especially in long sentences.
Parallel processing: Transformers can process all of the words in a sentence in parallel, which makes them much faster than RNNs. This is especially important for large datasets, such as those used for training language translation models.
Attention mechanism: Transformers use a self-attention mechanism to learn the relationships between words in a sentence. This allows them to focus on the most important words for translation, and to ignore irrelevant words.

Transformer Architecture

As a result of these advantages, Transformers have achieved state-of-the-art results on many language translation benchmarks. They are now the go-to model architecture for most language translation systems.

Foundational Resources

Here are some foundational resources for readers new to the topic of Transformers:

Attention Is All You Need: https://arxiv.org/abs/1706.03762 - The paper that introduced the Transformer model.
A Neural Network Architecture for Joint Machine Translation and Text Summarization: https://arxiv.org/abs/1910.10368 - The paper that introduced the BART model, a Transformer-based model for machine translation and text summarization.
Transformers: https://huggingface.co/docs/transformers/ - A popular library for training and using Transformer models for a variety of natural language processing tasks, including machine translation.

Real-World Applications

Here are some specific examples of how language translation models are being used in the real world:

Google Translate: Google Translate is a popular online language translation service that uses Transformer models to translate text into over 100 languages. Google Translate is used by millions of people around the world to communicate, learn, and access information in different languages.
Netflix: Netflix uses language translation models to translate subtitles and dubbing for its movies and TV shows into over 30 languages. This helps Netflix to reach a global audience and to make its content more accessible to people from all over the world.
The United Nations: The United Nations uses language translation models to translate its documents and communications into six official languages: Arabic, Chinese, English, French, Russian, and Spanish. This helps the United Nations to communicate effectively with its members and to promote its work on a global scale.

At its core, the nn.Transformer module relies entirely on an attention mechanism, skilfully implemented as nn.MultiheadAttention, which enables it to establish comprehensive connections between input and output elements. What sets the nn.Transformer module apart is its exceptional modularity; each component, such as nn.TransformerEncoder, is designed to be effortlessly adaptable and combinable, allowing for a highly customizable approach to model construction.

What This Tutorial Shows

How to train a translation model from scratch using Transformer and use torchtext library to access Multi30k dataset to train a German to English translation model.

Data Sourcing and Processing

Torchtext library has utilities for creating datasets that can be easily iterated through for the purposes of creating a language translation model. In this example, we show how to use torchtext's inbuilt datasets, tokenize a raw text sentence, build vocabulary, and numericalize tokens into tensor. We will use Multi30k dataset from torchtext library that yields a pair of source-target raw sentences.

To access torchtext datasets, please install torchdata following instructions at https://github.com/pytorch/data.

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import multi30k, Multi30k
from typing import Iterable, List

# We need to modify the URLs for the dataset since the links to the original dataset are broken
# Refer to https://github.com/pytorch/text/issues/1756#issuecomment-1163664163 for more info
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"

SRC_LANGUAGE = 'de'
TGT_LANGUAGE = 'en'

# Place-holders
token_transform = {}
vocab_transform = {}

Install the dependencies and then create source and target language tokenizer:

token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')

# Helper function to yield list of tokens
def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
    language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}
    for data_sample in data_iter:
        yield token_transform[language](data_sample[language_index[language]])

# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln),
                                                     min_freq=1,
                                                     specials=special_symbols,
                                                     special_first=True)

# Set UNK_IDX as the default index
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    vocab_transform[ln].set_default_index(UNK_IDX)

Seq2Seq Network using Transformer

In the following sections, we'll construct a Seq2Seq network built upon the Transformer architecture. This network comprises three essential components.

Firstly, we have the embedding layer, which plays a crucial role in converting input indices into their respective input embeddings. These embeddings are then enriched with positional encodings, serving as vital information about the positions of the input tokens within the sequence.

The second component is the core Transformer model itself, which is responsible for handling the sequence-to-sequence transformation.

Lastly, the output from the Transformer model undergoes processing through a linear layer. This layer computes unnormalized probabilities for each token present in the target language, a critical step in generating meaningful language outputs.

from torch import Tensor
import torch
import torch.nn as nn
from torch.nn import Transformer
import math

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Helper Module that adds positional encoding to the token embedding to introduce a notion of word order
class PositionalEncoding(nn.Module):
    def __init__(self, emb_size: int, dropout: float, maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()
        den = torch.exp(- torch.arange(0, emb_size, 2) * math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)
        
        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor):
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])

# Helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

# Seq2Seq Network
class Seq2SeqTransformer(nn.Module):
    def __init__(self,
                 num_encoder_layers: int,
                 num_decoder_layers: int,
                 emb_size: int,
                 nhead: int,
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 dim_feedforward: int = 512,
                 dropout: float = 0.1):
        super(Seq2SeqTransformer, self).__init__()
        self.transformer = Transformer(d_model=emb_size,
                                       nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=dim_feedforward,
                                       dropout=dropout)
        self.generator = nn.Linear(emb_size, tgt_vocab_size)
        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
        self.positional_encoding = PositionalEncoding(emb_size, dropout=dropout)

    def forward(self,
                src: Tensor,
                trg: Tensor,
                src_mask: Tensor,
                tgt_mask: Tensor,
                src_padding_mask: Tensor,
                tgt_padding_mask: Tensor,
                memory_key_padding_mask: Tensor):
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None,
                                src_padding_mask, tgt_padding_mask, memory_key_padding_mask)
        return self.generator(outs)

    def encode(self, src: Tensor, src_mask: Tensor):
        return self.transformer.encoder(self.positional_encoding(
            self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
        return self.transformer.decoder(self.positional_encoding(
            self.tgt_tok_emb(tgt)), memory, tgt_mask)

During training, we need a subsequent word mask that will prevent the model from looking into the future words when making predictions. We will also need masks to hide source and target padding tokens. Below, let's define a function that will take care of both:

def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

def create_mask(src, tgt):
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]

    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    src_mask = torch.zeros((src_seq_len, src_seq_len), device=DEVICE).type(torch.bool)

    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

Now, it's time to set up our model's parameters and create an instance of it. Additionally, we'll define our loss function as the cross-entropy loss and the optimizer that we'll use for training:

torch.manual_seed(0)

SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE])
TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE])
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 128
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3

transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
                                 NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)

for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

transformer = transformer.to(DEVICE)

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

Collation

In the "Data Sourcing and Processing" section, we saw that our data iterator provides us with pairs of raw strings. Now, to prepare these string pairs for input into our previously defined Seq2Seq network, we must transform them into batched tensors. In the following section, we introduce our collate function, which accomplishes this task by converting batches of raw strings into tensors suitable for direct processing by our model:

from torch.nn.utils.rnn import pad_sequence

# Helper function to club together sequential operations
def sequential_transforms(*transforms):
    def func(txt_input):
        for transform in transforms:
            txt_input = transform(txt_input)
        return txt_input
    return func

# Function to add BOS/EOS and create tensor for input sequence indices
def tensor_transform(token_ids: List[int]):
    return torch.cat((torch.tensor([BOS_IDX]),
                      torch.tensor(token_ids),
                      torch.tensor([EOS_IDX])))

# src and tgt language text transforms to convert raw strings into tensors indices
text_transform = {}
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    text_transform[ln] = sequential_transforms(token_transform[ln],  # Tokenization
                                                vocab_transform[ln],  # Numericalization
                                                tensor_transform)  # Add BOS/EOS and create tensor

# Function to collate data samples into batch tensors
def collate_fn(batch):
    src_batch, tgt_batch = [], []
    for src_sample, tgt_sample in batch:
        src_batch.append(text_transform[SRC_LANGUAGE](src_sample.rstrip("
")))
        tgt_batch.append(text_transform[TGT_LANGUAGE](tgt_sample.rstrip("
")))

    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX)
    return src_batch, tgt_batch

Now we define a training and evaluation loop that will be called for each epoch:

from torch.utils.data import DataLoader

def train_epoch(model, optimizer):
    model.train()
    losses = 0
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in train_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask, src_padding_mask, 
                      tgt_padding_mask, src_padding_mask)

        optimizer.zero_grad()

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_dataloader))

def evaluate(model):
    model.eval()
    losses = 0

    val_iter = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    val_dataloader = DataLoader(val_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in val_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask, src_padding_mask, 
                      tgt_padding_mask, src_padding_mask)

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        losses += loss.item()

    return losses / len(list(val_dataloader))

All our prerequisites for the model are done. Now it's time to run it:

from timeit import default_timer as timer

NUM_EPOCHS = 18

for epoch in range(1, NUM_EPOCHS + 1):
    start_time = timer()
    train_loss = train_epoch(transformer, optimizer)
    end_time = timer()
    val_loss = evaluate(transformer)
    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "
           f"Epoch time = {(end_time - start_time):.3f}s"))

# Greedy search decoder
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len - 1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                    .type(torch.bool)).to(DEVICE)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys

# Actual function to translate input sentence into target language
def translate(model: torch.nn.Module, src_sentence: str):
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(
        model, src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")

print(translate(transformer, "Eine Gruppe von Menschen steht vor einem Iglu."))

Output

A group of people is standing in front of an igloo.

Translation Output Example

Conclusion

In the blog above, we successfully created a sequence-to-sequence language translation model using Transformers. This should give you a good idea of how language generational models work. With a basic understanding of how Transformers work and how to implement a solution in a popular framework, you should be able to use this knowledge for multiple other tasks. We can extend this approach to different tasks like question answering, text classification, named entity recognition etc. For more info, you can always check out the tutorials on the PyTorch Documentation.