'Torchtext 0.7 shows Field is being deprecated. What is the alternative?

Looks like the previous paradigm of declaring Fields, Examples and using BucketIterator is deprecated and will move to legacy in 0.8. However, I don't seem to be able to find an example of the new paradigm for custom datasets (as in, not the ones included in torch.datasets) that doesn't use Field. Can anyone point me at an up-to-date example?

Reference for deprecation:

https://github.com/pytorch/text/releases



Solution 1:[1]

It took me a little while to find the solution myself. The new paradigm is like so for prebuilt datasets:

from torchtext.experimental.datasets import AG_NEWS
train, test = AG_NEWS(ngrams=3)

or like so for custom built datasets:

from torch.utils.data import DataLoader
def collate_fn(batch):
    texts, labels = [], []
    for label, txt in batch:
        texts.append(txt)
        labels.append(label)
    return texts, labels
dataloader = DataLoader(train, batch_size=8, collate_fn=collate_fn)
for idx, (texts, labels) in enumerate(dataloader):
    print(idx, texts, labels)

I've copied the examples from the Source

Solution 2:[2]

Browsing through torchtext's GitHub repo I stumbled over the README in the legacy directory, which is not documented in the official docs. The README links a GitHub issue that explains the rationale behind the change as well as a migration guide.

If you just want to keep your existing code running with torchtext 0.9.0, where the deprecated classes have been moved to the legacy module, you have to adjust your imports:

# from torchtext.data import Field, TabularDataset
from torchtext.legacy.data import Field, TabularDataset

Alternatively, you can import the whole torchtext.legacy module as torchtext as suggested by the README:

import torchtext.legacy as torchtext

Solution 3:[3]

There is a post regarding this. Instead of the deprecated Field and BucketIterator classes, it uses the TextClassificationDataset along with the collator and other preprocessing. It reads a txt file and builds a dataset, followed by a model. Inside the post, there is a link to a complete working notebook. The post is at: https://mmg10.github.io/pytorch/2021/02/16/text_torch.html. But you need the 'dev' (or nightly build) of PyTorch for it to work.

From the link above:

After tokenization and building vocabulary, you can build the dataset as follows

def data_to_dataset(data, tokenizer, vocab):
    
    data = [(text, label) for (text, label) in data]
    
    text_transform = sequential_transforms(tokenizer.tokenize,
                                                  vocab_func(vocab),
                                                  totensor(dtype=torch.long)
                                          )
    label_transform = sequential_transforms(lambda x: 1 if x =='1' else (0 if x =='0' else x),
                                                  totensor(dtype=torch.long)
                                          )
    
    
    transforms = (text_transform, label_transform)
    
    dataset = TextClassificationDataset(data, vocab, transforms)
    
    return dataset

The collator is as follows:

    def __init__(self, pad_idx):
        
        self.pad_idx = pad_idx
        
    def collate(self, batch):
        text, labels = zip(*batch)
        labels = torch.LongTensor(labels)
        text = nn.utils.rnn.pad_sequence(text, padding_value=self.pad_idx, batch_first=True)
        return text, labels

Then, you can build the dataloader with the typical torch.utils.data.DataLoader using the collate_fn argument.

Solution 4:[4]

Well it seems like pipeline could be like that:

    import torchtext as TT
    import torch
    from collections import Counter
    from torchtext.vocab import Vocab

    # read the data

    with open('text_data.txt','r') as f:
        data = f.readlines()
    with open('labels.txt', 'r') as f:
        labels = f.readlines()

    
    tokenizer = TT.data.utils.get_tokenizer('spacy', 'en') # can remove 'spacy' and use a simple built-in tokenizer
    train_iter = zip(labels, data)
    counter = Counter()
    
    for (label, line) in train_iter:
        counter.update(tokenizer(line))
        
    vocab = TT.vocab.Vocab(counter, min_freq=1)

    text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
    # this is data-specific - adapt for your data
    label_pipeline = lambda x: 1 if x == 'positive\n' else 0
    
    class TextData(torch.utils.data.Dataset):
        '''
        very basic dataset for processing text data
        '''
        def __init__(self, labels, text):
            super(TextData, self).__init__()
            self.labels = labels
            self.text = text
            
        def __getitem__(self, index):
            return self.labels[index], self.text[index]
        
        def __len__(self):
            return len(self.labels)
    
    
    def tokenize_batch(batch, max_len=200):
        '''
        tokenizer to use in DataLoader
        takes a text batch of text dataset and produces a tensor batch, converting text and labels though tokenizer, labeler
        tokenizer is a global function text_pipeline
        labeler is a global function label_pipeline
        max_len is a fixed len size, if text is less than max_len it is padded with ones (pad number)
        if text is larger that max_len it is truncated but from the end of the string
        '''
        labels_list, text_list = [], []
        for _label, _text in batch:
            labels_list.append(label_pipeline(_label))
            text_holder = torch.ones(max_len, dtype=torch.int32) # fixed size tensor of max_len
            processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int32)
            pos = min(200, len(processed_text))
            text_holder[-pos:] = processed_text[-pos:]
            text_list.append(text_holder.unsqueeze(dim=0))
        return torch.FloatTensor(labels_list), torch.cat(text_list, dim=0)
    
    train_dataset = TextData(labels, data)
    
    train_loader = DataLoader(train_dataset, batch_size=2, shuffle=False, collate_fn=tokenize_batch)
    
    lbl, txt = iter(train_loader).next()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Steven
Solution 2 Tobias Uhmann
Solution 3
Solution 4 crispengari