'How do I split a custom dataset into spatially disjoint training and test datasets in Python?
My question is close to this thread but the difference is I want my training and test dataset to be spatially disjoint. So no two samples from the same geographical region --you can also define the region by county, state, random geographical grid you create for you own dataset among others. An example of my dataset is like THIS which is an instance segmentation task for satellite imagery.
I know pytorch has this capability for random splitting:
train_size = int(0.75 * len(full_dataset))
test_size = len(full_dataset) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(full_dataset, [train_size, test_size])
However perhaps what I want is spatially_random_spliting functionality. Picture below is also showing the question where in my case each point is an image with associated labels.
Solution 1:[1]
I am not completely sure what your dataset and labels look like but from what i see why not cut image into pre defined chunk sizes like here - https://stackoverflow.com/a/63815878/4471672
and say save each chunk in different folders according to location then sample from whichever set you need (or know to be "spatially disjoint) randomly
Solution 2:[2]
I found the answer via TorchGEO library. Thank you all.
from torchgeo.samplers import RandomGeoSampler
sampler = RandomGeoSampler(dataset, size=256, length=10000)
dataloader = DataLoader(dataset, batch_size=128, sampler=sampler,
collate_fn=stack_samples)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Yev Guyduy |
| Solution 2 | Sheykhmousa |


