'split data into train and test in python
i have list of authors in this form:
list_authors = ['Abbott, Edwin Abbott, 1838-1926',
'Gatlin, Dana, 1884-1940',
'Riley, W. (William), 1866-1961',
...]
i also have list of books and their authors in this form:
list_books = [(15, ['Melville, Herman, 1819-1891'],['Barrie, J. M. (James Matthew), 1860-1937'],['Cather, Willa, 1873-1947']),(27, ['Hardy, Thomas, 1840-1928']),(32, ['Gilman, Charlotte Perkins, 1860-1935']),...]
list of tuples where each tuple is the book id and list of the book authors(can be one or more from the list_authors)
the goal is to split the books without the same author can be in test and train.
tried to use sklearn to split it like this:
train_authors, test_authors, = train_test_split(list_authors, test_size=0.20, random_state=42)
which is not solving the edge case when a book has multiple authors (say two), with one being in the train, and the other in the test split
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
