'How to use RandomizedSearchCV on a nested list consisting of one list?

I have built a Sentence Boundary Detection Classifier. For the sequence labeling I used a conditional random field. For the hyperparameter optimization I would like to use RandomizedSearchCV. My training data consists of 6 annotated texts. I merge all 6 texts to a tokenlist. For the implementation I followed an example from the documentation. Here my simplified code:

from sklearn_crfsuite import CRF
from sklearn_crfsuite import metrics
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
import scipy.stats

#my tokenlist has the length n
X_train = [feature_dict_token_1, ... , feature_dict_token_n]
# 3 types of tags, B-SEN for begin of sentence; E-SEN for end of sentence; O-Others
y_train = [tag_token_1, ..., tag_token_n]

# define fixed parameters and parameters to search
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    max_iterations=100,
    all_possible_transitions=True
)
params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

labels = ['B-SEN', 'E-SEN', 'O']

# use F1-score for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=labels)

# search
rs = RandomizedSearchCV(crf, params_space,
                        cv=3,
                        verbose=1,
                        n_jobs=-1,
                        n_iter=50,
                        scoring=f1_scorer)
rs.fit([X_train], [y_train])

I used rs.fit([X_train], [y_train])instead of rs.fit(X_train, y_train)since the documentation of crf.train says, that it needs a list of lists:

fit(X, y, X_dev=None, y_dev=None)

Parameters: 
-X (list of lists of dicts) – Feature dicts for several documents (in a python-crfsuite format).
-y (list of lists of strings) – Labels for several documents.
-X_dev ((optional) list of lists of dicts) – Feature dicts used for testing.
-y_dev ((optional) list of lists of strings) – Labels corresponding to X_dev.

But using a list of lists I get this Error:

ValueError: Cannot have number of splits n_splits=5 greater than the number of samples: n_samples=1

I understand that it is because I use [X_train] and [y_train] respectively and it is not possible to apply CV to a list consisting of one list, but with X_train and y_train crf.fit does not cope. How can i fix this?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source