'word2vec + XGBoostRegressor - error Check failed: preds.Size() == info.labels_.Size() (1 vs. 70812) : labels are not correctly providedpreds.size=1
I am quite new to NLP.
I am building a Regression model for predicting discrete values (like price).
While I was Using xgboostRegressor + word2vec. It throws the below error when trying to fit the model.
My input to the word2vec is a list of words
[text, font, graphics, screenshot, gain]
from xgboost import XGBRegressor
xgb_model = XGBRegressor(
objective = 'reg:squarederror',
colsample_bytree = 0.5,
learning_rate = 0.05,
max_depth = 6,
min_child_weight = 1,
n_estimators = 1000,
subsample = 0.7)
%time xgb_model.fit(list(x_train), y_train, early_stopping_rounds=5, verbose=False)
y_pred_xgb = xgb_model.predict(x_test)
XGBoostError Traceback (most recent call last) in () 10 subsample = 0.7) 11 ---> 12 get_ipython().magic('time xgb_model.fit(list(x_train), y_train, early_stopping_rounds=5, verbose=False)') 13 14 y_pred_xgb = xgb_model.predict(x_test)
8 frames
<decorator-gen-53> in time(self, line, cell, local_ns)
<timed eval> in <module>()
/usr/local/lib/python3.7/dist-packages/xgboost/core.py in _check_call(ret)
174 """
175 if ret != 0:
--> 176 raise XGBoostError(py_str(_LIB.XGBGetLastError()))
177
178
XGBoostError: [01:43:27] /workspace/src/objective/regression_obj.cu:65: Check failed: preds.Size() == info.labels_.Size() (1 vs. 70812) : labels are not correctly providedpreds.size=1, label.size=70812
Stack trace:
[bt] (0) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x24) [0x7f45763dfcb4]
[bt] (1) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(xgboost::obj::RegLossObj<xgboost::obj::LinearSquareLoss>::GetGradient(xgboost::HostDeviceVector<float> const&, xgboost::MetaInfo const&, int, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*)+0x21e) [0x7f45765ea84e]
[bt] (2) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*)+0x345) [0x7f4576479505]
[bt] (3) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(XGBoosterUpdateOneIter+0x35) [0x7f45763dcaa5]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f45d2a66dae]
[bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f45d2a6671f]
[bt] (6) /usr/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x28c) [0x7f45d2c7a5dc]
[bt] (7) /usr/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x109e3) [0x7f45d2c799e3]
[bt] (8) /usr/bin/python3(_PyObject_FastCallKeywords+0x92) [0x5559ff072902]
[1]: https://i.stack.imgur.com/JcTKs.png
Solution 1:[1]
The error is indicating that there's a problem with the dimensions of x_train: xgboost thinks that you've given it 1 training example in x_train and 70812 labels in y_train.
You need to check the shape of x_train and verify that you have a 2-dimensional array with the first dimension being the number of training examples, and the second dimension being the size of the embedding. The size of y_train should match the size of the first dimension of x_train.
When you say that your input to word2vec is a list of words, do you mean that each of your training examples is just one word, or that each example is a list of words? If you only have one word per example, then the encoded dataset should have dimensions of (num_examples, embedding_dim).
If each example is a sequence of words, then you will have (num_examples, sequence_len, embedding_dim) which is too many dimensions, so you'll have to take the average of embeddings over each sequence, or use sentence embeddings instead.
For example, given some randomly initialized numpy arrays:
import numpy as np
num_examples = 70812
embedding_dim = 100
x_train = np.random.rand(num_examples, embedding_dim)
y_train = np.random.rand(num_examples)
print(x_train.shape, y_train.shape)
This should print: (70812, 100) (70812,). 70812 is the number of training examples, and 100 is the size of each vector.
Then you can fit the model as before:
from xgboost import XGBRegressor
xgb_model = XGBRegressor(
objective = 'reg:squarederror',
colsample_bytree = 0.5,
learning_rate = 0.05,
max_depth = 6,
min_child_weight = 1,
n_estimators = 1000,
subsample = 0.7
)
xgb_model.fit(x_train, y_train)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Adam Montgomerie |
