'XGBoost4J-Spark Training Performance with vector Assembler and custom dense vector results in two completely different trained model file
I am currently working with XGBoost4j. To use it I have to transform my training data using Vector Assembler. Following are my problem:
I transform my data using vector Assembler. Output of vector Assembler is feature Vector Column. When does Vector assembler give output columns as dense vectorand when it get converted to sparse vector ? How should I set my missing value for Vector assembler to Non-zero value.
To avoid above, I also tried to use following code to transform my training data:
val feature_col = array(testing.drop("cust_xref_id","dep_var").columns.map(col).map(_.cast(DoubleType)): _*)
val trainDF = train.select(train("dep_var").cast(DoubleType), train("cust_xref_id").cast(StringType) ,feature_col).map(r => (r.getAs[Double](0), r.getAs[Long](1),**org.apache.spark.ml.linalg.Vectors.dense(r.getAs[mutable.WrappedArray[Double]**](2).toArray))).toDF("label","key","features")
where, Key_column is cust_xref_id and my label column is dep_var.
But, this result in drastic model performance drop. I am not sure what might be the Issue
- If I have a model where "0" is a meaningful value. How to should I prevent it to be treated as missing value in XGBoost4j Training.
https://xgboost.readthedocs.io/en/release_0.90/jvm/xgboost4j_spark_tutorial.html
In the above mention link, they have ask us to replace 0 to any other value. I have some feature out of one hot encoding, how should I handle that ?
I am using versions 0.82 and 0.9.
Please help me resolve this Issue.
Solution 1:[1]
I'm (very!) late to the party but the answer is in the following thread:
https://stackoverflow.com/a/61847377/5726057
where you can find both an explanation to your question (see Naveed's last comment) and the Scala code to solve it by creating a new MLlib Transformer that you can include in your pipelines.
Solution 2:[2]
When deploying ARM to a new data factory, it is automatically deployed in live mode.
If you published a new linked service from data factory, it will appear the next time you deploy.
If you want more flexibility in deploying data factory, take a took at SQLPlayer. A library for flexible data factory deployment.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Pablo |
| Solution 2 | 54m |
