'XGBoost model failed due to Closing connection _sid_af1c at exit
We use XGBoost model for regression prediction model, We use XGBoost as grid search hyper parameter tuning process,
We run this model on 90GB h2o cluster. This process now running over 1.2 years, but suddenly this process stop due to "Closing connection _sid_af1c at exit"
Training data set is 800 000, due to this error we decreased it to 500 000 but same error occurred.
ntrees - 300,400
depth - 8.10
variables - 382
I have attached H2o memory log and our application error log. Could you please support to fixed this issue.
----------------------------------------H2o Log [Start]----------------------
**We start H2o as 2 node cluster, but h2o log crated on one node.**
INFO water.default: ----- H2O started -----
INFO water.default: Build git branch: master
INFO water.default: Build git hash: 0588cccd72a7dc1274a83c30c4ae4161b92d9911
INFO water.default: Build git describe: jenkins-master-5236-4-g0588ccc
INFO water.default: Build project version: 3.33.0.5237
INFO water.default: Build age: 1 year, 3 months and 17 days
INFO water.default: Built by: 'jenkins'
INFO water.default: Built on: '2020-10-27 19:21:29'
WARN water.default:
WARN water.default: *** Your H2O version is too old! Please download the latest version from http://h2o.ai/download/ ***
WARN water.default:
INFO water.default: Found H2O Core extensions: [XGBoost, KrbStandalone]
INFO water.default: Processed H2O arguments: [-flatfile, /usr/local/h2o/flatfile.txt, -port, 54321]
INFO water.default: Java availableProcessors: 20
INFO water.default: Java heap totalMemory: 962.5 MB
INFO water.default: Java heap maxMemory: 42.67 GB
INFO water.default: Java version: Java 1.8.0_262 (from Oracle Corporation)
INFO water.default: JVM launch parameters: [-Xmx48g]
INFO water.default: JVM process id: [email protected]
INFO water.default: OS version: Linux 3.10.0-1127.10.1.el7.x86_64 (amd64)
INFO water.default: Machine physical memory: 62.74 GB
INFO water.default: Machine locale: en_US
INFO water.default: X-h2o-cluster-id: 1644769990156
INFO water.default: User name: 'root'
INFO water.default: IPv6 stack selected: false
INFO water.default: Possible IP Address: ens192 (ens192), xxxxxxxxxxxxxxxxxxxx
INFO water.default: Possible IP Address: ens192 (ens192), xxxxxxxxxxx
INFO water.default: Possible IP Address: lo (lo), 0:0:0:0:0:0:0:1%lo
INFO water.default: Possible IP Address: lo (lo), 127.0.0.1
INFO water.default: H2O node running in unencrypted mode.
INFO water.default: Internal communication uses port: 54322
INFO water.default: Listening for HTTP and REST traffic on http://xxxxxxxxxxxx:54321/
INFO water.default: H2O cloud name: 'root' on /xxxxxxxxxxxx:54321, discovery address /xxxxxxxxxxxx:57653
INFO water.default: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
INFO water.default: 1. Open a terminal and run 'ssh -L 55555:localhost:54321 root@xxxxxxxxxxxx'
INFO water.default: 2. Point your browser to http://localhost:55555
INFO water.default: Log dir: '/tmp/h2o-root/h2ologs'
INFO water.default: Cur dir: '/usr/local/h2o/h2o-3.33.0.5237'
INFO water.default: Subsystem for distributed import from HTTP/HTTPS successfully initialized
INFO water.default: HDFS subsystem successfully initialized
INFO water.default: S3 subsystem successfully initialized
INFO water.default: GCS subsystem successfully initialized
INFO water.default: Flow dir: '/root/h2oflows'
INFO water.default: Cloud of size 1 formed [/xxxxxxxxxxxx:54321]
INFO water.default: Registered parsers: [GUESS, ARFF, XLS, SVMLight, AVRO, PARQUET, CSV]
INFO water.default: XGBoost extension initialized
INFO water.default: KrbStandalone extension initialized
INFO water.default: Registered 2 core extensions in: 2632ms
INFO water.default: Registered H2O core extensions: [XGBoost, KrbStandalone]
INFO hex.tree.xgboost.XGBoostExtension: Found XGBoost backend with library: xgboost4j_gpu
INFO hex.tree.xgboost.XGBoostExtension: XGBoost supported backends: [WITH_GPU, WITH_OMP]
INFO water.default: Registered: 217 REST APIs in: 353ms
INFO water.default: Registered REST API extensions: [Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4]
INFO water.default: Registered: 291 schemas in 112ms
INFO water.default: H2O started in 4612ms
INFO water.default:
INFO water.default: Open H2O Flow in your web browser: http://xxxxxxxxxxxx:54321
INFO water.default:
INFO water.default: Cloud of size 2 formed [mastera.xxxxxxxxxxxx.com/xxxxxxxxxxxx:54321, masterb.xxxxxxxxxxxx.com/xxxxxxxxxxxx:54321]
INFO water.default: Locking cloud to new members, because water.rapids.Session$1
INFO hex.tree.xgboost.task.XGBoostUpdater: Initial Booster created, size=448
ERROR water.default: Got IO error when sending a batch of bytes:
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:51)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:605)
at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:588)
----------------------------------------H2o Log [End]--------------------------------
----------------------------------------Application Log [Start]----------------------
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
Warning: Your H2O cluster version is too old (1 year, 3 months and 17 days)! Please download and install the latest version from http://h2o.ai/download/
-------------------------- ------------------------------------------------------------------
H2O_cluster_uptime: 19 mins 49 secs
H2O_cluster_timezone: Asia/Colombo
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.33.0.5237
H2O_cluster_version_age: 1 year, 3 months and 17 days !!!
H2O_cluster_name: root
H2O_cluster_total_nodes: 2
H2O_cluster_free_memory: 84.1 Gb
H2O_cluster_total_cores: 40
H2O_cluster_allowed_cores: 40
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
H2O_API_Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version: 3.7.0 final
-------------------------- ------------------------------------------------------------------
-------------------------- ------------------------------------------------------------------
H2O_cluster_uptime: 19 mins 49 secs
H2O_cluster_timezone: Asia/Colombo
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.33.0.5237
H2O_cluster_version_age: 1 year, 3 months and 17 days !!!
H2O_cluster_name: root
H2O_cluster_total_nodes: 2
H2O_cluster_free_memory: 84.1 Gb
H2O_cluster_total_cores: 40
H2O_cluster_allowed_cores: 40
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
H2O_API_Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version: 3.7.0 final
-------------------------- ------------------------------------------------------------------
release memory here...
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
Warning: Your H2O cluster version is too old (1 year, 3 months and 17 days)! Please download and install the latest version from http://h2o.ai/download/
-------------------------- ------------------------------------------------------------------
H2O_cluster_uptime: 19 mins 49 secs
H2O_cluster_timezone: Asia/Colombo
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.33.0.5237
H2O_cluster_version_age: 1 year, 3 months and 17 days !!!
H2O_cluster_name: root
H2O_cluster_total_nodes: 2
H2O_cluster_free_memory: 84.1 Gb
H2O_cluster_total_cores: 40
H2O_cluster_allowed_cores: 40
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
H2O_API_Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version: 3.7.0 final
-------------------------- ------------------------------------------------------------------
Parse progress: |█████████████████████████████████████████████████████████| 100%
xgboost Grid Build progress: |████████Closing connection _sid_af1c at exit
H2O session _sid_af1c was not closed properly.
Closing connection _sid_9313 at exit
H2O session _sid_9313 was not closed properly.
----------------------------------------Application Log [End]----------------------
Solution 1:[1]
This typically means one of the nodes crashed, it can be due to many different reasons - memory is the most common one.
I see your machine has about 64GB of physical memory and H2O is getting 48GB out of that. XGBoost runs in native memory, not in the JVM memory. For XGBoost we recommend splitting the physical memory 50-50 to H2O and XGBoost.
You are running a development version of H2O (3.33) - I suggest upgrading to the latest stable.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Michal Kurka |
