'Random Forest with p>>n and not enough memory
I am trying to perform Random Forest classification on genomic data with ~200k predictors and ~20 rows. Predictors have been already pruned for autocorrelation. I tried to use the 'ranger' R package, but it complains it cannot allocate 164Gb vector (I do have 32Gb RAM).
- Is there any RF implementation that can manage the analysis given the available RAM (I would like to avoid increasing the swap)?
- Should I maybe use a different algorithm (for what I read, RF should deal alright with p>>n)?
Solution 1:[1]
If it's genomic data, are there a lot of zeroes? If so, you might be able to convert into a sparse matrix, using the Matrix package. I believe ranger has been able to work with sparse matrices for a while, and this can help a lot with memory issues.
As far as I know, ranger is the best R random forest package available for datasets where p >> n.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
