'Random Forest with p>>n and not enough memory

I am trying to perform Random Forest classification on genomic data with ~200k predictors and ~20 rows. Predictors have been already pruned for autocorrelation. I tried to use the 'ranger' R package, but it complains it cannot allocate 164Gb vector (I do have 32Gb RAM).

  1. Is there any RF implementation that can manage the analysis given the available RAM (I would like to avoid increasing the swap)?
  2. Should I maybe use a different algorithm (for what I read, RF should deal alright with p>>n)?


Solution 1:[1]

If it's genomic data, are there a lot of zeroes? If so, you might be able to convert into a sparse matrix, using the Matrix package. I believe ranger has been able to work with sparse matrices for a while, and this can help a lot with memory issues.

As far as I know, ranger is the best R random forest package available for datasets where p >> n.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1