'Large scale Machine Learning [closed]
I need to run various machine learning techniques on a big dataset (10-100 billions records) The problems are mostly around text mining/information extraction and include various kernel techniques but are not restricted to them (we use some bayesian methods, bootstrapping, gradient boosting, regression trees -- many different problems and ways to solve them)
What would be the best implementation? I'm experienced in ML but do not have much experience how to do it for huge datasets Is there any extendable and customizable Machine Learning libraries utilizing MapReduce infrastructure Strong preference to c++, but Java and python are ok Amazon Azure or own datacenter (we can afford it)?
Solution 1:[1]
Apache Mahout is what you are looking for.
Solution 2:[2]
Late answer, but here is a good link for large scale data mining and machine learning: The GraphLab project consists of a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API. In addition, we are actively developing new interfaces to allow users to leverage the GraphLab API from other languages and technologies.
Solution 3:[3]
Im not aware of any ML library that uses map/reduce. Maybe you have the capability to use an ML library and a Map/Reduce library together? You might want to look into Hadoop's Map/Reduce: http://hadoop.apache.org/mapreduce/
you would have to implement the reduce and the map methods. The fact that you use so many techniques might complicate this.
you can run it on your own cluster or if you are doing research maybe you could look into BOINC (http://boinc.berkeley.edu/).
On the other hand, maybe you can reduce your data-set. I have no idea what you are training on, but there must be some redundancy in 10 billion records...
Solution 4:[4]
I don't know of any ML libraries that can support 10 to 100 billion records, that's a bit of an extreme so I wouldn't expect to find anything off the shelf. What I would recommend is that you take a look at NetFlix prize winners: http://www.netflixprize.com//community/viewtopic.php?id=1537
The NetFlix prize had over 100 million entries, so while it's not quite as big as your data set you may still find their solutions to be applicable. What the BelKor team did was to combine multiple algorithms (something similar to ensemble learning) and weight the "prediction" or output of each algorithm.
Solution 5:[5]
Take a look at http://hunch.net/?p=1068 for info on Vowpal Wabbit; it's a stochastic gradient descent library for large-scale applications.
Solution 6:[6]
A friend of mine has worked on a similar project. He used perl for text mining and matlab for techniques as bayesian methods, latent semantic analysis and gaussian mixture...
Solution 7:[7]
See this list of large-scale machine learning resources (courses, papers etc): http://www.quora.com/Machine-Learning/What-are-some-introductory-resources-for-learning-about-large-scale-machine-learning
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Mikos |
| Solution 2 | remi |
| Solution 3 | sibtx13 |
| Solution 4 | |
| Solution 5 | bsdfish |
| Solution 6 | |
| Solution 7 | alex |
