The service is pretty cool. The training works by uploading your CSV formatted training data (with labels) to Google Storage. Then you make a call to train the google service. Google has not said much about what kind of algorithms they are using behind the scenes, besides the fact that they are using a combination of a proprietary and open-source ML algorithms. The service trains up a variety of different models and then uses a voting scheme to decide which ones are optimal.
A few problems I see (or saw, I havnt used it in a few months) with the service are the following.
1. currently, there is no way to pick your cross validation folds. this can lead to severe overfitting if your data is not i.i.d
2. they provide a numerical (double) accuracy number which corresponds to the accuracy estimated from training. how is this number calculated (AROCS,etc.). They do not say
3. Security issues - read the fine print of what happens when your data gets uploaded to Google storage. It could be a cause for concern
4. Your are competing for resources. When I was testing the API, I would train two successive models with the same amount of data, and I would notice one call would complete (asynchronously) after 10 seconds, while the next would take 10 minutes. This is because your are competing for resources
5. Currently no way to inject prior knowledge into your models. What if you know your data is Guassian, you could use an RBF kernel, but with this API, you cannot, because it might pick the Naive Bayes Classifier and not the SVM, etc.
In general, this service probably will work for the average SPAM detection problem, but if you really want a great system, you probably need to keep everything in house.
It's not quite clear to me what this actually does. How does training work? Do I give it input and actual decisions so that it can decide based on trends or is it more of a fuzzy match to known data?
It has been a while since I have checked it out, but you give it training data (examples + values) and then it will be able to predict values for unseen examples. Eg. predict if a comment on a blog post is spam.
Actually the google prediction api is already really old; I don't know why it shows up now on HN
Ouch you're right, this is not relevant. I was playing with Language detection and confused language identification in prediction with their language API. Deleting it now.
Because "prediction" here doesn't mean "magical prediction of the future", it means "spotting and extrapolating patterns", and (in so far as the Efficient Market Hypothesis is true) there are no exploitable patterns to spot in stock-market data.
(Presumably the EMH is only approximately true, but it's probably close enough that Google can't make billions that way without a considerable risk of losing billions instead, and without a lot of effort that they might do better to put into making billions by more conventional means.)
A few problems I see (or saw, I havnt used it in a few months) with the service are the following.
1. currently, there is no way to pick your cross validation folds. this can lead to severe overfitting if your data is not i.i.d
2. they provide a numerical (double) accuracy number which corresponds to the accuracy estimated from training. how is this number calculated (AROCS,etc.). They do not say
3. Security issues - read the fine print of what happens when your data gets uploaded to Google storage. It could be a cause for concern
4. Your are competing for resources. When I was testing the API, I would train two successive models with the same amount of data, and I would notice one call would complete (asynchronously) after 10 seconds, while the next would take 10 minutes. This is because your are competing for resources
5. Currently no way to inject prior knowledge into your models. What if you know your data is Guassian, you could use an RBF kernel, but with this API, you cannot, because it might pick the Naive Bayes Classifier and not the SVM, etc.
In general, this service probably will work for the average SPAM detection problem, but if you really want a great system, you probably need to keep everything in house.