Machine Learning Applied to Google's Rankings

paraschopra · on Oct 22, 2009

Though the analysis is interesting, it is not "Machine Learning". There is no test/training data set, no prediction, no model selection. It is just plain old, but extremely useful correlation analysis.

If someone is interested in a similar kind of correlation (and regression) analysis for website conversion rate, have a look at the study I did a study recently - http://www.wingify.com/case-studies/predictive-web-analytics...

randfish · on Oct 22, 2009

Actually, that's inaccurate. Please do read through the entire post. Although it begins with correlation data, the conclusion reached halfway through the post is that more sophisticated models are required, which is where the machine learning is put into use - a large number of features about links and on-page elements as well as derivatives of these features - modeled against training data (10K SERPs), then shown against a different set of 10K SERPs.

paraschopra · on Oct 22, 2009

That is very, very hazy. In my opinion the article just says "a machine learning model that maps to the search results and produces a result that's considerably better correlated with rankings than any single metric" without any other information on what the model is, how it is made, what is its validity. Please correct me if I missed something in the article, but where exactly is that machine learning model described?

ramanujan · on Oct 22, 2009

1) Rand Moz and other SEO people should do an extremely thorough study of 23andMe's site. Their product may be of questionable value, but if there is one site which has the skeleton key to SEO, it is an ecommerce site run by Google's Wife. Any kind of convention or trick that they use is likely to be preferred by Google.

2) I've messed around with this problem myself a bit. In general, predicting rank as a function of page properties is equivalent to replicating Google's own search ranking (i.e. if your predicted rank \hat{Y} = the true rank Y for input features X then you can basically rank pages as google does from signals on web pages, though of course you'll be doing it in batch without all the semi-realtime crawling that goog now does).

That said, you can pretty easily get something decent that will (a) give you an overall estimate of rank and (b) at least tell you quantitatively whether a given feature impacts rankings. This can settle a lot of debates among SEO people.

3) Specific proposal: calculate a non-parametric measure of correlation between empirical page rank and each of the features mentioned in this post (http://www.seomoz.org/article/search-ranking-factors ) on a sample of say 100k keywords. Examination of individual scatterplots will also be informative.

Now you can do a more abstract analysis. Construct a table where rows correspond to features and there are two columns: the empirical non-parametric correlation with PageRank and the estimate in the SEOMoz post on ranking factors of that feature's importance.

Make a scatterplot here (and calculate just one more non-parametric correlation) to see how good the experts were at determining how much each feature contributed to rank.

martian · on Oct 22, 2009

23andMe is still using meta keywords, so they might be a little out of date.

jgrahamc · on Oct 22, 2009

If you're going to throw around a term like 'machine learning' then it would be nice if you were to explain what you were doing. The article says:

We (well, technically, Ben) run them through a machine learning model that maps to the search results and produces a result that's considerably better correlated with rankings than any single metric.

kurtosis · on Oct 22, 2009

If anyone is interested in what a real "machine learning" approach the problem of learning a ranking function from data looks like see this paper:

Burges et. al. Learning to Rank using Gradient Descent http://research.microsoft.com/apps/pubs/?id=69183

Although this is most definitely an active research area and the papers citing this one should be pretty interesting.

carbocation · on Oct 22, 2009

If you look at their model vs the correct result, their error appears logarithmic. This is what I would expect from a linear model that is trying to approximate a function known to be logarithmic. (The 1-10 PageRank values we see are logarithms of the actual internal Google values, or so it is said.)

meatbag · on Oct 22, 2009

This submission implies that SEOmoz is at least slightly interested in peer review. Which would be a very good thing for them.

randfish · on Oct 22, 2009

Absolutely! I can't promise we can divulge everything we're doing, but I know that Ben (who runs these models) would love outside opinions and critiques. You can reach him via Ben at SEOmoz.org.