Though the analysis is interesting, it is not "Machine Learning". There is no test/training data set, no prediction, no model selection. It is just plain old, but extremely useful correlation analysis.
Actually, that's inaccurate. Please do read through the entire post. Although it begins with correlation data, the conclusion reached halfway through the post is that more sophisticated models are required, which is where the machine learning is put into use - a large number of features about links and on-page elements as well as derivatives of these features - modeled against training data (10K SERPs), then shown against a different set of 10K SERPs.
That is very, very hazy. In my opinion the article just says "a machine learning model that maps to the search results and produces a result that's considerably better correlated with rankings than any single metric" without any other information on what the model is, how it is made, what is its validity. Please correct me if I missed something in the article, but where exactly is that machine learning model described?
1) Rand Moz and other SEO people should do an extremely thorough study of 23andMe's site. Their product may be of questionable value, but if there is one site which has the skeleton key to SEO, it is an ecommerce site run by Google's Wife. Any kind of convention or trick that they use is likely to be preferred by Google.
2) I've messed around with this problem myself a bit. In general, predicting rank as a function of page properties is equivalent to replicating Google's own search ranking (i.e. if your predicted rank \hat{Y} = the true rank Y for input features X then you can basically rank pages as google does from signals on web pages, though of course you'll be doing it in batch without all the semi-realtime crawling that goog now does).
That said, you can pretty easily get something decent that will (a) give you an overall estimate of rank and (b) at least tell you quantitatively whether a given feature impacts rankings. This can settle a lot of debates among SEO people.
3) Specific proposal: calculate a non-parametric measure of correlation between empirical page rank and each of the features mentioned in this post (http://www.seomoz.org/article/search-ranking-factors
) on a sample of say 100k keywords. Examination of individual scatterplots will also be informative.
Now you can do a more abstract analysis. Construct a table where rows correspond to features and there are two columns: the empirical non-parametric correlation with PageRank and the estimate in the SEOMoz post on ranking factors of that feature's importance.
Make a scatterplot here (and calculate just one more non-parametric correlation) to see how good the experts were at determining how much each feature contributed to rank.
If you're going to throw around a term like 'machine learning' then it would be nice if you were to explain what you were doing. The article says:
We (well, technically, Ben) run them through a machine learning model that maps to the search results and produces a result that's considerably better correlated with rankings than any single metric.
If you look at their model vs the correct result, their error appears logarithmic. This is what I would expect from a linear model that is trying to approximate a function known to be logarithmic. (The 1-10 PageRank values we see are logarithms of the actual internal Google values, or so it is said.)
Absolutely! I can't promise we can divulge everything we're doing, but I know that Ben (who runs these models) would love outside opinions and critiques. You can reach him via Ben at SEOmoz.org.
If someone is interested in a similar kind of correlation (and regression) analysis for website conversion rate, have a look at the study I did a study recently - http://www.wingify.com/case-studies/predictive-web-analytics...