More

aaronjg · on March 16, 2017

Apparently it is a scientific program that was made into a web app https://www.ufz.de/index.php?en=39156

aaronjg · on March 6, 2017

This sort of approach is also used in reinforcement learning https://arxiv.org/abs/1702.01182

aaronjg · on Feb 26, 2017

There is some research on these ecosystems [1], and there dynamics [2,3]. The ecosystem is made up of a photosynthetic algae as a producer, a protozoa as a consumer, and a bacteria as a decomposer, and can be stable for over 1,000 days.

[1] http://ir.obihiro.ac.jp/dspace/bitstream/10322/221/1/Prot.Vo...

[2] http://www.cell.com/cell/pdfExtended/S0092-8674(12)00515-6

[3] http://journals.aps.org/prx/abstract/10.1103/PhysRevX.5.0410...

pvaldes · on Feb 26, 2017

Stable for 1000 days is not the same thing as 'will remain clean for 1000 days'. Have this phones a undisclosed problem with overheating that would suggest to add water as heat buffer?

aaronjg · on Jan 9, 2017

When is this from? The last paper in the reference is 2010.

Smerity · on Jan 9, 2017

I work in the field and I'm with aaronjg - this is ancient in the scheme of deep learning and very far from modern best practices. I honestly find it confusing when I see links like this hit the top of the page with such a strong number of upvotes. There is more modern and better run material now. Even if this is being referred to historically it should involve a timestamp.

technics256 · on Jan 9, 2017

Where would you recommend one to find the more modern and better material? Thanks.

Smerity · on Jan 9, 2017

For general introductory material in this style from Stanford, CS231n (fairly general but specialization in vision) and CS224d (specialization in DL for NLP) are great. The material for both of these are online for free and the video lectures (taken down due to legal challenges regarding accessibility) are available if you look hard enough ;)

If you're particularly after unsupervised deep learning, I'd recommend you do one or both of the above (or equivalent) and then read relevant recent papers.

http://cs231n.stanford.edu

http://cs224d.stanford.edu

Pyxl101 · on Jan 9, 2017

A link to an equivalent type of page that teaches modern material would be helpful.

Smerity · on Jan 9, 2017

See my reply to technics256 and feel free to ask other questions :)

deepGem · on Jan 9, 2017

The latest GitHub commit was done 3 years ago. So the course is probably that old. However, I don't think the age matters. I am referring this tutorial in parallel with cs231n and it's so far been good. At least for Convnets.

aaronjg · on Jan 20, 2015

I wrote about the problem with sequential testing in online experiment three years ago on the Custora blog [1]. And Evan Miller wrote about it two years before me on his blog [2]. I'm glad to see Optimizely finally getting on board. Communicating statistical significance to marketers is always challenging, and I'm sure this will lead to better decisions being made.

[1] http://blog.custora.com/2012/05/a-bayesian-approach-to-ab-te...

[2] http://www.evanmiller.org/how-not-to-run-an-ab-test.html

dargani123 · on Jan 20, 2015

Hey guys,

Answering addressing a few comments right here. I think the industry deserves a lot of credit in its efforts to help those wanting to run A/B tests. Many people were aware these were issues and many actually tried to fix it (us included). There are many blog posts in the community about why continuous monitoring is dangerous, why you should use a sample size calculator, how to properly set a Minimum Detectable Effect etc... We were part (and definitely not the first) of this group as we published a sample size calculator and spent a lot of time working with our clients on running tests with a safe testing procedure.

However, after doing this and looking more closely attempting to quantify the effect of these efforts we saw an opportunity for a simpler solution that could help even more people. Sequential Testing was this solution, and it's had success in other applications. We wanted to bring sequential testing to A/B testing and take the hard work out of doing it correctly. Specifically, we have built on that groundwork laid in 50's and 60's by providing an always valid notion of p-value that customers are looking for.

While traditional sequential testing combats the continuous monitoring problem well, they require you to have an intimate understanding of the solution that can pose cognitive hurdles for those not well-versed in statistics. You have to either know your target effect size, or have in mind a maximum allowable number of visitors and understand how changes in these will affect the run time of your test. What’s more, it is not straightforward to translate results to standard measures of significance such as p-values. This is actually where the biggest research contribution of Stats Engine comes in. We allow you to run a test, detect a range of effect sizes and provide an always valid FDR-adjusted p-value as opposed to a set of stopping rules that bounds Type 1 error at say 5%. The error rates are valid no matter how the user chooses to interact with the A/B test. Also, FDR control itself has only been around over the last 20-25 years.

Our biggest industry contribution is probably much simpler in us moving a lot of the market to sequential testing more generally. We are happy to be in the position to help build on research and bring this to practical applications.

aaronjg · on June 12, 2014

Google Cache Link

http://webcache.googleusercontent.com/search?q=cache:kvX1Y03...

aaronjg · on July 16, 2013

Gayle Laakman has a good post about self-publishing, and a lot of the hidden downsides, and why going the traditional route is often better (and more profitable).

http://www.technologywoman.com/2012/07/09/the-dirty-truth-ab...

aaronjg · on April 17, 2013

First of all, great work. It looks like you boosted your conversion rate from 0.19% to 0.43%. Which is a 125% improvement, or with confidence intervals, 55% - 179% improvement.

However, before everybody goes out and puts puppies on their homepages, they need to realize that there are a bunch of things being tested.

Image vs. no image: Is it possible that having any image at all improves the conversion. You should test with other pictures: perhaps some animals, people, nature, and see if the puppy is what makes it work.

Call to action: The 'puppy' version also features a more succinct call to action in "Sign up now" rather than "Start your 30 days free trial." Perhaps this also contributes some of the difference.

Button size: The button size in the 'puppy' version is smaller. Perhaps this has some effect as well.

Length of text: The 'puppy' version has more description of what is involved in the free trial. It says "Pick a plan & sign up in 60 seconds. Upgrade, downgrade cancel at any time." vs. the no puppy version that says "Start you 30 days free trial."

Vertical vs. Horizontal Layout: The 'puppy' version has a vertical layout of the text and button, where they are stacked on top of each other rather than left or right.

So there are at least five different changes made between these two designs. Clearly the second design wins on conversions, but it's not entirely clear to me why it wins.

SeanDav · on April 17, 2013

Not sure how you can get much of value from an A/B test with multiple changes, especially if one is claiming that only 1 of those changes is what is responsible for all the improvement.

harlanlewis · on April 17, 2013

If nothing else, they have a hypothesis to test in the next experiment.

Even when possible to isolate and remove ancillary changes to improve split test purity, it's often not beneficial. If there's a significant number of changes, achieving statistical significance across the full matrix probably isn't even possible.

But that's ok, because limiting changes to a single test queue restricts your ability to move fast and try lots of stuff, which is beneficial. So when you test, try cheap multivariate methods (there's a bunch!) to quickly understand how interactions between multiple changes affect results.

nathanfp · on April 17, 2013

You can iterate on the other tests over time. Many A/B tests start with a larger change that may include multiple variables but with that baseline increase, now they can go ahead and test Dog vs Cat vs Human as the image. Or can test a variety of different text sizes and lengths. This seems like a fantastic start, with plenty of room for further iteration and improvement.

aaronjg · on Feb 12, 2013

That approach was discussed by Anscombe, and I wrote up a summary in the Custora Blog. However just because an approach is frequentist or 'ad-hoc' does not necessarily mean that there is anything wrong with it. The bayesian approach requires making assumptions about the number of visitors to your site after you stop the test, which isn't really any less adhoc than picking an error cut off.

http://blog.custora.com/2012/05/a-bayesian-approach-to-ab-te...

btilly · on Feb 12, 2013

I like that article, but have one major qualm about it. Everything that you do in a Bayesian model depends on the prior. Yet you often see - as there - someone tell you, "Here is the rule to use" but without telling you the prior.

However the prior actually matters. For instance when you look at what Nate Silver did, most of the mathematical horsepower went to determining a really good prior to use based on historical data. And armed with that he both can and does make inferences. (Which he's willing to publish.)

That said, the Bayesian approach is conceptually so much better that Bayesian with a questionable prior can be better than a frequentist approach.

Finally the fact that a Bayesian approach needs a somewhat arbitrary planning horizon does not particularly bother me. Financial theory tells us that businesses really should apply a discounting factor to future projected income, and when you apply an exponentially decaying discounting factor, the weighted number of future visitors generally comes out to a finite number. And yes, there are a lot of arbitrary factors in how you get to that number. But you can generally do it in a reasonable enough way to be way less sloppy in your A/B test than every other part of the business is. Heck - you can just say that your planning horizon is 1 year, and use the expected number of visitors in that time as a cutoff.

Anyways I'd like to eventually get into this kind of issue with this series. But whether I can, I don't know. It certainly will be hard if I keep on trying to pitch it to the level of mathematical background that I've been aiming for so far.

yummyfajitas · on Feb 12, 2013

It is no less ad hoc, but it is clearer. Instead of making assumptions about the right p-values, you make assumptions about real world quantities.

And if you have actual data/projections about future visitors, it is less ad hoc.

aaronjg · on Feb 11, 2013

I have a problem when companies start claiming personalization to this extreme level. How can they claim that they know that an individual "Gets bored and checks email at 4pm."

They are able to look at their customers and see when they open emails, even report on the average time, but people are so much more noisy than they make it seem.

It's interesting that this attitude of pinning customers to a specific thing is so ingrained in their mentality that they bucket their customers: Johnson only ever drinks water, Aubrey rides his bike every day, rain or shine.

In reality people are complex and multifaceted, and it is important to acknowledge this when marketing to them.