I’m at The System seminar and I just heard Trevo Claiborne from the Google Website Optimizer team speak. He casually mentioned they recommend ONE HUNDRED conversions/action PER OPTION you are testing.
This is in direct contrast to the standard rules of thumb circling around the direct marketing world. (Most say 30, or 40 maximum is plenty).
Now, with my statistical background, rules of thumb have always bothered me in the first place. I’d rather see the appropriate statistical test applied, but I DO understand the practical need to move at the speed of business and use heurisitics and shortcuts.
But I was incredibly surprised to hear Trevor say ONE HUNDRED in contrast to the 30 we’re all used to hearing. So I asked whether this was based upon their observations across all their advertisers … I got a bit of a disclaimer, and then an answer I took for yes.
Here’s why I think this is a radical statement and the implications for us all.
First of all, Google knows.
I’m not doubting their observations.
They must SEE something we don’t. And I’m guessing what that is … what it MUST be, is that there is MUCH MORE NOISE on the internet than there is in traditional direct marketing environments.
This explains why so many clients report confusion after having declared a winner … then later find their profits haven’t really risen correspondingly.
So we probably should all lean in the direction of longer lasting, more statistically robust tests … which translates to MORE actions before making a decision.
On the other hand, the vast majority of clients I’ve spoken with don’t have the necessary traffic to accomplish this in any reasonable time frame.
So what’s the solution?
FEWER AND MORE WELL CHOSEN TESTING OPTIONS RUN LONGER. It really doesn’t make sense to throw everything against the wall. Do your research first and foremost to determine what MIGHT be likely to improve your conversions. This means surveying people as they’re exiting your page, installing live chats, monitoring competitive sites, reading repeatedly appearing sales copy in your market and identifying very strong candidates for conversion enhancement.
A plain old A-B test with well thought out inputs is probably more valuable than testing umpteen different options.
Of course, Taguchi starts to look more attractive in this situation (a method of compressing dozens of tests into one), but you have to remember that the same rules apply … we need MORE actions than we’re used to considering before we get really robust and stable results.
Which means your inputs for Taguchi testing are even more important.
Bottom line?
Google says there’s LOTS AND LOTS of noise online, and you have to test carefully and watch the results for a long time to be sure of them.
Believe them, and design your tests wisely since they’re not as “disposible” as you might have previously thought.
G


{ 1 trackback }
{ 20 comments }
Thanks for this heads up Glen and I agree with you if this comes from Google they will know.
It does make some sense with figures less than 50 there is a a lot of room for error, they are also correct about the noise levels. The web looks more and more attractive as a cost effective marketing tool. More people are jumping on pushing the noise levels up.
As the bigger companies start to divert more ad revenue in this direction the little guys have got to raise the game to stay ahead. The is better research and better testing.
All the best
Thanks Glenn,
This is something I’ve observed in my campaigns.
The more successful (in terms of conversion/cost & ad position) are the ads I DON’T fuss over. i.e Do the usual A/B split testing in the beginning and leave to simmer. Remove O click and O conversion keywords.
You end up with only the top performer keywords in each adgroup, simmer.
One thing. Does age of a campaign mean anything to Google?
Glenn,
I see a big swing based on the day of the week too. Not just Sunday vs. Wednesday, but from Tuesday to Thursday as well. No matter how many “actions” I have, I’ve found I have to run at least 7 days as well.
Enjoy the System – it’s always great!
-Mark Ingles
I’ve also got a statistical background. My degree was Computer Science and Statistics – there’s a match made for Google Adwords testing
.
I agree about the “noise” comment in so much that it really depends on what you are testing.
If you have highly targeted keywords (long tailed that really qualify the customer) then I can see that 30 or such conversions is going to be enough.
The problem starts when your keywords are not so highly targeted. Now the human intrepretation of what your Ad is offering them against the keywords they entered creates the noise that Google refers to.
As soon as you introduce the human factor into any control experiment, the control goes out of the window
So yes, I would always work on a much larger sample of conversion data before I made a concrete business decision of the Ad to back.
Great post.
Thanks
Mike
Hey Glenn.
Good input, sir. I’m going through you and Terry’s Blueprint course now and I want to say thanks for putting this product out. I’m putting my 3 question survey up this weekend and will A/B split testing my opt-in page. I tracked all my tips for a year when I delivered pizzas in college, so I’m excited to get back to my old surveying ways. Now all I have to do is get ample traffic to get to this magic 100 # from Google. Thanks for this!
One question on survey software: I have HotTopic’s free QuizMaker software I got from Eben Pagan and have created a survey with it; do you have any experience with it? If not, I’ll use the one you recommend in the Truthprints course.
Thanks, Glenn. You and Sharon enjoy your weekend! I trust the rest of the seminar will be fruitful.
David Newby
http://www.YourProsperityPower.com/FreeReport
Glenn –
Thank you for that. I’ve actually always heard 15-20 – so this is quite different. It’s even more reason to go for the A/B split test (rather than multi-variate).
This reminds me of some linguistics research I read back in college. I think we figured that the studies needed a minimum sample size of 30 to be statistically “correct” – yet many were published with 10 or 12 participants…
I suppose sample size really isn’t (or shouldn’t be) an issue on the internet, though.
Liane
Hi Glenn, and greetings from New Zealand
Great info – thanks. I work with stats all day long and we often advise production managers (large dairy factories) that 30 samples is the minimum they should use to get the mean and sd of a normally-distributed process. Ideally, we say 120 samples for the k-values to drop so that our confidence intervals are as high as we require, in practice it’s somewhere inbetween.
All gobbledy-gook for non-math people, but could you explain why it is 100+ for standard true/false results, which I assuming this AB testing is?
What are the stats measures?
Thanks
Stephen Barrett
Glenn,
I definitely think you have a point about needing more clicks/action to get a DEFINITE answer .. but sometimes that is not what you need to know !
For example if you are testing a control against a new variation, I believe there are 3 possible outcomes of a test -
1. New Variation beats the Control
2. Control beats the New Variation; OR
3. There is no SIGNIFICANT statistical difference … so the smart thing to do is to keep the control and test a new variation !
I think the real question is working out when option 3 applies.
Do you wait for enough actions to get a high (e.g. 99%) statistical certainty in results (which might be what google are advocating with 100 action per tets), or are you happy with 80% certainty if it means shorter test period and being able to go through more tests in the same period of time.
In practical terms, this might be the difference in running 4 tests in 1 month, and finding the ad/sales page combination with a 0.5% increase in the conversion rate … as opposed to running the first test for longer and waiting for a 99% statistical certainty – and finally finding that ad/sales combination which increases the conversion rate by 0.5% BUT you find is 3 months down the track !
Something to think about,
Suneel
When I am looking at Adwords campaigns and I see that a ad variation hasn’t been clicked in 4000 impressions I should keep it going to how many impressions before I realize that the copy is a dog that don’t hunt? Especially when I have an ad that at 4000 impressions has 7 clicks. I think that this would make sense if you have samples that are running closer together then it would make sense to run it out to 100 clicks each.
As always great insights
I’ve been doing some pretty in-depth A/B split testing for over a year now with over 100,000 total sales, and I’ve seen some VERY interesting things happen.
I personally do all of my split tests out to between 1000 to 2000 total sales. Luckily I have a high daily sales volume to be able to run tests out to that many sales and still finish the test in under a week.
But the interesting thing I’ve seen in all of these high volume split tests is that the “winner” of the test may change many more times throughout the test than you would think if you stopped it earlier.
Sometimes I see Site A up 10% after 100 sales and then Site B is winning by 8 or 10% after 500 sales, and then Site A is back winning again after 1000 sales.
The only time I think you can truly tell a winner after as small of a data set as 100 sales is when one site version is really crushing the other site version by at least 20% or more.
From all of my data, when one version beats another by at least 20%, that’s when almost every subset of data at 50 sales, 100 sales, 500 sales, etc are all winners as well. When you get under 10% difference between 2 versions, they seem to “flip-flop” back and forth quite a bit more than most people would think.
Also, interesting things happen when you split test opt-in capture vs trying directly for the immediate sale without trying to get an optin…
The non optin page will get a much higher initial conversion rate for the first 500 or so sales, and the opt-in version of the site will be losing badly. However, once you run that to 1000 or 1500 sales or more, many times the version of the site that captures optins will catch up to the non optin version, as many of those optins start to convert to sales through the autoresponders series.
It’s all very interesting!
Mike G.
http://www.smallbusinessinternetsecrets.net/
I have certainly seen the kind of flip-flopping and variation that Mike just reported.
I want to back up Suneel, too, and emphasize that business needs often trump waiting for ideal statistical information. For instance, I end a lot of tests not because I think the sample size is adequate and there’s a clear winner — there rarely is — but because there is no clear winner. I just don’t have time to wait around for two fairly well-matched ads to duke it out! Either one of them is obviously dominating, or not.
I have seen a lot of flip-flopping, sure. But on the other hand, I’ve never seen a larger sample size reveal a major lead, either. Significant dominance doesn’t emerge from tests that were previously roughly equal.
There are many circumstances in which I have very low confidence that two ads will be different in the future — maybe they will and maybe they won’t, but who cares? Because I have quite a high confidence that neither of them is going to rock my world. By ending such tests, I might miss an incrementally superior ad, which would have proven its power in a longer test. But I’m not interested in running extremely long tests to find incrementally superior ads.
I only take winners that are superior enough to give me greater confidence sooner.
Does 30 come from the direct marketing world, or did Perry Marshall make it up?
I’ve questioned the 30 rule of thumb from the beginning, and have long suspected it was based on mis-applying the rule of thumb that the sum of 30 trials of a binomial (coin toss) test does a good job of converging to a normal bell shaped distribution. Not 30 successes, 30 trials. But that’s when the probability of success and failure are both close to .5.
An A/B slit test is a different animal, and there should be an accurate statistical mesure of what is enough. And Google’s rule of thumb of 100 successes, may be correct, on average. But it should also depend on the CTR: closer to .5 likely converges faster.
Thanks Glenn, good post as ever.
I was there as you know, and I agree that when G speaks, we need to listen, as they have all the testing data in the world.
So I personally will be adopting the “100 clicks” rule in what I do with my advertiser accounts.
Another thing I learned from the presentation was to adopt (where congruent with your own pages) Google’s own “look and feel”, text, bullets, flow etc. since they have tested these to destruction (literally) and what you see is their clear winner after a testing regime no other organisation on Earth can undertake.
PS for Roslyn Garavaglia – yes, Google takes the age of your account, campaigns, ad groups, ad texts, and keywords into account all the time – it’s a crucial metric for them and why establishing good CTR as quickly as possible is so important. Good question!
I think 100 is way too much if you see a dog, LP or Ad not performing.. I think 30 is more realistic.
On a side note – Does anyone know if Google Website Optimizer hooks into Google Conversion Metrics? Like if you are an affiliate for example running the Website Optimizer on LPs but can only track click out to the merchants..not actual sales (as the conversion)…even though google conversion is tracking for keywords..but not for Google Optimizer?
Wow! Glenn,
That’s is big news. I appreciate the tip.
I think I’ll take a look at the test results at the 30 actions level and the 100 actions level and see if there is a difference.
100 clicks sounds doable but 100 CONVERSIONS could be a tad bit difficult especially if we’re in a market where each conversion takes quite abit of effort.
It’s good to be able to hear the thoughts of the Google Gods though so it’s definitely something worth bearing in mind when executing campaigns.
Good post Doc!
Would love to see some shares/uploads of the test results to A/B Tests community site. We are also on Twitter @abtests
Cmon, 100 clicks is NOTHING. Absolutely nothing. Even if they call came from single keyword and one ad variation. Conversions fluctuate by time of the day (before work/lunch/after work/late night surfers), day of the week, whether it’s downtown or suburbs, time of the month (after salary/before salary), weather, holidays – thousands of reasons; I agree that 100 conversions is a minimum you MUST acquire for each combination. If you see a consistent difference of 20% or more, you could act faster, say, after 60 or 80 conversions, but only if you’re 100% sure what is causing the improvement. If you’re not experienced, just
steal 2 landing pagesget inspired by 2 biggest/best competitors and run A/B test first on totally different versions – then repeat the process IF you’re seeing this consistent 20%+ difference.if you still think you’re wiser than Google, get yourself couple of popular books on probability theory and statistics, you will see that even 1000 trials is not really statistically significant;
What Google might be saying, really, is that you need 10,000 clicks to be sure (based on 100 conversions per variation if we run simple A/B test times 5% average conversion rate). Thus for opt-in conversions or simple leadgen forms where the conversion rate averages 20-40% we can settle for 30-50 conversions IMO, since it’s much easier to convince someone to give up their email/name than to pull out a credit card and make a purchase, so even little changes can have serious effects.
Hey Glenn! I KNEW it!!!
A pox on these propellorhead ‘know nothing know-it-alls’!
Sure, lots of this stuff is counter-intuitive, but it stillboils down to some common sense!
Thanks a Ton!
Owen
I recall my shock 6 years ago when running conversion tests on a campaign that got hundreds of optins and dozens of sales in a day that it still took absolutely no less than two full weeks for me to get statistically stable & reliable numbers that didn’t shift back and forth.
Seemed to suggest even then that the ‘Rule of 30′ was awfully nice for clicks, but no good for the complexity of conversions in most markets.
Comments on this entry are closed.