A/B testing, or split testing, is a method of comparing two versions of something — such as a web page, email or advert — by showing each to a random share of your audience and measuring which performs better against a chosen goal. Because the audience is split randomly and only one element differs, it isolates the effect of that change.

What is statistical significance in A/B testing?

Statistical significance is a measure of how likely an observed difference between two versions is real rather than down to random chance. A common threshold is 95% confidence, meaning there is roughly a 5% probability the result is a fluke. It is a confidence check, not absolute proof, and should be paired with a sensible sample size.

How large a sample size do I need?

It depends on your current conversion rate and how big a change you want to detect — smaller effects need larger samples. The key principle is to decide the required sample before you start and wait until you reach it, rather than stopping the moment a result looks good. Free sample-size calculators can estimate the number for you.

Why do A/B tests give misleading results?

Usually because of avoidable mistakes: stopping the test as soon as it looks favourable, not gathering enough data, changing several things at once so you cannot tell what worked, running it too briefly to capture normal variation, or focusing on a difference too small to matter in practice.

A/B Testing Explained: A Plain-English Guide

Marketing is full of confident opinions about what works — which headline, which colour, which subject line. A/B testing replaces those opinions with evidence. Instead of arguing, you run a fair experiment and let real behaviour decide. It is one of the most powerful tools in marketing, but only when done properly; done carelessly, it produces confident-sounding nonsense. This guide covers the four things that separate a real test from wishful thinking: hypotheses, sample size, significance and pitfalls.

What it is

A/B testing — also called split testing — is a method of comparing two versions of something by showing each to a random portion of your audience and measuring which performs better against a defined goal. Version A is usually the existing "control"; version B is the variation with one change.

The power comes from a single design choice: because users are split randomly and only one element differs between A and B, any difference in results can be attributed to that element. Everything else — the season, your traffic mix, the day of the week — affects both versions equally and cancels out. That is what makes A/B testing a genuine experiment rather than a guess, and why it underpins serious conversion rate optimisation.

The purpose of an A/B test is not to prove you were right. It is to find out what is true. Going in hoping for a particular result is the first step towards fooling yourself.

Start with a hypothesis

A good test begins before you touch anything, with a clear hypothesis. A hypothesis is not "let's try a green button"; it is a specific, testable statement with three parts:

The change — what exactly you will alter (and only that).
The expected effect — what you predict will happen and, ideally, why.
The metric — the single number you will judge it by.

For example: "Changing the call-to-action from 'Submit' to 'Get my free quote' will increase form completions, because it states the benefit and reduces hesitation." That is testable. It names the change, the expected outcome and the metric — form completions — you will measure.

Defining the metric in advance is vital. If you decide afterwards which number to look at, you will always find one that flatters the result. Decide the goal first, the same discipline that makes marketing ROI measurement trustworthy: choose what counts as success before you see the data.

Sample size: gather enough data

Here is the rule beginners break most often: a result means nothing until you have enough data. A version that looks like a runaway winner after twenty visitors is almost certainly noise. To know whether a difference is real, you need a sufficient sample size — enough people in each group to see a stable pattern rather than random fluctuation.

How much is enough depends on two things:

Factor	Effect on required sample
Your current conversion rate	Lower rates need larger samples
The size of effect you want to detect	Smaller improvements need larger samples

A tiny improvement is genuinely hard to detect — distinguishing a real 1% lift from random noise takes a lot of data. A big improvement reveals itself sooner. The practical move is to estimate the required sample size before you start, using one of the many free A/B test calculators, and then commit to running the test until you reach it. The way data behaves more reliably as samples grow is a basic statistical principle, the kind the Office for National Statistics relies on across its work.

Statistical significance: is the difference real?

Once you have enough data, you need a way to judge whether the difference between A and B is real or just chance. That is what statistical significance estimates.

In plain terms, significance answers: "If there were truly no difference between these versions, how likely is it I'd see a gap this large by luck alone?" A common threshold is 95% confidence, which corresponds to roughly a 5% chance the result is a fluke. Most testing tools calculate this for you and report it as a confidence level or p-value.

Two cautions keep you honest here:

Significance is a confidence check, not proof. Even at 95% confidence, there is a real chance the result is wrong. It reduces the odds of fooling yourself; it does not eliminate them.
Significance is not the same as importance. A result can be statistically significant yet so small it makes no practical difference to your business. Always ask how big the effect is, not just whether it cleared the confidence bar.

Two questions, not one: "Is this difference likely real?" and "Is it big enough to care about?" A change can pass the first and fail the second.

The pitfalls that ruin tests

Most failed A/B tests are not bad luck; they are avoidable mistakes. Watch for these:

Stopping early. Ending the test the moment it looks favourable — sometimes called "peeking" — is the cardinal sin. Results swing wildly early on; an apparent winner today can reverse by next week. Set your sample size and run to it.
Testing too many changes at once. If you change the headline, the image and the button together, a better result tells you something worked, but not what. Isolate one variable so you actually learn.
Running it too briefly. A test that does not span normal cycles — weekdays versus weekends, paydays, seasonal swings — can capture an unrepresentative slice of behaviour.
Ignoring external events. A sale, an outage, a press mention or a holiday can distort results. Note what else was happening during the test.
Chasing tiny effects on low traffic. If a page gets fifty visits a month, you will never gather enough data to test it meaningfully. Test where the volume is.

Avoiding these is mostly about patience and honesty. The whole value of testing evaporates if you bend the process to get the answer you wanted. That evidence-led humility carries across all of marketing measurement — as CM Beyer argues in its guide to measuring marketing ROI without overcomplicating it, a few numbers you trust and act on beat a pile of figures you quietly massage.

The bottom line

A/B testing turns marketing arguments into experiments: split your audience at random, change one thing, and measure. Begin with a clear hypothesis that names the change, the expected effect and the metric; gather a large enough sample size before drawing conclusions; use statistical significance to gauge whether a difference is real, while also asking whether it is big enough to matter; and steer clear of the classic pitfalls, above all stopping early. Done with patience and honesty, A/B testing is how you replace opinion with evidence — and keep improving for good.

A/B Testing Explained: A Plain-English Guide

What it is

Start with a hypothesis

Sample size: gather enough data

Statistical significance: is the difference real?

The pitfalls that ruin tests

The bottom line

Frequently asked questions

Sources & further reading

More from Daily Junction

Marketing Attribution Models Compared: First-Touch to Data-Driven

Conversion Rate Optimisation: A Beginner's Guide

Why Businesses Run a Newsroom (and How to Do It Well)