Statistical Significance

WHY

A numeric result has statistical significance when it is very unlikely to have occurred by natural variation (aka chance)

WHEN

In general, required on every numeric result, e.g.:

Comparison of 2 cohorts
A measure change (is it natural variation or real shift in a metric?)
Summarization of data: average, median

HOW

UseCase1

CohortA average is higher than CohortB average, but is it statistically significant ?

Tool: https://www.evanmiller.org/ab-testing/t-test.html Note that it also supports entering summary values (mean, std, usersCount) instead list of raw values.

UseCase2

Does the rate of success differ across two groups? (aka Proportion testing, very commonly used in A/B testing)

Tool: https://www.evanmiller.org/ab-testing/chi-squared.html

UseCase3

A metric trend has changed, is it just a natural (expected) variation? Or is in fact such an unlikely value (given its history) that it’s a real shift?

Tool: Is trend change significant? Time Series Analysis: CausalImpact (R tool), use a nonbiased trend as inputs, and get a validation whether change in trend is significant.

UseCase4

CohortA median is higher than CohortB median, but is it stat. significant ?

Tool: Moods http://www.real-statistics.com/non-parametric-tests/moods-median-test-two-samples/ or Wilcoxon http://www.real-statistics.com/non-parametric-tests/wilcoxon-rank-sum-test/ (or Mann-Whitney for >2 groups http://www.real-statistics.com/non-parametric-tests/mann-whitney-test/)

UseCase5

How many users are needed to be confident that a change is significant? Very commonly needed for A/B tests, but useful for getting a feeling for cohorts minimum size.

Tool: https://www.evanmiller.org/ab-testing/sample-size.html

UseCase6

Average is x, but variation within the original list of values is big, so how confidence we are on it ?

Tool: Result is average x, but with expected variation of: [x-1.96*std, x+1.96*std]

WHAT CAUSED IT ?

Isolate potential cases as hypothesis and test them.

A/B testing method: isolate/randomize every other factor, like seasonality, time of day, type of user, etc.. except the hypothesis being tested, like a user UX change, so that when a significant change has happened we know is surely caused by UX change.

Alexandre Matos Martins

Statistical Significance

WHY

WHEN

HOW

UseCase1

UseCase2

UseCase3

UseCase4

UseCase5

UseCase6

WHAT CAUSED IT ?

You May Also Enjoy

Thoughts on validating model accuracy - 2 Jul, 2021 (data)

Supervised and Unsupervised ML interplay - 18 May, 2021 (data)

Statistical Significance test using Permutation - 17 Mar, 2021 (data, montecarlo)

Evolution of metrics for the social media era - 18 Feb, 2021 (data)