Class 0x0C: Hypothesis testing and significance tests

Different cases of hypothesis and significance testing

Simple hypothesis tests

Test statistic:
one or more statistic(s) t, a function of the observations x.
Critical region:
a region defined by some limits on the test statistic(s). This is the "rejection region" for H_0.
Significance level \alpha of a critical region:
Probability for a result to be in critical region if H_0 is true. ("Type I error")
False negative probability \beta of a critical region:
Probability for a result to be outside critical region if H_1 is true. ("Type II error") 1-\beta is called the statistical power of the test.

Constructing a good test statistic

Example 1: unrealistic light bulb models

Suppose we have one model H_0 that the p.d.f. for light bulb lifetime T is given by

\frac{dP}{dT} = f_0(T) = \mu_0^{-1} e^{-T/\mu_0}

for some known \mu_0, and another model H_1 which is the same except that the mean is \mu_1, also known.

What's the Neyman-Pearson test statistic?

t = \frac{\prod_i f_0(T_i)}{\prod_i f_1(T_i)}.

Since we're just going to compare it to a value k, we can just as well use

\log t = \sum_i \log f_0(T_i) - \sum_i \log f_1(T_i).

In this case, this is simply

\log t = N \log(\mu_1/\mu_0) + (\mu_1^{-1}-\mu_0^{-1})\sum T_i.

Let's take \mu_1 < \mu_0. The critical region defined by \log t \leq k can be rewritten as

\frac{1}{N} \sum T_i \leq \frac{N^{-1}\log k - \log(\mu_1/\mu_0)}{\mu_1^{-1}-\mu_0^{-1}}.

Stated in words: "reject" the larger lifetime hypothesis if the observed mean is smaller than some amount. Adjust that amount to get the desired \alpha. This can be done assuming a gaussian distribution for the mean if N is large; otherwise, evaluate it analytically or using MC methods.

Histograms of the test statistics from example 1

These plots were made with the code in classCh_example1.cc with N=100, \mu_0=1.0, and \mu_1=0.67.

Figure showing MC histograms and c.d.f. estimates (normalized cumulative sums) Figure showing MC histograms superposed Figure showing c.d.f. of hypothesis 0 and 1-c.d.f. of hypothesis 1

For any given value x chosen as the decision criteria on the test statistic, the value of \alpha is given by the c.d.f. of the test statistic assuming H_0 (red curve above), and \beta is given by 1-c.d.f. of the test statistic assume H_1 (blue curve above). Generally one chooses \alpha first and then finds the necessary value of x for the decision. The Neyman-Pearson lemma says that \beta is as low as it can be for that \alpha. Note there is in general no particular advantage to setting \alpha=\beta, although there may be reason to do so in some cases.

Example 2: More realistic light bulb models

Suppose we have one model H_0 that the p.d.f. for light bulb lifetime T is given by

\frac{dP}{dT} = f_0(T) = \mu^{-1} e^{-T/\mu}

for some unknown \mu, and another model H_1

\frac{dP}{dT} = f_1(T) = \left\{ \begin{array}{ll} (h-| T-\mu |)/h^2 & \text{if~} | T-\mu |<h,\\ 0 & \text{otherwise} \end{array}\right. ,

for some unknown \mu and h. Construct the test statistic as before, and compare the best fit for H_0 to the best fit for H_1.

Here you might want to evaluate the significance levels \alpha for a given k using a MC simulation.

Example 3: applying a signal/background cut

(... see discussion in hypothesis test section of [PDG-Stat] ...)

Two-hypothesis significance test

Why the statistical significance isn't the probability you'd like

[*]See Comment on Bayesian statistics.

Goodness-of-fit or one-hypothesis significance test

Again, we have a test statistic, which I'll call T.

The p-value

The p-value is what the hypothetical model H says should be the probability to find the statistic T in a region of equal or lesser compatibility than the observed t: that is, p=P_T(t) assuming H is true.

[†]The proof is the reverse of the derivation of the inverse-distribution-function method of generating a random variable.

Why this statistical significance isn't the probability you'd like (II)

Example 4: exponential plus background

How good is the fit of the exponential-plus-background model to the data in the last assignment? Let's use the best-fit likelihood L_\text{max} as our test statistic. We'll get the p.d.f. for L_\text{max} using MC simulation.

Procedure for building up p.d.f. of L_\text{max}:
make a histogram to store the p.d.f.
loop M times:
simulate a dataset using the hypothesis
fit the dataset
"fill" histogram using L_\text{max}
Procedure for simulating a dataset:
Loop N times:
generate random variable x according to the model p.d.f. for x
(see class notes on MC simulation, use inverse distribution method)
store x in vector of doubles to be used as dataset
(instead of reading x from a file)

Now just read off the p-value from this histogram: according to the simulation, if the hypothesis is true, what fraction of L_\text{max} would be worse than what you got for the actual data?

Assignment

Choose either "option A" or "option B" below -- you do not have to do both.

Option A:
Complete example 4 above.
Option B:
See below.

Assignment option B: globular clusters

Are the 119 globular clusters in the Arp 1965 catalog uniformly distributed in \cos \theta, where \theta is galactic latitude?

Comment on Bayesian statistics

Even though P(H_0|\text{observation}) is meaningless as a probability in the sense of "the fraction of possible universes in which H_0 would turn out to be true given that we made these observations", some people like to use Bayes' law anyway to characterize and update what they call their "subjective degree of belief". Rather than using objective data for P(H_0), they use that term in Bayes' law to reflect their "prior subjective beliefs" ("priors" for short), deliberately introducing this as something that can only be changed by statistically significant evidence to the contrary. This approach has caused a lot of controversy over the years. In my opinion, there is nothing wrong with this as long as one evaluates the resulting p.d.f.s as carefully as possible and keeps in mind the limitations. However, it can go badly wrong if the "prior" pre-assigns very low probability to what the data actually ends up indicating, even if the "prior" is based on little or no relevant data: for an extreme case, see "The Logic of Intelligence Failure" by Bruce G. Blair [Blair2004], actually written by a proponent of this way of thinking. I won't talk about that further today.

References

In the following, (R) indicates a review, (I) indicates an introductory text, and (A) indicates an advanced text.

Probability:

PDG-Prob:

(R) "Probability", G. Cowan, in Review of Particle Physics, C. Amsler et al., PL B667, 1 (2008) and 2009 partial update for the 2010 edition (http://pdg.lbl.gov).

See also general references cited in PDG-Prob.

Statistics:

PDG-Stat:

(R) "Statistics", G. Cowan, in Review of Particle Physics, C. Amsler et al., PL B667, 1 (2008) and 2009 partial update for the 2010 edition (http://pdg.lbl.gov).

See also general references cited in PDG-Stat.

Larson:
(I) Introduction to Probability Theory and Statistical Inference, 3rd ed., H.J. Larson, Wiley (1982).
NumRecip:
(A) Numerical Recipes, W.H. Press, et al., Cambridge University Press (2007).

Other cited works:

Blair2004:
B.G. Blair, "The Logic of Intelligence Failure", Forum on Physics and Society Newsletter, April, 2004; http://www.aps.org/units/fps/newsletters/2004/april/article3.html browsed 2010/06/01.
KamLAND2002:
KamLAND collaboration, Phys.Rev.Lett.90:021802,2003; arXiv:hep-ex/0212021.
KamLAND2004:
KamLAND collaboration, Phys.Rev.Lett.94:081801,2005; arXiv:hep-ex/0406035.