What is a hypothesis test in statistics?

A hypothesis test is a formal statistical procedure for deciding whether sample data provides sufficient evidence to reject a default assumption (the null hypothesis, H₀) in favour of an alternative hypothesis (H₁). The test quantifies evidence against H₀ using a test statistic and compares it to a threshold (critical value) or probability (p-value) to make a yes/no decision about H₀.

What is the difference between a t-test and a Z-test?

Use a Z-test when the population standard deviation (σ) is known, or when the sample size is large (n ≥ 30) and the population is normally distributed. Use a t-test when σ is unknown and must be estimated from the sample standard deviation (s). For large n, both tests give very similar results because the t-distribution approaches the standard normal distribution as degrees of freedom increase.

What does p-value mean in hypothesis testing?

The p-value is the probability of obtaining a test statistic as extreme as (or more extreme than) the observed value, assuming the null hypothesis is true. A small p-value (typically p < 0.05) indicates strong evidence against H₀ — your result would be very unlikely under H₀. The p-value does NOT measure the probability that H₀ is true, nor the probability that the result occurred by chance.

What is degrees of freedom in a hypothesis test?

Degrees of freedom (df) refers to the number of independent values in a calculation that can vary freely. For a one-sample t-test: df = n − 1. For a two-sample t-test (pooled): df = n₁ + n₂ − 2. For a chi-square goodness-of-fit test: df = k − 1 (k = number of categories). More df means the t and chi-square distributions more closely resemble the normal distribution, leading to less conservative critical values.

When should I use a chi-square test vs a t-test?

Use a t-test when comparing means of continuous (quantitative) data. Use a chi-square test when analyzing categorical (qualitative) data. The chi-square goodness-of-fit test compares observed category frequencies to expected frequencies for a single variable. The chi-square test of independence tests whether two categorical variables are related in a contingency table.

What is statistical significance vs practical significance?

Statistical significance (p < α) only means your result is unlikely under H₀ — it does not mean the effect is large or important. With a very large sample, even a tiny, practically meaningless effect can be statistically significant. Always complement significance testing with effect size measures: Cohen's d for t-tests (d = 0.2 small, 0.5 medium, 0.8 large), Cramér's V for chi-square, or the actual magnitude of the difference in practical terms.

What are the assumptions of the one-sample t-test?

The one-sample t-test assumes: (1) The outcome variable is continuous. (2) The data form a simple random sample from the population. (3) The data are approximately normally distributed — this matters most for small samples (n < 30). For large n (≥ 30), the Central Limit Theorem ensures the sampling distribution of the mean is approximately normal regardless of the raw data distribution. (4) Observations are independent. Violations of normality for small samples require non-parametric alternatives like the Wilcoxon signed-rank test.

5 Test Types P-Value Computed Step-by-Step 95% CI

Hypothesis Test Calculator

Q: What are Type I and Type II errors?

A Type I error (false positive) occurs when you reject a true null hypothesis. Its probability equals α (the significance level). A Type II error (false negative) occurs when you fail to reject a false null hypothesis. Its probability is denoted β. Statistical power = 1 − β is the probability of correctly rejecting a false null hypothesis. There is always a trade-off: reducing α increases β and reduces power, while increasing α reduces β but increases the risk of false positives.

Q: What is degrees of freedom in a hypothesis test?

Degrees of freedom (df) refers to the number of independent values in a calculation that can vary freely. For a one-sample t-test: df = n − 1. For a two-sample t-test (pooled): df = n₁ + n₂ − 2. For a chi-square goodness-of-fit test: df = k − 1 (k = number of categories). More df means the t and chi-square distributions more closely resemble the normal distribution, leading to less conservative critical values.

Q: When should I use a chi-square test vs a t-test?

Use a t-test when comparing means of continuous (quantitative) data. Use a chi-square test when analyzing categorical (qualitative) data. The chi-square goodness-of-fit test compares observed category frequencies to expected frequencies for a single variable. The chi-square test of independence tests whether two categorical variables are related in a contingency table.

Q: What is statistical significance vs practical significance?

Statistical significance (p < α) only means your result is unlikely under H₀ — it does not mean the effect is large or important. With a very large sample, even a tiny, practically meaningless effect can be statistically significant. Always complement significance testing with effect size measures: Cohen's d for t-tests (d = 0.2 small, 0.5 medium, 0.8 large), Cramér's V for chi-square, or the actual magnitude of the difference in practical terms.

Q: What are the assumptions of the one-sample t-test?

The one-sample t-test assumes: (1) The outcome variable is continuous. (2) The data form a simple random sample from the population. (3) The data are approximately normally distributed — this matters most for small samples (n < 30). For large n (≥ 30), the Central Limit Theorem ensures the sampling distribution of the mean is approximately normal regardless of the raw data distribution. (4) Observations are independent. Violations of normality for small samples require non-parametric alternatives like the Wilcoxon signed-rank test.

Perform one-sample t-tests, two-sample t-tests, Z-tests, proportion tests, and chi-square goodness-of-fit tests instantly. Get the test statistic, exact p-value, critical values, and a full step-by-step solution — all explained in plain English.

Built by He Loves Math for statistics students, researchers, and data analysts who need fast, reliable, explained results.

Open the calculator Learn the theory

The Core Formula

Every hypothesis test reduces to: calculate how far your data is from the null hypothesis in standard-error units.

$$\text{Test Statistic} = \frac{\text{Observed} - \text{Hypothesised}}{\text{Standard Error}}$$

Compare to a critical value (or compute the p-value) to decide: is this difference too large to be explained by chance alone?

Hypothesis Test Calculator — 5 Test Types

Select a test type, enter your values, and click Calculate. The calculator shows the test statistic, exact p-value, critical value, confidence interval (where applicable), and a full step-by-step breakdown.

Sample Mean (x̄)

Sample Size (n)

Sample Std Dev (s)

Hypothesised Mean (μ₀)

Significance Level (α)

Alternative Hypothesis

Sample 1 Mean (x̄₁)

Sample 2 Mean (x̄₂)

Sample 1 Std Dev (s₁)

Sample 2 Std Dev (s₂)

Sample 1 Size (n₁)

Sample 2 Size (n₂)

Variance Assumption

Significance Level (α)

Alternative Hypothesis

Sample Mean (x̄)

Hypothesised Mean (μ₀)

Population Std Dev (σ)Use σ if known; otherwise use a t-test

Sample Size (n)

Significance Level (α)

Alternative Hypothesis

Number of Successes (x)

Sample Size (n)

Hypothesised Proportion (p₀)

Significance Level (α)

Alternative Hypothesis

Observed Frequencies (O₁, O₂, ...) Comma-separated non-negative integers

Expected Frequencies (E₁, E₂, ...) Comma-separated positive numbers. Sum(O) must equal Sum(E)

Significance Level (α)

This is always a right-tailed test (χ² ≥ 0).

—

P-values use statistical approximations. Critical values use embedded tables. Always verify important results with statistical software (R, SPSS, Python scipy.stats).

What Is Hypothesis Testing?

Hypothesis testing is one of the most fundamental tools in inferential statistics — the branch of statistics concerned with drawing conclusions about a population from sample data. Every day, scientists, doctors, engineers, and business analysts use hypothesis tests to make evidence-based decisions: Is this drug more effective than a placebo? Does this manufacturing process produce parts within tolerance? Is there a relationship between customer age and product preference?

The core idea is deceptively simple. You start with a default assumption called the null hypothesis (H₀) — usually that there is no effect, no difference, or no relationship. You then collect sample data and ask: if H₀ were true, how likely would it be to observe data as extreme as what we got? If the answer is "very unlikely" (probability below your chosen threshold α), you reject H₀ in favour of the alternative hypothesis (H₁).

Hypothesis testing never proves that H₀ is false. It only tells you how strongly your data evidence against it. This is a crucial philosophical point: you "reject" or "fail to reject" — you never "accept" or "prove."

The 5 Universal Steps of Hypothesis Testing

State the hypotheses. Formulate H₀ (nil hypothesis — the status quo) and H₁ (alternative — what you want to show). For a two-tailed t-test: H₀: μ = μ₀ vs. H₁: μ ≠ μ₀. The alternative determines whether your test is two-tailed, left-tailed, or right-tailed.
Set the significance level (α). Choose α before collecting data — usually 0.05, sometimes 0.01 or 0.10. α is the maximum probability of making a Type I error (falsely rejecting H₀) you are willing to accept. Choosing after seeing the data is "p-hacking" and invalidates the test.
Calculate the test statistic. Use the appropriate formula for your data type and research question. The test statistic converts your observed data into a standardised number that can be compared to a known probability distribution (t, Z, χ², F, etc.).
Determine the p-value and/or critical value. The p-value is $ P(\text{test statistic this extreme} \mid H_0 \text{ true}) $. The critical value is the boundary: if |test statistic| > critical value, reject H₀. Both approaches always give the same decision.
Make a decision and interpret. "Reject H₀" or "Fail to reject H₀." Then translate this statistical decision into a meaningful conclusion in plain language, acknowledging the effect size and practical significance.

Type I and Type II Errors

Type I Error (False Positive)

Rejecting H₀ when H₀ is actually true. Denoted by α — the significance level is literally the probability of making this error. Example: concluding a drug works when it doesn't.

$$P(\text{Type I}) = \alpha$$

Type II Error (False Negative)

Failing to reject H₀ when H₀ is actually false. Denoted by β. Example: concluding a drug doesn't work when it actually does. Power = 1 − β = probability of correctly detecting a real effect.

$$P(\text{Type II}) = \beta \qquad \text{Power} = 1 - \beta$$

There is always a trade-off: decreasing α (being more conservative about Type I errors) increases β (making Type II errors more likely), and vice versa. This is why sample size matters — larger n increases power, allowing you to reduce both types of error simultaneously.

The t-Test — When σ Is Unknown

The Student's t-test, developed by William Sealy Gosset in 1908 (published under the pseudonym "Student"), is the most widely used hypothesis test in science. It tests hypotheses about population means when the population standard deviation σ is unknown — which is almost always the case in real research.

One-Sample t-Test

Tests whether a single sample mean differs from a hypothesised value μ₀.

One-Sample t-Test $$t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} \qquad \text{df} = n - 1$$

Where $\bar{x}$ = sample mean, $\mu_0$ = hypothesised population mean, $s$ = sample standard deviation, $n$ = sample size, and $s/\sqrt{n}$ = the standard error of the mean (SEM).

The 95% confidence interval for μ is:

$$\bar{x} \pm t_{\alpha/2,\; n-1} \cdot \frac{s}{\sqrt{n}}$$

Two-Sample Independent t-Test

Tests whether the means of two independent groups differ. There are two versions depending on whether population variances are assumed equal.

Pooled t-Test (Equal Variances) $$t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{\dfrac{1}{n_1}+\dfrac{1}{n_2}}} \qquad s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}} \qquad \text{df} = n_1+n_2-2$$

Welch's t-Test (Unequal Variances — Generally Recommended) $$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}}} \qquad \text{df} \approx \frac{\left(\dfrac{s_1^2}{n_1}+\dfrac{s_2^2}{n_2}\right)^2}{\dfrac{(s_1^2/n_1)^2}{n_1-1}+\dfrac{(s_2^2/n_2)^2}{n_2-1}}$$

The Chi-Square Test — For Categorical Data

The chi-square (χ²) test is used when analysing categorical data — data that falls into distinct categories rather than being measured on a continuous scale. It compares observed frequencies to expected frequencies.

Chi-Square Goodness-of-Fit $$\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} \qquad \text{df} = k - 1$$

Where $O_i$ = observed frequency in category $i$, $E_i$ = expected frequency in category $i$, and $k$ = number of categories. The chi-square statistic is always non-negative ($\chi^2 \geq 0$), and the test is always right-tailed — we reject H₀ when χ² is large (large discrepancies between observed and expected).

Key assumption: All expected frequencies $E_i \geq 5$. If this fails, consider combining adjacent categories or using Fisher's exact test.

The Z-Test — When σ Is Known

Z-Test for Population Mean $$Z = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}}$$

Z-Test for Population Proportion $$Z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}} \qquad \hat{p} = \frac{x}{n}$$

Z-tests are compared to the standard normal distribution (N(0,1)). Common critical values: two-tailed α=0.05 → ±1.960; two-tailed α=0.01 → ±2.576; right-tailed α=0.05 → 1.645.

P-Value vs Critical Value Approach

Both approaches always lead to the same conclusion. Use whichever you find more intuitive:

Approach	Method	Decision Rule
Critical Value	Look up the critical value for your α and df; compare to \|test statistic\|	Reject H₀ if \|test stat\| > critical value
P-Value	Compute the probability of the observed test statistic under H₀	Reject H₀ if p-value < α
Confidence Interval	Compute the 95% CI for the parameter	Reject H₀ if μ₀ falls outside the CI (two-tailed only)

Effect Size — Beyond Statistical Significance

Statistical significance tells you whether an effect exists. Effect size tells you how large it is. With large samples, even trivial effects become statistically significant. Always report effect size alongside p-values.

Cohen's d (for t-tests) $$d = \frac{\bar{x} - \mu_0}{s} \qquad \text{Benchmarks: } d=0.2 \text{ (small)},\; 0.5 \text{ (medium)},\; 0.8 \text{ (large)}$$

Cramér's V (for chi-square) $$V = \sqrt{\frac{\chi^2}{n \cdot \min(r-1,\; c-1)}} \qquad V=0.1 \text{ (small)},\; 0.3 \text{ (medium)},\; 0.5 \text{ (large)}$$

Quick Reference: Which Test to Use?

Research Question	Data Type	Test	Statistic
Is sample mean different from a known value? (σ unknown)	Continuous	One-sample t-test	t with df=n−1
Is sample mean different from a known value? (σ known)	Continuous	Z-test for mean	Z ~ N(0,1)
Do two independent groups have different means?	Continuous	Two-sample t-test	t with df=n₁+n₂−2
Is a sample proportion different from a known value?	Binary/proportion	One-proportion Z-test	Z ~ N(0,1)
Does observed distribution match expected?	Categorical	Chi-square GoF	χ² with df=k−1
Are two categorical variables independent?	Categorical	Chi-square test of independence	χ² with df=(r−1)(c−1)
Do ≥3 independent groups have different means?	Continuous	One-way ANOVA (F-test)	F with df=(k−1, N−k)

Frequently Asked Questions

What is a hypothesis test?

A hypothesis test is a formal statistical procedure for evaluating whether sample data provides sufficient evidence to reject a default assumption (the null hypothesis, H₀). Using a test statistic and a chosen significance level α, you determine whether observed differences are large enough to be considered statistically significant — i.e., unlikely to have occurred by random chance if H₀ were true.

What does the p-value actually mean?

The p-value is the probability of observing a test statistic as extreme as (or more extreme than) your calculated value, assuming H₀ is true. It is NOT the probability that H₀ is true, nor the probability your result is due to chance. A p-value of 0.03 means: if H₀ were true, there's only a 3% chance of getting data this extreme. Since 3% < 5% (α), you reject H₀.

When should I use a t-test vs a Z-test?

Use a Z-test when the population standard deviation (σ) is known. This is rare in practice — usually σ is unknown. Use a t-test when σ must be estimated from the sample standard deviation (s). For large samples (n ≥ 30), the t-distribution closely approximates the normal distribution, so the choice matters less.

What are Type I and Type II errors?

Type I error (false positive): rejecting H₀ when it is true. Probability = α. Type II error (false negative): failing to reject H₀ when it is false. Probability = β. Statistical power = 1 − β = probability of correctly detecting a real effect. Increasing sample size increases power while keeping α fixed.

What are degrees of freedom?

Degrees of freedom (df) is the number of values in a calculation that are free to vary. For a one-sample t-test: df = n − 1 (one df is "used up" estimating the mean). For a two-sample pooled t-test: df = n₁ + n₂ − 2. For chi-square GoF: df = k − 1 (k = categories). Higher df means less uncertainty in the estimated variance and more precise critical values.

What is statistical significance?

A result is statistically significant when p < α, meaning the observed effect is unlikely under H₀. Statistical significance only indicates that an effect exists — it says nothing about its size or practical importance. A large sample can make a tiny difference statistically significant. Always pair significance with effect size (Cohen's d, odds ratio, etc.).

What is two-tailed vs one-tailed testing?

Two-tailed tests detect differences in either direction (μ ≠ μ₀). One-tailed tests detect differences in only one direction (μ > μ₀ or μ < μ₀). Two-tailed tests are more conservative (harder to reject H₀) and are generally preferred unless there is a strong theoretical reason to expect a specific direction — and that reason must be stated before seeing the data.

What does "fail to reject" mean?

"Fail to reject H₀" means your data does not provide sufficient evidence to conclude that H₁ is true at your chosen α level. It does NOT mean H₀ is true or proven correct. It is possible that H₀ is false but your sample was too small, the effect is too small, or there is too much variability to detect it — all scenarios leading to a Type II error.

Hypothesis Test Calculator

The Core Formula

Hypothesis Test Calculator — 5 Test Types

What Is Hypothesis Testing?

The 5 Universal Steps of Hypothesis Testing

Type I and Type II Errors

Type I Error (False Positive)

Type II Error (False Negative)

The t-Test — When σ Is Unknown

One-Sample t-Test

Two-Sample Independent t-Test

The Chi-Square Test — For Categorical Data

The Z-Test — When σ Is Known

P-Value vs Critical Value Approach

Effect Size — Beyond Statistical Significance

Quick Reference: Which Test to Use?

Frequently Asked Questions

Related Tools at He Loves Math

More Statistics Tools at He Loves Math

Related Posts

Daily

Weekly

Picked