Hypothesis Test Calculator
Perform one-sample t-tests, two-sample t-tests, Z-tests, proportion tests, and chi-square goodness-of-fit tests instantly. Get the test statistic, exact p-value, critical values, and a full step-by-step solution — all explained in plain English.
Built by He Loves Math for statistics students, researchers, and data analysts who need fast, reliable, explained results.
The Core Formula
Every hypothesis test reduces to: calculate how far your data is from the null hypothesis in standard-error units.
Compare to a critical value (or compute the p-value) to decide: is this difference too large to be explained by chance alone?
Hypothesis Test Calculator — 5 Test Types
Select a test type, enter your values, and click Calculate. The calculator shows the test statistic, exact p-value, critical value, confidence interval (where applicable), and a full step-by-step breakdown.
This is always a right-tailed test (χ² ≥ 0).
P-values use statistical approximations. Critical values use embedded tables. Always verify important results with statistical software (R, SPSS, Python scipy.stats).
What Is Hypothesis Testing?
Hypothesis testing is one of the most fundamental tools in inferential statistics — the branch of statistics concerned with drawing conclusions about a population from sample data. Every day, scientists, doctors, engineers, and business analysts use hypothesis tests to make evidence-based decisions: Is this drug more effective than a placebo? Does this manufacturing process produce parts within tolerance? Is there a relationship between customer age and product preference?
The core idea is deceptively simple. You start with a default assumption called the null hypothesis (H₀) — usually that there is no effect, no difference, or no relationship. You then collect sample data and ask: if H₀ were true, how likely would it be to observe data as extreme as what we got? If the answer is "very unlikely" (probability below your chosen threshold α), you reject H₀ in favour of the alternative hypothesis (H₁).
Hypothesis testing never proves that H₀ is false. It only tells you how strongly your data evidence against it. This is a crucial philosophical point: you "reject" or "fail to reject" — you never "accept" or "prove."
The 5 Universal Steps of Hypothesis Testing
- State the hypotheses. Formulate H₀ (nil hypothesis — the status quo) and H₁ (alternative — what you want to show). For a two-tailed t-test: H₀: μ = μ₀ vs. H₁: μ ≠ μ₀. The alternative determines whether your test is two-tailed, left-tailed, or right-tailed.
- Set the significance level (α). Choose α before collecting data — usually 0.05, sometimes 0.01 or 0.10. α is the maximum probability of making a Type I error (falsely rejecting H₀) you are willing to accept. Choosing after seeing the data is "p-hacking" and invalidates the test.
- Calculate the test statistic. Use the appropriate formula for your data type and research question. The test statistic converts your observed data into a standardised number that can be compared to a known probability distribution (t, Z, χ², F, etc.).
- Determine the p-value and/or critical value. The p-value is \( P(\text{test statistic this extreme} \mid H_0 \text{ true}) \). The critical value is the boundary: if |test statistic| > critical value, reject H₀. Both approaches always give the same decision.
- Make a decision and interpret. "Reject H₀" or "Fail to reject H₀." Then translate this statistical decision into a meaningful conclusion in plain language, acknowledging the effect size and practical significance.
Type I and Type II Errors
Type I Error (False Positive)
Rejecting H₀ when H₀ is actually true. Denoted by α — the significance level is literally the probability of making this error. Example: concluding a drug works when it doesn't.
Type II Error (False Negative)
Failing to reject H₀ when H₀ is actually false. Denoted by β. Example: concluding a drug doesn't work when it actually does. Power = 1 − β = probability of correctly detecting a real effect.
There is always a trade-off: decreasing α (being more conservative about Type I errors) increases β (making Type II errors more likely), and vice versa. This is why sample size matters — larger n increases power, allowing you to reduce both types of error simultaneously.
The t-Test — When σ Is Unknown
The Student's t-test, developed by William Sealy Gosset in 1908 (published under the pseudonym "Student"), is the most widely used hypothesis test in science. It tests hypotheses about population means when the population standard deviation σ is unknown — which is almost always the case in real research.
One-Sample t-Test
Tests whether a single sample mean differs from a hypothesised value μ₀.
Where \(\bar{x}\) = sample mean, \(\mu_0\) = hypothesised population mean, \(s\) = sample standard deviation, \(n\) = sample size, and \(s/\sqrt{n}\) = the standard error of the mean (SEM).
The 95% confidence interval for μ is:
Two-Sample Independent t-Test
Tests whether the means of two independent groups differ. There are two versions depending on whether population variances are assumed equal.
The Chi-Square Test — For Categorical Data
The chi-square (χ²) test is used when analysing categorical data — data that falls into distinct categories rather than being measured on a continuous scale. It compares observed frequencies to expected frequencies.
Where \(O_i\) = observed frequency in category \(i\), \(E_i\) = expected frequency in category \(i\), and \(k\) = number of categories. The chi-square statistic is always non-negative (\(\chi^2 \geq 0\)), and the test is always right-tailed — we reject H₀ when χ² is large (large discrepancies between observed and expected).
Key assumption: All expected frequencies \(E_i \geq 5\). If this fails, consider combining adjacent categories or using Fisher's exact test.
The Z-Test — When σ Is Known
Z-tests are compared to the standard normal distribution (N(0,1)). Common critical values: two-tailed α=0.05 → ±1.960; two-tailed α=0.01 → ±2.576; right-tailed α=0.05 → 1.645.
P-Value vs Critical Value Approach
Both approaches always lead to the same conclusion. Use whichever you find more intuitive:
| Approach | Method | Decision Rule |
|---|---|---|
| Critical Value | Look up the critical value for your α and df; compare to |test statistic| | Reject H₀ if |test stat| > critical value |
| P-Value | Compute the probability of the observed test statistic under H₀ | Reject H₀ if p-value < α |
| Confidence Interval | Compute the 95% CI for the parameter | Reject H₀ if μ₀ falls outside the CI (two-tailed only) |
Effect Size — Beyond Statistical Significance
Statistical significance tells you whether an effect exists. Effect size tells you how large it is. With large samples, even trivial effects become statistically significant. Always report effect size alongside p-values.
Quick Reference: Which Test to Use?
| Research Question | Data Type | Test | Statistic |
|---|---|---|---|
| Is sample mean different from a known value? (σ unknown) | Continuous | One-sample t-test | t with df=n−1 |
| Is sample mean different from a known value? (σ known) | Continuous | Z-test for mean | Z ~ N(0,1) |
| Do two independent groups have different means? | Continuous | Two-sample t-test | t with df=n₁+n₂−2 |
| Is a sample proportion different from a known value? | Binary/proportion | One-proportion Z-test | Z ~ N(0,1) |
| Does observed distribution match expected? | Categorical | Chi-square GoF | χ² with df=k−1 |
| Are two categorical variables independent? | Categorical | Chi-square test of independence | χ² with df=(r−1)(c−1) |
| Do ≥3 independent groups have different means? | Continuous | One-way ANOVA (F-test) | F with df=(k−1, N−k) |
Frequently Asked Questions
What is a hypothesis test?
A hypothesis test is a formal statistical procedure for evaluating whether sample data provides sufficient evidence to reject a default assumption (the null hypothesis, H₀). Using a test statistic and a chosen significance level α, you determine whether observed differences are large enough to be considered statistically significant — i.e., unlikely to have occurred by random chance if H₀ were true.
What does the p-value actually mean?
The p-value is the probability of observing a test statistic as extreme as (or more extreme than) your calculated value, assuming H₀ is true. It is NOT the probability that H₀ is true, nor the probability your result is due to chance. A p-value of 0.03 means: if H₀ were true, there's only a 3% chance of getting data this extreme. Since 3% < 5% (α), you reject H₀.
When should I use a t-test vs a Z-test?
Use a Z-test when the population standard deviation (σ) is known. This is rare in practice — usually σ is unknown. Use a t-test when σ must be estimated from the sample standard deviation (s). For large samples (n ≥ 30), the t-distribution closely approximates the normal distribution, so the choice matters less.
What are Type I and Type II errors?
Type I error (false positive): rejecting H₀ when it is true. Probability = α. Type II error (false negative): failing to reject H₀ when it is false. Probability = β. Statistical power = 1 − β = probability of correctly detecting a real effect. Increasing sample size increases power while keeping α fixed.
What are degrees of freedom?
Degrees of freedom (df) is the number of values in a calculation that are free to vary. For a one-sample t-test: df = n − 1 (one df is "used up" estimating the mean). For a two-sample pooled t-test: df = n₁ + n₂ − 2. For chi-square GoF: df = k − 1 (k = categories). Higher df means less uncertainty in the estimated variance and more precise critical values.
What is statistical significance?
A result is statistically significant when p < α, meaning the observed effect is unlikely under H₀. Statistical significance only indicates that an effect exists — it says nothing about its size or practical importance. A large sample can make a tiny difference statistically significant. Always pair significance with effect size (Cohen's d, odds ratio, etc.).
What is two-tailed vs one-tailed testing?
Two-tailed tests detect differences in either direction (μ ≠ μ₀). One-tailed tests detect differences in only one direction (μ > μ₀ or μ < μ₀). Two-tailed tests are more conservative (harder to reject H₀) and are generally preferred unless there is a strong theoretical reason to expect a specific direction — and that reason must be stated before seeing the data.
What does "fail to reject" mean?
"Fail to reject H₀" means your data does not provide sufficient evidence to conclude that H₁ is true at your chosen α level. It does NOT mean H₀ is true or proven correct. It is possible that H₀ is false but your sample was too small, the effect is too small, or there is too much variability to detect it — all scenarios leading to a Type II error.
Related Tools at He Loves Math
External reference: statsmodels (Python) · Khan Academy – Significance Tests
More Statistics Tools at He Loves Math
He Loves Math provides expert-built, student-friendly calculators for statistics, mathematics, science, and finance. Every result comes with the theory, formulas, and worked steps so you understand the answer — not just the number.
