Math Calculator

Hypothesis Test Calculator – t-Test, Chi-Square, Z-Test with P-Value

Free hypothesis testing calculator. Perform one-sample t-test, two-sample t-test, Z-test, proportion test, and chi-square goodness-of-fit test. Shows p-value, critical values, degrees of freedom, and full step-by-step solution.
5 Test Types P-Value Computed Step-by-Step 95% CI

Hypothesis Test Calculator

Perform one-sample t-tests, two-sample t-tests, Z-tests, proportion tests, and chi-square goodness-of-fit tests instantly. Get the test statistic, exact p-value, critical values, and a full step-by-step solution — all explained in plain English.

Built by He Loves Math for statistics students, researchers, and data analysts who need fast, reliable, explained results.

The Core Formula

Every hypothesis test reduces to: calculate how far your data is from the null hypothesis in standard-error units.

$$\text{Test Statistic} = \frac{\text{Observed} - \text{Hypothesised}}{\text{Standard Error}}$$

Compare to a critical value (or compute the p-value) to decide: is this difference too large to be explained by chance alone?

Hypothesis Test Calculator — 5 Test Types

Select a test type, enter your values, and click Calculate. The calculator shows the test statistic, exact p-value, critical value, confidence interval (where applicable), and a full step-by-step breakdown.

Use σ if known; otherwise use a t-test
Comma-separated non-negative integers
Comma-separated positive numbers. Sum(O) must equal Sum(E)

This is always a right-tailed test (χ² ≥ 0).

P-values use statistical approximations. Critical values use embedded tables. Always verify important results with statistical software (R, SPSS, Python scipy.stats).

What Is Hypothesis Testing?

Hypothesis testing is one of the most fundamental tools in inferential statistics — the branch of statistics concerned with drawing conclusions about a population from sample data. Every day, scientists, doctors, engineers, and business analysts use hypothesis tests to make evidence-based decisions: Is this drug more effective than a placebo? Does this manufacturing process produce parts within tolerance? Is there a relationship between customer age and product preference?

The core idea is deceptively simple. You start with a default assumption called the null hypothesis (H₀) — usually that there is no effect, no difference, or no relationship. You then collect sample data and ask: if H₀ were true, how likely would it be to observe data as extreme as what we got? If the answer is "very unlikely" (probability below your chosen threshold α), you reject H₀ in favour of the alternative hypothesis (H₁).

Hypothesis testing never proves that H₀ is false. It only tells you how strongly your data evidence against it. This is a crucial philosophical point: you "reject" or "fail to reject" — you never "accept" or "prove."

The 5 Universal Steps of Hypothesis Testing

  1. State the hypotheses. Formulate H₀ (nil hypothesis — the status quo) and H₁ (alternative — what you want to show). For a two-tailed t-test: H₀: μ = μ₀ vs. H₁: μ ≠ μ₀. The alternative determines whether your test is two-tailed, left-tailed, or right-tailed.
  2. Set the significance level (α). Choose α before collecting data — usually 0.05, sometimes 0.01 or 0.10. α is the maximum probability of making a Type I error (falsely rejecting H₀) you are willing to accept. Choosing after seeing the data is "p-hacking" and invalidates the test.
  3. Calculate the test statistic. Use the appropriate formula for your data type and research question. The test statistic converts your observed data into a standardised number that can be compared to a known probability distribution (t, Z, χ², F, etc.).
  4. Determine the p-value and/or critical value. The p-value is \( P(\text{test statistic this extreme} \mid H_0 \text{ true}) \). The critical value is the boundary: if |test statistic| > critical value, reject H₀. Both approaches always give the same decision.
  5. Make a decision and interpret. "Reject H₀" or "Fail to reject H₀." Then translate this statistical decision into a meaningful conclusion in plain language, acknowledging the effect size and practical significance.

Type I and Type II Errors

Type I Error (False Positive)

Rejecting H₀ when H₀ is actually true. Denoted by α — the significance level is literally the probability of making this error. Example: concluding a drug works when it doesn't.

$$P(\text{Type I}) = \alpha$$

Type II Error (False Negative)

Failing to reject H₀ when H₀ is actually false. Denoted by β. Example: concluding a drug doesn't work when it actually does. Power = 1 − β = probability of correctly detecting a real effect.

$$P(\text{Type II}) = \beta \qquad \text{Power} = 1 - \beta$$

There is always a trade-off: decreasing α (being more conservative about Type I errors) increases β (making Type II errors more likely), and vice versa. This is why sample size matters — larger n increases power, allowing you to reduce both types of error simultaneously.

The t-Test — When σ Is Unknown

The Student's t-test, developed by William Sealy Gosset in 1908 (published under the pseudonym "Student"), is the most widely used hypothesis test in science. It tests hypotheses about population means when the population standard deviation σ is unknown — which is almost always the case in real research.

One-Sample t-Test

Tests whether a single sample mean differs from a hypothesised value μ₀.

One-Sample t-Test $$t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} \qquad \text{df} = n - 1$$

Where \(\bar{x}\) = sample mean, \(\mu_0\) = hypothesised population mean, \(s\) = sample standard deviation, \(n\) = sample size, and \(s/\sqrt{n}\) = the standard error of the mean (SEM).

The 95% confidence interval for μ is:

$$\bar{x} \pm t_{\alpha/2,\; n-1} \cdot \frac{s}{\sqrt{n}}$$

Two-Sample Independent t-Test

Tests whether the means of two independent groups differ. There are two versions depending on whether population variances are assumed equal.

Pooled t-Test (Equal Variances) $$t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{\dfrac{1}{n_1}+\dfrac{1}{n_2}}} \qquad s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}} \qquad \text{df} = n_1+n_2-2$$
Welch's t-Test (Unequal Variances — Generally Recommended) $$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}}} \qquad \text{df} \approx \frac{\left(\dfrac{s_1^2}{n_1}+\dfrac{s_2^2}{n_2}\right)^2}{\dfrac{(s_1^2/n_1)^2}{n_1-1}+\dfrac{(s_2^2/n_2)^2}{n_2-1}}$$

The Chi-Square Test — For Categorical Data

The chi-square (χ²) test is used when analysing categorical data — data that falls into distinct categories rather than being measured on a continuous scale. It compares observed frequencies to expected frequencies.

Chi-Square Goodness-of-Fit $$\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} \qquad \text{df} = k - 1$$

Where \(O_i\) = observed frequency in category \(i\), \(E_i\) = expected frequency in category \(i\), and \(k\) = number of categories. The chi-square statistic is always non-negative (\(\chi^2 \geq 0\)), and the test is always right-tailed — we reject H₀ when χ² is large (large discrepancies between observed and expected).

Key assumption: All expected frequencies \(E_i \geq 5\). If this fails, consider combining adjacent categories or using Fisher's exact test.

The Z-Test — When σ Is Known

Z-Test for Population Mean $$Z = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}}$$
Z-Test for Population Proportion $$Z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}} \qquad \hat{p} = \frac{x}{n}$$

Z-tests are compared to the standard normal distribution (N(0,1)). Common critical values: two-tailed α=0.05 → ±1.960; two-tailed α=0.01 → ±2.576; right-tailed α=0.05 → 1.645.

P-Value vs Critical Value Approach

Both approaches always lead to the same conclusion. Use whichever you find more intuitive:

ApproachMethodDecision Rule
Critical ValueLook up the critical value for your α and df; compare to |test statistic|Reject H₀ if |test stat| > critical value
P-ValueCompute the probability of the observed test statistic under H₀Reject H₀ if p-value < α
Confidence IntervalCompute the 95% CI for the parameterReject H₀ if μ₀ falls outside the CI (two-tailed only)

Effect Size — Beyond Statistical Significance

Statistical significance tells you whether an effect exists. Effect size tells you how large it is. With large samples, even trivial effects become statistically significant. Always report effect size alongside p-values.

Cohen's d (for t-tests) $$d = \frac{\bar{x} - \mu_0}{s} \qquad \text{Benchmarks: } d=0.2 \text{ (small)},\; 0.5 \text{ (medium)},\; 0.8 \text{ (large)}$$
Cramér's V (for chi-square) $$V = \sqrt{\frac{\chi^2}{n \cdot \min(r-1,\; c-1)}} \qquad V=0.1 \text{ (small)},\; 0.3 \text{ (medium)},\; 0.5 \text{ (large)}$$

Quick Reference: Which Test to Use?

Research QuestionData TypeTestStatistic
Is sample mean different from a known value? (σ unknown)ContinuousOne-sample t-testt with df=n−1
Is sample mean different from a known value? (σ known)ContinuousZ-test for meanZ ~ N(0,1)
Do two independent groups have different means?ContinuousTwo-sample t-testt with df=n₁+n₂−2
Is a sample proportion different from a known value?Binary/proportionOne-proportion Z-testZ ~ N(0,1)
Does observed distribution match expected?CategoricalChi-square GoFχ² with df=k−1
Are two categorical variables independent?CategoricalChi-square test of independenceχ² with df=(r−1)(c−1)
Do ≥3 independent groups have different means?ContinuousOne-way ANOVA (F-test)F with df=(k−1, N−k)

Frequently Asked Questions

What is a hypothesis test?

A hypothesis test is a formal statistical procedure for evaluating whether sample data provides sufficient evidence to reject a default assumption (the null hypothesis, H₀). Using a test statistic and a chosen significance level α, you determine whether observed differences are large enough to be considered statistically significant — i.e., unlikely to have occurred by random chance if H₀ were true.

What does the p-value actually mean?

The p-value is the probability of observing a test statistic as extreme as (or more extreme than) your calculated value, assuming H₀ is true. It is NOT the probability that H₀ is true, nor the probability your result is due to chance. A p-value of 0.03 means: if H₀ were true, there's only a 3% chance of getting data this extreme. Since 3% < 5% (α), you reject H₀.

When should I use a t-test vs a Z-test?

Use a Z-test when the population standard deviation (σ) is known. This is rare in practice — usually σ is unknown. Use a t-test when σ must be estimated from the sample standard deviation (s). For large samples (n ≥ 30), the t-distribution closely approximates the normal distribution, so the choice matters less.

What are Type I and Type II errors?

Type I error (false positive): rejecting H₀ when it is true. Probability = α. Type II error (false negative): failing to reject H₀ when it is false. Probability = β. Statistical power = 1 − β = probability of correctly detecting a real effect. Increasing sample size increases power while keeping α fixed.

What are degrees of freedom?

Degrees of freedom (df) is the number of values in a calculation that are free to vary. For a one-sample t-test: df = n − 1 (one df is "used up" estimating the mean). For a two-sample pooled t-test: df = n₁ + n₂ − 2. For chi-square GoF: df = k − 1 (k = categories). Higher df means less uncertainty in the estimated variance and more precise critical values.

What is statistical significance?

A result is statistically significant when p < α, meaning the observed effect is unlikely under H₀. Statistical significance only indicates that an effect exists — it says nothing about its size or practical importance. A large sample can make a tiny difference statistically significant. Always pair significance with effect size (Cohen's d, odds ratio, etc.).

What is two-tailed vs one-tailed testing?

Two-tailed tests detect differences in either direction (μ ≠ μ₀). One-tailed tests detect differences in only one direction (μ > μ₀ or μ < μ₀). Two-tailed tests are more conservative (harder to reject H₀) and are generally preferred unless there is a strong theoretical reason to expect a specific direction — and that reason must be stated before seeing the data.

What does "fail to reject" mean?

"Fail to reject H₀" means your data does not provide sufficient evidence to conclude that H₁ is true at your chosen α level. It does NOT mean H₀ is true or proven correct. It is possible that H₀ is false but your sample was too small, the effect is too small, or there is too much variability to detect it — all scenarios leading to a Type II error.

Shares:

Related Posts