AP® Statistics Free-Response Questions Past Paper 2025 Solution – He Loves Math

Q.1 Comparing Gas Mileage Distributions

Part A: Compare the distributions of gas mileage

To compare the two distributions, we should address center, shape, spread, and any unusual features (C.U.S.S.).

Center: The median gas mileage for cars from Country B (approximately 30 mpg) is substantially higher than the median for cars from Country A (approximately 18 mpg).
Shape: The distribution of gas mileage for Country A appears to be skewed to the right. The median (18) is closer to the first quartile (16) than the third quartile (24), and the right whisker is longer than the left whisker. The presence of a high outlier also contributes to the right skew. The distribution for Country B appears to be skewed to the left, as the median (30) is closer to the third quartile (36) than the first quartile (25) and the left whisker is longer than the right.
Spread: The variability in gas mileage differs between the two countries. Country B has a larger interquartile range (IQR ≈ 36 - 25 = 11 mpg) compared to Country A (IQR ≈ 24 - 16 = 8 mpg), indicating more variability in the middle 50% of its cars. The overall range for Country B (≈ 40 - 17 = 23 mpg) is similar to the range of Country A when its outlier is included (≈ 38 - 14 = 24 mpg).
Unusual Features: The distribution for Country A has one high outlier, a car with a gas mileage of approximately 38 mpg. There are no outliers shown for the Country B distribution.

Part B: Mean vs. Median for Country A

For the distribution of gas mileage for Country A, one would expect the mean to be greater than the median.

Justification: The distribution of gas mileage for Country A is skewed to the right. This skewness is evident from the boxplot and is heavily influenced by the high outlier at approximately 38 mpg. In a right-skewed distribution, the mean is pulled toward the long tail (the higher values), making it greater than the median.

Part C: The Combined Data Set

i. What is the range of the combined data set?

The range of the combined data set is 26 mpg.

Justification: The range is calculated as the maximum value minus the minimum value for the entire combined data set.

The overall maximum value is the maximum from Country B, which is approximately 40 mpg.
The overall minimum value is the minimum from Country A, which is approximately 14 mpg.
Combined Range = 40 mpg - 14 mpg = 26 mpg.

ii. What is a possible value of the median of the combined data set?

A possible value for the median of the combined data set is 24.5 mpg. (Any value between 24 and 25 is acceptable with proper justification).

Justification: There are 100 cars from Country A and 100 cars from Country B, for a total of 200 cars in the combined data set. The median will be the average of the 100th and 101st values when all 200 mileages are sorted in increasing order.

From the boxplot for Country A, we know the third quartile (Q3) is at 24 mpg. This means approximately 75% of the cars from Country A (or 75 cars) have a gas mileage of 24 mpg or less.
From the boxplot for Country B, we know the first quartile (Q1) is at 25 mpg. This means approximately 25% of the cars from Country B (or 25 cars) have a gas mileage of 25 mpg or less.
If we combine the data, the first 100 values (the lower half of the combined data) will be composed of the 75 cars from Country A with mileage ≤ 24 mpg, plus the 25 lowest-mileage cars from Country B, which have mileage ≤ 25 mpg.
Therefore, the 100th value in the combined set must be close to 24 mpg, and the 101st value must be close to 25 mpg. The median, being the average of these two, must lie between 24 and 25 mpg.

Q.2 Aphid Infestation Sampling Methods

Part A: Evaluating Sampling Method I

No, sampling method I is not an appropriate sampling method for the farmer to use to estimate the proportion of all cabbage plants in the field that are damaged by aphids.

Justification: This method is a convenience sample because the farmer selected region 3 based on the ease of access from his house, not through a random process. Convenience samples are prone to bias because the sample chosen is not likely to be representative of the entire population (the whole cabbage field). For example, region 3 is the farthest region from the river. If the farmer's belief that aphid damage is greater closer to the river is correct, a sample from region 3 would likely underestimate the true proportion of damaged plants in the entire field. To obtain a valid estimate, every plant should have a chance of being selected, which this method does not ensure.

Part B: Evaluating the Result from Sampling Method II

If the farmer's belief is correct, the selection of row E is likely to provide an overestimate of the proportion of cabbage plants in the field that are damaged by aphids.

Justification: The farmer believes that the extent of aphid damage is greater in the regions closer to the river. Row E, containing regions 21 through 25, is the row located closest to the river. Therefore, the plants in row E are expected to have a higher proportion of aphid damage than the plants in the field as a whole. Because the sample consists entirely of plants from this high-damage area, the sample proportion of damaged plants will likely be higher than the true proportion of damaged plants for the entire field, resulting in an overestimate.

Part C: Implementing Sampling Method III

Sampling method III is a stratified random sample, where each row (A, B, C, D, and E) is a stratum. To implement this method, the farmer needs to perform a separate simple random sample within each of the five rows. A valid procedure is as follows:

For Row A: The regions in Row A are numbered 1, 2, 3, 4, and 5. The farmer will use a random number generator to generate one integer from 1 to 5, inclusive. The region corresponding to the selected integer will be included in the sample. For example, if the generator produces the number 4, region 4 is selected.
For Row B: The regions in Row B are numbered 6, 7, 8, 9, and 10. The farmer will use a random number generator to generate one integer from 6 to 10, inclusive. The corresponding region is selected for the sample.
For Row C: The regions are 11, 12, 13, 14, and 15. Use a random number generator to select one integer from 11 to 15. The corresponding region is selected.
For Row D: The regions are 16, 17, 18, 19, and 20. Use a random number generator to select one integer from 16 to 20. The corresponding region is selected.
For Row E: The regions are 21, 22, 23, 24, and 25. Use a random number generator to select one integer from 21 to 25. The corresponding region is selected.

The final sample will consist of the five regions selected through this process, with exactly one region from each row. The farmer would then examine every cabbage plant within these five selected regions for aphid damage.

Q.3 Restaurant Music Playlist Probability

Part A: Basic Probability

i. Probability of selecting one rock song

The probability of a single event is the ratio of the number of favorable outcomes to the total number of possible outcomes.

$$ P(\text{rock song}) = \frac{\text{Number of rock songs}}{\text{Total number of songs}} = \frac{100}{1000} = \span class="answer">0.1 $$

ii. Probability of selecting two rock songs

Since any song can be replayed, the selections are independent events. The probability of two independent events both occurring is the product of their individual probabilities.

$$ P(\text{1st is rock AND 2nd is rock}) = P(\text{1st is rock}) \times P(\text{2nd is rock}) $$ $$ = (0.1) \times (0.1) = \span class="answer">0.01 $$

Part B: The Binomial Random Variable

i. Define the random variable and its distribution

The random variable of interest, let's call it $X$, is the number of rock songs played in a typical one-hour period.

The random variable $X$ follows a Binomial distribution. We can verify this using the four conditions for a binomial setting (B.I.N.S.):

Binary: Each trial (song selection) has two outcomes: success (the song is a rock song) or failure (it is not a rock song).
Independent: Trials are independent because the problem states any song can be replayed. Thus, the outcome of one selection does not affect the next.
Number of trials: There is a fixed number of trials, $n = 20$ songs in the one-hour period.
Success probability: The probability of success, $p$, is constant for each trial. From part A, $p = P(\text{rock song}) = 0.1$.

Therefore, the random variable is distributed as $ X \sim \text{Binomial}(n=20, p=0.1) $.

ii. What is the expected value?

The expected value (or mean) of a binomial random variable is given by the formula $E(X) = np$.

$$ E(X) = (20)(0.1) = \span class="answer">2 \text{ rock songs} $$

On average, Ms. Fey can expect 2 rock songs to be played in any given one-hour period.

Part C: Binomial Probability and Inference

i. Determine the probability of 4 or more rock songs

We want to find $P(X \ge 4)$. It is easier to calculate this using the complement rule: $P(X \ge 4) = 1 - P(X \le 3)$.

The probability $P(X \le 3)$ is the cumulative probability of getting 0, 1, 2, or 3 successes. The formula for a single binomial probability is $P(X=k) = \binom{n}{k}p^k(1-p)^{n-k}$.

So, $P(X \le 3) = P(X=0) + P(X=1) + P(X=2) + P(X=3)$. This can be calculated using a calculator's binomial cumulative distribution function (like `binomcdf`).

$$ P(X \le 3) = \text{binomcdf}(\text{trials: } 20, p\text{: } 0.1, \text{x-value: } 3) \approx 0.8670 $$ $$ P(X \ge 4) = 1 - P(X \le 3) \approx 1 - 0.8670 = \span class="answer">0.1330 $$

There is approximately a 13.3% chance that 4 or more rock songs will be played in a particular one-hour period.

ii. Does observing 4 rock songs provide strong evidence against randomness?

No, observing 4 rock songs does not provide strong evidence that the song selection process was not truly random.

Justification: The question is asking if observing 4 rock songs is an unusually high number, or a "rare event," under the assumption of a random selection process. The probability of observing an event as extreme as this or more extreme (i.e., 4 or more rock songs) is the value we calculated in the previous part: $P(X \ge 4) \approx 0.1330$.

An event with a probability of 0.1330 (or 13.3%) is not typically considered rare or statistically significant. (Statisticians often use a threshold, or significance level, of 0.05 or 5%). Since the probability of this outcome happening by chance is relatively high, it is a plausible result under the random model. Therefore, we do not have strong evidence to conclude that the selection process is not random.

Q.4 Hypothesis Test for a Proportion

Step 1: State

We want to perform a significance test to determine if there is convincing evidence for Karen's belief. The parameter of interest is p, the true proportion of all students at Karen's high school who use the app to help them with their homework at least once per week.

The hypotheses are formulated based on the national proportion p₀ = 0.22 and Karen's belief that the proportion at her school is greater.

Null Hypothesis: H₀: p = 0.22
Alternative Hypothesis: H_a: p > 0.22

We will use a significance level of α = 0.05, as stated in the problem.

Step 2: Plan

The appropriate inference procedure is a one-sample z-test for a proportion. We must verify the conditions for this test:

Random: The data comes from a "simple random sample of 130 students from her school." This condition is met.
10% Condition (for Independence): The sample size is n = 130. The school has more than 2,000 students. Since 10 × 130 = 1300, which is less than 2,000, it's reasonable to assume the sample is less than 10% of the school's population. This condition is met.
Large Counts Condition (for Normality): We check if the expected number of successes and failures are both at least 10, assuming the null hypothesis is true.
- n × p₀ = 130 × 0.22 = ≥ 10
- n × (1 - p₀) = 130 × (0.78) = ≥ 10
Since both values are at least 10, the sampling distribution of the sample proportion (p̂) is approximately Normal.

All conditions are satisfied, so we can proceed with the test.

Step 3: Do

The sample proportion (p̂) is calculated from the sample data (x = 38, n = 130).

p̂ = xn = 38130 ≈

The test statistic is the z-score, calculated as follows:

z = p̂ - p₀ √ p₀(1 - p₀) n = - 0.22 √ 0.22(0.78) 130 ≈ 0.0723 ≈

Now, we find the P-value. Since this is a right-tailed test (H_a: p > 0.22), the P-value is the probability of observing a z-score this high or higher.

P-value = P(Z ≥ ) ≈

Step 4: Conclude

We compare the P-value to our significance level α = 0.05.

Because our P-value of is less than α = 0.05, we reject the null hypothesis (H₀).

We have convincing statistical evidence to conclude that the proportion of all students at Karen's high school who use the app to help with their homework at least once per week is greater than the national proportion of 0.22. Therefore, we have sufficient evidence to support Karen's belief.

Solution to Question 5: Mean Number of Bedrooms

Part A: Sample Probabilities and Mean

i. Probability of Fewer than 3 Bedrooms

To find the probability that a randomly selected house from the sample had fewer than 3 bedrooms, we sum the proportions for houses with 1 and 2 bedrooms.

P(X < 3) = P(X=1) + P(X=2)

Using the values from the table:

P(X < 3) = 0.12 + 0.22 =

ii. Mean Number of Bedrooms for the Sample

The mean number of bedrooms for the sample, denoted by x̄, is calculated as the expected value of the discrete probability distribution. We multiply each number of bedrooms by its corresponding proportion and sum the results.

x̄ = Σ [x_i ⋅ P(x_i)]

Calculation:

x̄ = (1 ⋅ 0.12) + (2 ⋅ 0.22) + (3 ⋅ 0.28) + (4 ⋅ 0.22) + (5 ⋅ 0.14) + (6 ⋅ 0.02)
x̄ = 0.12 + 0.44 + 0.84 + 0.88 + 0.70 + 0.12
x̄ =

The mean number of bedrooms for the sample is 3.10.

Part B: Hypothesis Test Setup

i. Hypotheses for the Test

Rodney wants to test if the mean number of bedrooms in 2024 is different from the 2017 mean of 2.9. This requires a two-tailed test.

Let μ represent the true mean number of bedrooms in all newly built houses in Country B in 2024.

Null Hypothesis: H₀: μ = 2.9
Alternative Hypothesis: H_a: μ ≠ 2.9

ii. Type I Error in Context

A Type I error occurs when we reject the null hypothesis (H₀) when it is actually true.

In the context of this problem, a Type I error would be:

Concluding that the true mean number of bedrooms in newly built houses in 2024 is different from 2.9, when in reality, the true mean is still 2.9.

Part C: Confidence Interval and Conclusion

Keisha calculated a 97% confidence interval for the population mean μ as (3.01, 3.19). We are asked to use this interval to make a conclusion for Rodney's hypothesis test at a significance level of α = 0.03.

The relationship between a two-tailed hypothesis test and a confidence interval is given by:
Significance Level (α) = 1 - Confidence Level (C)

In this case, α = 1 - 0.97 = 0.03. This matches the significance level given for the test.

To make a conclusion, we check if the null hypothesis value, μ₀ = 2.9, falls within the 97% confidence interval (3.01, 3.19).

Justification: The value 2.9 is not contained within the 97% confidence interval (3.01, 3.19). A confidence interval provides a range of plausible values for the true population parameter. Since the hypothesized value of 2.9 is not a plausible value according to the interval, we have evidence against the null hypothesis.

Conclusion: Because the null value μ₀ = 2.9 is not in the confidence interval, we reject the null hypothesis H₀ at the α = 0.03 significance level. There is convincing statistical evidence to conclude that the mean number of bedrooms in newly built houses in Country B in 2024 is different from 2.9.

Q.6 Solution: Reading Comprehension Study Analysis

Part A: Conclusion of the Hypothesis Test

We are asked to state a conclusion for Stefan's two-sample t-test given a P-value and a significance level.

P-value: 0.002
Significance Level (α): 0.05
Hypotheses: H₀: μ_AM = μ_PM versus H_a: μ_AM ≠ μ_PM

Justification: We compare the P-value to the significance level. Since the P-value of 0.002 is less than the significance level of α = 0.05 (0.002 < 0.05), we reject the null hypothesis (H₀).

Conclusion in Context: There is convincing statistical evidence to conclude that the true mean reading comprehension score for all children similar to those in the study who read the story at 9 a.m. is different from the true mean reading comprehension score for those who read the story at 3 p.m.

Part B: Appropriateness of a Two-Sample t-Test

A two-sample t-test is used for comparing the means of two independent groups, while a paired t-test is used when the data points in the two groups are paired or matched in some way (e.g., a "before and after" measurement on the same subject).

Explanation: A two-sample t-test was appropriate because the 100 children were randomly assigned to two separate, independent groups. The 50 children in the 9 a.m. group are different individuals from the 50 children in the 3 p.m. group. There is no natural pairing between a specific child in the first group and a specific child in the second group. Therefore, the two samples are independent, justifying the use of a two-sample t-test for the difference in population means.

Part C: Practical Importance and Cohen's d

i. Calculate Cohen's d Coefficient

First, we calculate the pooled standard deviation (s_p) since the group sizes are equal (n₁ = n₂ = 50).

s_p = √s₁² + s₂²2 = √4.12² + 4.43²2 = √16.9744 + 19.62492 = √18.29965 ≈

Next, we calculate Cohen's d coefficient.

d = |x̄₁ - x̄₂| s_p = |15.2 - 17.9| = 2.7 ≈

ii. Describe the Practical Importance

We compare our calculated Cohen's d to the guidelines in Table 2.

Interpretation: The calculated Cohen's d coefficient is approximately . This value falls within the interval 0.20 < d < 0.80.

Conclusion in Context: According to the provided guidelines, this indicates that the observed difference in mean reading comprehension scores between the 9 a.m. and 3 p.m. groups is somewhat meaningful in real life.

Part D: Hypothetical Scenario with Increased Standard Deviations

i. Effect on Cohen's d

We consider a new situation where the standard deviations (s₁ and s₂) are larger, but the means and sample sizes are unchanged.

Effect on Cohen's d: The Cohen's d coefficient would be smaller.

Explanation: The formula for Cohen's d is the absolute difference in means divided by the pooled standard deviation (s_p). The numerator, |x̄₁ - x̄₂|, would remain unchanged. However, since both s₁ and s₂ increase, the pooled standard deviation s_p (the denominator) would also increase. Dividing the same number by a larger number results in a smaller quotient. Thus, Cohen's d would decrease.

ii. Effect on Practical Importance

We consider how the smaller Cohen's d from part D(i) affects the interpretation of practical importance.

Effect on Practical Importance: The observed difference in means would have less practical importance.

Explanation: According to the guidelines in Table 2, lower values of Cohen's d correspond to less practical importance. Since the new, hypothetical Cohen's d is smaller than the one calculated in Part C, it would indicate a less meaningful effect in real life. A larger spread (standard deviation) in the data for both groups makes the same difference in means less impactful or significant in a practical sense.