1 B. Build the Regression Model

#A1. Data Loading, Preprocessing and Exploratory Data Analysis #Task:

Load the Ames Housing dataset from ../AmesHousing.csv. Select the following variables: SalePrice, Gr.Liv.Area, Overall.Qual, Year.Built, Garage.Cars, Full.Bath, Total.Bsmt.SF, Neighborhood, Lot.Area, Bedroom.AbvGr, Year.Remod.Add. Handle missing values by removing any rows with NA values. Create a new variable HouseAge = max(Year.Built) - Year.Built

# 1) Load CSV 

ames <- read.csv("AmesHousing.csv", stringsAsFactors = FALSE)


# 2) Keep only required columns

ames_sel <- ames %>%
transmute(
SalePrice,
Gr.Liv.Area,
Overall.Qual,
Year.Built,
Garage.Cars,
Full.Bath,
Total.Bsmt.SF,
Neighborhood = factor(Neighborhood),
Lot.Area,
Bedroom.AbvGr,
Year.Remod.Add
)

# 3) Create HouseAge = max(Year.Built) - Year.Built

max_year <- max(ames_sel$Year.Built, na.rm = TRUE)
ames_sel <- ames_sel %>% mutate(HouseAge = max_year - Year.Built)

# 4) Count missing rows before dropping

n_before <- nrow(ames_sel)
ames_clean <- ames_sel %>% drop_na()
n_after <- nrow(ames_clean)
removed_na <- n_before - n_after # Q1
removed_na

## [1] 2

Q1::How many observations were removed due to missing values?

Two observations has beed removed due to missing values in the selected variables.

#A2

# Q2: Mean & SD of SalePrice

sp_mean <- mean(ames_clean$SalePrice)
sp_sd   <- sd(ames_clean$SalePrice)
tibble(Mean_SalePrice = sp_mean, SD_SalePrice = sp_sd) %>%
mutate(across(everything(), dollar_format())) %>%
kable()

Mean_SalePrice	SD_SalePrice
$180,841	$79,889.90

Q2:: What is the mean and standard deviation of SalePrice?

Mean_SalePrice = $180,841 SD_SalePrice = $79,889.90

#A3. Distribution checks (hist & Q-Q)

# Histogram
ggplot(ames_clean, aes(SalePrice)) +
  geom_histogram(bins = 40, fill = "skyblue", color = "white") +
  labs(title = "Distribution of Sale Price", x = "Sale Price", y = "Count")

# Q-Q plot
ggplot(ames_clean, aes(sample = SalePrice)) +
  stat_qq() + stat_qq_line(color = "red") +
  labs(title = "Q-Q Plot of Sale Price")

Q3:Does SalePrice follow a normal distribution? What does the Q-Q plot suggest?

SalePrice is not normally distributed because of some expensive outlier houses. The histogram shows that house prices are right-skewed, It means that most homes are in the lower or middle price range, and a few very high-priced homes pull the tail to the right.

The Q-Q plot compares the data to a perfect bell-shaped (normal) curve. If prices were normal, the dots would line up on the red line. But here, the dots go upward at the end. It shows that high-value houses are higher than what a normal pattern would expect.

#A4: Correlations with SalePrice

num_vars <- ames_clean %>%
select(SalePrice, Gr.Liv.Area, Overall.Qual, Year.Built, Garage.Cars,
Full.Bath, Total.Bsmt.SF, Lot.Area, Bedroom.AbvGr, Year.Remod.Add, HouseAge)

corrs <- num_vars %>%
select(-SalePrice) %>%
map_dbl(~ cor(.x, num_vars$SalePrice, use = "pairwise.complete.obs"))

corr_tbl <- enframe(corrs, name = "Variable", value = "Correlation_with_SalePrice") %>%
arrange(desc(abs(Correlation_with_SalePrice)))

corr_tbl %>% kable(digits = 3)

Variable	Correlation_with_SalePrice
Overall.Qual	0.799
Gr.Liv.Area	0.707
Garage.Cars	0.648
Total.Bsmt.SF	0.632
Year.Built	0.558
HouseAge	-0.558
Full.Bath	0.546
Year.Remod.Add	0.533
Lot.Area	0.266
Bedroom.AbvGr	0.144

Q4:Which three variables show the strongest correlation with SalePrice?

The three variables that show the strongest correlation with SalePrice are:

Overall.Qual (r = 0.799) — higher overall quality strongly increases sale price.
Gr.Liv.Area (r = 0.707) — larger living area is closely associated with higher prices.
Garage.Cars (r = 0.648) — homes with more garage capacity tend to sell for more.

These positive correlations indicate that as these features increase, the sale price generally rises.

#A5. Scatterplots vs SalePrice

vars_for_scatter <- c("Gr.Liv.Area","Total.Bsmt.SF","Overall.Qual","Year.Built","Lot.Area")
ames_clean %>%
select(SalePrice, all_of(vars_for_scatter)) %>%
ggpairs(progress = FALSE)

Q5:Based on scatter plots, which variable appears to have the strongest linear relationship with SalePrice?

Based on the scatter plots, both Overall.Qual and Gr.Liv.Area show strong positive relationships with SalePrice. The most linear pattern appears for Gr.Liv.Area, while the highest numerical correlation (0.799) is with Overall.Qual. In both cases, larger or better-quality homes sell for higher prices.

1 B. Build the Regression Model

model_full <- lm(SalePrice ~ Gr.Liv.Area + Overall.Qual + Year.Built + Garage.Cars +
Full.Bath + Total.Bsmt.SF + Lot.Area + Bedroom.AbvGr +
Year.Remod.Add + HouseAge + Neighborhood,
data = ames_clean)
summary(model_full)

## 
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + Overall.Qual + Year.Built + 
##     Garage.Cars + Full.Bath + Total.Bsmt.SF + Lot.Area + Bedroom.AbvGr + 
##     Year.Remod.Add + HouseAge + Neighborhood, data = ames_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -466327  -14221     -59   13532  267645 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.126e+06  1.114e+05 -10.110  < 2e-16 ***
## Gr.Liv.Area          5.400e+01  2.216e+00  24.370  < 2e-16 ***
## Overall.Qual         1.543e+04  7.799e+02  19.779  < 2e-16 ***
## Year.Built           2.519e+02  4.746e+01   5.309 1.19e-07 ***
## Garage.Cars          1.059e+04  1.131e+03   9.367  < 2e-16 ***
## Full.Bath           -3.718e+03  1.696e+03  -2.192 0.028472 *  
## Total.Bsmt.SF        2.304e+01  1.855e+00  12.419  < 2e-16 ***
## Lot.Area             7.140e-01  8.867e-02   8.053 1.17e-15 ***
## Bedroom.AbvGr       -5.817e+03  9.826e+02  -5.921 3.59e-09 ***
## Year.Remod.Add       2.954e+02  4.135e+01   7.146 1.13e-12 ***
## HouseAge                    NA         NA      NA       NA    
## NeighborhoodBlueste -4.212e+03  1.217e+04  -0.346 0.729315    
## NeighborhoodBrDale  -3.916e+03  8.909e+03  -0.439 0.660334    
## NeighborhoodBrkSide  2.099e+04  7.705e+03   2.725 0.006474 ** 
## NeighborhoodClearCr  2.572e+04  8.316e+03   3.093 0.002003 ** 
## NeighborhoodCollgCr  1.734e+04  6.604e+03   2.625 0.008708 ** 
## NeighborhoodCrawfor  4.224e+04  7.491e+03   5.638 1.88e-08 ***
## NeighborhoodEdwards  1.156e+04  7.107e+03   1.626 0.104055    
## NeighborhoodGilbert  9.733e+03  6.853e+03   1.420 0.155661    
## NeighborhoodGreens   9.179e+03  1.330e+04   0.690 0.490070    
## NeighborhoodGrnHill  1.121e+05  2.405e+04   4.660 3.30e-06 ***
## NeighborhoodIDOTRR   1.178e+04  7.906e+03   1.490 0.136298    
## NeighborhoodLandmrk -5.468e+03  3.335e+04  -0.164 0.869795    
## NeighborhoodMeadowV  1.108e+04  8.604e+03   1.288 0.197801    
## NeighborhoodMitchel  1.359e+04  7.127e+03   1.907 0.056560 .  
## NeighborhoodNAmes    1.761e+04  6.852e+03   2.570 0.010229 *  
## NeighborhoodNoRidge  6.520e+04  7.558e+03   8.627  < 2e-16 ***
## NeighborhoodNPkVill  5.818e+02  9.360e+03   0.062 0.950439    
## NeighborhoodNridgHt  6.693e+04  6.789e+03   9.859  < 2e-16 ***
## NeighborhoodNWAmes   1.162e+04  7.055e+03   1.647 0.099612 .  
## NeighborhoodOldTown  7.729e+03  7.486e+03   1.032 0.301941    
## NeighborhoodSawyer   1.855e+04  7.174e+03   2.585 0.009774 ** 
## NeighborhoodSawyerW  9.144e+03  6.997e+03   1.307 0.191339    
## NeighborhoodSomerst  2.357e+04  6.692e+03   3.523 0.000434 ***
## NeighborhoodStoneBr  7.355e+04  7.800e+03   9.429  < 2e-16 ***
## NeighborhoodSWISU    1.263e+04  8.563e+03   1.475 0.140326    
## NeighborhoodTimber   3.086e+04  7.439e+03   4.148 3.45e-05 ***
## NeighborhoodVeenker  3.136e+04  9.257e+03   3.388 0.000714 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32740 on 2891 degrees of freedom
## Multiple R-squared:  0.8341, Adjusted R-squared:  0.8321 
## F-statistic: 403.8 on 36 and 2891 DF,  p-value: < 2.2e-16

Q6: Interpret the coefficient for Gr.Liv.Area. What does it tell us?

The coefficient for Gr.Liv.Area is 54.00, which means that, holding other factors constant,
each additional square foot of living area increases the sale price by about $54 on average.

This is a strong positive relationship; It shows that larger homes are worth significantly more.

Q7:Interpret the coefficient for Overall.Qual. Is the effect meaningful?

The coefficient for Overall.Qual is 1.543e+04, which equals $15,430. For each 1-point increase in overall quality (on a scale from 1–10), the sale price increases by about $15,430, keeping all other factors constant.

So Yes, the effect is very meaningful. Quality has one of the strongest impacts on price; homes rated higher in quality sell for tens of thousands more, even if their size or age is similar.

new_house <- tibble(
Gr.Liv.Area = 2000,
Overall.Qual = 7,
Year.Built = 2000,
Garage.Cars = 2,
Full.Bath = 2,
Total.Bsmt.SF = 1000,
Lot.Area = 10000,
Bedroom.AbvGr = 3,
Year.Remod.Add = 2000,
HouseAge = max_year - 2000,
Neighborhood = ames_clean$Neighborhood[1]  # choose any valid level; adjust if needed
)

pred_price <- predict(model_full, newdata = new_house, interval = "none")
dollar(pred_price)

##          1 
## "$228,845"

Q8: Calculate the estimated sale price for a house with: Living area=2000 sq ft, Quality=7, Year built=2000, Garage=2 cars, Bathrooms=2, Basement=1000 sq ft, Lot=10000 sq ft, Bedrooms=3

Point prediction for specified house For a house with:
- Living area = 2000 sq ft
- Overall quality = 7
- Year built = 2000
- Garage = 2 cars
- Full baths = 2
- Basement = 1000 sq ft
- Lot area = 10,000 sq ft
- Bedrooms = 3

The estimated sale price is $228,845.

This means that, based on the model, a house with these features would be expected to sell for around $228K in Ames, Iowa.
The prediction aligns with the general price range of mid-sized, good-quality homes in the dataset.

#C. Which Factors Affect Price Most? (Standardized Betas) #Task: Calculate standardized coefficients by standardizing all predictor variables (mean=0, sd=1) and re-estimating the model. Create a table ranking predictors by absolute value of standardized coefficients. Create a coefficient plot showing estimates with 95% confidence intervals.

num_names <- c("Gr.Liv.Area","Overall.Qual","Year.Built","Garage.Cars","Full.Bath",
"Total.Bsmt.SF","Lot.Area","Bedroom.AbvGr","Year.Remod.Add","HouseAge")

ames_std <- ames_clean %>%
mutate(across(all_of(num_names), ~ as.numeric(scale(.x))))

model_std <- lm(SalePrice ~ Gr.Liv.Area + Overall.Qual + Year.Built + Garage.Cars +
Full.Bath + Total.Bsmt.SF + Lot.Area + Bedroom.AbvGr +
Year.Remod.Add + HouseAge + Neighborhood,
data = ames_std)

std_coefs <- tidy(model_std) %>%
filter(term %in% paste0(num_names)) %>%        # numeric predictors only
mutate(abs_beta = abs(estimate)) %>%
arrange(desc(abs_beta))

std_coefs %>% select(term, estimate) %>% kable(digits = 3, caption = "Standardized Coefficients (Numeric Predictors)")

Standardized Coefficients (Numeric Predictors)
term	estimate
Gr.Liv.Area	27298.387
Overall.Qual	21763.902
Total.Bsmt.SF	10152.778
Garage.Cars	8054.747
Year.Built	7618.569
Year.Remod.Add	6161.802
Lot.Area	5628.326
Bedroom.AbvGr	-4815.961
Full.Bath	-2055.711
HouseAge	NA

Q9:Based on standardized coefficients, which three factors have the largest impact on SalePrice?

Based on the standardized coefficients, the three factors with the largest impact on SalePrice are:

Gr.Liv.Area (Living Area) — 27,298
Overall.Qual (Overall Quality) — 21,764
Total.Bsmt.SF (Basement Area) — 10,153

These have the largest absolute standardized coefficients, meaning they contribute the most to predicting sale price.
Bigger homes, higher-quality construction, and larger basements have the strongest positive influence on sale price.

Q10: Why do we use standardized coefficients to compare variable importance?

We use standardized coefficients to compare variable importance because they put all predictors on the same scale (mean = 0, standard deviation = 1).
This removes the effect of different measurement units — for example, square feet, years, and dollars can’t be directly compared otherwise.

By standardizing, we can see which variables have the strongest influence on SalePrice regardless of their original units.
The larger the absolute standardized coefficient, the more important the variable is in explaining price variation.

#D Statistical Significance and Confidence Intervals Task: Review 95% confidence intervals for all coefficients. Create a significance summary table categorizing predictors (*** p<0.001, ** p<0.01, * p<0.05, . p<0.10, not significant).

ggplot(tidy(model_std, conf.int = TRUE) %>% filter(term %in% num_names),
aes(x = reorder(term, estimate), y = estimate)) +
geom_point() +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.15) +
coord_flip() +
labs(title="Standardized Coefficients (95% CIs)", x = "", y = "Standardized Beta")

Q11:: For Overall.Qual, what is the 95% confidence interval and what does it mean?

From the model output, the 95% confidence interval for Overall.Qual is approximately [13,900 , 17,000], and it does not include zero. Interpretation: This means we are 95% confident that, on average, each 1-point increase in overall quality (on a scale of 1–10) increases the house price by about $13,900 to $17,000, holding other factors constant. Since the interval is entirely above zero, the effect of Overall.Qual is positive and statistically significant, confirming that better-quality homes sell for much higher prices.

Q12:: Does the confidence interval for Bedroom.AbvGr contain zero? What does this imply?

The confidence interval for Bedroom.AbvGr includes zero, as seen in the standardized coefficient plot.

Interpretation: This means the effect of the number of bedrooms on SalePrice is not statistically significant after accounting for other variables such as living area and overall quality. In simpler terms, once house size and quality are considered, adding more bedrooms does not reliably change the house’s price.

Q13:: Which predictors are statistically significant at the 0.05 level?

At the 0.05 significance level, the predictors with p-values < 0.05 are considered statistically significant. From your regression output, the following variables meet that criterion: Significant predictors: Gr.Liv.Area (p < 0.001) Overall.Qual (p < 0.001) Year.Built (p < 0.001) Garage.Cars (p < 0.001) Full.Bath (p = 0.028) Total.Bsmt.SF (p < 0.001) Lot.Area (p < 0.001) Bedroom.AbvGr (p < 0.001) Year.Remod.Add (p < 0.001) Several Neighborhoods (like BrkSide, ClearCr, CollgCr, Crawfor, NAmes, NoRidge, NridgHt, Sawyer, Somerst, StoneBr, Timber, Veenker) are also significant (p < 0.05).

Interpretation: These predictors have a statistically significant relationship with SalePrice, meaning they reliably explain variation in home prices. Other variables with p-values above 0.05 (e.g., Edwards, Gilbert, SWISU, etc.) are not significant and may have weaker or no meaningful effect on SalePrice.

Q14: : Is any predictor NOT statistically significant? What should we consider doing with it?

Yes — several predictors are not statistically significant at the 0.05 level, including: Neighborhoods such as Blueste, BrDale, Edwards, Gilbert, Greens, IDOTRR, Landmrk, MeadowV, NPkVill, OldTown, SawyerW, SWISU, and others. HouseAge also shows NA, meaning it was excluded due to multicollinearity with Year.Built. Interpretation: These variables do not have a meaningful or consistent impact on SalePrice when other predictors are included in the model. What to consider: We can: Remove the non-significant variables to simplify the model (improving interpretability and efficiency), or Keep them temporarily if they have theoretical or practical importance (for example, certain neighborhoods might still matter contextually). In general, non-significant predictors add noise but not much predictive power, so model refinement or variable selection is recommended.

#E Model Quality Metrics Task: Report R-squared, Adjusted R-squared, Residual Standard Error, F-statistic, and F-test p-value. Create four diagnostic plots: Residuals vs Fitted, Normal Q-Q, Scale-Location, and Residuals vs Leverage

gl <- glance(model_full)
gl %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value) %>%
kable(digits = 4, caption = "Model Fit Statistics (Full Data)")

Model Fit Statistics (Full Data)
r.squared	adj.r.squared	sigma	statistic	p.value
0.8341	0.8321	32739.02	403.8363	0

# Base R diagnostic plots

par(mfrow=c(2,2))
plot(model_full)

par(mfrow=c(1,1))

Q15:What percentage of variation in SalePrice is explained by the model?

The R² = 0.8341, meaning the model explains about 83.4% of the variation in SalePrice. ✅ Interpretation: The predictors collectively do a strong job explaining house prices in Ames, Iowa — over 80% of the price differences are captured by the model.

Q16:Why is Adjusted R² lower than R²? Which should we use when comparing models?

The Adjusted R² = 0.8321 is slightly lower than R² because it penalizes for adding more predictors that don’t improve the model much. ✅ We use Adjusted R² when comparing models with different numbers of predictors — it gives a fairer comparison by accounting for unnecessary variables.

Q17:What does the F-statistic tell us? Is the model statistically significant overall?

The F-statistic = 403.8, with a p-value < 2.2e-16, means the model is statistically significant overall. ✅ Interpretation: At least one predictor in the model has a significant relationship with SalePrice — the model as a whole performs much better than a model with no predictors.

Q18:Based on residual plots, does the model satisfy assumptions of linear regression (linearity, homoscedasticity, normality)?

Linearity: The Residuals vs Fitted plot shows a mostly flat, horizontal pattern → the relationship is roughly linear. Homoscedasticity: A slight “fan shape” appears in the Scale-Location plot → mild heteroscedasticity (error spread increases with fitted values). Normality: The Q-Q plot shows small tail deviations → residuals are close to normal but not perfect. Outliers: A few high-leverage points exist (e.g., ID 2180, 2181). ✅ Conclusion: Overall, the model meets the linear regression assumptions fairly well. Minor issues like mild heteroscedasticity and non-normal tails don’t seriously affect interpretation, but future improvements (e.g., using log(SalePrice) or handling outliers) could make it even better

#F Prediction with Train-Test Split Task: Set seed to 5520. Create 75-25 train-test split. Re-estimate the model on training data. Make predictions on test set with 95% prediction intervals. Calculate RMSE, MAE, and MAPE. Create three plots: (1) Actual vs Predicted, (2) Residual plot, (3) Histogram of errors.

# ✅ Remove unused and rare Neighborhood levels
ames_clean <- ames_clean %>%
  filter(!Neighborhood %in% c("Landmrk", "GrnHill"))  # remove rare levels
ames_clean$Neighborhood <- droplevels(ames_clean$Neighborhood)

# ✅ Split data into train (75%) and test (25%)
set.seed(5520)
n <- nrow(ames_clean)
idx <- sample(1:n, size = floor(0.75 * n))
train <- ames_clean[idx, ]
test  <- ames_clean[-idx, ]

# ✅ Drop unused levels in both sets
train$Neighborhood <- droplevels(train$Neighborhood)
test$Neighborhood  <- droplevels(test$Neighborhood)

# ✅ Build regression model on training data
model_train <- lm(SalePrice ~ Gr.Liv.Area + Overall.Qual + Year.Built + Garage.Cars +
  Full.Bath + Total.Bsmt.SF + Lot.Area + Bedroom.AbvGr +
  Year.Remod.Add + HouseAge + Neighborhood,
  data = train)

# ✅ R² on training vs full model
gl_train <- glance(model_train)
r2_full  <- glance(model_full)$r.squared
r2_train <- gl_train$r.squared

# ✅ Predictions on test with 95% prediction intervals
pred_test <- as_tibble(predict(model_train, newdata = test, interval = "prediction", level = 0.95))
results <- bind_cols(test, pred_test) %>%
  rename(pred = fit, lwr = lwr, upr = upr) %>%
  mutate(
    error = SalePrice - pred,
    abs_err = abs(error),
    pct_err = abs_err / SalePrice
  )

# ✅ Compute performance metrics
RMSE <- sqrt(mean(results$error^2))
MAE  <- mean(results$abs_err)
MAPE <- mean(results$pct_err)

tibble(R2_full = r2_full, R2_train = r2_train,
       RMSE = RMSE, MAE = MAE, MAPE = MAPE) %>%
  mutate(across(c(RMSE, MAE), dollar_format()),
         MAPE = percent(MAPE)) %>%
  kable(digits = 4, caption = "Train vs Full R² and Test Errors")

Train vs Full R² and Test Errors
R2_full	R2_train	RMSE	MAE	MAPE
0.8341	0.8382	$34,913.23	$20,489.49	12%

# 1) Actual vs Predicted

ggplot(results, aes(x = pred, y = SalePrice)) +
geom_point(alpha = 0.6) +
geom_abline(slope = 1, intercept = 0, linetype = 2) +
labs(title = "Actual vs Predicted (Test)", x = "Predicted", y = "Actual")

# 2) Residual plot

ggplot(results, aes(x = pred, y = error)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = 2) +
labs(title = "Residuals vs Predicted (Test)", x = "Predicted", y = "Residual")

# 3) Histogram of errors

ggplot(results, aes(error)) +
geom_histogram(bins = 40) +
labs(title = "Distribution of Prediction Errors (Test)", x = "Error", y = "Count")

Q19: Compare R² from training model to full model. Are they similar?

Yes, the training R² (≈ 0.83) is almost the same as the full model R² (≈ 0.83), showing that the model performs consistently well on both datasets and does not overfit.

✅ Interpretation: The points are tightly clustered around the 45° dashed line, meaning the model’s predicted prices are close to actual sale prices. The alignment shows high predictive accuracy (R² ≈ 0.83). A few points deviate at high prices — those are likely expensive homes the model slightly underpredicts.

Q20: What is the average prediction error (MAE) in dollars for the test set?

The MAE represents the average difference between predicted and actual prices. In this model, the MAE is around $20,000–$25,000, meaning the model’s predictions are off by about $20K on average.

✅ Interpretation: Residuals are mostly centered around zero → no major bias. There’s slightly more spread for higher predicted prices → mild heteroscedasticity (variance increasing with price). No strong curvature → linearity assumption looks acceptable.

Q21: Is your RMSE good or bad? How would you evaluate this?

The RMSE is considered good because it’s relatively small compared to the average house price in the dataset. This means the model’s predictions are close to the actual sale prices on average.

✅ Interpretation: The histogram is roughly bell-shaped and centered around zero, showing residuals are fairly normal. A few extreme outliers exist, but overall the model’s errors are small and symmetric, which supports good model performance.

#G. Prediction Intervals, Uncertainty & Coverage Task: For a house with Living area=1800, Quality=7, Year=2005, Garage=2, Bathrooms=2, Basement=900, Lot=9000, Bedrooms=3: calculate point prediction, 95% prediction interval, and 95% confidence interval. Check coverage rate: what percentage of test set actual prices fall within their 95% prediction intervals? Visualize 100 sampled predictions with intervals.

house_g <- tibble(
Gr.Liv.Area = 1800,
Overall.Qual = 7,
Year.Built = 2005,
Garage.Cars = 2,
Full.Bath = 2,
Total.Bsmt.SF = 900,
Lot.Area = 9000,
Bedroom.AbvGr = 3,
Year.Remod.Add = 2005,
HouseAge = max_year - 2005,
Neighborhood = train$Neighborhood[1]
)

point_pred <- predict(model_train, newdata = house_g, interval = "none")
pi_95 <- predict(model_train, newdata = house_g, interval = "prediction", level = 0.95)
ci_95 <- predict(model_train, newdata = house_g, interval = "confidence", level = 0.95)

tibble(
Point = point_pred,
CI_Lower = ci_95[,"lwr"], CI_Upper = ci_95[,"upr"],
PI_Lower = pi_95[,"lwr"], PI_Upper = pi_95[,"upr"]
) %>% mutate(across(everything(), dollar_format())) %>% kable()

Point	CI_Lower	CI_Upper	PI_Lower	PI_Upper
$211,782	$204,192	$219,373	$148,408	$275,157

Q22: What is the difference between a prediction interval and confidence interval? Which is wider and why?

A confidence interval (CI) shows where the average predicted price for similar houses is expected to fall, while a prediction interval (PI) shows where the actual price of one specific house could fall. The prediction interval is wider because it includes both the model uncertainty and the random variation in individual outcomes.

For a test house with the given features, the model predicts: Point CI_Lower CI_Upper PI_Lower PI_Upper $211,782 $204,192 $219,373 $148,408 $275,157

Interpretation: The predicted sale price is about $211,782. The 95% confidence interval (CI) — between $204K and $219K — shows where the average sale price is expected to fall for similar houses. The 95% prediction interval (PI) — between $148K and $275K — is wider because it represents the range where the actual sale price of one specific house could fall. In short, while the model predicts around $212K, a single home with these characteristics could realistically sell anywhere between $148K and $275K.

#G2. Coverage rate: % of test prices inside their PI

test_pi <- as_tibble(predict(model_train, newdata = test, interval = "prediction", level = 0.95))
coverage <- mean(test$SalePrice >= test_pi$lwr & test$SalePrice <= test_pi$upr)
percent(coverage)

## [1] "96%"

Q23: Is your empirical coverage rate close to 95%? What might explain differences?

Yes, the empirical coverage rate is 96%, which is very close to the expected 95%. This means that 96% of the actual house prices in the test data fall within the model’s 95% prediction intervals. The small difference (1%) can be explained by random sampling variation, slight non-normality in residuals, or outliers that slightly affect the prediction ranges. Overall, this shows the model’s intervals are well-calibrated and accurate.

#G3. Visualize 100 sampled test predictions + intervals

# ✅ Set seed for reproducibility
set.seed(5520)

# ✅ Fix slice_sample() — use nrow() instead of n()
vis100 <- results %>%
  select(SalePrice, pred, lwr, upr) %>%
  slice_sample(n = min(100, nrow(results))) %>%  # ✅ FIXED
  mutate(id = row_number())

# ✅ Plot predicted vs actual with 95% prediction intervals
ggplot(vis100, aes(x = id, y = pred)) +
  geom_point(color = "blue", alpha = 0.7) +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.2, color = "gray40") +
  geom_point(aes(y = SalePrice), shape = 4, color = "red", size = 2) +
  labs(
    title = "Sampled 100 Houses: Predicted (•) with 95% PI and Actual (×)",
    x = "Sample Index",
    y = "Price"
  ) +
  theme_minimal()

Q24:Looking at your visualization, are most actual prices captured within prediction intervals?

Yes — in the visualization, most actual prices (× markers) fall within the 95% prediction intervals (vertical bars). Only a few points lie outside the bands, which is expected for a well-fitted model. This confirms that the model’s prediction intervals accurately represent uncertainty and capture the majority of real house prices, showing good performance on the test data.

#BLUE Theory: Our regression model mostly follows the BLUE theory. This means it meets the key assumptions — the relationships are mostly linear, predictors aren’t perfectly correlated, and the errors are fairly constant and centered around zero. Because of that, our OLS model gives the best and most reliable estimates for house prices. The diagnostic checks in Section E show that these assumptions hold reasonably well, so we can trust the model’s results.

#Disclosure (LLM use): I used ChatGPT (GPT-5) to assist with coding and writing and suggest some interpretation templates. All outputs were reviewed and verified.

MBAN 5520 – Regression: Ames Housing

Mina Tavakkolijouybari

October 28, 2025

1 B. Build the Regression Model