# Regression and Outliers

Outliers: Abnormal data points. Two types of outliers from a regression viewpoint: Could be far from the points, or far from the model, or both. Only if both, you should remove it.

How to check:
a. Compare the following before and after removing the outlier: leverage, Cooks, DFits, coeffecients, r2, or even MSE (rule of thumb: 20% difference): if different remove it.
b. Compare Leverage, Cooks, DFITS of all observations.
Leverage Hi: A point’s ability to drag model if it changed from y to y+1. If above 3p/n then it has significant leverage, p is number of terms (vars), n is number of observations. 0 means no leverage, 1 means dragging model y by 1.
Cooks Distance Ci: influence if point is removed. Could be outlier, have large leverage, or both. Threshold >1 of F(0.5,p, np). If yes wecall it influential point. Distance between fitted points with vs without suspect point.
DFITS: number of sigma change if point is removed, if high means high influence.
Input Normality for regression and ExpoSmothing and Decomposition too? No.
Can autocorrelated vals be normally distributed?

Make table: leverage,cooks,dfits: Symbol, definition, formula, significance value, remove outlier?
Exam: Check minitab outlier results!

Remove outlier acf spike.
Arima101. 111. 112. Say in doc: Sas adds acf1 to residuals.

White nz: randomness of residuals, if high spikes means not random and autocorrelated: another way like acf & pacf.
For Obs-Arima: acf pacf Q DW Dick-Fuller tests>Need several spikes to make Obs bad.
For Residuals of all models: acf pacf white noise: one spike is enough to make model bad.

Not always needed to remove outliers. But If I want to remove them: Transformation of outliers. Study outliers Only for reg.
Outliers are either: far from data, far from model (high leverage), or both. If both: bad outlier, remove it. If not far from model: good keep it. If far from model but not from data: its ok keep it.
Technique: add&remove outlier if coeffs changed > should be removed.
Make sure your data is not sparse, check: p vs n so you get significant vars with large DF (we got df=1 for sparse data).

Heter: Residuals of all models. Obs & residuals of Regression.
plot 1 2 1. For obs.
R-student residuals: standardized residuals to check for hetero.

Obs Normality: mainly for regression, maybe:Arima
No assumptions for MA, ExpoSmooth, or Decomp obs. Residual normality:all models.

Show acf pacf residuals.

4.a linearity heterodesciacity. Typical values. Bring t table. Z2*rmse. Q1,2,3. Skweness~0?
Residuals: systemtically related: funneling(variances change constantly as indep var increases) . Evenly distributed. Will give us. Print arima choices criteria. Print past exam solutions. Cooks values from 0..1. Check for stat tests on deviation from assumptions(normality, mean=0?, outlier, linearity, funnel, autocorr). Acf&Pacf: stationarity, degree of autocorr (lag1,2,3..), seasonality (helps in choosing , upward direction. 4.a. DW low:: autocorr. Overfit? If Pacf acf of Arima11 same as Arima22 then 22 is overfitting (not adding better acf&pacf)Overfit bad since it fails on big changes. Counterbalance ar with ma! Use lag in regression to integrate AR to it & o check for seasonality: if M12 coeff is high> 12 month seasonality! Study Case 3-5. Time: 1,2.. For each period (year-month). Stepwise regression: add one var n check the coeffs. Will bring: write an exec summary to management.

Outlier for observations and residuals? Q/DW for residuals too (they only need 1 ACF/PACF to be bad, no need for Q ie avg ACP). Hetero is residuals too, obs check not enough ?