Prediction
Bivariate Normal Distribution
Let \((X,Y)\) be bivariate normal.
\[
f_{X,Y}(x,y) = \dfrac{1}{2\pi\sigma_X\sigma_Y\sqrt{1-\rho^2}} \exp\Big\{-\dfrac{1}{2(1-\rho^2)}\Big[\Big(\dfrac{x-\mu_X}{\sigma_X}\Big)^2 + \Big(\dfrac{y-\mu_Y}{\sigma_Y}\Big)^2 - 2\rho\Big(\dfrac{x-\mu_X}{\sigma_X}\Big)\Big(\dfrac{y-\mu_Y}{\sigma_Y}\Big)\Big]\Big\}.
\]
\(-\infty < \mu_X < \infty\) and \(-\infty < \mu_Y < \infty.\)
\(0 < \sigma_X,\sigma_Y\) and \(-1 < \rho < 1.\)
Marginals
\[
X \sim \mathcal N(\mu_X,\sigma_X^2)
\]
\[
Y \sim \mathcal N(\mu_Y,\sigma_Y^2).
\]
Exercise: Show
\[
f_X(x) = \dfrac{1}{\sqrt{2\pi}\sigma_X}\exp\Big\{-\dfrac{1}{2}\Big(\dfrac{x-\mu_X}{\sigma_X}\Big)^2\Big\}
\]
Exercise: Show
\[
f_Y(y) = \dfrac{1}{\sqrt{2\pi}\sigma_Y}\exp\Big\{-\dfrac{1}{2}\Big(\dfrac{y-\mu_Y}{\sigma_Y}\Big)^2\Big\}.
\]
Conditionals
\[
Y\mid X=x \sim \mathcal N\!\Big(\mu_Y + \rho\dfrac{\sigma_Y}{\sigma_X}(x-\mu_X),\ \sigma_Y^2(1-\rho^2)\Big).
\]
\[
f_{Y\mid X}(y\mid x) = \dfrac{1}{\sigma_Y\sqrt{1-\rho^2}}\exp\Big\{-\dfrac{1}{2}\dfrac{\big(y-\mu_Y-\rho\tfrac{\sigma_Y}{\sigma_X}(x-\mu_X)\big)^2}{\sigma_Y^2(1-\rho^2)}\Big\}. \]
\[
\operatorname{E}[Y\mid X=x] = \mu_Y + \rho\dfrac{\sigma_Y}{\sigma_X}(x-\mu_X).
\]
\[
\operatorname{Var}(Y\mid X=x) = \sigma_Y^2(1-\rho^2).
\]
Covariance and Correlation
\[
\operatorname{Cov}(X,Y) = \rho\sigma_X\sigma_Y.
\]
\[
\operatorname{Corr}(X,Y) = \rho.
\]
Note: For a bivariate normal, \(X\) and \(Y\) are independent if and only if \(\rho = 0\) . $$
Prediction via Minimum MSE
A random variable is observed and then based on the observed value, we predict a second random variable.
Let \(g(X)\) denote the predictor function.
Choose a function \(g(X)\) so it tends to be close to \(Y\)
Criteria:
Goal: minimizing MSE, i.e., minimze \[
\operatorname{E}[(Y-g(X))^2].
\]
Claim: The best predictor is
\[
g(x)=\operatorname{E}[Y\mid X=x].
\]
Proof:
Condition on \(X\) and then take \(E[ \ \ ]\) .
\[
\operatorname{E}[(Y-g(X))^2\mid X] = \operatorname{E}[(Y-\operatorname{E}[Y\mid X]+\operatorname{E}[Y\mid X]-g(X))^2\mid X]
\]
\[
= \operatorname{E}[(Y-\operatorname{E}[Y\mid X])^2\mid X] + \operatorname{E}[(\operatorname{E}[Y\mid X]-g(X))^2|X] + 2\operatorname{E}[(Y-\operatorname{E}[Y\mid X])(\operatorname{E}[Y\mid X]-g(X))\mid X].
\]
Note: \(\operatorname{E}[Y\mid X]-g(X)\) given \(X\) can be treated as a contant.
\[
\operatorname{E}[Y-\operatorname{E}[Y\mid X]\mid X]=0 \Rightarrow \text{cross term}=0.
\]
\[
\Rightarrow \operatorname{E}[(Y-g(X))^2\mid X] \ge \operatorname{E}[(Y-\operatorname{E}[Y\mid X])^2\mid X].
\]
\[
\Rightarrow \operatorname{E}[(Y-g(X))^2] \ge \operatorname{E}[(Y-\operatorname{E}[Y\mid X])^2].
\]
Exercise: Show
\[
\operatorname{E}[(X-a)^2] \text{ is minimized at } a=\operatorname{E}[X].
\]
Best Linear Predictor
It sometimes happens that the joint density of \(X\) and \(Y\) is not completely known, or \(E[Y|X]\) is difficult to calculate. However, \(\mu_X, \mu_Y, \sigma^2_X, \sigma^2_Y, \sigma_{XY}\) are known, then we can at least determine the Best Linear Predictor of \(Y\) on \(X\) ..
Assume \[
g(X)=a+bX \text{ and minimize } \operatorname{E}[(Y-(a+bX))^2].
\]
Answer:
\[
\dfrac{\partial}{\partial a}\operatorname{E}[(Y-(a+bX))^2] = -2\operatorname{E}[Y] + 2a + 2b\operatorname{E}[X] = 0.
\]
\[
\dfrac{\partial}{\partial b}\operatorname{E}[(Y-(a+bX))^2] = -2\operatorname{E}[XY] + 2a\operatorname{E}[X] + 2b\operatorname{E}[X^2] = 0.
\]
\[
b = \dfrac{\operatorname{E}[XY]-\operatorname{E}[X]\operatorname{E}[Y]}{\operatorname{E}[X^2]-(\operatorname{E}[X])^2} = \dfrac{\operatorname{Cov}(X,Y)}{\operatorname{Var}(X)} = \rho\dfrac{\sigma_Y}{\sigma_X}.
\]
\[
a = \operatorname{E}[Y] - b\operatorname{E}[X] = \mu_Y - \rho\dfrac{\sigma_Y}{\sigma_X}\mu_X.
\]
\[
g(x) = a + b x = \mu_Y + \rho\dfrac{\sigma_Y}{\sigma_X}(x-\mu_X).
\]
\[
\text{(intercept } = \mu_Y - \rho\tfrac{\sigma_Y}{\sigma_X}\mu_X,\ \text{slope } = \rho\tfrac{\sigma_Y}{\sigma_X}).
\]
Specialization to the Bivariate Normal
\[
g(x) = \operatorname{E}[Y\mid X=x] = \mu_Y + \rho\dfrac{\sigma_Y}{\sigma_X}(x-\mu_X).
\]
\[
\operatorname{MSE}(g) = \operatorname{E}[(Y-\mu_Y-\rho\tfrac{\sigma_Y}{\sigma_X}(X-\mu_X))^2] = \sigma_Y^2(1-\rho^2).
\]
Note: For \((X,Y)\) Binvariate Notemal, the Best Predictor is the Best Linear Predictor \(E[Y|X]\) .
Note: For \((X,Y)\) Binvariate Notemal, the Best Linear Predictor has MSE \(V(Y|X)\) .
Note: Thus, higher \(\rho\) , then lower prediction error.
R Example
Code
set.seed (1 )
n <- 2000
rho <- 0.6
muX <- 0 ; muY <- 1
sigX <- 2 ; sigY <- 3
Sigma <- matrix (c (sigX^ 2 , rho* sigX* sigY,
rho* sigX* sigY, sigY^ 2 ), 2 , 2 )
Z <- MASS:: mvrnorm (n, mu = c (muX, muY), Sigma = Sigma)
X <- Z[,1 ]; Y <- Z[,2 ]
b <- rho * sigY / sigX
a <- muY - b * muX
gX <- a + b * X
c (mean_mse = mean ((Y - gX)^ 2 ),
theory = sigY^ 2 * (1 - rho^ 2 ))
mean_mse theory
6.372515 5.760000