Prediction

Prediction

Bivariate Normal Distribution

Let \((X,Y)\) be bivariate normal.

\[ f_{X,Y}(x,y) = \dfrac{1}{2\pi\sigma_X\sigma_Y\sqrt{1-\rho^2}} \exp\Big\{-\dfrac{1}{2(1-\rho^2)}\Big[\Big(\dfrac{x-\mu_X}{\sigma_X}\Big)^2 + \Big(\dfrac{y-\mu_Y}{\sigma_Y}\Big)^2 - 2\rho\Big(\dfrac{x-\mu_X}{\sigma_X}\Big)\Big(\dfrac{y-\mu_Y}{\sigma_Y}\Big)\Big]\Big\}. \]

\(-\infty < \mu_X < \infty\) and \(-\infty < \mu_Y < \infty.\)

\(0 < \sigma_X,\sigma_Y\) and \(-1 < \rho < 1.\)

Marginals

\[ X \sim \mathcal N(\mu_X,\sigma_X^2) \]

\[ Y \sim \mathcal N(\mu_Y,\sigma_Y^2). \]

Exercise: Show

\[ f_X(x) = \dfrac{1}{\sqrt{2\pi}\sigma_X}\exp\Big\{-\dfrac{1}{2}\Big(\dfrac{x-\mu_X}{\sigma_X}\Big)^2\Big\} \]

Exercise: Show

\[ f_Y(y) = \dfrac{1}{\sqrt{2\pi}\sigma_Y}\exp\Big\{-\dfrac{1}{2}\Big(\dfrac{y-\mu_Y}{\sigma_Y}\Big)^2\Big\}. \]

Conditionals

\[ Y\mid X=x \sim \mathcal N\!\Big(\mu_Y + \rho\dfrac{\sigma_Y}{\sigma_X}(x-\mu_X),\ \sigma_Y^2(1-\rho^2)\Big). \]

\[ f_{Y\mid X}(y\mid x) = \dfrac{1}{\sigma_Y\sqrt{1-\rho^2}}\exp\Big\{-\dfrac{1}{2}\dfrac{\big(y-\mu_Y-\rho\tfrac{\sigma_Y}{\sigma_X}(x-\mu_X)\big)^2}{\sigma_Y^2(1-\rho^2)}\Big\}. \]

\[ \operatorname{E}[Y\mid X=x] = \mu_Y + \rho\dfrac{\sigma_Y}{\sigma_X}(x-\mu_X). \]

\[ \operatorname{Var}(Y\mid X=x) = \sigma_Y^2(1-\rho^2). \]

Covariance and Correlation

\[ \operatorname{Cov}(X,Y) = \rho\sigma_X\sigma_Y. \]

\[ \operatorname{Corr}(X,Y) = \rho. \]

Note: For a bivariate normal, \(X\) and \(Y\) are independent if and only if \(\rho = 0\). $$

Prediction via Minimum MSE

A random variable is observed and then based on the observed value, we predict a second random variable.

Let \(g(X)\) denote the predictor function.

Choose a function \(g(X)\) so it tends to be close to \(Y\)

Criteria:

Goal: minimizing MSE, i.e., minimze \[ \operatorname{E}[(Y-g(X))^2]. \]

Claim: The best predictor is

\[ g(x)=\operatorname{E}[Y\mid X=x]. \]

Proof:

Condition on \(X\) and then take \(E[ \ \ ]\).

\[ \operatorname{E}[(Y-g(X))^2\mid X] = \operatorname{E}[(Y-\operatorname{E}[Y\mid X]+\operatorname{E}[Y\mid X]-g(X))^2\mid X] \]

\[ = \operatorname{E}[(Y-\operatorname{E}[Y\mid X])^2\mid X] + \operatorname{E}[(\operatorname{E}[Y\mid X]-g(X))^2|X] + 2\operatorname{E}[(Y-\operatorname{E}[Y\mid X])(\operatorname{E}[Y\mid X]-g(X))\mid X]. \]

Note: \(\operatorname{E}[Y\mid X]-g(X)\) given \(X\) can be treated as a contant.

\[ \operatorname{E}[Y-\operatorname{E}[Y\mid X]\mid X]=0 \Rightarrow \text{cross term}=0. \]

\[ \Rightarrow \operatorname{E}[(Y-g(X))^2\mid X] \ge \operatorname{E}[(Y-\operatorname{E}[Y\mid X])^2\mid X]. \]

\[ \Rightarrow \operatorname{E}[(Y-g(X))^2] \ge \operatorname{E}[(Y-\operatorname{E}[Y\mid X])^2]. \]

Exercise: Show

\[ \operatorname{E}[(X-a)^2] \text{ is minimized at } a=\operatorname{E}[X]. \]

Best Linear Predictor

It sometimes happens that the joint density of \(X\) and \(Y\) is not completely known, or \(E[Y|X]\) is difficult to calculate. However, \(\mu_X, \mu_Y, \sigma^2_X, \sigma^2_Y, \sigma_{XY}\) are known, then we can at least determine the Best Linear Predictor of \(Y\) on \(X\)..

Assume \[ g(X)=a+bX \text{ and minimize } \operatorname{E}[(Y-(a+bX))^2]. \]

Answer:

\[ \dfrac{\partial}{\partial a}\operatorname{E}[(Y-(a+bX))^2] = -2\operatorname{E}[Y] + 2a + 2b\operatorname{E}[X] = 0. \]

\[ \dfrac{\partial}{\partial b}\operatorname{E}[(Y-(a+bX))^2] = -2\operatorname{E}[XY] + 2a\operatorname{E}[X] + 2b\operatorname{E}[X^2] = 0. \]

\[ b = \dfrac{\operatorname{E}[XY]-\operatorname{E}[X]\operatorname{E}[Y]}{\operatorname{E}[X^2]-(\operatorname{E}[X])^2} = \dfrac{\operatorname{Cov}(X,Y)}{\operatorname{Var}(X)} = \rho\dfrac{\sigma_Y}{\sigma_X}. \]

\[ a = \operatorname{E}[Y] - b\operatorname{E}[X] = \mu_Y - \rho\dfrac{\sigma_Y}{\sigma_X}\mu_X. \]

\[ g(x) = a + b x = \mu_Y + \rho\dfrac{\sigma_Y}{\sigma_X}(x-\mu_X). \]

\[ \text{(intercept } = \mu_Y - \rho\tfrac{\sigma_Y}{\sigma_X}\mu_X,\ \text{slope } = \rho\tfrac{\sigma_Y}{\sigma_X}). \]

Specialization to the Bivariate Normal

\[ g(x) = \operatorname{E}[Y\mid X=x] = \mu_Y + \rho\dfrac{\sigma_Y}{\sigma_X}(x-\mu_X). \]

\[ \operatorname{MSE}(g) = \operatorname{E}[(Y-\mu_Y-\rho\tfrac{\sigma_Y}{\sigma_X}(X-\mu_X))^2] = \sigma_Y^2(1-\rho^2). \]

Note: For \((X,Y)\) Binvariate Notemal, the Best Predictor is the Best Linear Predictor \(E[Y|X]\).

Note: For \((X,Y)\) Binvariate Notemal, the Best Linear Predictor has MSE \(V(Y|X)\).

Note: Thus, higher \(\rho\), then lower prediction error.

R Example

Code
set.seed(1)
n <- 2000
rho <- 0.6
muX <- 0; muY <- 1
sigX <- 2; sigY <- 3
Sigma <- matrix(c(sigX^2, rho*sigX*sigY,
                  rho*sigX*sigY, sigY^2), 2, 2)
Z <- MASS::mvrnorm(n, mu = c(muX, muY), Sigma = Sigma)
X <- Z[,1]; Y <- Z[,2]

b <- rho * sigY / sigX
a <- muY - b * muX
gX <- a + b * X

c(mean_mse = mean((Y - gX)^2),
  theory = sigY^2 * (1 - rho^2))
mean_mse   theory 
6.372515 5.760000