Prediction

Bivariate Normal Distribution

Let $(X,Y)$ be bivariate normal.

\[ f_{X,Y}(x,y) = \dfrac{1}{2\pi\sigma_X\sigma_Y\sqrt{1-\rho^2}} \exp\Big\{-\dfrac{1}{2(1-\rho^2)}\Big[\Big(\dfrac{x-\mu_X}{\sigma_X}\Big)^2 + \Big(\dfrac{y-\mu_Y}{\sigma_Y}\Big)^2 - 2\rho\Big(\dfrac{x-\mu_X}{\sigma_X}\Big)\Big(\dfrac{y-\mu_Y}{\sigma_Y}\Big)\Big]\Big\}. \]

$-\infty < \mu_X < \infty$ and $-\infty < \mu_Y < \infty.$

$0 < \sigma_X,\sigma_Y$ and $-1 < \rho < 1.$

Marginals

\[ X \sim \mathcal N(\mu_X,\sigma_X^2) \]

\[ Y \sim \mathcal N(\mu_Y,\sigma_Y^2). \]

Exercise: Show

\[ f_X(x) = \dfrac{1}{\sqrt{2\pi}\sigma_X}\exp\Big\{-\dfrac{1}{2}\Big(\dfrac{x-\mu_X}{\sigma_X}\Big)^2\Big\} \]

Exercise: Show

\[ f_Y(y) = \dfrac{1}{\sqrt{2\pi}\sigma_Y}\exp\Big\{-\dfrac{1}{2}\Big(\dfrac{y-\mu_Y}{\sigma_Y}\Big)^2\Big\}. \]

Conditionals

\[ Y\mid X=x \sim \mathcal N\!\Big(\mu_Y + \rho\dfrac{\sigma_Y}{\sigma_X}(x-\mu_X),\ \sigma_Y^2(1-\rho^2)\Big). \]

\[ f_{Y\mid X}(y\mid x) = \dfrac{1}{\sigma_Y\sqrt{1-\rho^2}}\exp\Big\{-\dfrac{1}{2}\dfrac{\big(y-\mu_Y-\rho\tfrac{\sigma_Y}{\sigma_X}(x-\mu_X)\big)^2}{\sigma_Y^2(1-\rho^2)}\Big\}. \]

\[ \operatorname{E}[Y\mid X=x] = \mu_Y + \rho\dfrac{\sigma_Y}{\sigma_X}(x-\mu_X). \]

\[ \operatorname{Var}(Y\mid X=x) = \sigma_Y^2(1-\rho^2). \]

Covariance and Correlation

\[ \operatorname{Cov}(X,Y) = \rho\sigma_X\sigma_Y. \]

\[ \operatorname{Corr}(X,Y) = \rho. \]

Note: For a bivariate normal, $X$ and $Y$ are independent if and only if $\rho = 0$. $$

Prediction via Minimum MSE

A random variable is observed and then based on the observed value, we predict a second random variable.

Let $g(X)$ denote the predictor function.

Choose a function $g(X)$ so it tends to be close to $Y$

Criteria:

Goal: minimizing MSE, i.e., minimze \[ \operatorname{E}[(Y-g(X))^2]. \]

Claim: The best predictor is

\[ g(x)=\operatorname{E}[Y\mid X=x]. \]

Proof:

Condition on $X$ and then take $E[ \ \ ]$.

\[ \operatorname{E}[(Y-g(X))^2\mid X] = \operatorname{E}[(Y-\operatorname{E}[Y\mid X]+\operatorname{E}[Y\mid X]-g(X))^2\mid X] \]

\[ = \operatorname{E}[(Y-\operatorname{E}[Y\mid X])^2\mid X] + \operatorname{E}[(\operatorname{E}[Y\mid X]-g(X))^2|X] + 2\operatorname{E}[(Y-\operatorname{E}[Y\mid X])(\operatorname{E}[Y\mid X]-g(X))\mid X]. \]

Note: $\operatorname{E}[Y\mid X]-g(X)$ given $X$ can be treated as a contant.

\[ \operatorname{E}[Y-\operatorname{E}[Y\mid X]\mid X]=0 \Rightarrow \text{cross term}=0. \]

\[ \Rightarrow \operatorname{E}[(Y-g(X))^2\mid X] \ge \operatorname{E}[(Y-\operatorname{E}[Y\mid X])^2\mid X]. \]

\[ \Rightarrow \operatorname{E}[(Y-g(X))^2] \ge \operatorname{E}[(Y-\operatorname{E}[Y\mid X])^2]. \]

Exercise: Show

\[ \operatorname{E}[(X-a)^2] \text{ is minimized at } a=\operatorname{E}[X]. \]

Best Linear Predictor

It sometimes happens that the joint density of $X$ and $Y$ is not completely known, or $E[Y|X]$ is difficult to calculate. However, $\mu_X, \mu_Y, \sigma^2_X, \sigma^2_Y, \sigma_{XY}$ are known, then we can at least determine the Best Linear Predictor of $Y$ on $X$..

Assume \[ g(X)=a+bX \text{ and minimize } \operatorname{E}[(Y-(a+bX))^2]. \]

Answer:

\[ \dfrac{\partial}{\partial a}\operatorname{E}[(Y-(a+bX))^2] = -2\operatorname{E}[Y] + 2a + 2b\operatorname{E}[X] = 0. \]

\[ \dfrac{\partial}{\partial b}\operatorname{E}[(Y-(a+bX))^2] = -2\operatorname{E}[XY] + 2a\operatorname{E}[X] + 2b\operatorname{E}[X^2] = 0. \]

\[ b = \dfrac{\operatorname{E}[XY]-\operatorname{E}[X]\operatorname{E}[Y]}{\operatorname{E}[X^2]-(\operatorname{E}[X])^2} = \dfrac{\operatorname{Cov}(X,Y)}{\operatorname{Var}(X)} = \rho\dfrac{\sigma_Y}{\sigma_X}. \]

\[ a = \operatorname{E}[Y] - b\operatorname{E}[X] = \mu_Y - \rho\dfrac{\sigma_Y}{\sigma_X}\mu_X. \]

\[ g(x) = a + b x = \mu_Y + \rho\dfrac{\sigma_Y}{\sigma_X}(x-\mu_X). \]

\[ \text{(intercept } = \mu_Y - \rho\tfrac{\sigma_Y}{\sigma_X}\mu_X,\ \text{slope } = \rho\tfrac{\sigma_Y}{\sigma_X}). \]

Specialization to the Bivariate Normal

\[ g(x) = \operatorname{E}[Y\mid X=x] = \mu_Y + \rho\dfrac{\sigma_Y}{\sigma_X}(x-\mu_X). \]

\[ \operatorname{MSE}(g) = \operatorname{E}[(Y-\mu_Y-\rho\tfrac{\sigma_Y}{\sigma_X}(X-\mu_X))^2] = \sigma_Y^2(1-\rho^2). \]

Note: For $(X,Y)$ Binvariate Notemal, the Best Predictor is the Best Linear Predictor $E[Y|X]$.

Note: For $(X,Y)$ Binvariate Notemal, the Best Linear Predictor has MSE $V(Y|X)$.

Note: Thus, higher $\rho$, then lower prediction error.

R Example

Code

set.seed(1)
n <- 2000
rho <- 0.6
muX <- 0; muY <- 1
sigX <- 2; sigY <- 3
Sigma <- matrix(c(sigX^2, rho*sigX*sigY,
                  rho*sigX*sigY, sigY^2), 2, 2)
Z <- MASS::mvrnorm(n, mu = c(muX, muY), Sigma = Sigma)
X <- Z[,1]; Y <- Z[,2]

b <- rho * sigY / sigX
a <- muY - b * muX
gX <- a + b * X

c(mean_mse = mean((Y - gX)^2),
  theory = sigY^2 * (1 - rho^2))

mean_mse   theory 
6.372515 5.760000