TL; DR
Bayesian linear regression assumes data follows a normal distribution given parameters.
Prior distributions for regression coefficients and variance are normally and inverse-gamma distributed respectively.
Gibbs sampling is used to sample from the posterior distribution when direct computation is infeasible.
The full conditional distributions for coefficients and variance are derived to facilitate sampling.
The Gibbs sampling procedure iterates between sampling regression coefficients and variance to approximate the posterior distribution.
Bayesian Linear Regression and Gibbs Sampling
Bayesian linear regression is a statistical method in which the statistical analysis is undertaken within the context of Bayesian inference. When performing Bayesian linear regression, we assume that the observed data, y y y , given the parameters β \beta β and σ 2 \sigma^2 σ 2 , follows a normal distribution:
y ∣ β , σ 2 ∼ N ( X β , σ 2 ) y|\beta, \sigma^2 \sim N(X\beta, \sigma^2) y ∣ β , σ 2 ∼ N ( Xβ , σ 2 ) Prior Distributions
The prior distributions for the parameters are set as follows:
For β \beta β :
β ∼ N ( β 0 , Λ 0 ) \beta \sim N(\beta_0, \Lambda_0) β ∼ N ( β 0 , Λ 0 ) with the prior probability density function (pdf) given by:
f ( x ; β 0 , Λ 0 ) = ( 2 π ) − k 2 det ( Λ 0 ) − 1 2 exp ( − 1 2 ( β − β 0 ) ′ Λ 0 − 1 ( β − β 0 ) ) f(x; \beta_0, \Lambda_0) = (2\pi)^{-\frac{k}{2}} \text{det}(\Lambda_0)^{-\frac{1}{2}} \exp\left(-\frac{1}{2} (\beta - \beta_0)' \Lambda_{0}^{-1} (\beta - \beta_0)\right) f ( x ; β 0 , Λ 0 ) = ( 2 π ) − 2 k det ( Λ 0 ) − 2 1 exp ( − 2 1 ( β − β 0 ) ′ Λ 0 − 1 ( β − β 0 ) ) For σ 2 \sigma^2 σ 2 :
σ 2 ∼ I G ( a 0 2 , b 0 2 ) \sigma^2 \sim IG\left(\frac{a_0}{2}, \frac{b_0}{2}\right) σ 2 ∼ I G ( 2 a 0 , 2 b 0 ) with the prior pdf given by:
f ( x ; a 0 2 , b 0 2 ) = ( b 0 2 ) a 0 2 Γ ( a 0 2 ) x − ( a 0 2 + 1 ) exp ( − b 0 2 x ) f(x; \frac{a_0}{2}, \frac{b_0}{2}) = \frac{\left(\frac{b_0}{2}\right)^{\frac{a_0}{2}}}{\Gamma\left(\frac{a_0}{2}\right)} x^{-\left(\frac{a_0}{2} + 1\right)} \exp\left(-\frac{b_0}{2x}\right) f ( x ; 2 a 0 , 2 b 0 ) = Γ ( 2 a 0 ) ( 2 b 0 ) 2 a 0 x − ( 2 a 0 + 1 ) exp ( − 2 x b 0 ) Expand for Bayesian Theorem and Posterior DistributionBy Bayes' theorem, the posterior distribution is proportional to the product of the likelihood and the prior distributions:
Where:
The likelihood is given by:
Gibbs Sampling
Expand for Full Conditional Distributions
This simplifies to:
This simplifies to an inverse-gamma distribution:
Where:
Gibbs Sampling Procedure
f ( y ∣ β , σ 2 ) ∝ ( σ 2 ) − T 2 exp ( − ( y − X β ) ′ ( y − X β ) 2 σ 2 ) f(y | \beta, \sigma^2) \propto (\sigma^2)^{-\frac{T}{2}} \exp\left(-\frac{(y-X\beta)' (y-X\beta)}{2\sigma^2}\right) f ( y ∣ β , σ 2 ) ∝ ( σ 2 ) − 2 T exp ( − 2 σ 2 ( y − Xβ ) ′ ( y − Xβ ) ) The prior for β \beta β is:
π ( β ) ∝ exp ( − 1 2 ( β − β 0 ) ′ Λ 0 − 1 ( β − β 0 ) ) \pi(\beta) \propto \exp\left(-\frac{1}{2} (\beta - \beta_0)' \Lambda_{0}^{-1} (\beta - \beta_0)\right) π ( β ) ∝ exp ( − 2 1 ( β − β 0 ) ′ Λ 0 − 1 ( β − β 0 ) ) The prior for σ 2 \sigma^2 σ 2 is:
π ( σ 2 ) ∝ ( σ 2 ) − ( a 0 2 + 1 ) exp ( − b 0 2 σ 2 ) \pi(\sigma^2) \propto (\sigma^2)^{-\left(\frac{a_0}{2} + 1\right)} \exp\left(-\frac{b_0}{2\sigma^2}\right) π ( σ 2 ) ∝ ( σ 2 ) − ( 2 a 0 + 1 ) exp ( − 2 σ 2 b 0 ) Directly computing β , σ 2 ∣ y \beta, \sigma^2 | y β , σ 2 ∣ y can be challenging, so we use Gibbs sampling by sampling from the full conditional distributions of β ∣ σ 2 , y \beta | \sigma^2, y β ∣ σ 2 , y and σ 2 ∣ β , y \sigma^2 | \beta, y σ 2 ∣ β , y .
The full conditional for β \beta β is:
β ∣ σ 2 , y ∝ exp ( − ( y − X β ) ′ ( y − X β ) 2 σ 2 ) exp ( − 1 2 ( β − β 0 ) ′ Λ 0 − 1 ( β − β 0 ) ) \beta | \sigma^2, y \propto \exp\left(-\frac{(y-X\beta)' (y-X\beta)}{2\sigma^2}\right) \exp\left(-\frac{1}{2} (\beta - \beta_0)' \Lambda_{0}^{-1} (\beta - \beta_0)\right) β ∣ σ 2 , y ∝ exp ( − 2 σ 2 ( y − Xβ ) ′ ( y − Xβ ) ) exp ( − 2 1 ( β − β 0 ) ′ Λ 0 − 1 ( β − β 0 ) )
∝ exp ( − 1 2 [ β ′ ( X ′ X σ 2 + Λ 0 − 1 ) β − 2 β ′ ( X ′ y σ 2 + Λ 0 − 1 β 0 ) ] ) \propto \exp\left(-\frac{1}{2} \left[ \beta' \left(\frac{X'X}{\sigma^2} + \Lambda_0^{-1}\right)\beta -2 \beta' \left(\frac{X'y}{\sigma^2} + \Lambda_0^{-1} \beta_0\right) \right]\right) ∝ exp ( − 2 1 [ β ′ ( σ 2 X ′ X + Λ 0 − 1 ) β − 2 β ′ ( σ 2 X ′ y + Λ 0 − 1 β 0 ) ] )
The full conditional for β \beta β is a normal distribution. If we denote β ∣ σ 2 , y ∼ N ( β 1 , Λ 1 ) \beta | \sigma^2, y \sim N(\beta_1, \Lambda_1) β ∣ σ 2 , y ∼ N ( β 1 , Λ 1 ) , then:
Λ 1 = ( X ′ X σ 2 + Λ 0 − 1 ) − 1 \Lambda_1 = \left(\frac{X'X}{\sigma^2} + \Lambda_0^{-1}\right)^{-1} Λ 1 = ( σ 2 X ′ X + Λ 0 − 1 ) − 1
β 1 = Λ 1 ( X ′ y σ 2 + Λ 0 − 1 β 0 ) \beta_1 = \Lambda_1 \left(\frac{X'y}{\sigma^2} + \Lambda_0^{-1} \beta_0\right) β 1 = Λ 1 ( σ 2 X ′ y + Λ 0 − 1 β 0 )
The full conditional for σ 2 \sigma^2 σ 2 is:
σ 2 ∣ β , y ∝ ( σ 2 ) − T 2 exp ( − ( y − X β ) ′ ( y − X β ) 2 σ 2 ) ( σ 2 ) − ( a 0 2 + 1 ) exp ( − b 0 2 σ 2 ) \sigma^2 | \beta, y \propto (\sigma^2)^{-\frac{T}{2}} \exp\left(-\frac{(y-X\beta)' (y-X\beta)}{2\sigma^2}\right) (\sigma^2)^{-\left(\frac{a_0}{2} + 1\right)} \exp\left(-\frac{b_0}{2\sigma^2}\right) σ 2 ∣ β , y ∝ ( σ 2 ) − 2 T exp ( − 2 σ 2 ( y − Xβ ) ′ ( y − Xβ ) ) ( σ 2 ) − ( 2 a 0 + 1 ) exp ( − 2 σ 2 b 0 )
σ 2 ∣ β , y ∼ I G ( a 1 2 , b 1 2 ) \sigma^2 | \beta, y \sim IG\left(\frac{a_1}{2}, \frac{b_1}{2}\right) σ 2 ∣ β , y ∼ I G ( 2 a 1 , 2 b 1 )
a 1 = a 0 + T a_1 = a_0 + T a 1 = a 0 + T
b 1 = b 0 + ( y − X β ) ′ ( y − X β ) b_1 = b_0 + (y-X\beta)'(y-X\beta) b 1 = b 0 + ( y − Xβ ) ′ ( y − Xβ )
The Gibbs sampling procedure for obtaining samples from β , σ 2 ∣ y \beta, \sigma^2 | y β , σ 2 ∣ y is as follows:
Sample β ( 1 ) \beta^{(1)} β ( 1 ) from β ∣ σ 2 ( 0 ) , y . \beta | \sigma^{2^{(0)}}, y. β ∣ σ 2 ( 0 ) , y .
Sample σ 2 ∣ β ( 1 ) , y . \sigma^2 | \beta^{(1)}, y. σ 2 ∣ β ( 1 ) , y . from σ 2 ∣ β ( 1 ) , y \sigma^2 | \beta^{(1)}, y σ 2 ∣ β ( 1 ) , y .
Sample β ( 2 ) \beta^{(2)} β ( 2 ) from β ∣ σ 2 ( 1 ) , y . \beta | \sigma^{2^{(1)}}, y. β ∣ σ 2 ( 1 ) , y .
Sample σ 2 ( 2 ) \sigma^{2^{(2)}} σ 2 ( 2 ) from σ 2 ∣ β ( 2 ) , y . \sigma^2 | \beta^{(2)}, y. σ 2 ∣ β ( 2 ) , y .
Continue this process for n n n iterations.
After n n n iterations, we have n n n samples from the full posterior distribution.