Nonparametric Models

Bayesian nonparametric models, such as DP mixture models, provide a flexible approach in which model structure (e.g. the number of components) can be adapted to data. The capability and effectiveness of such models have been proven in many applications.

This flexibility, however, leads to new questions to probabilistic programming – how to express models whose size/structure can vary. The Julia macro system makes it possible to address this problem in an elegant way due to is lazy evaluation nature.

Dirichlet Process Mixture Model

Dirichlet process mixture model (DPMM) is one of the most widely used in Bayesian nonparametrics. The formulation of DPMM is given by

D & \ \sim \ DP(\alpha B) \\
\theta_i & \ \sim \ D, \ \ x_i \ \sim \ G(\theta_i), \quad \forall i = 1, 2, \ldots, n

Here, \alpha is the concentration paramater, B is the base measure, and G denotes the component models that generate the observed samples. The model specification for a DPMM is given as below. It has been shown that D is almost surely discrete, and can be expressed in the followin form:

D = \sum_{k=1}^\infty \pi_k \delta_{\phi_k}

Hence, there exists positive probability that some of the components will be repeatedly sampled, and thus the number of distinct components is usually smaller than the number of samples. In practice, it is useful to construct a pool of components \phi_1, \phi_2, \ldots, and introduce an indicator z_i for each sample x_i, such that \theta_i = \phi_{z_i}. To make this explicit, we can reformulate the model as below

\pi & \ \sim \ StickBreak(\alpha) \\
\phi_k & \ \sim \ B, \quad \forall k = 1, 2, \ldots \\
z_i & \ \sim \pi, \ \ x_i \ \sim \ G(\phi_{z_i}), \quad \forall i = 1, 2, \ldots, n

Here, \pi, a sample from the stick breaking process, is an infinite sequence that sums to unity (i.e. \sum_{i=1}^\infty \pi_i = 1). The values of pi are defined as

v_k & \ \sim \ Beta(1, \alpha), \quad \forall k = 1, 2, \ldots \\
\pi_1 &= v1, \ \ \pi_k = v_k \prod_{l=1}^{k-1} (1 - v_l), \quad \forall k = 1, 2, \ldots

The model specification for this formulation is given below

@model DPMixtureModel{B, G} begin
    @constant n::Int            # the number of observed samples
    @hyperparam alpha::Float64  # the concentration parameter

    pi ~ StickBreak(alpha)
    for k = 1 : Inf
        phi[k] ~ B
    end

    # samples
    for i = 1 : n
        z[i] ~ pi
        x[i] ~ G(phi[z[i]])
    end
end

Remarks:

  • The parametric setting makes it possible to use arbitrary base distribution B and component G here, with the same generic formulation.
  • In theory, \pi is an infinite sequence, and therefore it is not feasible to completely instantiate a sample path of \pi in computer. This variable may be marginalized out in the inference, and thus directly querying \pi is not allowed.
  • \phi has infinitely many components. However, only a finite number of them are needed in inference. The compiler should generate a lazy data structure that only memorizes the subset of components needed during the inference. In particular, phi[k] is constructed and memorized when there is an i such that k = z[i]. Some efforts (e.g. the LazySequences.jl) have demonstrated that lazy data structure can be implemented efficiently in Julia.

Hierarchical Dirichlet Processes

HDP is an extension of the DP mixture models, which allows groups of data to be modeled by different DPs that share components. The formulation of HDP is given below

D_0 & \ \sim \ DP(\alpha B) \\
D_k & \ \sim \ DP(\gamma_k D_0) \\
\theta_{ki} & \ \sim \ D_k, \quad x_{ki} \ \sim \ G(\theta_{ki}),
\quad \forall k = 1, \ldots, m, \ i = 1, \ldots, n_k

Using D_0 as a base measure for the DP associated with each group, all groups share components in D_0 while allowing potentially infinite number of components. This formulation can be re-written (equivalently) using Pitman-Yor process, as follows

\pi_0 & \ \sim \ StickBreak(\alpha) \\
\psi_j & \ \sim \ B, \quad \forall j = 1, 2, \ldots

Then for each k = 1, \ldots, m,

\pi_k & \ \sim \ StickBreak(\gamma_k) \\
u_{kt} & \ \sim \ \pi_0, \quad \phi_{kt} = \psi_{u_{kt}}, \quad t = 1, 2, \ldots \\
z_{ki} & \ \sim \ \pi_k, \quad x_{ki} \ \sim \ G(\phi_{k z_{ki}}), \quad i = 1, 2, \ldots, n_k

Here is a brief description of this procedure:

  1. To generate D_0, we first draw an infinite multinomial distribution \pi_0 from a Pitman-Yor process with concentration parameter \alpha, and draw each component \psi_j from B. Then D_0 = \sum_{j=1}^\infty \pi_j \psi_j.
  2. Then for each group (say the k-th one), we draw \pi_k from a stick breaking process and draw each component from D_0. Note that drawing a component \phi_{kt} from D_0 is equivalent to choosing one of the atoms in D_0, which can be done in two steps: draw u_{kt} from \pi_0 and then set \phi_{kt} = \psi_{u_{kt}}. In other words, the t-th component in the k-th group is identical to the u_{kt}-th component in D_0.
  3. Finally, to generate the i-th sample in the k-th group, denoted by x_{ki}, we first draw z_{ki} from \pi_k and use the corresponding component \phi_{kz_{ki}} to generate the sample.

This formulation can be expressed using the DSL as below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
@model HierarchicalDP{B, G} begin
    @constant m::Int           # the number of groups
    @constant ns::Vector{Int}  # the number of samples in each group
    @hyperparam alpha::Float64   # the base concentration
    @hyperparam gamma::Float64   # the group specific concentration

    # for D0
    pi0 ~ StickBreak(alpha)
    for j = 1 : Inf
        psi[j] ~ B
    end

    # each group
    for k = 1 : m
        pi[k] ~ StickBreak(gamma)

        # Dk
        for t = 1 : Inf
            u[k][t] ~ pi0
            phi[k][t] = psi[u[k][t]]
        end

        # samples
        for i = 1 : ns[k]
            z[k][t] ~ pi[k]
            x[k][i] ~ G(phi[z[k][t]])
        end
    end
end

Gaussian Processes

Gaussian process (GP) is another important stochastic process that is widely used in Bayesian modeling. Formally, a Gaussian process is defined to be a a function-valued distribution X_t: t \in T, where T can be arbitrary domain, such that any finite subset of values in X_t is normally distributed. A Gaussian process is characterized by a mean function \mu: T \rightarrow R and a positive definite covariance function \kappa: T \times T \rightarrow R. The covariance function is typically given in a parametric form. The following is one that is widely used

\kappa(s, t; \theta) = \theta_0 \delta_{s,t} + \theta_1 \exp ( -\theta_2 (s - t)^2 )

In many application, the GP is considered to be hidden, and observations are a noisy transformation of the samples generated from the GP, as

g & \ \sim \ GP(\mu, \kappa) \\
x_i & \ \sim \ B, \quad y_i \ \sim \ F(g(x_i)), \quad \quad i = 1, \ldots, n

The following model specification describes this model.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
@model TransformedGaussProcess{B, F} begin
    @constant n::Int        # the number of observed samples

    # define mean and covariance function
    @param theta::Float^3
    mu(x) = 0.
    kappa(x, y) = theta[1] * delta(x, y) + theta[2] * exp(- theta[3] * abs2(x - y))

    # GP
    g ~ GaussianProcess(mu, kappa)

    # samples
    for i = 1 : n
        x[i] ~ B
        y[i] ~ F(g[x[i]])
    end
end

The GaussianProcess distribution here is a high-order stochastic function, which takes into two function arguments and generates another function. This is readily implementable in Julia, where functions are first-class citizens like in many functional programming languages.