OpenPPL: A Proposal for Probabilistic Programming in Julia¶
This document proposes the design of OpenPPL, an open source probabilistic programming framework implemented as a domain-specific language on top of Julia. This document discusses the design of basic language constructs for model specification, query, as well as directives to control algorithmic choices in inference.
Contents:
Basics¶
OpenPPL is a domain-specific language built on top of Julia using macros. The language consists of two parts: model specification and query. In particular, a model specification formalizes a probabilistic model, which involves declaring variables and specifying relations between them; while a query specifies what is given and what is to be inferred.
Terminologies¶
Here is a list of terminologies that would be involved in the description.
- Variable
- A variable generally refers to an entity that can take a value of certain type. It can be a random variable directly associated with a distribution, a deterministic transformation of another variable, or just some value given by the user. The value of a variable can be given or unknown.
- Constant
- A value that is fixed when a query is constructed and fixed throughout the inference procedure. A constant is typically used to represent vector dimensions, model sizes, and hyper-parameters etc. Note that model parameters are typically considered as variables instead of constants. For example, a learning process can be formulated as a query that solves the parameter given a set of observations, where the parameter values can be iteratively updated.
- Domain
The domain of a variable refers to the set of possible values that it may take. Any Julia type (e.g.
Int
,Float64
) can be considered as a domain that contains any value of that type. This package also supports array domains and restrictive domains that contain only a subset of a specific type. Here are some examples:1 2 3 4 5
1..K # integers between 1 and K 0.0..Inf # all non-negative real values Float64^n # n-dimensional real vectors Float64^(m,n) # matrices of size (m, n) (0.0..1.0)^m # m-dimensional real vectors with all components in [0., 1.]
- Distribution
- From a programmatic standpoint, a distribution can be considered as a stochastic function that yields a random value in some domain. A distribution can accept zero or more arguments. A distribution should be stochastically pure, meaning that it always outputs the same value given the same arguments and the same state of the random number generator. Such purity makes it possible to reason about program structure and transform the model from one form to another.
- Factor
- A factor is a pure real-valued function. Here, “pure” means that it always output the same value given the same inputs. Factors are the core building blocks of a probabilistic model. A complex distribution is typically formulated as a set of variables connected by factors. Even a simple distribution (e.g. normal distribution) consists of a factor that connects between generated variables and parameters (which may be considered as a variable with fixed value).
Getting Started: Gaussian Mixture Model¶
Here, we take the Gaussian Mixture Model as an example to illustrate how we can specify a probabilistic model using this Julia domain-specific language (DSL). A Gaussian Mixture Model is a generative model that combines several Gaussian components to approximate complex distributions (e.g. those with multiple modals.). A Gaussian mixture model is characterized by a prior distribution and a set of Gaussian component parameters . The generative process is described as follows:
Model Specification¶
The model specification is given by
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 @model GaussianMixtureModel begin # constant declaration @constant d::Int # vector dimension @constant n::Int # number of observations @hyperparam K::Int # number of components # parameter declaration @param pi :: (0.0..1.0)^K # prior proportions for k in 1 : K @param mu[k] :: Float64^d # component mean @param sig[k] :: Float64^(d, d) # component covariance end # sample generation process for i in 1 : n z[i] ~ Categorical(pi) x[i] ~ MultivariateNormal(mu[z[i]], sig[z[i]]) end end
This model specification should be self-explanatory. However, it is still worth clarifying several aspects:
The macro
@model
defines a model type namedGaussianMixtureModel
, and creates an environment (delimited bybegin
andend
) for model formulation. All model types created by@model
is a sub type ofAbstractModel
.The macro
@constant
declaresd
,K
, andn
as constants. The values of these constants need not be given in the specification. Instead, they are needed upon query. Particularly, to construct a model, one can writemdl = GaussianMixtureModel()
One can optionally fix the value of constants through keyword arguments in model construction, as below
mdl = GaussianMixtureModel(d = 2, K = 5)
Note: fixing constants upon model construction is generally unnecessary. However, it might be useful to fix them under certain circumstances to to simplify queries or restrict its use. Once a constant is fixed, it need not be specified again in the query.
The macro
@hyperparam
declares hyper parameters. Hyper parameters are similar to constant technically, except that they typically refer to model configurations that may be changed during cross validation.Variables can be defined using the syntax as
variable-name :: domain
. A for-loop can be used to declare multiple variables in the same domain. When the variable domain is clear from the context (e.g. the domain ofz
andx
can be inferred from where they are drawn), the declaration can be omitted.The macro
@param
tags certain variables to be parameters. The information will be used in the learning algorithm to determine which variables are the parameters to estimate.The statement
variable-name ~ distribution
introduces a conditional distribution over variables, which will be translated into a factor during model compilation.
Generic Specification: Finite Mixture Model¶
The Gaussian mixture model can be considered as a special case in a generic family called Finite mixture model. Generally, the components of a finite mixture model can be arbitrary distributions. To capture the concept of generic distribution family, we introduce generic specification (or parametric specification), which can take type arguments.
The specification of the generic finite mixture model is given by
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 @model FiniteMixtureModel{G, ParamTypes} begin # constant declaration @hyperparam K::Int @constant n::Int # parameter declaration @param pi :: (0.0..1.0)^K # prior proportions for k = 1 : K for j = 1 : length(ParamTypes) @param theta[k][j] :: ParamTypes[j] end end # sample generation process for i in 1 : n z[i] ~ Categorical(pi) x[i] ~ G(theta[z[i]]...) end end
One may consider a generic specification above as a specification template. To obtain a Gaussian mixture model specification, we can use the @modelalias
macro, as below:
1 2 3 4 5 6 7 8 @modelalias GaussianMixtureModel FiniteMixtureModel{G, ParamTypes} begin @constant d::Int @with G = MultivariateNormal @with ParamTypes[1] = Float64^d # component mean @with ParamTypes[2] = Float64^(d, d) # component covariance end mdl = GaussianMixtureModel()
The @modelalias
macro allows introducing new constants and specializing the type parameters.
Queries¶
In machine learning, the most common queries that people would make include
- learning: estimate model parameters
- prediction: predict the value or marginal distribution over unknown variables, given a learned model and observed variables.
- evaluation: evaluate log-likelihood of observations with a given model
- sampling: draw a set of samples of certain variables
To simplify these common queries, we provide several functions.
Query¶
Query refers to the task of inferring the value or marginal distributions of unknown variables, given a set of known variables.
1 2 3 4 5 6 7 8 9 function query(rmdl::AbstractModel, knowns::Associative, qlist::Array, options) set_variables!(rmdl, knowns) q = compile_query(rmdl, qlist, options) # this returns a query function q return q() # runs the query function and returns the results end function query(rmdl::AbstractModel, knowns::Associative, q) infer(rmdl, knowns, q, default_options(rmdl)) end
qlist
is a list of variables or functions over variables that you want to infer. The function compile_query
actually performs model compilation, analyzing model structure, choosing appropriate inference algorithms, and generating a closure q
, which, when executed, actually performs the inference.
This query
function here is very flexible. One can use it for prediction and sampling, etc.
1 2 3 4 5 6 7 8 9 10 11 12 13 # let rmdl be a learned model # predict the value of z given observation x z_values = query(rmdl, {:x=>columns(data)}, :z) # infer the posterior marginal distributions over z given x z_marginal = query(rmdl, {:x=>columns(data)}, :(marginal(z))) # you can simultaneously infer only selected variables in a flexible way r = query(rmdl, {:x=>columns(data)}, {:(z[1]), :(z[2]), :(marginal(z[3]))}) # draw 100 samples of z samples = query(rmdl, {x:=>columns(data)}, :(samples(z, 100)))
Note that inputs to the function are symbols like :z
or expressions like :(marginal(z))
, which indicate what we want to query. It is incorrect to pass z
or marginal(z)
– the value of z
or marginal(z)
is unavailable before the inference.
Learning¶
Learning refers to the task of estimating model parameters given observed data. This can be considered as a special kind of query, which infers the values of model parameters, given observed data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 function learn_model(mdl::AbstractModel, data::Associative, options) rmdl = copy(mdl) set_variables!(rmdl, data) q = compile_query(rmdl, parameters(rmdl), options) set_variables!(rmdl, q()) return rmdl end function learn_model(mdl::AbstractModel, data) learn_model(mdl, data, default_options(mdl)) end # learn a GMM, a simple wrapper of learn_model # suppose data is a d-by-n matrix rmdl = learn_model( GaussianMixtureModel(K = 3, d = size(data,1), n = size(data,2)), {:x => columns(data)})
In the function learn_model
, parameters(rmdl)
returns a list of parameters as the query list. Then the statement q = compile_query(rmdl, parameters(rmdl), options)
returns a query function q
, such that q()
executes the estimation procedure and returns the estimated model parameters. The following example shows how we can use this function to learn a Gaussian mixture model.
1 2 3 4 5 6 7 8 function learn_gmm(data::Matrix{Float64}, K::Int) learn_model( GaussianMixtureModel(K = K, d = size(data,1), n = size(data,2)), {:x => columns(data)}) end rmdl_K3 = learn_gmm(data, 3) rmdl_K5 = learn_gmm(data, 5)
Here, learn_gmm
is a light-weight wrapper of learn_model
.
Evaluation¶
Evaluation refers to the task of evaluating log-pdf of samples with respect to a learned model.
# evaluate the logpdf of x with respect to a GMM lp = query(rmdl, {:x=>columns(data)}, :(logpdf(x)))
Options¶
The compilation options that control how the query is compiled can be specified through the options
argument in the query
or learn_model
function. The following is some examples
rmdl = learn_model(mdl, data, {:method=>"variational_em", :max_iter=>100, :tolerance=1.0e-6})
For sampling, we may use a different set of options
options = { :method=>"gibbs_sampling", # choose to use Gibbs sampling :burnin=>5000, # the number of burn-in iterations :lag=>100} # the interval between two samples to retain samples = query(rmdl, {x:=>columns(data)}, :(samples(z, 100)), options)
Query Functions with Arguments¶
It is often desirable in practice that a query function can be applied to different data sets without being re-compiled. For this purpose, we introduce a function make_query_function
. The following example illustrates its use:
1 2 3 4 5 6 7 8 9 10 11 # suppose rmdl is a learned model q = make_query_function(rmdl, (:data,), # indicates that the function q would take one argument data {:x=>:(columns(data))}, # indicates how the argument is set to the model as a known value {:z}, # specifies what to query options) # compilation options # q can be repeatedly use for different datasets (without being re-compiled) z1 = q(x1) z2 = q(x2)
Note that q
is a closure that holds reference to the learned model, so you don’t have to pass the model as an argument into q
. The following code use this mechanism to generate a sampler:
1 2 3 4 5 6 7 8 9 10 11 q = make_query_function(rmdl, (:data, :n), # q would take two arguments, the observed data and the number of samples {:x=>:(columns(data))}, {:sample(z, n)}, options) # draw 100 samples of z zs1 = q(x, 100) # draw another 200 samples of z zs2 = q(x, 200)
More Examples¶
This chapter shows several common examples to illusrate how the model specification DSL works in practice.
Latent Dirichlet Allocation¶
Latent Dirichlet Allocation (LDA) is a probabilistic model for topic modeling. The basic idea is to use a set of topics to describe documents. Key aspects of this model is summarized below:
Each document is considered as a bag of words, which means that only the frequencies of words matter, while the sequential order is ignored. In practice, each document is summarized by a histogram vector .
The model comprises a set of topics. We denote the number of topics by . Each topic is characterized by a distribution over vocabulary, denoted by . It is a common practice to place a Dirichlet prior over these distributions.
Each document is associated with a topic proportion vector , generated from a Dirichlet prior.
Each word in a document is generated independently. Specifically, a word is generated as follows
Here, indicates the topic associated with the word .
The model specification is then given by
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 @model LatentDirichletAllocation begin # constants @constant m::Int # vocabulary size @constant n::Int # number of documents @constant nw::Vector{Int} # numbers of words in documents @hyperparam K::Int # number of topics @hyperparam a::Float64 # Dirichlet hyper-prior for beta # parameters @param alpha :: (0.0 .. Inf)^K for k in 1 : K @param beta[k] ~ Dirichlet(a) end # documents for i in 1 : n theta[i] ~ Dirichlet(alpha) let p = sum(beta[k] * theta[i][k] for k in 1 : K) h[i] ~ Multinomial(nw[i], p) end end end
Here, the keyword let
introduces a local value p
to denote the sum vector. It is important to note that p
has a local scope (thus it is not visible outside the loop), and the value of p
can be different for different i
.
The following function learns an LDA model from word histograms of training documents.
1 2 3 4 5 6 7 function learn_lda(h::Matrix{Double}, K::Int, alpha::Float64) learn_model(LatentDirichletAllocation( m = size(h, 1), n = size(h, 2), nw = sum(h, 1), K = K, a = alpha), {:h => columns(h)}) end mdl = learn_lda(training_hists, K, alpha)
The following statement infers topic proportions of testing documents, given a learned LDA model.
1 2 # suppose mdl is a learned LDA model theta = query(mdl, {:h => columns(testing_corpus)}, :theta)
Markov Random Fields¶
Unlike Bayesian networks, which can be factorized into a product of (conditional) distributions, Markov random fields are typically formulated in terms of potentials. Generally, a MRF formulation consists of two parts: identifying relevant cliques (small subsets of directly related variables) and assigning potential functions to them. In computer vision, Markov random fields are widely used in low level vision tasks, such as image recovery and segmentation. Deep Boltzmann machines, which become increasingly popular in recent years, are actually a special form of Markov random field. Here, we use a simple MRF model in the context of image denoising to demonstrate how one can use the model specification to describe an MRF.
From a probabilistic modeling standpoint, the task of image denoising can be considered as an inference problem based on an image model combined with an observation model. An image model captures the prior knowledge as to what an clean image may look like, while the observation model describes how the observed image is generated through a noisy imaging process. Here, we consider a simple setting: Gaussian MRF prior + white noise. A classical formulation of Gaussian MRF for image modeling is given below
Here, the distribution is formulated in the form of a Gibbs distribution, and is the energy function, which is controlled by a parameter $theta$. The energy function $E(x; theta)$ can be devised in different ways. A typical design would encourage smoothness, that is, assign low energy value when the intensity values of neighboring pixels are close to each other. For example, a classical formulation uses the following energy function
Here, and are indices of pixels, and the clique set $cset$ contains all edges between neighboring pixels. With the white noise assumption, the observed pixel values are given by
Below is the specification of the joint model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 @model SimpleMRF begin @constant nimgs::Int # the number of images @constant imgsizes::Vector{(Int, Int)} # parameters @param theta::Float64 # the Gaussian MRF parameter @param sig::Float64 # the variance of white noise for t in 1 : nimgs let m = imgsizes[t][1], n = imgsizes[t][2] x[t]::Float64^(m, n) # the true image y[t]::Float64^(m, n) # the observed noisy image let xt = x[t], yt = y[t] # the image prior (Gaussian MRF) for i in 2 : m-1, j in 2 : n-1 @fac exp(-theta * (xt[i,j] - xt[i,j-1])^2) @fac exp(-theta * (xt[i,j] - xt[i,j+1])^2) @fac exp(-theta * (xt[i,j] - xt[i-1,j])^2) @fac exp(-theta * (xt[i,j] - xt[i+1,j])^2) end # the observation model for i in 1 : m, j in 1 : n yt[i,j] ~ Normal(xt[i,j], sig) end end end end end
The following statement learns the model from a set of uncorrupted images
# suppose imgs is an array of images mdl = learn_model(SimpleMRF(nimgs=length(imgs), imgsizes=map(size, imgs)), {:x=>imgs})
In this specification, four potentials are used to connect a pixel to its left, right, upper, and lower neighbors. This approach would become quite cumbersome as the neighborhood grows. Many state-of-the-art denoising algorithms use mucher larger neighborhood (e.g. 5 x 5
, 9 x 9
, etc) to capture high order texture structure. A representative example is the Field of Experts, where the MRF prior is defined using a set of filters as follows:
Here, is the set of all patches of certain size (say $5 times 5$), and $x_c$ is the pixel values over a small patch . Here, filters are used, and is the filter response at patch . is a robust potential function that maps the filter responses to potential values, controlled by a parameter . The specification below describes this more sophisticated model, where local functions and local variables are used to simplify the specification.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 @model FieldOfExperts begin @constant K::Int # the number of filters @constant w::Int # patch size (w = 5 for 5 x 5 patches) @constant ew::Int = (w - 1) / 2 # half patch dimension @constant nimgs::Int # the number of images @constant imgsizes::Vector{(Int, Int)} @constant sig :: Float64 # variance of white noise # parameters for k = 1 : K @param J[k] :: Float64^(w, w) # filter kernel @param alpha[k] :: Float64 # filter coefficient end # the robust potential function rho(v, a) = -a * log(1 + v * v) for t in 1 : nimgs let m = imgsizes[t][1], n = imgsizes[t][2] x[t]::Float64^(m, n) # the true image y[t]::Float64^(m, n) # the observed noisy image let xt = x[t], yt = y[t] # the image prior for k in 1 : K, i in 1+ew : m-ew, j in 1+ew : n-ew let c = vec(xt(i-ew:i+ew, j-ew:j+ew)) @fac exp(rho(dot(J[k], c), alpha[k])) end end # the observation model for i in 1 : m, j in 1 : n yt[i,j] ~ Normal(xt[i,j], sig) end end end end end
Below is a query function that learns a field-of-experts model.
1 2 3 4 5 6 7 8 9 10 11 function learn_foe(imgs, w::Int, K::Int) # imgs: an array of images # w: the patch dimension # K: the number of filters mdl = FieldOfExpers( K = K, w = w, nimgs = length(imgs), imgsize = map(size, imgs)) learn_model(mdl, {:x=>imgs}) end
Given a learned model, the following query function performs image denosing.
1 2 3 4 5 6 7 8 function foe_denoise(mdl::FieldOfExperts, sig::Float64, noisy_im::Matrix{Float64}) # sig: the noise variance (which is typically given in denoising tasks) # noisy_im: the observed noisy image query(mdl, {:sig=>sig, :y=>[noisy_im]}, :x) end denoised_im = foe_denoise(mdl, 0.1, noisy_im)
Conditional Random Fields¶
Structured prediction, which exploits the statistical dependencies between multiple entities within an instance, has become an important area in machine learning and related fields. Conditional random field is a popular model in this area. Here, I consider a simple application of CRF in computer vision. A visual scene usually comprises multiple objects, and there exist statistical dependencies between the scene category and the objects therein. For example, a bed is more likely in the bedroom than in a forest. A conditional random field that takes advantage of such relations can be formulated as follows
This formulation contains three potentials:
- connects the scene class $s$ to the observed scene feature $x$,
- connects the object label $o_i$ to the corresponding object feature ,
- captures the statistical dependencies between scene classes and object classes.
In addition, is the normalization constant, whose value depends on the parameters . Below is the model specification:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 @model SceneObjectCRF begin @constant M::Int # the number of scene classes @constant N::Int # the number of object classes @constant p::Int # the scene feature dimension @constant q::Int # the object feature dimension @constant nscenes # the number of scenes @constant nobjs::Vector{Int} # numbers of objects in each scene for k in 1 : M @param alpha[k] :: Float64^p end for k in 1 : N @param beta[k] :: Float64^q end @param theta :: Float64^(p, q) for i in 1 : nscenes let n = nobjs[i] s[i] :: 1 .. M # the scene class label o[i] :: (1 .. N)^n # the object class labels x[i] :: Float64^p # the scene feature vector for j in 1 : n y[i][j] :: Float64^q # the object features end @fac dot(alpha[s[i]], x) for j in 1 : n let k = o[i][j] @expfac dot(beta[k], y[i][j]) @expfac theta[s[i], k] end end end end end
Note here that @expfac f(x)
is equivalent to @fac exp(f(x))
. The introduction of @expfac
is to simplify the syntax in cases where factors are specified in log-scale.
Deep Boltzmann Machines¶
A Boltzmann machine (BM) is a generative probabilistic model that describes data through hidden layers. In particular, a deep belief network and a deep Boltzmann machine, which becomes increasingly popular in machine learning and its application domains, can be constructed by stacking multiple layers of BMs. In a generic Boltzmann machine, the joint distributions over both hidden units and visible units are given by
When and are zero matrices, this reduces to a restricted Boltzmann machine. By stakcing multiple layers of BMs, we obtain a deep Boltzmann machine as follows
This probabilistic network, despite its complex internal structure, can be readily specified using the DSL as below
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 @model DeepBoltzmannMachine begin @hyperparam L::Int # the number of latent layers @hyperparam nnodes::Vector{Int} # the number of nodes in each layer @constant d::Int # the dimension of observed sample @constant n::Int # the number of observed samples # declare coefficient matrices @param L :: Float64^(d, d) @param W0 :: Float64^(d, nnodes[1]) for k = 1 : L @param J[k] :: Float64^(nnodes[k], nnodes[k]) end for k = 1 : L-1 @param W[k] :: Float64^(nnodes[k], nnodes[k+1]) end # declare of variables for i = 1 : n obs[i] :: Float64^d for k = 1 : L latent[i][k] :: Float64^(nnodes[k]) end end # samples for i = 1 : n let h = samples[i], v = obs[i] # intra-layer connections @expfac v' * L * v for k = 1 : L @expfac h[k]' * J[k] * h[k] end # inter-layer connections @expfac v' * W0 * h[1] for k = 1 : L-1 @expfac h[k]' * W[k] * h[k+1] end end end end
To learn this model from a set of samples, one can write
1 2 3 4 5 6 7 8 9 function learn_deepbm(x::Matrix{Float64}, nnodes::Vector{Int}) # each column of x is a sample # nnodes specifies the number of nodes at each layer learn_model(DeepBoltzmannMachine(L = length(nnodes), nnodes=nnodes, d=size(x,1), n=size(x,2)), {:obs=>columns(x)}) end mdl = learn_deepbm(x, [100, 50, 20])
Nonparametric Models¶
Bayesian nonparametric models, such as DP mixture models, provide a flexible approach in which model structure (e.g. the number of components) can be adapted to data. The capability and effectiveness of such models have been proven in many applications.
This flexibility, however, leads to new questions to probabilistic programming – how to express models whose size/structure can vary. The Julia macro system makes it possible to address this problem in an elegant way due to is lazy evaluation nature.
Dirichlet Process Mixture Model¶
Dirichlet process mixture model (DPMM) is one of the most widely used in Bayesian nonparametrics. The formulation of DPMM is given by
Here, is the concentration paramater, is the base measure, and denotes the component models that generate the observed samples. The model specification for a DPMM is given as below. It has been shown that is almost surely discrete, and can be expressed in the followin form:
Hence, there exists positive probability that some of the components will be repeatedly sampled, and thus the number of distinct components is usually smaller than the number of samples. In practice, it is useful to construct a pool of components , and introduce an indicator for each sample , such that . To make this explicit, we can reformulate the model as below
Here, \pi
, a sample from the stick breaking process, is an infinite sequence that sums to unity (i.e. ). The values of are defined as
The model specification for this formulation is given below
@model DPMixtureModel{B, G} begin @constant n::Int # the number of observed samples @hyperparam alpha::Float64 # the concentration parameter pi ~ StickBreak(alpha) for k = 1 : Inf phi[k] ~ B end # samples for i = 1 : n z[i] ~ pi x[i] ~ G(phi[z[i]]) end end
Remarks:
- The parametric setting makes it possible to use arbitrary base distribution and component here, with the same generic formulation.
- In theory, is an infinite sequence, and therefore it is not feasible to completely instantiate a sample path of in computer. This variable may be marginalized out in the inference, and thus directly querying is not allowed.
- has infinitely many components. However, only a finite number of them are needed in inference. The compiler should generate a lazy data structure that only memorizes the subset of components needed during the inference. In particular,
phi[k]
is constructed and memorized when there is ani
such thatk = z[i]
. Some efforts (e.g. the LazySequences.jl) have demonstrated that lazy data structure can be implemented efficiently in Julia.
Hierarchical Dirichlet Processes¶
HDP is an extension of the DP mixture models, which allows groups of data to be modeled by different DPs that share components. The formulation of HDP is given below
Using as a base measure for the DP associated with each group, all groups share components in while allowing potentially infinite number of components. This formulation can be re-written (equivalently) using Pitman-Yor process, as follows
Then for each ,
Here is a brief description of this procedure:
- To generate , we first draw an infinite multinomial distribution from a Pitman-Yor process with concentration parameter , and draw each component from . Then .
- Then for each group (say the k-th one), we draw from a stick breaking process and draw each component from . Note that drawing a component from is equivalent to choosing one of the atoms in , which can be done in two steps: draw from and then set . In other words, the -th component in the -th group is identical to the -th component in .
- Finally, to generate the -th sample in the -th group, denoted by , we first draw from and use the corresponding component to generate the sample.
This formulation can be expressed using the DSL as below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 @model HierarchicalDP{B, G} begin @constant m::Int # the number of groups @constant ns::Vector{Int} # the number of samples in each group @hyperparam alpha::Float64 # the base concentration @hyperparam gamma::Float64 # the group specific concentration # for D0 pi0 ~ StickBreak(alpha) for j = 1 : Inf psi[j] ~ B end # each group for k = 1 : m pi[k] ~ StickBreak(gamma) # Dk for t = 1 : Inf u[k][t] ~ pi0 phi[k][t] = psi[u[k][t]] end # samples for i = 1 : ns[k] z[k][t] ~ pi[k] x[k][i] ~ G(phi[z[k][t]]) end end end
Gaussian Processes¶
Gaussian process (GP) is another important stochastic process that is widely used in Bayesian modeling. Formally, a Gaussian process is defined to be a a function-valued distribution , where can be arbitrary domain, such that any finite subset of values in is normally distributed. A Gaussian process is characterized by a mean function and a positive definite covariance function . The covariance function is typically given in a parametric form. The following is one that is widely used
In many application, the GP is considered to be hidden, and observations are a noisy transformation of the samples generated from the GP, as
The following model specification describes this model.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 @model TransformedGaussProcess{B, F} begin @constant n::Int # the number of observed samples # define mean and covariance function @param theta::Float^3 mu(x) = 0. kappa(x, y) = theta[1] * delta(x, y) + theta[2] * exp(- theta[3] * abs2(x - y)) # GP g ~ GaussianProcess(mu, kappa) # samples for i = 1 : n x[i] ~ B y[i] ~ F(g[x[i]]) end end
The GaussianProcess
distribution here is a high-order stochastic function, which takes into two function arguments and generates another function. This is readily implementable in Julia, where functions are first-class citizens like in many functional programming languages.
The Inference Framework¶
The diagram below outlines the overall architecture of the inference framework.
The entire process of generating inference codes from model and query specification consists of four stages:
The
@model
macro will convert a model specification into a model class, of which the internal representation is a factor graph.The
@query
macro will reduce the factor graph based on the query. The reduction may involve following steps:- Simplify factors by absorbing known variables. For example, a second-order factor (i.e. a factor with two arguments) can be reduced to a first-order factor if the value of one variable (say ) is given.
- Eliminate irrelevant variables and factors: variables and factors that do not influence the conditional distribution of the queried variables can be safely removed. For example, consider a joint distribution . When the value of is given, the variable is conditionally independent from . Therefore, the variable can be ignored in the inference for .
The macro
@query
also generates a high-level description of the inference algorithm.An inference compiler will compiles the inference algorithm into low-level Julia codes, taking into account the computational architecture of the target platform (e.g. CPU cores, GPU, cluster, cloud, etc).
Finally, the Julia compiler will emit LLVM instructions, which will then be compiled into native codes by the LLVM compiler.
Here, we focus on the first two steps, that is, compilation of model and query specifications into inference algorithms.
Gaussian Mixture Model Revisited¶
To illustrate the procedure of model compilation, let’s revisit the Gaussian mixture model (GMM). Given a model specification, the @model
macro creates a factor graph, which is a hyper-graph with factors connecting between variables. The following diagram is the factor graph that represents a GMM with two components.
In this graph, each sample is associated with two factors, a mixture factor
that connects between observed samples, components, and component indicator , and a factor that specifies the prior of each component indicator. These two factors were directly specified in the model specification. In the model learning query, the data x
are given. Therefore the order of each mixture factor is reduced from to .
The most widely used learning algorithm for this purpose is Expectation Maximization, which comprises three updating steps, as outlined below.
Update the poterior probabilities of conditioned on prior , component parameters , and the corresponding observed sample , as
This can be interpreted as an integration of a message from the prior factor (that is, ) and a message from the corresponding mixture factor (that is, ). Here, denotes the pdf at with respect to a Gaussian component.
Update the maximum likelihood estimation of , as
This can be interpreted as a computational process that combines messages from each of (that is, ).
Update the maximum likelihood estimation of component parameters and , as
Again, this is also a specific way that combines messages from each of the mixture factors associated with them.
The analysis above shows that the E-M algorithm can be viewed as a message-passing scheme that iteratively update the states associated with each variable by exchanging messages between factors and variables. Actually, the message-passing scheme is a very general paradigm. Many widely used inference algorithms, including mean-field based variational inference, belief propagation, expectation propagation, Gibbs sampling, and Hamiltonian Monte Carlo, can be implemented as certain forms of message passing schemes.
Inference via Message Passing¶
Generally, a message passing procedure consists of several stages:
Initialize the states associated with each variable. Depending on the chosen algorithm, the states associated with a variable can be in differen forms. For example, for a discrete variable, its associated state can be a value (in Gibbs sampling or maximum-likelihood estimation), variational distribution (in variational inference), or a marginal distribution (in belief propagation or expectation propagation).
Iteratively invoke the following steps until convergence or other termination criteria are met.
- update the message from a factor to some of its incident variables (based on updated status of other incident variables).
- update variable states based on incoming messages.
Compute queried quantities based on variable states. For example, if a Gibbs sampling algorithm is used, the expectation of a variable can be approximated by the sample mean.
It is possible to generate such an inference procedure according to the structure of the factor graph. However, there remains several challenges to be addressed:
- Some intermediate quantities may be used in the computation of several different messages. For example, appears in multiple updating formulas for GMM. It is desirable to identify these quantities and avoid unnecessary re-computation of the same value.
- Each message depends on the states of several variables, while the states of a variable may depend on several messages. A message/variable state only needs to be updated when its depending values have changed. To identify whether a variable/message needs to be updated, a natural idea is to build a dependency graph, where each node corresponds to either a variable state or a message. By time-stamping each node, it is not difficult to see whether a node can be updated by looking at the time-stamps of each neighboring nodes.
- Updating steps can be scheduled in numerous ways. Poor scheduling may result in slow convergence. Therefore, deriving a reasonable schedule is also important to achieve high efficiency.