Background
One particular objective of probabilistic modelling that can be of great value is the
elucidation of genuinely meaningful dependencies in the data, a form of knowledge discovery. While it
is clearly advantageous if a model is able to make reliable
predictions of particular quantities based on a range of data
variables, it can be of even greater interest to infer which
variables are physically relevant in the predictive process, and to what
degree. Estimating a meaningful association between individual gene expression levels and disease, for example, may clearly be of value.
Even in cases where predictive modelling techniques are able to
provide accurate results, discovery of genuine causal
dependencies in data is typically highly problematic. There are three
main reasons why this is so:
- The computational mechanisms of many predictive models, even when accurate, can often
be sufficiently opaque as to defy useful interpretation. A good
example of this might be a multi-layer neural network or a
polynomial support vector machine.
- Most modelling techniques are designed to elicit arbitrary
dependencies, as long as they lead to accurate results. (For example,
given a particularly high-dimensional data set, cancer levels in
Wisconsin may be shown to be dependent on the pertaining phase of the
moon.)
- Models can only be expected to extract statistical
relationships between variables, and cannot prove
causality.
Expertise
The issue of opaque prediction mechanisms (item 1 above) may be overcome by an
educated model choice, and this is an area where we have broad
experience. Discovering minimal statistical dependencies in data sets
(2, above) is an area where we have exceptional expertise, having developed
the "sparse Bayesian" modelling approach specifically for this purpose. The
issue of causality (3, above) cannot be solved explicitly, but to an extent can be worked
around through sensible design of experiments or data-gathering policies.
Key Technology
Model transparency: the application of predictive techniques with
interpretable computational mechanisms.
- Application of advanced Bayesian techniques, notably automatic
relevance determination.
- Discovery of meaningful underlying relationships and dependencies
in data via appropriate prior probability models.
- Feature selection: elucidate key variables and factors of
influence for predictive modelling tasks.
- For unsupervised modelling tasks, extraction of a minimal
underlying statistical model leads to more meaningful, parsimonious
and efficient representations. For example, elicitation of a minimal
set of underlying clusters.
- Probabilistic representation in terms of minimal latent variables
models, including principal components, independent components and
factor analysis.
|