homeconsultancytechnologycontact

 

 

 

Knowledge Discovery

Background

A bivariate Bayesian "sparsity" prior. One particular objective of probabilistic modelling that can be of great value is the elucidation of genuinely meaningful dependencies in the data, a form of knowledge discovery. While it is clearly advantageous if a model is able to make reliable predictions of particular quantities based on a range of data variables, it can be of even greater interest to infer which variables are physically relevant in the predictive process, and to what degree. Estimating a meaningful association between individual gene expression levels and disease, for example, may clearly be of value.

 

Even in cases where predictive modelling techniques are able to provide accurate results, discovery of genuine causal dependencies in data is typically highly problematic. There are three main reasons why this is so:

  1. The computational mechanisms of many predictive models, even when accurate, can often be sufficiently opaque as to defy useful interpretation. A good example of this might be a multi-layer neural network or a polynomial support vector machine.
  2. Most modelling techniques are designed to elicit arbitrary dependencies, as long as they lead to accurate results. (For example, given a particularly high-dimensional data set, cancer levels in Wisconsin may be shown to be dependent on the pertaining phase of the moon.)
  3. Models can only be expected to extract statistical relationships between variables, and cannot prove causality.

Expertise

Estimated relevance of gene expression with incidence of colon cancer. The issue of opaque prediction mechanisms (item 1 above) may be overcome by an educated model choice, and this is an area where we have broad experience. Discovering minimal statistical dependencies in data sets (2, above) is an area where we have exceptional expertise, having developed the "sparse Bayesian" modelling approach specifically for this purpose. The issue of causality (3, above) cannot be solved explicitly, but to an extent can be worked around through sensible design of experiments or data-gathering policies.

Key Technology

  • Evolution of dependency estimates over time.Model transparency: the application of predictive techniques with interpretable computational mechanisms.
  • Application of advanced Bayesian techniques, notably automatic relevance determination.
  • Discovery of meaningful underlying relationships and dependencies in data via appropriate prior probability models.
  • Feature selection: elucidate key variables and factors of influence for predictive modelling tasks.
  • For unsupervised modelling tasks, extraction of a minimal underlying statistical model leads to more meaningful, parsimonious and efficient representations. For example, elicitation of a minimal set of underlying clusters.
  • Probabilistic representation in terms of minimal latent variables models, including principal components, independent components and factor analysis.