|
Before forming Vector Anomaly we had been active in research for several years and had published a number of original scientific articles on topics related to machine learning and data analysis. Some of the more relevant, and hopefully more interesting, examples are listed below (with links to downloadable versions where available).
The "Relevance Vector Machine" prediction model
Tipping, M. E. (2001).
Sparse Bayesian learning and the relevance vector
machine.
Journal of Machine Learning Research 1, 211–244.
[Available online from the Journal of Machine Learning Research]
Abstract
This paper
introduces a general Bayesian framework for obtaining sparse solutions
to regression and classification tasks utilising models linear in the
parameters. Although this framework is fully general, we illustrate
our approach with a particular specialisation that we denote the "relevance vector machine" (RVM), a model of identical functional form
to the popular and state-of-the-art "support vector machine" (SVM). We
demonstrate that by exploiting a probabilistic Bayesian learning
framework, we can derive accurate prediction models which typically
utilise dramatically fewer basis functions than a comparable SVM while
offering a number of additional advantages. These include the benefits
of probabilistic predictions, automatic estimation of 'nuisance'
parameters, and the facility to utilise arbitrary basis functions
(e.g. non-'Mercer' kernels). We detail the Bayesian framework and
associated learning algorithm for the RVM, and give some illustrative
examples of its application along with some comparative benchmarks. We
offer some explanation for the exceptional degree of sparsity
obtained, and discuss and demonstrate some of the advantageous
features, and potential extensions, of Bayesian relevance
learning.
The Relevance Vector Machine: an optimised learning algorithm
Tipping, M. E. and A. C. Faul (2003).
Fast marginal likelihood maximisation for sparse
Bayesian models.
In C. M. Bishop and B. J. Frey (Eds.), Proceedings of the
Ninth International Workshop on Artificial Intelligence and Statistics, Key
West, FL, Jan 3-6.
[PDF ]
[gzipped PostScript]
Abstract
The "sparse Bayesian" modelling approach, as exemplified by the "relevance vector machine", enables sparse classification and
regression functions to be obtained by linearly-weighting a small
number of fixed basis functions from a large dictionary of potential
candidates. Such a model conveys a number of advantages over the
related and very popular "support vector machine", but the necessary 'training' procedure — optimisation of the marginal likelihood
function — is typically much slower. We describe a new and
highly accelerated algorithm which exploits recently-elucidated
properties of the marginal likelihood function to enable
maximisation via a principled and efficient sequential addition and
deletion of candidate basis functions
A popular introductory bookchapter on Bayesian machine learning
Tipping, M. E. (2004).
Bayesian inference: An introduction to principles and
practice in machine learning.
In O. Bousquet, U. von Luxburg, and G. Rätsch (Eds.), Advanced Lectures on Machine Learning, pp.
41–62. Springer.
[PDF]
[gzipped PostScript]
Abstract
This article gives a basic introduction to the principles of Bayesian
inference in a machine learning context, with an emphasis on the
importance of marginalisation for dealing with uncertainty. We begin
by illustrating concepts via a simple regression task before
relating ideas to practical, contemporary, techniques with a
description of "sparse Bayesian" models and the "relevance vector
machine".
NeuroScale: a topographic approach to customised data visualisation
Lowe, D. and M. E. Tipping (1996).
Feed-forward neural networks and topographic mappings for
exploratory data analysis.
Neural Computing and Applications 4, 83–95.
[gzipped PostScript]
Abstract
A recent novel approach to the visualisation and analysis of datasets, and one
which is particularly applicable to those of a high dimension, is discussed
in the context of real applications. A feed-forward neural network is
utilised to effect a topographic, structure-preserving, dimension-reducing
transformation of the data, with an additional facility to incorporate
different degrees of associated subjective information. The properties of
this transformation are illustrated on synthetic and real data sets, including
the 1992 UK Research Assessment Exercise for funding in higher education. The
method is compared and contrasted to established techniques for feature
extraction, and related to topographic mappings, the Sammon projection and
the statistical field of multidimensional scaling.
Hierarchical Data Visualisation with Probabilistic Projection Models
Bishop, C. M. and M. E. Tipping (1998).
A hierarchical latent variable model for data
visualization.
IEEE Transactions on Pattern Analysis and Machine
Intelligence 20(3), 281–293.
[PDF]
Abstract
Visualisation has proven to be a powerful and widely-applicable tool for the
analysis and interpretation of multi-variate data. Most visualisation
algorithms aim to find a projection from the data space down to a
two-dimensional visualisation space. However, for complex data sets living in
a high-dimensional space it is unlikely that a single two-dimensional
projection can reveal all of the interesting structure. We therefore
introduce a hierarchical visualisation algorithm which allows the complete
data set to be visualised at the top level, with clusters and sub-clusters of
data points visualised at deeper levels. The algorithm is based on a
hierarchical mixture of latent variable models, whose parameters are
estimated using the expectation-maximization algorithm. We demonstrate the
principle of the approach on a toy data set, and we then apply the algorithm
to the visualisation of a synthetic data set in 12 dimensions obtained from a
simulation of multi-phase flows in oil pipelines, and to data in 36
dimensions derived from satellite images. A Matlab software implementation of
the algorithm is publicly available from the world-wide web.
Novel visualisation of binary data
Tipping, M. E. (1999b).
Probabilistic visualisation of high-dimensional binary
data.
In M. S. Kearns, S. A. Solla, and D. A. Cohn (Eds.), Advances in Neural Information Processing Systems 11,
Cambridge, MA, pp. 592–598. MIT Press.
[gzipped PostScript]
Abstract
We present a probabilistic latent-variable framework for data visualisation, a
key feature of which is its applicability to binary and categorical data
types for which few established methods exist. A variational approximation to
the likelihood is exploited to derive a fast algorithm for determining the
model parameters. Illustrations of application to real and synthetic binary
data sets are given.
Enhanced visualisation via distance metrics based on mixture density models
Tipping, M. E. (1999a).
Deriving cluster analytic distance functions from
Gaussian mixture models.
In Proceedings of the Ninth International Conference on
Artificial Neural Networks (ICANN99), Volume 2, pp.
815–820. IEE.
[gzipped PostScript]
Abstract
The reliable detection of clusters in datasets of non-trivial dimensionality is
notoriously difficult. Clustering algorithms are generally driven by some
distance function (usually Euclidean) defined over pairs of examples, which
implicitly treats distances within and between clusters alike. In this paper,
a more effective distance measure is proposed, derived from an a
priori estimated Gaussian mixture model. Examples illustrate how the
proposed approach can effectively de-emphasise within-cluster structure, and
thus implicitly magnify the separation between regions of high data
density.
Bayesian analysis of gene expression data
Li, Y., C. Campbell, and M. E. Tipping (2002).
Bayesian automatic relevance determination algorithms for
classifying gene expression data.
Bioinformatics 18(10), 1332–1339.
Abstract
We investigate two Bayesian classification algorithms incorporating feature
selection. These algorithms are applied to classification of gene expression
data derived from cDNA microarrays. We demonstrate the effectiveness of the
algorithms on three gene expression datasets for cancer, showing they compare
well with alternative kernel-based techniques. By automatically incorporating
feature selection, accurate classifiers can be constructed utilising very few
features and with minimal hand-tuning. We argue that the feature selection is
meaningful and some of the highlighted genes appear to be medically
important.
A probabilistic model for principal components analysis
Tipping, M. E. and C. M. Bishop (1999a).
Probabilistic principal component analysis.
Journal of the Royal Statistical Society, Series
B 61(3), 611–622.
[Request a copy]
Abstract
Principal component analysis (PCA) is a ubiquitous technique for data analysis
and processing, but one which is not based upon a probability model. In this
paper we demonstrate how the principal axes of a set of observed data vectors
may be determined through maximum-likelihood estimation of parameters in a
latent variable model closely related to factor analysis. We consider the
properties of the associated likelihood function, giving an EM algorithm for
estimating the principal subspace iteratively, and discuss, with illustrative
examples, the advantages conveyed by this probabilistic approach to
PCA.
A probabilistic mixture model implementation of multiple principal components analyses
Tipping, M. E. and C. M. Bishop (1999b).
Mixtures of probabilistic principal component
analysers.
Neural Computation 11(2), 443–482.
[PDF]
[gzipped PostScript]
Abstract
Principal component analysis (PCA) is one of the most popular techniques for
processing, compressing and visualising data, although its effectiveness is
limited by its global linearity. While nonlinear variants of PCA have been
proposed, an alternative paradigm is to capture data complexity by a
combination of local linear PCA projections. However, conventional PCA does
not correspond to a probability density, and so there is no unique way to
combine PCA models. Previous attempts to formulate mixture models for PCA
have therefore to some extent been ad hoc. In this paper, PCA is formulated
within a maximum-likelihood framework, based on a specific form of Gaussian
latent variable model. This leads to a well-defined mixture model for
probabilistic principal component analysers, whose parameters can be
determined using an EM algorithm. We discuss the advantages of this model in
the context of clustering, density modelling and local dimensionality
reduction, and we demonstrate its application to image compression and
handwritten digit recognition.
Exploiting variation inference techniques to derive novel methods for robust Bayesian interpolation "generalised" component analysis of data
Tipping, M. E. and N. D. Lawrence (2005).
Variational inference for Student-t models: Robust
Bayesian interpolation and generalised component analysis.
NeuroComputing 69, 123–141.
Abstract
We demonstrate how a variational approximation scheme enables effective
inference of key parameters in probabilisitic signal models which employ the
Student-t distribution. Using the two scenarios of robust
interpolation and independent component analysis (ICA) as examples, we
illustrate the key feature of the approach: that the form of the noise
distribution in the interpolation case, and the source distributions in the
ICA case, can be inferred from the data concurrent with all other model
parameters.
Exploiting Bayesian methodology to improve information retrieval systems
Zaragoza, H., D. Hiemstra, and M. E. Tipping (2003).
Bayesian extension to the language model for ad hoc
information retrieval.
In Proceedings of the 26th International ACM SIGIR
Conference, pp. 4–9.
Abstract
We propose a Bayesian extension to the adhoc Language
Model. Many smoothed estimators used for the multinomial
query model in adhoc Language Models (including Laplace
and Bayes smoothing) are approximations to the Bayesian
predictive distribution. In this paper we derive the full pre
dictive distribution in a form amenable to implementation
by classical IR models, and then compare it to other cur
rently used estimators. In our experiments the proposed
model outperforms Bayes smoothing, and its combination
with linear interpolation smoothing outperforms all other
estimators.
A Bayesian approach to image super-resolution for concurrent estimation of upscaled image pixel values and registration parameters
Tipping, M. E. and C. M. Bishop (2003).
Bayesian image super-resolution.
In S. Becker, S. Thrun, and K. Obermayer (Eds.), Advances
in Neural Information Processing Systems 15. MIT Press.
[PDF]
[gzipped PostScript]
Abstract
The extraction of a single high-quality image from a set of low-resolution
images is an important problem which arises in fields such as remote sensing,
surveillance, medical imaging and the extraction of still images from video.
Typical approaches are based on the use of cross-correlation to register the
images followed by the inversion of the transformation from the unknown high
resolution image to the observed low resolution images, using regularization
to resolve the ill-posed nature of the inversion process. In this paper we
develop a Bayesian treatment of the super-resolution problem in which the
likelihood function for the image registration parameters is based on a
marginalization over the unknown high-resolution image. This approach allows
us to estimate the unknown point spread function, and is rendered tractable
through the introduction of a Gaussian process prior over images. Results
indicate a significant improvement over techniques based on MAP (maximum
a-posteriori) point optimization of the high resolution image and associated
registration parameters.
|