homeconsultancytechnologycontact

 

 

 

Science

Before forming Vector Anomaly we had been active in research for several years and had published a number of original scientific articles on topics related to machine learning and data analysis. Some of the more relevant, and hopefully more interesting, examples are listed below (with links to downloadable versions where available).

Sparse Bayesian learning and the relevance vector machineThe "Relevance Vector Machine" prediction model

Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research  1, 211–244. [Available online from the Journal of Machine Learning Research]

Abstract

This paper introduces a general Bayesian framework for obtaining sparse solutions to regression and classification tasks utilising models linear in the parameters. Although this framework is fully general, we illustrate our approach with a particular specialisation that we denote the "relevance vector machine" (RVM), a model of identical functional form to the popular and state-of-the-art "support vector machine" (SVM). We demonstrate that by exploiting a probabilistic Bayesian learning framework, we can derive accurate prediction models which typically utilise dramatically fewer basis functions than a comparable SVM while offering a number of additional advantages. These include the benefits of probabilistic predictions, automatic estimation of 'nuisance' parameters, and the facility to utilise arbitrary basis functions (e.g. non-'Mercer' kernels). We detail the Bayesian framework and associated learning algorithm for the RVM, and give some illustrative examples of its application along with some comparative benchmarks. We offer some explanation for the exceptional degree of sparsity obtained, and discuss and demonstrate some of the advantageous features, and potential extensions, of Bayesian relevance learning.

Fast marginal likelihood maximisation for sparse Bayesian modelsThe Relevance Vector Machine: an optimised learning algorithm

Tipping, M. E. and A. C. Faul (2003). Fast marginal likelihood maximisation for sparse Bayesian models. In C. M. Bishop and B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Key West, FL, Jan 3-6. [PDF ] [gzipped PostScript]

Abstract

The "sparse Bayesian" modelling approach, as exemplified by the "relevance vector machine", enables sparse classification and regression functions to be obtained by linearly-weighting a small number of fixed basis functions from a large dictionary of potential candidates. Such a model conveys a number of advantages over the related and very popular "support vector machine", but the necessary 'training' procedure — optimisation of the marginal likelihood function — is typically much slower. We describe a new and highly accelerated algorithm which exploits recently-elucidated properties of the marginal likelihood function to enable maximisation via a principled and efficient sequential addition and deletion of candidate basis functions

Bayesian inference: An introduction to principles and practice in machine learningA popular introductory bookchapter on Bayesian machine learning

Tipping, M. E. (2004). Bayesian inference: An introduction to principles and practice in machine learning. In O. Bousquet, U. von Luxburg, and G. Rätsch (Eds.), Advanced Lectures on Machine Learning, pp.  41–62. Springer. [PDF] [gzipped PostScript]

Abstract

This article gives a basic introduction to the principles of Bayesian inference in a machine learning context, with an emphasis on the importance of marginalisation for dealing with uncertainty. We begin by illustrating concepts via a simple regression task before relating ideas to practical, contemporary, techniques with a description of "sparse Bayesian" models and the "relevance vector machine".

 

Feed-forward neural networks and topographic mappings for exploratory data analysisNeuroScale: a topographic approach to customised data visualisation

Lowe, D. and M. E. Tipping (1996). Feed-forward neural networks and topographic mappings for exploratory data analysis. Neural Computing and Applications  4, 83–95. [gzipped PostScript]

Abstract

A recent novel approach to the visualisation and analysis of datasets, and one which is particularly applicable to those of a high dimension, is discussed in the context of real applications. A feed-forward neural network is utilised to effect a topographic, structure-preserving, dimension-reducing transformation of the data, with an additional facility to incorporate different degrees of associated subjective information. The properties of this transformation are illustrated on synthetic and real data sets, including the 1992 UK Research Assessment Exercise for funding in higher education. The method is compared and contrasted to established techniques for feature extraction, and related to topographic mappings, the Sammon projection and the statistical field of multidimensional scaling.

 

A hierarchical latent variable model for data visualizationHierarchical Data Visualisation with Probabilistic Projection Models

Bishop, C. M. and M. E. Tipping (1998). A hierarchical latent variable model for data visualization. IEEE Transactions on Pattern Analysis and Machine Intelligence  20(3), 281–293. [PDF]

Abstract

Visualisation has proven to be a powerful and widely-applicable tool for the analysis and interpretation of multi-variate data. Most visualisation algorithms aim to find a projection from the data space down to a two-dimensional visualisation space. However, for complex data sets living in a high-dimensional space it is unlikely that a single two-dimensional projection can reveal all of the interesting structure. We therefore introduce a hierarchical visualisation algorithm which allows the complete data set to be visualised at the top level, with clusters and sub-clusters of data points visualised at deeper levels. The algorithm is based on a hierarchical mixture of latent variable models, whose parameters are estimated using the expectation-maximization algorithm. We demonstrate the principle of the approach on a toy data set, and we then apply the algorithm to the visualisation of a synthetic data set in 12 dimensions obtained from a simulation of multi-phase flows in oil pipelines, and to data in 36 dimensions derived from satellite images. A Matlab software implementation of the algorithm is publicly available from the world-wide web.

Probabilistic visualisation of high-dimensional binary dataNovel visualisation of binary data

Tipping, M. E. (1999b). Probabilistic visualisation of high-dimensional binary data. In M. S. Kearns, S. A. Solla, and D. A. Cohn (Eds.), Advances in Neural Information Processing Systems 11, Cambridge, MA, pp.  592–598. MIT Press. [gzipped PostScript]

Abstract

We present a probabilistic latent-variable framework for data visualisation, a key feature of which is its applicability to binary and categorical data types for which few established methods exist. A variational approximation to the likelihood is exploited to derive a fast algorithm for determining the model parameters. Illustrations of application to real and synthetic binary data sets are given.

Deriving cluster analytic distance functions from Gaussian mixture modelsEnhanced visualisation via distance metrics based on mixture density models

Tipping, M. E. (1999a). Deriving cluster analytic distance functions from Gaussian mixture models. In Proceedings of the Ninth International Conference on Artificial Neural Networks (ICANN99), Volume 2, pp.  815–820. IEE. [gzipped PostScript]

Abstract

The reliable detection of clusters in datasets of non-trivial dimensionality is notoriously difficult. Clustering algorithms are generally driven by some distance function (usually Euclidean) defined over pairs of examples, which implicitly treats distances within and between clusters alike. In this paper, a more effective distance measure is proposed, derived from an a priori estimated Gaussian mixture model. Examples illustrate how the proposed approach can effectively de-emphasise within-cluster structure, and thus implicitly magnify the separation between regions of high data density.

Bayesian automatic relevance determination algorithms for classifying gene expression dataBayesian analysis of gene expression data

Li, Y., C. Campbell, and M. E. Tipping (2002). Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics  18(10), 1332–1339.

Abstract

We investigate two Bayesian classification algorithms incorporating feature selection. These algorithms are applied to classification of gene expression data derived from cDNA microarrays. We demonstrate the effectiveness of the algorithms on three gene expression datasets for cancer, showing they compare well with alternative kernel-based techniques. By automatically incorporating feature selection, accurate classifiers can be constructed utilising very few features and with minimal hand-tuning. We argue that the feature selection is meaningful and some of the highlighted genes appear to be medically important.

Probabilistic principal component analysisA probabilistic model for principal components analysis

Tipping, M. E. and C. M. Bishop (1999a). Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B  61(3), 611–622. [Request a copy]

Abstract

Principal component analysis (PCA) is a ubiquitous technique for data analysis and processing, but one which is not based upon a probability model. In this paper we demonstrate how the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis. We consider the properties of the associated likelihood function, giving an EM algorithm for estimating the principal subspace iteratively, and discuss, with illustrative examples, the advantages conveyed by this probabilistic approach to PCA.

Mixtures of probabilistic principal component analysersA probabilistic mixture model implementation of multiple principal components analyses

Tipping, M. E. and C. M. Bishop (1999b). Mixtures of probabilistic principal component analysers. Neural Computation  11(2), 443–482. [PDF] [gzipped PostScript]

Abstract

Principal component analysis (PCA) is one of the most popular techniques for processing, compressing and visualising data, although its effectiveness is limited by its global linearity. While nonlinear variants of PCA have been proposed, an alternative paradigm is to capture data complexity by a combination of local linear PCA projections. However, conventional PCA does not correspond to a probability density, and so there is no unique way to combine PCA models. Previous attempts to formulate mixture models for PCA have therefore to some extent been ad hoc. In this paper, PCA is formulated within a maximum-likelihood framework, based on a specific form of Gaussian latent variable model. This leads to a well-defined mixture model for probabilistic principal component analysers, whose parameters can be determined using an EM algorithm. We discuss the advantages of this model in the context of clustering, density modelling and local dimensionality reduction, and we demonstrate its application to image compression and handwritten digit recognition.

Variational inference for Student-t models: Robust Bayesian interpolation and generalised component analysisExploiting variation inference techniques to derive novel methods for robust Bayesian interpolation  "generalised" component analysis of data

Tipping, M. E. and N. D. Lawrence (2005). Variational inference for Student-t models: Robust Bayesian interpolation and generalised component analysis. NeuroComputing  69, 123–141.

Abstract

We demonstrate how a variational approximation scheme enables effective inference of key parameters in probabilisitic signal models which employ the Student-t distribution. Using the two scenarios of robust interpolation and independent component analysis (ICA) as examples, we illustrate the key feature of the approach: that the form of the noise distribution in the interpolation case, and the source distributions in the ICA case, can be inferred from the data concurrent with all other model parameters.

Bayesian extension to the language model for ad hoc information retrievalExploiting Bayesian methodology to improve information retrieval systems

Zaragoza, H., D. Hiemstra, and M. E. Tipping (2003). Bayesian extension to the language model for ad hoc information retrieval. In Proceedings of the 26th International ACM SIGIR Conference, pp.  4–9.

Abstract

We propose a Bayesian extension to the ad­hoc Language Model. Many smoothed estimators used for the multinomial query model in ad­hoc Language Models (including Laplace and Bayes ­smoothing) are approximations to the Bayesian predictive distribution. In this paper we derive the full pre­ dictive distribution in a form amenable to implementation by classical IR models, and then compare it to other cur­ rently used estimators. In our experiments the proposed model outperforms Bayes ­smoothing, and its combination with linear interpolation smoothing outperforms all other estimators.

Bayesian image super-resolutionA Bayesian approach to image super-resolution for concurrent estimation of upscaled image pixel values and registration parameters

Tipping, M. E. and C. M. Bishop (2003). Bayesian image super-resolution. In S. Becker, S. Thrun, and K. Obermayer (Eds.), Advances in Neural Information Processing Systems 15. MIT Press. [PDF] [gzipped PostScript]

Abstract

The extraction of a single high-quality image from a set of low-resolution images is an important problem which arises in fields such as remote sensing, surveillance, medical imaging and the extraction of still images from video. Typical approaches are based on the use of cross-correlation to register the images followed by the inversion of the transformation from the unknown high resolution image to the observed low resolution images, using regularization to resolve the ill-posed nature of the inversion process. In this paper we develop a Bayesian treatment of the super-resolution problem in which the likelihood function for the image registration parameters is based on a marginalization over the unknown high-resolution image. This approach allows us to estimate the unknown point spread function, and is rendered tractable through the introduction of a Gaussian process prior over images. Results indicate a significant improvement over techniques based on MAP (maximum a-posteriori) point optimization of the high resolution image and associated registration parameters.