Data Analysis 3 — Sophistication (Knowledge & Innovation)

Sophistication: Taking Data Analysis to the Next Level

With an extensive background in both academic and commercial research, and with numerous technical papers published and patents held, our expertise is most definitely not limited to just the standard set of statistical tools. As a result, we can offer clients a two-fold advantage:

1

Knowledge

We have an expert knowledge of advanced techniques, from Bayesian statistics to neural computing. What's more, we know how to successfully integrate them in the real-world: from one-off bespoke systems to mass-market software applications with millions of users.

2

Innovation

When existing technology comes up short, we have the capacity to develop novel analytic methods to address uniquely demanding problems. Our "data relevance" model is one such innovation, and is being successfully applied world-wide in thousands of applications.

But "sophistication" need not imply "complication". As the below example demonstrates, a more sophisticated approach can actually lead to a more simplified outcome in many applications.

Sophistication in Practice

This particular example hails from the field of medical diagnosis, although the "predictive modelling" approach we use is entirely generic in nature and the principles of "knowledge discovery" shown here can be applied across many application areas.

In this case, we are trying to establish whether any given individual is genetically pre-disposed to develop a particular disease. Given a sample of both diseased and healthy individuals, the objective is to infer a statistical model capable of predicting an individual's "disease outcome" based on their genetic data alone. Such a model would of course have considerable diagnostic value in the case of patients who are currently healthy yet might be considered to possess "bad genes".

  • The Application
  • A Conventional Approach
  • The Next Level
  • Innovation

The below schematic illustrates the general structure of the predictive model for this example medical diagnosis application.

Schematic of a generic predictive model in a
medical application

A patient's genetic information (gene expression data) is fed into the model as "Input Data" on the left, some calculations based on those numbers are then undertaken, and finally a "Prediction" that the patient will develop the disease (in fact, a probability) emerges on the right.

In order to infer the necessary "calculations" above, we need some data. In this example, 32 "candidate genes" (G01, G02, etc, to G32) were isolated and measured from a mixed sample of 50 "patients" (some healthy, some diseased). The information in this sample was then used as the basis to develop a predictive model. The model should of course correctly classify the sample, and more importantly, should also give accurate predictions when applied to new patients.

From a statistical perspective, this would be considered a "regression" problem. As posed, it is a relatively challenging one since there are almost as many input variables (32 genes) as there are samples of data (50 patients). Typically, fewer of the former and more of the latter would be preferred.

Using a standard "off-the-shelf" statistical modelling approach ("multiple regression" here), we obtain the results illustrated in the schematic.

Predictive model derived from conventional statistics

The model associates a number with each individual gene, expressing its estimate of the "influence" of that gene on the prediction of disease. In this particular case, all 32 genes make a contribution, variously positive and negative, to the model (some of which are shown).

With this model now developed for "known" individuals (those in the original sample), any new individual whose genetic data can be measured may also have their health-status (in terms of the disease) checked.

Validating this model on a fresh, but known, "test sample" (a separate group of patients from the original used to develop the model) indicated that it was 100% accurate. Given the genetic data alone, it would always correctly predict whether, or not, the "test patient" had the disease. Diagnostically, this appears to be great news! But is that the end of the story?

The schematic now shows the model which is obtained by applying a more sophisticated "sparse Bayesian" regression model (also known as a "relevance vector" model).

Predictive model derived via more advanced analysis

Here, the Bayesian model attempts to fulfil the same primary objective as the earlier approach, that of accurate prediction, while at the same time seeking to do so using as sparse a set of input data (i.e. as few genes) as possible.

Like the previous multiple-regression approach, this model was also 100% accurate in its predictions. But, in this case, the modelling process reveals that knowledge of the levels of only two genes, G06 and G22, is sufficient to detect the disease.

The conventional approach gave us an accurate "diagnostic" model, but it only offered us limited "knowledge" regarding the influential genes. The more sophisticated approach here has actually led to a much more simplified outcome in the final analysis. In practice, the knowledge that just two genes carry sufficient information to predict the disease offers a major clinical advantage and, crucially, may suggest some potential avenues for developing treatments.

How did we know that adopting a "sparse Bayesian" model would offer such advantageous results in this problem? A lucky guess perhaps? Trial and error? Well, we'd like to think that the effectiveness of our particular approach here is a consequence, at least in part, of the 15 years' experience we've acquired at the cutting-edge of data analytic research and practice.

In fact, we also originally invented the model used in this example. The "relevance vector" model has a number of advantageous statistical features (some seen here), and has thus attracted considerable international interest and been adopted within a broad range of practical applications. The below list represents an ad hoc selection of real-world examples taken from over 75,000 results returned by an internet search, which we believe offers a genuine indication of the global influence of our expertise — expertise that can of course be applied to the advantage of our clients.

A sample of many thousands of real-world
applications of our model