Data Analysis 2 — Selectivity (Relevance & Automation)

Selectivity: Where is the Information?

More and more data is being collected, collated, communicated and accumulated than ever before. In principle, this is a good thing, but in practice, this raises some challenging questions: Which set of data? Considering which time period? Applying which statistical techniques? Directed at which objectives? The space of possibilities implied by the conjunction of expanding data resources and analytic options is so combinatorially large that a conventional statistical approach is no longer capable of getting full value from the data.

So if the "data-rich, information-poor" scenario is to be avoided, we would argue that it is now essential to integrate a degree of intelligent automation within modern-day statistical analysis. One of the key features of our approach is the use of principled Bayesian statistical techniques to do precisely that — to automatically bring the relevant and informative analysis directly to the client's attention.

Selectivity in Practice

In practice, our statistical approach effectively answers the key questions at two levels:

1

Is there any relevant information in this analysis?

For any given analysis on any given set of data, we can optimise its effectiveness by automatically identifying all the key statistical information. We can then selectively communicate it, perhaps by highlighting elements in an associated table or graph etc.

2

Which analyses are most relevant, and which data is informative?

At a higher level, not only can we automatically highlight the key information in individual analyses, but we can also intelligently summarise and collate (perhaps in a stand-alone document) all combinations of data and analysis that are relevant to overall objectives.

In the following example, we illustrate the above two-level approach at work by considering some of the pertinent questions that might be asked of a sizable table of data gathered in the context of a typical retail sales analysis (analogous to the "Simplicity" example).

  • The Data
  • 1) Ranking
  • 2) Trends
  • 3) Influences
  • 4) Relationships
  • Integration

This example considers a typical analytic scenario where we have collected one full year's worth of data measuring weekly sales revenue across a range of products (180) for a number of retail outlets (70). With supporting data, the table comprises in the order of 1.5 million total entries.

A spread-sheet comprising 1,500,000 numbers

In the context of realising the full value of this data, there are many pertinent questions that we might choose to ask of it. Here are four typical examples:

  1. Which retail outlets are performing best (and worst) for each product?
    How can we rank outlets reliably and accurately (in terms of revenue)?
  2. Which outlets are improving (or deteriorating) over the year?
    Are there any trends we should be aware of?
  3. How do particular promotional schemes influence product revenues?
    Do sales increase (or decrease) as expected in line with promotions?
  4. Are there any inter-relationships in terms of product revenues?
    If one product is selling well, might another be selling badly?

Of course, we want to answer these analytic questions as accurately as possible in every case. But we must also pose a more general question here: with so much data, comprising many outlets and multiple products, and with a sizable range of analytic possibilities, how can we extract all the useful information from that table in an efficient and effective way?

This is a critical question, because if we desire comprehensive answers to the four questions posed above, then we have as many as 1,271,600 distinct analyses to both compute and assess!

Which retail outlets are performing best (and worst) for each product?

Can we not simply calculate the average weekly revenues for each retail outlet, rank those values, and then base our conclusions on that? Ideally, yes we might, but in the real world, if we wish to make meaningful decisions with a quantifiable degree of confidence, we must take account of some key underlying factors: varying quantities of data (perhaps some outlets have only been open a few weeks), natural week-by-week volatility (which can both mask and fake systematic differences), accuracy and resolution of our measurements etc. Indeed, when interpreting ranked data, it is always worth bearing in mind that even if the numbers were generated by the meaningless throwing of dice, there would still be an apparent ordering of averages ...

Analogous to the "Simplicity" example, we can do some calculations behind the scenes so as to better appreciate the significance of the averages. By doing so, we can direct attention to those that are most reliably indicative of underlying systematic differences. To illustrate this, the below graph shows a histogram of revenue for one product of interest, where the weekly data has been pooled together from all outlets to indicate the overall spread and so put the averages into context.

A histogram with key information highlighted

On the histogram, we have elected to "flag up" only the statistically relevant outlets, allowing us to see the "best" and "worst" examples at-a-glance. The position of the flag along the axis gives the average weekly sales revenue for that outlet, but now the height of the flag also indicates the "significance". (The histogram approach is visually appealing, but alternatively we could of course use the same calculations to highlight the relevant statistics within a table.)

This approach gives us a way to automatically visualise, and focus attention on, only those averages which are considered informative. With a large table of data (or several such tables), featuring many products and outlets that we might be interested in, the task of identifying the key features of the data may be quite dramatically simplified.

Which outlets are improving (or deteriorating) over the year?

Whenever data is collected over time, we are often interested in seeing if there are any trends evident. In this example scenario, we may wish to identify those outlets that are performing better or worse overall over time (by aggregating across products), or discover if any products were becoming more or less popular (aggregating over outlets). The graph below looks at one example of the former question, analysing total revenue over a six-month period for one outlet of interest.

Assessing trends in weekly data

Subjectively, there is a suggestion that the measurements are increasing over time, perhaps linearly. Then again, purely random data might, by chance, suggest the same thing, so perhaps revenue is static and there is no underlying trend? The volatile nature of the data here (i.e. the natural week-to-week variability) undermines the confidence of any subjective judgement.

An effective way to choose between alternative hypotheses such as those suggested here (and thereby reject spurious ones) is to apply Bayesian statistical techniques. In fact, Bayesian analysis in this case reveals that a postulated "step change" in the underlying average at Week #12 is actually the most probable explanation. This could be particularly informative if there had perhaps been a change in key personnel around that date etc.

If we have thousands of combinations of outlets and products (e.g. 12,600 here), it can be seen that the number of possible trends we may wish to detect and examine becomes prohibitively large very quickly. By adopting a principled Bayesian statistical approach, we can automatically detect the key features in the data without the need to resort to arbitrary heuristics and/or extensive, error-prone, manual study of analytic output.

How do particular promotional schemes influence product revenues?

Often with data such as this, there may be some further information at our disposal concerning each weekly table entry — an associated "class" or "outcome" perhaps. In this example, for each retail outlet we have some additional data to indicate which one of three different "promotional schemes" was in place during each week for which data was collected.

It is then natural to ask if the revenue of a given outlet might be influenced by the scheme in place at that time (this could be valuable information!). An intuitive way to visualise any putative influence initially is to plot the weekly data (for a single outlet/product combination) in "box" form as shown below, separated vertically according to scheme (with colour-coding for further clarity).

Detecting potential influences across outcomes

Viewing the graph qualitatively, it looks like there is a correlation in this case such that for Scheme A (green) revenue is typically higher than the overall combined average, while for Scheme C (red) it is typically lower. As ever, we must guard against being misled by the natural variability of real-world data, where a link will quite often be suggested "by chance", so we'll want to perform this analysis in a statistically-robust manner. But of course, to supervise such analysis across the full set of 12,600 outlet-product combinations would be excessively onerous.

We can finesse this problem by once again appealing to Bayesian statistical methods. They provide us with the robustness to "chance" we require (particularly in the case where quantities of data are limited) while appealingly offering us direct probabilistic "yes/no" answers to the questions we pose. As a result, we can automatically determine which, if any, of the three schemes correlate with the product sales in all 12,600 cases. In this case, the intuition is reinforced as the Bayesian model supports systematic changes in average for Schemes A and C (as denoted by the blue arrows).

This is very much a "win-win" scenario. Principled statistical "selectivity" techniques deliver increased statistical fidelity while simultaneously making large-scale analysis practical.

Are there any inter-relationships in terms of product revenues?

With revenue tabulated across a range of products, one question of interest centres on whether there might exist any inter-relationships between revenue for those products for any given outlet (or collectively). Initially we would probably limit ourselves to pairwise sales correlations and ask questions such as: does increased revenue in terms of one product for an outlet in one week typically imply decreased revenue in terms of another product that same week? That is, is there any evidence of ongoing "trade-offs" in product revenue?

As in other examples, we might initially consider undertaking a manually-supervised statistical analysis of the appropriate data. However, because we would need to look at product combinations, the task rapidly becomes unmanageable. In the case of 170 products here, there are 14,365 possible pair-wise combinations, for just one single outlet. To assess each pair accurately (or at all!) "by hand" is simply not realistic, so a serious degree of intelligent automation is essential if a fully comprehensive and reliable analysis is to be undertaken.

A small subset of the 14,000+ scatter plots To illustrate the problem, we show a small fraction of the analysis on the right, limited to a subset of just 5 products, giving only 10 graphs ("scatter-plots") to assess.

To automate the task, we have again applied Bayesian statistical techniques to identify those graphs (specific product-pairs) where there are genuine correlations. In this case, Product #2 and Product #4 are identified as being negatively inter-related — when the outlet is generating more revenue from #2, it is at the expense of that made from #4 (on average).

This kind of information may have considerable practical value, but to have discovered it "manually" in any sizable real-world application would have been practically impossible. By employing Bayesian methodology, we can offer an accurate statistical solution to the problem that can be efficiently automated without any need to resort to heuristics.

The previous "tabs" illustrate the advantages — both in terms of accuracy and efficiency — of applying "intelligent statistical automation" techniques to various analytic questions in isolation. By bringing together all these ideas at a higher level, we can provide clients with the systematic means to more efficiently realise the information potential, and full value, of their data.

In the four prior examples, we typically adopted a Bayesian statistical approach. As well as offering the benefit of more robust and accurate analysis in each individual case, the adoption of a consistent statistical framework facilitates the flexible combination of all those analytic elements (and others) within a single, individually-tailored, integrated "end-product". Two compelling examples of the potential for such integration are Intelligent Reporting and Guided Interactivity.

A structured analytic report, automatically compiled

Intelligent Reporting

The concept of Intelligent Reporting is that "at the press of a button", we can generate a comprehensive, stand-alone, printable document (e.g. a PDF) summarising all the key analysis for a given scenario and/or set of data.

The illustration right shows some specimen pages from an automatically-typeset "performance report", which brings together highlights from a range of analyses with multiple levels of detail.

We can provide this type of report for clients on a "service" basis, or can develop a bespoke software application to implement a client-specific system.

Guided Interactivity

One of the services we offer is the development of analytic software applications to clients' individual specifications. Typically, such software might feature a user-interface that provides a range of options to manipulate data, undertake analytic procedures, and output results.

A software application, with guided interactivity In the context of Guided Interactivity, while the user remains free to follow their path of choice through the numerous data choices and analytic options in the software, we can provide intelligent guidance, via the interface, directing the user towards, and highlighting, the significant data and analysis.

The illustration shows an "Expert Overview" screen from one such application. The highlights from six different analyses on a spreadsheet of 760,000 numbers are concisely summarised on a single screen, with the user offered hints to "click through" to more detailed interactive views as desired.