Data Analysis 1 — Simplicity (Insight & Understanding)

Simplicity: More Than Just a Concept

Really effective data analysis should provide clarity, insight and understanding — not a collection of averages, variances, regression coefficients, p-values etc that demand further interpretation. By emphasising principles of "simplicity" in our analytic approach, we focus on providing results that comprehensively convey what the data really means. In practical terms, this means intelligently combining sophisticated statistical methodology with intuitive presentation techniques so as to optimally deliver the key information required to support subsequent decision-making.

Simplicity in Practice

We can illustrate our approach at work in a generic "table-crunching" example, considering the common scenario where data concerning a set of "Individuals" is measured on an ongoing basis in respect a number of "Variables". This type of analysis arises across a broad spectrum of real-world application domains, including marketing, engineering, medicine, sport etc. For the purposes of example here, we illustrate the tabulation of "weekly sales revenues", averaged over a 12-month period, for a number of "retail outlets" across a range of "products". The principles demonstrated are of course generic.

In such a scenario, there could easily be thousands of both outlets and products to analyse, perhaps spread across many tables. For practical purposes, we look only at a fragment of one possible table here, and consider three steps which can transform the "raw data" into meaningful information.

  • Table of Data
  • 1) Statistics
  • 2) Simplicity
  • 3) Decision-making

Data: An illustrative fragment of one example table of retail data is shown below.

Outlet A Outlet B Outlet C Outlet D
Product #1 3.31 2.49 2.89 2.91
Product #2 7.56 9.77 10.11 9.61
Product #3 10.65 12.56 12.83 12.98
Product #4 19.26 18.92 17.11 15.85
Product #5 0.04 0.04 0.00 0.00
Product #6 0.33 0.22 0.29 0.37
Product #7 41.54 32.48 39.93 46.65
Product #8 13.12 9.59 10.88 16.77

Each cell in the table contains specimen data representing the annually-averaged value of weekly sales revenue for four different "Outlets" across eight "Products". (Three cells are highlighted for ongoing reference.)

Looking at the raw data, it is inevitably unclear as to which numbers might be significant in isolation, and whether any particular comparisons across either products or outlets can therefore be valid and meaningful. By way of illustration, we can see that Outlet B has an average of 2.49 for Product #1, but how are we to interpret this?

  • Is a revenue average of 2.49 "good" or "bad" overall, in the context of all other outlets?
  • Is Outlet B's value of 2.49 for Product #1 better or worse in any sense than B's value of 0.04 for Product #5, given obviously different scales of measurement (underlying price)?
  • Is it fair to conclude that Outlet A's value of 3.31 for Product #1 is better than B's 2.49?

Tabs 1), 2) and 3) consider how we might provide meaningful answers to key questions such as these.

Step 1) The averages in the original table are converted into statistical "scores".

Outlet A Outlet B Outlet C Outlet D
Product #1 -0.60 -1.46 -1.11 -1.01
Product #2 -1.59 -0.66 -0.54 -0.73
Product #3 -1.71 -1.07 -1.01 -0.94
Product #4 -0.13 -0.02 0.62 1.03
Product #5 -0.62 -0.71 0.24 0.20
Product #6 -0.16 0.36 0.04 -0.39
Product #7 0.71 -0.23 0.55 1.23
Product #8 0.29 -0.60 -0.28 1.21

The tabulated scores place each outlet's average within the context of the complete set of data, which is an essential step if it is desired to assess the individual numbers in a meaningful way. The scoring calculation is statistically principled, and takes into account the differing quantities, the varying scales of measurement and the temporal (week-by-week) volatility of the underlying data. Quantifying these statistical features is a pre-requisite to making reliable assessments across outlets and products.

A positive/negative score indicates whether each entry is better/worse than the overall average (across all outlets for that product), and the magnitude of the score indicates by how much (in statistical terms). We can see that Outlet B's value of 2.49 for Product #1, when expressed as a "score", is -1.46. This is negative and therefore below average overall, and more so than is Outlet A's value of -0.60.

Standardised scores also allow balanced comparison across products (as well as outlets), as the differing scales of measurement are implicitly factored out. Outlet B's average for Product #5 can be seen to be better than that for Product #1, although both are below the overall average. We can see that Product #1 represents the "worst performance" for Outlet B in this table.

Step 2) The "scores" are transformed into easily interpretable "grades".

Outlet A Outlet B Outlet C Outlet D
Product #1 Grade -1 Grade -2 Grade -2 Grade -2
Product #2 Grade -3 Grade -1 Grade -1 Grade -1
Product #3 Grade -3 Grade -2 Grade -2 Grade -1
Product #4 Grade -0 Grade -0 Grade +1 Grade +2
Product #5 Grade -1 Grade -1 Grade -0 Grade -0
Product #6 Grade -0 Grade -0 Grade -0 Grade -0
Product #7 Grade +1 Grade -0 Grade +1 Grade +2
Product #8 Grade -0 Grade -1 Grade -0 Grade +2

The "scores" in Tab 1) are informative, and encapsulate the key statistical features of the data — but they are not necessarily straightforward to interpret at first glance. Step 2, therefore, is a change of representation: for presentation purposes we convert the scores into "grades", on a -5 ("worst") to +5 ("best") scale, denoted symbolically by "Grade -5" and "Grade +5" respectively.

This is a classic example of analytic "simplicity" at work: without any loss of underlying statistical fidelity, the significance of all the data in the table is made immediately clear. We instantly see that in respect of Product #1, Outlet B appears to be slightly inferior to Outlet A. The apparent superiority of Outlet D overall is also immediately suggested. In fact, a considerable amount of information may now be derived from the table simply "at-a-glance".

This "grade" representation may appear heuristic, but the underlying statistical calculation is a principled one which can be interpreted mathematically: we can choose to calibrate "Better" to mean "one in five", and "BetterBetter" to be "one in twenty-five", or use a -10 to +10 scale etc according to the end requirements. Ultimately, this layer of simplicity is an informational enhancement of, and not a replacement for, the underlying data and detailed statistics.

Step 3) Further statistical calculations enable accurate, direct, comparison with Outlet A.

Outlet A Outlet B Outlet C Outlet D
Product #1 Worse
Product #2 Better Better Better
Product #3 Better Better Better
Product #4 Better
Product #5 Better
Product #6
Product #7 Worse
Product #8 Worse Better

We now to extend the statistical methodology to enable more precise comparison, in particular to answer the original question: "Is Outlet A really better than Outlet B in terms of Product #1?"

Viewing the grades (or scores) gives us an immediate indication as to how one outlet compares to any other for any given product. However, we can improve upon that partially subjective judgement by applying Bayesian statistics to directly answer the question, and so more optimally support any subsequent decision-making.

In the example here, Outlets B, C and D are all compared independently with Outlet A. Where calculations show that any of B, C or D are statistically superior to Outlet A, the corresponding entry is denoted "Better", and if inferior, "Worse". The entry is left blank if the averages cannot be statistically distinguished, which may be the case if there is too little underlying data, that data is highly variable, or the original averages were simply too close to differentiate.

We can now conclude: "Outlet B's value of 2.49 for Product #1 is inferior to that of A." This was suggested in Step 2), but is now confirmed by the more explicit goal-driven analysis given here.