Simplicity: More Than Just a Concept
Really effective data analysis should provide clarity, insight and understanding — not a collection of averages, variances, regression coefficients, p-values etc that demand further interpretation. By emphasising principles of "simplicity" in our analytic approach, we focus on providing results that comprehensively convey what the data really means. In practical terms, this means intelligently combining sophisticated statistical methodology with intuitive presentation techniques so as to optimally deliver the key information required to support subsequent decision-making.
Simplicity in Practice
We can illustrate our approach at work in a generic "table-crunching" example, considering the common scenario where data concerning a set of "Individuals" is measured on an ongoing basis in respect a number of "Variables". This type of analysis arises across a broad spectrum of real-world application domains, including marketing, engineering, medicine, sport etc. For the purposes of example here, we illustrate the tabulation of "weekly sales revenues", averaged over a 12-month period, for a number of "retail outlets" across a range of "products". The principles demonstrated are of course generic.
In such a scenario, there could easily be thousands of both outlets and products to analyse, perhaps spread across many tables. For practical purposes, we look only at a fragment of one possible table here, and consider three steps which can transform the "raw data" into meaningful information.
- Table of Data
- 1) Statistics
- 2) Simplicity
- 3) Decision-making
Data: An illustrative fragment of one example table of retail data is shown below.
| Outlet A | Outlet B | Outlet C | Outlet D | |
|---|---|---|---|---|
| Product #1 | 3.31 | 2.49 | 2.89 | 2.91 |
| Product #2 | 7.56 | 9.77 | 10.11 | 9.61 |
| Product #3 | 10.65 | 12.56 | 12.83 | 12.98 |
| Product #4 | 19.26 | 18.92 | 17.11 | 15.85 |
| Product #5 | 0.04 | 0.04 | 0.00 | 0.00 |
| Product #6 | 0.33 | 0.22 | 0.29 | 0.37 |
| Product #7 | 41.54 | 32.48 | 39.93 | 46.65 |
| Product #8 | 13.12 | 9.59 | 10.88 | 16.77 |
Each cell in the table contains specimen data representing the annually-averaged value of weekly sales revenue for four different "Outlets" across eight "Products". (Three cells are highlighted for ongoing reference.)
Looking at the raw data, it is inevitably unclear as to which numbers might be significant in isolation, and whether any particular comparisons across either products or outlets can therefore be valid and meaningful. By way of illustration, we can see that Outlet B has an average of 2.49 for Product #1, but how are we to interpret this?
- Is a revenue average of 2.49 "good" or "bad" overall, in the context of all other outlets?
- Is Outlet B's value of 2.49 for Product #1 better or worse in any sense than B's value of 0.04 for Product #5, given obviously different scales of measurement (underlying price)?
- Is it fair to conclude that Outlet A's value of 3.31 for Product #1 is better than B's 2.49?
Tabs 1), 2) and 3) consider how we might provide meaningful answers to key questions such as these.
Step 1) The averages in the original table are converted into statistical "scores".
| Outlet A | Outlet B | Outlet C | Outlet D | |
|---|---|---|---|---|
| Product #1 | -0.60 | -1.46 | -1.11 | -1.01 |
| Product #2 | -1.59 | -0.66 | -0.54 | -0.73 |
| Product #3 | -1.71 | -1.07 | -1.01 | -0.94 |
| Product #4 | -0.13 | -0.02 | 0.62 | 1.03 |
| Product #5 | -0.62 | -0.71 | 0.24 | 0.20 |
| Product #6 | -0.16 | 0.36 | 0.04 | -0.39 |
| Product #7 | 0.71 | -0.23 | 0.55 | 1.23 |
| Product #8 | 0.29 | -0.60 | -0.28 | 1.21 |
The tabulated scores place each outlet's average within the context of the complete set of data, which is an essential step if it is desired to assess the individual numbers in a meaningful way. The scoring calculation is statistically principled, and takes into account the differing quantities, the varying scales of measurement and the temporal (week-by-week) volatility of the underlying data. Quantifying these statistical features is a pre-requisite to making reliable assessments across outlets and products.
A positive/negative score indicates whether each entry is better/worse than the overall average (across all outlets for that product), and the magnitude of the score indicates by how much (in statistical terms). We can see that Outlet B's value of 2.49 for Product #1, when expressed as a "score", is -1.46. This is negative and therefore below average overall, and more so than is Outlet A's value of -0.60.
Standardised scores also allow balanced comparison across products (as well as outlets), as the differing scales of measurement are implicitly factored out. Outlet B's average for Product #5 can be seen to be better than that for Product #1, although both are below the overall average. We can see that Product #1 represents the "worst performance" for Outlet B in this table.
Step 2) The "scores" are transformed into easily interpretable "grades".
| Outlet A | Outlet B | Outlet C | Outlet D | |
|---|---|---|---|---|
| Product #1 | ![]() | ![]() | ![]() |
|
| Product #2 | ![]() | ![]() | ![]() |
|
| Product #3 | ![]() | ![]() | ![]() |
|
| Product #4 | ![]() | ![]() | ![]() |
|
| Product #5 | ![]() | ![]() | ![]() |
|
| Product #6 | ![]() | ![]() | ![]() |
|
| Product #7 | ![]() | ![]() | ![]() |
|
| Product #8 | ![]() | ![]() | ![]() |
The "scores" in Tab 1) are informative, and encapsulate
the key statistical features of the data — but they
are not necessarily straightforward to interpret at first
glance. Step 2, therefore, is a change
of representation: for presentation purposes we convert
the scores into "grades", on a -5 ("worst") to +5 ("best")
scale, denoted symbolically by "
" and
"
" respectively.
This is a classic example of analytic "simplicity" at work: without any loss of underlying statistical fidelity, the significance of all the data in the table is made immediately clear. We instantly see that in respect of Product #1, Outlet B appears to be slightly inferior to Outlet A. The apparent superiority of Outlet D overall is also immediately suggested. In fact, a considerable amount of information may now be derived from the table simply "at-a-glance".
This "grade" representation may appear heuristic, but the
underlying statistical calculation is a principled one
which can be interpreted mathematically: we can choose to
calibrate "
" to mean "one in five", and
"
" to be "one in
twenty-five", or use a -10 to +10 scale etc according
to the end requirements. Ultimately, this layer of
simplicity is an informational enhancement of, and not a
replacement for, the underlying data and detailed statistics.
Step 3) Further statistical calculations enable accurate, direct, comparison with Outlet A.
| Outlet A | Outlet B | Outlet C | Outlet D | |
|---|---|---|---|---|
| Product #1 | – | |||
| Product #2 | – | ![]() | ![]() |
|
| Product #3 | – | ![]() | ![]() |
|
| Product #4 | – | |||
| Product #5 | – | |||
| Product #6 | – | |||
| Product #7 | – | |||
| Product #8 | – | ![]() |
We now to extend the statistical methodology to enable more precise comparison, in particular to answer the original question: "Is Outlet A really better than Outlet B in terms of Product #1?"
Viewing the grades (or scores) gives us an immediate indication as to how one outlet compares to any other for any given product. However, we can improve upon that partially subjective judgement by applying Bayesian statistics to directly answer the question, and so more optimally support any subsequent decision-making.
In the example here, Outlets B,
C and D are all compared independently
with Outlet A. Where calculations show
that any of B, C or D are
statistically superior to Outlet A, the
corresponding entry is denoted "
", and if
inferior, "
". The entry is left blank if the
averages cannot be statistically distinguished, which may
be the case if there is too little underlying data, that
data is highly variable, or the original averages were
simply too close to differentiate.
We can now conclude: "Outlet B's value of 2.49 for Product #1 is inferior to that of A." This was suggested in Step 2), but is now confirmed by the more explicit goal-driven analysis given here.





