Summing Across Overlapping Dimensions “1:1″
** If you’re a database guru you’ll find this pretty boring – if you’re not it may actually be insightful **
How many people came into the store today? Let’s see, the store has 3 entrances. If we count-up the # of people who came in Door A, Door B and Door C…this should give us a reasonable estimate of total unique visitors to the store, right? Kind of like a website where you can only enter once per visit. Sure, I can come back later through another door or on the web my session came timeout, but it should be reasonably close.
What it we tried counting by the # times people pass through various departments in a department store? Let’s take 5 people and the departments they visit:
- John – tools, toys, food, pharma
- Jeff – sports, clothing, food
- Joe – pharmacy, food, tools
- Jesse – register for gift card
- James - tools, food, sports
We know there are 5 people, but let’s try counting by way of department.
We see that if we sum people across departments we get a count of 14 people, where only 5 unique people actually exist. The highly sophisticated Venn diagrams below the table where the circles represent some dimension highlights both good and bad dimensions for summing unique items. The department example falls under bad (both for summing and artistic sensibility).
Who cares? Well, in a world where data is increasingly defined by having multiple dimensions associated with each metric there’s quite some room for this type of error or similar errors in aggregating data. If a # ever seems really high or skewed and you’re operating across many dimensions…ask yourself, can the item generating the metric live in many dimensions?
Other situations where summing unique folks across dimensions could get complex:
Geography – person could have many homes. Zip codes crossing multiple counties?
Retail – multiple departments. Multiple purchases (hopefully) or multiple unique customers in same household.
Finance – multiple cards or accounts.
Gaming – played multiple games. Unique person played the same game during multiple sessions on the same date.
Travel – multiple bookings same vacation.
The most infamous example of all at this point for marketers is SAME PERSON MULTIPLE DEVICES (ha ha ha laughs the distracted consumer, catch me if you can!).
One,
Jeff

4 comments + Add a comment
Sergey
December 3, 2012 at 12:52 pm
Good post!
I faced such an issue. And sometimes it is hard to define are you good with your numbers, because of very complicated data structure. Seems that one of the best way to define such a bug is a simple testing. But it is understandable that you should be accurate with your data before testing anyway
Jerry
December 17, 2012 at 3:36 pm
Counting intersections of sets can be simple if you are familiar with basic counting concepts. Data which intersects is realistic, and reveals more options for customer segmentation. This would be more than a “nutshell” approach.
V Jeffrey James
February 19, 2013 at 6:42 pm
Jerry, good point – it’s rare that a 3NF table is useful for any BI reporting architect. I guess I was trying to highlight the importance of having a solid ‘data model’ before just building reports!
Jeff
Andrey Safonov
January 16, 2013 at 9:27 am
> It seems that one should calculate not the sum of visits, but the sum of “unique visitors” through all buckets.
> The problem is to define a unique customer id and to assign to the correct bucket an omnichannel customer.