Monday, February 4, 2013

Normalizing Geographic Data

Geographical data has always been fascinating to me.  Probably because maps are fascinating.  Overlaying a map with data to create a heat map provides an interesting perspective to behavior and markets based on location.  There is a beautiful map of Walmart’s growth over time on FlowingData.  This graphic provides a lot of insight into Walmart's growth, but this information can be misleading.  The concept of normalization can provide some clarity.

Normalization is a concept that allows two metrics with different dimensions to be compared using a common base.  While shopping we may have the choice between 100 paper plates for 2.99, or 250 plates for $5.99.  Simple division tells us that the first option is $.03 per plate and the second option is $.024 per plate.  We normalized the price to the unit (plates) in order to compare the two offerings using a common unit.  

The issue with geographic based data is that heat maps are usually created using a standard map of the United States (or the world, specific state,...) but these maps depict area.  Since populations are not distributed uniformly across areas, it rarely makes sense to compare metrics using acreage as the common unit.  

A common example of this is with election data.  We all see the results roll in and the data is presented with red and blue shadings of states to indicate which way the voters in that state voted.  



The 2012 election results are pictured above.  At first glance the heat map would indicate the red candidate is the favorite.  But unless votes are assigned to acres, this doesn’t make since.  Populations are often congregated near coasts, but in this map Wyoming with less than 800k people appears to carry more weight than say Maryland with just under 6MM people.  Normalizing the data by the underlying population can provide more insight into the actual results.



Here the map (actually a cartogram) is morphed so that the area encompassed by each state boundary reflects the population of each state, not the geographical area.  This modified map indicates the blue candidate won a large percentage of the vote.  Of course we know this to be true, but this graphic displays the data in a more appropriate context.  Similarly we could modify the map to reflect electoral college votes, but population is a proxy for that.

So what does this have to do with web analytics?  Web Analytics is often about sales and markets.  Google Analytics, Adobe, Tableau and the host of other platforms make it very easy to display your data on a geographic heat map.  But don't let the temptation to make something eye catching cause you to create something that can mislead your intended audience. 
Let’s say we have 100k conversions in California, and 100k conversions in Wyoming.  On a standard heat map, these two states would have the same color shading indicating they are equal.  Yet if we normalize over population, we would see that we have a much larger presence in Wyoming, and have really just barely tapped California.  The first approach might cause us to choose the same strategy for California and Wyoming, but the second would cause us to choose very different strategies.  This would hopefully be obvious when we look at these two states, but if we expand the segmentation to zip code this might not be as clear.

Even if we do not have the software necessary to morph maps by population (or the number of customers in our target segment) we can normalize the metrics we display by population.  Population data is readily available on the Census Bureau website.  Using percentages of the total population, instead of measured values (sales, views,...), is one easy way to accomplish this.

Other applications of this concept might be more product specific.  For instance, if you are selling skis, on a heat map of sales you might notice western mountain states carry a greater load of the total sales.  This makes sense intuitively.  But if you were to acquire data that depicted the populations of avid skiers by zip code, you might find significant marketing opportunities outside of what was expected.  For instance Texas might be under represented compared to Colorado, but the skiing population is smaller.  Normalizing might allow you to see that conversions could actually be better on Texas based consumers than Colorado based consumers.  This could allow you to target this demographic with greater precision.

Getting back to the Walmart graphic in the first paragraph, you might have noticed a lot of under representation in the west.  Take another look at Nevada.  There are no stores anywhere in the middle of the state.  Hopefully by now you realize that normalizing this data would show that Walmart strategically positioned their stores in population centers, and there is very little population in the middle of the state.  

Resources -
http://www.esri.com/news/arcuser/1000/files/normalize.pdf
http://www-personal.umich.edu/~mejn/election/2012/
http://adam.webanalyticsdemystified.com/2010/04/13/comparison-reports/
http://semphonic.blogs.com/semangel/2010/01/tactics-in-web-analytics-visitor-segmentation.html