Segment Reporting – Why your parents were right and you can’t always have everything you want!
If you don’t have a technical background, it can be hard to understand why web analytics tools often have strange and apparently unreasonable restrictions on the types of reports or analysis you can do. But as arbitrary as these restrictions may seem, they are born of a very real and fundamental set of problems that arise from a few simple facts about web measurement that may actually have considerable influence on your tool selection.
Fact number one is that web analytic solutions intended for big enterprises often have to deal with almost unprecedented amounts of detail data. Big web sites generate billions of rows of data in a year. It’s a lot. Even the amount of data for a single day can be enormous. Because the raw data set CAN be so large, tools often can’t take the approach of using the low-level data to answer every question. Instead, many web analytic solutions rely on a technique developed well before web analytics to handle very large data sets in other fields. This approach (OLAP is the techie term) involves the creation of data “cubes” that pre-summarize specific relationships in the data.
But here’s the deal – a cube can’t capture every possible relationship in the data – or it gets to be bigger than the original raw data. So the cube builder (in this case your web analytics vendor) has to make some tough decisions about what data to include. In some cases, these decisions are driven by basic analysis of what the vendors think is likely to be important: screen size by referring site – no; conversion by campaign – yes. But there are other factors to consider and the biggest is something called cardinality.
Cardinality is a measure of how many different values a variable can have. OLAP always worked great for basic customer analysis because most of the variables had very low cardinality (gender – two values plus San Francisco residents – I’m one so I can say this; age – no more than about hundred values and often reduced into four or five categories; income – usually reduced to three or four categories). The lower the cardinality, the more data compression you get with a cube and the more variables you can cross-tabulate and make available in the reporting. The higher the cardinality, the less performance benefits a cube will provide unless lots of variable cross-tabulations are eliminated.
Unfortunately, and here’s key fact number two, web analytics is filled with important variables of very high cardinality. These include page names, content sections, search terms (internal and external), referring sites, paths, and our current focus – unique visitors for segmentation. You get the idea. Most of the things you actually care about on the web have thousands, tens of thousands, hundreds of thousands or even millions of different values. It’s hell on cube designers and it makes the life of a web vendor tough.
So one of the biggest differences between web analytic solutions is the approach they’ve taken to solving the “big” data issue and how clever they’ve been at getting around the inherent limitations of their approach.
There are solutions that try to solve the “big” data issue by tackling the raw data directly. This approach has lots of advantages – because it means they don’t ever have to say “no” to a question or report. The interface can give you almost unlimited access to the data. Indeed, if the data is stored in a truly open system like Oracle or SQL-Server then even when the interface can’t do something the data is still readily available. This is a fantastic approach and very attractive. Until, that is, the performance isn’t good or the hardware to get good performance is more than you can afford. If you don’t have enormous quantities of data or you are prepared to invest in sufficient hardware, then direct to data solutions will almost always be significantly better than OLAP solutions.
If you do have lots of data and more limited budgets, however, then a big part of how attractive a web analytics solution is going to be is how good a job they’ve done of hiding the OLAP limitations. You’re going to have to think very carefully about what’s important to your organization and really pin down the vendors on these issues – because I can tell you from personal experience that they are often very cagey about these questions.
Not all of the issues around cardinality and OLAP crop up around visitor segmentation (though it’s one of the most common failure areas). So first I want to just list a few other areas to keep in mind when you’re evaluating solutions:
- Cropping paths. Sites can generate a staggering number of unique paths. Does the solution crop paths? This can be crippling for sites with lots and lots of low volume but important pages – publishing sites being the most common case.
- Cropping pages. This isn’t as big an issue except for sites that have unusual dynamic page generation. If your site has more than a few hundred thousand pages then you may need to be concerned about this.
- Cropping search terms. Not as big an issue unless your site is heavily dependent on search AND has very high spread in term usage (neither of which is actually all that unusual).
- Not including reporting cuts for numeric variables like avg. page time. This isn’t often an issue but if your data needs are unusual, then these kinds of limitations can be fatal.
For visitors and visitor segmentation, here are two of the most common short-cuts vendors take:
- Uniques Reporting: What a headache this is. Numbers like daily uniques are almost useless. Vendors often don’t de-dupe visitors except at the daily level – and almost never at custom levels. What’s more, they don’t always provide de-duped visitor counts at weekly and monthly levels. And when they do, they often don’t do it consistently across all reports. This creates more errors, havoc, and user confusion than almost any other area in web analytics. Not only because it is inherently confusing but because the vendors almost never tell you in any given report what level of de-duping has been applied. In my experience, you really have to press vendors on this issue. Not only to understand what’s going on, but in some cases to get them to provide you with de-duped uniques for any period other than daily.
- Segment Building: You don’t always have access to every variable you’d like when it comes to creating segments. In most cases, you’ll get all the things you need but there will be exceptions. Make sure, if you’re a commerce site, that you can build segments from e-Commerce variables – both on an individual order basis and in terms of total customer spend. Make sure you can build segments from custom variables. And make sure you can build segments based on both individual session variables (like referring site) and lifetime customer variables like original referring site or original campaign).
Vendors who rely primarily on cubes will sometimes provide a tiered approach to data access. The cube will support basic reporting and analytic requirements with interactive report generation, while the user will also have the ability to submit longer running queries to a data warehouse. This can be a very effective approach, but it, too, raises important issues. Here are the biggest “gotchas” with tiered implementations:
- Limited Data Access: Oh is this frustrating! Just because you’re getting access to the data warehouse doesn’t necessarily mean you’re getting access to the raw data. Sometimes, you’re just getting query access to the cube. That has uses, but one of them isn’t answering any new questions. This really isn’t a tiering strategy at all – just a way of providing automated programmatic access to the cube.
- Limited Query Access: Remember that in most cases, you’re running your query on a vendor’s data warehouse. So it’s possible you could really kill their machines! Anyone who has ever written SQL (the most common database query language) has, at one time or another, mistakenly written a query that just brought a machine to its knees – usually without even returning the right results. Vendors protect themselves against this by providing either restricted SQL – with key capabilities removed – or a restricted interface through which queries must be constructed. This means that while you may have significantly improved data access, there are lots of queries you just won’t be able to make.
- Queuing: Long queries are going to take a long time. Fair enough. But don’t assume that just because queries “usually process over night” that your queries will. If this is an important part of your overall analysis needs, then make sure you get Service Levels that match your needs. Our experiences, particularly for larger sites, haven’t always been what the client had been led to believe.
By now, I hope you’ve gotten the basic idea of this series on tool evaluation that in the world of web analytics tools, nothing is either simple or perfect. Just as with our earlier discussion around tool capabilities in constructing segments, there is no one perfect approach. And here, the very real trade-offs between interactive query performance and richness of data access insure that vendors will always be working a grey area that they believe best fits their key customers. Sometimes, that means that it behooves you to make sure you fit the vendor’s sweet spot in terms of your data quantity and reporting needs. Sometimes, it means that to get the right tool you need to really understand exactly what you need to accomplish analytically before you spend dollar one. And sometimes, it means you will need to push your vendor to get the tools and the contract you need to make your analytics productive.
Tag:
Add to Del.icio.us | Digg | Yahoo! My Web | Furl
Gary Angel is the author of the “SEMAngel blog – Web Analytics and Search Engine Marketing practices and perspectives from a 10-year experienced guru.