Tuesday, April 24, 2012

Getting the Best Value from Open-Ended Questions

An article by Carolyn Lindquist called “For Better Insights From Text Analytics, Elicit Better Comments” in the most recent edition of Quirks Marketing Research (April 2012) gives three recommendations for improving the quality of consumer responses to open-ended questions. These three recommendations are:

1. Target your questions

2. Ask why

3. Be sensitive to placement

Based on my own experience, these are worthwhile considerations when designing surveys. I think most quantitative researchers – including me! – can fall into the twin traps of asking too many open-ends in a single survey and not defining those open-ends as clearly as possible.

I’m a strong believer in what I consider “directed open-ends,” which means that the wording is specific to the situation rather than a catch-all “please list comments below.” For example, in concept tests, I strongly believe in asking for strengths and weaknesses separately and this makes the survey both easier to answer and to analyze. This is consistently with Carolyn’s recommendation to “target your questions” – the example she gives is to link is to vary the open-ended question text according to the stated level of overall satisfaction.

I’m intrigued by Carolyn’s suggestion to ask “why” rather than “what questions,” as they have found that asking “why” (such as “please tell us why you were less than satisfied with your experience”) yields longer and more useful answers than asking “what” (as in “please tell us what we can do to improve your next experience”). She has found that the responses to “what” questions contain less detail and emotion than the answers to “why” questions. I think this suggestion is worth testing out. However, this does not mean we should ask “why” after every rating question, as we’ve had some clients request a few times over the years!

I also agree with her third recommendation on being sensitive to the placement of open-ended questions, although I don’t agree with her suggestion that open-ends should only be asked at the end of a survey. In my experience, open-ended questions should appear where they make the most sense in a survey and a nice balance of quantitative rating questions and open-ends makes for a more pleasant and natural survey-taking experience. One caveat though – I avoid having too many open-ended questions listed sequentially, as I believe that too many open-ends in a row can lead to a feeling that the survey is longer than it actually is and lead to respondent fatigue.

Thursday, April 19, 2012

Some Practical Advice on Statistical Testing

One thing that I am willing to admit is that I am a very “practical” researcher, meaning that I prefer to rely on the craft of analysis when constructing a story more than statistical analyses. This is not to say that advanced statistical tools do not have their place within a researcher’s tool box but they should not substitute for the attention required to carefully review the results and to dig deep through cross tabs to uncover the patterns in the data so as to create the a relevant and meaningful story. Remember the adage – “the numbers don’t speak for themselves, the researcher has to speak for the numbers.”
A great example of this is the use – and misuse – of statistical testing. I would never claim to be a statistician but, over the years, I’ve found that the type of statistical testing that often accompanies data analysis has very limited uses. In a nutshell, statistical testing is great for warning analysts when apparent differences in percentages are not significantly different. This is extremely important when deciding what action to take based on the results. However, such testing is no use on its own when determining whether statistically significant differences are meaningful. In my experience, statistical significance works as a good test of difference but such differences alone are insufficient when analyzing research data.
I love this comment from an article by Connie Schmitz on the use of statistics in surgical education and research that “Statistical analysis has a narrative role to play in our work. But to tell a good story, it has to make sense.” (http://www.facs.org/education/rap/schmitz0207.html) She points out that, with a large enough sample size, every comparison between findings can be labeled “significant,” as well as concluding that “it is particularly difficult to determine the importance of findings if one cannot translate statistical results back into the instrument’s original units of measure, into English, and then into practical terms.”
The idea of translating survey results into practical terms represents the very foundation of what I believe market research should be doing. This same idea is highlighted in an article by Terry Grapentine in the April 2011 edition of Quirks Marketing Research called “Statistical Significance Revisited.” Building on an even earlier Quirk’s article from 1994 called “The Use, Misuse and Abuse of Significance” (http://www.quirks.com/articles/a1994/19941101.aspx?searchID=29151901), he stresses that statistical testing does not render a verdict on the validity of the data being analyzed. He highlights examples of both sampling error and measurement error that can have major impacts on the validity of survey results that would not at all affect the decision that a particular difference is “statistically significant.” I agree wholeheartedly with his conclusion that “unfortunately, when one includes the results of statistical tests in a report, doing so confers a kind of specious statement on a study’s ‘scientific’ precision and validity” while going on to point out that “precision and validity are not the same thing.”
Personally, I find it especially frustrating when research analysis is limited to pointing out each and every one of the statistically significant differences, with the reader expected to draw their own conclusions from this laundry list of differences. How can that possibly be helpful in deciding what action to take? In this case, the researcher has simply failed to fulfill one of their key functions – describing the results in a succinct, coherent and relevant manner. In contrast, I believe that I follow the recommendation of Terry Grapentine (and of Patrick Baldasare and Vikas Mittel before him) that researchers should be seeking and reporting on “managerial significance,” by focusing on the differences in survey results “whose magnitude have relevance to decision making.” This is quite a different approach than simply reciting back the results that are statistically different.
Going back to Connie Schmitz’s article, she closes with a great observation conveyed by Geoffrey Norman and David Streiner in their book PDQ Statistics:
“Always keep in mind the advice of Winifred Castle, a British statistician, who wrote that, ‘We researchers use statistics the way a drunkard uses a lamp post, more for support than illumination’.”

Tuesday, April 17, 2012

The Risks of Projecting Survey Results To A Larger Population

In my experience, most quantitative research results are analyzed on the basis of the survey results themselves – such as the percentage distributions on rating scales – without the need to project results onto the larger population that the sample represents. It is generally understood that, with reasonably rigorous sampling procedures, these distributions are reflective of the attitudes held by the population at large.

In some instances, though, it is important to project to the larger group, such as when creating estimates of product use based on concept results. In these cases, we face a special challenge – do we take consumers at their word and simply extrapolate their answers to the larger population or do we use some combination of common sense and experience to adjust the data?

Although there are many sophisticated models for translating interest in a new product or service into projections of first year use, most include “adjustments” to the survey data to account for typical consumer behavior, such as:

1. The typical 5-point purchase intent scale is weighted in order to more accurately predict what proportion of the population will actually try the product. For example, the proportion of those who would “definitely buy” might be given a weight of 80% to reflect a high, but not absolute, likelihood of buying whereas those who would “probably buy” might be given a weight of just 20%.

2. Secondly, these results assume 100% awareness of the new product or service so further adjustments are required to account for the anticipated build in awareness, usually as a result of advertising, and

3. Thirdly, some estimate of repeat purchase is required, often derived from consumer experience with the new product or service or from established market results.

We take these steps to mitigate the risk of simply applying the survey results to the total population, as this could wildly inflate potential use of a new product or service.

This issue came to my mind this weekend when reading a New York Times article called “The Cybercrime Wave That Wasn’t” (http://www.nytimes.com/2012/04/15/opinion/sunday/the-cybercrime-wave-that-wasnt.html) in which Dinei Florêncio and Cormac Herley of Microsoft Research conclude that, although some cybercriminals may do well, “cybercrime is a relentless, low-profit struggle for the majority.”

Part of their analysis questions the highly-touted estimates of the value of cybercrime, including a recent claim of annual losses among consumers at $114 billion worldwide. This estimate makes the value of such crime comparable to estimates of the global drug trade. As it turns out, however, Florêncio and Herley conclude that “such widely circulated cybercrime estimates are generated using absurdly bad statistical methods, making them wholly unreliable.” This is a very practical example of how results from what appear to be reasonably large research samples can run into critical problems of statistical reliability, whether through poor sampling, naïve extrapolation or other sorts of statistical errors. In the case of the cybercrime estimate, it appears that the estimates of losses that come from just 1 or 2 people in the research sample are being extrapolated to the entire population, which means that

In this particular example, a more accurate approach would be to separate the “screening” sample – i.e., identifying those consumers who have been victims of cybercrime using an extremely large database – from the “outcome” sample. In other words, if the goal is to estimate the impact of cybercrime, the objective should be to find a reliable sample of victims and interview them on their experience, including the extent of their losses. This approach would provide a much more rigorous basis for estimating the total value of cybercrime. However, caution should still be exercised when projecting to the total population.

The key learning is that anytime we have data we want to extrapolate, we need to think about how much we trust that data to be accurate. There are some things consumers can report with superb accuracy - where they ate lunch today, the size of their mortgage payment, how many pets are in their homes. Assuming a decent survey sample, data of this sort can be easily extrapolated to a larger population. But other kinds of data are less accurate, whether due to the limits of human recall or various other forms of bias. Studies have shown, for example, that survey respondents cannot accurately recall where they ate lunch a week or two ago (recall error), tend to under-report their alcohol consumption (social desirability bias) and over-estimate their future purchases of products we show them in concept tests.

So, if we wish to extrapolate from our survey data to a larger sample, we have to be honest about how accurate the results are, what sorts of bias might inflate or deflate the numbers, and what sorts of adjustments, if any, we should make. And when we see stories in the media with giant estimates of the prevalence of some sort of crime, social problem or behavioral trend, we need to take a moment to ask how they came up with those numbers. Often, with a little digging, we see problems in how these estimates were created, leading to the same need for logic and common sense that we find when dealing with our own market projections.

Monday, April 9, 2012

How and When Should I Use Statistical Testing?

Statistical testing is a common deliverable provided by market research vendors. But in some cases the users of the research findings may be uncertain about what the statistical testing really means and whether or not it should influence the way they use the data. Below are five key questions to keep in mind when using statistical testing.
1. What kind of data am I dealing with? Statistical testing can only be applied to quantitative data, such as survey data. There are no statistical tests for qualitative data, such as focus groups and in-depth interviews.
2. What am I trying to learn? Most statistical testing is used primarily to help decide which of the differences we see in our data are real in terms of the population we are interested in. For example, if your findings show that 45% of men like a new product concept and 55% of women like the concept, you need to decide if that difference is real ‒that is, the difference seen in your survey accurately reflects a difference between men and women that exists in the larger population of target consumers.
3. How certain do I need to be? Confidence intervals are the most common way of deciding whether percentage differences of this sort are meaningful. The size of a confidence interval is determined by the level of certainty we demand – usually 90% or 95% in market research, 95% or 99% in medical research – and the size of our sample relative to the population it is drawn from. The higher the level of certainty we demand, the wider the confidence interval will be – with a very high standard of certainty, we need a wide interval to be sure we have captured the true population percentage. Conversely, the bigger the sample, the narrower the confidence interval - as the sample gets bigger it becomes more and more like the target population and we become more certain that the differences we see are valid.
4. How good is my sample? Most statistical tests rely on key assumptions about how you selected the sample of people from whom you collected your data. For tests like the confidence intervals described above, this key assumption is having some element of random selection built into your sample that makes it mathematically representative of the population you are studying. The further your sampling procedure strays from this assumption, the less valid your statistical testing will be. If you can make the case that your sample is not biased in any important ways relevant to your research questions, you can rely on your stats tests to identify meaningful differences. If you have doubts about your sample, use the tests with caution.
5. Does my data meet other key assumptions about the test? Some stats tests assume particular data distributions, such as the bell-shaped curve which is an underlying assumption for confidence intervals. If your data are distributed in some other way – lop-sided toward the high or low end of the scale or polarized – the stats test is worse than worthless, it will actually be misleading!
6. Does the stats testing seem to align with other things I know about the research topic? Stats tests should supplement your overall understanding of the data. They are not a substitute for common sense. Keep in mind that most data analysis software will produce stats tests automatically, whether or not the tests are appropriate for the particular data set you are using. Almost every experienced researcher has watched someone (or been someone) trying to explain a “finding” that was nothing more than a meaningless software output.
If you can provide honest, satisfactory answers to these five questions, stats testing can hugely improve your understanding of your data and help you identify its key themes. And likewise, these key questions can keep you from wasting your time analyzing differences that aren’t really there.