Showing posts with label margin of error. Show all posts
Showing posts with label margin of error. Show all posts

Thursday, March 28, 2013

Which Way to Compare? Part 1 – Why Percentage Distributions are Better than Averages

Our work for our clients, especially our employee satisfaction and engagement studies, often includes comparisons to national or industry norms or across groups within their organizations. These comparisons enable clients to where they stand in and help them set reasonable goals for organizational improvement. In recent weeks we’ve had several conversations regarding the pros and cons of different ways of expressing these comparative figures – as percentage distributions, as averages and as indexes. We strongly feel that percentage distributions offer the best approach in most cases. Today we’ll show why we prefer percentage distributions over averages and in the next blog we’ll show why we also prefer percentages over indexes.

Averages offer the benefit of simplicity for the end users of data. If a survey question has a 5-point scale that is converted to the numbers 1 through 5, taking the numerical average of the responses produces a score between 1 and 5. It’s then a simple matter to compare across groups. If we put the “5” at the positive end of the scale, then those groups – workgroups, locations, divisions – with higher scores are doing better than those whose scores are lower. It’s easy to glance at a set of these average scores and identify priorities for improvement.

The problem with using averages, however, lies in the nature of the average (technically known as the arithmetic mean), as a statistic. An average is a measure of central tendency and has an underlying assumption that the answers are more-or-less normally distributed. This assumption is often incorrect. It is not uncommon to find survey responses that are skewed toward one end of the scale or even polarized. Using a central tendency measure when there is no central tendency can reduce the utility of the information or even be misleading. A simple example can show why this is true. Imagine three work groups all answering the question “How much do you like your job?” using s 5-point scale. Each group has 10 employees:

In group one, all 10 employees choose the middle of the scale

In group two, 5 employees choose one end of the scale, and 5 choose the other end

In group three, 2 employees choose each of the 5 points on the scale

The average score for all three groups is a “3.” None of these groups has a central tendency and taking an average obscures an important feature of the data – the way the opinions are distributed. If these three work groups all reported to you, which information would be most actionable – knowing that they all have the same average score or knowing something about how the scores are distributed? We think the answer is pretty obvious.

Whether in market research or national politics the difference between winning and losing is often in the percentage distribution, not the average. In the 2012 election, more votes were cast for Democratic candidates for the House of Representatives than for Republican candidates, but we have a Republican majority in the House because of the way those votes were distributed across congressional districts. Nate Silver made his reputation as a predictor of elections by understanding the details of percentage distributions of voter behavior. We feel our clients need and deserve the same level of information about the issues that are important to them. So even though average scores are easy to calculate and present, we think that looking at percentage distributions is worth it.

Wednesday, August 22, 2012

The Perception Dilemma, Or, What Can We Do About Self-Report Bias?

A recent article in the Sunday New York Times called “Why Waiting Is Torture” (http://www.nytimes.com/2012/08/19/opinion/sunday/why-waiting-in-line-is-torture.html?pagewanted=all) brought to mind one of key dilemmas in survey design – the simple fact that people often “misremember” their experiences (which is what we call “self-report bias”). How reliable can survey results be if respondents cannot accurately recall what happened?

The article itself is about the psychology of waiting in lines and some of the points are very interesting (although perhaps not surprising to researchers!):

1. According to Richard Larson at M.I.T., occupied time (such as walking to a specific location) feels shorter than unoccupied time (such as standing around waiting),

2. There is a tendency to overestimate the amount of time spent waiting in line (the article quotes an average of 36%),

3. A sense of uncertainty, such as not knowing how long you will be in line, increases the stress of waiting, while information and feedback on wait times or reasons for delays improve perceptions,

4. When there are multiple lines, customers focus on the lines they are “losing to” and not on the lines they are beating, and

5. The frustrations of waiting can be mitigated in the final moments by beating expectations, such as having the line suddenly speed up.

What implications do these findings have on survey design and analysis? In my experience, if we are trying to get an accurate record of an event – such as the amount of time waiting in line – a straightforward recall question is not always the best choice. There are actions we can take during research design, in developing our data collection tools and in analysis to deal with the problems or poor or inaccurate self-report of behavior.

At the research design stage, we should ask whether a self-report on a survey question is the best way to collect the data. In some cases, we are better off using direct measures, such as observations of the behavior, instead of asking about it. At the questionnaire development stage, we can explore which ways of asking a question are more likely to limit bias, for example asking people what hours they watched TV last night will produce a larger per night (and more accurate) answer than asking people to estimate their total viewing hours per week. In the analysis stage we often know which direction the self-report bias will tend to lean – for example, people generally under-report their consumption of alcohol and over-report their church attendance. When we know these tendencies we can deal with them either by adjusting the answers up or down – if we know the appropriate adjustment to make – or by mentioning them when we report the findings or make recommendations.

The key here is to take the possibility of self-report bias into consideration and to have a plan for dealing with it. The existence of self-report bias does not invalidate research efforts, it is merely one of the many factors that research vendors and clients must take into consideration as they approach their projects.

Thursday, April 19, 2012

Some Practical Advice on Statistical Testing

One thing that I am willing to admit is that I am a very “practical” researcher, meaning that I prefer to rely on the craft of analysis when constructing a story more than statistical analyses. This is not to say that advanced statistical tools do not have their place within a researcher’s tool box but they should not substitute for the attention required to carefully review the results and to dig deep through cross tabs to uncover the patterns in the data so as to create the a relevant and meaningful story. Remember the adage – “the numbers don’t speak for themselves, the researcher has to speak for the numbers.”
A great example of this is the use – and misuse – of statistical testing. I would never claim to be a statistician but, over the years, I’ve found that the type of statistical testing that often accompanies data analysis has very limited uses. In a nutshell, statistical testing is great for warning analysts when apparent differences in percentages are not significantly different. This is extremely important when deciding what action to take based on the results. However, such testing is no use on its own when determining whether statistically significant differences are meaningful. In my experience, statistical significance works as a good test of difference but such differences alone are insufficient when analyzing research data.
I love this comment from an article by Connie Schmitz on the use of statistics in surgical education and research that “Statistical analysis has a narrative role to play in our work. But to tell a good story, it has to make sense.” (http://www.facs.org/education/rap/schmitz0207.html) She points out that, with a large enough sample size, every comparison between findings can be labeled “significant,” as well as concluding that “it is particularly difficult to determine the importance of findings if one cannot translate statistical results back into the instrument’s original units of measure, into English, and then into practical terms.”
The idea of translating survey results into practical terms represents the very foundation of what I believe market research should be doing. This same idea is highlighted in an article by Terry Grapentine in the April 2011 edition of Quirks Marketing Research called “Statistical Significance Revisited.” Building on an even earlier Quirk’s article from 1994 called “The Use, Misuse and Abuse of Significance” (http://www.quirks.com/articles/a1994/19941101.aspx?searchID=29151901), he stresses that statistical testing does not render a verdict on the validity of the data being analyzed. He highlights examples of both sampling error and measurement error that can have major impacts on the validity of survey results that would not at all affect the decision that a particular difference is “statistically significant.” I agree wholeheartedly with his conclusion that “unfortunately, when one includes the results of statistical tests in a report, doing so confers a kind of specious statement on a study’s ‘scientific’ precision and validity” while going on to point out that “precision and validity are not the same thing.”
Personally, I find it especially frustrating when research analysis is limited to pointing out each and every one of the statistically significant differences, with the reader expected to draw their own conclusions from this laundry list of differences. How can that possibly be helpful in deciding what action to take? In this case, the researcher has simply failed to fulfill one of their key functions – describing the results in a succinct, coherent and relevant manner. In contrast, I believe that I follow the recommendation of Terry Grapentine (and of Patrick Baldasare and Vikas Mittel before him) that researchers should be seeking and reporting on “managerial significance,” by focusing on the differences in survey results “whose magnitude have relevance to decision making.” This is quite a different approach than simply reciting back the results that are statistically different.
Going back to Connie Schmitz’s article, she closes with a great observation conveyed by Geoffrey Norman and David Streiner in their book PDQ Statistics:
“Always keep in mind the advice of Winifred Castle, a British statistician, who wrote that, ‘We researchers use statistics the way a drunkard uses a lamp post, more for support than illumination’.”

Monday, April 9, 2012

How and When Should I Use Statistical Testing?

Statistical testing is a common deliverable provided by market research vendors. But in some cases the users of the research findings may be uncertain about what the statistical testing really means and whether or not it should influence the way they use the data. Below are five key questions to keep in mind when using statistical testing.
1. What kind of data am I dealing with? Statistical testing can only be applied to quantitative data, such as survey data. There are no statistical tests for qualitative data, such as focus groups and in-depth interviews.
2. What am I trying to learn? Most statistical testing is used primarily to help decide which of the differences we see in our data are real in terms of the population we are interested in. For example, if your findings show that 45% of men like a new product concept and 55% of women like the concept, you need to decide if that difference is real ‒that is, the difference seen in your survey accurately reflects a difference between men and women that exists in the larger population of target consumers.
3. How certain do I need to be? Confidence intervals are the most common way of deciding whether percentage differences of this sort are meaningful. The size of a confidence interval is determined by the level of certainty we demand – usually 90% or 95% in market research, 95% or 99% in medical research – and the size of our sample relative to the population it is drawn from. The higher the level of certainty we demand, the wider the confidence interval will be – with a very high standard of certainty, we need a wide interval to be sure we have captured the true population percentage. Conversely, the bigger the sample, the narrower the confidence interval - as the sample gets bigger it becomes more and more like the target population and we become more certain that the differences we see are valid.
4. How good is my sample? Most statistical tests rely on key assumptions about how you selected the sample of people from whom you collected your data. For tests like the confidence intervals described above, this key assumption is having some element of random selection built into your sample that makes it mathematically representative of the population you are studying. The further your sampling procedure strays from this assumption, the less valid your statistical testing will be. If you can make the case that your sample is not biased in any important ways relevant to your research questions, you can rely on your stats tests to identify meaningful differences. If you have doubts about your sample, use the tests with caution.
5. Does my data meet other key assumptions about the test? Some stats tests assume particular data distributions, such as the bell-shaped curve which is an underlying assumption for confidence intervals. If your data are distributed in some other way – lop-sided toward the high or low end of the scale or polarized – the stats test is worse than worthless, it will actually be misleading!
6. Does the stats testing seem to align with other things I know about the research topic? Stats tests should supplement your overall understanding of the data. They are not a substitute for common sense. Keep in mind that most data analysis software will produce stats tests automatically, whether or not the tests are appropriate for the particular data set you are using. Almost every experienced researcher has watched someone (or been someone) trying to explain a “finding” that was nothing more than a meaningless software output.
If you can provide honest, satisfactory answers to these five questions, stats testing can hugely improve your understanding of your data and help you identify its key themes. And likewise, these key questions can keep you from wasting your time analyzing differences that aren’t really there.

Tuesday, January 24, 2012

How Many People Do I Need To Survey To Get Meaningful Answers?

A central decision for anyone considering a survey – or any other quantitative research – is figuring out how big the survey sample needs to be in order to produce meaningful answers to the research questions. Researchers focus on sample size because it ties together three core aspects of any research effort:

  • Cost – the bigger the sample, the more it will cost to collect, process and analyze the data
  • Speed – the bigger the sample, the longer it will take to collect it (big samples can sometimes be collected quickly, but usually only by further raising costs!)
  • Accuracy – the bigger the sample, the more certain we can be that we have correctly captured the perceptions/opinions/behavior/beliefs/feelings of the population we are interested in (the technical term for this is statistical reliability)

As we see from these three bullets, the decision about sample size essentially boils down to a trade-off between cost, speed and accuracy. So when we pick a sample size we are making a decision about how much accuracy we are going to purchase, within the framework of our budget and timing.

Fortunately for researchers, quantitative samples do not have to be enormous to provide findings that are accurate enough to answer most market research questions. Any unbiased sample (we’ll talk about sample bias in another blog entry) of 50 or more stands a halfway decent chance of giving you a reasonable view of the population it is drawn from and, as we increase the sample size, our confidence that we have the correct answer increases. We can show this effect by looking at the margin of error – the plus or minus number – for some common sample sizes. To keep it simple, all of these are calculated using the assumption that the sample is drawn from a large population (20,000 or more) and that we are using the 95% confidence level of statistical reliability (the most typical standard for statistical reliability used in market research). If we are looking at percentages:

  • A sample of 100 has a margin of error of ± 9.8%
  • A sample of 250 has a margin of error of ± 6.2%
  • A sample of 500 has a margin of error of ± 4.4%
  • A sample of 1,000 has a margin of error of ± 3.0%
  • A sample of 2,000 has a margin of error of ± 2.1%

Looking at these numbers you can see why national surveys, such as the big public opinion polls shown on TV or in newspapers, often have samples around 1,000 or so. Samples in that size range have small margins of error, and doubling the sample size wouldn’t make the margin of error much smaller – there’s no reason to spend money making the sample bigger for such a small gain in accuracy.

These numbers also show why we often urge clients to spend a bit to make a small sample bigger, but not too big! The gains in accuracy are all at the beginning – moving from a sample of 100 to something larger is almost always a good idea, while adding anything over 1,000 usually is not. So the rule of thumb is: 100 is probably too small and 1,000 is probably too big.

Of course, in real life it can be more complicated. We may need to examine sub-groups (age or income groups, political parties, geographic regions, etc.) within the population we are looking at. If a sub-group is small, we may need a bigger overall sample to capture enough of each of the sub-groups in order to provide an accurate picture of their views. So we have a rule of thumb about sub-groups, as well – don’t make decisions about any sub-group smaller than 30. For example, if we do a survey of households in a large urban area and we want to compare households by income level, we need to make our sample big enough to have at least 30 households in each of the income categories we want to compare. Assuming this is a normal city, there will be fewer households at the high end of the income distribution than at the low end, so we need to think about how to get enough of the high-end households to be able to make how comparison. So, if we want to be able to look at households with income over $100K, and 15% of the population has an income of $100K more, we need to have a sample of at least 200 households to ensure that 30 of the households would be in that category.

Using these rules of thumb, you can form an idea about how big your sample needs to be to answer your research questions - without spending more than you can afford.