Are polling databases a good solution to the sample size problem?

Working in strategic polling is a lot like being in college: you're constantly scrambling to meet crazy deadlines, dealing with too many obligations at once and barely have enough time, money and energy to put into handling any single one of these obligations at any time. So, similar to writing papers and cramming for exams, you limit how many resources you put towards each poll. Instead of calling 5,000 people over the course of one week, hiring out the entire phonebank, you call 500 people, hire out only part of the phone bank's hours, and do it over the course of two days.  You get the data faster, it's way cheaper and your client is happy with what he has. And, really, it's only the difference of a few points on the MoE, so who's really hurt?

This is, of course, completely wrong.  Think about it for a minute, and then click the full entry to see why.

Figured it out?  The point of a poll is not to ask some set of arbitrary questions to some arbitrary group of people and then look at them as a lump.  The point of a poll is to ask a set of carefully considered questions of a carefully constructed group of people in order to allow you to make projections about the whole universe.  In other words, think of your n as not being one big lump, but the aggregate of lots and lots of little, overlapping lumps, and inter-connecting to build your n.    So, if you have a poll of 500 people that's broken out as follows:

  • 50% Male, 50% Female
  • 35% African-American, 40% White, 15% Latino, 10% Other/Don't Know/Declined to State
  • 15% Some College, 50% Bachelors Degree, 20% Some Post-Grad, 15% Other/Don't Know/Declined to State

That means that you'll have an n of 250 for all men and an of 250 for all women. But you're not just going to look at men and women as men and women: you're going to look at them against all the other relevant variables.  So, for example, you're going to want to see those who are Male, African-American and has a Post-Grad degree answered your questions.   So, let's do a little arithmetic.

500 Respondents * 50% of being male * 35% of being African-American and 20% chance of being Post Grad = 17.5 cases, realistically, 18.

Suddenly, that rationalization of going for 500 completes no longer seems like a good idea. The confidence that you have in those observations is so low that it's not even worth looking at - it's probably not worth the paper it's printed on. (I'm sure that every firm out there has a floor for how low some sub-universe has to be before they'll examine it. I've seen numbers ranging from 50 to 100.)

Now, everybody knows this, but it usually winds up getting dealt with in one of two ways:

  • People ignore it, and act as if the only relevant MoE is the MoE of the whole poll instead of the Margins of Error of all the sub-universes or
  • People acknowledge the problem, wring their hands, and say "What are you going to do? That's the nature of the biz."

Recently, however, people have come up with another solution: a database of polls.  The way that this works is that some group of clients will come together and agree to share their data. They'll ask the pollster to run all these small polls, and then merge the datasets together into one big database that they call all look at and query at will.  So, imagine that you have ten clients, each running a poll every week with the specs that I described above:

  • 50% Male, 50% Female
  • 35% African-American, 40% White, 15% Latino, 10% Other/Don't Know/Declined to State
  • 15% Some College, 50% Bachelors Degree, 20% Some Post-Grad, 15% Other/Don't Know/Declined to State

Now, if you wanted to look at the same sub-universe, i.e., Male, African-American and has a Post-Grad degree, you wind up with the following:

5000 Cases * 50% Chance of being Male * 35% Chance of being African American * 20% Chance of having a Post-Grad degree = 175 Cases.

Phew! That's way more stable and you can actually have confidence in the observation!  Of course, this is not without its own problems.  Let's assume that you have ten clients: The Sierra Club, NARAL, The American Federation of Musicians, The Society of Professional Journalists, The National Debate Tournament, The Council on American-Islamic Relations,  The Organization of Chinese Americans, The GNU/GPL Foundation and J Street.  Even if they're all polling universes that are relatively similar to the specs in this example (odds are that they won't), do you really think that they're all asking the same questions?  And if they are asking the same questions, are they asking the questions in the same way?

Of course not. That would make life easy.

Here's where the real problems arise. In order to take all these different datasets and merge them, you have to first look and see which questions and which data from the file they all have in common. (Don't worry about wording at this stage, we'll get to that later.) Since they're all your clients, you've probably managed to convince them to use a common data vendor for their projects, so that's one less headache you have to worry about, and you can just concentrate on the questions.   If their polls are going to be of any use to them, they're going to have a laser beam focus on answering specific questions, so there probably won't be too many questions that they have in common.  Let's say that each poll has 60 questions - I'll bet that you can maybe get 10 questions out of each poll, at the max, that can have anything in common with the others, and that's if you're lucky. Let's call 10 the upper limit, and say that you'll have no fewer than 3 from each poll.

We'll call this derived dataset "D". Now, in order to put all the various responses into D, you'll have to look at how each question was asked, and figure out some way to make them all line up. This is where things get a little crazy.  Let's say that 6 of the polls ask the Presidential Vote question this way:

Q4. [Vote Decision] In the upcoming November elections, have you decided for which candidate your're going to vote?

  1. Yes
  2. Maybe/Not Sure
  3. No
  4. Other/Don't Know/Declined to State

If 1, Go to Q5, else Go to Q6.

Q5. [Vote By Party]

Do you intend to vote for

  1. The Democratic Candidate
  2. The Green Party Candidate
  3. The Republican Candidate
  4. The Libertarian Candidate
  5. Some Other Party Candidate
  6. Other/Don't Know/Declined to State

Q6. [Leaners]  Well, if you had to decide today, would you vote for

  1. The Democratic Candidate
  2. The Green Party Candidate
  3. The Republican Candidate
  4. The Libertarian Candidate
  5. Some Other Party Candidate
  6. Other/Don't Know/Declined to State

And the remaining four ask it this way:

Q4. In the upcoming November elections, are you going to vote for

  1. Senator Barack Obama - Strongly (Democrat)
  2. Senator Barack Obama - Lean (Democrat)
  3. Representative Cynthia McKinney -Strongly (Green)
  4. Representative Cynthia McKinney -Lean (Green)
  5. Senator John McCain -Strongly (Republican)
  6. Senator John McCain -Lean (Republican)
  7. Representative Bob Barr - Strongly (Libertarian)
  8. Representative Bob Barr - Lean (Libertarian)
  9. Some other candidate
  10. Other/Don't Know/Declined to State

Do not read party names. Only offer party names if asked by respondent.

See the problem? Both are asking about the Presidential Vote, but one way has it broken out into two questions and asks by Party, and the other has it combined into one question and asks by candidate name. Obviously, it's smarter to change the formatting of the six questions to match the formatting of the four, as you'll have everything contained in one question, but you have to ask yourself whether or not you're getting the same data.  Is someone who says that he's voting for the Democrat identical to someone who says that he's voting for Obama? 

Obviously, the answer to that is no. Without getting too deeply into the data, we saw from the recent election that the universe of Obama voters is a superset of the universe of Democratic voters. But, this is another case of accepting the limitations of the biz.  This was a relatively simple standardisation, but I'm sure that you can imagine how things become more and more problematic as  the questions  start resembling each other less and less. Religion questions are a great example of this.

The real issue, however, has to do with who's being asked the questions.  Even  if your ten clients are all inquiring into areas that have the demographic breakdown that I offered up, and they're weighting the results to match that, do you really think that the 250 white people who came from a Sierra Club poll of unlikely voters are identical to the 250 white people from a Society of Professional Journalists membership poll?    Of course not.  Forget issues of standardizing questions - this is the real problem.  What polling databases are trying to do is pretend that because all these various polls match D, they were built similarly and asked questions of the same kinds of universes.  There is absolutely no guarantee that that's true.  Unlike questions that can be modified by recoding, there's nothing that you can do to reconcile the fact that the 500 people that the National Debate Tournament wanted to poll are fantastically different than the 500 people that NARAL would poll, even if the universes are weighted to similar numbers.  In short, you can't assume that the 2500 white people you'd be getting a week would be sufficiently similar to warrant making any projections off them, and the same obtains for any other sub-universe you can define.

Now, granted, if they're all polling universes of people that are virtually identical, the problem goes away, but that's not likely to happen, as they all have different interests.  It's an interesting thing to try out, but I'm not so sure of the efficacy.

Thoughts?

DD

p.s. Not too long ago, my colleague Student Redux wrote a fantastic post that discussed problems inherent in polling polls.  I just realized after writing all this that I should have talked to him first to get his take on this problem.

VTDuWcDpaB

YG44bL nypksntutlpx, [url=http://nruxygyspzby.com/]nruxygyspzby[/url], [link=http://pupjshxbgfqw.com/]pupjshxbgfqw[/link], http://elemvtdgpcba.com/