Hi, everyone. I woke up with a bit of insomnia, so I thought that I would post something that had occurred to me in my sleep. (Yes, I know that it makes me an inveterate dork that I think about these things in my sleep. I don't care.) Recently, my colleague, Student Redux, has been exploring methodological problems in sampling and weighting, and I had an idea that I wanted to explore as a possible solution. Now, granted, I'm not committed to this, and I'd love to have your comments as further exploration of the topic.
There's more...
To do a little injustice to Student Redux's argument, the problem with stratified sampling is that getting counts based on the universe alone, without considering vote history, gives us an incaccurate projection of the subset of that universe who will actually turn out to vote on Election Day. Weighting isn't a good solution because it can only correct universe paremeters that are already represented and because we would still ultimately be weighting to counts determined by the initial sample universe or to whatever intuitively determined set of weights we imagine the likely voter universe to look like. The bottom line is that because we don't have a verifiable, testable means of determining voter turnout, we don't have a good way to sample or weight.
Part of the problem is that pollsters often wind up constructing models that are based on intuitions and gut checks. They carefully study a district, and then try and predict what the turnout is likely to be, and then, sample and call those people. Unfortunately, this is rarely done in a quantitative manner. My solution is to try and quantify this method by use of modeling.
Using a well built list, you can look at vote history to predict likelihood of turnout based on previous elections for individuals, and then start doing the same for groups. So, for example, you can look at a voter file for a Congressional District in Texas, and see that white people who've voted in general elections prior to the current are 75% likely to vote in the current election, black people who've voted in the last two general elections prior to the current are 62% likely to vote in the current, etc. You can start doing the same by geographical districts, gender, age, etc., and really get into some good details. (Because you're still working with the entire voter file at large, and not a sample thereof, you still have enough records that the margin of error isn't likely to become a real problem.)
You can input all these factors into a model, as well as creating independent side rules, such as "regardless of race, voting for any three consecutive elections gives you a +5% chance to vote", to try and generate a turnout score. Obviously, you would have to evaluate things like the relationship between date of registration and first election voted in, as well as other factors, and put it up against race, gender, etc. Vote history isn't enough for what I'm talking about here. You want to look at almost every variable that you can, at this stage. You may also want to evaluate the recency of the data - for example, we know that in 2008, massive registration and subsequent turnout are distorting the turnout models.
Having built this gigantic model, we can now compute some vote likelihood score, VLS. Having computed VLS, perpaps we can use either a large, large completely randomly sampled list with quotas set to universe counts or we can use a smaller list that's stratified and randomly sampled, using VLS to correct for over and under-representations in the sample.
It's important to note a few things, here.
One of the problems that this does solve is that you no longer wind up with sub-samples that are way too small to have any confidence in the observation, and you also don't wind up weighting to either universe counts or weighting to some some sub-universe counts - you wind up weighting to a predicatable Vote Likelihood Score. And now, because the VLS is external, quantified and available for everyone to see, you have an objective basis for criticising the weights of a poll.
I'll await my late night sleepy thought evisceration in the comments. Thoughts?
Dirty D
Comments
Something like what I'd been thinking...
Sat, 10/25/2008 - 03:59 — Student ReduxThough for model specifics, I was actually thinking that this could be a good place for a method like Nate's heirarchical linear modeling. We have plenty of data on voter turnout from previous elections, and plenty of historical, social, and economic data. It might be possible to predict turnout by demographic based on a set of such variables. (Best way to work the model would be through cross-validation, so split the population of electoral events in half when building the model)
I wish we'd been talking about this a month ago. I think it might have actually been possible to kick out an adequate model in that time frame, and maybe make a stab at what turnout would look like for this year. I think we're too late to do that in time, now, but I think it's something we might be able to do for 2010. It'd be excellent to actually produce a coherent voter turnout model.
Nate's model is not as precise as what I'd want to do.
Sun, 10/26/2008 - 21:58 — Dirty DFrom what little I know of Nate's model, he doesn't work off of voter files, but off of results from polls. What I am talking about is a score generated by data vendors for pollsters to use, or, more likely, in conjunction. The scores would come off of looking at the entire section of the voter file that pertains to the universe. Thus, you'd have different VLS's for Congressional races, statewide elected races, Senate races, etc.
We should talk about how to do this, though. Blue Leader could probably offer a lot of expertise.
Dirty D