Data Mining Epic Fail
The Ever Amazing Cory Doctorow managed to unearth this little beauty on how government usage of data mining is best described as a catastrophic failure:
They admit that far more Americans live their lives online, using everything from VoIP phones to Facebook to RFID tags in automobiles, than a decade ago, and the databases created by those activities are tempting targets for federal agencies. And they draw a distinction between subject-based data mining (starting with one individual and looking for connections) compared with pattern-based data mining (looking for anomalous activities that could show illegal activities).
But the authors conclude the type of data mining that government bureaucrats would like to do--perhaps inspired by watching too many episodes of the Fox series 24--can't work. "If it were possible to automatically find the digital tracks of terrorists and automatically monitor only the communications of terrorists, public policy choices in this domain would be much simpler. But it is not possible to do so."
There are several points in this to discuss. Let's start by conceding that the massive intrusion of the government into our daily lives by means of data collection and profiling is absolutely terrifying, etc. We all agree to this, and it's non-controversial. Of course, it's also non-trivial, but I think that we can take it as axiomatic that all of us agree that it's pretty scary. The question, though, is how scary it can be when they can't even do it correctly, and the answer is, even more terrifying. The fact that an aggressive lunatic with a rifle has terrible aim doesn't do anything to reassure the people around his target that they won't get hit. The phrase "false positive", while used correctly, somewhat euphemises what's actually happening. What this means is that the government are arresting and charging the wrong people.
There's more...
Now, obviously, the government frequently make mistakes in their charges, but this is different for two reasons:
- These aren't cases of a local police department picking up the wrong homeless guy off the street and charging him with selling marijuana to college kids. These techniques are used by the top level federal government agencies to try nand find terrorists and violent conspirators. To borrow some words from De La Soul, "Stakes is high".
- Because of the fact that the grounds for charges are algorithmic, and not the result of normal human methods, the confidence that people have in them is much, much higher than the confidence that they would have had in other charges, regardless of the number of false positives.
The problem here is that regardless of the fact that these charges are based on algorithmic grounds, people are forgetting that it takes a human being to write these algorithms and interpret the results. What is under indictment here is neither subject nor pattern-based data-mining. What is under indictment is the capacity of the Bush administration to use these methods effectively.
If it were as simple as just conjoining lists of consumer data, political data and assorted other data and then running CHAID after CHAID, any idiot with a lot of money and a lot of processor power could do it. All you'd have to do is buy the lists and watch the machines work. The simple fact of the matter is that any result is only as good as the algorithm that produced it, and the algorithm is only as good as the programmer who produced it.
Let's take a very simple example. Let's suppose that I have a poll, and in party of that poll, I'm trying to determine whether or not someone is a techie. The way that I do this is by asking the following questions:
I am now going to ask you a series of questions. Please answer yes or no, and if you need me to repeat this instructions, please let me know.
Do you/Did you:
- Work in a technical capacity?
- Spend your personal time pursuing technology related hobbies?
- Adopt new technologies before more people?
- Study some technology related course of study when you were in school or at college?
- Do you frequently eat Red Vines, gummi bears and pre-packaged pastry products?
- Do you read XKCD?
- Are you fond of comic books?
- Do you spend more than ten hours a week playing video games?
- Is Chris Matthews a tool?
- Is the greatest threat to American sovereignty the proliferation of Amero loving liberals?
- Are blogs ruining journalism?
- Is Jeet Kune Do the baddest martial art of all time?
Once I have the responses to these questions, let's say that I'm operating in SPSS, and I choose to define a techie as anyone who answered yes to at least four out of twelve of these questions. For those of you who are interested, here's the syntax:
COUNT TECHIE = Q1 TO Q12 ("Yes").
VAR LAB TECHIE "TECHIES".
RECODE TECHIE (LO THRU 3=1) (4 THRU HI=2).
VAL LAB TECHIE 1 "Not a techie" 2 "Techie".
The problem here is that I'm getting a bad measurement. In the battery of questions above, 1 through 4 accurately measure things that make people techies, 5 though 8 measure whether or not someone participates in tech geek culture (which is not identical to being a techie - not all techies are tech geeks), and 9 through 12 are completely non-germane. If you were to define a techie as any person who answered "Yes" to any four of those twelve questions, you're going to get a whole bunch of false positives.
This is an extreme example, granted, but it illustrates the point: any algorithm is only as good as the person who wrote it. Frankly, is it any surprise that the Bush government just can't get it right?
Dirty D














Another key problem
Base rate.
It doesn't matter how well you code an algorithm for "finding terrorists", the fact of the matter is that the USA has 300,000,000 people and the number of terrorists IN the USA is astronomically small relative to that.
For a more accessible example, say we have a city of 100,000 people. There's a small terrorist cell in this city, 10 people in size. That means, if you pick anyone off the street, there's a 1/10,000 chance (.01%) that the individual is a terrorist.
Now every algorithm is going to have a false-positive rate and a false-negative rate, and these two rates are inextricably linked.
Say you want to make sure you don't accuse a lot of innocent people of being terrorists (you lower the false-positive rate). This means you up the threshold of identification for who's a terrorist suspect... and the higher you push that threshold, the more likely it is that a more normal-seeming terrorist is going to slip past your notice.
Conversely, say you want to make absolutely sure you find ALL the terrorists. That means you lower your threshold to look at more potential suspects. There you have it, "look at more potential suspects". You increase your false positives in your quest for a true positive.
Now let's assume that your algorithm correctly identifies terrorists 90% of the time, and only misidentifies innocent people 1% of the time. (Don't worry about the percentages - false positives and false negatives are linked, but they don't "sum to 100%") Run this algorithm on the city of 100,000 people and you get:
10 * .9 = 9 terrorists
100,000 * .01 = 1000 total suspects
Well, you found most of the terrorists! But you also found 991 innocent people. That's over 110 innocent people for every real terrorist.
Real algorithms will be better than what I've suggested above, but the underlying problem remains. There are so few terrorists that, even when you use an algorithm to whittle down the suspects, it's STILL like searching for a needle in a haystack. More precise algorithms improve performance, but since prediction of human behavior is imperfect to begin with, how far can we really go to identify terrorists by behavior patterns alone?
The issue is fundamentally intractable. Any government's best bet with data-mining is to find the most effective algorithm they can and set the identification threshold at ROCK BOTTOM. Then use real human intelligence methods and see if there are cross-matches. Algorithmic intelligence alone is worthless.
Words I thought I'd never write
I actually think you're being a little unfair to the Bush administration. After all, Britain under Blair and Brown has been a pioneer in this field, and they aren't exactly doing a bang-up job either. Nor are private companies or political campaigns all that effective.
I think the problem might not be in algorithim quality per se, and more the consequences of failure. If a campaign misidentifies you, you might get some mail that you're not interested in; if the government screws up, you'll never fly again.
UPDATE: Credit where credit's due--my thinking on the whole subject has been heavily influenced by Bruce Schneier.
Responding to both of your comments at once
So, I've decided to group the debate. Blue Leader, yes, you're right that Blair wasn't doing much better, and Brown hasn't made any substantial increases in quality. As I said in the original post, "Stakes is high". And I do think that that there's a lot to be said for algorithmic failure. I'm very, very curious to see what the specific methods that they use are.
Student Redux, we're not making mutually exclusive claims. I reconcile our two takes on this as follows
1. Whatever the Bush administration do, they do badly
2. Because of #1, there is a lot of human error going into the data mining.
3. The base rate problem would have made this a difficult endeavor to begin with, but the human error is making it worse.
My take is that this should never have been a problem if the government weren't so obsessed with acquiring data on people for acquisition's sake. Once they had these data, they had to nominally justify the use thereof, so they put together this half assed operation. They had a tiny, tiny number of cases that they were working from - there really just aren't that many terrorists in the U.S. Working from this small number of base cases just didn't give them enough to base anything on, and as a result, things went off the rails.
I would also add that part of the problem here is that they are over-using data mining. The tool is exceptionally limited, and they're relying on it way too much.
Dirty D
Dirty D writes about polling, analytics, data and whatever else may cross his mind as being neat. Feel free to contact him by email : D I R T Y D AT O V E R D E T E R M I N E D DOT N E T.
I entirely agree
Was elucidating a further point, but I think the misuse of data-mining is probably the most critical part of this. I think of it in very simple terms. All of the information intelligence services needed to identify and stop the attacks of 9/11 was already in the system. The government failed to recognize and process that information. There is absolutely no reason to believe that increasing the amount of information available to the government (at least under policies anything like those in place from 20 Jan. 2001 to 11 Sept. 2001) - there's absolutely no reason to believe that having more information will solve the problem. The government doesn't know what to do with the information it has. Adding in a huge data-mining scheme is just creating additional informational burden, with no evidence of increased effectiveness.
Can it be used well? Sure. Is it being used well? I doubt it, especially if the UK is having problems implementing their own system well.