As a person who likes to stay abreast of our ever-expanding government in my areas of specialization (energy and environment), I periodically survey the website of the U.S. Environmental Protection Agency (EPA) to see what they are funding with my taxpayer dollars.
Imagine my surprise when I encountered a novel Request for Proposals at their National Center for Environmental Research seeking to recruit people at non-profit institutions to dredge through EPA’s databases in order to gin up new new things for the agency to worry about and possibly regulate.
The U.S. Environmental Protection Agency (EPA), as part of its Science to Achieve Results (STAR) program, is seeking applications proposing to use existing datasets from health studies to analyze health outcomes for which the link to air pollution is not well established, or to evaluate underlying heterogeneity in health responses among subgroups defined by susceptibility or extent and/or composition of exposure.
And, ever helpful, EPA gives some examples of what such data-dredging exercises might look like:
For example, while air pollution associations with respiratory and cardiovascular disease have been studied most extensively, evidence is beginning to emerge of possible air pollution impacts on additional health conditions including diabetes, neurological disorders, and reproductive and developmental outcomes. Studies also might evaluate factors that confer increased sensitivity to air pollution effects such as compromised health status, genetic variants, social and neighborhood conditions, higher exposure and others. In addition, some research groups have developed innovative methods and models to characterize exposure that might be applied to health effects analyses in other cohorts to understand whether certain sources or atmospheric components contribute to observed geographic heterogeneity in health-exposure associations.
Further, EPA has specific outcomes in mind. This is not random data dredging, which would be bad enough. This program seeks to fund directional data dredging that looks only for relationships suggesting that exposures to various air pollutants causes harm to human health. In EPA’s words:
EPA is interested in research to explain heterogeneity in health responses to air pollutants. Heterogeneity might be explained by: 1) Individual characteristics and other environmental/social conditions that increase the likelihood of an adverse health outcome among a subset of the population. [emphasis mine]
To pay for this innovative regulatory fishing expedition, EPA proposes to give away $1.4 million dollars in portions up to $300,000, for projects that could last up to three years.
Now, there’s nothing wrong with trying to ensure that people’s health is protected from dangerous air pollutants (in fact, I’d argue that it’s a very legitimate function of government), but there is something wrong with organizing taxpayer funded fishing expeditions to probe for new regulatory potential by seeking out obscure relationships in large databases. And those problems are intrinsic to data dredging, an frequently abused form of data mining.
Data dredging, according to Wikipedia, is “the inappropriate (sometimes deliberately so) use of data mining to uncover misleading relationships in data. These relationships may be valid within the test set but have no statistical significance in the wider population.” Wikipedia gives a particularly relevant example: “Suppose that observers note that a particular town appears to be a cancer cluster, but lack a firm hypothesis of why this is so. However, they have access to a large amount of demographic data about the town and surrounding area, containing measurements for the area of hundreds or thousands of different variables, mostly uncorrelated. Even if all these variables are independent of the cancer incidence rate, it is highly likely that at least one variable will be significantly correlated with the cancer rate across the area.”
Or, as the Congressional Research Office explains (in the context of fishing for terrorists in air-travel databases):
Although data mining can help reveal patterns and relationships, it does not tell the user the value or significance of these patterns. These types of determinations must be made by the user. Similarly, the validity of the patterns discovered is dependent on how they compare to “real world” circumstances. For example, to assess the validity of a data mining application designed to identify potential terrorist suspects in a large pool of individuals, the user may test the model using data that includes information about known terrorists. However, while possibly re-affirming a particular profile, it does not necessarily mean that the application will identify a suspect whose behavior significantly deviates from the original model.
Another limitation of data mining is that while it can identify connections between behaviors and/or variables, it does not necessarily identify a causal relationship. For example, an application may identify that a pattern of behavior, such as the propensity to purchase airline tickets just shortly before the flight is scheduled to depart, is related to characteristics such as income, level of education, and Internet use. However, that does not necessarily indicate that the ticket purchasing behavior is caused by one or more of these variables. In fact, the individual’s behavior could be affected by some additional variable(s) such as occupation (the need to make trips on short notice), family status (a sick relative needing care), or a hobby (taking advantage of last minute discounts to visit new destinations).
In other words, with data dredging, it really is a situation of “Seek and ye shall find.”
It is one thing for scientists to identify sick populations, and to investigate what it is that might be making them sick. It is another thing entirely to sift through large data bases in order to come up with correlations that may have no causal relationship, but that might, nonetheless, cause EPA to spend scarce taxpayer money researching the potential linkage, or worse, to endlessly dredge through databases in search of ever lower, ever more obscure health impacts to justify expanded regulation and EPA intrusion into the economy. This is one EPA funding proposal that should be scrapped. If EPA has more money than it knows what to do with, there’s always the crazy idea of giving it back to the taxpayer.
Right now in Florida an effort is under way desperately trying to find a cause for the “cancer cluster” in the Acreage-destroying property values all the way, and probably forcing the rest of us to pay to have them put on our water supply (one-debunked-theory is that their wells are connected). And all because people don’t understand randomness. People seem to think that if you found that cancer victims all lived an equal distance apart that would be “random”. Hogwash, in fact if I saw people dying fixed distances apart I’d think “serial killer” immediately.
The problem is something called “egalitarian bias”
Andrew – You’re correct in pointing out that people simply don’t understand the nature of probability and statistics, a failure easily demonstrated by the continued growth of Lotto revenues, and the continued profitability of casinos.
When it comes to things like cancer clusters, or any kind of illness cluster, people simply assume that there must be a cause, not understanding that such things will happen purely by random chance. I try to explain this to people by posing the following challenge to them. First, I get them to agree that flipping a coin is a 50:50 proposition, and that every toss is a unique event, unrelated to previous or future tosses. Then I have them take 1000 pennies, and toss them randomly onto a map. If they then look over the map, they’ll find areas with clusters of heads and tails that arise simply through random distribution.
It’s even worse than that, Ken. Imagine if people insisted that there must be a “heads” cluster on a part of the map with a lot of pennies, because more heads are there than other places. Sure, and there are more tails there also. It should hardly surprise anyone that there are “lumpy” distributions in epidemiology, since there are “lumpy” distributions of human population in the first place. I imagine that NY City has the most cancer cases in the country, or close to it.
I don’t think some analysts even bother to normalize for population density.