Many areas of scientific research such as psychology, medicine, and economics make heavy use of tests of statistical significance. A popular significance level is 5%; this means that the experiment results are less than 5% likely to have occurred by random chance. While these tests are arguably valid and useful, with the wide availability of automated experimentation and computerized data mining they become ever easier to unintentionally and intentionally game.
One well-known mistake is to omit necessary corrections when using the same data set to test multiple hypothesis. If a single hypothesis is 5% likely to be true by random chance, then by testing 14 hypothesis against the same data set it is over 50% likely [1 - (1 - .05)^14 = 0.51] to find at least one false positive result. Besides correction factors, a common solution is to have two separate data sets, where one set is used for data mining for possible hypothesis, and a second set is used to verify the result. On a large scale, however, poor hypothesis can still pass this second filter. This is especially true when the hypothesis themselves are being automatically generated and not based on any previously known plausible physical mechanisms.
A related problem is known as the "file drawer effect": positive results are published, while negative results remain in the "file drawer". This creates a serial version of the problem above, where if the same hypothesis is tested 14 times then by random chance it is over 50% likely to be confirmed at least once. The negative results are never published. There is a movement to publish negative results, but this is also problematic because the negative results may be due to recognized poor experimental procedure. These problems reduce the usefulness of meta-studies, since they are summarizing and aggregating the results of positive studies without having accurate information about how many other negative studies were never published.
These issues are becoming more well known, as problematic results in areas such as clinical drug trials are found. A useful "canary in the coal mine" for statistical techniques is parapsychology. While confirmation of abilities such as precognition (seeing the future) is possible, it is much more likely to indicate that statistical standards for research and publication in a field (in this case psychology) have fallen too low. Recently the paper "Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect" by Daryl J. Bem was accepted for publication in a prominent psychology journal. Critics claim the author has made numerous mistakes, including those discussed above. Bem found subjects could predict a future result correctly 53.1% of the time (where random chance was 50%). But as Wagenmakers, etc. have have pointed out a similar large-scale test has been running for a long time: casino roulette. In European roulette the house edge is 2.7% [36 to 1 odds against winning; 35 to 1 payout], so gamblers with Bem's 3.1% edge would have cleaned out the casinos already. It is suspicious that in most studies which do find psychic abilities it hovers at the edge of the statistical significance value chosen for the study.
Bem's response to critics is amusing. A major part of his defense is appeal to authority: the referees of his paper accepted it for publication in a prominent journal, and his statistical techniques are commonly accepted in psychology. He defers to another author (Radin) for defense of the historical record of parapsychological research. Radin cites meta-studies and makes an even more dubious appeal to authority: the U.S. patent office.