Quality control of the dataset using the forward search



Years ago, statisticians were primarily concerned with development of statistical methods for the analysis of various kinds of data. They developed the probability distributions and attributes of new estimates, tests of hypothesis, and predictive methodologies. They then applied these methods to the data that was presented to them by subject matter specialists, fitting the methods to the data as well as they could.

Some of the studies were not created only for scientific or engineering audiences, but also had the attention of regulatory authorities. It became more and more important that the analytical methods be shown to fit the data. Soon, the statisticians were asked to help design the studies in such a way that available analytical methods could be more appropriately used on their data.

Study protocols went from being nice-to-have guidelines to being serious plans for the conduct of the study. Greater and greater detail was inserted into them. Statistical analysis plans (SAPs) were soon required to accompany the protocol and to be synchronized with it to a greater and greater degree.

In 2014, Francis Collins and Lawrence Tabak, the Director of the National Institutes of Health and his principal deputy, published a report (reference below) that addressed their concern that the system for ensuring the reproducibility of scientific research was failing. Their principal concern was that statistical results on some preclinical studies could not be reproduced by subsequent investigators in the area. They assured their readers that they found very few cases where lack of reproducibility was caused by misconduct, but rather by existing practices and procedures, lack of laboratory staff training in statistical concepts and methods, failure to report critical aspects of experimental design, etc.

Scientific progress relies on the premise that it is objective, not subjective, that another competent investigator should be able to reproduce the results of the study by simply reproducing the stated methods, using the same study materials. I will leave it to you to read their paper and those that reinforced or contradicted their prescriptions for corrective action.

This concern about study reproducibility is not limited to preclinical studies. It also applies to clinical trials and other large, expensive studies with substantial societal impact. If a second study does not fully support the first study, which one should we believe?

One drawback to the NIH approach is that it relies on the follow-up study to divulge the problem. This might be months or years after the original study and after a great deal of money has been spent going down the wrong path. Also, the original study might be a large engineering study or a complicated clinical trial that may itself take many months or years to complete. Staff changes are bound to occur. Supplies used in the beginning of the study may not be identical to supplies used at the end of the study. Study sites may come or go. In other words, the data analysis could be just fine, appropriate for the study as defined in the protocol and the statistical analysis plan. Perhaps it is simply the nature of the study that inappropriate or inconsistent data has been introduced. The protocol is supposed to describe the universe to which the study (and the data) applies. But how accurate is that description considering the actual data?

Quality control of the data is the correctness of the data after it is first collected. Proper, easily interpreted data entry devices and software to get the data correctly into a format for data analysis. Proper storage of all of the data to prevent deliberate or inadvertent changes. These are pretty obvious, but not uniformly available to all investigators, even today.

Scientific integrity is the adherence of the study conduct to the protocol and the completeness of the protocol with regard to its scope. A study of 14- to 19-year-olds, should not admit a 13-year-old who will be 14 before the end of the study, no matter how compelling her medical case. But once she is in, her data is part of the study.

Unfortunately, it is virtually impossible to enumerate all the different ways that data quality control and scientific integrity of the study can be compromised. What we are describing here is a software tool that is available now and that can help point out in real time possible discrepancies in the first study with regard to data quality or scientific integrity. This tool can and should be used prior to the statistical analysis of the first study.

To complicate matters further, regulatory agencies and others worry that data quality control is only undertaken after some unexpected, commercially adverse findings are discovered in the study. Eliminating observations at this point has the appearance of sampling to a foregone conclusion or eliminating adverse results.

Diagnostics involves identifying inconsistent observations, determining their impact on the primary analyses of the study, and justifying their disposition: total acceptance of the observation, modification of value or classification prior to acceptance, or removal. In the name of proper scientific integrity, disposition of outliers must be well documented.

I believe that statisticians, especially those that work with studies and data every day, will have a bigger role in the quality control and scientific integrity of studies in the years to come. The forsearch R extension will be a useful tool in this area.


Atkinson, A and M Riani. Robust Diagnostic Regression Analysis, Springer, New York, 2000.

Pinheiro JC and DM Bates. Mixed-Effects Models in S and S-Plus, Springer, New York, 2000.

Collins, FS and LA Tabak (2014). Policy: NIH plans to enhance reproducibility, Nature, 505, p612–613.