《When Null Hypothesis Significance Testing Is Unsuitable for Research: A Reassessment》整理
Null Hypothesis Significance Testing(NHST)可谓是神经科学、心理学和生物医学领域研究中最为常见且常用的传统统计技术,也是近年来爆发的重复性危机中被广泛讨论的标的之一。很多人对NHST等一系列工具的问题有所了解,但实际上仍是非常浅薄。frontiers in Human Neuroscience在17年8月的这篇文章是在我有限的阅读中较为兼具深度和广度的,其中论述的细节很多,所以特别在这里整理一遍。
这篇文章刊登在神经科学杂志上,但并没有聚集神经成像方法,而是把NHST的原理问题挖深,过程中举出神经成像相关的例子。作者在一开始就表明态度了,“while it (NHST) may have legitimate uses when there are precise quantitiative predictions and/or as a heuristic, it should be abandoned as the cornerstone of research.”
现在的统计教科书很少会把Fisher和Neyman-Pearson方法解释清楚,这是非常不好的。
首先,Fisher当年可是只使用H0的概念和p值,p值的意义是pr(data or more extreme data|H0),具体解释我在上一篇介绍贝叶斯的帖子中提及过。Fisher将p值视作归纳推理证据中的启发式部件(a heuristic piece of inductive evidence),用来衡量H0与其它证据(例如效应量)一起打包存在的合理性。Fisher提出了p≤0.05则拒绝H0的建议,但没有数学证明。根据他的理解,“a single significant result should not represent a 'scientific fact' but should merely draw attention to a phenomenon which seems worthy of further investigation including replication.” 请反复看几遍这句英语,然后想一下有多少研究项目是根据这样的原则开展和结题的,更不用说媒体的推波助澜了。
再来是Neyman-Pearson这边,H1、α、β、一型错误、二型错误这些概念都是他们发明的,这些组件加上H0和p值共同构成一个决策机制,目的是选取合适的α和β值平衡一型错误和二型错误的可能性,注意“the sole objective of the decision framework is long-run error minimization and only the critical threshold but not the exact p-value plays any role in achieving this goal.”
不论是Fisher还是Neyman-Pearson方法,都离不开一个long-run repeated testing的核心。“If we only run a single experiment all we can claim is that if we had run a long series of experiments we would have have 100α% false positives (Type I error) had H0 been true and 100β% false negatives (Type II error) had H1 been true provided we got the power calculations right.” 请尤其注意打粗体的这几个英文单词。这是什么时态?这是什么意思?然而在现实中有多少研究在给数据分析做结论时是符合这句话的呢?
对于Neyman-Pearson方法,α和β值的选取看似简单,但事实却正相反,“the Neyman-Pearson approach requires researchers to specify an effect size associated with H1 and compute power (1-β)”,而很多呆板的流程作业实则有悖于Neyman-Pearson方法的思想,“Researchers set H0 nearly always ‘predicting’ zero effect but do not quantitatively define H1. Hence, pre-experimental power cannot be calculated for most tests which is a crucial omission in the Neyman-Pearson framework. Researchers compute the exact p-value as Fisher did but also mechanistically reject H0 and accept the undefined H1 if p≤(α=0.05) without flexibility following the behavioral decision rule of Neyman and Pearson.”
记下句子里的“do not quantitatively define H1”,等会要用。
先看看这个问题:到底应该将研究的效应量设到如何的水平才能可信而有效的侦测出数据统计显著性呢?答案是自己看着办,因为我们的目标总体根本是未知的,根本不可能确定唯一的效应量是什么值,更不用提连带出来的统计效力和样本量了,其结果是我们计算出来的一型错误和二型错误相关估算值也只是可能性,NHST并不能脱离研究问题的具体情景简单给出非此即彼的二分判断。
是的,又一个关键词出现了,研究问题的具体情景。有很多人把NHST当作一台程序固化的机器,你把数据倒进去,它输出结果,你再把数值填进几套模板,嘿,这些句子就是分析结论了。
这就是文中提到的迷信“automatic statistical inference”,而如果事情真是如此简单,那也不会有这个帖子了。
文章在讨论NHST的逻辑不完备时举了一个有趣的例子,假设有H0:A是美国人,H1:A不是美国人,我们手头的信息(data)是:A是美国国会议员。根据美国人口和国会议员人数的信息可以大致估算出pr(data|H0)=pr(A是美国国会议员|A是美国人)=10的-7次方,而在美国,非美国人是不可能当上国会议员的,所以pr(data|H1)=pr(A是美国国会议员|A不是美国人)=0。
当我们完全不顾研究问题的具体情景,只是单纯使用NHST时,p值就是10的-7次,远远小于0.05,那么拒绝H0,接受H1,幸福的A不是美国人但奇迹般的成了国会议员。这种荒诞的结论看似很容易戳破,但在真实的复杂研究中会披上伪装,机械性的套用NHST完全有可能带来指鹿为马的谬误。
作者在举这个例子的时候其实是为了说明“A widespread misconception flowing from the fuzzy attitude of NHST to H1 is that rejecting H0 allows for accepting a specific H1. This is what most practicing researchers do in practice when they reject H0 and argue for their specific H1 in turn. However, NHST only computes probabilities conditional on H0 and it does not allow for the acceptance of either H0, a specific H1 or a generic H1. Rather, it only allows for the rejection of H0. Hence, if we reject H0 we wil have no idea about how well our data fits a specific H1.”
在以上一大段英文中需要强调的是:NHST的唯一能实现的功能是拒绝H0,它不能用于接受H0,也不能用于接受任何形式的H1。
举个栗子,假设我们有两组被试,想知道药物A是否有效,于是让一组服用药物A,另一组服用安慰剂,事后分别计算两组的目标变量平均数a和b,设H0为“药物A无效”,即a-b等于0,H1为“药物A有效”,即a-b不等于0。结果无非两种,一是p≤α,可视为拒绝H0,二是p>α,这时候的正确解释是证据不足,无法做出任何判断。
p≤α的情况需要详细说明,此时拒绝H0是NHST的唯一功能,而因为这只是单次实验,在严格意义上我们并不能接受H1。想象一下“long-run repeated testing”的虚拟场景,我们反复从同一目标总体中抽取大量样本数据计算p值,如此一来就有了大量的p值,作分布图就能发现有趣的事情,“The only difference between the true H0 and true H1 situations is that when H0 is true in all experiments, the distribution of p-values is uniform between 0 and 1 whereas when H1 is true in all experiments p-values are more likely to fall on the left of the 0–1 interval, that is, their distribution becomes right skewed. The larger is the effect size and power the stronger is this right skew.” 这才是p值侦测H0和H1真假的原始意义。
那么也许你要问了,为什么你看到很多论文的表述都是类似“根据NHST的分析结果,本研究拒绝H0,接受H1”。这是一个非常棒的问题,答案是因为研究者们并没有在数据分析中采用最严格的标准来限制结论,研究者们只假设单次或数次实验的数据足以代表虚拟构想中的大量重复。因为现实中本就不太可能进行大量重复,于是这种假设便有了一定合理性,它是一种妥协,使得研究实践可行。但是,这种妥协在各种非统计学专业的相关大学教育中几乎从不被强调,甚至会被视为理所当然。
即使我们采用上述这个不严格的假设,在论文里写下“根据NHST的分析结果,本研究拒绝H0,接受H1”,所接受的也不是一个事实,而是一种可能性。在药物A的例子中,我们接受的是“药物A有效”的可能性较大,并不是“药物A有效”这一个事实。还需要注意的是,在这个例子中我们设定的H1是排除H0后剩下所有可能的合集,也就是拒绝“药物A无效”对应的就是“药物A有效”,然而就算接受了H1,我们仍完全不知道药物A到底是多么的有效,我们不能说其有效的程度就是a-b这个值,同样不能确定任何一个值。这就是上文中让大家记下的“do not quantitatively define H1”。
接受“药物A有效”意味着其效应有无限的取值可能性,换句话说,我们还有无数个可能的假设需要被测试排除,“In most real world problems multiple alternative hypotheses compete to explain the data. However, by using NHST we can only reject H0 and argue for some H1 without any formal justification of why we prefer a particular hypothesis whereas it can be argued that it only makes sense to reject any hypothesis if another one better fits the data.” 要知道NHST一次只能排除一个H0,虽然这样可以一直循环进行下去,这次排除一个H0,下次排除另一个H0,可是必须牢记每一次都得使用不同的数据,不然就得面对著名的multiple testing problem。也许你了解一些应对multiple testing problem的统计技巧,但不要忘了药物A的效应有无限的取值可能性,当进行比较的假设数量大到一定程度,现有的统计技巧根本无能为力。
连锁反应就是“Vague H1 definitions (the lack of quantitative predictions) enable researchers to avoid the falsification of their favorite hypotheses by intricately redefining them (especially in fields such as psychology and cognitive neuroscience where theoretical constructs are often vaguely defined) and never providing any definitive assessment of the plausibility of a favorite hypothesis in light of credible alternatives. This problem is reflected in papers aiming at the mere demonstration of often little motivated significant differences between conditions and post-hoc explanations of likely unexpected but statistically significant findings.” 作者拿fMRI开刀举例,由于大量重复实验的费用太大,一些脑成像研究有时会用某种补偿机制的说法来解释预期之外的数据,然而这些数据的本质却是各种来源的随机误差共同作用,即使采用调整α值的multiple testing correction得出结果也是错误的,“empirical analyses of large fMRI data sets found that the most popular fMRI analysis software packages implemented erroneous multiple testing corrections and hence, generate much higher levels of false positive results than expected. This casts doubts on a substantial part of the published fMRI literature. Further, Carp reported that about 40% of 241 relatively recent fMRI papers actually did not report having used multiple testing correction. So, a very high percentage of fMRI literature may have been exposed to high false positive rates either multiple correction was used or not.”
也许你会觉得能拒绝H0已经够用了,起码这是一个完美的证伪工具,证的是H0的伪。恐怕你要失望了,在“do not quantitatively define H1”的情况下,“groups are actually likely to differ and if sample size increases and variability in data decreases it will become easier and easier to reject any kind of H0 when following the NHST approach. In fact, with precise enough measurements, large enough sample size and repeated ‘falsification’ attempts H0 is guaranteed to be rejected on the long run even if the underlying processes generating the data in two experimental conditions are exactly the same. Hence, ultimately any H1 can be accepted, claiming support for any kind of theory.” 这是单次实验可能得到的结果,也是多次实验得到的结果,尤其像是故意只报告有显著性的部分数据这样的p-hacking手段。长期的大量重复实验才有助于减少这样的错误。
也许你会想起置信区间的概念,并认为这能解决问题,但恐怕你又要失望了。第一,置信区间不是NHST技术的一环,很多研究并不计算和给出这个指标,这很不好;第二,置信区间与p值一样不容易理解,95%置信区间是指目标参数真值在这个区间内的概率是95%吗?请好好读一下统计学教科书中对置信区间的描述;第三,单次实验数据对应的置信区间与p值都只是提供极其模糊的方向,它们的价值只有在大量重复实验的背景下才得以展现,可事实上很多人把单次实验或极少量重复的结果当作大量重复实验来解释和传播,NHST在这个过程中相当常见。
也许你还会指望元分析技术力挽狂澜,将不同研究项目中的大量数据综合归纳总结更加靠谱。这次也许不会太让你失望,不过前提是你分析的资料大部分本身就靠谱。Garbage in, garbage out,在追求显著性的研究氛围中,元分析结果夸大显著性的风险更大,而且基于NHST资料的元分析无法提供目标效应的量级信息,仍然只是做非此即彼的二分决策。
NHST和诸多技术一样,它自身没有错,是使用它的人偏离Fisher和Neyman-Pearson方法的初衷,曲解了它。
重点又来了:“(1) NHST does not deliver final objective theoretical decisions, there is no theoretical justification for any α thresholds marking a boundary of informal surprise and NHST merely aims to minimize Type I error on the long run (and in fact, Neyman and Pearson (1933) considered their procedure a theory-free decision mechanism and Fisher considered it a heuristic). (2) NHST can only reject H0 (heuristically or in a theory-free manner) and (3) cannot provide support for any H1 . ”
回忆一下,这几句英文与一开始介绍Fisher和Neyman-Pearson方法的要点遥相呼应,Fisher把p值视为启发式的弱势证据,而Neyman-Pearson方法则将NHST用于在脱离理论的决策中最小化一型错误,只有拒绝H0的功能,而这不等于为任何H1提供证据,靠谱的结论只能建立在长期的大量重复实验之上。
脱离理论的决策,是的,“supporting a specific alternative theory is just not possible in the NHST framework”。fMRI研究再次被作者拿出来躺枪,“with a bit of creativity fMRI ‘activation’ in many different (perhaps post-hoc defined) ROIs can easily be ‘explained’ by some theory when H0 (‘no activation’) is rejected in any of the ROIs.”
文章中还提到了另一个教科书中几乎从不教的点,即H0:H1这个比值的用处。作者举出的例子是“we may know about a single published study claiming to demonstrate H1 by showing a difference between appropriate experimental conditions. However, in conferences we may have also heard about 9 highly powered but failed replication attempts very similar to the original study. In this case we may assume that the odds of H0:H1 are 9:1, that is, pr(H1) is 1/10.” 这个值对于NHST而言毫无用处,因为NHST几乎完全忽视过往的相关专业科研知识,而且这个值的估算极其困难,只有对该领域有全面且深入的理解才能进行有理有据的论证。但是辛苦的工作有相应的回报,H0:H1可以用于计算虚报率(False Report Probability)与命中率(True Report Probability),另两个极其重要的指标。此外,“it is reasonable to assume that only the most risk avoidant studies have lower H0:H1 odds than 1, relatively conservative studies have low to moderate H0:H1 odds (1–10) while H0:H1 odds can be much higher in explorative research (50–100 or even higher).” 举例说明一下,想象在药物实验的探索阶段是应该采用激进冒险还是保守安全的分析策略呢?答案很简单,因为在这个阶段的药物实验中,假阳性的不良后果是大于假阴性的,于是保守安全的分析策略才是正确的选择,H0:H1则是良好的指标。
差不多该给出一些结论了:
NHST不适用于大数据。如果你有接触统计学的应用问题,大概会知道长期以来心理学、教育学、神经科学等等领域都在呼吁加大样本量,从而保障统计效力,这是因为相当多的研究样本量实在是太小了,但凡事都是过犹不及,如果样本量太大,那么不只是目标变量的效应,各种残差干扰和偏见效应都会被放大成显著性,最后的结论仍然是假阳性。
NHST不适用于因果推理,不能拿来做因果关系的证据。
NHST不适用于神经科学这样不同分析决策会造成不同结果的领域,“This is particularly a problem in neuroimaging where the complexity and idiosyncrasy of analyses is such that it is usually impossible to replicate exactly what happened and why during data analysis.”
NHST不适用于需要了解量级信息的研究情景中,显然就不适用于知识的系统性积累和整合分析。
“NHST Is Unsuitable as the Cornerstone of Scientific Inquiry in Most Fields”。
无论何时何地,想用NHST就应当先论证为何要使用它。
如果没有其它统计分析工具可用,那自然可以选择NHST,如果有其它的可用,那么就必须仔细比较优劣势,“A more reasoned approach may be to consider explicitly what the consequences (‘costs’) are of a false-positive, true-positive, false-negative, and true-negative result. Explicit modeling can suggest that the optimal combination of Type 1 error and power may need to be different depending on what these assumed costs are. Different fields may need to operate at different optimal ratios of false-positives to false-negatives.”
如果研究者可以遵照Neyman-Pearson方法的初衷“very precise quantitative theoretical predictions can be tested, hence, both power and effect size can be estimated well”,可以选用NHST。
如果研究者可以像Fisher那样仅仅把它用作对数据的启发式检视,可以选用NHST。
总之,“NHST can only reject H0 and can accept neither a generic or specific H1. So, on its own NHST cannot provide evidence ‘for’ something even if findings are replicated.”
作者最后列出一些改善研究质量的建议:
如果理论基础薄弱,那就深挖原始数据信息,认真估算效应量和考量数据的不确定性,而不只是用NHST做非此即彼的二分判断;
预先注册研究计划;
所有相关的分析文本一并公开;
公开原始数据;
在研究中报告一切分析结果和数据集,无论显著与否;
保障统计效力,公开预实验的统计效力估算细节;
深入理解所使用的统计分析方法,不盲从,有自己的深入思考;
学习更多的统计分析工具和技术,包括自助法和基于似然估计的方法,以及作者尤其推荐的贝叶斯方法。
如果有人真能做到以上几点,唯有五体投地了。
-
siren 转发了这篇日记 2018-06-20 03:33:33
-
un 转发了这篇日记 2018-06-20 01:34:20
-
一丈红专业批发 转发了这篇日记 2018-06-20 01:20:21
-
斯丢象 转发了这篇日记 2018-06-20 00:11:27
-
我们 转发了这篇日记 2018-06-20 00:04:12
-
未然 转发了这篇日记 2018-03-26 00:30:14
-
yiran 转发了这篇日记 2018-03-26 00:19:53
-
zhenmafan 转发了这篇日记 2018-03-25 23:44:24
-
xiaofrank 转发了这篇日记 2018-01-08 09:49:09
-
豆友4104547 转发了这篇日记 2018-01-07 18:41:43