快速注册

《Causal Inference: What If》有关倾向值部分的笔记整理

井梧 2021-04-24 23:48:26

本学期学习因果推断，终于学到倾向值了，单独整理了一下教材的这一部分（其中代码部分就是把作者的截出来照搬的，正文部分挑了一些要点用中文组织了一下），存个稿。教材用的是 Miguel A. Hernán, James M. Robins (HR). Forthcoming. Causal Inference: What If. 书挺不错的，虽然没出版，但作者已经在网上放出了全书电子版，还附上了书中用到的数据和代码（作者真好），链接：https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

第15章结果回归和倾向值

结果回归和倾向值分析是最常用的做因果推断的参数方法，但它们只能在设定简单的时候有效，不能处理有关时异变量的因果推断，对于纵贯数据能做的有限

15.1结果回归

l 过去三章介绍了IP加权、标准化、G估计的方法来推断戒烟（A）对体重增加（Y）的平均因果效应，也介绍了推断总体中的子集的平均因果效应的方法

l 在第十四章中介绍了结构嵌套模型，结构嵌套模型中包含了A和L乘积项的参数，但L本身是没有参数的，因为关注的是在某个L的水平上A对Y的影响，而不关心L和Y的关系

l 如果我们在结构模型中考虑L和Y的关系，我们就可以写出以下式子：

（1）

其中，参数β3指的是L带来的主要的效应，但它不能被理解为因果效应

l 在每一层L的取值（也就是每一个l）中，如果没有数据的删失，并且满足可交换性、正值性、对干预的定义清晰的条件，反事实平均结果和相关平均结果是相等的：

l 因此，（1）式的结构模型就可以通过最小二乘法拟合成结果回归模型：

l 正如第3章中介绍的分层法，结果回归通过估计每一层L中处理变量带来的平均因果效应，校正了混淆偏误

l 对离散型的结果作结果回归仍然是适用的，比如对二分类的结果可以拟合一个逻辑斯蒂模型

细节点15.1冗余参数

l 指的是并非我们首要感兴趣的参数，但对于冗余参数建模的正确性会影响到我们最感兴趣的、用来做因果推断的参数估计值，例如在结果回归模型中，协变量L被用线性形式建模了，但实际上L并不能单纯用线性形式来预设，这样就会给模型带来设定的错误

15.1代码（R语言）

# Estimating the average causal effect within levels of confounders

#install.packages("readxl") # install package if required

library("readxl")

nhefs <- read_excel("F:/Homework/Textbooks/Hernan - Causal Inferences/nhefs.xls")

nhefs$cens <- ifelse(is.na(nhefs$wt82), 1, 0)

# regression on covariates, allowing for some effect modification

fit <- glm(wt82_71 ~ qsmk + sex + race + age + I(age*age) + as.factor(education)

+ smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs

+ I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)

+ wt71 + I(wt71*wt71) + I(qsmk*smokeintensity), data=nhefs)

summary(fit)

# (step 1) build the contrast matrix with all zeros

# this function builds the blank matrix

# install.packages("multcomp") # install packages if necessary

library("multcomp")

makeContrastMatrix <- function(model, nrow, names) {

m <- matrix(0, nrow = nrow, ncol = length(coef(model)))

colnames(m) <- names(coef(model))

rownames(m) <- names

return(m)

}

K1 <- makeContrastMatrix(fit, 2, c('Effect of Quitting Smoking at Smokeintensity of 5',

'Effect of Quitting Smoking at Smokeintensity of 40'))

# (step 2) fill in the relevant non-zero elements

K1[1:2, 'qsmk'] <- 1

K1[1:2, 'I(qsmk * smokeintensity)'] <- c(5, 40)

# (step 3) check the contrast matrix

# (step 4) estimate the contrasts, get tests and confidence intervals for them

estimates1 <- glht(fit, K1)

summary(estimates1)

confint(estimates1)

# regression on covariates, not allowing for effect modification

fit2 <- glm(wt82_71 ~ qsmk + sex + race + age + I(age*age) + as.factor(education)

+ smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs

+ I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)

+ wt71 + I(wt71*wt71), data=nhefs)

summary(fit2)

15.2倾向值

l Π（L）：在给定协变量L的情况下，个体接受干预的可能性

l 倾向值：在给定协变量L相关信息之下，个体接受干预的概率，即Π（L）

（广义倾向值：干预变量具有多种干预水平，进入某一种干预水平的概率）

l 在观测型研究中，不同个体接受干预的可能性不同（因为有混淆变量的存在），而此时，干预分配是不受调查者控制的，因此倾向值是未知的，需要从数据中估计得来

l 如果接受干预和没接受干预的群体中倾向值分布是一致的，则说明此时L并没有带来混淆，并不存在L指向A的通路，即倾向值平衡了协变量L，即通过控制倾向值，满足了可交换性假设

l 使用倾向值的方法同样需要满足可交换性、正值性、一致性假设

技术点15.1：平衡值和预后值

l 倾向值是一种最简单的平衡值，也就是在平衡值的每一种取值之下，协变量L在干预组和未干预组中的分布是一样的（倾向值Π（L）是L的代理变量，避免维度诅咒）

l 如果在变量L的条件下能证明可交换性和正值性，那么在平衡值b（L）条件下也能证明可交换性和正值性

l 如果能够充分地调整L就也能够充分地调整平衡值b（L）

l Π（L）可以被是为L和A之间的中间节点

l 平衡值的一个替代是预后值，预后值需要更强的假设以及不能轻易推广到时异变量

15.2代码（R语言）

# Estimating and plotting the propensity score

fit3 <- glm(qsmk ~ sex + race + age + I(age*age) + as.factor(education)

+ smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs

+ I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)

+ wt71 + I(wt71*wt71), data=nhefs, family=binomial())

summary(fit3)

nhefs$ps <- predict(fit3, nhefs, type="response")

summary(nhefs$ps[nhefs$qsmk==0])

summary(nhefs$ps[nhefs$qsmk==1])

# # plotting the estimated propensity score

# install.packages("ggplot2") # install packages if necessary

# install.packages("dplyr")

library("ggplot2")

library("dplyr")

ggplot(nhefs, aes(x = ps, fill = qsmk)) + geom_density(alpha = 0.2) +

xlab('Probability of Quitting Smoking During Follow-up') +

ggtitle('Propensity Score Distribution by Treatment Group') +

scale_fill_discrete('') +

theme(legend.position = 'bottom', legend.direction = 'vertical')

# alternative plot with histograms

nhefs <- nhefs %>% mutate(qsmklabel = ifelse(qsmk == 1,

yes = 'Quit Smoking 1971-1982',

no = 'Did Not Quit Smoking 1971-1982'))

ggplot(nhefs, aes(x = ps, fill = as.factor(qsmk), color = as.factor(qsmk))) +

geom_histogram(alpha = 0.3, position = 'identity', bins=15) +

facet_grid(as.factor(qsmk) ~ .) +

xlab('Probability of Quitting Smoking During Follow-up') +

ggtitle('Propensity Score Distribution by Treatment Group') +

scale_fill_discrete('') +

scale_color_discrete('') +

theme(legend.position = 'bottom', legend.direction = 'vertical')

# attempt to reproduce plot from the book

nhefs %>%

mutate(ps.grp = round(ps/0.05) * 0.05) %>%

group_by(qsmk, ps.grp) %>%

summarize(n = n()) %>%

ungroup() %>%

mutate(n2 = ifelse(qsmk == 0, yes = n, no = -1*n)) %>%

ggplot(aes(x = ps.grp, y = n2, fill = as.factor(qsmk))) +

geom_bar(stat = 'identity', position = 'identity') +

geom_text(aes(label = n, x = ps.grp, y = n2 + ifelse(qsmk == 0, 8, -8))) +

xlab('Probability of Quitting Smoking During Follow-up') +

ylab('N') +

ggtitle('Propensity Score Distribution by Treatment Group') +

scale_fill_discrete('') +

scale_x_continuous(breaks = seq(0, 1, 0.05)) +

theme(legend.position = 'bottom', legend.direction = 'vertical',

axis.ticks.y = element_blank(),

axis.text.y = element_blank())

15.3倾向值分层和标准化

l 倾向值是大小处在0到1之间的连续变量，个体的倾向值几乎无法完全相同，因此难以在某个特定的倾向值之下比较干预组和未干预组，一个解决方法是创造出不同的层，每层中的个体倾向值相近

l 大多按照十分位来做分层处理，分成十层

l 我们可以使用结果回归的方法估计每一层中的平均因果效应，大多数使用这种方法时假设倾向值没有效应修饰

l 一个可能会影响效度的问题是，模型的线性设定是否是可靠的，因为用了Π（L）一维去综合原本多维的L，Π（L）和结果Y之间的关系有可能是高次的

l 如果要计算总体中的平均因果效应，我们需要按照倾向值的分布对条件均值进行标准化（具体过程如第13章中所说，只要把L替换成Π（L）即可）

15.3代码（R语言）

# Stratification on the propensity score

# calculation of deciles

nhefs$ps.dec <- cut(nhefs$ps,

breaks=c(quantile(nhefs$ps, probs=seq(0,1,0.1))),

labels=seq(1:10),

include.lowest=TRUE)

#install.packages("psych") # install package if required

library("psych")

describeBy(nhefs$ps, list(nhefs$ps.dec, nhefs$qsmk))

# function to create deciles easily

decile <- function(x) {

return(factor(quantcut(x, seq(0, 1, 0.1), labels = FALSE)))

}

# regression on PS deciles, allowing for effect modification

for (deciles in c(1:10)) {

print(t.test(wt82_71~qsmk, data=nhefs[which(nhefs$ps.dec==deciles),]))

}

# regression on PS deciles, not allowing for effect modification

fit.psdec <- glm(wt82_71 ~ qsmk + as.factor(ps.dec), data = nhefs)

summary(fit.psdec)

confint.lm(fit.psdec)

15.4倾向匹配

l 第4章介绍过匹配相关的流程，倾向匹配有许多种形式，目标是达成干预组和未干预组的可交换性，例如一个受干预的个体匹配一个或多个倾向值相近（±0.05）的未受干预的个体（何为相近是需要定义的，如果定义过于宽松则可能出现匹配后不满足可交换性的问题；如果定义太紧则会排除掉过多的个体）

l 倾向匹配将未成功匹配的受干预个体（倾向值过大或过小）排除在外，匹配不会考虑极端值的出现是随机的还是有结构非正值，这就是与拟合回归模型的不同，拟合模型要包含所有的个体就有可能受到极端值的影响

l 倾向值匹配留下的都是匹配成功的个体，这会使得效应估计是匹配群体中的效应；但是匹配成功的个体有什么特征我们往往不清楚，这就会影响效应估计的可推广性；不过在现实世界中，对排除的解释往往是群体的某些特征而非估计出的倾向值大小

15.4（R代码）

# Standardization using the propensity score

#install.packages("boot") # install package if required

library("boot")

# standardization by propensity score, agnostic regarding effect modification

std.ps <- function(data, indices) {

d <- data[indices,] # 1st copy: equal to original one`

# calculating propensity scores

ps.fit <- glm(qsmk ~ sex + race + age + I(age*age)

+ as.factor(education) + smokeintensity

+ I(smokeintensity*smokeintensity) + smokeyrs

+ I(smokeyrs*smokeyrs) + as.factor(exercise)

+ as.factor(active) + wt71 + I(wt71*wt71),

data=d, family=binomial())

d$pscore <- predict(ps.fit, d, type="response")

# create a dataset with 3 copies of each subject

d$interv <- -1 # 1st copy: equal to original one`

d0 <- d # 2nd copy: treatment set to 0, outcome to missing

d0$interv <- 0

d0$qsmk <- 0

d0$wt82_71 <- NA

d1 <- d # 3rd copy: treatment set to 1, outcome to missing

d1$interv <- 1

d1$qsmk <- 1

d1$wt82_71 <- NA

d.onesample <- rbind(d, d0, d1) # combining datasets

std.fit <- glm(wt82_71 ~ qsmk + pscore + I(qsmk*pscore), data=d.onesample)

d.onesample$predicted_meanY <- predict(std.fit, d.onesample)

# estimate mean outcome in each of the groups interv=-1, interv=0, and interv=1

return(c(mean(d.onesample$predicted_meanY[d.onesample$interv==-1]),

mean(d.onesample$predicted_meanY[d.onesample$interv==0]),

mean(d.onesample$predicted_meanY[d.onesample$interv==1]),

mean(d.onesample$predicted_meanY[d.onesample$interv==1])-

mean(d.onesample$predicted_meanY[d.onesample$interv==0])))

}

# bootstrap

results <- boot(data=nhefs, statistic=std.ps, R=5)

# generating confidence intervals

se <- c(sd(results$t[,1]), sd(results$t[,2]),

sd(results$t[,3]), sd(results$t[,4]))

mean <- results$t0

ll <- mean - qnorm(0.975)*se

ul <- mean + qnorm(0.975)*se

bootstrap <- data.frame(cbind(c("Observed", "No Treatment", "Treatment",

"Treatment - No Treatment"), mean, se, ll, ul))

bootstrap

15.5倾向值模型、结构模型、预测模型

l 倾向值模型中的参数是不能做出因果性解释的冗余参数，但倾向值模型对于做因果推断还是有用的，通常可以作为结构模型中参数估计的基础

l 结构模型中又包含边际结构模型和结构嵌套模型，边际结构模型中包含处理变量的参数、效应修饰变量（变量V）的参数、处理变量和效应修饰变量的乘积的参数，当然也有的可能不包含变量V；结构嵌套模型包含了处理变量的参数、处理变量和所有效应修饰变量（变量L）的乘积的参数

l 结果回归不仅可以用来做因果推断，也非常普遍地被用在纯粹的预测上，比如预估什么样的顾客更有可能做出购买行为，预测模型中的参数都不能做出因果性解释

l 结果回归的双重效用会带来一些误解，一个很重要的是和变量选择程序有关，如果是为了结果预测，调查者想要选择能够提高预测力的任何变量，而这些变量选择的程序也会应用到因果推断的模型中去，将这些预测算法放到因果推断的模型中甚至会带来虚增方差

l 倾向值模型并不需要很好地预测干预，而是只要保证可交换性

l 除了虚增方差之外还会带来自伤型偏误，例如把对撞点做为协变量会带来系统性偏误，这在之后的第18章会再做介绍

细节点15.2效应修饰和倾向值

l 倾向值匹配通常会使得干预组的个体得到匹配，而非干预组的个体没得到匹配并且从分析中被排除了出去，因此在匹配后群体中的效应修饰会更接近于干预组中的效应

l 效应修饰往往是决策者们知道自己在做什么的证明，例如医生倾向于治疗想从治疗中获益的病人，但是效应修饰也会使得对于估计的理解复杂化

l 除了效应修饰之外，还有一些原因使得匹配后的估计和总体效应估计之间有差异：未匹配的部分违反了正值性假设、匹配部分中未测量的混淆因素

回应转发赞收藏

加载中...

私密豆列或部分疑似不友善内容不展示

<前页后页>

井梧

边陲的反攻 “当这盏灯转红便会别离，凭运气决定我生死”

《Causal Inference: What If》有关倾向值部分的笔记整理

热门话题 · · · · · · ( 去话题广场 )