本文首发于公众号:医学和生信笔记,完美观看体验请至公众号查看本文。

医学和生信笔记,专注R语言在临床医学中的使用,R语言数据分析和可视化。

文章目录

二项logistic回归

R语言中的

factor()函数可以把变量变为因子类型,默认是没有等级之分的(可以理解为无序分类变量nominal)!当然也可以通过添加参数

ordered=T变成有序因子(等级资料,有序分类ordinal)。

二项logistic回归

因变量是二分类变量时,可以使用二项逻辑回归(binomial logistic regression),自变量可以是数值变量、无序多分类变量、有序多分类变量。

使用课本例16-2的数据,直接读取。

为了探讨冠心病发生的危险因素,对26例冠心病患者和28例对照者进行病例-对照研究,试用逻辑回归筛选危险因素。

df16_2 <- foreign::read.spss("例16-02.sav",

to.data.frame = T,

use.value.labels = F,

reencode = "utf-8")

## re-encoding from utf-8

str(df16_2)

## 'data.frame': 54 obs. of 11 variables:

## $ .... : num 1 2 3 4 5 6 7 8 9 10 ...

## $ x1 : num 3 2 2 2 3 3 2 3 2 1 ...

## $ x2 : num 1 0 1 0 0 0 0 0 0 0 ...

## $ x3 : num 0 1 0 0 0 1 1 1 0 0 ...

## $ x4 : num 1 1 1 1 1 1 0 1 0 1 ...

## $ x5 : num 0 0 0 0 0 0 0 1 0 0 ...

## $ x6 : num 0 0 0 0 1 0 0 0 0 0 ...

## $ x7 : num 1 1 1 1 1 2 1 1 1 1 ...

## $ x8 : num 1 0 0 0 1 1 0 0 1 0 ...

## $ y : num 0 0 0 0 0 0 0 0 0 0 ...

## $ PGR_1: num 1 0 0 0 1 1 0 0 0 0 ...

## - attr(*, "variable.labels")= Named chr [1:11] "" "" "" "" ...

## ..- attr(*, "names")= chr [1:11] "...." "x1" "x2" "x3" ...

数据一共11列,第1列是编号,第2-9列是自变量,第10列是因变量。

具体说明:

x1:年龄,小于45岁是1,45-55是2,55-65是3,65以上是4;x2:高血压病史,1代表有,0代表无;x3:高血压家族史,1代表有,0代表无;x4:吸烟,1代表吸烟,0代表不吸烟;x5:高血脂病史,1代表有,0代表无;x6:动物脂肪摄入,0表示低,1表示高x7:BMI,小于24是1,24-26是2,大于26是3;x8:A型性格,1代表是,0代表否;y:是否是冠心病,1代表是,0代表否

这里的x1~y虽然是数值型,但并不是真的代表数字大小,只是为了方便标识,进行了转换,因此在进行logistic回归之前,我们要把数值型变量变成无序分类或有序分类变量,在R语言中可以通过factor()函数变成因子型实现。

# 变成因子型

df16_2[,c(2:10)] <- lapply(df16_2[,c(2:10)], factor)

str(df16_2)

## 'data.frame': 54 obs. of 11 variables:

## $ .... : num 1 2 3 4 5 6 7 8 9 10 ...

## $ x1 : Factor w/ 4 levels "1","2","3","4": 3 2 2 2 3 3 2 3 2 1 ...

## $ x2 : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 1 1 1 ...

## $ x3 : Factor w/ 2 levels "0","1": 1 2 1 1 1 2 2 2 1 1 ...

## $ x4 : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 1 2 ...

## $ x5 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...

## $ x6 : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...

## $ x7 : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 1 1 1 1 ...

## $ x8 : Factor w/ 2 levels "0","1": 2 1 1 1 2 2 1 1 2 1 ...

## $ y : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

## $ PGR_1: num 1 0 0 0 1 1 0 0 0 0 ...

## - attr(*, "variable.labels")= Named chr [1:11] "" "" "" "" ...

## ..- attr(*, "names")= chr [1:11] "...." "x1" "x2" "x3" ...

需要注意的是自变量x1和x7,这两个应该是有序分类变量,这种自变量在进行逻辑回归时,可以进行哑变量设置,即给定一个参考,让其他所有组都和参考相比,比如这里,我们把x1变成因子型后,R语言在进行logistic回归时,会默认选择第一个为参考。

接下来进行二项逻辑回归,在R语言中,默认是以因子的第一个为参考的!而SPSS是以大的那个为参考。

f <- glm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, data = df16_2, family = binomial())

summary(f)

##

## Call:

## glm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, family = binomial(),

## data = df16_2)

##

## Deviance Residuals:

## Min 1Q Median 3Q Max

## -2.1727 -0.4719 -0.1409 0.5315 2.5914

##

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -5.46026 2.07370 -2.633 0.00846 **

## x12 0.85285 1.54399 0.552 0.58070

## x13 0.47754 1.59320 0.300 0.76438

## x14 3.44227 2.10985 1.632 0.10278

## x21 1.14905 0.93176 1.233 0.21750

## x31 1.66039 1.16857 1.421 0.15535

## x41 0.85994 1.32437 0.649 0.51613

## x51 0.73600 0.97088 0.758 0.44840

## x61 3.92067 1.57004 2.497 0.01252 *

## x72 -0.03467 1.13363 -0.031 0.97560

## x73 -0.38230 1.61710 -0.236 0.81311

## x81 2.46322 1.10484 2.229 0.02578 *

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## (Dispersion parameter for binomial family taken to be 1)

##

## Null deviance: 74.786 on 53 degrees of freedom

## Residual deviance: 40.028 on 42 degrees of freedom

## AIC: 64.028

##

## Number of Fisher Scoring iterations: 6

结果详解:

Deviance Residuals:表示偏差残差统计量。在理想情况下应服从正态分布,均值应为0。

在此例中,中位数的符号为负(-0.01406),表明整体向左偏移,中位数的大小表明偏移的程度。第一个四分位数(1Q)和第三个四分位数(3Q)为两侧“尾巴”分布的幅度。这里3Q大于1Q(绝对值),表明这个曲线是向右倾斜的。最大和最小残差可用来检验数据中的离群值。

结果中Estimate是回归系数和截距,Std. Error表示回归系数的标准误,z value是统计量值(z的平方就是Wald值),Pr(>|z|)是P值。

β值(这里就是Estimate)是指回归系数和截距(常数项),可以是负数(负相关时回归系数出现负值);

OR是比值比(odds ratio),其取值范围是0至正无穷,不可能是负数;

Wald是一个卡方值,等于β除以它的标准误(这里是Std. Error),然后取平方(也就是z值的平方),因此也不可能是负数。Wald用于对β值进行检验,考察β值是否等于0。若β值等于0,其对应的OR值,也就是Exp(β)为1,表明两组没有显著差异。OR等于β值的反自然对数。Wald值越大,β值越不可能等于0。

结果中出现了x12/x13/x14这种,这是因为R语言在做回归时,如果设置了哑变量,默认是以第一个为参考的,其余都是和第一个进行比较(SPSS中是以大的那个为参考),这也是R中自动进行哑变量编码的方式。

Null deviance:无效偏差(零偏差),Residual deviance:残差偏差,无效偏差和残差偏差之间的差异越大越好,用来评价模型与实际数据的吻合情况。

AIC:赤池信息准则,表示模型拟合程度的好坏,AIC越低表示模型拟合越好。

最后还有一个Fisher Scoring的迭代次数。

我们可以通过函数的方式分别获取模型信息。

# β值

coefficients(f)

## (Intercept) x12 x13 x14 x21 x31

## -5.46025547 0.85285212 0.47754497 3.44227124 1.14905003 1.66039360

## x41 x51 x61 x72 x73 x81

## 0.85994185 0.73600239 3.92067487 -0.03467497 -0.38230011 2.46321944

# β值

coef(f)

## (Intercept) x12 x13 x14 x21 x31

## -5.46025547 0.85285212 0.47754497 3.44227124 1.14905003 1.66039360

## x41 x51 x61 x72 x73 x81

## 0.85994185 0.73600239 3.92067487 -0.03467497 -0.38230011 2.46321944

# β值的95%可信区间

confint(f)

## Waiting for profiling to be done...

## 2.5 % 97.5 %

## (Intercept) -10.3696980 -1.983104

## x12 -2.0317236 4.405067

## x13 -2.5429244 4.085370

## x14 -0.2319302 8.343123

## x21 -0.6458070 3.099838

## x31 -0.5431686 4.205175

## x41 -1.6713365 3.801261

## x51 -1.1846658 2.725051

## x61 1.3290533 7.677657

## x72 -2.3618580 2.224863

## x73 -3.8303437 2.725470

## x81 0.5105394 4.997352

# OR值

exp(coef(f))

## (Intercept) x12 x13 x14 x21 x31

## 0.004252469 2.346329320 1.612111759 31.257871683 3.155194147 5.261381340

## x41 x51 x61 x72 x73 x81

## 2.363023282 2.087573511 50.434470096 0.965919321 0.682290259 11.742555242

# OR值的95%的可信区间

exp(confint(f))

## Waiting for profiling to be done...

## 2.5 % 97.5 %

## (Intercept) 3.136876e-05 0.1376413

## x12 1.311093e-01 81.8646261

## x13 7.863610e-02 59.4639513

## x14 7.930015e-01 4201.1887167

## x21 5.242393e-01 22.1943589

## x31 5.809047e-01 67.0323349

## x41 1.879956e-01 44.7576059

## x51 3.058484e-01 15.2571993

## x61 3.777465e+00 2159.5535363

## x72 9.424495e-02 9.2522177

## x73 2.170216e-02 15.2635868

## x81 1.666190e+00 148.0206875

这里x21的OR值是2.346329320,代表,45~55岁的人群患冠心病的风险是小于45岁人群的2.346329320倍,但是这个结果并没有统计学意义!

# Wald值

summary(f)$coefficients[,3]^2

## (Intercept) x12 x13 x14 x21 x31

## 6.933188870 0.305111544 0.089843733 2.661883233 1.520790277 2.018903576

## x41 x51 x61 x72 x73 x81

## 0.421615676 0.574682148 6.235929079 0.000935592 0.055890396 4.970577395

# P值

summary(f)$coefficients[,4]

## (Intercept) x12 x13 x14 x21 x31

## 0.00846107 0.58069555 0.76437591 0.10277898 0.21749994 0.15535128

## x41 x51 x61 x72 x73 x81

## 0.51613195 0.44840433 0.01251839 0.97559855 0.81311338 0.02578204

# 预测概率

fitted(f) # 或者 predict(f,type = "response")

## 1 2 3 4 5 6

## 0.375076515 0.110360122 0.069240725 0.023034427 0.905605901 0.491543165

## 7 8 9 10 11 12

## 0.049878030 0.151052146 0.104875967 0.009948712 0.208062753 0.046013662

## 13 14 15 16 17 18

## 0.009879122 0.497751927 0.500211100 0.074509703 0.023034427 0.105543397

## 19 20 21 22 23 24

## 0.359548891 0.441102099 0.048627400 0.734770361 0.366272916 0.009879122

## 25 26 27 28 29 30

## 0.049878030 0.366272916 0.009879122 0.101665098 0.995553588 0.950848767

## 31 32 33 34 35 36

## 0.712839656 0.995611072 0.216828996 0.984826081 0.543195397 0.905612594

## 37 38 39 40 41 42

## 0.868286980 0.993760333 0.868286980 0.034813473 0.902606657 0.966930037

## 43 44 45 46 47 48

## 0.375076515 0.964725296 0.840087511 0.818110300 0.881331876 0.676305952

## 49 50 51 52 53 54

## 0.780828686 0.555921773 0.986103872 0.816157300 0.466253375 0.655579178

# 偏差

deviance(f)

## [1] 40.02758

# 残差自由度

df.residual(f)

## [1] 42

# 伪R^2

DescTools::PseudoR2(f, which = c("McFadden", "CoxSnell", "Nagelkerke"))

## McFadden CoxSnell Nagelkerke

## 0.4647704 0.4746397 0.6331426

模型整体的假设检验:

# 先构建一个只有截距的模型

f0 <- glm(y ~ 1, data = df16_2, family = binomial())

anova(f0,f,test="Chisq")

## Analysis of Deviance Table

##

## Model 1: y ~ 1

## Model 2: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8

## Resid. Df Resid. Dev Df Deviance Pr(>Chi)

## 1 53 74.786

## 2 42 40.028 11 34.758 0.0002716 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P<0.001,说明我们的模型是有意义的。

逐步回归法的logistic回归,可以使用step()函数:

# 向前

f1 <- step(f, direction = "forward")

## Start: AIC=64.03

## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8

summary(f1)

##

## Call:

## glm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, family = binomial(),

## data = df16_2)

##

## Deviance Residuals:

## Min 1Q Median 3Q Max

## -2.1727 -0.4719 -0.1409 0.5315 2.5914

##

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -5.46026 2.07370 -2.633 0.00846 **

## x12 0.85285 1.54399 0.552 0.58070

## x13 0.47754 1.59320 0.300 0.76438

## x14 3.44227 2.10985 1.632 0.10278

## x21 1.14905 0.93176 1.233 0.21750

## x31 1.66039 1.16857 1.421 0.15535

## x41 0.85994 1.32437 0.649 0.51613

## x51 0.73600 0.97088 0.758 0.44840

## x61 3.92067 1.57004 2.497 0.01252 *

## x72 -0.03467 1.13363 -0.031 0.97560

## x73 -0.38230 1.61710 -0.236 0.81311

## x81 2.46322 1.10484 2.229 0.02578 *

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## (Dispersion parameter for binomial family taken to be 1)

##

## Null deviance: 74.786 on 53 degrees of freedom

## Residual deviance: 40.028 on 42 degrees of freedom

## AIC: 64.028

##

## Number of Fisher Scoring iterations: 6

# 向后

f2 <- step(f, direction = "backward")

## Start: AIC=64.03

## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8

##

## Df Deviance AIC

## - x7 2 40.086 60.086

## - x1 3 43.933 61.933

## - x4 1 40.466 62.466

## - x5 1 40.605 62.605

## - x2 1 41.600 63.600

## 40.028 64.028

## - x3 1 42.196 64.196

## - x8 1 46.365 68.365

## - x6 1 50.469 72.469

##

## Step: AIC=60.09

## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x8

##

## Df Deviance AIC

## - x1 3 44.541 58.541

## - x5 1 40.611 58.611

## - x4 1 40.814 58.814

## - x2 1 41.616 59.616

## 40.086 60.086

## - x3 1 42.747 60.747

## - x8 1 47.255 65.255

## - x6 1 51.415 69.415

##

## Step: AIC=58.54

## y ~ x2 + x3 + x4 + x5 + x6 + x8

##

## Df Deviance AIC

## - x5 1 45.746 57.746

## - x4 1 45.779 57.779

## - x3 1 45.853 57.853

## 44.541 58.541

## - x2 1 46.763 58.763

## - x8 1 50.136 62.136

## - x6 1 54.588 66.588

##

## Step: AIC=57.75

## y ~ x2 + x3 + x4 + x6 + x8

##

## Df Deviance AIC

## - x4 1 47.537 57.537

## 45.746 57.746

## - x2 1 48.470 58.470

## - x3 1 49.083 59.083

## - x8 1 51.976 61.976

## - x6 1 56.634 66.634

##

## Step: AIC=57.54

## y ~ x2 + x3 + x6 + x8

##

## Df Deviance AIC

## 47.537 57.537

## - x3 1 50.276 58.276

## - x2 1 51.418 59.418

## - x8 1 53.869 61.869

## - x6 1 59.649 67.649

summary(f2)

##

## Call:

## glm(formula = y ~ x2 + x3 + x6 + x8, family = binomial(), data = df16_2)

##

## Deviance Residuals:

## Min 1Q Median 3Q Max

## -2.2486 -0.7361 -0.3070 0.6264 1.9791

##

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -3.0314 0.8965 -3.381 0.000722 ***

## x21 1.4715 0.7656 1.922 0.054617 .

## x31 1.2251 0.7543 1.624 0.104359

## x61 3.6124 1.3391 2.698 0.006985 **

## x81 1.8639 0.8045 2.317 0.020505 *

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## (Dispersion parameter for binomial family taken to be 1)

##

## Null deviance: 74.786 on 53 degrees of freedom

## Residual deviance: 47.537 on 49 degrees of freedom

## AIC: 57.537

##

## Number of Fisher Scoring iterations: 5

# 步进法

f3 <- step(f, direction = "both")

## Start: AIC=64.03

## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8

##

## Df Deviance AIC

## - x7 2 40.086 60.086

## - x1 3 43.933 61.933

## - x4 1 40.466 62.466

## - x5 1 40.605 62.605

## - x2 1 41.600 63.600

## 40.028 64.028

## - x3 1 42.196 64.196

## - x8 1 46.365 68.365

## - x6 1 50.469 72.469

##

## Step: AIC=60.09

## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x8

##

## Df Deviance AIC

## - x1 3 44.541 58.541

## - x5 1 40.611 58.611

## - x4 1 40.814 58.814

## - x2 1 41.616 59.616

## 40.086 60.086

## - x3 1 42.747 60.747

## + x7 2 40.028 64.028

## - x8 1 47.255 65.255

## - x6 1 51.415 69.415

##

## Step: AIC=58.54

## y ~ x2 + x3 + x4 + x5 + x6 + x8

##

## Df Deviance AIC

## - x5 1 45.746 57.746

## - x4 1 45.779 57.779

## - x3 1 45.853 57.853

## 44.541 58.541

## - x2 1 46.763 58.763

## + x1 3 40.086 60.086

## + x7 2 43.933 61.933

## - x8 1 50.136 62.136

## - x6 1 54.588 66.588

##

## Step: AIC=57.75

## y ~ x2 + x3 + x4 + x6 + x8

##

## Df Deviance AIC

## - x4 1 47.537 57.537

## 45.746 57.746

## - x2 1 48.470 58.470

## + x5 1 44.541 58.541

## + x1 3 40.611 58.611

## - x3 1 49.083 59.083

## + x7 2 44.697 60.697

## - x8 1 51.976 61.976

## - x6 1 56.634 66.634

##

## Step: AIC=57.54

## y ~ x2 + x3 + x6 + x8

##

## Df Deviance AIC

## 47.537 57.537

## + x1 3 41.625 57.625

## + x4 1 45.746 57.746

## + x5 1 45.779 57.779

## - x3 1 50.276 58.276

## - x2 1 51.418 59.418

## + x7 2 46.792 60.792

## - x8 1 53.869 61.869

## - x6 1 59.649 67.649

summary(f3)

##

## Call:

## glm(formula = y ~ x2 + x3 + x6 + x8, family = binomial(), data = df16_2)

##

## Deviance Residuals:

## Min 1Q Median 3Q Max

## -2.2486 -0.7361 -0.3070 0.6264 1.9791

##

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -3.0314 0.8965 -3.381 0.000722 ***

## x21 1.4715 0.7656 1.922 0.054617 .

## x31 1.2251 0.7543 1.624 0.104359

## x61 3.6124 1.3391 2.698 0.006985 **

## x81 1.8639 0.8045 2.317 0.020505 *

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## (Dispersion parameter for binomial family taken to be 1)

##

## Null deviance: 74.786 on 53 degrees of freedom

## Residual deviance: 47.537 on 49 degrees of freedom

## AIC: 57.537

##

## Number of Fisher Scoring iterations: 5

按照步进法最终纳入的自变量是x2,x3,x6,x8。

本文首发于公众号:医学和生信笔记,完美观看体验请至公众号查看本文。

医学和生信笔记,专注R语言在临床医学中的使用,R语言数据分析和可视化。

好文推荐

评论可见,请评论后查看内容,谢谢!!!
 您阅读本篇文章共花了: