医学统计学 R语言logistic回归的细节解读

本文首发于公众号：医学和生信笔记，完美观看体验请至公众号查看本文。

医学和生信笔记，专注R语言在临床医学中的使用，R语言数据分析和可视化。

文章目录

二项logistic回归

R语言中的

factor()函数可以把变量变为因子类型，默认是没有等级之分的（可以理解为无序分类变量nominal）！当然也可以通过添加参数

ordered=T变成有序因子（等级资料，有序分类ordinal）。

二项logistic回归

因变量是二分类变量时，可以使用二项逻辑回归（binomial logistic regression），自变量可以是数值变量、无序多分类变量、有序多分类变量。

使用课本例16-2的数据，直接读取。

为了探讨冠心病发生的危险因素，对26例冠心病患者和28例对照者进行病例-对照研究，试用逻辑回归筛选危险因素。

df16_2 <- foreign::read.spss("例16-02.sav",

to.data.frame = T,

use.value.labels = F,

reencode = "utf-8")

## re-encoding from utf-8

str(df16_2)

## 'data.frame': 54 obs. of 11 variables:

## $ .... : num 1 2 3 4 5 6 7 8 9 10 ...

## $ x1 : num 3 2 2 2 3 3 2 3 2 1 ...

## $ x2 : num 1 0 1 0 0 0 0 0 0 0 ...

## $ x3 : num 0 1 0 0 0 1 1 1 0 0 ...

## $ x4 : num 1 1 1 1 1 1 0 1 0 1 ...

## $ x5 : num 0 0 0 0 0 0 0 1 0 0 ...

## $ x6 : num 0 0 0 0 1 0 0 0 0 0 ...

## $ x7 : num 1 1 1 1 1 2 1 1 1 1 ...

## $ x8 : num 1 0 0 0 1 1 0 0 1 0 ...

## $ y : num 0 0 0 0 0 0 0 0 0 0 ...

## $ PGR_1: num 1 0 0 0 1 1 0 0 0 0 ...

## - attr(*, "variable.labels")= Named chr [1:11] "" "" "" "" ...

## ..- attr(*, "names")= chr [1:11] "...." "x1" "x2" "x3" ...

数据一共11列，第1列是编号，第2-9列是自变量，第10列是因变量。

具体说明：

x1：年龄，小于45岁是1,45-55是2,55-65是3,65以上是4；x2：高血压病史，1代表有，0代表无；x3：高血压家族史，1代表有，0代表无；x4：吸烟，1代表吸烟，0代表不吸烟；x5：高血脂病史，1代表有，0代表无；x6：动物脂肪摄入，0表示低，1表示高x7：BMI，小于24是1,24-26是2，大于26是3；x8：A型性格，1代表是，0代表否；y：是否是冠心病，1代表是，0代表否

这里的x1~y虽然是数值型，但并不是真的代表数字大小，只是为了方便标识，进行了转换，因此在进行logistic回归之前，我们要把数值型变量变成无序分类或有序分类变量，在R语言中可以通过factor()函数变成因子型实现。

# 变成因子型

df16_2[,c(2:10)] <- lapply(df16_2[,c(2:10)], factor)

str(df16_2)

## 'data.frame': 54 obs. of 11 variables:

## $ .... : num 1 2 3 4 5 6 7 8 9 10 ...

## $ x1 : Factor w/ 4 levels "1","2","3","4": 3 2 2 2 3 3 2 3 2 1 ...

## $ x2 : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 1 1 1 ...

## $ x3 : Factor w/ 2 levels "0","1": 1 2 1 1 1 2 2 2 1 1 ...

## $ x4 : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 1 2 ...

## $ x5 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...

## $ x6 : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...

## $ x7 : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 1 1 1 1 ...

## $ x8 : Factor w/ 2 levels "0","1": 2 1 1 1 2 2 1 1 2 1 ...

## $ y : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

## $ PGR_1: num 1 0 0 0 1 1 0 0 0 0 ...

## - attr(*, "variable.labels")= Named chr [1:11] "" "" "" "" ...

## ..- attr(*, "names")= chr [1:11] "...." "x1" "x2" "x3" ...

需要注意的是自变量x1和x7，这两个应该是有序分类变量，这种自变量在进行逻辑回归时，可以进行哑变量设置，即给定一个参考，让其他所有组都和参考相比，比如这里，我们把x1变成因子型后，R语言在进行logistic回归时，会默认选择第一个为参考。

接下来进行二项逻辑回归，在R语言中，默认是以因子的第一个为参考的！而SPSS是以大的那个为参考。

f <- glm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, data = df16_2, family = binomial())

summary(f)

## Call:

## glm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, family = binomial(),

## data = df16_2)

## Deviance Residuals:

## Min 1Q Median 3Q Max

## -2.1727 -0.4719 -0.1409 0.5315 2.5914

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -5.46026 2.07370 -2.633 0.00846 **

## x12 0.85285 1.54399 0.552 0.58070

## x13 0.47754 1.59320 0.300 0.76438

## x14 3.44227 2.10985 1.632 0.10278

## x21 1.14905 0.93176 1.233 0.21750

## x31 1.66039 1.16857 1.421 0.15535

## x41 0.85994 1.32437 0.649 0.51613

## x51 0.73600 0.97088 0.758 0.44840

## x61 3.92067 1.57004 2.497 0.01252 *

## x72 -0.03467 1.13363 -0.031 0.97560

## x73 -0.38230 1.61710 -0.236 0.81311

## x81 2.46322 1.10484 2.229 0.02578 *

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## (Dispersion parameter for binomial family taken to be 1)

## Null deviance: 74.786 on 53 degrees of freedom

## Residual deviance: 40.028 on 42 degrees of freedom

## AIC: 64.028

## Number of Fisher Scoring iterations: 6

结果详解：

Deviance Residuals:表示偏差残差统计量。在理想情况下应服从正态分布，均值应为0。

在此例中，中位数的符号为负（-0.01406），表明整体向左偏移，中位数的大小表明偏移的程度。第一个四分位数（1Q）和第三个四分位数（3Q）为两侧“尾巴”分布的幅度。这里3Q大于1Q（绝对值），表明这个曲线是向右倾斜的。最大和最小残差可用来检验数据中的离群值。

结果中Estimate是回归系数和截距，Std. Error表示回归系数的标准误，z value是统计量值（z的平方就是Wald值），Pr(>|z|)是P值。

β值（这里就是Estimate）是指回归系数和截距（常数项），可以是负数（负相关时回归系数出现负值）；

OR是比值比（odds ratio），其取值范围是0至正无穷，不可能是负数；

Wald是一个卡方值，等于β除以它的标准误（这里是Std. Error），然后取平方（也就是z值的平方），因此也不可能是负数。Wald用于对β值进行检验，考察β值是否等于0。若β值等于0，其对应的OR值，也就是Exp(β)为1，表明两组没有显著差异。OR等于β值的反自然对数。Wald值越大，β值越不可能等于0。

结果中出现了x12/x13/x14这种，这是因为R语言在做回归时，如果设置了哑变量，默认是以第一个为参考的，其余都是和第一个进行比较（SPSS中是以大的那个为参考），这也是R中自动进行哑变量编码的方式。

Null deviance:无效偏差（零偏差），Residual deviance:残差偏差，无效偏差和残差偏差之间的差异越大越好，用来评价模型与实际数据的吻合情况。

AIC：赤池信息准则，表示模型拟合程度的好坏，AIC越低表示模型拟合越好。

最后还有一个Fisher Scoring的迭代次数。

我们可以通过函数的方式分别获取模型信息。

# β值

coefficients(f)

## (Intercept) x12 x13 x14 x21 x31

## -5.46025547 0.85285212 0.47754497 3.44227124 1.14905003 1.66039360

## x41 x51 x61 x72 x73 x81

## 0.85994185 0.73600239 3.92067487 -0.03467497 -0.38230011 2.46321944

# β值

coef(f)

## (Intercept) x12 x13 x14 x21 x31

## -5.46025547 0.85285212 0.47754497 3.44227124 1.14905003 1.66039360

## x41 x51 x61 x72 x73 x81

## 0.85994185 0.73600239 3.92067487 -0.03467497 -0.38230011 2.46321944

# β值的95%可信区间

confint(f)

## Waiting for profiling to be done...

## 2.5 % 97.5 %

## (Intercept) -10.3696980 -1.983104

## x12 -2.0317236 4.405067

## x13 -2.5429244 4.085370

## x14 -0.2319302 8.343123

## x21 -0.6458070 3.099838

## x31 -0.5431686 4.205175

## x41 -1.6713365 3.801261

## x51 -1.1846658 2.725051

## x61 1.3290533 7.677657

## x72 -2.3618580 2.224863

## x73 -3.8303437 2.725470

## x81 0.5105394 4.997352

# OR值

exp(coef(f))

## (Intercept) x12 x13 x14 x21 x31

## 0.004252469 2.346329320 1.612111759 31.257871683 3.155194147 5.261381340

## x41 x51 x61 x72 x73 x81

## 2.363023282 2.087573511 50.434470096 0.965919321 0.682290259 11.742555242

# OR值的95%的可信区间

exp(confint(f))

## Waiting for profiling to be done...

## 2.5 % 97.5 %

## (Intercept) 3.136876e-05 0.1376413

## x12 1.311093e-01 81.8646261

## x13 7.863610e-02 59.4639513

## x14 7.930015e-01 4201.1887167

## x21 5.242393e-01 22.1943589

## x31 5.809047e-01 67.0323349

## x41 1.879956e-01 44.7576059

## x51 3.058484e-01 15.2571993

## x61 3.777465e+00 2159.5535363

## x72 9.424495e-02 9.2522177

## x73 2.170216e-02 15.2635868

## x81 1.666190e+00 148.0206875

这里x21的OR值是2.346329320，代表，45~55岁的人群患冠心病的风险是小于45岁人群的2.346329320倍，但是这个结果并没有统计学意义！

# Wald值

summary(f)$coefficients[,3]^2

## (Intercept) x12 x13 x14 x21 x31

## 6.933188870 0.305111544 0.089843733 2.661883233 1.520790277 2.018903576

## x41 x51 x61 x72 x73 x81

## 0.421615676 0.574682148 6.235929079 0.000935592 0.055890396 4.970577395

# P值

summary(f)$coefficients[,4]

## (Intercept) x12 x13 x14 x21 x31

## 0.00846107 0.58069555 0.76437591 0.10277898 0.21749994 0.15535128

## x41 x51 x61 x72 x73 x81

## 0.51613195 0.44840433 0.01251839 0.97559855 0.81311338 0.02578204

# 预测概率

fitted(f) # 或者 predict(f,type = "response")

## 1 2 3 4 5 6

## 0.375076515 0.110360122 0.069240725 0.023034427 0.905605901 0.491543165

## 7 8 9 10 11 12

## 0.049878030 0.151052146 0.104875967 0.009948712 0.208062753 0.046013662

## 13 14 15 16 17 18

## 0.009879122 0.497751927 0.500211100 0.074509703 0.023034427 0.105543397

## 19 20 21 22 23 24

## 0.359548891 0.441102099 0.048627400 0.734770361 0.366272916 0.009879122

## 25 26 27 28 29 30

## 0.049878030 0.366272916 0.009879122 0.101665098 0.995553588 0.950848767

## 31 32 33 34 35 36

## 0.712839656 0.995611072 0.216828996 0.984826081 0.543195397 0.905612594

## 37 38 39 40 41 42

## 0.868286980 0.993760333 0.868286980 0.034813473 0.902606657 0.966930037

## 43 44 45 46 47 48

## 0.375076515 0.964725296 0.840087511 0.818110300 0.881331876 0.676305952

## 49 50 51 52 53 54

## 0.780828686 0.555921773 0.986103872 0.816157300 0.466253375 0.655579178

# 偏差

deviance(f)

## [1] 40.02758

# 残差自由度

df.residual(f)

## [1] 42

# 伪R^2

DescTools::PseudoR2(f, which = c("McFadden", "CoxSnell", "Nagelkerke"))

## McFadden CoxSnell Nagelkerke

## 0.4647704 0.4746397 0.6331426

模型整体的假设检验：

# 先构建一个只有截距的模型

f0 <- glm(y ~ 1, data = df16_2, family = binomial())

anova(f0,f,test="Chisq")

## Analysis of Deviance Table

## Model 1: y ~ 1

## Model 2: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8

## Resid. Df Resid. Dev Df Deviance Pr(>Chi)

## 1 53 74.786

## 2 42 40.028 11 34.758 0.0002716 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P<0.001，说明我们的模型是有意义的。

逐步回归法的logistic回归，可以使用step()函数：

# 向前

f1 <- step(f, direction = "forward")

## Start: AIC=64.03

## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8

summary(f1)

## Call:

## glm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, family = binomial(),

## data = df16_2)

## Deviance Residuals:

## Min 1Q Median 3Q Max

## -2.1727 -0.4719 -0.1409 0.5315 2.5914

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -5.46026 2.07370 -2.633 0.00846 **

## x12 0.85285 1.54399 0.552 0.58070

## x13 0.47754 1.59320 0.300 0.76438

## x14 3.44227 2.10985 1.632 0.10278

## x21 1.14905 0.93176 1.233 0.21750

## x31 1.66039 1.16857 1.421 0.15535

## x41 0.85994 1.32437 0.649 0.51613

## x51 0.73600 0.97088 0.758 0.44840

## x61 3.92067 1.57004 2.497 0.01252 *

## x72 -0.03467 1.13363 -0.031 0.97560

## x73 -0.38230 1.61710 -0.236 0.81311

## x81 2.46322 1.10484 2.229 0.02578 *

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## (Dispersion parameter for binomial family taken to be 1)

## Null deviance: 74.786 on 53 degrees of freedom

## Residual deviance: 40.028 on 42 degrees of freedom

## AIC: 64.028

## Number of Fisher Scoring iterations: 6

# 向后

f2 <- step(f, direction = "backward")

## Start: AIC=64.03

## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8

## Df Deviance AIC

## - x7 2 40.086 60.086

## - x1 3 43.933 61.933

## - x4 1 40.466 62.466

## - x5 1 40.605 62.605

## - x2 1 41.600 63.600

## 40.028 64.028

## - x3 1 42.196 64.196

## - x8 1 46.365 68.365

## - x6 1 50.469 72.469

## Step: AIC=60.09

## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x8

## Df Deviance AIC

## - x1 3 44.541 58.541

## - x5 1 40.611 58.611

## - x4 1 40.814 58.814

## - x2 1 41.616 59.616

## 40.086 60.086

## - x3 1 42.747 60.747

## - x8 1 47.255 65.255

## - x6 1 51.415 69.415

## Step: AIC=58.54

## y ~ x2 + x3 + x4 + x5 + x6 + x8

## Df Deviance AIC

## - x5 1 45.746 57.746

## - x4 1 45.779 57.779

## - x3 1 45.853 57.853

## 44.541 58.541

## - x2 1 46.763 58.763

## - x8 1 50.136 62.136

## - x6 1 54.588 66.588

## Step: AIC=57.75

## y ~ x2 + x3 + x4 + x6 + x8

## Df Deviance AIC

## - x4 1 47.537 57.537

## 45.746 57.746

## - x2 1 48.470 58.470

## - x3 1 49.083 59.083

## - x8 1 51.976 61.976

## - x6 1 56.634 66.634

## Step: AIC=57.54

## y ~ x2 + x3 + x6 + x8

## Df Deviance AIC

## 47.537 57.537

## - x3 1 50.276 58.276

## - x2 1 51.418 59.418

## - x8 1 53.869 61.869

## - x6 1 59.649 67.649

summary(f2)

## Call:

## glm(formula = y ~ x2 + x3 + x6 + x8, family = binomial(), data = df16_2)

## Deviance Residuals:

## Min 1Q Median 3Q Max

## -2.2486 -0.7361 -0.3070 0.6264 1.9791

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -3.0314 0.8965 -3.381 0.000722 ***

## x21 1.4715 0.7656 1.922 0.054617 .

## x31 1.2251 0.7543 1.624 0.104359

## x61 3.6124 1.3391 2.698 0.006985 **

## x81 1.8639 0.8045 2.317 0.020505 *

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## (Dispersion parameter for binomial family taken to be 1)

## Null deviance: 74.786 on 53 degrees of freedom

## Residual deviance: 47.537 on 49 degrees of freedom

## AIC: 57.537

## Number of Fisher Scoring iterations: 5

# 步进法

f3 <- step(f, direction = "both")

## Start: AIC=64.03

## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8

## Df Deviance AIC

## - x7 2 40.086 60.086

## - x1 3 43.933 61.933

## - x4 1 40.466 62.466

## - x5 1 40.605 62.605

## - x2 1 41.600 63.600

## 40.028 64.028

## - x3 1 42.196 64.196

## - x8 1 46.365 68.365

## - x6 1 50.469 72.469

## Step: AIC=60.09

## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x8

## Df Deviance AIC

## - x1 3 44.541 58.541

## - x5 1 40.611 58.611

## - x4 1 40.814 58.814

## - x2 1 41.616 59.616

## 40.086 60.086

## - x3 1 42.747 60.747

## + x7 2 40.028 64.028

## - x8 1 47.255 65.255

## - x6 1 51.415 69.415

## Step: AIC=58.54

## y ~ x2 + x3 + x4 + x5 + x6 + x8

## Df Deviance AIC

## - x5 1 45.746 57.746

## - x4 1 45.779 57.779

## - x3 1 45.853 57.853

## 44.541 58.541

## - x2 1 46.763 58.763

## + x1 3 40.086 60.086

## + x7 2 43.933 61.933

## - x8 1 50.136 62.136

## - x6 1 54.588 66.588

## Step: AIC=57.75

## y ~ x2 + x3 + x4 + x6 + x8

## Df Deviance AIC

## - x4 1 47.537 57.537

## 45.746 57.746

## - x2 1 48.470 58.470

## + x5 1 44.541 58.541

## + x1 3 40.611 58.611

## - x3 1 49.083 59.083

## + x7 2 44.697 60.697

## - x8 1 51.976 61.976

## - x6 1 56.634 66.634

## Step: AIC=57.54

## y ~ x2 + x3 + x6 + x8

## Df Deviance AIC

## 47.537 57.537

## + x1 3 41.625 57.625

## + x4 1 45.746 57.746

## + x5 1 45.779 57.779

## - x3 1 50.276 58.276

## - x2 1 51.418 59.418

## + x7 2 46.792 60.792

## - x8 1 53.869 61.869

## - x6 1 59.649 67.649

summary(f3)

## Call:

## glm(formula = y ~ x2 + x3 + x6 + x8, family = binomial(), data = df16_2)

## Deviance Residuals:

## Min 1Q Median 3Q Max

## -2.2486 -0.7361 -0.3070 0.6264 1.9791

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -3.0314 0.8965 -3.381 0.000722 ***

## x21 1.4715 0.7656 1.922 0.054617 .

## x31 1.2251 0.7543 1.624 0.104359

## x61 3.6124 1.3391 2.698 0.006985 **

## x81 1.8639 0.8045 2.317 0.020505 *

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## (Dispersion parameter for binomial family taken to be 1)

## Null deviance: 74.786 on 53 degrees of freedom

## Residual deviance: 47.537 on 49 degrees of freedom

## AIC: 57.537

## Number of Fisher Scoring iterations: 5

按照步进法最终纳入的自变量是x2,x3,x6,x8。

本文首发于公众号：医学和生信笔记，完美观看体验请至公众号查看本文。

医学和生信笔记，专注R语言在临床医学中的使用，R语言数据分析和可视化。

好文推荐

评论可见，请评论后查看内容，谢谢！！！

您阅读本篇文章共花了：

金钥匙

医学统计学 R语言logistic回归的细节解读

神经网络人工智能线性回归经验分享深度学习基础15（softmax回归的基本实现）

自动化测试 python 职场和发展程序人生软件测试了解回归测试的作用

发表评论取消回复

金钥匙

医学统计学 R语言logistic回归的细节解读

神经网络 人工智能 线性回归 经验分享 深度学习基础15（softmax回归的基本实现）

自动化测试 python 职场和发展 程序人生 软件测试 了解回归测试的作用

相关文章

发表评论取消回复

神经网络人工智能线性回归经验分享深度学习基础15（softmax回归的基本实现）

自动化测试 python 职场和发展程序人生软件测试了解回归测试的作用