I'm working in an external validation project and I have some doubts regarding testing the calibration of a model. My project regards validation of Community Acquired Pneumonia Scores in a Retrospective Cohort, however, for this question, I will create a simulated data-set.
Suppose I have 3 predictor variables (hemoglobin, creatinine and sex) which are continuous, continuous and binary respectively and the outcome is death which is also binary outcome.
And suppose there is a model out there that says that points for hemoglobin, creatinine and sex to predict death. It assigns 1 point if male, 2 points if hemoglobin less or equal to 12 and 3 points if creatinine greater or equal to 1. I then calculate the respective score in R and will have the outcome in numbers from 0 to 6 maximum. I have the also the outcome of those patients, however, I have a scale in points and not probabilities. How can I test the calibration of this model?
The simplest way that comes to my mind is fitting it a logistic model (eg. glm(data = outcome, death ~ score)) and then using for example the function calibrate from rms package to plot it. However, I'm concerned that fitting a logistic regression model would 'train' and overfitt the model to my data (slope would be 1 and intercept 0). A second way of doing that would be going after the original paper where they validated the score and getting their predicted death rates by each level of score. However, I dont think people would actually do that.
Here goes the code.
To make things easier for understanding, my problem is validating the PSI (Pneumonia Severity Assessment) in a cohort of COVID-19 patients. However, my PSI score ranges from 1-5 only.
Thanks in advance.
library(tidyverse)
library(rms)
df <- data.frame(id = 1:20,
hemoglobin = rnorm(20, mean = 13, sd = 2),
creatinine = rnorm(20, mean = 0.8, sd = 0.2),
sex = rbinom(20, 1, 0.5),
death = rbinom(20, 1, 0.2))
df
#> id hemoglobin creatinine sex death
#> 1 1 14.647678 0.9027220 0 0
#> 2 2 14.220308 1.0181192 0 1
#> 3 3 14.540249 0.7445206 1 0
#> 4 4 12.173647 0.6452908 0 0
#> 5 5 11.713601 0.8535807 0 0
#> 6 6 9.971964 0.4272673 0 1
#> 7 7 11.489067 0.6374971 0 1
#> 8 8 13.532228 0.7572170 1 0
#> 9 9 13.742081 0.6465518 0 0
#> 10 10 10.249797 0.4877277 1 1
#> 11 11 10.389741 0.6760747 0 1
#> 12 12 15.043165 0.7012690 0 0
#> 13 13 11.976079 1.1220017 1 0
#> 14 14 15.905599 1.1381758 1 0
#> 15 15 14.369145 1.0629459 1 0
#> 16 16 14.034443 0.7007272 1 0
#> 17 17 15.656715 0.6277480 1 1
#> 18 18 7.496555 0.8633633 1 1
#> 19 19 12.590283 0.6671725 1 0
#> 20 20 11.709377 0.9579166 1 0
df$score <- 1*(df$sex == 1) + 2*(df$hemoglobin <= 12) + 3*(df$creatinine >= 1)
df$score
#> [1] 0 3 1 0 2 2 2 1 0 3 2 0 6 4 4 1 1 3 1 3
model <- glm(data = df, death ~ score, family = 'binomial')
df$prob <- predict(model, df, type = 'response')
val.prob(df$prob, df$death)
```
#> Dxy C (ROC) R2 D D:Chi-sq
#> 3.186813e-01 6.593407e-01 3.340394e-02 -2.544734e-02 4.910532e-01
#> D:p U U:Chi-sq U:p Q
#> NA -1.000000e-01 3.552714e-15 1.000000e+00 7.455266e-02
#> Brier Intercept Slope Emax E90
#> 2.236665e-01 -6.026885e-11 1.000000e+00 7.162773e-01 4.008586e-01
#> Eavg S:z S:p
#> 2.505908e-01 5.303483e-02 9.577042e-01
```
Related
I am new to R so this may seem like a basic question; I am trying to estimate a probit regression of being Employed given the individual is a Male. I think I have the probit model correct but I am unable to plot it. Below is the first 10 rows from my dataset which has 60,000 in total. As you can see I have created 5 dummy variables: 'Male', 'LeavingCert', 'Bachelors', 'Married', and 'Employed'; (Although the first 10 rows in the Employed column are 0, this is not the case for the full dataset. However; there are significantly more 0s than 1s and perhaps this is my issue?)
Top10 <- head(data,10)
Top10
# A tibble: 10 × 10
...1 SEX MARSTAT MAINSTAT EDUCLEV4 Male LeavingCert Bachelors Married Employed
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 2 6 9 0 0 0 1 0
2 2 2 1 2 9 0 0 0 0 0
3 3 2 1 2 9 0 0 0 0 0
4 4 2 2 2 9 0 0 0 1 0
5 5 2 1 5 6 0 0 1 0 0
6 6 1 2 3 9 1 0 0 1 0
7 7 1 2 2 9 1 0 0 1 0
8 8 2 3 2 9 0 0 0 0 0
9 9 2 2 2 9 0 0 0 1 0
10 10 1 1 4 9 1 0 0 0 0
For my 'Probit1' model, my Y is 'Employed' and my X is 'Male'. My code is as follows;
Probit1 <- glm(Employed ~ Male,
family = binomial(link = "probit"),
data = data)
summary(Probit1)
I have tried to plot this probit regression as follows;
# plot data
plot(x = data$Male,
y = data$Employed,
main = "Probit Model of the Probability of Employed, Given Male",
xlab = "Male",
ylab = "Employed",
pch = 20,
ylim = c(-0.4, 1.4),
cex.main = 0.85)
# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Empolyed")
text(2.5, -0.1, cex= 0.8, "Unemployed")
# add estimated regression line
x <- seq(0, 3, 0.01)
y <- predict(Probit1, list(Male = x), type = "response")
lines(x, y, lwd = 1.5, col = "steelblue")
This is the plot I am getting and it does not seem correct?
Are my data the problem?
Any help is very much appreciated and if this is not possible, is there another plot I could do? Thanks in advance.
summary(Probit1)
> summary(Probit1)
Call:
glm(formula = Employed ~ Male, family = binomial(link = "probit"),
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.2777 -0.2777 -0.2742 -0.2742 2.5689
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.78787 0.01299 -137.67 <2e-16 ***
Male 0.01141 0.01871 0.61 0.542
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 19747 on 61939 degrees of freedom
Residual deviance: 19747 on 61938 degrees of freedom
AIC: 19751
Number of Fisher Scoring iterations: 5
table(data$Male, data$Employed)
> table(data$Male, data$Employed)
0 1
0 31165 1194
1 28462 1119
Thank you for adding the regression summary and the table. From the regression summary we can see, that the coefficient for male is really small (and not significant):
Male 0.01141 0.01871 0.61 0.542
That small coefficient means, that being male increases the chances of being employed only the tiniest bit so it makes sense, that we cannot see it in the plot, because the change is too small, the slope to little.
Looking at the table
> table(data$Male, data$Employed)
0 1
0 31165 1194
1 28462 1119
We can confirm, that there is no numerically or visually impressive increase:
1194/(1194+31165) = 3,7%, 1119/(1119+28462) = 3,8%.
You cannot expect to see a .1% increase in your plot.
Are my data the problem?
They are the reason. I don't know, whether they are a problem.
I am trying to build a logistic regression on predicting cancer (1) or no cancer (0) using various categorical variables in the dataset.
In the raw data set, there is a "count" column that indicates the frequency of each combination.
The dataset is large, so in order to reduce the number of rows in the dataset, they added a "count" column to indicate these variables combinations happened xxx times.
How do I incorporate this count column in the logistic regression?
my_model <- glm(cancer ~ age_grp + density + race + bmi, bcancer)
Dataset from BCSC: https://www.bcsc-research.org/data/rfdataset/dataset
You seem to have data like this.
head(dat)
# cancer age_grp race bmi count
# 1 0 1 1 18 561
# 2 1 1 1 18 997
# 3 0 2 1 18 321
# 4 1 2 1 18 153
# 5 0 3 1 18 74
# 6 1 3 1 18 228
You could calculate weighted regression with count as weights.
summary(glm(cancer ~ age_grp + race + bmi, family=binomial(), dat,
weights=count))$coef
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.364588477 0.041604639 8.763169 1.898369e-18
# age_grp 0.009726589 0.002182186 4.457269 8.301035e-06
# race 0.020779774 0.005636968 3.686339 2.275036e-04
# bmi -0.021827620 0.001754685 -12.439623 1.592543e-35
You could also try to "unpack" the data,
dat_unpack <- do.call(rbind.data.frame,
apply(dat, 1, \(x)
t(replicate(x['count'], x[setdiff(names(x), 'count')]))))
head(dat_unpack)
# cancer age_grp race bmi
# 1 0 1 1 18
# 2 0 1 1 18
# 3 0 1 1 18
# 4 0 1 1 18
# 5 0 1 1 18
# 6 0 1 1 18
but it's wasted labor of love since, except for the usual rounding errors, the results are identical.
summary(glm(cancer ~ age_grp + race + bmi, family=binomial(), dat_unpack))$coef
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.364588477 0.041604640 8.763169 1.898374e-18
# age_grp 0.009726589 0.002182186 4.457268 8.301070e-06
# race 0.020779774 0.005636970 3.686338 2.275043e-04
# bmi -0.021827620 0.001754685 -12.439621 1.592570e-35
Data
set.seed(42)
dat <- expand.grid(cancer=0:1, age_grp=1:7, race=1:3, bmi=18:26)
dat$count <- sample(1e3, nrow(dat), replace=TRUE)
When I use anova_test() function(from rstatix package) to do two-way repeated measures ANOVA, an error occur:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases
I check my data and there is no missing value.
BTW, in my data, not all the people have 8 times outcome. Some people have maximum 3 times, some 8 times and so on.
I refer to this website to do my two-way repeated measures ANOVA :
https://www.datanovia.com/en/lessons/repeated-measures-anova-in-r/
I have upload my dataset to github.
mydata :https://github.com/lizhiwei1994/testRepo/blob/master/mydata.csv
My code:
# load packages
library("tidyverse")
library("ggpubr")
library("rstatix")
# load data and check missing value
mydata <- read.csv(
url("https://raw.githubusercontent.com/lizhiwei1994/testRepo/master/mydata.csv")
) %>% convert_as_factor(id, time, treatment)
glimpse(mydata)
sum(is.na(mydata))
# error occurring
res.aov <- anova_test(
data = mydata, dv = outcome, wid = id,
within = c(treatment, time)
)
get_anova_table(res.aov)
For repeated measure anova, you need complete observations for each time point, before and after treatment, before you do the anova, it's always good to check the observations:
tab = table(mydata$time,mydata$treatment,mydata$id)
#subject = 1
tab[,,"1"]
control2 treat2
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 0 0
So this subject has only control2 observations but no treatment2 observations. If there is something wrong with the entry of subject, please correct it. Below I can show you an example of when it will work:
test = expand.grid(id=1:2,time=1:8,treatment=c("a","b"))
test$outcome=rnorm(nrow(test))
table(test$time,test$treatment,test$id)
, , = 1
a b
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
8 1 1
, , = 2
a b
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
8 1 1
anova_test(data=test,dv=outcome,wid=id,within=c("treatment","time"))
ANOVA Table (type III tests)
Effect DFn DFd F p p<.05 ges
1 treatment 1 1 424.283 0.031 * 0.078
2 time 7 7 1.596 0.276 0.422
3 treatment:time 7 7 1.571 0.283 0.422
If your dataset is such, according to the formula you provided in the comment aov(outcome ~ time*treatment + Error(id), data = mydata), it is a mixed anova, but for this you need individuals that have undergone both treatment and control, to control for the individual effect, and this is absent from your dataset.
The only anova you can perform in this situation is a two-way anova:
anova_test(data=test,dv=outcome,between=c(time,treatment))
Coefficient covariances computed by hccm()
ANOVA Table (type II tests)
Effect DFn DFd F p p<.05 ges
1 time 7 16 1.668 0.187 0.422
2 treatment 1 16 1.350 0.262 0.078
3 time:treatment 7 16 1.668 0.187 0.422
Using data in the following form, in which ways can I calculate the (age-specific) mortality rate in the R programming language?
head(data)
## age gender zone Class misc bonus duration death cost
## 1 0 M 1 4 12 1 0.1753 0 0
## 2 4 M 3 6 9 1 0.0000 1 0
## 3 5 F 3 3 18 1 0.4548 0 0
## 4 5 F 4 1 25 1 0.1726 0 0
## 5 6 F 2 1 26 1 0.1808 0 0
## 6 9 F 3 3 8 1 0.5425 0 0
That is, for each age I want to calculate the number of deaths and divide by the total number of exposed individuals in that particular age. I tried the following:
n <- length(data$age);
rate <- c(1:n);
for (i in 1:n){
rate[i] <- sum(subset(data, age == i)$death)/ length(subset(data, age == i))
}
But this was useless - obviously not all ages from 1 to n is present in the dataset - I am looking for a written program in the sense of the above which will do the job.
Because the variable death only takes on the value of zero or one, you can calculate the age-specific mortality in one line of code.
tapply(data$death, data$age, mean)
You can get most of the way there with table(). If we assume that all those not dying are present for 100% of the time (a year, say), and that those dying are present for 1/2 of the time, then we have enough info to calculate exposure from these data. I'm not sure what your duration column is, but you haven't really described the data.
# cheap version of your data:
DF <- data.frame(age = c(0,4,5,5,6,9), death = c(0,1,0,0,0,0))
(DAT <- table(DF$death,DF$age))
0 4 5 6 9
0 1 0 2 1 1
1 0 1 0 0 0
# weight these two rows for components of exposure:
Exposure <- colSums(DAT * c(1,.5))
# rates are the ratio of death counts in each age to exposure to risk in each age:
Rates <- DAT["1",] / Exposure
If you then go on to calculate a lifetable, this is the so-called Mx or mx column.
So, my data set consists of 15 variables, one of them (sex) has only 2 levels. I want to use it as a dummy variable, but the levels are 1 and 2. How do I do this? I want to have levels 0 and 1, but I don't know how to manage this in R!
With most of R's modelling tools with a formula interface you don't need to create dummy variables, the underlying code that handles and interprets the formula will do this for you. If you want a dummy variable for some other reason then there are several options. The easiest (IMHO) is to use model.matrix():
set.seed(1)
dat <- data.frame(sex = sample(c("male","female"), 10, replace = TRUE))
model.matrix( ~ sex - 1, data = dat)
which gives:
> dummy <- model.matrix( ~ sex - 1, data = dat)
> dummy
sexfemale sexmale
1 0 1
2 0 1
3 1 0
4 1 0
5 0 1
6 1 0
7 1 0
8 1 0
9 1 0
10 0 1
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$sex
[1] "contr.treatment"
> dummy[,1]
1 2 3 4 5 6 7 8 9 10
0 0 1 1 0 1 1 1 1 0
You can use either column of dummy as a numeric dummy variable; choose whichever column you want to be the 1-based level. dummy[,1] chooses 1 as representing the female class and dummy[,2] the male class.
Cast this as a factor if you want it to be interpreted as a categorical object:
> factor(dummy[, 1])
1 2 3 4 5 6 7 8 9 10
0 0 1 1 0 1 1 1 1 0
Levels: 0 1
But that is defeating the object of factor; what is 0 again?
Ty this
set.seed(001) # generating some data
sex <- factor(sample(1:2, 10, replace=TRUE)) # this is what you have
[1] 1 1 2 2 1 2 2 2 2 1
Levels: 1 2
sex<-factor(ifelse(as.numeric(sex)==2, 1,0)) # this is what you want
sex
[1] 0 0 1 1 0 1 1 1 1 0
Levels: 0 1
If you want labels to be 0 = Male and 1 = Female, then...
sex<-factor(ifelse(as.numeric(sex)==2, 1,0), labels=c('M', 'F'))
sex # this is what you want
[1] M M F F M F F F F M
Levels: M F
Actually you don't need to create a dummy variable in order to estimate a model using lm, let's see this example:
set.seed(001) # Generating some data
N <- 100
x <- rnorm(N, 50, 20)
y <- 20 + 3.5*x + rnorm(N)
sex <- factor(sample(1:2, N, replace=TRUE))
# Estimating the linear model
lm(y ~ x + sex) # using the first category as the baseline (this means sex==1)
Call:
lm(formula = y ~ x + sex)
Coefficients:
(Intercept) x sex2
19.97815 3.49994 -0.02719
# renaming the categories and labelling them
sex<-factor(ifelse(as.numeric(sex)==2, 1,0), labels=c('M', 'F'))
lm(y ~ x + sex) # the same results, baseline is 'Male'
Call:
lm(formula = y ~ x + sex)
Coefficients:
(Intercept) x sexF
19.97815 3.49994 -0.02719
As you can see R deals with the dummies pretty well, you just pass them into the formula as factor variable and R will do the rest for you.
By the way there's no need to change the categories from c(2,1) into c(0,1), the results will be the same as you can seen in the example above.
As suggested by many above, turn it into factor.
If you really want to dummy code the gender variable, consider this
set.seed(100)
gender = rbinom(100,1,0.5)+1
gender_dummy = gender-1