I am trying to build a logistic regression on predicting cancer (1) or no cancer (0) using various categorical variables in the dataset.
In the raw data set, there is a "count" column that indicates the frequency of each combination.
The dataset is large, so in order to reduce the number of rows in the dataset, they added a "count" column to indicate these variables combinations happened xxx times.
How do I incorporate this count column in the logistic regression?
my_model <- glm(cancer ~ age_grp + density + race + bmi, bcancer)
Dataset from BCSC: https://www.bcsc-research.org/data/rfdataset/dataset
You seem to have data like this.
head(dat)
# cancer age_grp race bmi count
# 1 0 1 1 18 561
# 2 1 1 1 18 997
# 3 0 2 1 18 321
# 4 1 2 1 18 153
# 5 0 3 1 18 74
# 6 1 3 1 18 228
You could calculate weighted regression with count as weights.
summary(glm(cancer ~ age_grp + race + bmi, family=binomial(), dat,
weights=count))$coef
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.364588477 0.041604639 8.763169 1.898369e-18
# age_grp 0.009726589 0.002182186 4.457269 8.301035e-06
# race 0.020779774 0.005636968 3.686339 2.275036e-04
# bmi -0.021827620 0.001754685 -12.439623 1.592543e-35
You could also try to "unpack" the data,
dat_unpack <- do.call(rbind.data.frame,
apply(dat, 1, \(x)
t(replicate(x['count'], x[setdiff(names(x), 'count')]))))
head(dat_unpack)
# cancer age_grp race bmi
# 1 0 1 1 18
# 2 0 1 1 18
# 3 0 1 1 18
# 4 0 1 1 18
# 5 0 1 1 18
# 6 0 1 1 18
but it's wasted labor of love since, except for the usual rounding errors, the results are identical.
summary(glm(cancer ~ age_grp + race + bmi, family=binomial(), dat_unpack))$coef
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.364588477 0.041604640 8.763169 1.898374e-18
# age_grp 0.009726589 0.002182186 4.457268 8.301070e-06
# race 0.020779774 0.005636970 3.686338 2.275043e-04
# bmi -0.021827620 0.001754685 -12.439621 1.592570e-35
Data
set.seed(42)
dat <- expand.grid(cancer=0:1, age_grp=1:7, race=1:3, bmi=18:26)
dat$count <- sample(1e3, nrow(dat), replace=TRUE)
Related
I'm working in an external validation project and I have some doubts regarding testing the calibration of a model. My project regards validation of Community Acquired Pneumonia Scores in a Retrospective Cohort, however, for this question, I will create a simulated data-set.
Suppose I have 3 predictor variables (hemoglobin, creatinine and sex) which are continuous, continuous and binary respectively and the outcome is death which is also binary outcome.
And suppose there is a model out there that says that points for hemoglobin, creatinine and sex to predict death. It assigns 1 point if male, 2 points if hemoglobin less or equal to 12 and 3 points if creatinine greater or equal to 1. I then calculate the respective score in R and will have the outcome in numbers from 0 to 6 maximum. I have the also the outcome of those patients, however, I have a scale in points and not probabilities. How can I test the calibration of this model?
The simplest way that comes to my mind is fitting it a logistic model (eg. glm(data = outcome, death ~ score)) and then using for example the function calibrate from rms package to plot it. However, I'm concerned that fitting a logistic regression model would 'train' and overfitt the model to my data (slope would be 1 and intercept 0). A second way of doing that would be going after the original paper where they validated the score and getting their predicted death rates by each level of score. However, I dont think people would actually do that.
Here goes the code.
To make things easier for understanding, my problem is validating the PSI (Pneumonia Severity Assessment) in a cohort of COVID-19 patients. However, my PSI score ranges from 1-5 only.
Thanks in advance.
library(tidyverse)
library(rms)
df <- data.frame(id = 1:20,
hemoglobin = rnorm(20, mean = 13, sd = 2),
creatinine = rnorm(20, mean = 0.8, sd = 0.2),
sex = rbinom(20, 1, 0.5),
death = rbinom(20, 1, 0.2))
df
#> id hemoglobin creatinine sex death
#> 1 1 14.647678 0.9027220 0 0
#> 2 2 14.220308 1.0181192 0 1
#> 3 3 14.540249 0.7445206 1 0
#> 4 4 12.173647 0.6452908 0 0
#> 5 5 11.713601 0.8535807 0 0
#> 6 6 9.971964 0.4272673 0 1
#> 7 7 11.489067 0.6374971 0 1
#> 8 8 13.532228 0.7572170 1 0
#> 9 9 13.742081 0.6465518 0 0
#> 10 10 10.249797 0.4877277 1 1
#> 11 11 10.389741 0.6760747 0 1
#> 12 12 15.043165 0.7012690 0 0
#> 13 13 11.976079 1.1220017 1 0
#> 14 14 15.905599 1.1381758 1 0
#> 15 15 14.369145 1.0629459 1 0
#> 16 16 14.034443 0.7007272 1 0
#> 17 17 15.656715 0.6277480 1 1
#> 18 18 7.496555 0.8633633 1 1
#> 19 19 12.590283 0.6671725 1 0
#> 20 20 11.709377 0.9579166 1 0
df$score <- 1*(df$sex == 1) + 2*(df$hemoglobin <= 12) + 3*(df$creatinine >= 1)
df$score
#> [1] 0 3 1 0 2 2 2 1 0 3 2 0 6 4 4 1 1 3 1 3
model <- glm(data = df, death ~ score, family = 'binomial')
df$prob <- predict(model, df, type = 'response')
val.prob(df$prob, df$death)
```
#> Dxy C (ROC) R2 D D:Chi-sq
#> 3.186813e-01 6.593407e-01 3.340394e-02 -2.544734e-02 4.910532e-01
#> D:p U U:Chi-sq U:p Q
#> NA -1.000000e-01 3.552714e-15 1.000000e+00 7.455266e-02
#> Brier Intercept Slope Emax E90
#> 2.236665e-01 -6.026885e-11 1.000000e+00 7.162773e-01 4.008586e-01
#> Eavg S:z S:p
#> 2.505908e-01 5.303483e-02 9.577042e-01
```
I have data such as this. I am running glm on all the Q variables.
dat <- read_table2("condition school Q5_3 Q6 Q7_1 Q7_2 Q7_3 Q7_4 Q13_1 Q13_2 Q13_3
0 A 1 1 1 1 1 1 0 1 1
1 B 1 0 0 NA NA NA NA 1 1
1 C 1 0 1 1 1 1 0 1 1
1 A 0 0 0 NA NA NA NA 1 1
1 B 1 0 0 NA NA NA NA 1 1
0 C 1 1 1 1 1 0 0 0 0
0 A 0 0 0 NA NA NA NA NA NA
0 B 1 1 1 1 1 1 1 1 1
0 C 1 1 0 NA NA NA NA 1 0
0 A 1 0 0 NA NA NA NA 1 0
0 B 1 0 1 1 0 1 1 NA NA
0 C 1 0 1 1 1 1 1 1 0
1 A 1 1 1 1 0 1 0 1 1
1 B 0 0 0 NA NA NA NA 1 1
0 C 1 0 0 NA NA NA NA NA NA
")
This is the loop that I am using to pull out the coefficients that I want.
# We only need the condition and school
# Apply
models <- function(x)
{
model1 <- glmer(x~ (1|school) + condition ,data=dat , family = binomial, na.action = na.exclude)
return(model1)
}
y <- apply(dat[,-c(1,2)],2,models)
#Extract results
extract <- function(x)
{
z <- as.data.frame(summary(x)$coefficient)
z$id <- rownames(z)
z <- z[,c(dim(z)[2],1:(dim(z)[2]-1))]
rownames(z)<-NULL
return(z)
}
#Extract summary with function
DF <- as.data.frame(do.call(rbind,lapply(y,extract)))
#Format variables
DF$var <- gsub("\\..*","",rownames(DF))
#Arrange columns
DF_glm <- DF[,c(dim(DF)[2],1:(dim(DF)[2]-1))]
rownames(DF)<-NULL
This loops works fine, but I need to convert the output (coefficients) from log odds to probabilities. Any suggestions on how to do this?
Bad news: there's not really any sensible way to convert coefficients of a logistic regression (which are on the log-odds-ratio or logit scale) to a probability scale. The conversion from log-odds to probabilities depends on the baseline level, so to get probabilities you would have to make predictions of probabilities for specific cases: see e.g. this CrossValidated question.
Good news: exponentiating the coefficients gives you odds ratios, which are generally well understood and arguably easier to understand than the log-odds-ratio.
library(broom.mixed)
dd <- dat[,-c(1,2)]
## find (and drop) examples with no variation
uu <- apply(dd,2,function(x) length(unique(na.omit(x))))
modList <- apply(dd[,uu>1],2,models)
## generate list of models
purrr:::map_dfr(modList,tidy,
effect="fixed",
exponentiate=TRUE,.id="Q")
This gives you a table (tibble) with estimates on the odds ratio scale, standard errors, p-values etc. There are other options such as conf.int=TRUE if you want confidence intervals in the table. You can operate it with any of the tidyverse tools (e.g. %>% filter(term=="condition") if you're not interested in the intercepts).
Many of the answers in this example are kind of bogus, but that's because your data set is too small ... I hope your real data set is bigger than this ...
Explaining why you can't generally convert odds ratios to probabilities (without specifying a baseline) is really more of a statistical/CrossValidated question, but I'll give a short example based on the UCLA stats site
Import data: scale the predictor variables for GRE and GPA to get more interpretable parameter values.
library(tidyverse)
dd <- (haven::read_dta("https://stats.idre.ucla.edu/stat/stata/dae/binary.dta")
%>% mutate_at(c("gre","gpa"), ~drop(scale(.)))
)
Fit the model and extract coefficients
m <- glm(admit~gre+gpa, family=binomial, dd)
cc <- coef(m)
## (Intercept) gre gpa
## -0.8097503 0.3108184 0.2872088
transforming:
plogis() is the built-in R function for the inverse logit (logistic) transformation.
Transforming the intercept parameter does make sense: it gives the predicted probability for an individual with baseline characteristics; since we have centered the predictors, this corresponds to an individual with the population mean GPA and GRE.
int_prob <- plogis(cc["(Intercept)"]) ## 0.307
We could also predict the probability for an individual with the mean GRE and a GPA one standard deviation above the mean (the units of the GPA parameter are "per standard deviation" because we scaled the GPA variable by its standard deviation):
gre_prob <- with(as.list(cc), plogis(`(Intercept)`+gre)) ## 0.3777
We could calculate the difference between these predictions, which is one way of specifying the effect of GRE on the probability scale:
gre_prob-int_prob ## 0.0698
However, it only applies for this particular comparison (an individual with mean GPA and GRE 1 SD above the mean compared to an individual with the mean GPA and GRE). The change in probability per unit GRE would be different if we started from a different baseline or made the prediction for a different GRE change.
You can logistic-transform the GRE coefficient if you want:
plogis(cc["gre"]) ## 0.577
What does this mean, though? It is the probability of success for an individual with a baseline log-odds of zero (which is not the individual with the average GPA and GRE) if you were then to increase their GRE by 1 standard deviation. Not something that's easy to explain ...
There are other rules of thumb/approximations for understanding the meaning of log-odds-ratios, e.g. the divide by 4 rule, but they all depend in some way on specifying a baseline level.
You can try this. You will get warnings because of your data:
library(lme4)
#Preprocess data
#If you omit NA variables will be constants so that the model
#can no be fitted and will produce error
#I set NA to zero in order to get models working
#Please check your data
dat[is.na(dat)] <- 0
# We only need the condition and school
# Apply
models <- function(x)
{
model1 <- glmer(x~ (1|school) + condition ,data=dat , family = binomial, na.action = na.exclude)
return(model1)
}
y <- apply(dat[,-c(1,2)],2,models)
#Extract results and we will extract the logs
extract <- function(x)
{
z <- as.data.frame(summary(x)$coefficient)
z$id <- rownames(z)
z <- z[,c(dim(z)[2],1:(dim(z)[2]-1))]
z$odds <- exp(z$Estimate)
z$prob <- z$odds / (1 + z$odds)
rownames(z)<-NULL
return(z)
}
#Extract summary with function
DF <- as.data.frame(do.call(rbind,lapply(y,extract)))
#Format variables
DF$var <- gsub("\\..*","",rownames(DF))
#Arrange columns
DF_glm <- DF[,c(dim(DF)[2],1:(dim(DF)[2]-1))]
rownames(DF)<-NULL
You will get this:
id Estimate Std. Error z value Pr(>|z|) odds
1 (Intercept) 2.079442e+00 1.060660e+00 1.960517e+00 0.04993534 8.000000e+00
2 condition -1.386294e+00 1.369306e+00 -1.012407e+00 0.31134363 2.500000e-01
3 (Intercept) -2.231436e-01 6.708203e-01 -3.326428e-01 0.73940393 8.000000e-01
4 condition -1.386294e+00 1.284523e+00 -1.079229e+00 0.28048568 2.500000e-01
5 (Intercept) 2.231436e-01 6.708203e-01 3.326428e-01 0.73940394 1.250000e+00
6 condition -9.162907e-01 1.095445e+00 -8.364553e-01 0.40289882 4.000000e-01
7 (Intercept) 2.231436e-01 6.708203e-01 3.326428e-01 0.73940394 1.250000e+00
8 condition -9.162907e-01 1.095445e+00 -8.364553e-01 0.40289882 4.000000e-01
9 (Intercept) -2.231436e-01 6.708204e-01 -3.326428e-01 0.73940397 8.000000e-01
10 condition -1.386294e+00 1.284523e+00 -1.079229e+00 0.28048583 2.500000e-01
11 (Intercept) -2.231436e-01 6.708204e-01 -3.326428e-01 0.73940395 8.000000e-01
12 condition -4.700036e-01 1.095445e+00 -4.290527e-01 0.66788485 6.250000e-01
13 (Intercept) -7.440587e-01 1.454336e+00 -5.116141e-01 0.60892109 4.751814e-01
14 condition -5.938497e+04 2.739708e+07 -2.167566e-03 0.99827053 0.000000e+00
15 (Intercept) 2.231435e-01 6.708204e-01 3.326427e-01 0.73940398 1.250000e+00
16 condition 3.442056e+01 1.351269e+07 2.547276e-06 0.99999797 8.884999e+14
17 (Intercept) -1.252763e+00 8.017837e-01 -1.562470e+00 0.11817732 2.857143e-01
18 condition 3.800559e+01 2.739708e+07 1.387213e-06 0.99999889 3.203452e+16
prob var
1 0.8888889 Q5_3
2 0.2000000 Q5_3
3 0.4444444 Q6
4 0.2000000 Q6
5 0.5555556 Q7_1
6 0.2857143 Q7_1
7 0.5555556 Q7_2
8 0.2857143 Q7_2
9 0.4444444 Q7_3
10 0.2000000 Q7_3
11 0.4444444 Q7_4
12 0.3846154 Q7_4
13 0.3221173 Q13_1
14 0.0000000 Q13_1
15 0.5555556 Q13_2
16 1.0000000 Q13_2
17 0.2222222 Q13_3
18 1.0000000 Q13_3
When I use anova_test() function(from rstatix package) to do two-way repeated measures ANOVA, an error occur:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases
I check my data and there is no missing value.
BTW, in my data, not all the people have 8 times outcome. Some people have maximum 3 times, some 8 times and so on.
I refer to this website to do my two-way repeated measures ANOVA :
https://www.datanovia.com/en/lessons/repeated-measures-anova-in-r/
I have upload my dataset to github.
mydata :https://github.com/lizhiwei1994/testRepo/blob/master/mydata.csv
My code:
# load packages
library("tidyverse")
library("ggpubr")
library("rstatix")
# load data and check missing value
mydata <- read.csv(
url("https://raw.githubusercontent.com/lizhiwei1994/testRepo/master/mydata.csv")
) %>% convert_as_factor(id, time, treatment)
glimpse(mydata)
sum(is.na(mydata))
# error occurring
res.aov <- anova_test(
data = mydata, dv = outcome, wid = id,
within = c(treatment, time)
)
get_anova_table(res.aov)
For repeated measure anova, you need complete observations for each time point, before and after treatment, before you do the anova, it's always good to check the observations:
tab = table(mydata$time,mydata$treatment,mydata$id)
#subject = 1
tab[,,"1"]
control2 treat2
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 0 0
So this subject has only control2 observations but no treatment2 observations. If there is something wrong with the entry of subject, please correct it. Below I can show you an example of when it will work:
test = expand.grid(id=1:2,time=1:8,treatment=c("a","b"))
test$outcome=rnorm(nrow(test))
table(test$time,test$treatment,test$id)
, , = 1
a b
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
8 1 1
, , = 2
a b
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
8 1 1
anova_test(data=test,dv=outcome,wid=id,within=c("treatment","time"))
ANOVA Table (type III tests)
Effect DFn DFd F p p<.05 ges
1 treatment 1 1 424.283 0.031 * 0.078
2 time 7 7 1.596 0.276 0.422
3 treatment:time 7 7 1.571 0.283 0.422
If your dataset is such, according to the formula you provided in the comment aov(outcome ~ time*treatment + Error(id), data = mydata), it is a mixed anova, but for this you need individuals that have undergone both treatment and control, to control for the individual effect, and this is absent from your dataset.
The only anova you can perform in this situation is a two-way anova:
anova_test(data=test,dv=outcome,between=c(time,treatment))
Coefficient covariances computed by hccm()
ANOVA Table (type II tests)
Effect DFn DFd F p p<.05 ges
1 time 7 16 1.668 0.187 0.422
2 treatment 1 16 1.350 0.262 0.078
3 time:treatment 7 16 1.668 0.187 0.422
I would like to run a logistic regression with specific group (range 0f values) of a categorial variable.I did the following steps:
1. I cut the variable to groups:
cut_Var3 <- cut(dat$Var3,breaks=c(0,3,6,9))
the outcome of table(cut_Var3) gave me this output (cut_Var3 was turned into a factor):
# (0,3] (3,6] (6,9]
# 5 4 4
I wanted to do a logistic regression with other variable but in separate for the level of (3,6) only.
So I'll be able to run the regression on the 4 observations of the second group.
2. I tried to write this line of code (and also other variations):
ff <- glm( TargetVar ~ relevel(cut_Var3,3:6), data = dat)
but with no luck.
What should I do in order to run it properly?
attached is an example data set:
dat <- read.table(text = " TargetVar Var1 Var2 Var3
0 0 0 7
0 0 1 1
0 1 0 3
0 1 1 7
1 0 0 5
1 0 1 1
1 1 0 0
1 1 1 6
0 0 0 8
0 0 1 5
1 1 1 4
0 0 1 2
1 0 0 9
1 1 1 2 ", header = TRUE)
For relevel you need to specify the level label exactly as it appear in the factor:
glm( TargetVar ~ relevel(cut_Var3,"(3,6]"), data = dat)
Call: glm(formula = TargetVar ~ relevel(cut_Var3, "(3,6]"), data = dat)
Coefficients:
(Intercept) relevel(cut_Var3, "(3,6]")(0,3]
0.75 -0.35
relevel(cut_Var3, "(3,6]")(6,9]
-0.50
Degrees of Freedom: 12 Total (i.e. Null); 10 Residual
(1 observation deleted due to missingness)
Null Deviance: 3.231
Residual Deviance: 2.7 AIC: 24.46
So, my data set consists of 15 variables, one of them (sex) has only 2 levels. I want to use it as a dummy variable, but the levels are 1 and 2. How do I do this? I want to have levels 0 and 1, but I don't know how to manage this in R!
With most of R's modelling tools with a formula interface you don't need to create dummy variables, the underlying code that handles and interprets the formula will do this for you. If you want a dummy variable for some other reason then there are several options. The easiest (IMHO) is to use model.matrix():
set.seed(1)
dat <- data.frame(sex = sample(c("male","female"), 10, replace = TRUE))
model.matrix( ~ sex - 1, data = dat)
which gives:
> dummy <- model.matrix( ~ sex - 1, data = dat)
> dummy
sexfemale sexmale
1 0 1
2 0 1
3 1 0
4 1 0
5 0 1
6 1 0
7 1 0
8 1 0
9 1 0
10 0 1
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$sex
[1] "contr.treatment"
> dummy[,1]
1 2 3 4 5 6 7 8 9 10
0 0 1 1 0 1 1 1 1 0
You can use either column of dummy as a numeric dummy variable; choose whichever column you want to be the 1-based level. dummy[,1] chooses 1 as representing the female class and dummy[,2] the male class.
Cast this as a factor if you want it to be interpreted as a categorical object:
> factor(dummy[, 1])
1 2 3 4 5 6 7 8 9 10
0 0 1 1 0 1 1 1 1 0
Levels: 0 1
But that is defeating the object of factor; what is 0 again?
Ty this
set.seed(001) # generating some data
sex <- factor(sample(1:2, 10, replace=TRUE)) # this is what you have
[1] 1 1 2 2 1 2 2 2 2 1
Levels: 1 2
sex<-factor(ifelse(as.numeric(sex)==2, 1,0)) # this is what you want
sex
[1] 0 0 1 1 0 1 1 1 1 0
Levels: 0 1
If you want labels to be 0 = Male and 1 = Female, then...
sex<-factor(ifelse(as.numeric(sex)==2, 1,0), labels=c('M', 'F'))
sex # this is what you want
[1] M M F F M F F F F M
Levels: M F
Actually you don't need to create a dummy variable in order to estimate a model using lm, let's see this example:
set.seed(001) # Generating some data
N <- 100
x <- rnorm(N, 50, 20)
y <- 20 + 3.5*x + rnorm(N)
sex <- factor(sample(1:2, N, replace=TRUE))
# Estimating the linear model
lm(y ~ x + sex) # using the first category as the baseline (this means sex==1)
Call:
lm(formula = y ~ x + sex)
Coefficients:
(Intercept) x sex2
19.97815 3.49994 -0.02719
# renaming the categories and labelling them
sex<-factor(ifelse(as.numeric(sex)==2, 1,0), labels=c('M', 'F'))
lm(y ~ x + sex) # the same results, baseline is 'Male'
Call:
lm(formula = y ~ x + sex)
Coefficients:
(Intercept) x sexF
19.97815 3.49994 -0.02719
As you can see R deals with the dummies pretty well, you just pass them into the formula as factor variable and R will do the rest for you.
By the way there's no need to change the categories from c(2,1) into c(0,1), the results will be the same as you can seen in the example above.
As suggested by many above, turn it into factor.
If you really want to dummy code the gender variable, consider this
set.seed(100)
gender = rbinom(100,1,0.5)+1
gender_dummy = gender-1