I have data which looks like this:
df <- data.frame (
time = rep(c("2010", "2011", "2012", "2013", "2014"),4),
age = rep(c("40-44", "45-49", "50-54", "55-59", "60-64"),4),
weight = rep(c(0.38, 0.23, 0.19, 0.12, 0.08),4),
ethgp = rep(c(rep("M",5),rep("NM",5)),2),
gender = c(rep("M",10), rep("F",10)),
pop = round((runif(10, min = 10000, max = 99999)), digits = 0),
count = round((runif(10, min = 1000, max = 9999)), digits = 0)
)
df <- df %>%
mutate(rate = count / pop,
asr_rate = (rate * weight)*100000,
asr_round = round(asr_rate, digits = 0))
First, I remove all zero values from the dataframe
df <- df [apply(df!=0, 1, all),]
Then I run the following code, to run multiple Poisson regression models, for each sub-group within this data (age, gender, and year); comparing ethnic groups (M / NM). I want to generate rate ratios, and CIs, comparing M with NM, for all sub-groups.
Poisson_test <- df %>% group_by(time, gender, age) %>%
do({model = glm(asr_round ~ relevel(ethgp, ref = 2), family = "poisson", data = .);
data.frame(nlRR_MNM = coef(model)[[2]], SE_MNM = summary(model)$coefficients[,2][2])})
This code works fine for the sample above.
When I run this code on my actual dataset, however, I get the following error message: Error in contrasts<-(tmp, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
Because I have only one explanatory variable, ethgp, I assume this is the source of the error?
I tested whether there are levels in my data (not in the sample data):
str(M_NM_NZ$ethgp)
R responds: Factor w/ 2 levels "M","NM": 1 1 1 1 1 1 1 1 1 1 ...
I checked if there were NA values in the ethgp
sum(is.na(M_NM_NZ%ethgp))
R responds [1] 0
Are there other reasons I might be getting this error message?
I have seen this question Error in contrasts when defining a linear model in R But in this example, it sounds like the explanatory variable is not in the correct format, or has NA values. This is not the case in my data. Are there other reasons I might be getting this error?
I don't understand the underlying problem which causes this error when a factor does have more than one level.
In this instance I fixed the issue by converting the ethgp variable into a numeric variable.
df <- df %>%
mutate(ethnum = ifelse(ethgp == "M", 1, 0))
And then running the regressions using ethnum as the explanatory variable.
Poisson <- df %>% group_by(time, gender, age) %>%
do({model = glm(asr_round ~ ethnum, family = "poisson", data = .);
data.frame(nlRR_MNM = coef(model)[[2]], nlUCI = confint(model)[2,2], nlLCI = confint(model)[2,1])})
Poisson <- mutate(Poisson,
RR_MNM = round(exp(nlRR_MNM),digits = 3),
UCI = round(exp(nlUCI),digits = 3),
LCI = round(exp(nlLCI),digits = 3))
This code also computes the upper and lower 95% confidence intervals for each rate ratio.
Related
I am trying to compare (with a statistical test) the probability of detection according to a categorical covariates with 3 levels, let's say A,B and C in a very simple occupancy model. I am interested by the 3 pairwise comparison. Is there any way to get this result from the unmarked package ?
I decide to circumvent this problem by changing the reference level in order to use the test provided by the summary output of the fitted occupancy model. However, even so my covariates are specified as factors, the function unmarkedFrameOccu does not recognize it and I get the following warning Warning message:
obsCovs contains characters. Converting them to factors.
As a consequence, aconversion is done by unmarked I have no way to control the reference level.
Do you have any idea of the possible issues and why unmarkedFrameOccu does not recognize my factor ? I add a reproducible example below. I find a way to deal with my main issue by recoding categorical variable using 1/0 but I am still curious on this warning thing.
Thanks for your help and time and here is the example
Marie
library(tidyverse)
library(unmarked)
n_occas <- 6
n_sites <- 10
p_detect <- c(0.3, 0.8, 0.5)
cov <- matrix(factor( sample(c("A", "B", "C"), size = 60, replace = TRUE)), ncol = n_occas, nrow= n_sites)
sites <- sample(c(0,1), size = n_sites, replace = TRUE)
## generate actual detection according to presence state
y <- cov %>% as_tibble %>%
add_column(truth = sites) %>%
mutate(id = 1 : n()) %>%
pivot_longer(cols = c(-truth, -id), names_to = "occas", values_to = "cov") %>%
mutate(detection = case_when(cov == "A" ~ truth * (runif(n = 1) < p_detect[1]),
cov == "B" ~ truth * (runif(n = 1) < p_detect[2]),
cov == "C" ~ truth * (runif(n = 1) < p_detect[3])
)
) %>%
select(-cov) %>%
pivot_wider(names_from = occas, values_from = detection, values_fill = NA) %>%
select(-truth, -id)
unmarkedFrameOccu(y = y,
obsCovs = list(cov1 = cov))
I am trying to simulate how replacement/reassignment of values on random samples affect predictions conveyed by AUC.
I have a tumor classification in a dataframe denoted df$who which has levels 1, 2, 3 corresponding to the severity of the tumor lesion.
Intro to the question
Lets say the baseline data looks like this:
set.seed(1)
df <- data.frame(
who = as.factor(sample(1:3, size = 6000, replace = TRUE, prob = c(0.8, 0.15, 0.05))),
age = round(runif(n = 6000, min = 18, max = 95), digits = 1),
gender = sample(c("m", "f"), size = 6000, replace = TRUE, prob = c(1/3, 2/3)),
event.time = runif(n = 6000, min = 8, max = 120),
event = as.factor(sample(0:2, size = 6000, replace = TRUE, prob = c(0.25, 0.2, 0.55)))
)
And a standard cause-specific Cox regression looks like:
library(survival)
a_baseline <- coxph(Surv(event.time, event == 1) ~ who + age + gender, data = df, x = TRUE)
From which AUC can be obtained as a measure of predictive performance. Here, leave-one-out bootstrap on 5-year prediction on df$event == 1.
library(riskRegression)
u <- Score(list("baseline" = a_baseline),
Surv(event.time, event == 1) ~ 1,
data = df,
times = 60,
plots = "cal",
B = 50,
split.method = "loob",
metrics = c("auc", "brier")
)
# The AUC is then obtained
u$AUC$score$AUC[2]
Question
I want to simulate how re-classifying a random 5% of df$who == 1 to dfwho == 2 affect the 5-year prediction on df$event == 1
I want to create 10 separate and simulated subsets of the baseline data df, but each containing a random allocation of 5% df$who == 1 to .. == 2. Then, I want to apply each of these 10 separate and simulated subsets to predict the 5-year risk of df$event == 1.
I have applied a for loop to this. The expected output is dataframe that tells me which of the 10 simulated datasets yielded the highest and lowest u$AUC$score$AUC[2] (i.e., the best and worst prediction).
I am new to for loop, but here is my go (that obviously did not work).
all_auc <- data.frame() ## create a dataframe to fill in AUC from all 10 simulated sub-datasets
for(i in 1:10){ #1:10 represent the simulated datasets from 1 to 10
df[i] <- df #allocating baseline data to each of the 10 datasets
df[i]$who[sample(which(df[i]$who==1), round(0.05*length(which(df[i]$who==1))))]=2 #create the random 5% allocation of who==1 to who==2 in the i'th simulated dataset
ith_cox <- coxph(Surv(event.time, event == 1) ~ who + age + gender, data = df[i], x = TRUE) #create the i'th Cox regression based on the i´th dataset
# create the predictions based on the i´th Cox
u[i] <- Score(list("baseline" = ith_cox),
Surv(event.time, event == 1) ~ 1,
data = df[i],
times = 60,
plots = "cal",
B = 50,
split.method = "loob",
metrics = c("auc", "brier")
)
# summarize all AUC from all 10 sub-datasets
all_auc <- u[i]$AUC$score$AUC[2]
}
(1) I could not get this for loop to work as described, and
(2) the final dataframe all_auc should provide only which of the 10 datasets yielded the worst and best predictions (I will then use these two data sets for further analysis).
A final note
This is only a reproducible example. The for loop will be applied to 10.000 simulated datasets in our analysis. I do not know if this could affect the answer - but, it illustrates the importance of the result: a dataframe (or vector?) that simply tells me which simulated dataset yielded the best vs worst predictions, and that I subsequently will be able to use these two dataframes for furter analysis, eg df2930 and df8939.
I'm using gtsummary::tbl_uvregression to construct a univariate regression table modeled with geepack::geeglm.
Some variables contain NA values that are breaking the GEE models in the table.
x There was an error constructing model geepack::geeglm(formula = tts_participant ~ omb_race, data = ., family = poisson, id = School Name, corstr = "independence", scale.fix = TRUE) See error below.
Error in mutate_cols():
! Problem with mutate() column model.
i model = map(...).
x Error in geese.fit(xx, yy, id, offset, soffset, w, waves = waves, zsca, : nrow(zsca) and length(y) not match
Caused by error:
! Error in geese.fit(xx, yy, id, offset, soffset, w, waves = waves, zsca, : nrow(zsca) and length(y) not match
You would generally set data = na.omit(data) for geepack::geeglm. However, I only want to remove NAs for the variables that I'm looking at in the table. na.action = na.omit does not work. I would like to avoid computing a gtsummary::tbl_uvregression table for each individual variable.
How do I instruct tbl_uvregression to remove the NAs only for the specific model being generated. Here is my attempt:
tbl_uvregression(
method = geepack::geeglm,
y = tts_participant,
include = -`School Name`,
method.args = list(
family = poisson,
data = na.omit(data),
#na.action = na.omit,
id = `School Name`,
corstr = "independence",
scale.fix = TRUE
),
exponentiate = TRUE,
add_estimate_to_reference_rows = FALSE
)
Thank you!
The geepack::geeglm() is funny how it doesn't handle the NA values for us. When I've come across this, I write a small wrapper function for geeglm() that deletes the missing values before passing to geeglm(). Example below! Happy Programming!
library(gtsummary)
library(tidyverse)
packageVersion("gtsummary")
#> [1] '1.5.2'
my_geeglm <- function(formula, data, id, ...) {
# capture id input (since it's unquoted)
id <- rlang::enexpr(id)
# keep compelte cases amoung the variables needed in the model
data <-
select(data, all_of(all.vars(formula)), !!id) %>%
dplyr::filter(complete.cases(.))
# build GEE model
rlang::inject(
geepack::geeglm(
formula = formula,
data = data,
id = !!id, # inserting unquoted id column name
...
)
)
}
data(dietox, package = "geepack")
dietox %>%
select(Pig, Weight, Cu, Feed) %>%
tbl_uvregression(
y = Weight,
method = my_geeglm,
method.args = list(id = Pig, family = poisson("identity"), corstr = "ar1"),
include = -Pig
) %>%
as_kable() # convert to kable to show in SO
Characteristic
N
Beta
95% CI
p-value
Cu
861
Cu000
—
—
Cu035
-0.49
-3.5, 2.5
0.7
Cu175
1.8
-1.9, 5.5
0.3
Feed
789
0.43
0.42, 0.45
<0.001
Created on 2022-02-09 by the reprex package (v2.0.1)
I have a data frame "customers" build of customer id, month and total purchases that month.
I'm trying to calculate a running slope for a window of 12 months using robust regression.
I have tried the following:
Coef <- function(x) {return(rlm(cbind(x)~cbind(1:length(x)))$coefficients[2])}
customer_slope = customers %>% mutate(slope = runner(x=total_purchases,k=12,f=Coef))
I get the following error:
x 'x' is singular: singular fits are not implemented in 'rlm'
If I run a single example, the function returns what I've expected:
Coef(c(4,11,7,15,5,14,8,9,14,17,14,13))
cbind(1:length(x))
0.6888112
So I ran into similar problems and finally came to the below solution using slider. This provides a 3 days rolling estimate (of course you can change as you see fit). This doesn't quite get to your answer (which you could probably get with loops), but most of the way there.
library(MASS)
library(dplyr)
library(slider)
dat <- tibble::tibble(customers = c(4,11,7,15,5,14,8,9,14,17,14,13)) %>%
mutate(t = 1:n() %>% as.numeric())
dat %>%
mutate(results = slide_dbl(.x = .,
.f = ~rlm(customers ~ t, k = 12, data = .x)$coefficients[2],
.before = 2,
.complete = T))
It look like that's the way to go, thanks!
It seems like what caused the singularity was that I didn't change the default .complete from F to T.
So, combined with your suggestion, this is how I made it work (took about two hours for 3M rows I did have however more complex group_by involved which is not shown below)
slope_rlm <- function(x) {
x=as.numeric(x)
prep = tibble(data=x)%>%mutate(t=1:n()%>%as.numeric())
return(rlm(data~t,data=prep)$coefficients[2])
}
customers_rlm = customers %>%
mutate(cust_rlm_12=slide_dbl(total_purchases,slope_rlm,.before=11,.complete=T))
Consider data with two customers with data from 1000 days span. total_purchases are cumulated by customer, and each purchase size is ~pois(5).
set.seed(1)
customers <- data.frame(
id = factor(rep(1:2, length.out = 100)),
date = seq(Sys.Date(), Sys.Date() + 1000, length.out = 100)
) %>%
group_by(id) %>%
mutate(
total_purchases = cumsum(rpois(n(), lambda = 5))
)
When using calculating regression in rolling window make sure that you handle errors which comming from insufficient degrees of freedom, singularity etc. - that is why I've put tryCatch around rlm call - if there is any error, function returns NA for failing window.
Data below is grouped by id which means that model is calculated per customer. Yearly rolling regression should converge to the slope = 5 (+/- random error).
customers %>%
group_by(id) %>%
mutate(
slope = runner(
x = .,
f = function(x) {
tryCatch(
rlm(x$total_purchases ~ seq_len(nrow(x)))$coefficients[2],
error = function(e) NA
)
},
idx = "date",
k = "year"
)
)
Plotting slope in time for customers
ggplot(customers, aes(x = date, y = slope, color = id, group = id)) +
geom_line() +
geom_hline(yintercept = 5, color = "red")
While using the package msm, I am currently getting the error:
* caught segfault * address 0x7f875be5ff48, cause 'memory not mapped'
when I introduce a covariate to my model. Previously, I had resolved a similar error by converting my response variable from a factor to a numeric variable. This however does not resolve my current issue.
The data <- https://www.dropbox.com/s/wx6s4liofaxur0v/data_msm.txt?dl=0
library(msm)
#number of transitions between states
#1: healthy; 2: ill; 3: dead; 4: censor
statetable.msm(state_2, id, data=dat.long)
#setting initial values
q <- rbind(c(0, 0.25, 0.25), c(0.25, 0, 0.25), c(0, 0, 0))
crudeinits <- crudeinits.msm(state_2 ~ time, subject=id, data=dat.long, qmatrix=q, censor = 4, censor.states = c(1,2))
#running model without covariates
(fm1.msm <- msm(state_2 ~ time, subject = id, qmatrix = crudeinits, data = dat.long, death = 3, censor = 4, censor.states = c(1,2)))
#running model with covariates
(fm2.msm <- msm(state_2 ~ time, subject = id, qmatrix = crudeinits, data = dat.long, covariates = ~ gender, death = 3, censor = 4, censor.states = c(1,2)))
Alternatively, I can run the models with covariates if I set the state values dead and censor (3 & 4) to missing.
#set death and censor to missing
dat.long$state_2[dat.long$state_2 %in% c(3,4)] <- NA
statetable.msm(state_2, id, data=dat.long)
#setting initial values
q <- rbind(c(0, 0.5), c(0.5, 0))
crudeinits <- crudeinits.msm(state_2 ~ time, subject=id, data=dat.long, qmatrix=q)
#running models with covariates
(fm3.msm <- msm(state_2 ~ time, subject = id, qmatrix = crudeinits, data = dat.long, covariates = ~ gender))
(fm4.msm <- msm(state_2 ~ time, subject = id, qmatrix = crudeinits, data = dat.long, covariates = ~ covar))
Thanks for your help
In version 1.5 of msm, there's an error in the R code that detects and drops NAs in the data. This is triggered when there are covariates, and the state or time variable contains NAs. Those NAs can then be passed through to the C code that computes the likelihood, causing a crash. I'll fix this for the next version. In the meantime you can work around it by dropping NAs from the data before calling msm.