I've been working recently with CDISC data that is structured with pre-specified column names and certain expectations for the way survival data are coded.
I want to write a wrapper for survival::Surv() that uses the structured data of the CDISC format. I have a function that is working in most scenarios, but I can't get it to work with survival::coxph().
How can I get my Surv() wrapper function to use default values and work in coxph()? Below are examples using the visR::adtte data set (data set ships with the visR package...install with devtools::install_github("openpharma/visR")), which is in CDISC format. All examples run without issue except the last one.
Surv_CDISC <- function(AVAL, CNSR) {
# set default values if not passed by user -----------------------------------
if (missing(AVAL) && exists("AVAL", envir = rlang::caller_env()))
AVAL <- get("AVAL", envir = rlang::caller_env())
else if (missing(AVAL))
stop("Default 'AVAL' value not found. Specify argument in `Surv_CDISC(AVAL=)`.")
if (missing(CNSR) && exists("CNSR", envir = rlang::caller_env()))
CNSR <- get("CNSR", envir = rlang::caller_env())
else if (missing(CNSR))
stop("Default 'CNSR' value not found. Specify argument in `Surv_CDISC(CNSR=)`.")
# pass args to `survival::Surv()` --------------------------------------------
survival::Surv(time = AVAL, event = 1 - CNSR)
}
# passing the arguments, everything works
with(visR::adtte, Surv_CDISC(AVAL = AVAL, CNSR = CNSR)) |> head()
#> [1] 2 3 3 28+ 58 46+
# letting the arguments use default value, everything still works
with(visR::adtte, Surv_CDISC()) |> head()
#> [1] 2 3 3 28+ 58 46+
# using function in model.frame() and defining argument values, everything works
model.frame(Surv_CDISC(AVAL, CNSR) ~ SEX, data = visR::adtte) |> head(n = 2)
#> Surv_CDISC(AVAL, CNSR) SEX
#> 1 2 F
#> 2 3 M
# using function in model.frame() with default arguments, everything works
model.frame(Surv_CDISC() ~ SEX, data = visR::adtte) |> head(n = 2)
#> Surv_CDISC() SEX
#> 1 2 F
#> 2 3 M
# using function in survfit() and defining argument values, everything works
survival::survfit(Surv_CDISC(AVAL, CNSR) ~ SEX, data = visR::adtte)
#> Call: survfit(formula = Surv_CDISC(AVAL, CNSR) ~ SEX, data = visR::adtte)
#>
#> n events median 0.95LCL 0.95UCL
#> SEX=F 143 80 64 47 96
#> SEX=M 111 72 41 30 57
# using function in survfit() with default arguments, everything works
survival::survfit(Surv_CDISC() ~ SEX, data = visR::adtte)
#> Call: survfit(formula = Surv_CDISC() ~ SEX, data = visR::adtte)
#>
#> n events median 0.95LCL 0.95UCL
#> SEX=F 143 80 64 47 96
#> SEX=M 111 72 41 30 57
# using function in coxph() and defining argument values, everything works
survival::coxph(Surv_CDISC(AVAL, CNSR) ~ SEX, data = visR::adtte)
#> Call:
#> survival::coxph(formula = Surv_CDISC(AVAL, CNSR) ~ SEX, data = visR::adtte)
#>
#> coef exp(coef) se(coef) z p
#> SEXM 0.3147 1.3699 0.1626 1.935 0.053
#>
#> Likelihood ratio test=3.71 on 1 df, p=0.05412
#> n= 254, number of events= 152
# DOES NOT WORK TRYING TO RELY ON DEFAULT VALUES
survival::coxph(Surv_CDISC() ~ SEX, data = visR::adtte)
#> Error in x[[2]]: subscript out of bounds
Created on 2022-06-05 by the reprex package (v2.0.1)
This looks like a bug in the survival package, or maybe a mis-use of it (I'm not so familiar with the internals).
EDITED TO ADD A COMMENT:
I think the analysis below is wrong. Reading the code more carefully, I think the current code in the survival package won't work reliably unless you use the explicit Surv(AVAL, CNSR) in the formula.
HERE'S THE ANALYSIS THAT APPEARED TO WORK, BUT I DON'T TRUST IT:
The problem is that survival:::terms.inner looks specifically for a function named Surv, here: https://github.com/therneau/survival/blob/b5238a42867a931954cf222b871a7b3a1c2fcd24/R/xtras.R#L65 . Since your function has a different name, it's not handled as if it is the same thing.
You could fix this by naming your function Surv as well. When I do that, things appear to work. Of course, this may cause problems elsewhere when you want the original Surv without the survival:: prefix, but I don't know a way to fix that.
I'd still worry about using caller_env(). Here's how I'd create your fake Surv:
make_surv_CDISC <- function(defaults) {
force(defaults)
function(AVAL = defaults$AVAL,
CNSR = defaults$CNSR) {
# pass args to `survival::Surv()` --------------------------------------------
survival::Surv(time = AVAL, event = 1 - CNSR)
}
}
Surv <- make_surv_CDISC(visR::adtte)
This is less general than yours, but I think it's safer.
Related
I am trying to run a post hoc analysis on an unbalanced two way anova using the anova_test funciton in the rstatix package. I need to run this post hoc test iteratively, as I have ~26 response (y) variables. My first step is to create models of all my y variables with relation to group and treatment. I have successfully managed to do this, creating a single list with 26 models:
models <- map(data[,y1:y26], ~(lm(.x ~data$group*data$treatment)))
Now comes the part I'm stuck on. Referring to these models iteratively. I would like to run the following code for every y variable I have:
group_by(group) %>%
anova_test(y ~ treatment, error = models(y), type = 3)
where my y changes every time and as it does, the "model" (referred to in the error = term) is updated accordingly. I'm struggling with this bit since first set of models I make is used to inform the second set of models.
However, if I run just one y variable through this whole bit of code at one time, I get the appropriate results.
model <- lm(y ~ group*treatment, data = data)
data %>%
group_by(group) %>%
anova_test(y ~ treatment, error = model, type = 3)
I have tried creating a for loop as well as using the map function in the purrr package but I have been unsuccessful. I am new to for loops and purrr so I am sure it's a simple fix I just can't see it.
Basically I want a way to run
data %>%
group_by(group) %>%
anova_test(y ~ treatment, error = model, type = 3)
iteratively for different y variables (y1, y2, ..., y26) while also referring to the approprite model (model$y1, model$y2, ..., model$26).
Thanks for your help!
Well you didn't give any data so let's use toothgrowth. You seem to like the model format, so let's build a list of models. You could do this in an automated fashion but to make it clear lets do it by hand. The call purrr::map with the anova_test function. You'll get a list back. Since you're in charge of naming the list elements go to town.
Updated answer May 18th. Now using map2 since you want two different models passed build a list for each...
library(rstatix)
library(purrr)
ToothGrowth$len2 <- ToothGrowth$len^2 # for variety
models <- list(model1 = lm(len ~ supp*dose, ToothGrowth),
model2 = lm(len ~ dose*supp, ToothGrowth),
model3 = lm(len2 ~ dose*supp, ToothGrowth),
model4 = lm(len2 ~ supp*dose, ToothGrowth))
models2 <- list(model1 = lm(len ~ supp, ToothGrowth),
model2 = lm(len ~ dose, ToothGrowth),
model3 = lm(len2 ~ dose, ToothGrowth),
model4 = lm(len2 ~ supp, ToothGrowth))
# one model
purrr::map(models, ~ anova_test(.x, type = 3))
# now with model for error term
purrr::map2(models, models2, ~ anova_test(.x, error = .y, type = 3))
#> Coefficient covariances computed by hccm()
#> Coefficient covariances computed by hccm()
#> Coefficient covariances computed by hccm()
#> Coefficient covariances computed by hccm()
#> $model1
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 supp 1 58 4.058 0.049000 * 0.065
#> 2 dose 1 58 12.717 0.000734 * 0.180
#> 3 supp:dose 1 58 1.588 0.213000 0.027
#>
#> $model2
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 dose 1 58 33.626 2.92e-07 * 0.367
#> 2 supp 1 58 10.729 2.00e-03 * 0.156
#> 3 dose:supp 1 58 4.200 4.50e-02 * 0.068
#>
#> $model3
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 dose 1 58 36.028 1.35e-07 * 0.383
#> 2 supp 1 58 7.128 1.00e-02 * 0.109
#> 3 dose:supp 1 58 2.709 1.05e-01 0.045
#>
#> $model4
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 supp 1 58 2.684 0.107000 0.044
#> 2 dose 1 58 13.566 0.000508 * 0.190
#> 3 supp:dose 1 58 1.020 0.317000 0.017
Thanks to Nirgrahamuk from the rstudio community forum for this answer:
map(names(models_1) ,
~ anova_test(data=group_by(df,edge),
formula = as.formula(paste0(.x,"~ trt")),
error = models_1[[.x]],
type = 3))
(see their full answer at: https://community.rstudio.com/t/trouble-using-group-by-and-map2-together/66730/8?u=mvula)
Created on 2020-05-20 by the reprex package (v0.3.0)
I'm writing a function for my (working) R script in order to clean up my code. I do not have experience with writing functions, but decided I should invest some time into this. The goal of my function is to perform multiple statistical tests while only passing the required dataframe, quantitative variable and grouping variable once. However, I cannot get this to work. For your reference, I'll use the ToothGrowth data frame to illustrate my problem.
Say I want to run a Kruskal-Wallis test and one-way ANOVA on len, to compare different groups named supp, for whatever reason. I can do this separately with
kruskal.test(len ~ supp, data = ToothGrowth)
aov(len ~ supp, data = ToothGrowth)
Now I want to write a function that performs both tests. This is what I had thought should work:
stat_test <- function(mydata, quantvar, groupvar) {
kruskal.test(quantvar ~ groupvar, data = mydata)
aov(quantvar ~ groupvar, data = mydata)
}
But if I then run stat_test(ToothGrowth, "len", "sup"), I get the error
Error in kruskal.test.default("len", "supp") :
all observations are in the same group
What am I doing wrong? Any help would be much appreciated!
You can use deparse(substitute(quantvar)) to get the quoted name of the column you are passing to the function, and this will allow you to build a formula using paste. This is a more idiomatic way of operating in R.
Here's a reproducible example:
stat_test <- function(mydata, quantvar, groupvar) {
A <- as.formula(paste(deparse(substitute(quantvar)), "~",
deparse(substitute(groupvar))))
print(kruskal.test(A, data = mydata))
cat("\n--------------------------------------\n\n")
aov(A, data = mydata)
}
stat_test(ToothGrowth, len, supp)
#>
#> Kruskal-Wallis rank sum test
#>
#> data: len by supp
#> Kruskal-Wallis chi-squared = 3.4454, df = 1, p-value = 0.06343
#>
#>
#> --------------------------------------
#> Call:
#> aov(formula = A, data = mydata)
#>
#> Terms:
#> supp Residuals
#> Sum of Squares 205.350 3246.859
#> Deg. of Freedom 1 58
#>
#> Residual standard error: 7.482001
#> Estimated effects may be unbalanced
Created on 2020-03-30 by the reprex package (v0.3.0)
It looks like you need to convert your variable arguments, given as text strings, into a formula. You can do this by concatenating the strings with paste(). Also, you will need to wrap print() around both of your statistical tests within the function, otherwise only the last one will display.
Here is the modified function:
stat_test <- function(mydata, quantvar, groupvar) {
model_formula <- formula(paste(quantvar, '~', groupvar))
print(kruskal.test(model_formula, data = mydata))
print(aov(model_formula, data = mydata))
}
For reference, if using rstatix (tidy version of statistical functions), you need to use sym and !!, while using formula() when needed.
make_kruskal_test <- function(data, quantvar, groupvar) {
library(rstatix, quietly = TRUE)
library(rlang, quietly = TRUE)
formula_expression <- formula(paste(quantvar, "~", groupvar))
quantvar_sym <- sym(quantvar)
shapiro <- shapiro_test(data, !!quantvar_sym) %>% print()
}
sample_data <- tibble::tibble(sample = letters[1:5], mean = 1:5)
make_kruskal_test(sample_data, "mean", "sample")
#> # A tibble: 1 x 3
#> variable statistic p
#> <chr> <dbl> <dbl>
#> 1 mean 0.987 0.967
I have the following code:
x <- c(
0.367141764080875, 0.250037975705769, 0.167204185003365, 0.299794433447383,
0.366885973041269, 0.300453205296379, 0.333686861081341, 0.33301168850398,
0.400142004893329, 0.399433677388411, 0.366077304765104, 0.166402979455671,
0.466624230750293, 0.433499934139897, 0.300017278751768, 0.333673696762895,
0.29973685692478
)
fn <- fitdistrplus::fitdist(x,"norm")
summary(fn)
#> Fitting of the distribution ' norm ' by maximum likelihood
#> Parameters :
#> estimate Std. Error
#> mean 0.32846024 0.01918923
#> sd 0.07911922 0.01355908
#> Loglikelihood: 19.00364 AIC: -34.00727 BIC: -32.34084
#> Correlation matrix:
#> mean sd
#> mean 1 0
#> sd 0 1
Basically, it takes a vector and tried to fit the distribution
using fitdistrplus package.
I tried looking at the broom package, but it doesn't have
a function that covers that.
When you call broom::tidy(fn) you receive an error that says:
Error: No tidy method for objects of class fitdist
This is because this function from broom only has a finite number objects that are "good to use", see methods(tidy) for the complete list. (Read more about S3 methods in R. More here).
So the function doesn't work for an object fitdist but works for a fitdistr object from MASS (more "famous").
We can then assign to fn that class, and then use broom:
class(fn) <- ("fitdist", "fitdistr")
# notice that I've kept the original class and added the other
# you shouldn't overwrite classes. ie: don't to this: class(fn) <- "fitdistr"
broom::tidy(fn)
# # A tibble: 2 x 3
# term estimate std.error
# <chr> <dbl> <dbl>
# 1 mean 0.328 0.0192
# 2 sd 0.0791 0.0136
Note that you can only see the parameters. If you wish to see more and organize everything as "tidy", you should tell us more about your expected output.
broom::tidy() gets you this far, if you want more I'd start by defining my own method function that works for a class fitdist object using as reference the tidy.fitdistr method, and adapting it.
Example of how I'd adapt from the original broom::tidy() code, using the S3 method for the class fitdist.
Define your own method (similar to how you define your own function):
# necessary libraries
library(dplyr)
library(broom)
# method definition:
tidy.fitdist <- function(x, ...) { # notice the use of .fitdist
# you decide what you want to keep from summary(fn)
# use fn$ecc... to see what you can harvest
e1 <- tibble(
term = names(x$estimate),
estimate = unname(x$estimate),
std.error = unname(x$sd)
)
e2 <- tibble(
term = c("loglik", "aic", "bic"),
value = c(unname(x$loglik), unname(x$aic), unname(x$bic))
)
e3 <- x$cor # I prefer this to: as_tibble(x$cor)
list(e1, e2, e3) # you can name each element for a nicer result
# example: list(params = e1, scores = e2, corrMatr = e3)
}
This is how you can call this new method now:
tidy(fn) # to be more clear this is calling your tidy.fitdist(fn) under the hood.
# [[1]]
# # A tibble: 2 x 3
# term estimate std.error
# <chr> <dbl> <dbl>
# 1 mean 0.328 0.0192
# 2 sd 0.0791 0.0136
#
# [[2]]
# # A tibble: 3 x 2
# term value
# <chr> <dbl>
# 1 loglik 19.0
# 2 aic -34.0
# 3 bic -32.3
#
# [[3]]
# mean sd
# mean 1 0
# sd 0 1
Notice that the class is:
class(fn)
[1] "fitdist"
So now you don't actually need to assign the fitdistr (from MASS) class as before.
Not sure exactly what you need, but you can try:
tidy_fn <- rbind(fn$estimate,fn$sd)
https://stats.stackexchange.com/questions/23539/use-fitdist-parameters-in-variables
I use stargazer to create a table for multiple models. They are actually the same model but the first is based on all observations, while the other drop different observations respectively. All variables are named the same, so what surprises me is that when I export the table to Latex, two lines, one for a dummy variable and another for an interaction term, are duplicated.
What is really strange is that I cannot replicate the results, but I will post a minimal working example nonetheless. Perhaps you can help me based on my description alone.
This is the code for my MWE:
library(tibble)
library(stargazer)
df <- as_tibble(data.frame(first = rnorm(100, 50), second = rnorm(100, 30), third = rnorm(100, 100), fourth = c(rep(0, 50), rep(1, 50))))
model.1 <- lm(first ~ second + third + fourth + third*fourth, data = df)
model.2 <- lm(first ~ second + third + fourth + third*fourth, data = df[!rownames(df) %in% "99",])
stargazer(model.1, model.2)
I will now post the Latex output includes the error that I am trying to fix (with this snippet it seems to work just fine).
What I would like to have, of course is the code as produced by this snippet (I feel very stupid for not being able to reproduce it):
you could take a look at the names of your model's coefficients using coefficients(). Mare sure they are identical, i.e. identical(names(model.1), names(model.2)) Then use stargazer's keep statement to make sure you get the coefficients you want,
Here with the example above keeping selected variables;
coefficients(model.1)
#> (Intercept) second third fourth third:fourth
#> 57.27352606 0.02674072 -0.08236250 20.23596216 -0.20288137
coefficients(model.2)
#> (Intercept) second third fourth third:fourth
#> 57.06149556 0.03305134 -0.08214812 20.85087288 -0.20885718
identical(names(model.1), names(model.2))
#> [1] TRUE
I'm using the type = "text" to make it more friendly to SO, but I guess it's the same with LaTeX,
stargazer(model.1, model.2, type = "text", keep=c("third","third:fourth"))
#>
#> =========================================================
#> Dependent variable:
#> -------------------------------------
#> first
#> (1) (2)
#> ---------------------------------------------------------
#> third -0.082 -0.082
#> (0.166) (0.167)
#>
#> third:fourth -0.203 -0.209
#> (0.222) (0.223)
#>
#> ---------------------------------------------------------
#> Observations 100 99
#> R2 0.043 0.044
#> Adjusted R2 0.002 0.004
#> Residual Std. Error 1.044 (df = 95) 1.047 (df = 94)
#> F Statistic 1.056 (df = 4; 95) 1.089 (df = 4; 94)
#> =========================================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
but it might be hard to rule out that it's a local issue if we cannot find a way to reproduce your issue.
I have a dataframe called repay and I have created a vector for the variables names of the variables I am interested in called variables.
variables<-names(repay)[22:36]
I want to write a for loop that does some univariate analysis on each of the variables in variables. For example:
for (i in 1:length(variables))
{
model<-glm(Successful~ variables[i]
,data=repay
,family=binomial(link='logit'))
}
However it doesn't recognize variables[i] as a variable, giving the following error message:
Error in model.frame.default(formula = Successful ~ variables[i], data
= repay, : variable lengths differ (found for 'variables[i]')
Try using the formula function in R. It will allow correct interpretation of models as below:
for (i in 1:length(variables){
myglm <- glm(formula(paste("Successful", "~", variables[i])),
data = repay, family = binomial(link = 'logit'))
See my post here for more things you can do in this context.
Alternatively you can use assign yielding in as many models as the variables.
Let us consider
repay<-data.table(Successful=runif(10),a=sample(10),b=sample(10),c=runif(10))
variables<-names(repay)[2:4]
yielding:
>repay
Successful a b c
1: 0.8457686 7 9 0.2930537
2: 0.4050198 6 6 0.5948573
3: 0.1994583 2 8 0.4198423
4: 0.1471735 1 5 0.5906494
5: 0.7765083 8 10 0.7933327
6: 0.6503692 9 4 0.4262896
7: 0.2449512 4 1 0.7311928
8: 0.6754966 3 3 0.4723299
9: 0.7792951 10 7 0.9101495
10: 0.6281890 5 2 0.9215107
Then you can perform the loop
for (i in 1:length(variables)){
assign(paste0("model",i),eval(parse(text=paste("glm(Successful~",variables[i],",data=repay,family=binomial(link='logit'))"))))
}
resulting in 3 objects: model1,model2 and model3.
>model1
Call: glm(formula = Successful ~ a, family = binomial(link = "logit"),
data = repay)
Coefficients:
(Intercept) a
-0.36770 0.05501
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 5.752
Residual Deviance: 5.69 AIC: 17.66
Idem for model2, model3 et.c.
You could create a language object from a string,
var = "cyl"
lm(as.formula(sprintf("mpg ~ %s", var)), data=mtcars)
# alternative (see also substitute)
lm(bquote(mpg~.(as.name(var))), data=mtcars)
Small workaround that might help
for (i in 22:36)
{
ivar <- repay[i] #choose variable for running the model
repay2 <- data.frame(Successful= repay$Successful, ivar) #create new data frame with 2 variables only for running the model
#run model for new data frame repay2
model<-glm(Successful~ ivar
,data=repay2
,family=binomial(link='logit'))
}