I am trying to replicate the example code in Bender and Schleip for Piece-wise exponential Additive Mixed modelling tools. Specifically a survival exercise with time varying effects.
https://arxiv.org/pdf/1806.01042.pdf
library(dplyr); library(tidyr); library(purrr); library(ggplot2)
library(survival); library(mgcv); library(pammtools)
data("pbc", package="survival")
# event time information
pbc <- pbc %>%
filter(id <= 312) %>%
mutate(status = ifelse(status==0,0,1) )%>%
select(id:status, trt:sex, bili, protime)
pbc %>% slice(1:6)
pbc_ped <- as_ped(
data = list(pbc, pbcseq),
formula = Surv(pbc$time, pbc$status)~sex|concurrent(bili, protime, tz_var = "day"),
id = "id")
I always get the error
Error: .x is empty, and no .init supplied
I installed and checked Rtools, I tried with different (older) version of Purrr, which sometimes is related with this error. I tried to run the code also on https://rdrr.io/snippets/.
Any idea? thank you very much...
You have not used the code in that vignette. And you added pbc$ to the arguments in Surv(), a common mistake but generally not a productive strategy
# Need to narrow the material from pbcseq
pbcseq <- pbcseq %>% select(id, day, bili, protime)
# I would have given it a different name
#------ Error when using "|" rather than "+"
pbc_ped <- as_ped(
data = list(pbc, pbcseq),
formula = Surv(time, status)~sex|concurrent(bili, protime, tz_var = "day"),
id = "id")
#Error: `.x` is empty, and no `.init` supplied
#________________
pbc_ped <- as_ped(
data = list(pbc, pbcseq),
formula = Surv(time, status)~sex + concurrent(bili, protime, tz_var = "day"),
id = "id") # No error
I think there may be an error in the vignette. I don't see any examples using the construct ...
Surv(time,status)~ variates | special(.)
They all use a "+" sign for adding the time-dependent covariates. If you go to https://adibender.github.io/pammtools//articles/data-transformation.html you see them using a "+" rather than a "|". I think there is some sloppiness in that package's documentation. But your additions only made the problem worse.
Related
This is my question
Do the developers that make more games charge higher prices?
my code:
dev_data <- steam_data_final %>%
group_by(developer) %>%
summarize(num_dev = n(), avg_price = mean(price, na.rm = TRUE)) %>%
arrange(desc(num_dev))
dev_data
but this model isn't working, getting Warning: Dropping 3038 rows with missing values
mod_dev <- lm(num_dev ~ avg_price, data = dev_data)
Check if you have any NA using summary() or is.na() for each column. If you do have any NA, then it is the reason why the lm() gives you the warning message.
Also, it seems like you need to use lm(avg_price ~ num_dev, data = dev_data) instead of lm(num_dev ~ avg_price, data = dev_data). It seems like the dependent variable should be avg_price, not num_dev. (It depends on your question of research.)
After designing a Tidymodels recipe-based workflow, which is tuned then fitted to some training data, I'm not clear what objects (fitted "workflow", "recipe", ..etc) should be saved to disk for use in predicting new data in production. I understand I can use saveRDS()/readRDS(), write_rds()/read_rds(), or other options to actually do the saving/loading of these objects, but which ones?
In a clean R environment I will have incoming new raw data which will need pre-processed using the "recipe" I used in training the model. I then want to make predictions based on that data after it has been pre-processed. If I intend to use the prep() and bake() functions to pre-process the new data as I did the training data, then I will minimally need the recipe and original training data it seems to get prep() to work. Plus, I also need the fitted model/workflow to make predictions. So three objects it seems. If I save to disk the workflow object in SESSION 1 then I have the ability to extract the recipe and model from it in SESSION 2 with pull_workflow_prepped_recipe() and pull_workflow_fit() respectively. But prep() seems to require the original training data, which I can keep in the workflow with an earlier use of retain = TRUE...but then that gets stripped out of the workflow after a call to fit(). Hear my cries for help! :)
So, imagine two different R sessions, where the first session I am doing all the training and model building, and the second session is some running production app that uses what was learned from the first session. I need help at the arrows in the bottom of SESSION1, and in multiple places in SESSION 2. I used the Tidymodels Get Started as the base for this example.
SESSION 1
library(tidymodels)
library(nycflights13)
library(readr)
set.seed(123)
flight_data <-
head(flights, 500) %>%
mutate(
arr_delay = ifelse(arr_delay >= 30, "late", "on_time"),
arr_delay = factor(arr_delay),
date = as.Date(time_hour)
) %>%
inner_join(weather, by = c("origin", "time_hour")) %>%
select(dep_time, flight, origin, dest, air_time, distance, carrier, date, arr_delay, time_hour) %>%
na.omit() %>%
mutate_if(is.character, as.factor)
set.seed(555)
data_split <- initial_split(flight_data, prop = 3/4)
train_data <- training(data_split)
test_data <- testing(data_split)
flights_rec <-
recipe(arr_delay ~ ., data = train_data) %>%
update_role(flight, time_hour, new_role = "ID") %>%
step_date(date, features = c("dow", "month")) %>%
step_holiday(date, holidays = timeDate::listHolidays("US")) %>%
step_rm(date) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors())
lr_mod <-
logistic_reg() %>%
set_engine("glm")
flights_wflow <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(flights_rec)
flights_fit <-
flights_wflow %>%
fit(data = train_data)
predict(flights_fit, test_data)
### SAVE ONE OR MORE OBJECTS HERE FOR NEXT SESSION <------------
# What to save? workflow (pre or post fit()?), recipe, training data...etc.
write_rds(flights_wflow, "flights_wflow.rds") # Not fitted workflow
write_rds(flights_fit, "flights_fit.rds") # Fitted workflow
SESSION 2
### READ ONE OR MORE OBJECTS HERE FROM PRIOR SESSION <------------
flights_wflow <- read_rds("flights_wflow.rds")
flights_fit <- read_rds("flights_fit.rds")
# Acquire new data, do some basic transforms as before
new_flight_data <-
tail(flights, 500) %>%
mutate(
arr_delay = ifelse(arr_delay >= 30, "late", "on_time"),
arr_delay = factor(arr_delay),
date = as.Date(time_hour)
) %>%
inner_join(weather, by = c("origin", "time_hour")) %>%
select(dep_time, flight, origin, dest, air_time, distance, carrier, date, arr_delay, time_hour) %>%
na.omit() %>%
mutate_if(is.character, as.factor)
# Something here to preprocess the data with recipe as in SESSION 1 <----------
# new_flight_data_prep <- prep(??)
# new_flight_data_preprocessed <- bake(??)
# Predict new data
predict(flights_fit, new_data = new_flight_data_preprocessed)
You have some flexibility in how you approach this, depending on your constraints, but generally I would recommend saving/serializing the fitted workflow, perhaps after using butcher to reduce its size. You can see an example model fitting script in this repo that shows at the end how I save the fitted workflow.
When you go to predict with this workflow, there are some things to keep in mind. I have an example Plumber API in the same repo that demonstrates what is needed to predict for that particular workflow. Notice that the packages how the package needed for prediction are loaded/attached for this API. I didn't use all of tidymodels, but instead only the specific packages I need, for better performance and a smaller container.
Saving the fitted workflow did not work for me. When trying to predict with new data is asking for the target variable (a churn model)
predict(churn_model, the_data)
Error: Problem with `mutate()` column `churn`.
i `churn = dplyr::if_else(churn == 1, "yes", "no")`.
x object 'churn' not found
I still don't get why is asking for a column that should not be present in the data as it is the variable I try to predict...
I get the error below when I try to use the add_p() function to get a p-value for differences between my by variable (with 10 levels) and a categorical variable with two levels (yes/no). I am not sure how to provide a reproducible example. From the trials data, I imagine my by variable would be the "T Stage" variable with 10 levels, and the categorical variables would be: (1) "Chemotherapy Treatment" with 2 levels, and (2) "Chemotherapy Treatment2" with 4 levels. But here is the code I ran.
library(gtsummary)
library(tidyverse)
miro_def %>%
select(mheim, age_dx, time_t1d_yrs, gender, collard, fhist_pandz) %>%
tbl_summary(by = mheim, missing = "no",
type = list(c(gender, collard, fhist_pandz, mheim) ~ "categorical"),
label = list(gender ~ "Gender",
fhist_pandz ~ "Family history of PD",
age_dx ~ "Age at diagnosis",
time_t1d_yrs ~ "Follow-up(years)")) %>%
add_p() %>%
# style the output with custom header
#modify_header(stat_by = "{level}") %>%
# convert to kableExtra as_kable_extra(booktabs = TRUE) %>%
# reduce font size to make table fit. # you may also use the `latex_options = "scale_down"` argument here.
kable_styling(font_size = 7, latex_options = "scale_down")
However, I do get a p-value for this by variable (10 levels) with other variables (which are continous/numeric)
How can I fix this error?
In the case where I have the mentioned multilevel by variable and a multilevel (>2 levels) categorical variable, is there something special I should do to get a p-value?
There was an error in 'add_p()' for variable 'gender' and test 'fisher.test', p-value omitted:
Error in stats::fisher.test(data[[variable]], as.factor(data[[by]])): FEXACT error 7(location). LDSTP=18540 is too small for this problem,
(pastp=51.2364, ipn_0:=ipoin[itp=150]=215, stp[ipn_0]=40.6787).
Increase workspace or consider using 'simulate.p.value=TRUE'
There was an error in 'add_p()' for variable 'collard' and test 'fisher.test', p-value omitted:
Error in stats::fisher.test(data[[variable]], as.factor(data[[by]])): FEXACT error 7(location). LDSTP=18570 is too small for this problem,
(pastp=37.0199, ipn_0:=ipoin[itp=211]=823, stp[ipn_0]=23.0304).
Increase workspace or consider using 'simulate.p.value=TRUE'
There was an error in 'add_p()' for variable 'fhist_pandz' and test 'fisher.test', p-value omitted:
Error in stats::fisher.test(data[[variable]], as.factor(data[[by]])): FEXACT error 7(location). LDSTP=18570 is too small for this problem,
(pastp=36.4614, ipn_0:=ipoin[itp=58]=1, stp[ipn_0]=31.8106).
Increase workspace or consider using 'simulate.p.value=TRUE'
since nobody posted an answer, here's what I used when coming across this. Following the Examples given in the help file ?gtsummary::add_p.tbl_summary, I composed a custom function that runs fisher.test with the simulate.p.values = TRUE option:
## define custom test
fisher.test.simulate.p.values <- function(data, variable, by, ...) {
result <- list()
test_results <- stats::fisher.test(data[[variable]], data[[by]], simulate.p.value = TRUE)
result$p <- test_results$p.value
result$test <- test_results$method
result
}
## add p-values to your gtsummary table, using custom test defined above
summary_table %>%
add_p(
test = list(all_categorical() ~ "fisher.test.simulate.p.values") # this applies the custom test to all categorical variables
)
You can also amend the number of iterations for computing the simulated p-values by changing the default B = 2000 parameter to fisher.test() above.
All this assumes, of course, that it's appropriate to use Fisher's test in the first place.
Since it fixed the issue for me, I would like to indicate that since version 1.3.6 of gtsummary there is an option in add_p() with which you can specify arguments to the test functions (i.e. test.args). Thank you to the developers for this!
From the NEWS:
Each add_p() method now has the test.args = argument. Use this argument to pass
additional arguments to the statistical method, e.g.
add_p(test = c(age, marker) ~ "t.test",
test.args = c(age, marker) ~ list(var.equal = TRUE))
It is also explained in the add_p() help (i.e. ?add_p).
I had a similar problem. You have to increase your workspace with test.args within add_p().
miro_def %>%
select(mheim, age_dx, time_t1d_yrs, gender, collard, fhist_pandz) %>%
tbl_summary(by = mheim, missing = "no",
type = list(c(gender, collard, fhist_pandz, mheim) ~ "categorical"),
label = list(gender ~ "Gender",
fhist_pandz ~ "Family history of PD",
age_dx ~ "Age at diagnosis",
time_t1d_yrs ~ "Follow-up(years)")) %>%
add_p(test.args = variable_with_no_pval ~ list(workspace=2e9))
or
add_p(test.args = all_test("fisher.test") ~ list(workspace=2e9))
I am trying to replicate this example of time series analysis in R using Keras (see Here) and unfortunately I am receiving error message while computing first average rmes
coln <- colnames(compare_train)[4:ncol(compare_train)]
cols <- map(coln, quo(sym(.)))
rsme_train <-
map_dbl(cols, function(col)
rmse(
compare_train,
truth = value,
estimate = !!col,
na.rm = TRUE
)) %>% mean()
rsme_train
Error message:
Error in is_symbol(x) : object '.' not found
There are some helpful comments at the bottom of the post but new version of dplyr doesn't help really. Any suggestion how to get around this?
I stumbled upon the same problem, so here's a solution that is close to the original code.
The transformation for cols is not necessary, because !! works with the character vector already. You can change the code to
coln <- colnames(compare_train)[4:ncol(compare_train)]
rsme_train <-
map_df(coln, function(col)
rmse(
compare_train,
truth = value,
estimate = !!col,
na.rm = TRUE
)) %>%
pull(.estimate) %>%
mean()
rsme_train
You might also want to check for updates of tidyverse, just to be sure.
I would like to run each variable in a dataset as a univariate glmer model using the lme4 package in R. I would like to prepare the data with the dplyr/tidyr packages, and organize the results from each model with the broom package (i.e. do(glance(glmer...). I would most appreciate help that stuck within that framework. I'm not that great in R, but was able to produce a dataset that throws an error and has the same structure as the data I'm using:
library(lme4)
library(dplyr)
library(tidyr)
library(broom)
Bird<-c(rep(c(0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0),10))
Stop<-c(rep(seq(1,10), 20))
Count<-c(rep(c(rep(c(1,2), each=10)), each=10))
Route<-c(rep(seq(1,10), each=20))
X1<-rnorm(200, 50, 10)
X2<-rnorm(200, 10, 1)
X3<-c(rep(c(0),200))#trouble maker variable
Data<-data.frame(cbind(Bird, Stop, Count, Route, X1, X2, X3))
Data%>%
gather(Variable, Value, 5:7)%>%
group_by(Variable)%>%
do(glance(glmer(Bird~Value+Stop+(1+Stop|Route/Count), data=., family=binomial)))
The last variable produces an error so there is no output. What I would like is it to produce NA values in the output if this occurs, or just skip that variable. I've tried using 'try' to blow past the trouble maker variable:
do(try(glance(glmer(Bird~Value+Stop+(1+Stop|Route/Count), data=., family=binomial))))
which it does, but still an output is not produced because it can't coerce a 'try-error' to a data.frame. Unfortunately there is no tryharder function. I've tried some if statements which make sense to me but not the computer. I'm sure I'm not doing it right, but if for example I use:
try(glance(glmer(Bird~Value+Stop+(1+Stop|Route/Count), data=., family=binomial)))->mod
if(is.data.frame(mod)){do(mod)}
I get subscript out of bounds errors. Thanks very much for any input you can provide!
Use tryCatch before the call to glance:
zz = Data %>%
gather(Variable, Value, 5:7) %>%
group_by(Variable) %>%
do(aa = tryCatch(glmer(Bird~Value+Stop+(1+Stop|Route/Count), data=.,
family=binomial), error = function(e) data.frame(NA)))
zz %>%
glance(aa)