dplyr's pipe does not pass the name of objects passed down the chain. This is well known. However, it leads to unexpected complications after you fit a glm model. Functions using glm objects expect the call to contain the correct name of the object containing data.
#sample data
p_load(ISLR)
mydata = ISLR::Default
#fit glm
fitted=
mydata %>%
select(default, income) %>%
glm(default~.,data=.,family=binomial)
#dot in call
fitted$call
#pscl's pR2 pseudo r2 function does not work
p_load(pscl)
pR2(fitted)
How to fix this behavior?
I want to keep using pipes, including the select function. I also want to obtained a glm objected in fitted than can be used with pR2 or other function that need a working call.
One can re-arrange the data-preprocessing into the glm call, but it takes away the elegance of the code.
fitted=
glm(default~.,
data=mydata %>%
select(default, income),
family=binomial)
1) Since you are explicitly writing out all the variables in the select anyways you can just as easily write them out in the formula instead and get rid of the select -- you can keep the select if you like but it does seem pointless if the variables are already explicitly given in the formula. Then this works:
library(dplyr)
library(magrittr)
library(pscl)
library(ISLR)
fitted <- Default %$% glm(default ~ income, family=binomial)
fitted %>% pR2
2) Another possibilty is to invert it so that instead of putting glm inside the pipe put the pipe inside glm:
fitted <-
glm(default ~ ., data = Default %>% select(income, default), family = binomial)
fitted %>% pR2
3) A third approach is to generate the formula argument of glm rather than the data argument.
fitted <- Default %>%
select(starts_with("inc")) %>%
names %>%
reformulate("default") %>%
glm(data = Default, family = binomial)
fitted %>% pR2
Replace the glm line with this if it is important that the Call: line in the output look nice.
{ do.call("glm", list(., data = quote(Default), family = quote(binomial))) }
or using purrr:
{ invoke("glm", list(., data = expr(Default), family = expr(binomial))) }
Related
I'm trying to 'tidy' up a binary regression (so using a log link not a logit link -> so I get RR estimates not OR) using the broom function 'tidy' on a 'glm2' object. However its giving me an error saying
> tidy(model, conf.int=TRUE, exponentiate=TRUE)
Error: no valid set of coefficients has been found: please supply starting values
Here is a reproducible example of what I mean:
library(tidyverse)
library(glm2)
library(broom)
data(iris)
glimpse(iris)
table(iris$Species)
##create an outcome
df <-iris %>%
mutate(outcome = case_when(Petal.Width>2 ~1,
TRUE ~0))
#fit stardard glm
glm(outcome ~ Sepal.Length+Sepal.Width, data=df,
family = binomial(link="log"))
# -> doesnt converge using a log link due to parameter space issues (common in fitting binary regression).
# go to glm2 to fit the model instead, but need starting values for this:
p0 <- sum(as.numeric(df$outcome))/length(as.numeric(df$outcome))
start.val <- c(log(p0),rep(0,2))
model<-glm2(outcome ~ Sepal.Length+Sepal.Width, data=df,
family = binomial(link="log"),
start = start.val)
##get warnings, but converges
model$converged
##now tidy up and display model
tidy(model, conf.int=TRUE, exponentiate=TRUE)
#error -> wants starting values again? also shows warnings from previous
# (which are now saying model hasnt converged?)
tidy(model, conf.int=TRUE, exponentiate=TRUE, start=start.val)
# doesnt recognise starting values?
Any ideas on how to get tidy to work, or do I just do it manually?
I am trying to create a custom function in R that lets the user perform linear regressions on a data set, I would like the user to be able to input variables for the data to be grouped by so that multiple regressions are performed on the data set. The problem I am having is trying to get a user defined list of variables into the custom function. Below I have tried using "..." however this does not work. If anyone has any idea how I should be approaching this that would be great. For reference For reference - lr.1 = the dataset - ddate = the x variable - alue = the y variable - the variables that the data should be grouped by)
`grouped.lr = function(lr.1,ddate, value, ...){
test = lr.1 %>%
group_by(...) %>%
nest() %>%
mutate(mod = map(data, fitmodel.test),
pars = map(mod, tidy),
pred = map(mod, augment))}`
It seems like the use of a formula might be fitting here, as it allows the user to specify the predictor-response relations.
The formula object is also accepted as a format for various models and can thus be directly passed down to the lm() function.
# function training a linear model and a random forest
build_my_models <- function(formula, data) {
lm.fit <- lm(formula, data)
rf.fit <- randomForest(formula, data)
return(list(lm.fit, rf.fit))
}
# data frame with three continuous variables
a <- rnorm(100)
b <- rnorm(100, mean = 2, sd = 4)
c <- 2*a + b
my_data <- data.frame(a = a, b = b, c = c)
# build the models
my_models <- build_my_models(a ~ ., my_data)
# here the formula 'a ~ .' defines the relation between response and predictors
# (the dot indicates that 'a' depends on all other variables in the data frame)
If you want to implement a model yourself, it's never a bad idea to stick to R's syntax and conventions. You can check to documentation on how to parse the formula for your specific needs.
I would like to fit a quadratic to (Time,SkinTemp) for each id in the following data.frame df. Each id has a different number of Time,SkinTemp entries so I'm stuck with 'predict'
df<-data.frame(Time=seq(65),
SkinTemp=rnorm(65,37,0.5),
id=rep(1:10,c(5,4,10,6,7,8,9,8,4,4)))
So far I have:
#Fit the model y=x^2+x+C
fitted_models = df %>% group_by(id) %>% do(model = lm(SkinTemp ~ Time+I(Time^2), data = .))
So far so good. Here's where I'm stuck. How do I pass the original Time data into the predict function below?
#Predict data points for each quadratic
predQ<-sapply(unique(df$id), function(x) predict(fitted_models$model[[x]]))
Use fitted:
lapply(fitted_models$model, fitted)
I have a big dataset that I want to partition based on the values of a particular variable (in my case lifetime), and then run logistic regression on each partition. Following the answer of #tchakravarty in Fitting several regression models with dplyr I wrote the following code:
lifetimemodels = data %>% group_by(lifetime) %>% sample_frac(0.7)%>%
do(lifeModel = glm(churn ~., x= TRUE, family=binomial(link='logit'), data = .))
My question now is how I can use the resulting logistic models on computing the AUC on the rest of the data (the 0.3 fraction that was not chosen) which should again be grouped by lifetime?
Thanks a lot in advance!
You could adapt your dplyr approach to use the tidyr and purrr framework. You look at grouping/nesting, and the mutate and map functions to create list frames to store pieces of your workflow.
The test/training split you are looking for is part of modelr a package built to assist modelling within the purrr framework. Specifically the cross_vmc and cross_vkfold functions.
A toy example using mtcars (just to illustrate the framework).
library(dplyr)
library(tidyr)
library(purrr)
library(modelr)
analysis <- mtcars %>%
nest(-cyl) %>%
unnest(map(data, ~crossv_mc(.x, 1, test = 0.3))) %>%
mutate(model = map(train, ~lm(mpg ~ wt, data = .x))) %>%
mutate(pred = map2(model, train, predict)) %>%
mutate(error = map2_dbl(model, test, rmse))
This:
takes mtcars
nest into a list frame called data by cyl
Separate each data into a training set by mapping crossv_mc to each element, then using unnest to make the test and train list columns.
Map the lm model to each train, store that in model
Map the predict function to model and train and store in pred
Map the rmse function to model and test sets and store in error.
There are probably users out there more familiar than me with the workflow, so please correct/elaborate.
I have 100 replicates of coxph model fitted in loop. I am trying to extract out log-rank score test result with p-values for each replicate in a data frame or list. I am using the following. But, it gives me only log rank score, not p-value. Any help will be very appreciated.
I can share dataset, but am not sure how to attach here.
thanks,
Krina
Repl_List <- unique(dat3$Repl)
doLogRank = function(sel_name) {
dum <- dat3[dat3$Repl == sel_name,]
reg <- with(dum, coxph(Surv(TIME_day, STATUS) ~ Treatment, ties = "breslow"))
LogRank <- with(reg, reg$score)
}
LogRank <- t(as.data.frame(lapply(Repl_List, doLogRank)))
Here is a mock example that I took from the help page of the coxph function. I just replicated the dataset 100 times to create your scenario. I highly recommend to start using the tidyverse packages to do such work. broom is a great addition along with dplyr and tidyr.
library(survival)
library(tidyverse)
library(broom)
test <- data.frame(time=c(4,3,1,1,2,2,3),
status=c(1,1,1,0,1,1,0),
x=c(0,2,1,1,1,0,0),
sex=c(0,0,0,0,1,1,1))
Below I am replicating the dataset 100 times using the replicate function.
r <- replicate(test,n = 100,simplify = FALSE) %>% bind_rows %>%
mutate(rep = rep(seq(1,100,1),each=7))
I setup the cox model as a small function that I can them pass on to each replicate of the dataframe.
cxph_mod <- function(df) {
coxph(Surv(time, status) ~ x + strata(sex), df)
}
Below, is the step by step process of fitting the model and extracting the values.
tidyr::nest the dataframe
purrr::map the model into each nest
nest is function in library(tidyr)
map is a function similar to lapply in library(purrr)
nested <- r %>%
group_by(rep) %>%
nest %>%
mutate(model = data %>% map(cxph_mod))
look into the first rep to see the coxph output. You will see the model object stored in the cells of the dataframe allowing easier access.
nested %>% filter(rep==1)
With each model object, now use broom to get the parameter estimates and the prediction from the model into the nested dataset
nested <- nested %>%
mutate(
ests = model %>% map(broom::tidy)
)
tidyr::unnest to view your predictions for fitting each resampled dataset
ests <- unnest(nested,ests,.drop=TRUE) %>% dplyr::select(rep,estimate:conf.high)
In this case since I am repeating the same dataset 100 times, the pvalue will be the same, but in your case you will have 100 different datasets and hence 100 different p.values.
ggplot(data=ests,aes(y=p.value,x=rep))+geom_point()
Vijay