R random forest aggregate vs individual prediction - r

Please consider this minimal reproducible example of a random forest regression estimate
library(randomForest)
# fix missing data
airquality <- na.roughfix(airquality)
set.seed(123)
#fit the random forest model
rf_fit <- randomForest(formula = Ozone ~ ., data = airquality)
#define new observation
new <- data.frame(Solar.R=250, Wind=8, Temp=70, Month=5, Day=5)
set.seed(123)
#use predict all on new observation
rf_predict<-predict(rf_fit, newdata=new, predict.all = TRUE)
rf_predict$aggregate
library(tidyverse)
predict_mean <- rf_predict$individual %>%
as_tibble() %>%
rowwise() %>%
transmute(avg = mean(V1:V500))
predict_mean
I was expecting to get the same value by rf_predict$aggregate and predict_mean
Where and why am I wrong about this assumption?
My final objective is to get a confidence interval of the predicted value.

I believe your code needs to include a c_across() call for the calculation to be performed correctly:
The ?c_across documentations tells us:
c_across() is designed to work with rowwise() to make it easy to
perform row-wise aggregations.
predict_mean <- rf_predict$individual %>%
as_tibble() %>%
rowwise() %>%
transmute(avg = mean(c_across(V1:V500)))
>predict_mean
[1] 30.5
An answer to a previous question, points out that mean() can't handle a data.frame. And in your code the data being provide to mean() is a row-wise data frame with class rowwise_df. c_across allows the data in the rows to be presented to mean() as vectors (I think).

Related

Fitting a quadratic curve for each data set that has different lengths

I would like to fit a quadratic to (Time,SkinTemp) for each id in the following data.frame df. Each id has a different number of Time,SkinTemp entries so I'm stuck with 'predict'
df<-data.frame(Time=seq(65),
SkinTemp=rnorm(65,37,0.5),
id=rep(1:10,c(5,4,10,6,7,8,9,8,4,4)))
So far I have:
#Fit the model y=x^2+x+C
fitted_models = df %>% group_by(id) %>% do(model = lm(SkinTemp ~ Time+I(Time^2), data = .))
So far so good. Here's where I'm stuck. How do I pass the original Time data into the predict function below?
#Predict data points for each quadratic
predQ<-sapply(unique(df$id), function(x) predict(fitted_models$model[[x]]))
Use fitted:
lapply(fitted_models$model, fitted)

How can I pull slope and intercept variables produced by the segmented package and put into a dataframe using r?

Can anyone walk me through how to get the slopes and intercepts produced by the segmented package out and placed in a data frame? This will ultimately be used to line up slopes and intercepts back to their original value. See data (that I took from another post) below.
#load packages
library(segmented)
library(tidyverse)
#set seed and develop data
set.seed(1)
Y<-c(13,21,12,11,16,9,7,5,8,8)
X<-c(74,81,80,79,89,96,69,88,53,72)
age<-c(50.45194,54.89382,46.52569,44.84934,53.25541,60.16029,50.33870,
51.44643,38.20279,59.76469)
dat=data.frame(Y=Y,off.set.term=log(X),age=age)
#run initial GLM
glm.fit=glm(Y~age+off.set.term,data=dat,family=poisson)
summary(glm.fit)
#run segmented glm
glm.fitted.segmented <- segmented(glm.fit, seg.Z=~age + off.set.term, psi =
list(age = c(50,53), off.set.term = c(4.369448)))
#Get summary, slopes and intercepts
summary(glm.fitted.segmented)
slope(glm.fitted.segmented)
intercept(glm.fitted.segmented)
library(broom)
library(dplyr)
library(tidyr)
library(stringr)
slopes <-
bind_rows(lapply(slope(glm.fitted.segmented), tidy), .id = "variable") %>%
mutate(type = str_extract(.rownames, "^[a-z]+"),
model = str_extract(.rownames, "[0-9]+$")) %>%
select(variable, model, type, estimate = "Est.")
intercepts <-
bind_rows(lapply(intercept(glm.fitted.segmented), tidy), .id = "variable") %>%
mutate(type = str_extract(.rownames, "^[a-z]+"),
model = str_extract(.rownames, "[0-9]+$")) %>%
select(variable, model, type, estimate = "Est.")
bind_rows(slopes, intercepts) %>%
spread(type, estimate)
Using the tidy function, you can easily pull out the data.frame for each variable then extract the model and type of unit. Bind it all together and spread the type and estimate value to end with variable, model, intercept, and slope.

Tidy output from many single-variable models using purrr, broom

I have a dataframe that comprises of a binary outcome column (y), and multiple independent predictor columns (x1, x2, x3...).
I would like to run many single-variable logistic regression models (e.g. y ~ x1, y ~ x2, y ~ x3), and extract the exponentiated coefficients (odds ratios), 95% confidence intervals and p-values for each model into rows of a dataframe/tibble. It seems to me that a solution should be possible using a combination of purrr and broom.
This question is similar, but I can't work out the next steps of:
extracting only the values I need and
tidying into a dataframe/tibble.
Working from the example in the referenced question:
library(tidyverse)
library(broom)
df <- mtcars
df %>%
names() %>%
paste('am~',.) %>%
map(~glm(as.formula(.x), data= df, family = "binomial"))
After sleeping on it, the solution occurred to me. Requires the use of map_df to run each model, and tidy to extract the values from each model.
Hopefully this will be useful for others:
library(tidyverse)
library(broom)
df <- mtcars
output <- df %>%
select(-am) %>%
names() %>%
paste('am~',.) %>%
map_df(~tidy(glm(as.formula(.x),
data= df,
family = "binomial"),
conf.int=TRUE,
exponentiate=TRUE)) %>%
filter(term !="(Intercept)")

Fitting several regression models after group_by with dplyr and applying the resulting models into test sets

I have a big dataset that I want to partition based on the values of a particular variable (in my case lifetime), and then run logistic regression on each partition. Following the answer of #tchakravarty in Fitting several regression models with dplyr I wrote the following code:
lifetimemodels = data %>% group_by(lifetime) %>% sample_frac(0.7)%>%
do(lifeModel = glm(churn ~., x= TRUE, family=binomial(link='logit'), data = .))
My question now is how I can use the resulting logistic models on computing the AUC on the rest of the data (the 0.3 fraction that was not chosen) which should again be grouped by lifetime?
Thanks a lot in advance!
You could adapt your dplyr approach to use the tidyr and purrr framework. You look at grouping/nesting, and the mutate and map functions to create list frames to store pieces of your workflow.
The test/training split you are looking for is part of modelr a package built to assist modelling within the purrr framework. Specifically the cross_vmc and cross_vkfold functions.
A toy example using mtcars (just to illustrate the framework).
library(dplyr)
library(tidyr)
library(purrr)
library(modelr)
analysis <- mtcars %>%
nest(-cyl) %>%
unnest(map(data, ~crossv_mc(.x, 1, test = 0.3))) %>%
mutate(model = map(train, ~lm(mpg ~ wt, data = .x))) %>%
mutate(pred = map2(model, train, predict)) %>%
mutate(error = map2_dbl(model, test, rmse))
This:
takes mtcars
nest into a list frame called data by cyl
Separate each data into a training set by mapping crossv_mc to each element, then using unnest to make the test and train list columns.
Map the lm model to each train, store that in model
Map the predict function to model and train and store in pred
Map the rmse function to model and test sets and store in error.
There are probably users out there more familiar than me with the workflow, so please correct/elaborate.

extract log rank (score) test result wiht p-value for Coxph Model

I have 100 replicates of coxph model fitted in loop. I am trying to extract out log-rank score test result with p-values for each replicate in a data frame or list. I am using the following. But, it gives me only log rank score, not p-value. Any help will be very appreciated.
I can share dataset, but am not sure how to attach here.
thanks,
Krina
Repl_List <- unique(dat3$Repl)
doLogRank = function(sel_name) {
dum <- dat3[dat3$Repl == sel_name,]
reg <- with(dum, coxph(Surv(TIME_day, STATUS) ~ Treatment, ties = "breslow"))
LogRank <- with(reg, reg$score)
}
LogRank <- t(as.data.frame(lapply(Repl_List, doLogRank)))
Here is a mock example that I took from the help page of the coxph function. I just replicated the dataset 100 times to create your scenario. I highly recommend to start using the tidyverse packages to do such work. broom is a great addition along with dplyr and tidyr.
library(survival)
library(tidyverse)
library(broom)
test <- data.frame(time=c(4,3,1,1,2,2,3),
status=c(1,1,1,0,1,1,0),
x=c(0,2,1,1,1,0,0),
sex=c(0,0,0,0,1,1,1))
Below I am replicating the dataset 100 times using the replicate function.
r <- replicate(test,n = 100,simplify = FALSE) %>% bind_rows %>%
mutate(rep = rep(seq(1,100,1),each=7))
I setup the cox model as a small function that I can them pass on to each replicate of the dataframe.
cxph_mod <- function(df) {
coxph(Surv(time, status) ~ x + strata(sex), df)
}
Below, is the step by step process of fitting the model and extracting the values.
tidyr::nest the dataframe
purrr::map the model into each nest
nest is function in library(tidyr)
map is a function similar to lapply in library(purrr)
nested <- r %>%
group_by(rep) %>%
nest %>%
mutate(model = data %>% map(cxph_mod))
look into the first rep to see the coxph output. You will see the model object stored in the cells of the dataframe allowing easier access.
nested %>% filter(rep==1)
With each model object, now use broom to get the parameter estimates and the prediction from the model into the nested dataset
nested <- nested %>%
mutate(
ests = model %>% map(broom::tidy)
)
tidyr::unnest to view your predictions for fitting each resampled dataset
ests <- unnest(nested,ests,.drop=TRUE) %>% dplyr::select(rep,estimate:conf.high)
In this case since I am repeating the same dataset 100 times, the pvalue will be the same, but in your case you will have 100 different datasets and hence 100 different p.values.
ggplot(data=ests,aes(y=p.value,x=rep))+geom_point()
Vijay

Resources