While reading this page, I saw this code (I generated fake data):
y = rbinom(n = 1000,size = 1,prob = .5)
x = abs(rnorm(1000))
mydata = data.frame(x,y)
z <- gam(y ~ s(x), data = mydata, family = binomial(link = "logit"),
method = "GCV.Cp")
plot(z, se = 1, seWithMean = TRUE, rug = FALSE, shift = mean(predict(z)),
trans = function(x){exp(x)/(1+exp(x))}) # binomial data
What is the role of shift = mean(predict(z))?


Logistic regression for non-linear data

I have a data with continuous independent variable and binary dependent. Therefore I was trying to apply logistic regression for the analysis of this data. However in contrast to the classical case with S-shaped transition, I have a two transitions.
Here is an example of what I mean
classic.data = data.frame(x = seq(from = 0, by = 0.5, length = 30),
y = c(rep(0, times = 14), 1, 0, rep(1, times = 14)))
model.classic = glm(formula = y ~ x,
data = classic.data,
family = "binomial")
partial = FALSE,
scale = "response",
alpha = 0)
my.data = data.frame(x = seq(from = 0, by = 0.5, length = 30),
y = c(rep(0, times = 10), rep(1, times = 10), rep(0, times = 10)))
model.my = glm(formula = y ~ x,
data = my.data,
family = "binomial")
partial = FALSE,
scale = "response",
alpha = 0)
The blue lines on both plots - it is outcome of glm, while red line it what I want to have.
Is there any way to apply logistic regression to such data? Or should I apply some other type of regression analysis?
In your second model, y is not a linear function of x. When you write y ~ x you assume that when x increases, y will increase/decrease depending on a positive/negative coefficient. That is not the case, it's increasing and then decreasing, making the average effect of x zero (hence the strait line). You therefore need a non-linear function. You could do that with a gam from the mgcv package, where the effect of x is modelled as a smooth function:
my.data = data.frame(x = seq(from = 0, by = 0.5, length = 30),
y = c(rep(0, times = 10), rep(1, times = 10), rep(0, times = 10)))
m = gam(y ~ s(x), data = my.data, family = binomial)
That would lead to the following fit on the original scale:
my.data$prediction = predict(m, type = "response")
plot(my.data$x, my.data$y)
lines(my.data$x, my.data$prediction, col = "red")

Weighting using predict function

I have used 'predict' find a fit line for a linear model(lm) I have created. Because the lm was built on only 2 data points and needs to have a positive slope, I have forced it to go thru the origin (0,0). I have also weighted the function by the number of observations underlying each data point.
Question 1: (SOLVED -see comment by #Gregor)
Why does the predicted line lie so much closer to my second data point (B) than my first data point (A), when B has fewer underlying observations? Did I code something wrong here when weighting the model?
Question 2:
Plotting GLM (link=logit) now, but how can still I force this through 0,0? I've tried adding formula = y~0+x in several places, none of which seem to work.
M <- data.frame("rate" = c(0.4643,0.2143), "conc" = c(300,6000), "nr_dead" = c(13,3), "nr_surv" = c(15,11), "region" = c("A","B"))
M$tot_obsv <- (M$nr_dead+M$nr_surv)
M_conc <- M$conc
M_rate <- M$rate
M_tot_obsv <- M$tot_obsv
#**linear model of data, force 0,0 intercept, weighted by nr. of observations of each data point.**
M_lm <- lm(data = M, rate~0+conc, weights = tot_obsv)
#**plot line using "predict" function**
x_conc <-c(600, 6700)
y_rate <- predict(M_lm, list(conc = x_conc), weights = tot_obsv, type = 'response')
plot(x = M$conc, y = M$rate, pch = 16, ylim = c(0, 0.5), xlim = c(0,7000), xlab = "conc", ylab = "death rate")
lines(x_conc, y_rate, col = "red", lwd = 2)
#**EDIT 1:**
M_glm <- glm(cbind(nr_dead, nr_surv) ~ (0+conc), data = M, family = "binomial")
#*plot using 'predict' function*
binomial_smooth <- function(formula = (y ~ 0+x),...) {
geom_smooth(method = "glm", method.args = list(family = "binomial"), formula = (y ~ 0+x), ...)
tibble(x_conc = c(seq(300, 7000, 1), M$conc), y_rate = predict.glm(M_glm, list(conc = x_conc), type = "response")) %>% left_join(M, by = c('x_conc' = 'conc')) %>%
ggplot(aes(x = x_conc, y = y_rate)) + xlab("concentration") + ylab("death rate") +
geom_point(aes(y = rate, size = tot_obsv)) + binomial_smooth(formula = (y ~ 0+x)) + theme_bw()

ggsurvplot: unable to use survfit when called from a function

I am trying to plot a survival plot and have run into an issue when trying to move my survfit function into a main function, where I can call it for different data sets. When I run the code
fit<- survfit(Surv(time, status) ~ sex, data = lung)
allsurv <- function(fit){
pval = TRUE,
pval.coord = c(200, 0.10),
conf.int = TRUE,
xlab = "Days",
ggtheme = theme_light(),
surv.median.line = "hv",
legend.labs = c("Female","Male"),
legend.title = "",
palette = c("#8C3F4D","#3E606F")) +
scale_y_continuous(expand = c(0.02, 0.02),breaks = seq(from = 0, to = 1, by = 0.1),labels=percent) +
scale_x_continuous(expand = c(0.006, 0.006),
limits = c(0,366*12), breaks = seq(0, 4392, 100))
The function is plotted normally
However when I call survfit from a function:
fit_all <- function(x){
survfit(Surv(time, status) ~ sex, data = x)
I receive an error: " Error in eval(fit$call$data) : object 'x' not found "
Any ideas to what I am doing wrong ?
Survminer includes a function surv_fit that acts as a wrapper around survfit. If you use surv_fit instead of survfit, the "call" of the returned object will include the whole data frame instead of just data = x. That works better when calling ggsurvplot inside a function:
allsurv <- function(fit){
pval = TRUE,
pval.coord = c(200, 0.10),
conf.int = TRUE,
xlab = "Days",
ggtheme = theme_light(),
surv.median.line = "hv",
legend.labs = c("Female","Male"),
legend.title = "",
palette = c("#8C3F4D","#3E606F"))
fit_all <- function(x){
surv_fit(Surv(time, status) ~ sex, data = x)
I managed to get it to work. For anyone else with the same issue. The problem seems to be caused by what calling the function on its own or from a different function returns.
which returns Call: survfit(formula = Surv(time, status) ~ sex, data = x)
which returns Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
It seems that ggsurvplot treats the data in the call as the data for the plot and when called from another function ' data = x'. It seems the way to get past that is to define the data before the survfit function in the ggsurvplot.
allsurv <- function(fit, x){
data = x,
pval = TRUE,
pval.coord = c(200, 0.10),
conf.int = TRUE,
xlab = "Days",
ggtheme = theme_light(),
surv.median.line = "hv",
legend.labs = c("Female","Male"),
legend.title = "",
palette = c("#8C3F4D","#3E606F")) +
scale_y_continuous(expand = c(0.02, 0.02),breaks = seq(from = 0, to = 1, by = 0.1),labels=percent) +
scale_x_continuous(expand = c(0.006, 0.006),
limits = c(0,366*12), breaks = seq(0, 4392, 100))
allsurv(fit_all(lung), lung)

Side effect for nodesize in R randomForest ?

I'm currently working on a randomForest model. In my configuration I realize that high nodesize values is the configuration which outperform using crossvalidation.
But then I realized something strange.
Here is a reproduicible piece of code :
For nodesize = nrow(data)+4 :
n = 10
sigma = 0.4
X = runif(n)
Y = runif(n)
Z = X^2+X*Y+Y+sigma*rnorm(n)
Data = data.table(X,Y,Z)
model = randomForest(formula = as.formula('Z ~ X + Y'),data = Data,mtry = 1,tree= 500,nodesize = n+4,do.trace = TRUE)
pred = predict(model,Data)
1.041549 1.036075 1.266310 1.324197 1.308377 1.480041 1.691081 1.752463 1.203856 1.306943
For nodesize = nrow(data)+5
n = 10
sigma = 0.4
X = runif(n)
Y = runif(n)
Z = X^2+X*Y+Y+sigma*rnorm(n)
Data = data.table(X,Y,Z)
model = randomForest(formula = as.formula('Z ~ X + Y'),data = Data,mtry = 1,tree= 500,nodesize = n+5,do.trace = TRUE)
pred = predict(model,Data)
1.330427 1.330427 1.330427 1.330427 1.330427 1.330427 1.330427 1.330427 1.330427 1.330427
The prediction are the same for any observation from nodesize = n+5.
This is the case for any value of n (n = 20000 also).
For me the nodesize is the minimal number of records in a node in order for a split to be performed. So this means that if we have n records, the trees are grown by sampling n+4 observations with replacements. So if nodesize > n+4 no split is performed and the trees return global means : that's why the same prediction is attributed to every observations. Does that makes sense ? Is there a parameter to indicate how many sample are drawn from the original dataset ?
Thanks in advance
#Morgan is right that there was a misunderstanding regarding what nodesize does. It's the minimum size of any terminal node. But still, the behavior of randomForest seems unexpected (bug?). It does actually create branches if n <= nodesize <= n+5, that's what you've found.
If nodesize is 10 (the size of the sample), there should not be any splits, but randomForest still makes a split that splits off several observations. It also does that when nodesize is one of 11:14 (not shown here):
n = 10
sigma = 0.4
X = runif(n)
Y = runif(n)
Z = X^2+X*Y+Y+sigma*rnorm(n)
Data = data.frame(X,Y,Z)
# mtry = p, replace = F and sampsize = n to eliminate randomness
model = randomForest(formula = Z ~ X + Y, data = Data,
mtry = 2, ntree = 50, nodesize = 10, replace = F, sampsize = n)
grid <- expand.grid(X = seq(from = min(Data$X), to = max(Data$X), length.out = 100),
Y = seq(from = min(Data$Y), to = max(Data$Y), length.out = 100))
grid$grid_preds <- predict(model, grid)
ggplot(grid, aes(x = X, y = Y)) + geom_point(aes(color = grid_preds)) +
geom_point(data = Data, aes(x = X, y = Y, size = 4), color = "blue") +
theme(legend.position = "none")
If you set nodesize to n + 5 or higher randomForest does not make any splits anymore, as expected:
model = randomForest(formula = Z ~ X + Y, data = Data,
mtry = 2, ntree = 50, nodesize = 15, replace = F, sampsize = n)
grid <- expand.grid(X = seq(from = min(Data$X), to = max(Data$X), length.out = 100),
Y = seq(from = min(Data$Y), to = max(Data$Y), length.out = 100))
grid$grid_preds <- predict(model, grid)
ggplot(grid, aes(x = X, y = Y, color = grid_preds)) + geom_point()
As a comparison, ranger shows the expected behavior and does not attempt any splits if min.node.size >= n:
rang = ranger(Z ~ X + Y, data = Data, write.forest = T,
replace = F, sample.fraction = 1,
mtry = 2, num.trees = 50, min.node.size = 10)
grid$grid_preds <- predict(rang, grid)$prediction
ggplot(grid, aes(x = X, y = Y, color = grid_preds)) + geom_point()
By the way, the n+5 rule in randomForest also holds for n other than 10. I'm wondering too what's going on there.

How to graph my multiple linear regression model (caret)?

I have created an multiple linear regression model and would now like to plot it. But I can't seem to figure it out. Any help would be greatly appreciated! I used baruto to find the feature attributes and then used train() to get the model. When I try to plot model_lm I get the error:
There are no tuning parameters with more than 1 value.
Here is my code at what I have attempted so far:
rt_train <- rttotal2
rt_train$year <- NULL
#rt_train$box_office <- NULL
#impute na and address multicoliniearity
preproc <- preProcess(rt_train, method = c("knnImpute","center",
rt_proc <- predict(preproc, rt_train)
rt_proc$box_office <- rt_train$box_office
titles <- rt_proc$titles
rt_proc$titles <- NULL
#rt_train$interval <- as.factor(rt_train$interval)
dmy <- dummyVars(" ~ .", data = rt_proc,fullRank = T)
rt_transform <- data.frame(predict(dmy, newdata = rt_proc))
index <- createDataPartition(rt_transform$interval, p =.75, list = FALSE)
train_m <- rt_transform[index, ]
rt_test <- rt_transform[-index, ]
y_train <- train_m$box_office
y_test <-rt_test$box_office
train_m$box_office <- NULL
rt_test$box_office <- NULL
#selected feature attributes
boruta.train <- Boruta(interval~., train_m, doTrace =1)
#graph to see most important var to interval
names(lz) <- colnames(boruta.train$ImpHistory)
plot(boruta.train, xlab = "", xaxt = "n")
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.7)
#get most important attributes
final.boruta <- TentativeRoughFix(boruta.train)
getSelectedAttributes(final.boruta, withTentative = F)
boruta.rt_df <- attStats(final.boruta)
boruta.rt_df <- setDT(boruta.rt_df, keep.rownames = TRUE)[]
predictors <- boruta.rt_df %>%
filter(., decision =="Confirmed") %>%
select(., rn)
predictors <- unlist(predictors)
control <- trainControl(method="repeatedcv",
#look at residuals
#p-value is very small so reject H0 that predictors have no effect so
#we can use rotten tomatoes to predict box_office ranges
train_m$interval <- NULL
model_lm <- train(train_m[,predictors],
y_train, method='lm',
trControl = control, tuneLength = 10)
model_lm #.568
z <- varImp(object=model_lm)
z <- setDT(z, keep.rownames = TRUE)
z$model <- NULL
z$calledFrom <- NULL
plot(varImp(object=model_lm),main="Linear Model Variable Importance")
#get coeff
interc <- coef(model_lm$finalModel)
slope <- coef(model_lm$finalModel)
ggplot(data = rt_train, aes(y = box_office)) +
geom_point() +
geom_abline(slope = slope, intercept = interc, color = 'red')
This is what some of my input looks like. Thank you!!
Here is an example using the inbuilt data set cars:
data(cars, package = "datasets")
build the model
control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 6)
model_lm <- train(dist ~ speed, data = cars, method='lm',
trControl = control, tuneLength = 10)
I will assume you would like to plot the final model.
You can use the caret predict.train function to get the predictions from the model and plot them:
pred <- predict(model_lm, cars)
pred <- data.frame(pred = pred, speed = cars$speed)
additionally you can provide the cars data set to geom point and plot the observations:
ggplot(data = pred)+
geom_line(aes(x = speed, y = pred))+
geom_point(data = cars, aes(x=speed, y = dist))
if you would like to obtain the confidence or prediction interval you can use the predict.lm function on model_lm$finalModel:
Here is an example for the prediction interval:
pred <- predict(model_lm$finalModel, cars, se.fit = TRUE, interval = "prediction")
pred <- data.frame(pred = pred$fit[,1], speed = cars$speed, lwr = pred$fit[,2], upr = pred$fit[,3])
pred_int <- ggplot(data = pred)+
geom_line(aes(x = speed, y = pred))+
geom_point(data = cars, aes(x = speed, y = dist)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = speed), alpha = 0.2)
or the confidence interval:
pred <- predict(model_lm$finalModel, cars, se.fit = TRUE, interval = "confidence")
pred <- data.frame(pred = pred$fit[,1], speed = cars$speed, lwr = pred$fit[,2], upr = pred$fit[,3])
pred_conf <- ggplot(data = pred)+
geom_line(aes(x = speed, y = pred))+
geom_point(data = cars, aes(x = speed, y = dist)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = speed), alpha = 0.2)
plotting them side by side:
plot_grid(pred_int, pred_conf)
to plot the linear dependence on two variables you can use a 3D plot, for more than 3 it will be a problem.
