Using effects package inside a function - r

I have a large frame with lots of variables which I'm going to analyze in the same way. Specifically, I want to plot effect confidence intervals in mixed effect model. I want to write function which make a custom plot for one dependent variable. Direct application of effect() function goes well. But the same code inside function cause error.
I tried two variants of function. Both cause errors.
Here is my reproducible example:
library(nlme)
library(effects)
df <- data.frame(y = rnorm(90), x = gl(3, 30), b = factor(rep(1:30, 3)))
fit <- lme(fixed = y ~ x, random = ~ 1 | b, data = df, method = "REML")
ef <- effect("x", fit)
bp <- barplot(as.vector(ef$fit), col = c("tomato", "skyblue", "limegreen"),
ylim = c(min(ef$lower), max(ef$upper) + (max(ef$upper) - min(ef$lower)) * 0.2 ))
arrows(x0 = bp, y0 = ef$lower, y1 = ef$upper, code = 3, angle = 90)
test1 <- function(y, x, b)
{
fit <- lme(fixed = y ~ x, random = ~ 1 | b, method = "REML")
ef <- effect("x", fit)
bp <- barplot(as.vector(ef$fit), col = c("tomato", "skyblue", "limegreen"),
ylim = c(min(ef$lower), max(ef$upper) + (max(ef$upper) - min(ef$lower)) * 0.2 ))
arrows(x0 = bp, y0 = ef$lower, y1 = ef$upper, code = 3, angle = 90)
}
test1(df$y, df$x, df$b)
# Error in eval(predvars, data, env) : object 'y' not found
test2 <- function(y, x, b)
{
frame <- data.frame(y, x, b)
fit <- lme(fixed = y ~ x, random = ~ 1 | b, frame, method = "REML")
ef <- effect("x", fit)
bp <- barplot(as.vector(ef$fit), col = c("tomato", "skyblue", "limegreen"),
ylim = c(min(ef$lower), max(ef$upper) + (max(ef$upper) - min(ef$lower)) * 0.2 ))
arrows(x0 = bp, y0 = ef$lower, y1 = ef$upper, code = 3, angle = 90)
}
test2(df$y, df$x, df$b)
# Error in as.data.frame.default(data, optional = TRUE) :
# cannot coerce class ‘"function"’ to a data.frame

Simpler:
function(df) {
fit <- lme(fixed = y ~ x, random = ~ 1 | b, data = df, method = "REML")
ef <- effect("x", fit)
bp <- barplot(as.vector(ef$fit), col = c("tomato", "skyblue", "limegreen"),
ylim = c(min(ef$lower),
max(ef$upper) + (max(ef$upper) - min(ef$lower)) * 0.2 ))
arrows(x0 = bp, y0 = ef$lower, y1 = ef$upper, code = 3, angle = 90)
}
You need to pass data to lme, the formula doesn't actually pass any data.
That said, your test2 should work. I can replicate your error but it is really very weird. Somehow the code works in the global env but not in the closure. Very surprising.

Related

How to fit a GAM model to several pairs of (x,y) variables

I am trying to fit a GAM model to a dataset consisting of two pairs of (x,y) values i.e. (x1,y1) and (x2,y2) by first fitting the 1st pair and then moving to the second. When I call the gam function inside the ‘for’ loop, it gives an error “Not enough (non-NA) data to do anything meaningful”.
I suspect this is something to do with the way I construct the x1, y1, x2 and y2 labels of the columns because outside the ‘for’ loop the gam function works.
Thank you!
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-26. For overview type 'help("mgcv-package")'.
library(ggplot2)
library(tidyverse)
# create dataframe
x1 = seq(0, 50, by = 0.5)
y1 = dnorm(x1, mean = 22, sd = 5)
x2 = seq(0, 50, by = 0.5)
y2 = dnorm(x2, mean = 28, sd = 7)
df = cbind.data.frame(x1, y1, x2, y2)
# plot(c(x1,x2), c(y1,y2))
count = ncol(df)/2
for (i in 1:count) {
x<-noquote(paste("x", i, sep = ""))
y<-noquote(paste("y", i, sep = ""))
print(x) # test
gam(y ~ s(x), data = df, method = "REML") # this call doesn't work
}
gam(y1 ~ s(x1), data = df, method = "REML") # this call works
I have managed to figure out what the problem is. It turned out that my construction of xi and yi vars is causing the problem because then the y ~ s(x) is not of type “formula”. I had to construct the equation outside the gam function call, convert it to type “formula” and then use it in the gam call.
library(mgcv)
library(ggplot2)
library(tidyverse)
# create test dataframe
x1 = seq(0, 50, by = 0.5)
y1 = dnorm(x1, mean = 25, sd = 5)
x2 = seq(0, 50, by = 0.5)
y2 = dnorm(x2, mean = 29, sd = 7)
df = cbind.data.frame(x1, y1, x2, y2)
plot(c(df$x1,df$x2), c(df$y1,df$y2))
(count = ncol(df)/2)
for (i in 1:count) {
# construct the formula to go into the "gam" function and convert it to type "formula" with the "as.formula" function
part1 <- noquote(paste0("y", i))
part2 <- paste0("~ s(")
frag1 <- paste(part1, part2)
part3 <- noquote(paste0("x", i))
frag2 <- paste0(frag1, part3)
frag3 <- paste0(frag2, ")")
fmla <- as.formula(frag3)
# fit the data
gam_mod <- gam(formula = fmla, data = df, method = "REML")
print(gam_mod)
}

How to fit exponential regression in r?(a.k.a changing power of base)

I am making exponential regressions in r.
Actually I want to compare y = exp^(ax+b) with y = 5^(ax+b).
# data
set.seed(1)
y <- c(3.5, 2.9, 2.97,4.58,6.18,7.11,9.50,9.81,10.17,10.53,
12.33,14.14,18, 22, 25, 39, 40, 55, 69, 72) + rnorm(20, 10, 1)
x <- 1:length(y)
df = data.frame(x = x, y = y)
predata = data.frame(x = 1:20)
# plot
plot(df, ylim = c(0,100), xlim = c(0,40))
# simple linear regression
fit_sr = lm(y~x, data = df)
pre_sr = predict(fit_sr, newdata = predata,
interval ='confidence',
level = 0.90)
lines(pre_sr[,1], col = "red")
# exponential regression 1
fit_er1 = lm(log(y, base = exp(1))~x, data = df)
pre_er1 = predict(fit_er1, newdata = predata,
interval ='confidence',
level = 0.90)
pre_er1 = exp(1)^pre_er1 # correctness
lines(pre_er1[,1], col = "dark green")
# exponential regression 2
fit_er2 = lm(log(y, base = 5) ~ x, data = df)
pre_er2 = predict(fit_er2, newdata = predata,
interval ='confidence',
level = 0.90)
pre_er2 = 5^pre_er2 # correctness
lines(pre_er2[,1], col = "blue")
I expect something like this(plot1), but exponential regression 1 and 2 are totally the same(plot2).
plot1
plot2
The two regression should be different because of the Y value is different.
Also, I am looking for how to make y = exp(ax+b) + c fitting in R.
Your code is correct, your theory is where the problem is. The models should be the same.
Easiest way is to think on the log scale, as you've done in your code. Starting with y = exp(ax + b) we can get to log(y) = ax + b, so a linear model with log(y) as the response. With y = 5^(cx + d), we can get log(y) = (cx + d) * log(5) = (c*log(5)) * x + (d*log(5)), also a linear model with log(y) as the response. Yhe model fit/predictions will not be any different with a different base, you can transform the base e coefs to base 5 coefs by multiplying them by log(5). a = c*log(5) and b = d*log(5).
It's a bit like wanting to compare the linear models y = ax + b where x is measured in meters vs y = ax + b where x is measured in centimeters. The coefficients will change to accommodate the scale, but the fit isn't any different.
The first part is already answered by #gregor, the second part "...I am looking for how to make y = exp(ax+b) + c fitting in R" can be done with nls:
fit_er3 <- nls(y ~ exp(a*x+b) + c, data = df, start=list(a=1,b=0,c=0))

Overlay decision boundary for random forests and boostings

I generate some random data and am trying to overlay a decision boundary based upon fitting using random forests and boosting. I can recreate the problem below. I generate the data, and using regression trees I can easily overlay the decision boundary using the following code:
library(tidyverse)
# set seed and generate some random data
set.seed(123)
Dat <- tibble(
x1 = rnorm(100),
x2 = rnorm(100)
) %>% mutate(y = as_factor(ifelse(x1^2 + x2^2 > 1.39, "A", "B")))
circlepts <- tibble(theta = seq(0, 2*pi, length = 100)) %>%
mutate(x = sqrt(1.39) * sin(theta), y = sqrt(1.39) * cos(theta))
# graph the data and draw the boundary
p <- ggplot(Dat) + geom_point(aes(x1, x2, color = y)) + coord_fixed() +
geom_polygon(data = circlepts, aes(x, y), color = "blue", fill = NA)
# convert character to binary inputs making classification easier
binVec = as.vector(Dat$y)
binVec[which(binVec =="A")] = 1
binVec[which(binVec == "B")] = 0
binVec = as.numeric(binVec)
Dat$y = binVec
# split the data up
datasplit <- initial_split(Dat, prop = 0.7)
training_set <- as_tibble(training(datasplit))
testing_set <- as_tibble(testing(datasplit))
tree_fit <- tree(y~ ., training_set)
grid <- crossing(x1 = modelr::seq_range(testing_set$x1, 50), x2 = modelr::seq_range(testing_set$x1, 50)) %>%
modelr::add_predictions(tree_fit)
# plot the data with the decision overlay of the tree fit
p + geom_contour(data = grid, aes(x2, x1, z = as.numeric(pred)), binwidth = 1)
Now if I try doing so with random forests or gradient boosting, add_predictions doesn't cooperate that well...
rf_fit <- randomForest(y ~ ., data=training_set, mtry = 2, ntree=500)
grid <- crossing(x1 = modelr::seq_range(testing_set$x1, 50), x2 = modelr::seq_range(testing_set$x1, 50)) %>%
modelr::add_predictions(rf_fit)
p + geom_contour(data = grid, aes(x2, x1, z = as.numeric(pred)), binwidth = 1)
##ERROR: Error in if (is.na(out.type)) stop("type must be one of 'response', 'prob', 'vote'") : argument is of length zero
And for gradient boosting:
fitBoost <- gbm(y ~ ., data= Dat, distribution = "gaussian",
n.trees = 1000)
pred <- predict(fitBoost, newdata=training_set, n.trees=1000)
grid <- crossing(x1 = modelr::seq_range(testing_set$x1, 50), x2 = modelr::seq_range(testing_set$x1, 50)) %>%
modelr::add_predictions(fitBoost)
### ERROR: Error in paste("Using", n.trees, "trees...\n") : argument "n.trees" is missing, with no default
It seems to be a very simple problem. Could someone help me out?
The following code works with your random forest:
training_set$y <- factor(training_set$y)
rf_fit <- randomForest(y ~ ., data=training_set, mtry=2, ntree=500)
grid <- crossing(x1 = modelr::seq_range(testing_set$x1, 50),
x2 = modelr::seq_range(testing_set$x1, 50)) %>%
modelr::add_predictions(rf_fit)
p + geom_contour(data = grid, aes(x2, x1, z = as.numeric(pred)), binwidth = 1)
And here is the code for the gradient boosting machine:
fitBoost <- gbm(y ~ ., data=Dat, distribution="gaussian", n.trees=1000)
pred <- predict(fitBoost, newdata=training_set, n.trees=1000)
add_predictions2 <- function (data, model, var = "pred", type = NULL)
{
data[[var]] <- predict2(model, data, type = type)
data
}
predict2 <- function (model, data, type = NULL)
{
if (is.null(type)) {
stats::predict(model, data, n.trees=1000)
} else {
stats::predict(model, data, type = type, n.trees=1000)
}
}
grid <- crossing(x1 = modelr::seq_range(testing_set$x1, 50),
x2 = modelr::seq_range(testing_set$x1, 50)) %>%
add_predictions2(fitBoost)
p + geom_contour(data = grid, aes(x2, x1, z = as.numeric(pred)), binwidth = 1)

How to display different levels in a multilevel analysis with different colors

I am a beginner at multilevel analysis and try to understand how I can do graphs with the plot functions from base-R. I understand the output of fit below but I am struggeling with the visualization. df is just some simple test data:
t <- seq(0, 10, 1)
df <- data.frame(t = t,
y = 1.5+0.5*(-1)^t + (1.5+0.5*(-1)^t) * t,
p1 = as.factor(rep(c("p1", "p2"), 10)[1:11]))
fit <- lm(y ~ t * p1, data = df)
# I am looking for an automated version of that:
plot(df$t, df$y)
lines(df$t[df$p1 == "p1"],
fit$coefficients[1] + fit$coefficients[2] * df$t[df$p1 == "p1"], col = "blue")
lines(df$t[df$p1 == "p2"],
fit$coefficients[1] + fit$coefficients[2] * df$t[df$p1 == "p2"] +
+ fit$coefficients[3] + fit$coefficients[4] * df$t[df$p1 == "p2"], col = "red")
It should know that it has to include p1 and that there are two lines.
The result should look like this:
Edit: Predict est <- predict(fit, newx = t) gives the same result as fit but still I don't know "how to cluster".
Edit 2 #Keith: The formula y ~ t * p1 reads y = (a + c * p1) + (b + d * p1) * t. For the "first blue line" c, d are both zero.
This is how I would do it. I'm including a ggplot2 version of plot as well because I find it better fitted for the way I think about plots.
This version will account for the number of levels in p1. If you want to compensate for the number of model parameters, you will just have to adjust the way you construct xy to include all the relevant variables. I should point out that if you omit the newdata argument, fitting will be done on the dataset provided to lm.
t <- seq(0, 10, 1)
df <- data.frame(t = t,
y = 1.5+0.5*(-1)^t + (1.5+0.5*(-1)^t) * t,
p1 = as.factor(rep(c("p1", "p2"), 10)[1:11]))
fit <- lm(y ~ t * p1, data = df)
xy <- data.frame(t = t, p1 = rep(levels(df$p1), each = length(t)))
xy$fitted <- predict(fit, newdata = xy)
library(RColorBrewer) # for colors, you can define your own
cols <- brewer.pal(n = length(levels(df$p1)), name = "Set1") # feel free to ignore the warning
plot(x = df$t, y = df$y)
for (i in 1:length(levels(xy$p1))) {
tmp <- xy[xy$p1 == levels(xy$p1)[i], ]
lines(x = tmp$t, y = tmp$fitted, col = cols[i])
}
library(ggplot2)
ggplot(xy, aes(x = t, y = fitted, color = p1)) +
theme_bw() +
geom_point(data = df, aes(x = t, y = y)) +
geom_line()

2 polynomial regressions in a ggplot() graph

This is my Dataset:
As you can see, there are two quantitative variables (X, Y) and 1 categorical variable (molar, with two factors: M1, M2).
I would like to represent in one single graph two polynomial regressions and their respective prediction intervals: one for the M1 factor and one for the M2 factor. Each polynomial regression has its own degree (M1 is a 4 degree polynomial regression, and M2 is a 6 degree).
I want to use ggplot() function (which is in package ggplot2 in R). I have actually performed this figure but with all data merged (I mean, with no distinction between factors). This is the code I used:
# Fit a linear model
m <- lm(Y ~ X+I(X^2)+I(X^3)+I(X^4), data = Dataset)
# cbind the predictions to Dataset
mpi <- cbind(Dataset, predict(m, interval = "prediction"))
ggplot(mpi, aes(x = X)) +
geom_ribbon(aes(ymin = lwr, ymax = upr),
fill = "blue", alpha = 0.2) +
geom_point(aes(y = Y)) +
geom_line(aes(y = fit), colour = "blue", size = 1)
With this result:
So, I would like to have two different-grade polynomial regressions (one for the M1 and one for the M2), taking into account their respective predictions intervals. Which would be the exact code?
UPDATE - New code! I run this code with no success:
M1=subset(Dataset,Dataset$molar=="M1",select=X:Y)
M2=subset(Dataset,Dataset$molar=="M2",select=X:Y)
M1.R <- lm(Y ~ X +I(X^2)+I(X^3)+I(X^4),
data=subset(Dataset,Dataset$molar=="M1",select=X:Y))
M2.R <- lm(Y ~ X +I(X^2)+I(X^3)+I(X^4),
data=subset(Dataset,Dataset$molar=="M2",select=X:Y))
newdf <- data.frame(x = seq(0, 1, c(408,663)))
M1.P <- cbind(data=subset(Dataset,Dataset$molar=="M1",select=X:Y), predict(M1.R, interval = "prediction"))
M2.P <- cbind(data=subset(Dataset,Dataset$molar=="M2",select=X:Y), predict(M2.R, interval = "prediction"))
p = cbind(as.data.frame(rbind(M1.P, M2.P)), f = factor(rep(1:2, c(408,663)), x = rep(newdf$x, 2))
mdf = with(Dataset, data.frame(x = rep(x, 2), y = c(subset(Dataset,Dataset$molar=="M1",select=Y), subset(Dataset,Dataset$molar=="M2",select=Y),
f = factor(rep(1:2, c(408,663))))
ggplot(mdf, aes(x = x, y = y, colour = f)) + geom_point() +
geom_ribbon(data = p, aes(x = x, ymin = lwr, ymax = upr,
fill = f, y = NULL, colour = NULL),
alpha = 0.2) +
geom_line(data = p, aes(x = x, y = fit))
These are the messages I get now:
[98] WARNING: Warning in if (n < 0L) stop("wrong sign in 'by' argument") :
the condition has length > 1 and only the first element will be used
Warning in if (n > .Machine$integer.max) stop("'by' argument is much too small") :
the condition has length > 1 and only the first element will be used
Warning in 0L:n :
numerical expression has 2 elements: only the first used
Warning in if (by > 0) pmin(x, to) else pmax(x, to) :
the condition has length > 1 and only the first element will be used
[99] WARNING: Warning in predict.lm(M1.R, interval = "prediction") :
predictions on current data refer to _future_ responses
[100] WARNING: Warning in predict.lm(M2.R, interval = "prediction") :
predictions on current data refer to _future_ responses
[101] ERROR: <text>
I think I am closer but still can't see it. Help!
Here is one way. If you have more than two models/levels in the factor you should look into code that will work over the levels of the factor and fit the models that way.
Anyway, first some dummy data:
set.seed(100)
x <- runif(100)
y1 <- 2 + (0.3 * x) + (2.4 * x^2) + (-2.5 * x^3) + (3.4 * x^4) + rnorm(100)
y2 <- -1 + (0.3 * x) + (2.4 * x^2) + (-2.5 * x^3) + (3.4 * x^4) +
(-0.3 * x^5) + (2.4 * x^6) + rnorm(100)
df <- data.frame(x, y1, y2)
Fit our two models:
m1 <- lm(y1 ~ poly(x, 4), data = df)
m2 <- lm(y2 ~ poly(x, 6), data = df)
Now precict at some new locations x and stick it together with x and f, a factor indexing the model, into a tidy format:
newdf <- data.frame(x = seq(0, 1, length = 100))
p1 <- predict(m1, newdata = newdf, interval = "prediction")
p2 <- predict(m2, newdata = newdf, interval = "prediction")
p <- cbind(as.data.frame(rbind(p1, p2)), f = factor(rep(1:2, each = 100)),
x = rep(newdf$x, 2))
Melt the original data into tidy form
mdf <- with(df, data.frame(x = rep(x, 2), y = c(y1, y2),
f = factor(rep(1:2, each = 100))))
Draw the plot, using colour to distinguish the models/data
ggplot(mdf, aes(x = x, y = y, colour = f)) +
geom_point() +
geom_ribbon(data = p, aes(x = x, ymin = lwr, ymax = upr,
fill = f, y = NULL, colour = NULL),
alpha = 0.2) +
geom_line(data = p, aes(x = x, y = fit))
This gets us

Resources