I know this has been asked for multiple times, but I could not find an answer to solve the problem I am encountering.
I would like to generate a prediction curve and superimpose it on a ggplot. The model is a quadratic plateau nonlinear function.
Data as below
dd_ <- data.frame(yield = c(2.07, 1.58, 2.01, 2.27, 3.28,
2.31, 2.49, 2.41, 3.90, 3.26,
3.37, 3.83, 4.06, 3.54, 3.75,
3.48, 4.51, 3.39, 4.09, 3.87,
4.31, 4.36, 4.66, 3.79, 4.17,
4.63, 3.99, 3.88, 4.73),
n_trt = c(0,0,0,0,25,25,25,25,
50,50,50,50,75,75,75,75,
100,100,100,100,125,125,125,125,
150,150,150,175,175))
the function is
quadratic.plateau <- function(alpha,beta,gamma, D, x){
ifelse(x< D,alpha+beta*x+gamma*x*x,alpha+beta*D+gamma*D*D)
}
I use minpack.lm package as it creates a better fit than nls
library(minpack.lm)
library(ggiraphExtra)
q_model <- nlsLM(yield~quadratic.plateau(A,B,C, D, n_trt),
data = dd_, start=list(A=2.005904,
B=0.03158664,
C=-0.0001082836,
D = 145.8515 ))
ggPredict(q_model,interactive=TRUE,colorn=100,jitter=FALSE)
By doing this, I receive an error
Error: $ operator is invalid for atomic vectors
I also used
ggplot(dd_, aes(n_trt, yield)) +
geom_point(size = 0.5) +
geom_smooth(method = "quadratic.plateau", data = dd_)
but no prediction curve was generated.
I appreciate your help. Thanks!
Almost identical to this question: the main point is that you have to set se=FALSE because predict.nls() doesn't return standard errors ...
ggplot(dd_, aes(n_trt, yield)) +
geom_point(size = 0.5) +
geom_smooth(method="nlsLM",
se=FALSE,
formula=y~quadratic.plateau(A,B,C, D, x),
method.args=list(start=list(A=2.005904,
B=0.03158664,
C=-0.0001082836,
D = 145.8515 )))
After a few attempts, this solves my problem.
eq1 = function(x){
ifelse(x< coef(q_model)[4], coef(q_model)[1]+coef(q_model)[2]*x+coef(q_model)[3]*x*x,
coef(q_model)[1]+coef(q_model)[2]*coef(q_model)[4]+coef(q_model)[3]*coef(q_model)[4]*coef(q_model)[4])
}
ggplot(dd_, aes(n_trt, yield)) +
geom_point(size = 0.5) +
stat_function(fun=eq1,geom="line",color=scales::hue_pal()(2)[1])
Related
I am having trouble understanding how to implement a Gaussian kernel density estimation of the following dataset in R. I appreciate if you can help me understand the mechanism of how to do it. I am currently trying to get a formula for the bell shaped curves at the bottom of the following picture. As you can see there is one bell shaped curve for each data point. (Note the picture does not represent the data I am using.)
This is my data:
x<-c(4.09, 4.46, 4.61, 4.30, 4.03, 5.22, 4.21, 4.07, 4.02, 4.58, 4.66, 4.05, 4.23, 5.51, 4.03, 4.72, 4.47, 4.50, 5.80, 4.30, 4.09, 4.78, 4.18, 4.45, 4.40, 5.60, 4.37, 4.42, 4.88, 4.20, 4.45, 4.10, 4.43, 4.58, 4.40, 4.38)
(x has 36 elements)
This is the kernel density estimator:
(If you can't see the image, it's from this page http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/tutorials/xlghtmlnode33.html)
where K(u)=
is the Gaussian kernel function and h=.1516 is the bandwidth selected by Scott.
So, plugging in we get f hat (x) = 1/(36*.1516) (1/sqrt(2pi))[e^(-1/2 ((4.09-x)/.1516)^2 + e^(-1/2 ((4.46-x)/.1516)^2 + ... + e^(-1/2 ((4.38-x)/.1516)^2]
Ok. So we have a function of x. But how do we get the equation of each of the bell shaped curves in the above diagram? If we plug in, for example, 4.09, into f hat (x) we get a number, not a curve/function/distribution. Can someone help me understand the procedure to find the equation for the bell shaped curve/kernel density estimate?
Here's a function that will return your fhat function given your x values and h value
get_fhat <- function(x, h) {
Vectorize(function(z) 1/length(x)/h*sum(dnorm((x-z)/h)))
}
This function returns a function that we can use to get values. We Vectorize it so we can pass in multiple values at once to the function.
We can get a single value or plot it with
fhat <- get_fhat(x, .1516)
fhat(4.09)
# [1] 0.9121099
curve(fhat, from=min(x), to=max(x))
Graph
## Given data
x <- c(4.09, 4.46, 4.61, 4.30, 4.03, 5.22, 4.21, 4.07, 4.02, 4.58, 4.66, 4.05,
4.23, 5.51, 4.03, 4.72, 4.47, 4.50, 5.80, 4.30, 4.09, 4.78, 4.18, 4.45,
4.40, 5.60, 4.37, 4.42, 4.88, 4.20, 4.45, 4.10, 4.43, 4.58, 4.40, 4.38)
h <- 0.1516
# GaussianKernel
GK <- function(u) {(1/sqrt(2*pi))*exp(-(u^2)/2)} # or dnorm(u)
This function gives a similar plot.
DensityGraph <- function(x, h){
n <- length(x)
xi <- seq(min(x) - sd(x), max(x) + sd(x), length.out = 512)
# fhat without sum since we are interest in the bell shaped curves
fhat <- sapply(x, function(y){(1/(n*h))*GK((xi - y)/h)})
# histogram of x
hist (x, freq = FALSE, nclass = 15, main = "Kernel density with histogram",
xlab = paste("N = ", n, " ", "Bandwidth = ", h))
# add fhat with sum
lines(xi, rowSums(fhat), lwd = 2)
# add the bell shaped curves
apply(fhat, 2, function(j) lines(xi, j, col = 4))
# show data points
rug (x, lwd = 2, col = 2)
}
DensityGraph(x = x, h = 0.05)
Blue bell shaped curves represent each data point of x
DensityGraph(x = x, h = 0.1516)
Compare with built in density function in R
lines(density(x = x, bw = 0.1516), col = 3, lwd = 2)
fhat for each x
This function gives the value of fhat given a specific x
fhat <- function(x, h, specific_x){
n <- length(x)
xi <- seq(min(x) - sd(x), max(x) + sd(x), length.out = 512)
f <- rowSums(sapply(x, function(y){(1/(n*h))*GK((xi - y)/h)}))
kde <- data.frame(xi, fhat = f)
indx <- which.min(abs(xi - specific_x))
fx <- kde[indx, "fhat"]
list(fx = fx, kde = kde)
}
KernelDensity <- fhat(x = x, h = 0.1516, specific_x = 4.09)
KernelDensity$fx
# [1] 0.9114677
plot(KernelDensity$kde, type = "l", lwd = 2, xlab = "")
title(xlab = paste("N = ", n, " Bandwidth = ", h))
rug(x, lwd = 2, col = 2)
Compare built in density function
lines(density(x, bw = 0.1516), col = 5)
I would like to plot a threshold model with smooth 95% confidence interval lines between line segments. You would think this would be on the simple side but I have not been able to find an answer!
My threshold/breakpoints are known, it would be great if there were a way to visualize this data. I have tried the segmented package which produces the following plot:
The plot shows a threshold model with a breakpoint at 5.4. However, the confidence intervals are not smooth between regression lines.
If anyone knows of any way to produce smooth (i.e. without the jump between line segments) CI lines between segmented regression lines (ideally in ggplot) that would be amazing. Thank you so much.
I have included sample data and the code I have tried below:
x <- c(2.26, 1.95, 1.59, 1.81, 2.01, 1.63, 1.62, 1.19, 1.41, 1.35, 1.32, 1.52, 1.10, 1.12, 1.11, 1.14, 1.23, 1.05, 0.95, 1.30, 0.79,
0.81, 1.15, 1.10, 1.29, 0.97, 1.05, 1.05, 0.84, 0.64, 0.80, 0.81, 0.61, 0.71, 0.75, 0.30, 0.30, 0.49, 1.13, 0.55, 0.77, 0.51,
0.67, 0.43, 1.11, 0.29, 0.36, 0.57, 0.02, 0.22, 3.18, 3.79, 2.49, 2.44, 2.12, 2.45, 3.22, 3.44, 3.86, 3.53, 3.13)
y <- c(22.37, 18.93, 16.99, 15.65, 14.62, 13.79, 13.09, 12.49, 11.95, 11.48, 11.05, 10.66, 10.30, 9.96, 9.65, 9.35, 9.07, 8.81,
8.56, 8.32, 8.09, 7.87, 7.65, 7.45, 7.25, 7.05, 6.86, 6.68, 6.50, 6.32, 6.15, 5.97, 5.80, 5.63, 5.47, 5.30,
5.13, 4.96, 4.80, 4.63, 4.45, 4.28, 4.09, 3.90, 3.71, 3.50, 3.27, 3.01, 2.70, 2.28, 22.37, 16.99, 11.05, 8.81,
8.56, 8.32, 7.25, 7.05, 6.50, 6.15, 5.63)
lin.mod <- lm(y ~ x)
segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=2)
plot(x, y)
plot(segmented.mod, add=TRUE, conf.level = 0.95)
which produces the following plot (and associated jumps in 95% confidence intervals):
segmented plot
Background: The non-smoothness in existing change point packages are due to the fact that frequentist packages operate with a fixed change point value. But as with all inferred parameters, this is wrong because there is indeed uncertainty concerning the location of the change.
Solution: AFAIK, only Bayesian methods can quantify that and the mcp package fills this space.
library(mcp)
model = list(
y ~ 1 + x, # Segment 1: Intercept and slope
~ 0 + x # Segment 2: Joined slope (no intercept change)
)
fit = mcp(model, data = data.frame(x, y))
Default plot (plot.mcpfit() returns a ggplot object):
plot(fit) + ggtitle("Default plot")
Each line represents a possible model that generated the data. The posterior for the change point is shown as a blue density. You can add a credible interval on top using plot(fit, q_fit = TRUE) or plot it alone:
plot(fit, lines = 0, q_fit = c(0.025, 0.975), cp_dens = FALSE) + ggtitle("Credible interval only")
If your change point is indeed known and if you want to model different residual scales for each segment (i.e., quasi-emulate segmented), you can do:
model2 = list(
y ~ 1 + x,
~ 0 + x + sigma(1) # Add intercept change in residual scale
)
fit = mcp(model2, df, prior = list(cp_1 = 1.9)) # Note: prior is a fixed value - not a distribution.
plot(fit, q_fit = TRUE, cp_dens = FALSE)
Notice that the CI does not "jump" around the change point as in segmented. I believe that this is the correct behavior. Disclosure: I am the author of mcp.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 5 years ago.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Improve this question
I have seen some papers that deal (not completely) with residuals in a regression analysis in a smart way, they plot the residual distribution perpendicular to the fitted line.
Example image from Figure 2 or 5 (linear regression) in: https://www.nature.com/articles/nn.4538#results
My R example:
The data example it has been taken from: https://www.r-bloggers.com/simple-linear-regression-2/
Data example:
alligator = data.frame(
lnLength = c(3.87, 3.61, 4.33, 3.43, 3.81, 3.83, 3.46, 3.76,
3.50, 3.58, 4.19, 3.78, 3.71, 3.73, 3.78),
lnWeight = c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50,
3.58, 3.64, 5.90, 4.43, 4.38, 4.42, 4.25)
)
Linear regression model:
reg <- lm(alligator$lnWeight ~ alligator$lnLength)
Scatter plot:
plot(alligator,
xlab = "Snout vent length (inches) on log scale",
ylab = "Weight (pounds) on log scale",
main = "Alligators in Central Florida"
)
Fitted line:
abline(reg,col = "black", lwd = 1)
Residual distribution (histogram):
hist(reg$residuals, 10, xaxt='n', yaxt='n', ann=FALSE)
I would like to insert the histogram at the topright of the linear regression plot as the example image from Figure 2 or 5 (linear regression) in: https://www.nature.com/articles/nn.4538#results
Thanks for the help.
This will work to get the residual histogram overlaid on the main plot. You'll have to do a bit of work to get it to be angled perpendicular, as in the examples you cite.
library("ggplot2")
theme_set(theme_minimal())
alligator = data.frame(
lnLength = c(3.87, 3.61, 4.33, 3.43, 3.81, 3.83, 3.46, 3.76,
3.50, 3.58, 4.19, 3.78, 3.71, 3.73, 3.78),
lnWeight = c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50,
3.58, 3.64, 5.90, 4.43, 4.38, 4.42, 4.25)
)
reg <- lm(alligator$lnWeight ~ alligator$lnLength)
# make main plot, with best fit line (set se=TRUE to get error ribbon)
main_plot <- ggplot(alligator, aes(x=lnLength, y=lnWeight)) +
geom_point() + geom_smooth(method="lm", se=FALSE) +
scale_y_continuous(limits=c(0,7))
# create another plot, histogram of the residuals
added_plot <- ggplot(data.frame(resid=reg$residuals), aes(x=resid)) +
geom_histogram(bins=5) +
theme(panel.grid=element_blank(),
axis.text.y=element_blank(),
axis.text.x=element_text(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
axis.ticks.y=element_blank(),
axis.line.y=element_blank())
# turn the residual plot into a ggplotGrob() object
added_plot_grob <- ggplot2::ggplotGrob(added_plot)
# then add the residual plot to the main one as a custom annotation
main_plot + annotation_custom(grob=added_plot_grob,
xmin=4.0, xmax=4.35, ymin=1, ymax=5)
Then check out the docs for ggplot2:: and gridExtra:: to figure out the rotation. Hope this helps!
Background: Species richness scales to the negative -0.75 of body weight. However, when I fit my data, I get a value of 0.57. A friend told me that the summary(lm) results prints the 'best fit' slope of the data. Nevertheless, I'm wondering if I can create a regression plot wherein I force the slope to be -0.75 like the literature.
My code is:
y value
log.nspecies.dec = c(3.05, 2.95, 2.97, 2.98, 2.84, 2.85, 2.83, 2.71, 2.64, 2.62, 2.58, 2.37,
2.26, 2.17, 2.00, 1.88, 1.75, 1.62, 1.36, 1.30, 1.08, 1.20, 0.90, 0.30, 0.70,
0.30, 0.48, 0.00, 0.30, 0.00)
x value
logbio.dec = c(2.1, 2.3, 2.5, 2.7, 2.9, 3.1, 3.3, 3.7, 3.9, 4.1, 4.3, 4.5, 4.7, 4.9, 5.1,
5.3, 5.5, 5.7, 5.9, 6.1, 6.3, 6.5, 6.7, 6.9, 7.1, 7.3, 7.5, 7.7, 7.9)
plot a barplot and superimpose a regression line
name the y variables with the x
names(log.nspecies.dec) = logbio.dec
order the y variables
log.nspecies.dec = log.nspecies.dec[order (as.numeric(names(log.nspecies.dec)))]
do the barplot
xpos = barplot(log.nspecies.dec, las = 2, space = 0)
lm.fit = lm(log.nspecies.dec ~ as.numeric(names(log.nspecies.dec)))
summary(lm.fit)
y.init = lm.fit$coefficients[2] * as.numeric(names(log.nspecies.dec))1 +
lm.fit$coefficients1
y.end = lm.fit$coefficients[2] * as.numeric(names(log.nspecies.dec))[length(log.nspecies.dec)] +
lm.fit$coefficients1
segments(xpos1, y.init, xpos [length(xpos)], y.end, lwd = 2, col = 'red')
title(main = 'ln Number of species ~ lm Weight')
coef(lm.fit)
gives a result wherein the slope is 0.57. How do I force the slope to -0.75?
You can use offset to fix the y-intercept at a negative value. For example
## Example data
x = 1:10
y = -2 + 2* x
# Fit the model
(m = lm(y ~ 0 + x, offset = rep(-2, length(y))))
#Call:
#lm(formula = y ~ 0 + x, offset = rep(-2, length(y)))
#Coefficients:
#x
#2
The output correctly identifies the gradient as 2.
The reason your code doesn't work is that you are using abline(). Looking at ?abline, it states that to draw the line it will in turn call coef(MODEL). When you use offset, the coef function doesn't return the y-intercept.
R> coef(m)
x
2
Hence abline draws the wrong line.
If the intercept is changed, the code still works
x = 1:10
y = 2 + 2*x
lm(y ~ 0 + x, offset = rep(2, length(y)))
I am using auto.arima() from forecast package and am coming into some strange results with the prediction.
library(forecast)
x <- structure(c(1.92, 2.1, 1.73, 1.35, 1.29, 1.35, 1.42, 1.46, 1.6,
1.67, 1.98, 1.78, 1.77, 2.35, 1.93, 1.43, 1.29, 1.26, 1.93, 2.33,
2.22, 2.19, 2.15, 2.25, 3.12, 3.32, 2.72, 2.28, 2.28, 2.16, 2.81,
3.12, 2.85, 2.98, 3.3, 3.06, 3.56, 3.81, 3.48, 2.64, 2.91, 3.35,
3.73, 3.58, 4, 3.94, 3.79, 3.85), .Tsp = c(2012, 2015.91666666667,
12), class = "ts")
fit <- auto.arima(x)
plot(forecast(fit, 12)) #forecast and actual data
f2 <- fitted.values(fit)
lines(f2, col="red") #add predicted values during training
I don't understand how fitted value (red line) is very close to observed value (black) but then there is a such a big jump in the first forecast.
Any ideas why we see this jump? I've seen other posts on Stack Exchange where the xreg option was used but this is not doing that so I haven't been able to track down a similar post.
Generally I tend to believe that auto.arima slightly overfits data. Some quick exploratory analysis with ACF shows that (0,1,2)(0,1,0)[12] is already a decent model. I will use arima0 from R base to fit this model:
fit0 <- arima0(x, order = c(0,1,2), seasonal = c(0,1,0))
Prediction / forecasting is done with predict.arima0:
pred <- predict(fit0, n.ahead = 12, se.fit = FALSE)
Let's plot observed series and forecast together:
ts.plot(x, pred, col = 1:2)
There is still a jump. But the variation is fairly reasonable, compared with the variability of the series.
Nothing wrong. When we forecast x[49] from x[1:48], it will be different from x[48]. Typically, (0,1,2)(0,1,0)[12] has a linear trend plus a seasonal effect. It helps to visualize your time series and prediction season by season:
ts.plot(window(x, 2012, 2012 + 11/12),
window(x, 2013, 2013 + 11/12),
window(x, 2014, 2014 + 11/12),
window(x, 2015, 2015 + 11/12),
pred, col = 1:5)