Plotting cross validation of ridge regression's MSE - r

first of all, I have to apologize for my poor English. Second, the objective of this post is that I want to reproduce the plot of the ridge regression's MSE with ggplot2 instead of the function plot which is included in R.
The object of cv.out is defined by the next expression:
cv.out <- cv.glmnet(x_var[train,], y_var[train], alpha = 0). And when I print that object these are the elements of cv.out
Lambda
Measure
SE
Nonzero
min
439.8
32554969
1044541
5
lse
1343.1
33586547
1068662
5
This is the plot with plot(cv.out):
The thing what I want to do the same plot but more elaborated with ggplot and I don't know which aesthetics put in the function. These are the elements of cv.out when I call the object like this: cv.out$ :
lambda
cmv
cvsd
cvup
cvlo
nzero
call
name
lambda.min
lambda.lse
Finally, thanks for your help. I really appreciate it. :)

Using example dataset:
X = as.matrix(mtcars[,-1])
y = as.matrix(mtcars[,1])
cv.out = cv.glmnet(X,y,alpha=0)
plot(cv.out)
You just need to pull out the values and put into a data.frame, and plot using geom_point() and geom_errorbar() :
df = with(cv.out,
data.frame(lambda = lambda,MSE = cvm,MSEhi=cvup,MSElow=cvlo))
ggplot(df,aes(x=lambda,y=MSE)) +
geom_point(col="#f05454") +
scale_x_log10("log(lambda)") +
geom_errorbar(aes(ymin = MSElow,ymax=MSEhi),col="#30475e") +
geom_vline(xintercept=c(cv.out$lambda.1se,cv.out$lambda.min),
linetype="dashed")+
theme_bw()

Related

Plotting Linear Regression Line with Confidence Interval

I am trying to plot how 'Square feet' of a home affects 'Sales Price (in $1000)' of the same. Particularly, I want the coefficient line from Square ft vs. Sales price plotted with a hypothetical grey area around the line with the original datapoints superimposed.
I have tried to complete this a few different ways. One way I have tried is using the function effect_plot from library(jtools). I used the coding I found from https://cran.r-project.org/web/packages/jtools/vignettes/effect_plot.html.
But when I run this function, I don't get a plot, I just get an error: Error in FUN(X[[i]], ...) : object 'Sales Price (in $1000)' not found.
The second way I have attempted is through manually creating a new vector and attempting to plot the confidence interval. My code inspiration is from Plotting a 95% confidence interval for a lm object.
But with this one, I get an error in the conf_interval line: Error in eval(predvars, data, env) : object 'Square feet' not found. I cannot figure out how to correct this error.
And finally, I have tried to use library(ggplot2) to complete the problem with inspiration from https://rpubs.com/aaronsc32/regression-confidence-prediction-intervals.
But each time I run R, it creates a coordinate plane with a single point in the center of the plane; there is no line, no real points, no hypothetical confidence interval. There are no errors and I also cannot figure out the issue with the coding.
library("jtools")
LRA1 <- lm(`Sales Price (in $1000)` ~ `Square feet` + Rooms +
Bedrooms + Age,data=HomedataSRS) #LRA1 is the regression model
effect_plot(LRA1, pred = 'Square feet', inherit.aes = FALSE,
plot.points = TRUE) #function should create graph
newSF = seq(min(HomedataSRS$`Square feet`),
max(HomedataSRS$`Square feet`), by = 0.05)
conf_interval <- predict(LRA1, newdata=data.frame(x=newSF),
interval="confidence",level = 0.95)
plot(HomedataSRS$`Square feet`, HomedataSRS$`Sales Price (in $1000)`,
xlab="Square feet", ylab="Sales Price(in $1000)",
main="Regression")
abline(LRA1, col="lightblue")
matlines(newSF, conf_interval[,2:3], col = "blue", lty=2)
library(ggplot2)
SFHT <- HomedataSRS %>% select(1:2)
#This is to select the 2 variables I'm working with
ggplot(SFHT, aes(x='Square feet', inherit.aes = FALSE,
y='Sales Price (in $1000)')) +
geom_point(color='#2980B9', size = 4) +
geom_smooth(method=lm, color='#2C3E50')
Data:
arguments to aes() should not be quoted. Try
ggplot(SFHT, aes(x = `Square feet`, y = `Sales Price (in $1000)`)) +
geom_point(color='#2980B9', size = 4) +
geom_smooth(method=lm, color='#2C3E50')
alternatively, you could use the new aes_string() function:
ggplot(SFHT, aes_string(x='Square feet',y='Sales Price (in $1000)')) +
geom_point(color='#2980B9', size = 4) +
geom_smooth(method=lm, color='#2C3E50')
more info is available on the package site: https://ggplot2.tidyverse.org/reference/aes_.html

Having problems with nonlinear regression in ggplot

attenuation = data.frame(km =
c(0,0,0.4,0.4,0.8,0.8,1.2,1.2,1.6,1.6,2,2,2.4,2.4,2.8,2.8,3.2,3.2,3.6,3.6,4,
4,4.4,4.4,4.8,4.8,5.2,5.2,5.6,5.6,6,6,6.4,6.4,6.8,6.8,7.2,7.2,7.6,7.6,8,8,
11.7,11.7,13,13), edna = c(76000,20000,0,0,6000,0,0,6880,10700,0,6000,
0,0,0,0,0,0,6000,0,0,0,0,0,0,0,0,6310,0,6000,6000,0,0,0,0,0,
0,0,0,0,0,0,6000,0,0,0,0))
#This worked great for a linear regression
ggplot(attenuation, aes(x = km, y = edna)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
xlab("Distance from Cage (km)") +
ylab("eDNA concentration (gene sequence/Liter)")
But the linear regression doesn't seem to be a good fit (r squared =0.09). So I'd like to try something else. I tried some other regressions also with poor fits, so I'd like to try a nonlinear regression.
I have researched this question on stack overflow and tried a number of different options, but nothing is working. The option I provide below makes the most sense-but I wonder if I have the formula wrong? Or if the start list needs to be modified?
For context I am trying to explore the relationship between river distance and concentration.
#This is not working for a nonlinear regression
ggplot(attenuation, aes(x = km, y = edna))+
geom_point() +
stat_smooth(method = 'nls', formula = 'y~a*x^b', method.args=list (start =
list(a = 1,b=1), se=FALSE))
I get the following error from r when I run the code for nls above
Computation failed in stat_smooth():
variable lengths differ (found for '(se)')
You have 2 problems. First a missplaced ")" since se=FALSE is an argument to stat_smooth=, not method.args=:
ggplot(attenuation, aes(x = km, y = edna))+
geom_point() +
stat_smooth(method='nls', formula='y~a*x^b', method.args=list(start =
list(a=1, b=1)), se=FALSE)
But this will not work either because your model is impossible to fit to your data. Look at the equation. When x=0, y will equal 0. For values of x greater than 0, y will increase unless b is negative, but but then x=0 is Inf so the algorithm fails to try negative values. Since you have a decreasing relationship, you need to specify a function that is defined for x=0 and plausible starting values. This one parameter fits your data better than a linear function (it could also be defined as a*(x + 1)^-1 which is essentially your function with 1 added to x so that it is defined at x=0:
ggplot(attenuation, aes(x = km, y = edna))+
geom_point() +
stat_smooth(method = 'nls', formula = 'y~a/(x + 1)',
method.args=list(start=list(a=50000)), se=FALSE)
[
I picked 50000 by splitting the difference between 20,000 and 76,000. The final estimate is about 20,000. You can bend the curve more sharply by adding a second parameter, but you have so many 0 values it may be too much depending on what you are trying to communicate:
ggplot(attenuation, aes(x = km, y = edna))+
geom_point() +
stat_smooth(method='nls', formula='y~a*(1+x)^b', method.args=list(start =
list(a=50000, b=-1)), se=FALSE)
I agree with #dcarlson's answer. You've got a pretty small data set here (a total of 11 non-zero data points, two of which fall on top of each other) so you probably shouldn't push any conclusions too hard. The first two points are definitely large, and there might be a mild declining trend after that, but beyond that you can't say too much.
If you want to do the power-law fit you have to displace the zero-km data point from the origin. I've done it by adding 0.1 to the x values. This is an arbitrary choice on my part and should be thought about carefully on your end ... (note that there's a large difference in the results if you add 0.1 as I did or 1 as #dcarlson did). I also had to put in more reasonable starting values, which I did by fitting a log-log linear regression (lm(log(edna) ~ log(km+0.1), data=attenuation)) and extracting the coefficients (which were approximately 4 and -1.5).
ggplot(attenuation, aes(x = km, y = edna))+
geom_point() +
stat_smooth(method = 'nls', formula = 'y~a*(x+0.1)^b',
method.args=list (start = list(a = exp(4),b=-1.5)), se=FALSE)
You can also do this slightly more efficiently with a log-link Gaussian GLM as follows (you still need to displace the x-values from zero). I also added some code to disambiguate the repeated points.
ggplot(attenuation, aes(x = km, y = edna))+
stat_sum() +
geom_smooth(method="glm", formula=y~log(x+0.1),
method.args=list(family=gaussian(link="log"),
start=c(4,-1.5)))+
scale_size(breaks=c(1,2),range=c(1,3))

ggplot smooth pass aes variable to method.args

After many google searches I decided to ask for your help, guys.
I am plotting just some observations at different time points and I want to add a linear regression with stat_smooth. However, I want the linear model with the intercept at 100 (because data are percentage relative to time 0). To do that, I found that the easiest way is to use the offset parameter in lm. The problem is how to get the number of 'y' observations per group(col and facet groups) to pass it to offset parameter.
If I use data with the same number of observations per group (10 in my case), I can just write the number and it works great:
myplot <- ggplot(mydt2, aes(x=Time_point, y=GFP_rel, col=Gene, fill=Gene,group=Gene))
myplot <- myplot + stat_smooth(method='lm', formula = y ~ x + 0, method.args=list(offset=rep(100,10))) +
facet_wrap(~Cell_line)
However, this is not very elegant and/or flexible. My question is: how can I pass the number of observations to method.args? I tried offset(100,..count..), but I get the error: (list) object cannot be coerced to type 'integer').
Any suggestions?
Thanks
You can use the I(y - 100) coding in the formula as shown here instead of using an offset.
However, the predicted values for stat_smooth will then be predictions for y - 100, not y. This line will go through 0. You can move the lines back to the position to display predictions of the original y variable using position_nudge.
So the stat_smooth code would look something like
stat_smooth(method = "lm", formula = I(y - 100) ~ x + 0,
position = position_nudge(y = 100))

add exponential function given mean and intercept to cdf plot

Considering the following random data:
set.seed(123456)
# generate random normal data
x <- rnorm(100, mean = 20, sd = 5)
weights <- 1:100
df1 <- data.frame(x, weights)
#
library(ggplot2)
ggplot(df1, aes(x)) + stat_ecdf()
We can create a general cumulative distribution plot.
But, I want to compare my curve to that from data used 20 years ago. From the paper, I only know that the data is "best modeled by a shifted exponential distribution with an x intercept of 1.1 and a mean of 18"
How can I add such a function to my plot?
+ stat_function(fun=dexp, geom = "line", size=2, col="red", args = (mean=18.1))
but I am not sure how to deal with the shift (x intercept)
I think scenarios like this are best handled by making your function first outside of the ggplot call.
dexp doesn't take a parameter mean but uses rate instead which is the same as lambda. That means you want rate = 1/18.1 based on properties of exponential distributions. Also, I don't think dexp makes much sense here since it shows the density and I think you really want the probability with is pexp.
your code could look something like this:
library(ggplot2)
test <- function(x) {pexp(x, rate = 1/18.1)}
ggplot(df1, aes(x)) + stat_ecdf() +
stat_function(fun=test, size=2, col="red")
you could shift your pexp distributions doing this:
test <- function(x) {pexp(x-10, rate = 1/18.1)}
ggplot(df1, aes(x)) + stat_ecdf() +
stat_function(fun=test, size=2, col="red") +
xlim(10,45)
just for fun this is what using dexp produces:
I am not entirely sure if I understand concept of mean for exponential function. However, generally, when you pass function as an argument, which is fun=dexp in your case, you can pass your own, modified functions in form of: fun = function(x) dexp(x)+1.1, for example.
Maybe experimenting with this feature will get you to the solution.

How can I overlay timeseries models for exponential decay into ggplot2 graphics?

I'm trying to plot an exponential decay line (with error bars) onto a scatterplot in ggplot of price information over time. I currently have this:
f2 <- ggplot(data, aes(x=date, y=cost) ) +
geom_point(aes(y = cost), colour="red", size=2) +
geom_smooth(se=T, method="lm", formula=y~x) +
# geom_smooth(se=T) +
theme_bw() +
xlab("Time") +
scale_y_log10("Price over time") +
opts(title="The Falling Price over time")
print(f2)
The key line is in the geom_smooth command, of formula=y~x Although this looks like a linear model, ggplot seems to automatically detect my scale_y_log10 and log it.
Now, my issue here is that date is a date data type. I think I need to convert it to seconds since t=0 to be able to apply an exponential decay model of the form y = Ae^-(bx).
I believe this because when I tried things like y = exp(x), I get a message that I think(?) is telling me I can't take exponents of dates. It reads:
Error in lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok, :
NA/NaN/Inf in foreign function call (arg 1)
However, log(y) = x works correctly. (y is a numeric data type, x is a date.)
Is there a convenient way to fit exponential growth/decay time series models within ggplot plots in the geom_smooth(formula=formula) function call?
This appears to work, although I don't know how finicky it will be with real/messy data:
set.seed(101)
dat <- data.frame(d=seq.Date(as.Date("2010-01-01"),
as.Date("2010-12-31"),by="1 day"),
y=rnorm(365,mean=exp(5-(1:365)/100),sd=5))
library(ggplot2)
g1 <- ggplot(dat,aes(x=d,y=y))+geom_point()+expand_limits(y=0)
g1+geom_smooth(method="glm",family=gaussian(link="log"),
start=c(5,0))

Resources