So I have 2 groups and an x and y variable. I am trying to run a linear regression to see if there is a significant relationship between the x and y variables within each group but I also want to look at the significance between groups. Then I would like to plot those results and provide a p-value, equation, and R^2 value on the graph. How would I go about accomplishing this?
I am able to plot the data on the same graph using this code:
ggplot(data_NeuroPsych, aes(x = Flanker_Ratio, y = Neuropsych_Delta, color = Group)) +
geom_point() +
geom_smooth(method = "lm", fill = NA)
Then using this open source code I was able to look at the results separately: https://github.com/kassambara/ggpubr/blob/master/R/stat_regline_equation.R#L7
The issue with the above is the data is not on the same plot and it does not look at the comparison between groups.
INTRO: I am new to r and to stack overflow...So I am doing a term paper and need to run some stats on how or better when to develop habits.
Ideally habit formation is according to Mitscherlich’s law & looks like a non-linear regression and a asymptote. Once a participant reaches his plateau (defined as 95% interval to asymptote) One can speak of an established habit... Well actually that is debateable... BUT we are doing a replica of a study done by Lally et al. 2010 (How habits are formed:Modelling habit formation in the real world) So we somehow have to stick to certian criteria
ACTUAL QUESTION: The first step is to obtain the R2 for linear and non-linear regression. This I managed.
But for some reason I just can not manage to obtain the x-Axis value for the intersect(orange point in picture) of a non-linear function and my 95% Habit plateau line (Purple line in picture)...
Here is an example of how an ideal graph looks like
But exactley this X value is crucial for group comparison and later on checking for significant differences...
Of course I already googled but somehow I could not manage to make sense of the presented solutions to other or similar question... It seems one can not do it in ggplot with the geom_point() & therefore has to build a seperate formula using the approx() function, right?
Maybe someone can help me out... Tks in advance!
And here is the code of interest...
library(ggplot2)
library(patchwork)
library(stats)
days <- c(0:15)
score <- c(14,17,16,22,24,27,30,31,32,35,40,43,42,43,43)
df <- data.frame(days,score)
# red curve in graph
#This way the R2 for the nonlinear regression is obtained for later analisis
nonlinreg1 <- nls(score ~ SSasymp(days, Asym, R0, lrc), data = df)
summary(nonlinreg1)
RSS <- sum(residuals(nonlinreg1)^2)
TSS <- sum((df$score - mean(df$score))^2)
R.square.nonlinreg1 <- 1 - (RSS/TSS)
R.square.nonlinreg1
# purple line in graph
#Definition of plateau at 95% of asymptote reached
Asymp95 <- summary(nonlinreg1)$parameters[1,1] * 0.95
# define green line as the Asymptote
nls_line <- predict(nonlinreg1)
#plotting Asymptote (nls_line)
HabitReach95 <- approx(nls_line, df$days, xout = Asymp95)$y
# Now in GGplot
ggplot(data=df,aes(x=days, y=score)) +
geom_point()+
#HERE now from this intersect below, I would like to obtain the exact X-value
geom_point(x = HabitReach95, y= Asymp95, aes(color="Intersect"), lwd=2) +
#this is the rest of ggplot code but I think it is not of interest for the described problem, but still just in case...
geom_smooth(method=lm, aes(color="Linear Reg"), se=F) +
geom_smooth(method="nls", formula=y~SSasymp(x, Asym, R0, lrc), aes(color="Non-Linear Reg"), se=F) +
geom_hline(aes(color="Asymptote for non-linear Reg", yintercept=summary(nonlinreg1)$parameters[1,1])) +
geom_hline(aes(color="Habit plateau at 95%", yintercept=Asymp95)) +
xlab("Days of Experiment") + ylab("Automaticity Score Habit") +
ggtitle("Test graph for participant") +
theme(plot.title = element_text(hjust = 0.5))+
#ylim(0,49)+ # Actual graph or scale for experiment
scale_color_manual(values = c("green", "purple", "orange", "blue", "red"), name="Legend")
OMg I am so stupid... I already calculate it with this line here!!!
HabitReach95 <- approx(nls_line, df$days, xout = Asymp95)$y
Haha can not believe it... well sometimes you don't see the forest for the trees!
I am trying to plot how 'Square feet' of a home affects 'Sales Price (in $1000)' of the same. Particularly, I want the coefficient line from Square ft vs. Sales price plotted with a hypothetical grey area around the line with the original datapoints superimposed.
I have tried to complete this a few different ways. One way I have tried is using the function effect_plot from library(jtools). I used the coding I found from https://cran.r-project.org/web/packages/jtools/vignettes/effect_plot.html.
But when I run this function, I don't get a plot, I just get an error: Error in FUN(X[[i]], ...) : object 'Sales Price (in $1000)' not found.
The second way I have attempted is through manually creating a new vector and attempting to plot the confidence interval. My code inspiration is from Plotting a 95% confidence interval for a lm object.
But with this one, I get an error in the conf_interval line: Error in eval(predvars, data, env) : object 'Square feet' not found. I cannot figure out how to correct this error.
And finally, I have tried to use library(ggplot2) to complete the problem with inspiration from https://rpubs.com/aaronsc32/regression-confidence-prediction-intervals.
But each time I run R, it creates a coordinate plane with a single point in the center of the plane; there is no line, no real points, no hypothetical confidence interval. There are no errors and I also cannot figure out the issue with the coding.
library("jtools")
LRA1 <- lm(`Sales Price (in $1000)` ~ `Square feet` + Rooms +
Bedrooms + Age,data=HomedataSRS) #LRA1 is the regression model
effect_plot(LRA1, pred = 'Square feet', inherit.aes = FALSE,
plot.points = TRUE) #function should create graph
newSF = seq(min(HomedataSRS$`Square feet`),
max(HomedataSRS$`Square feet`), by = 0.05)
conf_interval <- predict(LRA1, newdata=data.frame(x=newSF),
interval="confidence",level = 0.95)
plot(HomedataSRS$`Square feet`, HomedataSRS$`Sales Price (in $1000)`,
xlab="Square feet", ylab="Sales Price(in $1000)",
main="Regression")
abline(LRA1, col="lightblue")
matlines(newSF, conf_interval[,2:3], col = "blue", lty=2)
library(ggplot2)
SFHT <- HomedataSRS %>% select(1:2)
#This is to select the 2 variables I'm working with
ggplot(SFHT, aes(x='Square feet', inherit.aes = FALSE,
y='Sales Price (in $1000)')) +
geom_point(color='#2980B9', size = 4) +
geom_smooth(method=lm, color='#2C3E50')
Data:
arguments to aes() should not be quoted. Try
ggplot(SFHT, aes(x = `Square feet`, y = `Sales Price (in $1000)`)) +
geom_point(color='#2980B9', size = 4) +
geom_smooth(method=lm, color='#2C3E50')
alternatively, you could use the new aes_string() function:
ggplot(SFHT, aes_string(x='Square feet',y='Sales Price (in $1000)')) +
geom_point(color='#2980B9', size = 4) +
geom_smooth(method=lm, color='#2C3E50')
more info is available on the package site: https://ggplot2.tidyverse.org/reference/aes_.html
attenuation = data.frame(km =
c(0,0,0.4,0.4,0.8,0.8,1.2,1.2,1.6,1.6,2,2,2.4,2.4,2.8,2.8,3.2,3.2,3.6,3.6,4,
4,4.4,4.4,4.8,4.8,5.2,5.2,5.6,5.6,6,6,6.4,6.4,6.8,6.8,7.2,7.2,7.6,7.6,8,8,
11.7,11.7,13,13), edna = c(76000,20000,0,0,6000,0,0,6880,10700,0,6000,
0,0,0,0,0,0,6000,0,0,0,0,0,0,0,0,6310,0,6000,6000,0,0,0,0,0,
0,0,0,0,0,0,6000,0,0,0,0))
#This worked great for a linear regression
ggplot(attenuation, aes(x = km, y = edna)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
xlab("Distance from Cage (km)") +
ylab("eDNA concentration (gene sequence/Liter)")
But the linear regression doesn't seem to be a good fit (r squared =0.09). So I'd like to try something else. I tried some other regressions also with poor fits, so I'd like to try a nonlinear regression.
I have researched this question on stack overflow and tried a number of different options, but nothing is working. The option I provide below makes the most sense-but I wonder if I have the formula wrong? Or if the start list needs to be modified?
For context I am trying to explore the relationship between river distance and concentration.
#This is not working for a nonlinear regression
ggplot(attenuation, aes(x = km, y = edna))+
geom_point() +
stat_smooth(method = 'nls', formula = 'y~a*x^b', method.args=list (start =
list(a = 1,b=1), se=FALSE))
I get the following error from r when I run the code for nls above
Computation failed in stat_smooth():
variable lengths differ (found for '(se)')
You have 2 problems. First a missplaced ")" since se=FALSE is an argument to stat_smooth=, not method.args=:
ggplot(attenuation, aes(x = km, y = edna))+
geom_point() +
stat_smooth(method='nls', formula='y~a*x^b', method.args=list(start =
list(a=1, b=1)), se=FALSE)
But this will not work either because your model is impossible to fit to your data. Look at the equation. When x=0, y will equal 0. For values of x greater than 0, y will increase unless b is negative, but but then x=0 is Inf so the algorithm fails to try negative values. Since you have a decreasing relationship, you need to specify a function that is defined for x=0 and plausible starting values. This one parameter fits your data better than a linear function (it could also be defined as a*(x + 1)^-1 which is essentially your function with 1 added to x so that it is defined at x=0:
ggplot(attenuation, aes(x = km, y = edna))+
geom_point() +
stat_smooth(method = 'nls', formula = 'y~a/(x + 1)',
method.args=list(start=list(a=50000)), se=FALSE)
[
I picked 50000 by splitting the difference between 20,000 and 76,000. The final estimate is about 20,000. You can bend the curve more sharply by adding a second parameter, but you have so many 0 values it may be too much depending on what you are trying to communicate:
ggplot(attenuation, aes(x = km, y = edna))+
geom_point() +
stat_smooth(method='nls', formula='y~a*(1+x)^b', method.args=list(start =
list(a=50000, b=-1)), se=FALSE)
I agree with #dcarlson's answer. You've got a pretty small data set here (a total of 11 non-zero data points, two of which fall on top of each other) so you probably shouldn't push any conclusions too hard. The first two points are definitely large, and there might be a mild declining trend after that, but beyond that you can't say too much.
If you want to do the power-law fit you have to displace the zero-km data point from the origin. I've done it by adding 0.1 to the x values. This is an arbitrary choice on my part and should be thought about carefully on your end ... (note that there's a large difference in the results if you add 0.1 as I did or 1 as #dcarlson did). I also had to put in more reasonable starting values, which I did by fitting a log-log linear regression (lm(log(edna) ~ log(km+0.1), data=attenuation)) and extracting the coefficients (which were approximately 4 and -1.5).
ggplot(attenuation, aes(x = km, y = edna))+
geom_point() +
stat_smooth(method = 'nls', formula = 'y~a*(x+0.1)^b',
method.args=list (start = list(a = exp(4),b=-1.5)), se=FALSE)
You can also do this slightly more efficiently with a log-link Gaussian GLM as follows (you still need to displace the x-values from zero). I also added some code to disambiguate the repeated points.
ggplot(attenuation, aes(x = km, y = edna))+
stat_sum() +
geom_smooth(method="glm", formula=y~log(x+0.1),
method.args=list(family=gaussian(link="log"),
start=c(4,-1.5)))+
scale_size(breaks=c(1,2),range=c(1,3))
I'm trying to display some frequencies convolved with a Gaussian kernel in ggplot2. I tried smoothing the lines with:
+ stat_smooth(se = F,method = "lm", formula = y ~ poly(x, 24))
Without success.
I read an article suggesting the frequencies should be convolved with a Gaussian kernel. Which ggplot2's stat_density function (http://docs.ggplot2.org/current/stat_density.html) seem to be able to produce.
However, I can't seem to be able to replace my geometry with stat_density. I there anything wrong with my code?
require(reshape2)
library(ggplot2)
library(RColorBrewer)
fileName = "/1.csv" # downloadable there: https://www.dropbox.com/s/l5j7ckmm5s9lo8j/1.csv?dl=0
mydata = read.csv(fileName,sep=",", header=TRUE)
dataM = melt(mydata,c("bins"))
myPalette <- colorRampPalette(rev(brewer.pal(11, "Spectral")))
ggplot(data=dataM,
aes(x=bins, y=value, colour=variable)) +
geom_line() + scale_x_continuous(limits = c(0, 2))
This code produces the following plot:
I'm looking at smoothing the lines a little bit, so they look more like this:
(from http://journal.frontiersin.org/Journal/10.3389/fncom.2013.00189/full)
Since my comments solved your problem, I'll convert them to an answer:
The density function takes individual measurements and calculates a kernel density distribution by convolution (gaussian is the default kernel). For example, plot(density(rnorm(1000))). You can control the smoothness with the bw (bandwidth) parameter. For example, plot(density(rnorm(1000), bw=0.01)).
But your data frame is already a density distribution (analogous to the output of the density function). To generate a smoother density estimate, you need to start with the underlying data and run density on it, adjusting bw to get the smoothness where you want it.
If you don't have access to the underlying data, you can smooth out your existing density distributions as follows:
ggplot(data=dataM, aes(x=bins, y=value, colour=variable)) +
geom_smooth(se=FALSE, span=0.3) +
scale_x_continuous(limits = c(0, 2)).
Play around with the span parameter to get the smoothness you want.