Using ggplot2 in R creating multiple smoothed/fitted lines - r

I am having trouble producing a figure in R using ggplots. No stats are needed - I just need the visual representation of my data. I have 7 participants, and I want to plot a line for each participant through a scatterplot. The slope and shape of the line is different for each participant, however on average is somewhat exponential.
I have used the below code in R, however I am only getting liner models. When changing the method to loess, the lines are too wriggly. Can someone please help me make this more presentable? Essentially I'm after a line of best fit for each participant, yet still need to be able to use the function fullrange = FALSE.
Furthermore, should I be using stat_smooth or geom_smooth? Is there a difference.
ggplot(data, aes(x=x, y=y, group = athlete)) +
geom_point() +
stat_smooth(method = "lm", se=FALSE, fullrange = FALSE)
Thanks in advance for any help!

I don't have your data, so I'll just do this with the mpg dataset.
As you've noted, you can use geom_smooth() and specify a method such as "loess". Know that you can pass on arguments to the methods as you would if you were using the function behind it.
With loess, the smoothing parameter is span. You can play around with this until you're happy with the results.
data(mpg)
g <- ggplot(mpg, aes(x = displ, y = hwy, color = class)) + geom_point()
g + geom_smooth(se = F, method = 'loess', span = .8) + ggtitle("span 0.8")
g + geom_smooth(se = F, method = 'loess', span = 1) + ggtitle("span 1")

There is, to my knowledge, no built-in method for achieving this, but you can do it with some manual plotting. First, since you expect an exponential relationship, it might make sense to run a linear regression using log(y) as the response (I'll be using u and v, in order not to confuse them with the x and y aesthetics in the graph):
tb1 = tibble(
u = rep(runif(100, 0, 5), 3),
a = c(rep(-.5, 100), rep(-1, 100), rep(-2, 100)),
v = exp(a*u + rnorm(3*100, 0, .1))
) %>% mutate(a = as.factor(a))
lm1 = lm(log(v) ~ a:u, tb1)
summary(lm1)
gives you:
Call:
lm(formula = log(v) ~ a:u, data = tb1)
Residuals:
Min 1Q Median 3Q Max
-0.263057 -0.069510 -0.001262 0.062407 0.301033
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.013696 0.012234 -1.12 0.264
a-2:u -1.996670 0.004979 -401.04 <2e-16 ***
a-1:u -1.001412 0.004979 -201.14 <2e-16 ***
a-0.5:u -0.495636 0.004979 -99.55 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1002 on 296 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9983
F-statistic: 6.025e+04 on 3 and 296 DF, p-value: < 2.2e-16
Under "Coefficients" you can find the intercept and the "slopes" for the curves (actually the exponential factors). You can see that they closely match the factors we used for generating the data.
To plot the fitting curves, you can use the "predicted" values, produced from your linear model using predict:
ggplot(tb1, aes(u, v, colour=a)) +
geom_point() +
geom_line(data=tb1 %>% mutate(v = exp(predict(lm1))))
If you want to have the standard error ribbons, it's a little more work, but still possible:
p1 = predict(lm1, se.fit=T)
tb2 = tibble(
u = tb1$u,
a = tb1$a,
v = exp(p1$fit),
vmin = exp(p1$fit - 1.96*p1$se.fit),
vmax = exp(p1$fit + 1.96*p1$se.fit)
)
ggplot(tb2, aes(u, v, colour=a)) +
geom_ribbon(aes(fill=a, ymin=vmin, ymax=vmax), colour=NA, alpha=.25) +
geom_line(size=.5) +
geom_point(data=tb1)
produces:

Related

code for linear regression scatterplot residuals scatterplot

I ran a linear regression
lm.fit <- lm(intp.trust~age+v225+age*v225+v240+v241+v242,data=intp.trust)
summary(lm.fit)
and get the following results
Call:
lm(formula = intp.trust ~ age + v225 + age * v225 + v240 + v241 +
v242, data = intp.trust)
Residuals:
Min 1Q Median 3Q Max
-1.32050 -0.33299 -0.04437 0.30899 2.35520
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.461e+00 2.881e-02 85.418 < 2e-16 ***
age -2.416e-03 5.144e-04 -4.697 2.66e-06 ***
v225 5.794e-04 1.574e-02 0.037 0.971
v240 2.111e-02 2.729e-03 7.734 1.07e-14 ***
v241 -1.177e-03 1.958e-04 -6.014 1.83e-09 ***
v242 -1.473e-02 4.166e-04 -35.354 < 2e-16 ***
age:v225 4.214e-06 3.101e-04 0.014 0.989
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4833 on 34845 degrees of freedom
(21516 observations deleted due to missingness)
Multiple R-squared: 0.05789, Adjusted R-squared: 0.05773
F-statistic: 356.8 on 6 and 34845 DF, p-value: < 2.2e-16
"consider the residuals from the regression above. compare the residual distributions for females and males using an appropriate graph?"
Males and females is coded using variable v225. How do I go about on creating this graph?
at first I created :
lm.res <- resid(lm.fit)
but I'm not sure what the next step is.
The graph is supposed to be a scatterplot of residuals with different colour for females and males.
I tried this but was not working
ggplot(intp.trust, aes(x = intp.trust, y = lm.res, color = v225)) + geom_point()
In this line:
ggplot(intp.trust, aes(x = intp.trust, y = lm.res, color = v225)) + geom_point()
You are saying: "go look in the data.frame intp.trust for a variable called lm.res, and plot that as y"
But you created lm.res as standalone object, not as a column of intp.trust. Assign the residuals from your model to a new column in the data.frame like this:
intp.trust$lm.res <- resid(lm.fit)
And it should work. Example with dummy data:
library(ggplot2)
# generate data
true_function <- function(x, is_female) {
ifelse(is_female, 5, 2) +
ifelse(is_female, -1.5, 1.5) * x +
rnorm(length(x))
}
set.seed(123)
dat <- data.frame(x = runif(200, 1, 5), is_female = rbinom(200, 1, .5))
dat$y <- with(dat, true_function(x, is_female))
# regression
lm_fit <- lm(y ~ x + as.factor(is_female), data=dat)
# add residuals to data.frame
dat$resid <- resid(lm_fit)
# plot
ggplot(dat, aes(x=x, y=resid, color=as.factor(is_female))) +
geom_point()
Here is a sample that you could follow and get what you want
# Sample Data
x_1 <- rnorm(100)
x_2 <- runif(100, 10, 30)
x_3 <- rnorm(100) * runif(100)
y <- rnorm(100, mean = 10)
gender <- sample(c("F", "M"), replace = TRUE)
df <- data.frame(x_1, x_2, x_3, y, gender)
# Fit model
lm.fit <- lm(y ~ x_1 + x_2 + x_1 * x_2 + x_3, data = df)
# Update data.frame
df$residuals <- lm.fit$residuals
# Scatter Residuals
ggplot(df) +
geom_point(aes(x = as.numeric(row.names(df)), y = residuals, color = gender)) +
labs(x = 'Index', y = 'Residual value', title = 'Residual scatter plot')

R: Why this two different results (fit curve) from two different software for the same points?

I have this equation problem. I want to plot and fit (polyniomial 2°) this point data frame df.1:
df.1
x y
1902 0.01
1930 0.1
1950 0.5
1980 1
2014 1.8
the code is:
lm(df.1[,2] ~ poly(df.1[,1],2))
the result is:
Call:
lm(formula = df.1[, 2] ~ poly(df.1[, 1], 2))
Coefficients:
(Intercept) poly(df.1[, 1], 2)1 poly(df.1[, 1], 2)2
0.6620 1.4660 0.3339
The equation plot is :
ggplot(df.1, aes(x=x,y=y))+
geom_point(size = 4)+
geom_smooth(aes(y=df.1[,2],x=df.1[,1]),show.legend = T,linetype="dashed",method = "lm", formula = y ~ poly(x, 2), size = 0.4,se=T)+
stat_poly_eq(aes(label = paste(..eq.label..,..rr.label..,sep = "~")),formula =y ~ poly(x, 2),parse = TRUE)+
theme(panel.background = element_rect(fill = "white", colour = "grey50"))
Now, if i use other software, like excel or STATISTICA 10, the coefficient result from 2° fit polynomial curve is:
intercept 366.199
poly x 0.389864
poly x^2 0.000103743
The values in y (if i want to find the values of all fit curve), are right with excel equation, but the question is: Why R fit result in different values (moreover with only positive coefficient values)?
lm(df.1[,2] ~ poly(df.1[,1], 2, raw = T)) will return the same values as in Excel.
poly {stats} raw if true, use raw and not orthogonal polynomials.
The orthogonal polynomial is summarized by the coefficients, which can
be used to evaluate it via the three-term recursion given in Kennedy &
Gentle (1980, pp. 343–4), and used in the predict part of the code.
Source

Drawing the glm decision boundary with ggplot's stat_smooth() function returns wrong line

I want to plot the decision boundary after I fit a logistic regression model to my data. I use ggplot and stat_smooth() function to define the decision boundary line. However the plot returned is wrong. For a reproducible example, see below:
#-----------------------------------------------------------------------------------------------------
# CONSTRUCT THE DATA
#-----------------------------------------------------------------------------------------------------
X.1_Y.1 <- rnorm(1000, mean = 1.5, sd= 0.3)
X.2_Y.1 <- rnorm(1000, mean = 1.5, sd= 5)
X.1_Y.0 <- rnorm(99000, mean = 0, sd = 1)
X.2_Y.0 <- rnorm(99000, mean = 0, sd = 1)
data <- data.table(X.1 = c(X.1_Y.1 , X.1_Y.0),
X.2 = c(X.2_Y.1 , X.2_Y.0),
Y = c(rep(1, 1000) , rep(0, 99000 ))
)
#-----------------------------------------------------------------------------------------------------
# FIT A LOGISTIC MODEL ON THE DATA
#-----------------------------------------------------------------------------------------------------
model <- glm(Y ~ X.1 + X.2, data, family = "binomial")
summary(model)
#Call:
# glm(formula = Y ~ ., family = "binomial", data = data)
#Deviance Residuals:
# Min 1Q Median 3Q Max
#-1.6603 -0.1194 -0.0679 -0.0384 4.6263
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -6.04055 0.06636 -91.02 <2e-16 ***
# X.1 1.60828 0.03854 41.73 <2e-16 ***
# X.2 0.43272 0.01673 25.87 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#(Dispersion parameter for binomial family taken to be 1)
#Null deviance: 11200.3 on 99999 degrees of freedom
#Residual deviance: 8218.5 on 99997 degrees of freedom
#AIC: 8224.5
#-------------------------------------------------------------------------------------------------------
# DEFINE AND DRAW THE DECISION BOUNDARY
#-------------------------------------------------------------------------------------------------------
# 0 = -6.04 + 1.61 * X.1 + 0.44 * X2 => X2 = 6.04/0.44 - 1.61/0.44 * X.1
setDT(data)
ggplot(data, aes(X.1, X.2, color = as.factor(Y))) +
geom_point(alpha = 0.2) +
stat_smooth(formula = x.2 ~ 6.04/0.44 - (1.61/0.44) * X.1, color = "blue", size = 2) +
coord_equal() +
theme_economist()
This returns the following plot:
You can easily see that the line drawn is wrong. According to the formula X.2 should be 6.04/0.44 when X.1 = 0 which clearly is not the case in this plot.
Could you tell me where my code errs and how to correct it?
Your advice will be appreciated.
If you are trying to plot a line on your graph that you fit yourself, you should not be using stat_smooth, you should be using stat_function. For example
ggplot(data, aes(X.1, X.2, color = as.factor(Y))) +
geom_point(alpha = 0.2) +
stat_function(fun=function(x) {6.04/0.44 - (1.61/0.44) * x}, color = "blue", size = 2) +
coord_equal()

How to add a logarithmic nonlinear fit to ggplot?

I'd like to fit the logarithmic curve through my data using nls.
library(dplyr)
library(ggplot2)
a <- 3
b <- 2
Y <- data_frame(x = c(0.2, 0.5, 1, 2, 5, 10),
y = a + b*log(x))
Y %>%
ggplot(aes(x = x, y = y)) +
geom_point(shape = 19, size = 2) +
geom_smooth(method = "nls",
formula = y ~ p1 + p2*log(x),
start = list(a = a, b = b),
se = FALSE,
control = list(maxiter = 100))
This gives me an error:
Error in method(formula, data = data, weights = weight, ...) :
number of iterations exceeded maximum of 100
What is going wrong?
Here's some text I copied and pasted after doing ?nls:
Warning
Do not use nls on artificial "zero-residual" data.
The nls function uses a relative-offset convergence criterion that compares the numerical imprecision at the current parameter estimates to the residual sum-of-squares. This performs well on data of the form
y = f(x, θ) + eps
(with var(eps) > 0). It fails to indicate convergence on data of the form
y = f(x, θ)
because the criterion amounts to comparing two components of the round-off error. If you wish to test nls on artificial data please add a noise component, as shown in the example below.
That inspired me to try this:
> library(dplyr)
> library(ggplot2)
> a <- 3
> b <- 2
> Y <- data_frame(x = c(0.2, 0.5, 1, 2, 5, 10),
+ y = a + b*log(x)*(1 + rnorm(length(x), sd=0.001)))
> Y %>%
+ ggplot(aes(x = x, y = y)) +
+ geom_point(shape = 19, size = 2) +
+ geom_smooth(method = "nls",
+ formula = y ~ p1 + p2*log(x),
+ start = list(p1 = a, p2 = b),
+ se = FALSE,
+ control = list(maxiter = 100))
Note: your code had start = list(a=a, b=b) which is a typo because a and b aren't defined in your formula. Aside from that, adding the *(1 + rnorm(length(x), sd=0.001)) is the only thing I did.
The resulting graph made it seem like everything worked fine.
I'd generally recommend doing the fit separately, however, and then plotting it with predict. That way you can always check the quality of the fit to see if it worked before plotting.
> fit <- nls(data=Y, formula = y ~ p1 + p2*log(x), start = list(p1 = a, p2 = b))
> summary(fit)
Formula: y ~ p1 + p2 * log(x)
Parameters:
Estimate Std. Error t value Pr(>|t|)
p1 3.001926 0.001538 1952 4.14e-13 ***
p2 1.999604 0.001114 1795 5.78e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.003619 on 4 degrees of freedom
Number of iterations to convergence: 1
Achieved convergence tolerance: 1.623e-08
> new_x = data.frame(x=seq(from=0.2, to=10, length.out=100))
> ggplot(data=Y, aes(x=x, y=y)) +
geom_point() +
geom_line(data=new_x,
aes(x=new_x, y=predict(fit, newdata=new_x)),
color='blue')

Plot NLS in R, real data and estimated parameters

My dataset ICM_Color0 has the following structure, where columns are:
Lum Ruido Dist RT.ms Condicion
With 2599 rows.
There are three luminance = [13,19,25];and two types of noise = [1, 2] -> 3x2 = 6 conditions.
Condicion:
Lum Ruido Condicion
13 1 1
13 2 2
19 1 3
19 2 4
25 1 5
25 2 6
My model is:
Color0.nls <- nls(RT.ms ~ 312 + K[Condicion]/(Dist^1),
data = ICM_Color0, start = list(K = rep(1,6)))
> summary(Color0.nls)
Formula: RT.ms ~ RT0.0 + K[Condicion]/(Dist^n)
Parameters:
Estimate Std. Error t value Pr(>|t|)
K1 1.84108 0.03687 49.94 <2e-16 ***
K2 2.04468 0.03708 55.14 <2e-16 ***
K3 1.70841 0.03749 45.58 <2e-16 ***
K4 2.09915 0.03628 57.86 <2e-16 ***
K5 1.62961 0.03626 44.94 <2e-16 ***
K6 2.18235 0.03622 60.26 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 120.5 on 2593 degrees of freedom
Number of iterations to convergence: 1
Achieved convergence tolerance: 1.711e-08
I need to plot the actual data and parameter estimation.
I already made a general review of the literature but found no examples with a model like mine, where the model depends on the condition variable.
Can anyone guide me?
Thanks a lot
It's fairly straightforward to plot the fitted lines from a regression (non-linear or not). I most often do this by using predict to calculate the predicted values from the original data and then plotting those as lines on top of a scatterplot of the data.
You didn't give a reproducible example, so I made some nonlinear data following this answer.
# Create data to fit with non-linear regression
set.seed(16)
x = seq(100)
y = rnorm(200, 50 + 30 * x^(-0.2), 1)
site = rep(c("a", "b"), each = 100)
dat = data.frame(expl = c(x, x), resp = y, site)
Then I fit a nonlinear regression, allowing each parameter to vary by the grouping variable site.
fit1 = nls(resp ~ a[site] + b[site] * expl^(-c[site]), data = dat,
start = list(a = c(80, 80), b = c(20, 20), c = c(.2, .2)))
Now I just add the fitted values to the dataset using predict.nls
dat$pred = predict(fit1)
I plotted this using the ggplot2 package.
ggplot(data = dat, aes(x = expl, y = resp, color = site)) +
geom_point() +
geom_line(aes(y = pred))
In this case, where I'm allowing all parameters to vary by site, it looks like you can do all of this in ggplot through geom_smooth. I found a very nice example of this here.
Here is what it would look like with the toy dataset.
ggplot(data = dat, aes(x = expl, y = resp, color = site)) +
geom_point() +
geom_smooth(aes(group = site), method = "nls", formula = "y ~ a + b*x^(-c)",
start = list(a = 80, b = 20, c = .2), se = FALSE)

Resources