code for linear regression scatterplot residuals scatterplot - r

I ran a linear regression
lm.fit <- lm(intp.trust~age+v225+age*v225+v240+v241+v242,data=intp.trust)
summary(lm.fit)
and get the following results
Call:
lm(formula = intp.trust ~ age + v225 + age * v225 + v240 + v241 +
v242, data = intp.trust)
Residuals:
Min 1Q Median 3Q Max
-1.32050 -0.33299 -0.04437 0.30899 2.35520
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.461e+00 2.881e-02 85.418 < 2e-16 ***
age -2.416e-03 5.144e-04 -4.697 2.66e-06 ***
v225 5.794e-04 1.574e-02 0.037 0.971
v240 2.111e-02 2.729e-03 7.734 1.07e-14 ***
v241 -1.177e-03 1.958e-04 -6.014 1.83e-09 ***
v242 -1.473e-02 4.166e-04 -35.354 < 2e-16 ***
age:v225 4.214e-06 3.101e-04 0.014 0.989
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4833 on 34845 degrees of freedom
(21516 observations deleted due to missingness)
Multiple R-squared: 0.05789, Adjusted R-squared: 0.05773
F-statistic: 356.8 on 6 and 34845 DF, p-value: < 2.2e-16
"consider the residuals from the regression above. compare the residual distributions for females and males using an appropriate graph?"
Males and females is coded using variable v225. How do I go about on creating this graph?
at first I created :
lm.res <- resid(lm.fit)
but I'm not sure what the next step is.
The graph is supposed to be a scatterplot of residuals with different colour for females and males.
I tried this but was not working
ggplot(intp.trust, aes(x = intp.trust, y = lm.res, color = v225)) + geom_point()

In this line:
ggplot(intp.trust, aes(x = intp.trust, y = lm.res, color = v225)) + geom_point()
You are saying: "go look in the data.frame intp.trust for a variable called lm.res, and plot that as y"
But you created lm.res as standalone object, not as a column of intp.trust. Assign the residuals from your model to a new column in the data.frame like this:
intp.trust$lm.res <- resid(lm.fit)
And it should work. Example with dummy data:
library(ggplot2)
# generate data
true_function <- function(x, is_female) {
ifelse(is_female, 5, 2) +
ifelse(is_female, -1.5, 1.5) * x +
rnorm(length(x))
}
set.seed(123)
dat <- data.frame(x = runif(200, 1, 5), is_female = rbinom(200, 1, .5))
dat$y <- with(dat, true_function(x, is_female))
# regression
lm_fit <- lm(y ~ x + as.factor(is_female), data=dat)
# add residuals to data.frame
dat$resid <- resid(lm_fit)
# plot
ggplot(dat, aes(x=x, y=resid, color=as.factor(is_female))) +
geom_point()

Here is a sample that you could follow and get what you want
# Sample Data
x_1 <- rnorm(100)
x_2 <- runif(100, 10, 30)
x_3 <- rnorm(100) * runif(100)
y <- rnorm(100, mean = 10)
gender <- sample(c("F", "M"), replace = TRUE)
df <- data.frame(x_1, x_2, x_3, y, gender)
# Fit model
lm.fit <- lm(y ~ x_1 + x_2 + x_1 * x_2 + x_3, data = df)
# Update data.frame
df$residuals <- lm.fit$residuals
# Scatter Residuals
ggplot(df) +
geom_point(aes(x = as.numeric(row.names(df)), y = residuals, color = gender)) +
labs(x = 'Index', y = 'Residual value', title = 'Residual scatter plot')

Related

ggplot exponential smooth with tuning parameter inside exp

ggplot provides various "smoothing methods" or "formulas" that determine the form of the trend line. However it is unclear to me how the parameters of the formula are specified and how I can get the exponential formula to fit my data. In other words how to tell ggplot that it should fit the parameter inside the exp.
df <- data.frame(x = c(65,53,41,32,28,26,23,19))
df$y <- c(4,3,2,8,12,8,20,15)
x y
1 65 4
2 53 3
3 41 2
4 32 8
5 28 12
6 26 8
7 23 20
8 19 15
p <- ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "glm", se=FALSE, color="black", formula = y ~ exp(x)) +
geom_point()
p
Problematic fit:
However if the parameter inside the exponential is fit then the form of the trend line becomes reasonable:
p <- ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "glm", se=FALSE, color="black", formula = y ~ exp(-0.09 * x)) +
geom_point()
p
Here is an approach with method nls instead of glm.
You can pass additional parameters to nls with a list supplied in method.args =. Here we define starting values for the a and r coefficients to be fit from.
library(ggplot2)
ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "nls", se = FALSE,
formula = y ~ a * exp(r * x),
method.args = list(start = c(a = 10, r = -0.01)),
color = "black") +
geom_point()
As discussed in the comments, the best way to get the coefficients on the graph is by fitting the model outside the ggplot call.
model.coeff <- coef(nls( y ~ a * exp(r * x), data = df, start = c(a = 50, r = -0.04)))
ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "nls", se = FALSE,
formula = y ~ a * exp(r * x),
method.args = list(start = c(a = 50, r = -0.04)),
color = "black") +
geom_point() +
geom_text(x = 40, y = 15,
label = as.expression(substitute(italic(y) == a %.% italic(e)^(r %.% x),
list(a = format(unname(model.coeff["a"]),digits = 3),
r = format(unname(model.coeff["r"]),digits = 3)))),
parse = TRUE)
Firstly, to pass additional parameters to the function passed to the method param of geom_smooth, you can pass a list of named parameters to method.args.
Secondly, the problem you're seeing is that glm is placing the coefficient in front of the whole term: y ~ coef * exp(x) instead of inside: y ~ exp(coef * x) like you want. You could use optimization to solve the latter outside of glm, but you can fit it into the GLM paradigm by a transformation: a log link. This works because it's like taking the equation you want to fit, y = exp(coef * x), and taking the log of both sides, so you're now fitting log(y) = coef * x, which is equivalent to what you want to fit and works with the GLM paradigm. (This ignores the intercept. It also ends up in transformed link units, but it's easy enough to convert back if you like.)
You can run this outside of ggplot to see what the models look like:
df <- data.frame(
x = c(65,53,41,32,28,26,23,19),
y <- c(4,3,2,8,12,8,20,15)
)
bad_model <- glm(y ~ exp(x), family = gaussian(link = 'identity'), data = df)
good_model <- glm(y ~ x, family = gaussian(link = 'log'), data = df)
# this is bad
summary(bad_model)
#>
#> Call:
#> glm(formula = y ~ exp(x), family = gaussian(link = "identity"),
#> data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -7.7143 -2.9643 -0.8571 3.0357 10.2857
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.714e+00 2.437e+00 3.986 0.00723 **
#> exp(x) -3.372e-28 4.067e-28 -0.829 0.43881
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 41.57135)
#>
#> Null deviance: 278.00 on 7 degrees of freedom
#> Residual deviance: 249.43 on 6 degrees of freedom
#> AIC: 56.221
#>
#> Number of Fisher Scoring iterations: 2
# this is better
summary(good_model)
#>
#> Call:
#> glm(formula = y ~ x, family = gaussian(link = "log"), data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.745 -2.600 0.046 1.812 6.080
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.93579 0.51361 7.663 0.000258 ***
#> x -0.05663 0.02054 -2.757 0.032997 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 12.6906)
#>
#> Null deviance: 278.000 on 7 degrees of freedom
#> Residual deviance: 76.143 on 6 degrees of freedom
#> AIC: 46.728
#>
#> Number of Fisher Scoring iterations: 6
From here, you can reproduce what geom_smooth is going to do: make a sequence of x values across the domain and use the predictions as the y values for the line:
# new data is a sequence across the domain of the model
new_df <- data.frame(x = seq(min(df$x), max(df$x), length = 501))
# `type = 'response'` because we want values for y back in y units
new_df$bad_pred <- predict(bad_model, newdata = new_df, type = 'response')
new_df$good_pred <- predict(good_model, newdata = new_df, type = 'response')
library(tidyr)
library(ggplot2)
new_df %>%
# reshape to long form for ggplot
gather(model, y, contains('pred')) %>%
ggplot(aes(x, y)) +
geom_line(aes(color = model)) +
# plot original points on top
geom_point(data = df)
Of course, it's a lot easier to let ggplot handle all that for you:
ggplot(df, aes(x, y)) +
geom_smooth(
method = 'glm',
formula = y ~ x,
method.args = list(family = gaussian(link = 'log'))
) +
geom_point()

Using ggplot2 in R creating multiple smoothed/fitted lines

I am having trouble producing a figure in R using ggplots. No stats are needed - I just need the visual representation of my data. I have 7 participants, and I want to plot a line for each participant through a scatterplot. The slope and shape of the line is different for each participant, however on average is somewhat exponential.
I have used the below code in R, however I am only getting liner models. When changing the method to loess, the lines are too wriggly. Can someone please help me make this more presentable? Essentially I'm after a line of best fit for each participant, yet still need to be able to use the function fullrange = FALSE.
Furthermore, should I be using stat_smooth or geom_smooth? Is there a difference.
ggplot(data, aes(x=x, y=y, group = athlete)) +
geom_point() +
stat_smooth(method = "lm", se=FALSE, fullrange = FALSE)
Thanks in advance for any help!
I don't have your data, so I'll just do this with the mpg dataset.
As you've noted, you can use geom_smooth() and specify a method such as "loess". Know that you can pass on arguments to the methods as you would if you were using the function behind it.
With loess, the smoothing parameter is span. You can play around with this until you're happy with the results.
data(mpg)
g <- ggplot(mpg, aes(x = displ, y = hwy, color = class)) + geom_point()
g + geom_smooth(se = F, method = 'loess', span = .8) + ggtitle("span 0.8")
g + geom_smooth(se = F, method = 'loess', span = 1) + ggtitle("span 1")
There is, to my knowledge, no built-in method for achieving this, but you can do it with some manual plotting. First, since you expect an exponential relationship, it might make sense to run a linear regression using log(y) as the response (I'll be using u and v, in order not to confuse them with the x and y aesthetics in the graph):
tb1 = tibble(
u = rep(runif(100, 0, 5), 3),
a = c(rep(-.5, 100), rep(-1, 100), rep(-2, 100)),
v = exp(a*u + rnorm(3*100, 0, .1))
) %>% mutate(a = as.factor(a))
lm1 = lm(log(v) ~ a:u, tb1)
summary(lm1)
gives you:
Call:
lm(formula = log(v) ~ a:u, data = tb1)
Residuals:
Min 1Q Median 3Q Max
-0.263057 -0.069510 -0.001262 0.062407 0.301033
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.013696 0.012234 -1.12 0.264
a-2:u -1.996670 0.004979 -401.04 <2e-16 ***
a-1:u -1.001412 0.004979 -201.14 <2e-16 ***
a-0.5:u -0.495636 0.004979 -99.55 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1002 on 296 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9983
F-statistic: 6.025e+04 on 3 and 296 DF, p-value: < 2.2e-16
Under "Coefficients" you can find the intercept and the "slopes" for the curves (actually the exponential factors). You can see that they closely match the factors we used for generating the data.
To plot the fitting curves, you can use the "predicted" values, produced from your linear model using predict:
ggplot(tb1, aes(u, v, colour=a)) +
geom_point() +
geom_line(data=tb1 %>% mutate(v = exp(predict(lm1))))
If you want to have the standard error ribbons, it's a little more work, but still possible:
p1 = predict(lm1, se.fit=T)
tb2 = tibble(
u = tb1$u,
a = tb1$a,
v = exp(p1$fit),
vmin = exp(p1$fit - 1.96*p1$se.fit),
vmax = exp(p1$fit + 1.96*p1$se.fit)
)
ggplot(tb2, aes(u, v, colour=a)) +
geom_ribbon(aes(fill=a, ymin=vmin, ymax=vmax), colour=NA, alpha=.25) +
geom_line(size=.5) +
geom_point(data=tb1)
produces:

Drawing the glm decision boundary with ggplot's stat_smooth() function returns wrong line

I want to plot the decision boundary after I fit a logistic regression model to my data. I use ggplot and stat_smooth() function to define the decision boundary line. However the plot returned is wrong. For a reproducible example, see below:
#-----------------------------------------------------------------------------------------------------
# CONSTRUCT THE DATA
#-----------------------------------------------------------------------------------------------------
X.1_Y.1 <- rnorm(1000, mean = 1.5, sd= 0.3)
X.2_Y.1 <- rnorm(1000, mean = 1.5, sd= 5)
X.1_Y.0 <- rnorm(99000, mean = 0, sd = 1)
X.2_Y.0 <- rnorm(99000, mean = 0, sd = 1)
data <- data.table(X.1 = c(X.1_Y.1 , X.1_Y.0),
X.2 = c(X.2_Y.1 , X.2_Y.0),
Y = c(rep(1, 1000) , rep(0, 99000 ))
)
#-----------------------------------------------------------------------------------------------------
# FIT A LOGISTIC MODEL ON THE DATA
#-----------------------------------------------------------------------------------------------------
model <- glm(Y ~ X.1 + X.2, data, family = "binomial")
summary(model)
#Call:
# glm(formula = Y ~ ., family = "binomial", data = data)
#Deviance Residuals:
# Min 1Q Median 3Q Max
#-1.6603 -0.1194 -0.0679 -0.0384 4.6263
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -6.04055 0.06636 -91.02 <2e-16 ***
# X.1 1.60828 0.03854 41.73 <2e-16 ***
# X.2 0.43272 0.01673 25.87 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#(Dispersion parameter for binomial family taken to be 1)
#Null deviance: 11200.3 on 99999 degrees of freedom
#Residual deviance: 8218.5 on 99997 degrees of freedom
#AIC: 8224.5
#-------------------------------------------------------------------------------------------------------
# DEFINE AND DRAW THE DECISION BOUNDARY
#-------------------------------------------------------------------------------------------------------
# 0 = -6.04 + 1.61 * X.1 + 0.44 * X2 => X2 = 6.04/0.44 - 1.61/0.44 * X.1
setDT(data)
ggplot(data, aes(X.1, X.2, color = as.factor(Y))) +
geom_point(alpha = 0.2) +
stat_smooth(formula = x.2 ~ 6.04/0.44 - (1.61/0.44) * X.1, color = "blue", size = 2) +
coord_equal() +
theme_economist()
This returns the following plot:
You can easily see that the line drawn is wrong. According to the formula X.2 should be 6.04/0.44 when X.1 = 0 which clearly is not the case in this plot.
Could you tell me where my code errs and how to correct it?
Your advice will be appreciated.
If you are trying to plot a line on your graph that you fit yourself, you should not be using stat_smooth, you should be using stat_function. For example
ggplot(data, aes(X.1, X.2, color = as.factor(Y))) +
geom_point(alpha = 0.2) +
stat_function(fun=function(x) {6.04/0.44 - (1.61/0.44) * x}, color = "blue", size = 2) +
coord_equal()

How to add a logarithmic nonlinear fit to ggplot?

I'd like to fit the logarithmic curve through my data using nls.
library(dplyr)
library(ggplot2)
a <- 3
b <- 2
Y <- data_frame(x = c(0.2, 0.5, 1, 2, 5, 10),
y = a + b*log(x))
Y %>%
ggplot(aes(x = x, y = y)) +
geom_point(shape = 19, size = 2) +
geom_smooth(method = "nls",
formula = y ~ p1 + p2*log(x),
start = list(a = a, b = b),
se = FALSE,
control = list(maxiter = 100))
This gives me an error:
Error in method(formula, data = data, weights = weight, ...) :
number of iterations exceeded maximum of 100
What is going wrong?
Here's some text I copied and pasted after doing ?nls:
Warning
Do not use nls on artificial "zero-residual" data.
The nls function uses a relative-offset convergence criterion that compares the numerical imprecision at the current parameter estimates to the residual sum-of-squares. This performs well on data of the form
y = f(x, θ) + eps
(with var(eps) > 0). It fails to indicate convergence on data of the form
y = f(x, θ)
because the criterion amounts to comparing two components of the round-off error. If you wish to test nls on artificial data please add a noise component, as shown in the example below.
That inspired me to try this:
> library(dplyr)
> library(ggplot2)
> a <- 3
> b <- 2
> Y <- data_frame(x = c(0.2, 0.5, 1, 2, 5, 10),
+ y = a + b*log(x)*(1 + rnorm(length(x), sd=0.001)))
> Y %>%
+ ggplot(aes(x = x, y = y)) +
+ geom_point(shape = 19, size = 2) +
+ geom_smooth(method = "nls",
+ formula = y ~ p1 + p2*log(x),
+ start = list(p1 = a, p2 = b),
+ se = FALSE,
+ control = list(maxiter = 100))
Note: your code had start = list(a=a, b=b) which is a typo because a and b aren't defined in your formula. Aside from that, adding the *(1 + rnorm(length(x), sd=0.001)) is the only thing I did.
The resulting graph made it seem like everything worked fine.
I'd generally recommend doing the fit separately, however, and then plotting it with predict. That way you can always check the quality of the fit to see if it worked before plotting.
> fit <- nls(data=Y, formula = y ~ p1 + p2*log(x), start = list(p1 = a, p2 = b))
> summary(fit)
Formula: y ~ p1 + p2 * log(x)
Parameters:
Estimate Std. Error t value Pr(>|t|)
p1 3.001926 0.001538 1952 4.14e-13 ***
p2 1.999604 0.001114 1795 5.78e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.003619 on 4 degrees of freedom
Number of iterations to convergence: 1
Achieved convergence tolerance: 1.623e-08
> new_x = data.frame(x=seq(from=0.2, to=10, length.out=100))
> ggplot(data=Y, aes(x=x, y=y)) +
geom_point() +
geom_line(data=new_x,
aes(x=new_x, y=predict(fit, newdata=new_x)),
color='blue')

Plot NLS in R, real data and estimated parameters

My dataset ICM_Color0 has the following structure, where columns are:
Lum Ruido Dist RT.ms Condicion
With 2599 rows.
There are three luminance = [13,19,25];and two types of noise = [1, 2] -> 3x2 = 6 conditions.
Condicion:
Lum Ruido Condicion
13 1 1
13 2 2
19 1 3
19 2 4
25 1 5
25 2 6
My model is:
Color0.nls <- nls(RT.ms ~ 312 + K[Condicion]/(Dist^1),
data = ICM_Color0, start = list(K = rep(1,6)))
> summary(Color0.nls)
Formula: RT.ms ~ RT0.0 + K[Condicion]/(Dist^n)
Parameters:
Estimate Std. Error t value Pr(>|t|)
K1 1.84108 0.03687 49.94 <2e-16 ***
K2 2.04468 0.03708 55.14 <2e-16 ***
K3 1.70841 0.03749 45.58 <2e-16 ***
K4 2.09915 0.03628 57.86 <2e-16 ***
K5 1.62961 0.03626 44.94 <2e-16 ***
K6 2.18235 0.03622 60.26 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 120.5 on 2593 degrees of freedom
Number of iterations to convergence: 1
Achieved convergence tolerance: 1.711e-08
I need to plot the actual data and parameter estimation.
I already made a general review of the literature but found no examples with a model like mine, where the model depends on the condition variable.
Can anyone guide me?
Thanks a lot
It's fairly straightforward to plot the fitted lines from a regression (non-linear or not). I most often do this by using predict to calculate the predicted values from the original data and then plotting those as lines on top of a scatterplot of the data.
You didn't give a reproducible example, so I made some nonlinear data following this answer.
# Create data to fit with non-linear regression
set.seed(16)
x = seq(100)
y = rnorm(200, 50 + 30 * x^(-0.2), 1)
site = rep(c("a", "b"), each = 100)
dat = data.frame(expl = c(x, x), resp = y, site)
Then I fit a nonlinear regression, allowing each parameter to vary by the grouping variable site.
fit1 = nls(resp ~ a[site] + b[site] * expl^(-c[site]), data = dat,
start = list(a = c(80, 80), b = c(20, 20), c = c(.2, .2)))
Now I just add the fitted values to the dataset using predict.nls
dat$pred = predict(fit1)
I plotted this using the ggplot2 package.
ggplot(data = dat, aes(x = expl, y = resp, color = site)) +
geom_point() +
geom_line(aes(y = pred))
In this case, where I'm allowing all parameters to vary by site, it looks like you can do all of this in ggplot through geom_smooth. I found a very nice example of this here.
Here is what it would look like with the toy dataset.
ggplot(data = dat, aes(x = expl, y = resp, color = site)) +
geom_point() +
geom_smooth(aes(group = site), method = "nls", formula = "y ~ a + b*x^(-c)",
start = list(a = 80, b = 20, c = .2), se = FALSE)

Resources