fitting more than one regression line to a scatterplot in R - r

I'm trying to fit regression lines to this relation angc~ext. Variable pch divides the data into two sets to each of which I want to fit a regression line
with its confidence intervals. Here's my data frame (C):
"ext" "angc" "pch"
25 3.76288002820208 0
29 4.44255895177431 0
21 2.45214044383301 0
35 4.01334352881766 0
35 9.86225452423762 0
28 19.9304126868056 1
32 25.6984064030981 1
20 5.10582966112880 0
36 5.75603291081328 0
11 4.62311785943305 0
33 4.94401591414043 0
27 8.10039123328465 0
29 16.3882499757369 1
30 29.3492784626796 1
29 3.85960848290140 0
32 5.35857680326963 0
26 4.86451443776053 0
16 8.22008387344697 0
30 10.2212259432413 0
32 17.2519440101067 1
29 27.5011256290209 1
My code:
c0 <- C[C$pch == 0, ]
c1 <- C[C$pch == 1, ]
prd0 <- as.data.frame( predict( lm(c0$angc ~ c0$ext), interval = c("confidence") ) )
prd1 <- as.data.frame( predict( lm(c1$angc ~ c1$ext), interval = c("confidence") ) )
dev.new()
plot( C$angc ~ C$ext, type = 'n' )
points( c0$angc ~ c0$ext, pch = 17 ) # triangles
abline(lm(c0$angc ~ c0$ext)) # regression line
lines(prd0$lwr) # lower CI
lines(prd0$upr) # upper CI
points( c1$angc ~ c1$ext, pch = 1 ) # circles
abline(lm(c1$angc ~ c1$ext))
lines(prd1$lwr, type = 'l', lty = 3 )
lines(prd1$upr, type = 'l', lty = 3 )
I have two problems:
How can I get the desired regression line for the circles? It should be an almost vertical line (check c1)
I don't get correct confidence intervals
Thank you for your help,
Santi

In ggplot2 you can do this rather efficiently:
ggplot(C, aes(x = ext, y = angc, shape = pch)) + geom_point() +
geom_smooth(method = "lm")
This will create a scatterplot (geom_point()) of angc vs ext, where the shape of the points is based on pch. In addition, a regression line is drawn in the plot for each unique element in pch. The name geom_smooth() comes from the fact that it draws a smoothed version of the data, in this case a linear regression.

Related

How to create multiple plots (plot means) on the same graph?

TL;DR: Trying to create multiple plots in one graph (image attached), using loop function. Currently manually creating codes for each boxplot, then using par() function to plot them together. It works, but looking for a less repetitive way.
I was wondering if it's possible to create multiple plots; specifically to plot "plot means". You can find the exact output in image form here (the second example on plot means): How to create multiple ggboxplots on the same graph using the loop function?
My data looks something like this:
# A tibble: 62 x 4
offer payoff partner_transfer round_type
<dbl> <dbl> <dbl> <chr>
1 40 126 66 actual
2 100 273 273 actual
3 0 100 0 actual
4 100 6 6 actual
5 25 99 24 actual
6 80 29 9 practice
7 100 45 45 practice
8 0 100 0 practice
9 25 99 24 practice
10 100 183 183 practice
# ... with 52 more rows
I'm trying to get it to look like this:
![sample plot means][2]
Currently, my code to get this output is:
par(mfrow = c(2,2))
plot_offer <- plotmeans( offer ~ round_type, data = tg_proposer_split,
xlab = "Round Type", ylab = "Offer (by A)",
main="Mean Plot with 95% CI")
plot_partner_transfer <- plotmeans( partner_transfer ~ round_type, data = tg_proposer_split,
xlab = "Round Type", ylab = "Amount Transferred by Partner (Bot)",
main="Mean Plot with 95% CI")
plot_payoff <- plotmeans( payoff ~ round_type, data = tg_proposer_split,
xlab = "Round Type", ylab = "Payoff (for A)",
main="Mean Plot with 95% CI")
Is there a way I can shorten this code?
Biggest apologies, for some reason I'm unable to attach images because I haven't collated enough reputation points so I have no choice but to try it this way. Hope it is still clear.
Many thanks!
Here is a way to simplify the code with Map.
Define a general purpose function to take care of the plot, fun_plot;
Get the column names of the y axis variables;
Create a vector of y axis labels;
Plot in a Map loop.
The code becomes
fun_plot <- function(ycol, ylab){
fmla <- paste(ycol, "round_type", sep = "~")
fmla <- as.formula(fmla)
plotmeans(fmla, data = tg_proposer_split,
xlab = "Round Type", ylab = ylab,
main = "Mean Plot with 95% CI")
}
y_cols <- names(tg_proposer_split)[which(names(tg_proposer_split) != "round_type")]
y_lab <- c("Offer (by A)", "Amount Transferred by Partner (Bot)", "Payoff (for A)")
old_par <- par(mfrow = c(2,2))
Map(fun_plot, y_cols, y_lab)
par(old_par)
Edit.
Following the error reported in comment, here is a more general function, allowing for xcol and the data set to take any values, not just "round_type" and tg_proposer_split, respectively. This solution now uses mapply, not Map, in order for those two arguments to be passed in a MoreArgs list.
fun_plot2 <- function(ycol, ylab, xcol, data){
fmla <- paste(ycol, xcol, sep = "~")
fmla <- as.formula(fmla)
plotmeans(fmla, data = data,
xlab = "Round Type", ylab = ylab,
main = "Mean Plot with 95% CI")
}
old_par <- par(mfrow = c(2,2))
mapply(fun_plot2, y_cols, y_lab,
MoreArgs = list(
xcol = "round_type",
data = tg_proposer_split
)
)
par(old_par)
Data
tg_proposer_split <- read.table(text = "
offer payoff partner_transfer round_type
1 40 126 66 actual
2 100 273 273 actual
3 0 100 0 actual
4 100 6 6 actual
5 25 99 24 actual
6 80 29 9 practice
7 100 45 45 practice
8 0 100 0 practice
9 25 99 24 practice
10 100 183 183 practice
", header = TRUE)

How to divide 2D data into two groups

I have Test data as below;
Test
x y
1 4324.3329 484.6496
3 3258.4572 499.9621
4 4462.8230 562.7703
7 5173.4353 572.9492
8 4188.0244 530.8349
9 3557.5385 494.6672
10 2353.1382 517.5235
11 4944.2605 537.7489
15 3335.6628 488.4479
16 4059.0555 534.5479
17 4694.1778 531.7709
18 3213.8639 496.0062
19 4119.5348 516.3399
20 4267.7457 537.1041
22 4284.2706 503.8527
23 3019.6271 498.8519
35 2549.8743 503.5473
36 4976.5386 566.5985
37 2717.9942 513.2320
38 3545.2092 448.4752
40 3352.3206 457.7265
41 3198.0481 560.4075
42 1387.7531 395.7657
43 957.6421 296.1419
44 3168.8167 489.5333
45 2717.1015 478.6760
46 3694.8913 455.2763
47 4131.9760 519.9161
48 4366.2339 502.5977
49 4314.1003 486.7103
50 3818.1977 461.5844
52 3745.0532 467.7885
I add scatter plot as follows;
gg <- ggplot(Test, aes(x = x, y = y))+
geom_point()+
stat_ellipse()
ggMarginal(
gg,
type = "boxplot",
margins = "both",
size = 5
)
print(gg)
It seems like there are two groups;
(1) at right-top with large number of points
(2) at left-bottom with two points.
In this case, how can I divide the data into two groups?
I have tried k-mean clustering as follows;
#k-mean
km <- kmeans(Test,2)
library(cluster)
clusplot(Test, km$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
But, this changes x-y coordinates into PC1 & PC2, which is not what I want in this case.
For example,
set.seed(42)
km <- kmeans(Test,2)
ggplot(Test, aes(x = x, y = y,colour = factor(km$cluster)))+
geom_point()+
stat_ellipse(type = "norm", linetype = 2)
gives,

Comparing Multiple lm() Results within ggplot2 [duplicate]

Is there a way to extract the values of the fitted line returned from stat_smooth?
The code I am using looks like this:
p <- ggplot(df1, aes(x=Days, y= Qty,group=Category,color=Category))
p <- p + stat_smooth(method=glm, fullrange=TRUE)+ geom_point())
This new r user would greatly appreciate any guidance.
Riffing off of #James example
p <- qplot(hp,wt,data=mtcars) + stat_smooth()
You can use the intermediate stages of the ggplot building process to pull out the plotted data. The results of ggplot_build is a list, one component of which is data which is a list of dataframes which contain the computed values to be plotted. In this case, the list is two dataframes since the original qplot creates one for points and the stat_smooth creates a smoothed one.
> ggplot_build(p)$data[[2]]
geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
x y ymin ymax se PANEL group
1 52.00000 1.993594 1.149150 2.838038 0.4111133 1 1
2 55.58228 2.039986 1.303264 2.776709 0.3586695 1 1
3 59.16456 2.087067 1.443076 2.731058 0.3135236 1 1
4 62.74684 2.134889 1.567662 2.702115 0.2761514 1 1
5 66.32911 2.183533 1.677017 2.690049 0.2465948 1 1
6 69.91139 2.232867 1.771739 2.693995 0.2244980 1 1
7 73.49367 2.282897 1.853241 2.712552 0.2091756 1 1
8 77.07595 2.333626 1.923599 2.743652 0.1996193 1 1
9 80.65823 2.385059 1.985378 2.784740 0.1945828 1 1
10 84.24051 2.437200 2.041282 2.833117 0.1927505 1 1
11 87.82278 2.490053 2.093808 2.886297 0.1929096 1 1
12 91.40506 2.543622 2.145018 2.942225 0.1940582 1 1
13 94.98734 2.597911 2.196466 2.999355 0.1954412 1 1
14 98.56962 2.652852 2.249260 3.056444 0.1964867 1 1
15 102.15190 2.708104 2.303465 3.112744 0.1969967 1 1
16 105.73418 2.764156 2.357927 3.170385 0.1977705 1 1
17 109.31646 2.821771 2.414230 3.229311 0.1984091 1 1
18 112.89873 2.888224 2.478136 3.298312 0.1996493 1 1
19 116.48101 2.968745 2.531045 3.406444 0.2130917 1 1
20 120.06329 3.049545 2.552102 3.546987 0.2421773 1 1
21 123.64557 3.115893 2.573577 3.658208 0.2640235 1 1
22 127.22785 3.156368 2.601664 3.711072 0.2700548 1 1
23 130.81013 3.175495 2.625951 3.725039 0.2675429 1 1
24 134.39241 3.181411 2.645191 3.717631 0.2610560 1 1
25 137.97468 3.182252 2.658993 3.705511 0.2547460 1 1
26 141.55696 3.186155 2.670350 3.701961 0.2511175 1 1
27 145.13924 3.201258 2.687208 3.715308 0.2502626 1 1
28 148.72152 3.235698 2.721744 3.749652 0.2502159 1 1
29 152.30380 3.291766 2.782767 3.800765 0.2478037 1 1
30 155.88608 3.353259 2.857911 3.848607 0.2411575 1 1
31 159.46835 3.418409 2.938257 3.898561 0.2337596 1 1
32 163.05063 3.487074 3.017321 3.956828 0.2286972 1 1
33 166.63291 3.559111 3.092367 4.025855 0.2272319 1 1
34 170.21519 3.634377 3.165426 4.103328 0.2283065 1 1
35 173.79747 3.712729 3.242093 4.183364 0.2291263 1 1
36 177.37975 3.813399 3.347232 4.279565 0.2269509 1 1
37 180.96203 3.910849 3.447572 4.374127 0.2255441 1 1
38 184.54430 3.977051 3.517784 4.436318 0.2235917 1 1
39 188.12658 4.037302 3.583959 4.490645 0.2207076 1 1
40 191.70886 4.091635 3.645111 4.538160 0.2173882 1 1
41 195.29114 4.140082 3.700184 4.579981 0.2141624 1 1
42 198.87342 4.182676 3.748159 4.617192 0.2115424 1 1
43 202.45570 4.219447 3.788162 4.650732 0.2099688 1 1
44 206.03797 4.250429 3.819579 4.681280 0.2097573 1 1
45 209.62025 4.275654 3.842137 4.709171 0.2110556 1 1
46 213.20253 4.295154 3.855951 4.734357 0.2138238 1 1
47 216.78481 4.308961 3.861497 4.756425 0.2178456 1 1
48 220.36709 4.317108 3.859541 4.774675 0.2227644 1 1
49 223.94937 4.319626 3.851025 4.788227 0.2281358 1 1
50 227.53165 4.316548 3.836964 4.796132 0.2334829 1 1
51 231.11392 4.308435 3.818728 4.798143 0.2384117 1 1
52 234.69620 4.302276 3.802201 4.802351 0.2434590 1 1
53 238.27848 4.297902 3.787395 4.808409 0.2485379 1 1
54 241.86076 4.292303 3.772103 4.812503 0.2532567 1 1
55 245.44304 4.282505 3.754087 4.810923 0.2572576 1 1
56 249.02532 4.269040 3.733184 4.804896 0.2608786 1 1
57 252.60759 4.253361 3.710042 4.796680 0.2645121 1 1
58 256.18987 4.235474 3.684476 4.786473 0.2682509 1 1
59 259.77215 4.215385 3.656265 4.774504 0.2722044 1 1
60 263.35443 4.193098 3.625161 4.761036 0.2764974 1 1
61 266.93671 4.168621 3.590884 4.746357 0.2812681 1 1
62 270.51899 4.141957 3.553134 4.730781 0.2866658 1 1
63 274.10127 4.113114 3.511593 4.714635 0.2928472 1 1
64 277.68354 4.082096 3.465939 4.698253 0.2999729 1 1
65 281.26582 4.048910 3.415849 4.681971 0.3082025 1 1
66 284.84810 4.013560 3.361010 4.666109 0.3176905 1 1
67 288.43038 3.976052 3.301132 4.650972 0.3285813 1 1
68 292.01266 3.936392 3.235952 4.636833 0.3410058 1 1
69 295.59494 3.894586 3.165240 4.623932 0.3550782 1 1
70 299.17722 3.850639 3.088806 4.612473 0.3708948 1 1
71 302.75949 3.804557 3.006494 4.602619 0.3885326 1 1
72 306.34177 3.756345 2.918191 4.594499 0.4080510 1 1
73 309.92405 3.706009 2.823813 4.588205 0.4294926 1 1
74 313.50633 3.653554 2.723308 4.583801 0.4528856 1 1
75 317.08861 3.598987 2.616650 4.581325 0.4782460 1 1
76 320.67089 3.542313 2.503829 4.580796 0.5055805 1 1
77 324.25316 3.483536 2.384853 4.582220 0.5348886 1 1
78 327.83544 3.422664 2.259739 4.585589 0.5661643 1 1
79 331.41772 3.359701 2.128512 4.590891 0.5993985 1 1
80 335.00000 3.294654 1.991200 4.598107 0.6345798 1 1
Knowing a priori where the one you want is in the list isn't easy, but if nothing else you can look at the column names.
It is still better to do the smoothing outside the ggplot call, though.
EDIT:
It turns out replicating what ggplot2 does to make the loess is not as straightforward as I thought, but this will work. I copied it out of some internal functions in ggplot2.
model <- loess(wt ~ hp, data=mtcars)
xrange <- range(mtcars$hp)
xseq <- seq(from=xrange[1], to=xrange[2], length=80)
pred <- predict(model, newdata = data.frame(hp = xseq), se=TRUE)
y = pred$fit
ci <- pred$se.fit * qt(0.95 / 2 + .5, pred$df)
ymin = y - ci
ymax = y + ci
loess.DF <- data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)
ggplot(mtcars, aes(x=hp, y=wt)) +
geom_point() +
geom_smooth(aes_auto(loess.DF), data=loess.DF, stat="identity")
That gives a plot that looks identical to
ggplot(mtcars, aes(x=hp, y=wt)) +
geom_point() +
geom_smooth()
(which is the expanded form of the original p).
stat_smooth does produce output that you can use elsewhere, and with a slightly hacky way, you can put it into a variable in the global environment.
You enclose the output variable in .. on either side to use it. So if you add an aes in the stat_smooth call and use the global assign, <<-, to assign the output to a varible in the global environment you can get the the fitted values, or others - see below.
qplot(hp,wt,data=mtcars) + stat_smooth(aes(outfit=fit<<-..y..))
fit
[1] 1.993594 2.039986 2.087067 2.134889 2.183533 2.232867 2.282897 2.333626
[9] 2.385059 2.437200 2.490053 2.543622 2.597911 2.652852 2.708104 2.764156
[17] 2.821771 2.888224 2.968745 3.049545 3.115893 3.156368 3.175495 3.181411
[25] 3.182252 3.186155 3.201258 3.235698 3.291766 3.353259 3.418409 3.487074
[33] 3.559111 3.634377 3.712729 3.813399 3.910849 3.977051 4.037302 4.091635
[41] 4.140082 4.182676 4.219447 4.250429 4.275654 4.295154 4.308961 4.317108
[49] 4.319626 4.316548 4.308435 4.302276 4.297902 4.292303 4.282505 4.269040
[57] 4.253361 4.235474 4.215385 4.193098 4.168621 4.141957 4.113114 4.082096
[65] 4.048910 4.013560 3.976052 3.936392 3.894586 3.850639 3.804557 3.756345
[73] 3.706009 3.653554 3.598987 3.542313 3.483536 3.422664 3.359701 3.294654
The outputs you can obtain are:
y, predicted value
ymin, lower pointwise confidence interval around
the mean
ymax, upper pointwise confidence interval around the mean
se, standard error
Note that by default it predicts on 80 data points, which may not be aligned with your original data.
A more general approach could be to simply use the predict() function to predict any range of values that are interesting.
# define the model
model <- loess(wt ~ hp, data = mtcars)
# predict fitted values for each observation in the original dataset
modelFit <- data.frame(predict(model, se = TRUE))
# define data frame for ggplot
df <- data.frame(cbind(hp = mtcars$hp
, wt = mtcars$wt
, fit = modelFit$fit
, upperBound = modelFit$fit + 2 * modelFit$se.fit
, lowerBound = modelFit$fit - 2 * modelFit$se.fit
))
# build the plot using the fitted values from the predict() function
# geom_linerange() and the second geom_point() in the code are built using the values from the predict() function
# for comparison ggplot's geom_smooth() is also shown
g <- ggplot(df, aes(hp, wt))
g <- g + geom_point()
g <- g + geom_linerange(aes(ymin = lowerBound, ymax = upperBound))
g <- g + geom_point(aes(hp, fit, size = 1))
g <- g + geom_smooth(method = "loess")
g
# Predict any range of values and include the standard error in the output
predict(model, newdata = 100:300, se = TRUE)
If you want to bring in the power of the tidyverse, you can use the "broom" library to add the predicted values from the loess function to your original dataset. This is building on #phillyooo's solution.
library(tidyverse)
library(broom)
# original graph with smoother
ggplot(data=mtcars, aes(hp,wt)) +
stat_smooth(method = "loess", span = 0.75)
# Create model that will do the same thing as under the hood in ggplot2
model <- loess(wt ~ hp, data = mtcars, span = 0.75)
# Add predicted values from model to original dataset using broom library
mtcars2 <- augment(model, mtcars)
# Plot both lines
ggplot(data=mtcars2, aes(hp,wt)) +
geom_line(aes(hp, .fitted), color = "red") +
stat_smooth(method = "loess", span = 0.75)
Save the graph object and use ggplot_build() or layer_data() to obtain the elements/estimates for the layers. e.g.
pp<-ggplot(mtcars, aes(x=hp, y=wt)) + geom_point() + geom_smooth();
ggplot_build(pp)

How to perform Broken-line regression analysis in R?

I have the following data:
Treatment Dose Value
FeSo4 200 104.17
TQ1 6 98.17
TQ2 9 92
TQ3 12 86.67
TQ4 15 77.33
TQ5 18 71.33
TQ6 21 74.83
TQ7 24 82.17
How can I do Broken-line regression analysis of this data in R to get the graph as below:
The best way to fit linear models by segments in R is to use CRAN package segmented.
In what follows, I have created a new column, coercing column Treatment from class factor to its integer codes.
library(segmented)
df1$Num <- as.integer(df1$Treatment)
fit <- lm(Value ~ Num, df1)
summary(fit)
seg <- segmented(fit, seg.Z = ~Num, psi = 6)
plot(Value ~ Num, df1) # plot the points
plot(seg, add = TRUE) # plot the broken line
abline(v = seg$psi[2]) # plot the vertical at the breakpoint
Data.
df1 <- read.table(text = "
Treatment Dose Value
FeSo4 200 104.17
TQ1 6 98.17
TQ2 9 92
TQ3 12 86.67
TQ4 15 77.33
TQ5 18 71.33
TQ6 21 74.83
TQ7 24 82.17
", header = TRUE)
A different approach is to first find the threshold and then fit a regular lm() model:
library(SiZer)
df <- read.table(text = "
Treatment Dose Value
FeSo4 200 104.17
TQ1 6 98.17
TQ2 9 92
TQ3 12 86.67
TQ4 15 77.33
TQ5 18 71.33
TQ6 21 74.83
TQ7 24 82.17
", header = TRUE)
df$Num <- as.integer(df$Treatment)
thr.pwl = piecewise.linear(df$Num, df$Value,
middle = 1, CI = FALSE,
bootstrap.samples = 1000, sig.level = 0.001)
thr.pwl
[1] "Threshold alpha: 6.30159931424453" #This is the threshold you need
[1] ""
[1] "Model coefficients: Beta[0], Beta[1], Beta[2]" #The estimates here are the same as in model.pwl, however, with lm() you can include also other independent variables
(Intercept) x w
111.48333 -6.63000 13.97001
model.pwl <- lm(Value ~ Num*(Num >= 6.30) + Num*(Num < 6.30),
data = df)
summary(model.pwl)
And you can plot it as:
plot(thr.pwl)
abline(v = thr.pwl$change.point)
However, with piecewise.linear() you can only us one threshold, while with segmented() more of them.

join axes in barplot

I would like to eliminate the gap between the x and y axes in barplot and extend the predicted line back to intersect the y axis, preferably in base R. Is this possible? Thank you for any advice or suggestions.
my.data <- read.table(text = '
band mid.point count
1 0.5 74
2 1.5 73
3 2.5 79
4 3.5 70
5 4.5 78
6 5.5 63
7 6.5 59
8 7.5 60
', header = TRUE)
my.data
x <- my.data$mid.point^2
my.model <- lm(count ~ x, data = my.data)
my.plot <- barplot(my.data$count, ylim=c(0,100), space=0, col=NA)
axis(1, at=my.plot+0.5, labels=my.data$band)
lines(predict(my.model, data.frame(x=x), type="resp"), col="black", lwd = 1.5)
EDIT November 26, 2014
I just realized the two plots are not the same (the plot in the original post and the plot in my answer below). Compare the two curved lines closely, particularly at the right-side of the plot. Clearly the two curved lines intersect the top of the 8th bar in different locations. However, I have not yet had time to figure out why the plots differ.
Here is one way to extrapolate the predicted line back to the y axis. I incorporate rawr's suggestion regarding eliminating the gap between the y axis and the x axis.
setwd('c:/users/markm/simple R programs/')
jpeg(filename = "barplot_and_line.jpeg")
my.data <- read.table(text = '
band mid.point count
1 0.5 74
2 1.5 73
3 2.5 79
4 3.5 70
5 4.5 78
6 5.5 63
7 6.5 59
8 7.5 60
', header = TRUE)
x <- my.data$mid.point^2
my.model <- lm(count ~ x, data = my.data)
z <- seq(0,8,0.01)
y <- my.model$coef[1] + my.model$coef[2] * z^2
barplot(my.data$count, ylim=c(0,100), space=0, col=NA, xaxs = 'i')
points(z, y, type='l', col=1)
dev.off()

Resources