array of correlation plots - r

From a mouse experiment I have data for about fifty mice coming for about 15 different metrics. I generated a list of correlation plots of every metric against every other metric to identify which measurements correlate with each other and which ones don't.
library(ggplot2)
df <- structure(list(mouse_ID = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 22L, 23L, 24L, 25L,
26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L,
39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L,
52L, 53L, 54L, 55L), treatment = structure(c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L
), .Label = c("not challenged", "vehicle control", "high",
"medium", "low", "reference"
), class = "factor"), value.x = c(0.003725, 0.0208, 0.004475,
0, 0.00895, 1.00625, 1.0125, 1.014, 1.1025, 0.925, 0.897, 0.99,
1.1495, 1.0125, 1.08, 0.88425, 1.001, 0.864, 0.89175, 0.9425,
0.943, 1.07325, 0.73575, 0.606, 0.682, 0.79925, 0.87, 0.60225,
0.756, 0.891, 0.6555, 0.572, 0.253, 0.255, 0.396, 0.4495, 0.299,
0.39, 0.3, 0.5365, 0.378, 0.475, 0.73575, 0.4895, 0.468, 0.90625,
0.3905, 0.4995, 0.60375, 0.744, 0.75, 0.5535), value.y = c(0,
0, 0, 0, 0, 5.775, 4.6875, 4.992, 7.245, 6.0125, 3.795, 4.99125,
7.26275, 4.35375, 4.3875, 3.6025, 4.389, 3.852, 3.444, 4.205,
5.207, 4.77, 3.052, 2.65125, 2.024, 3.6835, 2.9, 1.5695, 2.7,
2.619, 2.964, 1.936, 0.539, 0.408, 1.056, 1.085, 0.897, 0.795,
0.5, 1.0915, 0.5355, 0.575, 2.8885, 2.0915, 1.755, 3.40625, 1.42,
1.6095, 2.835, 2.3715, 2.7, 1.927)), row.names = c(NA, -52L),
class = c("tbl_df", "tbl", "data.frame"))
ggplot(data = df, aes(x = value.x, y = value.y)) +
geom_point(aes(color = treatment)) +
geom_smooth(method = lm, se = TRUE)
#> `geom_smooth()` using formula 'y ~ x'
It turns out that a long list of over 100 plots is really hard to take in, and on each plot there is relatively little information. I would like to arrange these linear plots in a grid of the 15 x 15 measurements and visualize the correlation coefficient for the linear models by background color and overlay the linear model and data points.
Is this somehow feasible to do in ggplot? Is there another tool I could use? And if so, how should I arrange the data structure? I am comfortable dealing with purrr and nested lists for such models, but I guess in this case a long list does not seem ideal -- a matrix-style arrangement would fit the output much better.
Any thoughts or suggestions on how to approach this?
Created on 2021-01-20 by the reprex package (v0.3.0)
Sorry, my explanation wasn’t clear. The data I am showing above is only a fraction of the data available. Here I am plotting the linear correlation of two read outs. But I have over a dozen read outs that I used for pair wise comparisons. I am looking for something like this:
Each tile should be colored by a metric of the linear model (eg correlation coefficient or p value) but it should also show the graphed data and overlay of the linear model.

GGally is absolutely what I was looking for. It's simply to use and has a number of useful plotting options I will need to explore.
It turns out there are potentially some issues when the grid gets larger, bit right now it's not clear to me if this is a data issue or a limitation in the plotting function. Lot's of stuff to explore, but the simplicity of getting the first plots done is awesome.
Now to figure out how to scale the background color of each mini-plot by the overall correlation coefficient!

Are you looking for faceting?
library(ggplot2)
ggplot(df, aes(x = value.x, y = value.y)) +
geom_point(aes(color = treatment)) +
geom_smooth(method = "lm", se = TRUE) +
facet_wrap(~treatment, labeller = label_both)
If you want to compare combinations of grouping variables, try facet_grid. I'm using the builtin mtcars data for this example, since your sample data only has one categorical variable.
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE) +
facet_grid(cyl ~ am, labeller = label_both)

Related

ggpairs formatting for points only

I'm looking to increase the size of the points AND outline them in black while keeping the line weight the same across the remaining plots.
library(ggplot2)
library(GGally)
pp <- ggpairs(pp.sed, columns = c(1,2), aes(color=pond.id, alpha = 0.5)) +
theme_bw()
print(pp)
Which gives me the following figure:
Data for reproducibility, and TIA!
> dput(pp.sed)
structure(list(Fe.259.941 = c(905.2628883, 825.7883359, 6846.128702,
1032.932924, 997.8037721, 588.9599882, 6107.641947, 798.4493611,
1046.38376, 685.2485692, 6452.273486, 730.8656684, 902.8585447,
1039.886406, 7408.801001, 2512.089991, 911.2101809, 941.3712067,
659.1069185, 1070.090445, 1017.666402, 925.3221586, 645.0500668,
954.0009756, 1022.594904, 803.5865352, 7653.184537, 1082.714082,
1048.51115, 773.9070604, 6889.060748, 973.0971769, 1002.091143,
798.9670583, 5089.035978, 2361.713222, 970.8258109, 748.3574529,
3942.04816, 889.1760124), Mn.257.611 = c(17.24667962, 14.90488024,
14.39265671, 20.51133433, 19.92596564, 11.76690074, 19.76386229,
14.29779164, 20.23646264, 13.55374658, 16.8847698, 13.11784439,
15.91777975, 20.64068844, 16.78681661, 28.61732162, 15.88328987,
19.59750367, 13.09735943, 21.59458118, 17.680152, 19.87127449,
12.8082581, 20.12050221, 17.57143193, 18.72196029, 16.21525793,
22.0518966, 18.39642397, 18.32238508, 16.17696923, 20.69668404,
17.96018218, 18.71945309, 16.50162126, 30.60719123, 17.69058768,
14.99048753, 16.28302375, 18.32277507), pond.id = structure(c(6L,
5L, 2L, 1L, 3L, 5L, 2L, 1L, 3L, 5L, 2L, 1L, 6L, 3L, 2L, 4L, 6L,
3L, 4L, 4L, 6L, 3L, 4L, 1L, 6L, 3L, 2L, 1L, 6L, 3L, 2L, 1L, 6L,
3L, 2L, 1L, 6L, 5L, 2L, 1L), .Label = c("LIL", "RHM", "SCS",
"STN", "STS", "TS"), class = "factor")), class = "data.frame", row.names = c(11L,
12L, 13L, 15L, 26L, 27L, 28L, 30L, 36L, 37L, 38L, 40L, 101L,
102L, 103L, 105L, 127L, 128L, 129L, 131L, 142L, 143L, 144L, 146L,
157L, 158L, 159L, 161L, 172L, 173L, 174L, 176L, 184L, 185L, 186L,
188L, 199L, 200L, 201L, 203L))
The GGally package already offers a family of wrap_xxx functions which could be used to set parameters to override default behaviour, e.g. using wrap you could override the default size of points using wrap(ggally_points, size = 5).
To use the wrapped function instead of the default you have to call
ggpairs(..., lower = list(continuous = wrap(ggally_points, size = 5))).
Switching the outline is a bit more tricky. Using wrap we could switch the shape of the points to 21 and set the outline color to "black". However, doing so the points are no longer colored. Unfortunately I have found no way to override the mapping. While it is possible to add a global fill aes, a drawback of doing so is that we lose the black outline for the densities.
One option to fix that is to write a wrapper for ggally_points which adjusts the mapping so that the fill aes is used instead of color.
library(ggplot2)
library(GGally)
ggally_points_filled <- function(data, mapping, ...) {
names(mapping)[grepl("^colour", names(mapping))] <- "fill"
ggally_points(data, mapping, ..., shape = 21)
}
w_ggally_points_filled <- wrap(ggally_points_filled, size = 5, color = "black")
ggpairs(pp.sed, columns = c(1, 2), aes(color = pond.id, alpha = 0.5),
lower = list(continuous = w_ggally_points_filled)) +
theme_bw()

What is the best way to use agricolae to do ANOVAs on a split plot design?

I'm trying to run some ANOVAs on data from a split plot experiment, ideally using the agricolae package. It's been a while since I've taken a stats class and I wanted to be sure I'm analyzing this data correctly, so I did some searching online and couldn't really find consistency in the way people were analyzing their split plot experiments. What is the best way for me to do this?
Here's the head of my data:
dput(head(rawData))
structure(list(ï..Plot = 2111:2116, Variety = structure(c(5L,
4L, 3L, 6L, 1L, 2L), .Label = c("Burbank", "Hodag", "Lamoka",
"Norkotah", "Silverton", "Snowden"), class = "factor"), Rate = c(4L,
4L, 4L, 4L, 4L, 4L), Rep = c(1L, 1L, 1L, 1L, 1L, 1L), totalTubers = c(594L,
605L, 656L, 729L, 694L, 548L), totalOzNoCulls = c(2544.18, 2382.07,
2140.69, 2401.56, 2440.56, 2503.5), totalCWTacNoCulls = c(461.76867,
432.345705, 388.535235, 435.88314, 442.96164, 454.38525), avgLWratio = c(1.260615419,
1.287949374, 1.111981583, 1.08647584, 1.350686661, 1.107173509
), Hollow = c(14L, 15L, 22L, 25L, 14L, 13L), Double = c(10L,
13L, 15L, 22L, 11L, 9L), Knob = c(86L, 80L, 139L, 156L, 77L,
126L), Researcher = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "Wang", class = "factor"),
CullsPounds = c(1.75, 1.15, 4.7, 1.85, 0.8, 5.55), CullsOz = c(28,
18.4, 75.2, 29.6, 12.8, 88.8), totalOz = c(2572.18, 2400.47,
2215.89, 2431.16, 2453.36, 2592.3), totalCWTacCulls = c(466.85067,
435.685305, 402.184035, 441.25554, 445.28484, 470.50245)), row.names = c(NA,
6L), class = "data.frame")
For these data, the whole plot is Rate, the split plot is Variety, the block is Rep, and for discussion's sake here, we can look at totalCWTacNoCulls as the response.
Any help would be very much appreciated! I am still getting the hang of Stack Overflow, so if I have made any mistakes or shared my data wrong, please let me know and I'll change it. Thank you!
You can do this using agricolae package as follows
library(agricolae)
attach(rawData)
Rate = factor(Rate)
Variety = factor(Variety)
Rep = factor(Rep)
sp.plot(Rep, Rate, Variety, totalCWTacNoCulls)
Usage according to agricolae package is
sp.plot(block, pplot, splot, Y)
where, block is replications, pplot is main-plot Factor, splot is sub-plot Factor and Y response variable

Drawing SE in xyplot with errorbars

I am trying to construct a simple XY-Graph with the milk production (called FCM) of two different groups of cows (from the output I got from the mixed model, using the lsmeans and SE).
I was able to construct the plot displaying the lsmeans using the xyplot function in lattice:
library(lattice)
xyplot(lsmean~Time, type="b", group=Group, data=lsmeans2[order(lsmeans2$Time),],
pch=16, ylim=c(10,35), col=c("darkorange","darkgreen"),
ylab="FCM (kg/day)", xlab="Week", lwd=2,
key=list(space="top",
lines=list(col=c("darkorange","darkgreen"),lty=c(1,1),lwd=2),
text=list(c("Confinement Group","Pasture Group"), cex=0.8)))
I now want to add the error bars. I tried some things with the panel.arrow function, just copying and pasting from other examples but didn´t get any further.
I would really appreciate some help!
My lsmeans2 dataset:
Group Time lsmean SE df lower.CL upper.CL
Stall wk1 26.23299 0.6460481 59 24.19243 28.27356
Weide wk1 25.12652 0.6701080 58 23.00834 27.24471
Stall wk10 21.89950 0.6460589 59 19.85890 23.94010
Weide wk10 18.45845 0.6679617 58 16.34705 20.56986
Stall wk2 25.38004 0.6460168 59 23.33957 27.42050
Weide wk2 22.90409 0.6679617 58 20.79269 25.01549
Stall wk3 25.02474 0.6459262 59 22.98455 27.06492
Weide wk3 24.05886 0.6679436 58 21.94751 26.17020
Stall wk4 23.91630 0.6456643 59 21.87694 25.95565
Weide wk4 22.23608 0.6678912 58 20.12490 24.34726
Stall wk5 23.97382 0.6493483 59 21.92283 26.02481
Weide wk5 18.14550 0.6677398 58 16.03480 20.25620
Stall wk6 24.48899 0.6456643 59 22.44963 26.52834
Weide wk6 19.40022 0.6697394 58 17.28319 21.51724
Stall wk7 24.98107 0.6459262 59 22.94089 27.02126
Weide wk7 19.71200 0.6677398 58 17.60129 21.82270
Stall wk8 22.65167 0.6460168 59 20.61120 24.69214
Weide wk8 19.35759 0.6678912 58 17.24641 21.46877
Stall wk9 22.64381 0.6460481 59 20.60324 24.68438
Weide wk9 19.26869 0.6679436 58 17.15735 21.38004
For completeness, here is a solution using xyplot:
# Reproducible data
lsmeans2 = structure(list(Group = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Stall",
"Weide"), class = "factor"), Time = structure(c(1L, 1L, 2L, 2L,
3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L,
10L), .Label = c("wk1", "wk10", "wk2", "wk3", "wk4", "wk5", "wk6",
"wk7", "wk8", "wk9"), class = "factor"), lsmean = c(26.23299,
25.12652, 21.8995, 18.45845, 25.38004, 22.90409, 25.02474, 24.05886,
23.9163, 22.23608, 23.97382, 18.1455, 24.48899, 19.40022, 24.98107,
19.712, 22.65167, 19.35759, 22.64381, 19.26869), SE = c(0.6460481,
0.670108, 0.6460589, 0.6679617, 0.6460168, 0.6679617, 0.6459262,
0.6679436, 0.6456643, 0.6678912, 0.6493483, 0.6677398, 0.6456643,
0.6697394, 0.6459262, 0.6677398, 0.6460168, 0.6678912, 0.6460481,
0.6679436), df = c(59L, 58L, 59L, 58L, 59L, 58L, 59L, 58L, 59L,
58L, 59L, 58L, 59L, 58L, 59L, 58L, 59L, 58L, 59L, 58L), lower.CL = c(24.19243,
23.00834, 19.8589, 16.34705, 23.33957, 20.79269, 22.98455, 21.94751,
21.87694, 20.1249, 21.92283, 16.0348, 22.44963, 17.28319, 22.94089,
17.60129, 20.6112, 17.24641, 20.60324, 17.15735), upper.CL = c(28.27356,
27.24471, 23.9401, 20.56986, 27.4205, 25.01549, 27.06492, 26.1702,
25.95565, 24.34726, 26.02481, 20.2562, 26.52834, 21.51724, 27.02126,
21.8227, 24.69214, 21.46877, 24.68438, 21.38004)), .Names = c("Group",
"Time", "lsmean", "SE", "df", "lower.CL", "upper.CL"), class = "data.frame", row.names = c(NA,
-20L))
xyplot(lsmean~Time, type="b", group=Group, data=lsmeans2[order(lsmeans2$Time),],
panel = function(x, y, ...){
panel.arrows(x, y, x, lsmeans2$upper.CL, length = 0.15,
angle = 90, col=c("darkorange","darkgreen"))
panel.arrows(x, y, x, lsmeans2$lower.CL, length = 0.15,
angle = 90, col=c("darkorange","darkgreen"))
panel.xyplot(x,y, ...)
},
pch=16, ylim=c(10,35), col=c("darkorange","darkgreen"),
ylab="FCM (kg/day)", xlab="Week", lwd=2,
key=list(space="top",
lines=list(col=c("darkorange","darkgreen"),lty=c(1,1),lwd=2),
text=list(c("Confinement Group","Pasture Group"), cex=0.8)))
The length argument in panel.arrows changes the width of the error heads. You can fiddle around with this parameter to get a width you like.
Notice that even though you had lsmeans2[order(lsmeans2$Time),] when specifying the data =, the ordering of Time is still wrong. This is because Time is a factor, and R doesn't know you want it to order by the numerical suffix of wk. This means, that it will sort wk10 before wk2, because 1 is smaller than 2. You can use this little trick below to order it correctly:
# Order first by the character lenght, then by Time
Timelevels = levels(lsmeans2$Time)
Timelevels = Timelevels[order(nchar(Timelevels), Timelevels)]
# Reorder the levels
lsmeans2$Time = factor(lsmeans2$Time, levels = Timelevels)
# Create Subset
lsmeansSub = lsmeans2[order(lsmeans2$Time),]
xyplot(lsmean~Time, type="b", group=Group, data=lsmeansSub,
panel = function(x, y, yu, yl, ...){
panel.arrows(x, y, x, lsmeansSub$upper.CL, length = 0.15,
angle = 90, col=c("darkorange","darkgreen"))
panel.arrows(x, y, x, lsmeansSub$lower.CL, length = 0.15,
angle = 90, col=c("darkorange","darkgreen"))
panel.xyplot(x, y, ...)
},
pch=16, ylim=c(10,35), col=c("darkorange","darkgreen"),
ylab="FCM (kg/day)", xlab="Week", lwd=2,
key=list(space="top",
lines=list(col=c("darkorange","darkgreen"),lty=c(1,1),lwd=2),
text=list(c("Confinement Group","Pasture Group"), cex=0.8)))
Note that even after reordering the the levels of "Time", I still need to use the sorted data for the data = argument. This is because xyplot plots the points in the order that appears in the dataset, not the order of the factor levels.
Is there a particular reason you want to use xplot? ggplot2 is much easier to work with and prettier. Here's an example of what I think you want.
#load ggplot2
library(ggplot2)
#load data
d = structure(list(Group = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Stall",
"Weide"), class = "factor"), Time = structure(c(1L, 1L, 2L, 2L,
3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L,
10L), .Label = c("wk1", "wk10", "wk2", "wk3", "wk4", "wk5", "wk6",
"wk7", "wk8", "wk9"), class = "factor"), lsmean = c(26.23299,
25.12652, 21.8995, 18.45845, 25.38004, 22.90409, 25.02474, 24.05886,
23.9163, 22.23608, 23.97382, 18.1455, 24.48899, 19.40022, 24.98107,
19.712, 22.65167, 19.35759, 22.64381, 19.26869), SE = c(0.6460481,
0.670108, 0.6460589, 0.6679617, 0.6460168, 0.6679617, 0.6459262,
0.6679436, 0.6456643, 0.6678912, 0.6493483, 0.6677398, 0.6456643,
0.6697394, 0.6459262, 0.6677398, 0.6460168, 0.6678912, 0.6460481,
0.6679436), df = c(59L, 58L, 59L, 58L, 59L, 58L, 59L, 58L, 59L,
58L, 59L, 58L, 59L, 58L, 59L, 58L, 59L, 58L, 59L, 58L), lower.CL = c(24.19243,
23.00834, 19.8589, 16.34705, 23.33957, 20.79269, 22.98455, 21.94751,
21.87694, 20.1249, 21.92283, 16.0348, 22.44963, 17.28319, 22.94089,
17.60129, 20.6112, 17.24641, 20.60324, 17.15735), upper.CL = c(28.27356,
27.24471, 23.9401, 20.56986, 27.4205, 25.01549, 27.06492, 26.1702,
25.95565, 24.34726, 26.02481, 20.2562, 26.52834, 21.51724, 27.02126,
21.8227, 24.69214, 21.46877, 24.68438, 21.38004)), .Names = c("Group",
"Time", "lsmean", "SE", "df", "lower.CL", "upper.CL"), class = "data.frame", row.names = c(NA,
-20L))
#fix week
library(stringr)
library(magrittr)
d$Time %<>% as.character() %>% str_replace(pattern = "wk", replacement = "") %>% as.numeric()
#plot
ggplot(d, aes(Time, lsmean, color = Group, group = Group)) +
geom_point() +
geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = .2) +
geom_line() +
ylim(10, 35) +
scale_x_continuous(name = "Week", breaks = 1:10) +
ylab("FCM (kg/day)") +
scale_color_discrete(label = c("Confinement Group","Pasture Group"))

How to add multiple data series to a scatterplot and how to format numbers to appear in standard form on y axis

My data set:
structure(list(Site = c(2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L,
4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L, 6L), Average.worm.weight..g. = c(0.1934,
0.249, 0.263, 0.262, 0.4186, 0.204, 0.311, 0.481, 0.326, 0.657,
0.347, 0.311, 0.239, 0.4156, 0.31, 0.3136, 0.4033, 0.302, 0.277
), Average.total.immune.cell.count = structure(c(8L, 16L, 11L,
12L, 10L, 1L, 4L, 15L, 4L, 3L, 17L, 13L, 18L, 7L, 5L, 6L, 9L,
14L, 2L), .Label = c("0", "168750", "18650000", "200,000", "21,600,000",
"226666.6", "22683333.33", "2533333.33", "283333.333", "291666.6",
"335833.3", "435800", "474816666.7", "500000", "6450000", "729166.667",
"7433333.3", "9916667"), class = "factor"), Average.eleocyte.number = structure(c(2L,
5L, 14L, 10L, 1L, 1L, 6L, 1L, 6L, 7L, 1L, 9L, 15L, 8L, 12L, 3L,
11L, 13L, 4L), .Label = c("0", "1266666.67", "153333.3", "168740",
"17", "200,000", "2266666.667", "22683333.33", "23116666.67",
"264000", "283333.333", "442", "500000", "7.3", "9916667"), class = "factor")), .Names = c("Site",
"Average.worm.weight..g.", "Average.total.immune.cell.count",
"Average.eleocyte.number"), class = "data.frame", row.names = c(NA,
-19L))
This is my R script so far:
Plotting multiple data series on a graph
y1<-dframe1$"Average.total.immune.cell.count"
y2<-dframe1$"Average.eleocyte.number"
x<-dframe1$"Average.worm.weight..g."
plot.default(y1~x,type="p" )
points(y2~x)
I am trying to add to y series to the same scatterplot and I am struggling to do so, I want to have different symbols for the points so as to tell apart the two different data series. Also I would like the axes to meet on the bottom left hand side and would appreciate being informed as to how I can do that? I would also like the y axis to be in standard form, but do not know how to get R to do that.
Best regards.
K.
So this is an object lesson is getting your data in the correct format to begin with. Your numbers have commas, which R does not like. Hence the numbers get converted to character and imported as factors (which your structure(...) clearly shows. You need to fix that, or better yet get rid of the commas prior to exporting.
Something like this will work
colnames(dframe) <- c("Site","x","y1","y2")
dframe$y1 <- as.numeric(as.character(gsub(",","",dframe$y1,fixed=TRUE)))
dframe$y2 <- as.numeric(as.character(gsub(",","",dframe$y2,fixed=TRUE)))
plot(y1~x,dframe, col="red", pch=20)
points(y2~x,dframe, col="blue", pch=20)
But there are additional problems. One of the numbers (in row 12) is a factor of 10 larger than all the others, so the plot above is not very informative. It's hard to know if this is a data input error, or a genuine outlier in your data.
EDIT: Response to OP's comment
dframe <- dframe[-12,] # remove row 12
dframe <- dframe[order(dframe$x),] # order by increasing x
plot(y1~x,dframe, col="red", pch=20, type="b")
points(y2~x,dframe, col="blue", pch=20, type="b")
legend("topleft",legend=c("y1","y2"),col=c("red","blue"),pch=20)

factor order when subsetting within ggplot

I have factors on x-axis and order those factor levels in a way that's intuitive to plot with ggplot. It works fine. However, when I use the subset command within ggplot, it re-orders my original sequence of factors.
Is it possible to do subsetting within ggplot and preserve the order of factor levels?
Here is the data and code:
library(ggplot2)
library(plyr)
dat <- structure(list(SubjectID = structure(c(12L, 4L, 6L, 7L, 12L,
7L, 5L, 8L, 14L, 1L, 15L, 1L, 7L, 1L, 7L, 5L, 4L, 2L, 9L, 6L,
7L, 13L, 12L, 2L, 15L, 3L, 5L, 13L, 13L, 10L, 7L, 8L, 10L, 10L,
1L, 10L, 12L, 7L, 6L, 10L), .Label = c("s001", "s002", "s003",
"s004", "s005", "s006", "s007", "s008", "s009", "s010", "s011",
"s012", "s013", "s014", "s015"), class = "factor"), Parameter = structure(c(7L,
3L, 5L, 3L, 6L, 4L, 6L, 7L, 7L, 4L, 7L, 12L, 8L, 11L, 1L, 4L,
3L, 4L, 6L, 4L, 6L, 6L, 12L, 5L, 12L, 1L, 7L, 13L, 11L, 1L, 4L,
1L, 6L, 13L, 10L, 10L, 10L, 13L, 5L, 8L), .Label = c("(Intercept)",
"c0.008", "c0.01", "c0.015", "c0.02", "c0.03", "PrevCorr1", "PrevFail1",
"c0.025", "c0.004", "c0.006", "c0.009", "c0.012", "c0.005"), class = "factor"),
Weight = c(0.0352725634087837, 1.45546697427904, 2.29457594510248,
0.479548914792514, 6.39680995359234, 1.48829600339586, 2.69253113220079,
-0.171219812386926, -0.453625394224277, 1.43732884325816,
0.742416863226952, 0.256935761466245, -0.29401087047524,
0.34653127811481, 0.33120592543102, 2.79213318878505, 2.47047299128637,
1.022450287681, 6.92891513416868, 0.648982326396105, 6.58336282626389,
6.40600461501379, 1.80062359655524, 3.86658202530889, 1.23833324887194,
-0.026560261876089, 0.121670468861011, 0.9290824087063, 0.349104382483186,
0.24722583823016, 1.82473621255801, -0.712668411699556, 6.51789901685784,
0.74682257127003, 0.0755807984938072, 0.131705709322157,
0.246465073382095, 0.876279316248929, 1.83442709571662, -0.579086982613267
)), .Names = c("SubjectID", "Parameter", "Weight"), row.names = c(2924L,
784L, 1537L, 1663L, 3138L, 1744L, 1266L, 1996L, 3548L, 86L, 3692L,
230L, 1613L, 213L, 1627L, 1024L, 832L, 384L, 2418L, 1568L, 1714L,
3362L, 3200L, 497L, 3632L, 683L, 1020L, 3281L, 3263L, 2779L,
1632L, 1995L, 2674L, 2753L, 312L, 2638L, 3198L, 1809L, 1569L,
2589L), class = "data.frame")
## Sort factors in the order that will make it intuitive to read the plot
## It goes, "(Intercept), "PrevCorr1", "PrevFail1", "c0.004", "c0.006", etc.
paramNames <- levels(dat$Parameter)
contrastNames <- sort(paramNames[grep("c0",paramNames)])
biasNames <- paramNames[!paramNames %in% contrastNames]
dat$Parameter <- factor(dat$Parameter, levels=c(biasNames, contrastNames))
## Add grouping parameter that will be used to plot different weights in different colors
dat$plotColor <-"Contrast"
dat$plotColor[dat$Parameter=="(Intercept)"] <- "Intercept"
dat$plotColor[grep("PrevCorr", dat$Parameter)] <- "PrevSuccess"
dat$plotColor[grep("PrevFail", dat$Parameter)] <- "PrevFail"
p <- ggplot(dat, aes(x=Parameter, y=Weight)) +
# The following command, which adds geom_line to data points of the graph, changes the order of levels
# If I uncomment the next line, the factor level order goes wrong.
#geom_line(subset=.(plotColor=="Contrast"), aes(group=1), stat="summary", fun.y="mean", color="grey50", size=1) +
geom_point(aes(group=Parameter, color=plotColor), size=5, stat="summary", fun.y="mean") +
geom_point(aes(group=Parameter), size=2.5, color="white", stat="summary", fun.y="mean") +
theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1))
print(p)
Here is the plot when geom line is commented
And here is what happens when geom_line is uncommented
If you switch the order in which you plot the objects, the problem disappears:
p <- ggplot(dat, aes(x=Parameter, y=Weight)) +
# The following command, which adds geom_line to data points of the graph, changes the order of levels
# If I uncomment the next line, the factor level order goes wrong.
geom_point(aes(group=Parameter, color=plotColor), size=5, stat="summary", fun.y="mean") +
geom_line(subset = .(plotColor == "Contrast"), aes(group=1), stat="summary", fun.y="mean", color="grey50", size=1) +
geom_point(aes(group=Parameter), size=2.5, color="white", stat="summary", fun.y="mean") +
theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1))
print(p)
I think the problem lies in plotting the subsetted data first, it ditches the levels for the original data, and when you add back in the points, it doesn't know where to put them. When you plot with the original data first, it maintains the levels. I'm not sure though, you might have to take my word on it.

Resources