connecting lines between means of factors in ggplot2 - r

I was trying to create a simple line graph of means and interactions. I have a DV (reading times) on the y-axis, one factor (Length) on the x-axis, and another as a grouping variable (position).
The syntax I used is below. The data plotted as single points on a line for each of the two Length conditions, but did not connect with lines between the two Length conditions. What am I missing in terms of syntax?
I am using R i386 2.15.2, and updated ggplot2 last week.
Here is a reproducible example
SubjectID <- c(101,101,101,101,101,101,101,101,102,102,102,102,102,102,102,102,
201,201,201,201,201,201,201,201,202,202,202,202,202,202,202,202)
Group <- c("PWA","PWA","PWA","PWA","PWA","PWA","PWA","PWA","PWA","PWA","PWA",
"PWA","PWA","PWA","PWA","PWA","Control","Control","Control",
"Control","Control","Control","Control","Control","Control",
"Control","Control","Control","Control","Control","Control",
"Control")
Length <- c(1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2)
Pos <- c(1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2)
ReadT <- c(6.7,7.6,6.4,7.9,5.4,6.4,6.3,7.4,6.9,7.2,6.7,7.4,5.7,6.1,6.5,7.8,
6.1,5.7,4.9,6.1,4.7,6.5,6.1,6.2,6.9,5.9,4.8,6.5,4.6,6.3,6.7,6.6)
data <- data.frame (SubjectID, Group,Length,Pos,ReadT)
data$Length <- factor(data$Length, order = TRUE,
levels = c(1,2),
labels = c("Length 1", "Length 2"))
data$Pos <- factor(data$Pos, order = TRUE,
levels = c(1,2),
labels = c("Position 1", "Position 2"))
qplot(Length, data=data, ReadT, geom=c("point", "line"),
stat="summary", fun.y=mean, group=Pos, colour=Pos,
facets = ~Group)

I don't think you have reproduced any inconsistency, but your issues in part are clouded by trying condense everything into single qplot call.
Your x variable Length is a factor, therefore ggplot is sensibly considering Length 1 and Length 2 to be independent, and won't connect the lines.
Secondly, you won't be able to use stat_summary to summarize by your x values, without forcing these to be a factor (and hence independant).
I find it easiest to presummarize the data and not rely on ggplot.
eg
library(plyr)
data.means <- ddply(data, .(Group, Pos, Length), summarize, ReadT = mean(ReadT))
Then construct the plot using ggplot not qplot, to give you the flexibility (and transparency) required.
The trick to get the lines connected is to consider x numeric within the call to geom_line see here for example
ggplot(data.means, aes(x= Length, y= ReadT, colour = Pos)) +
geom_point() +
geom_line(aes(x=as.numeric(Length))) +
facet_grid(~Group)
If you insisted on using the raw data, and stat_xxxx functions, you could also replicate this using stat_smooth to estimate the means (which would keep x classified as numeric)
ggplot(data, aes(x = Length, y= ReadT, colour = Pos)) +
stat_summary(fun.y = 'mean', geom = 'point')+
stat_smooth(method = 'lm', aes(x=as.numeric(Length)), se = FALSE) +
facet_grid(~Group)

Related

Looping over variables in ggplot to create a grid of density distributions for each variable

I want to create a grid of density distribution plots, with a dashed vertical line at the mean, for multiple variables I have in a dataset. Using mtcars dataset as an example, the code for a single variable plot would be:
ggplot(mtcars, aes(x = mpg)) + geom_density() + geom_vline(aes(xintercept =
mean(mpg)), linetype = "dashed", size = 0.6)
I am unclear about how I alter this to make it loop over specified variables in my dataset and produce a grid with the plots of each one. It seems like it would involve some combination of adding facet_grid and the "vars" argument but I have tried a number of combinations with no success.
It seems like in all the examples I can find online, facet_grid splits the plots by subsets of a variable, while keeping the same x and y for each plot, but I want to have the plot of x vary in each graph and the y is the density of values.
In trying to solve this, it is also my understanding that the new release of ggplot includes something involving "quasiquotation" which may help solve my problem (https://www.tidyverse.org/articles/2018/07/ggplot2-tidy-evaluation/) but again, I couldn't quite figure out how to apply the examples provided here to my own issue.
Consider reshaping the data into long format than plotting with facets. Here both x and y scales are free since plot differ in magnitude across the columns.
rdf <- reshape(mtcars, varying = names(mtcars), v.names = "value",
times = names(mtcars), timevar = "variable",
new.row.names = 1:1000, direction = "long")
ggplot(rdf, aes(x = value)) + geom_density() +
geom_vline(aes(xintercept = mean(value)), linetype = "dashed", size = 0.6) +
facet_grid(~variable, scales="free")

Contour plot or heatmap from three continuous variables

I have a model which has told me there is an interaction between two variables: a and b, which is significantly influencing my response variable: c. All three are continuous numeric variables. For detail c is the rate in change my response variable, b is the rate of change in my predictor and a is mean annual rainfall. The unit of analysis is pixels in a raster. So my model is telling me mean annual rainfall modifies how my predictor affects my response.
To visualise this interaction I would like to use a contour plot/heat map/level plot with a and b on the x and y axes and c providing the colour to show me how my response variable changes within the space described by a and b. I can do this with a scatter plot but its not very pretty or easy to interpret:
qplot(b, a, colour = c) +
scale_colour_gradient(low="green", high="red") +
When I try to plot a contour plot/heat map/level plot though all I get is errors, blank plots or ugly plots.
geom_contour gives me an error:
ggplot(data = Mod, aes(x = Rain, y = Bomas, z = Fire)) +
geom_contour()
Warning message:
Not possible to generate contour data
geom_raster initially gives me Error: cannot allocate vector of size 81567.2 Gb but when I round my data it produces:
ggplot(data = df, aes(x = a, y = b, z = c)) +
geom_raster(aes(fill = c))
Adding interpolate = TRUE to the geom_raster code just makes the lines a little blurry.
geom_tile produces a blank graph but with a scale bar for c:
ggplot(data = df, aes(x = a, y = b, z = c)) +
geom_tile(aes(color = c))
I've also tried using stat_density2d and setting the fill and/or the colour to c, but just got an error, and I've tried using levelplot in the lattice package as well but that produces this:
levelplot(c ~ a * b, data = df,
aspect = "asp", contour = TRUE,
xlab = "a",
ylab = "b")
I suspect the problems I'm encountering are because the functions are not set up to deal with continuous x and y variables, all the examples seem to use factors. I would have thought I could compensate for that by changing bin widths but that doesn't seem to work either. Is there a function that allows you to make a heat map with 3 continuous variables? Or do I need to treat my a and b variables as factors and manually make a dataframe with bins appropriate for my data?
If you want to experiment for yourself then you get similar problems to what I'm having with:
df<- as.data.frame(rnorm(1:1068))
df[,2] <- rnorm(1:1068)
df[,3] <- rnorm(1:1068)
names(df) <- c("a", "b", "c")
You can get automatic bins, and for example calculate the means by using stat_summary_2d:
ggplot(df, aes(a, b, z = c)) +
stat_summary_2d() +
geom_point(shape = 1, col = 'white') +
viridis::scale_fill_viridis()
Another good option is to slice your data by the third variable, and plot small multiples. This doesn't really show very well for random data though:
library(ggplot2)
ggplot(df, aes(a, b)) +
geom_point() +
facet_wrap(~cut_number(c, 4))

how to combine in ggplot line / points with special values?

I'm quite new to ggplot but I like the systematic way how you build your plots. Still, I'm struggeling to achieve desired results. I can replicate plots where you have categorical data. However, for my use I often need to fit a model to certain observations and then highlight them in a combined plot. With the usual plot function I would do:
library(splines)
set.seed(10)
x <- seq(-1,1,0.01)
y <- x^2
s <- interpSpline(x,y)
y <- y+rnorm(length(y),mean=0,sd=0.1)
plot(x,predict(s,x)$y,type="l",col="black",xlab="x",ylab="y")
points(x,y,col="red",pch=4)
points(0,0,col="blue",pch=1)
legend("top",legend=c("True Values","Model values","Special Value"),text.col=c("red","black","blue"),lty=c(NA,1,NA),pch=c(4,NA,1),col=c("red","black","blue"),cex = 0.7)
My biggest problem is how to build the data frame for ggplot which automatically then draws the legend? In this example, how would I translate this into ggplot to get a similar plot? Or is ggplot not made for this kind of plots?
Note this is just a toy example. Usually the model values are derived from a more complex model, just in case you wante to use a stat in ggplot.
The key part here is that you can map colors in aes by giving a string, which will produce a legend. In this case, there is no need to include the special value in the data.frame.
df <- data.frame(x = x, y = y, fit = predict(s, x)$y)
ggplot(df, aes(x, y)) +
geom_line(aes(y = fit, col = 'Model values')) +
geom_point(aes(col = 'True values')) +
geom_point(aes(col = 'Special value'), x = 0, y = 0) +
scale_color_manual(values = c('True values' = "red",
'Special value' = "blue",
'Model values' = "black"))

How to specify ggplot2 boxplot fill colour for continuous data?

I want to plot a ggplot2 boxplot using all columns of a data.frame, and I want to reorder the columns by the median for each column, rotate the x-axis labels, and fill each box with the colour corresponding to the same median. I can't figure out how to do the last part. There are plenty of examples where the fill colour corresponds to a factor variable, but I haven't seen a clear example of using a continuous variable to control fill colour. (The reason I'm trying to do this is that the resultant plot will provide context for a force-directed network graph with nodes that will be colour-coded in the same way as the boxplot -- the colour will then provide a mapping between the two plots.) It would be nice if I could re-use the value-to-colour mapping for later plots so that colours are consistent between plots. So, for example, the box corresponding to the column variable with a high median value will have a colour that denotes this mapping and matches perfectly the colour for the same column variable in other plots (such as the corresponding node in a force-directed network graph).
So far, I have something like this:
# Melt the data.frame:
DT.m <- melt(results, id.vars = NULL) # using reshape2
# I can now make a boxplot for every column in the data.frame:
g <- ggplot(DT.m, aes(x = reorder(variable, value, FUN=median), y = value)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
geom_boxplot(???, alpha=0.5)
The colour fill information is what I'm stuck on. "value" is a continuous variable in the range [0,1] and there are 55 columns in my data.frame. Various approaches I've tried seem to result in the boxes being split vertically down the middle, and I haven't got any further. Any ideas?
You can do this by adding the median-by-group to your data frame and then mapping the new median variable to the fill aesthetic. Here's an example with the built-in mtcars data frame. By using this same mapping across different plots, you should get the same colors:
library(ggplot2)
library(dplyr)
ggplot(mtcars %>% group_by(carb) %>%
mutate(medMPG = median(mpg)),
aes(x = reorder(carb, mpg, FUN=median), y = mpg)) +
geom_boxplot(aes(fill=medMPG)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
scale_fill_gradient(low=hcl(15,100,75), high=hcl(195,100,75))
If you have various data frames with different ranges of medians, you can still use the method above, but to get a consistent mapping of color to median across all your plots, you'll need to also set the same limits for scale_fill_gradient in each plot. In this example, the median of mpg (by carb grouping) varies from 15.0 to 22.8. But let's say across all my data sets, it varies from 13.3 to 39.8. Then I could add this to all my plots:
scale_fill_gradient(limits=c(13.3, 39.8),
low=hcl(15,100,75), high=hcl(195,100,75))
This is just for illustration. For ease of maintenance if your data might change, you'll want to set the actual limits programmatically.
I built on eipi10's solution and obtained the following code which does what I want:
# "results" is a 55-column data.frame containing
# bootstrapped estimates of the Gini impurity for each column variable
# (But can synthesize fake data for testing with a bunch of rnorms)
DT.m <- melt(results, id.vars = NULL) # using reshape2
g <- ggplot(DT.m %>% group_by(variable) %>%
mutate(median.gini = median(value)),
aes(x = reorder(variable, value, FUN=median), y = value)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_boxplot(aes(fill=median.gini)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
scale_fill_gradientn(colours = heat.colors(9)) +
ylab("Gini impurity") +
xlab("Feature") +
guides(fill=guide_colourbar(title="Median\nGini\nimpurity"))
plot(g)
Later, for the second plot:
medians <- lapply(results, median)
color <- colorRampPalette(colors =
heat.colors(9))(1000)[cut(unlist(medians),1000,labels = F)]
color is then a character vector containing the colours of the nodes in my subsequent network graph, and these colours match those in the boxplot. Job done!

ggplot not showing data

I am trying to make a nice plot with ggplot. However, I do not know why it is not showing data.
Here is some minimum code
dummylabels <- c("A","B","C")
dummynumbers <- c(1,2,3)
dummy_frame <- data.frame(dummylabels,dummynumbers)
p= ggplot(data=dummy_frame, aes(x =dummylabels , y = dummynumbers)) + geom_bar(fill = "blue")
p + coord_flip() + labs(title = "Title")
I get the following error message, which I cannot make sense of
Error : Mapping a variable to y and also using stat="bin".
With stat="bin", it will attempt to set the y value to the count of cases in each group.
This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
If you want y to represent values in the data, use stat="identity".
See ?geom_bar for examples. (Defunct; last used in version 0.9.2)
Why do I get this error?
From the error message you got:
If you want y to represent values in the data, use stat="identity".
geom_bar expects to be used as a histogram, where it bins the data itself and calculates heights based on frequency. This is the stat="bin" behaviour, and is the default. It throws an error, as you gave it a y value too. To fix it, you want stat="identity":
p <- ggplot(data = dummy_frame, aes(x = dummylabels, y = dummynumbers)) +
geom_bar(fill = "blue", stat = "identity") +
coord_flip() +
labs(title = "Title")
p

Resources