Summary: When I use a "for" loop to add layers to a violin plot (in ggplot), the only layer that is added is the one created by the final loop iteration. Yet in explicit code that mimics the code that the loop would produce, all the layers are added.
Details: I am trying to create violin graphs with overlapping layers, to show the extent that estimate distributions do or do not overlap for several survey question responses, stratified by place. I want to be able to include any number of places, so I have one column in by dataframe for each place, and am trying to use a "for" loop to generate one ggplot layer per place. But the loop only adds the layer from the loop's final iteration.
This code illustrates the problem, and some suggested approaches that failed:
library(ggplot2)
# Create a dataframe with 500 random normal values for responses to 3 survey questions from two cities
topic <- c("Poverty %","Mean Age","% Smokers")
place <- c("Chicago","Miami")
n <- 500
mean <- c(35, 40,58, 50, 25,20)
var <- c( 7, 1.5, 3, .25, .5, 1)
df <- data.frame( topic=rep(topic,rep(n,length(topic)))
,c(rnorm(n,mean[1],var[1]),rnorm(n,mean[3],var[3]),rnorm(n,mean[5],var[5]))
,c(rnorm(n,mean[2],var[2]),rnorm(n,mean[4],var[4]),rnorm(n,mean[6],var[6]))
)
names(df)[2:dim(df)[2]] <- place # Name those last two columns with the corresponding place name.
head(df)
# This "for" loop seems to only execute the final loop (i.e., where p=3)
g <- ggplot(df, aes(factor(topic), df[,2]))
for (p in 2:dim(df)[2]) {
g <- g + geom_violin(aes(y = df[,p], colour = place[p-1]), alpha = 0.3)
}
g
# But mimicing what the for loop does in explicit code works fine, resulting in both "place"s being displayed in the graph.
g <- ggplot(df, aes(factor(topic), df[,2]))
g <- g + geom_violin(aes(y = df[,2], colour = place[2-1]), alpha = 0.3)
g <- g + geom_violin(aes(y = df[,3], colour = place[3-1]), alpha = 0.3)
g
## per http://stackoverflow.com/questions/18444620/set-layers-in-ggplot2-via-loop , I tried
g <- ggplot(df, aes(factor(topic), df[,2]))
for (p in 2:dim(df)[2]) {
df1 <- df[,c(1,p)]
g <- g + geom_violin(aes(y = df1[,2], colour = place[p-1]), alpha = 0.3)
}
g
# but got the same undesired result
# per http://stackoverflow.com/questions/15987367/how-to-add-layers-in-ggplot-using-a-for-loop , I tried
g <- ggplot(df, aes(factor(topic), df[,2]))
for (p in names(df)[-1]) {
cat(p,"\n")
g <- g + geom_violin(aes_string(y = p, colour = p), alpha = 0.3) # produced this error: Error in unit(tic_pos.c, "mm") : 'x' and 'units' must have length > 0
# g <- g + geom_violin(aes_string(y = p ), alpha = 0.3) # produced this error: Error: stat_ydensity requires the following missing aesthetics: y
}
g
# but that failed to produce any graphic, per the errors noted in the "for" loop above
The reason this is happening is due to ggplot's "lazy evaluation". This is a common problem when ggplot is used this way (making the layers separately in a loop, rather than having ggplot to it for you, as in #hrbrmstr's solution).
ggplot stores the arguments to aes(...) as expressions, and only evaluates them when the plot is rendered. So, in your loops, something like
aes(y = df[,p], colour = place[p-1])
gets stored as is, and evaluated when you render the plot, after the loop completes. At this point, p=3 so all the plots are rendered with p=3.
So the "right" way to do this is to use melt(...) in the reshape2 package so convert your data from wide to long format, and let ggplot manage the layers for you. I put "right" in quotes because in this particular case there is a subtlety. When calculating the distributions for the violins using the melted data frame, ggplot uses the grand total (for both Chicago and Miami) as the scale. If you want violins based on frequency scaled individually, you need to use loops (sadly).
The way around the lazy evaluation problem is to put any reference to the loop index in the data=... definition. This is not stored as an expression, the actual data is stored in the plot definition. So you could do this:
g <- ggplot(df,aes(x=topic))
for (p in 2:length(df)) {
gg.data <- data.frame(topic=df$topic,value=df[,p],city=names(df)[p])
g <- g + geom_violin(data=gg.data,aes(y=value, color=city))
}
g
which gives the same result as yours. Note that the index p does not show up in aes(...).
Update: A note about scale="width" (mentioned in a comment). This causes all the violins to have the same width (see below), which is not the same scaling as in OP's original code. IMO this is not a great way to visualize the data, as it suggests there is much more data in the Chicago group.
ggplot(gg) +geom_violin(aes(x=topic,y=value,color=variable),
alpha=0.3,position="identity",scale="width")
You can do it w/o a loop:
df.2 <- melt(df)
gg <- ggplot(df.2, aes(x=topic, y=value))
gg <- gg + geom_violin(position="identity", aes(color=variable), alpha=0.3)
gg
You can use aes_() rather than aes(), which appears to stop the lazy evaluation. Answer found on a closed question that links here (Update a ggplot using a for loop (R)), but thought it should be here since the other question was closed.
While generally speaking, reshaping the data is always preferred, with newer version of ggplot2 (>3.0.0), you can use !! to inject values into the aes() For example you can do
g <- ggplot(df, aes(factor(topic), df[,2]))
for (p in 2:dim(df)[2]) {
g <- g + geom_violin(aes(y = df[,!!p], colour = place[!!p-1]), alpha = 0.3)
}
g
To get the desired result. The !! will force evaluation rather than remaining lazy as is the default.
Related
I am currently trying to implement a graphing library where I need a bit more flexibility than what is currently provided by ggplot. I am interested in going in a functional programming kind of way.
Currently, I have a barchart which is defined as
make_bar <- function(data, x, n_cols)
{
#Data: Dataframe or tibble
#x: Factor singular column
#output: ggplot object
n_colors = nrow(distinct(data[x]))
if (n_colors != length(n_cols)) {
difference <- abs(n_colors - length(colors))
colors <- head(colors, difference)
}
plot <- ggplot(data, aes(x = .data[[x]],
tooltip = .data[[x]],
data_id = .data[[x]])) +
geom_bar_interactive(fill=custom_colour_palette(colors))
}
Which very nicely returns a bar chart. Now I want the functionality to write a function called "add_line" which should then be applied to the barchart if one wishes to do so. The line function as is right now is:
add_line <- function(data, x) {
data %>%
count(.data[[x]]) %>%
ggplot(aes(.data[[x]], n)) +
geom_line(group=1)
}
So now I have two lists, but is there any easy - or best practice - way to add such two lists to create one combined plot with the line overlayed on the barchart?
Code for reproducbility can be called with:
data <- mpg
h <- add_line(data, 'manufacturer')
x <- make_bar(data, 'manufacturer', 15)
# x + h ? does not work and shouldn't but such a functionality would be nice
Adding to what #MrFlick has said, here's how you return a geom object in add_line that can be added onto the base bar chart:
add_line <- function(data, x) {
geom_line(
aes_string(x = x, y = "n"),
data = count(data, .data[[x]]),
group = 1
)
}
Then the following should work:
x <- make_bar(mpg, "manufacturer", 15)
h <- add_line(mpg, "manufacturer")
x + h
The aes_string allows for using character strings rather than expressions, really useful for dynamic column choices.
I want to plot over a timecourse x with y values that are often repeated (integer scores 1-4) and I want to visualize many subjects at once.
Because there is so much overlap, a vertical position dodge would be ideal, such as position_dodgev from ggstance package. However, when I try to connect the dots with geom_line, the order of the connection gets messed up and is connected based on y values and not x values.
I tried a coordinate flip work-around which was not successful. And replacing geom_line with geom_path (making sure it was ordered on the x scale) also did not work.
Here is a reproducible example:
#data
df<-tibble(x=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
y=c(1,2,3,7,7,1,2,3,7,7,2,1,6,7,7),
group=c('a','a','a','a','a','b','b','b','b','b','c','c','c','c','c'))
#horizontal dodge masks groups
ggplot(df, aes(x=x, y=y,col=group,group=group)) +
geom_point(position=position_dodge(width=0.3))+
geom_line(position=position_dodge(width=0.3))
#line connection error with vertical dodge
library(ggstance)
ggplot(df, aes(x=x, y=y,col=group,group=group)) +
geom_point(position=position_dodgev(height=0.3))+
geom_line(position=position_dodgev(height=0.3))
Horizontal dodge works as expected but does not allow visualization of all the overlapped groups in a stretch of repeated y values. Vertical dodge from ggstance connected the dots in group c in the wrong order.
I am not sure what exactly causes the issue. Knowing that position_dodge is not intended to be used with geoms and it's been called a bug, I am surprised and not at the same time about this issue.
But in any case, I found a workaround by disassembling the plot using ggplot_build, rearranging the points for geom_line within that object and then reassembling the plot again; look below:
g <- ggplot(df, aes(x=x, y=y,col=group,group=group)) +
geom_point(position=position_dodgev(height=0.3)) +
geom_line(position=position_dodgev(height=0.3))
gg <- ggplot_build(g)
# -- look at gg$data to understand following lines --
#gg$data[[2]]: data associated with geom_line as it is the 2nd geom
#c(1,2) & c(2,1): I have $group==3 ...
# ... so just need to flip 1st and 2nd datapoints within that group
gg$data[[2]][gg$data[[2]]$group==3,][c(1,2),] <-
gg$data[[2]][gg$data[[2]]$group==3,][c(2,1),]
gt <- ggplot_gtable(gg)
plot(gt)
I suspect the problem occurs due to PositionDodgev's compute_panel function, which takes in a dataset sorted by x values, & returns a dataset sorted instead by y values (within each group) after making the necessary transformations to dodge positions vertically.
The following workaround defines a new ggproto object that inherits from PositionDodgev, but reorders the dataset in compute_panel before returning it:
# new ggproto based on PositionDodgev
PositionDodgeNew <- ggproto(
"PositionDodgeNew",
PositionDodgev,
compute_panel = function (data, params, scales){
result <- ggstance:::collidev(data, params$height,
name = "position_dodgev",
strategy = ggstance:::pos_dodgev,
n = params$n,
check.height = FALSE)
result <- result[order(result$group, result$x), ] # reordering by group & x
result
})
# position function that uses PositionDodgeNew instead of PositionDodgev
position_dodgenew <- function (height = NULL, preserve = c("total", "single")) {
ggproto(NULL, PositionDodgeNew, height = height, preserve = match.arg(preserve))
}
Usage:
po <- position_dodgenew(height = 0.3)
ggplot(df,
aes(x = x, y = y, col = group)) +
geom_point(position = po) +
geom_line(position = po)
I have 2 datasets, called A and B.
I want to compare the distribution of one common variable, called k, showing up in both dataset, but of different lengths (A contains 2000 values of k, while B has 1000, both have some N/A). So I would like to plot the distribution of A$k anf B$k in the same plot.
I have tried:
g1 <- ggplot(A, aes(x=A$k)) + geom_density()
g2 <- ggplot(B, aes(x=B$k)) + geom_density()
g <- g1 + g2
But then comes the error:
Don't know how to add o to a plot.
How can I overcome this problem?
Since we dont have any data it is hard to provide a specific solution that meets your scenario. But below is a general principal of what I think you trying to do.
The trick is to put your data together and have another column that identifies group A and group B. This is then used in the aes() argument in ggplot. Bearing in mind that combining your data frames might not be as simple as what I have done since you might have some extra columns etc.
# generating some pseudo data from a poisson distribution
A <- data.frame(k = rpois(2000, 4))
B <- data.frame(k = rpois(1000, 7))
# Create identifier
A$id <- "A"
B$id <- "B"
A_B <- rbind(A, B)
g <- ggplot(data = A_B, aes(x = k,
group = id, colour = id, fill = id)) + # fill/colour aes is not required
geom_density(alpha = 0.6) # alpha for some special effects
g
I can't tell you exactly that to do without knowing what data sets actually look like. But merging data sets into one then use ggplot() by specifying group or 'colour' would be one way to compare.
Another way is to use grid.arrange() from gridExtra package.
gridExtra::grid.arrange(g1, g2)
This is really easy and pretty convenient function. If you want to know more about gridExtra package, visit this official document.
I am trying to replicate figure 6.11 from Hadley Wickham's ggplot2 book, which plots R colors in Luv space; the colors of points represent themselves, and no legend is necessary.
Here are two attempts:
library(colorspace)
myColors <- data.frame("L"=runif(10000, 0,100),"a"=runif(10000, -100, 100),"b"=runif(10000, -100, 100))
myColors <- within(myColors, Luv <- hex(LUV(L, a, b)))
myColors <- na.omit(myColors)
g <- ggplot(myColors, aes(a, b, color=Luv), size=2)
g + geom_point() + ggtitle ("mycolors")
Second attempt:
other <- data.frame("L"=runif(10000),"a"=runif(10000),"b"=runif(10000))
other <- within(other, Luv <- hex(LUV(L, a, b)))
other <- na.omit(other)
g <- ggplot(other, aes(a, b, color=Luv), size=2)
g + geom_point() + ggtitle("other")
There are a couple of obvious problems:
These graphs don't look anything like the figure. Any suggestions on
the code needed?
The first attempt generates a lot of NA fields in the Luv
column (only ~3100 named colors out of 10,000 runs, versus ~9950 in
the second run). If L is supposed to be between 0-100 and u and v
between -100 and 100, why do I have so many NAs in the first run? I have tried rounding, it doesn't help.
Why do I have a legend?
Many thanks.
You're getting strange colors because aes(color = Luv) says "assign a color to each unique string in column Luv". If you assign color outside of aes, as below, it means "use these explicit colors". I think something like this should be close to the figure you presented.
require(colorspace)
x <- sRGB(t(col2rgb(colors())))
storage.mode(x#coords) <- "numeric" # as(..., "LUV") doesn't like integers for some reason
y <- as(x, "LUV")
DF <- as.data.frame(y#coords)
DF$col <- colors()
ggplot(DF, aes( x = U, y = V)) + geom_point(colour = DF$col)
I'd like to remove a layer (in this case the results of geom_ribbon) from a ggplot2 created grid object. Is there a way I can remove it once it's already part of the object?
library(ggplot2)
dat <- data.frame(x=1:3, y=1:3, ymin=0:2, ymax=2:4)
p <- ggplot(dat, aes(x=x, y=y)) + geom_ribbon(aes(ymin=ymin, ymax=ymax), alpha=0.3)
+ geom_line()
# This has the geom_ribbon
p
# This overlays another ribbon on top
p + geom_ribbon(aes(ymin=ymin, ymax=ymax, fill=NA))
I'd like this functionality to allow me to build more complicated plots on top of less complicated ones. I am using functions that return a grid object and then printing out the final plot once it is fully assembled. The base plot has a single line with a corresponding error bar (geom_ribbon) surrounding it. The more complicated plot will have several lines and the multiple overlapping geom_ribbon objects are distracting. I'd like to remove them from the plots with multiple lines. Additionally, I'll be able to quickly create alternative versions using facets or other ggplot2 functionality.
Edit: Accepting #mnel's answer as it works. Now I need to determine how to dynamically access the geom_ribbon layer, which is captured in the SO question here.
Edit 2: For completeness, this is the function I created to solve this problem:
remove_geom <- function(ggplot2_object, geom_type) {
layers <- lapply(ggplot2_object$layers, function(x) if(x$geom$objname == geom_type) NULL else x)
layers <- layers[!sapply(layers, is.null)]
ggplot2_object$layers <- layers
ggplot2_object
}
Edit 3: See the accepted answer below for the latest versions of ggplot (>=2.x.y)
For ggplot2 version 2.2.1, I had to modify the proposed remove_geom function like this:
remove_geom <- function(ggplot2_object, geom_type) {
# Delete layers that match the requested type.
layers <- lapply(ggplot2_object$layers, function(x) {
if (class(x$geom)[1] == geom_type) {
NULL
} else {
x
}
})
# Delete the unwanted layers.
layers <- layers[!sapply(layers, is.null)]
ggplot2_object$layers <- layers
ggplot2_object
}
Here's an example of how to use it:
library(ggplot2)
set.seed(3000)
d <- data.frame(
x = runif(10),
y = runif(10),
label = sprintf("label%s", 1:10)
)
p <- ggplot(d, aes(x, y, label = label)) + geom_point() + geom_text()
Let's show the original plot:
p
Now let's remove the labels and show the plot again:
p <- remove_geom(p, "GeomText")
p
If you look at
p$layers
[[1]]
mapping: ymin = ymin, ymax = ymax
geom_ribbon: na.rm = FALSE, alpha = 0.3
stat_identity:
position_identity: (width = NULL, height = NULL)
[[2]]
geom_line:
stat_identity:
position_identity: (width = NULL, height = NULL)
You will see that you want to remove the first layer
You can do this by redefining the layers as just the second component in the list.
p$layer <- p$layer[2]
Now build and plot p
p
Note that p$layer[[1]] <- NULL would work as well. I agree with #Andrie and #Joran's comments regarding in wehat cases this might be useful, and would not expect this to be necessarily reliable.
As this problem looked interesting, I have expanded my 'ggpmisc' package with functions to manipulate the layers in a ggplot object (currently in package 'gginnards'). The functions are more polished versions of the example in my earlier answer to this same question. However, be aware that in most cases this is not the best way of working as it violates the Grammar of Graphics. In most cases one can assemble different variations of the same figure in the normal way with operator +, possibly "packaging" groups of layers into lists to have combined building blocks that can simplify the assembly of complex figures. Exceptionally we may want to edit an existing plot or a plot output by a higher level function that whose definition we cannot modify. In such cases these layer manipulation functions can be useful. The example above becomes.
library(gginnards)
p1 <- delete_layers(p, match_type = "GeomText")
See the documentation of the package for other examples, and for information on the companion functions useful for modifying the ordering of layers, and for inserting new layers at arbitrary positions.
#Kamil Slowikowski Thanks! Very useful. However I could not stop myself from creating a new variation on the same theme... hopefully easier to understand than that in the original post or the updated version by Kamil, also avoiding some assignments.
remove_geoms <- function(x, geom_type) {
# Find layers that match the requested type.
selector <- sapply(x$layers,
function(y) {
class(y$geom)[1] == geom_type
})
# Delete the layers.
x$layers[selector] <- NULL
x
}
This version is functionally identical to Kamil's function, so the usage example above does not need to be repeated here.
As an aside, this function can be easily adapted to select the layers based on the class of the stat instead of the class of the geom.
remove_stats <- function(x, stat_type) {
# Find layers that match the requested type.
selector <- sapply(x$layers,
function(y) {
class(y$stat)[1] == stat_type
})
# Delete the layers.
x$layers[selector] <- NULL
x
}
#Kamil and #Pedro Thanks a lot! For those interested, one can also augment Pedro's function to select only specific layers, as shown here with a last_only argument:
remove_geoms <- function(x, geom_type, last_only = T) {
# Find layers that match the requested type.
selector <- sapply(x$layers,
function(y) {
class(y$geom)[1] == geom_type
})
if(last_only)
selector <- max(which(selector))
# Delete the layers.
x$layers[selector] <- NULL
x
}
Coming back to #Kamil's example plot:
set.seed(3000)
d <- data.frame(
x = runif(10),
y = runif(10),
label = sprintf("label%s", 1:10)
)
p <- ggplot(d, aes(x, y, label = label)) + geom_point() + geom_point(color = "green") + geom_point(size = 5, color = "red")
p
p %>% remove_geoms("GeomPoint")
p %>% remove_geoms("GeomPoint") %>% remove_geoms("GeomPoint")