Create a grouped plot by more than one factor - r

I'm currently investigating the advantages of dplyr and ggplot. In one of my plots I have a data.frame containing two factors, namely F1 and F2. I wanted to create a line plot for each possible combinations of F1 and F2 (Cartesian product, here 4=2 x 2 outcomes).
It seems, that ggplot only accepts one factor here F1. So that the lines in my plots are connected, whereas I expect four separate lines.
library(dplyr)
library(ggplot2)
df<-data.frame("X"=c(1,2,4,5,7,8,10,11),"Y"=c(1,2,3,4,5,6,7,8),"F1"=c(1,1,1,1,2,2,2,2),"F2"=c(1,1,2,2,1,1,2,2))
df$F1<-as.factor(df$F1)
df$F2<-as.factor(df$F2)
ggplot(df, aes(x=X, y=Y, group=F1, color=F2))+geom_line()
A workaround might be to add a new column to my dataframe combining F1 and F2, but I have no idea how to do this.

You're right that you can only specify one factor to control the grouping. If you want to use more than one, you can use the interaction function to work out the combination:
# you can either calculate the interaction beforehand and supply that...
df$F1F2 <- interaction(df$F1, df$F2)
ggplot(df, aes(x = X, y = Y, group = F1F2, color = F1F2)) + geom_line()
# or you can just drop it straight in!
ggplot(df,
aes(x = X, y = Y, group = interaction(F1, F2), color = interaction(F1, F2))) +
geom_line()
You can also use the arguments to interaction to control how it's formatted (for plot presentation).
Alternatively, you can find other ways to visualise multiple factors. For example, you might use one factor for grouping, and then use facetting to display levels of another factor (or two more, in fact).

Related

Why can facet_wrap() in ggplot2 be expressed with either a tilde (~) or vars()?

A tilde (~) in R generally denotes an anonymous function or formula, if I understand correctly. In ggplot2, you can use facet_wrap() to split your plot into facets based on a factor variable with multiple levels. There are two different ways to express this, and they both produce similar results:
# load starwars and tidyverse
library(tidyverse)
data(starwars)
With a ~:
ggplot(data = starwars, mapping = aes(x = mass)) +
geom_histogram(fill = "blue", alpha = .2) +
theme_minimal() +
facet_wrap( ~ gender, nrow = 1)
With vars():
ggplot(data = starwars, mapping = aes(x = mass)) +
geom_histogram(fill = "blue", alpha = .2) +
theme_minimal() +
facet_wrap( vars(gender), nrow = 1)
How are vars() and ~ equivalent in ggplot2? How is ~ being used in a manner that is analogous, or equivalent to, its typical usage as an anonymous function or formula in R? It doesn't seem like it's a function here? Can someone help clarify how vars() and ~ for facet_wrap() denote the same thing?
The two plots should be identical.
In ggplot2, vars() is just a quoting function that takes inputs to be evaluated, which in this case is the variable name used to form the faceting groups. In other words, the column you supplied, usually a variable with more than one level, will be automatically quoted, then evaluated in the context of the data to form small panels of plots. I recommend using vars() inputs when you want to create a function to wrap around facet_wrap(); it’s a lot easier.
The ~, on the other hand, is syntax specific to the facet_wrap() function. For example, facet_wrap(~ variable_name) does not imply the estimation of some formulaic expression. Rather, as a one-sided formula with a variable on the right-hand side, it’s like telling R to feed the function the variable in its current form, which is just the name of the column itself. It’s confusing because we usually use the ~ to denote a relationship between x and y. It’s kind of the same thing in this context. The missing dependent y variable to the left of the ~ represents the row values, whereas the independent x variable to the right of the ~ represents the column(s). Note, the function may already know the y variable, which is usually specified inside of the aes() call. Layering on facet_wrap(~ ...) is just a quick way to partition those y values (rows) across each dimension (level) of your x variable.

Apply ggplot2 across columns

I am working with a dataframe with many columns and would like to produce certain plots of the data using ggplot2, namely, boxplots, histograms, density plots. I would like to do this by writing a single function that applies across all attributes (columns), producing one boxplot (or histogram etc) and then storing that as a given element of a list into which all the boxplots will be chained, so I could later index it by number (or by column name) in order to return the plot for a given attribute.
The issue I have is that, if I try to apply across columns with something like apply(df,2,boxPlot), I have to define boxPlot as a function that takes just a vector x. And when I do so, the attribute/column name and index are no longer retained. So e.g. in the code for producing a boxplot, like
bp <- ggplot(df, aes(x=Group, y=Attr, fill=Group)) +
geom_boxplot() +
labs(title="Plot of length per dose", x="Group", y =paste(Attr)) +
theme_classic()
the function has no idea how to extract the info necessary for Attr from just vector x (as this is just the column data and doesn't carry the column name or index).
(Note the x-axis is a factor variable called 'Group', which has 6 levels A,B,C,D,E,F, within X.)
Can anyone help with a good way of automating this procedure? (Ideally it should work for all types of ggplots; the problem here seems to simply be how to refer to the attribute name, within the ggplot function, in a way that can be applied / automatically replicated across the columns.) A for-loop would be acceptable, I guess, but if there's a more efficient/better way to do it in R then I'd prefer that!
Edit: something like what would be achieved by the top answer to this question: apply box plots to multiple variables. Except that in that answer, with his code you would still need a for-loop to change the indices on y=y[2] in the ggplot code and get all the boxplots. He's also expanded-grid to include different ````x``` possibilities (I have only one, the Group factor), but it would be easy to simplify down if the looping problem could be handled.
I'd also prefer just base R if possible--dplyr if absolutely necessary.
Here's an example of iterating over all columns of a data frame to produce a list of plots, while retaining the column name in the ggplot axis label
library(tidyverse)
plots <-
imap(select(mtcars, -cyl), ~ {
ggplot(mtcars, aes(x = cyl, y = .x)) +
geom_point() +
ylab(.y)
})
plots$mpg
You can also do this without purrr and dplyr
to_plot <- setdiff(names(mtcars), 'cyl')
plots <-
Map(function(.x, .y) {
ggplot(mtcars, aes(x = cyl, y = .x)) +
geom_point() +
ylab(.y)
}, mtcars[to_plot], to_plot)
plots$mpg

How to scatter plot with colours assigned to specific factor

I would like to plot(x,y) but associated with it are two other factors z and t. There are three levels in z and two levels in t. How do I do a scatter plot with assigned colours to each different factors and levels? ... which would mean a total of six different colours.
I'm considering creating multiple .csv file and using par but I think there should be an easier way to do this.
I'm not sure if you want a single plot or multiple plots. Since you mentioned par, I'm guessing multiple plots. Regardless, to make two factors work together to make the correct number of colors, an easy way is to combine them into a new factor by concatenating them together with paste(). Here's an example with ggplot2 and data.table:
library(data.table)
library(ggplot2)
DT <- as.data.table(mtcars)
DT[, combinedFactor := as.factor(paste(cyl, am))]
ggplot(data = DT, aes(x = mpg, y = disp, color = combinedFactor)) +
geom_point() +
facet_wrap(facets = "am")

Dot Priority in ggplot2 jittered scatterplot [duplicate]

I'm plotting a dense scatter plot in ggplot2 where each point might be labeled by a different color:
df <- data.frame(x=rnorm(500))
df$y = rnorm(500)*0.1 + df$x
df$label <- c("a")
df$label[50] <- "point"
df$size <- 2
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size))
When I do this, the scatter point labeled "point" (green) is plotted on top of the red points which have the label "a". What controls this z ordering in ggplot, i.e. what controls which point is on top of which?
For example, what if I wanted all the "a" points to be on top of all the points labeled "point" (meaning they would sometimes partially or fully hide that point)? Does this depend on alphanumerical ordering of labels?
I'd like to find a solution that can be translated easily to rpy2.
2016 Update:
The order aesthetic has been deprecated, so at this point the easiest approach is to sort the data.frame so that the green point is at the bottom, and is plotted last. If you don't want to alter the original data.frame, you can sort it during the ggplot call - here's an example that uses %>% and arrange from the dplyr package to do the on-the-fly sorting:
library(dplyr)
ggplot(df %>%
arrange(label),
aes(x = x, y = y, color = label, size = size)) +
geom_point()
Original 2015 answer for ggplot2 versions < 2.0.0
In ggplot2, you can use the order aesthetic to specify the order in which points are plotted. The last ones plotted will appear on top. To apply this, you can create a variable holding the order in which you'd like points to be drawn.
To put the green dot on top by plotting it after the others:
df$order <- ifelse(df$label=="a", 1, 2)
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size, order=order))
Or to plot the green dot first and bury it, plot the points in the opposite order:
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size, order=-order))
For this simple example, you can skip creating a new sorting variable and just coerce the label variable to a factor and then a numeric:
ggplot(df) +
geom_point(aes(x=x, y=y, color=label, size=size, order=as.numeric(factor(df$label))))
ggplot2 will create plots layer-by-layer and within each layer, the plotting order is defined by the geom type. The default is to plot in the order that they appear in the data.
Where this is different, it is noted. For example
geom_line
Connect observations, ordered by x value.
and
geom_path
Connect observations in data order
There are also known issues regarding the ordering of factors, and it is interesting to note the response of the package author Hadley
The display of a plot should be invariant to the order of the data frame - anything else is a bug.
This quote in mind, a layer is drawn in the specified order, so overplotting can be an issue, especially when creating dense scatter plots. So if you want a consistent plot (and not one that relies on the order in the data frame) you need to think a bit more.
Create a second layer
If you want certain values to appear above other values, you can use the subset argument to create a second layer to definitely be drawn afterwards. You will need to explicitly load the plyr package so .() will work.
set.seed(1234)
df <- data.frame(x=rnorm(500))
df$y = rnorm(500)*0.1 + df$x
df$label <- c("a")
df$label[50] <- "point"
df$size <- 2
library(plyr)
ggplot(df) + geom_point(aes(x = x, y = y, color = label, size = size)) +
geom_point(aes(x = x, y = y, color = label, size = size),
subset = .(label == 'point'))
Update
In ggplot2_2.0.0, the subset argument is deprecated. Use e.g. base::subset to select relevant data specified in the data argument. And no need to load plyr:
ggplot(df) +
geom_point(aes(x = x, y = y, color = label, size = size)) +
geom_point(data = subset(df, label == 'point'),
aes(x = x, y = y, color = label, size = size))
Or use alpha
Another approach to avoid the problem of overplotting would be to set the alpha (transparancy) of the points. This will not be as effective as the explicit second layer approach above, however, with judicious use of scale_alpha_manual you should be able to get something to work.
eg
# set alpha = 1 (no transparency) for your point(s) of interest
# and a low value otherwise
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size,alpha = label)) +
scale_alpha_manual(guide='none', values = list(a = 0.2, point = 1))
The fundamental question here can be rephrased like this:
How do I control the layers of my plot?
In the 'ggplot2' package, you can do this quickly by splitting each different layer into a different command. Thinking in terms of layers takes a little bit of practice, but it essentially comes down to what you want plotted on top of other things. You build from the background upwards.
Prep: Prepare the sample data. This step is only necessary for this example, because we don't have real data to work with.
# Establish random seed to make data reproducible.
set.seed(1)
# Generate sample data.
df <- data.frame(x=rnorm(500))
df$y = rnorm(500)*0.1 + df$x
# Initialize 'label' and 'size' default values.
df$label <- "a"
df$size <- 2
# Label and size our "special" point.
df$label[50] <- "point"
df$size[50] <- 4
You may notice that I've added a different size to the example just to make the layer difference clearer.
Step 1: Separate your data into layers. Always do this BEFORE you use the 'ggplot' function. Too many people get stuck by trying to do data manipulation from with the 'ggplot' functions. Here, we want to create two layers: one with the "a" labels and one with the "point" labels.
df_layer_1 <- df[df$label=="a",]
df_layer_2 <- df[df$label=="point",]
You could do this with other functions, but I'm just quickly using the data frame matching logic to pull the data.
Step 2: Plot the data as layers. We want to plot all of the "a" data first and then plot all the "point" data.
ggplot() +
geom_point(
data=df_layer_1,
aes(x=x, y=y),
colour="orange",
size=df_layer_1$size) +
geom_point(
data=df_layer_2,
aes(x=x, y=y),
colour="blue",
size=df_layer_2$size)
Notice that the base plot layer ggplot() has no data assigned. This is important, because we are going to override the data for each layer. Then, we have two separate point geometry layers geom_point(...) that use their own specifications. The x and y axis will be shared, but we will use different data, colors, and sizes.
It is important to move the colour and size specifications outside of the aes(...) function, so we can specify these values literally. Otherwise, the 'ggplot' function will usually assign colors and sizes according to the levels found in the data. For instance, if you have size values of 2 and 5 in the data, it will assign a default size to any occurrences of the value 2 and will assign some larger size to any occurrences of the value 5. An 'aes' function specification will not use the values 2 and 5 for the sizes. The same goes for colors. I have exact sizes and colors that I want to use, so I move those arguments into the 'geom_plot' function itself. Also, any specifications in the 'aes' function will be put into the legend, which can be really useless.
Final note: In this example, you could achieve the wanted result in many ways, but it is important to understand how 'ggplot2' layers work in order to get the most out of your 'ggplot' charts. As long as you separate your data into different layers before you call the 'ggplot' functions, you have a lot of control over how things will be graphed on the screen.
It's plotted in order of the rows in the data.frame. Try this:
df2 <- rbind(df[-50,],df[50,])
ggplot(df2) + geom_point(aes(x=x, y=y, color=label, size=size))
As you see the green point is drawn last, since it represents the last row of the data.frame.
Here is a way to order the data.frame to have the green point drawn first:
df2 <- df[order(-as.numeric(factor(df$label))),]

How subset a data frame by a factor and repeat a plot for each subset?

I am new to R. Forgive me if this if this question has an obvious answer but I've not been able to find a solution. I have experience with SAS and may just be thinking of this problem in the wrong way.
I have a dataset with repeated measures from hundreds of subjects with each subject having multiple measurements across different ages. Each subject is identified by an ID variable. I'd like to plot each measurement (let's say body WEIGHT) by AGE for each individual subject (ID).
I've used ggplot2 to do something like this:
ggplot(data = dataset, aes(x = AGE, y = WEIGHT )) + geom_line() + facet_wrap(~ID)
This works well for a small number of subjects but won't work for the entire dataset.
I've also tried something like this:
ggplot(data=data, aes(x = AGE,y = BW, group = ID, colour = ID)) + geom_line()
This also works for a small number of subjects but is unreadable with hundreds of subjects.
I've tried to subset using code like this:
temp <- split(dataset,dataset$ID)
but I'm not sure how to work with the resulting dataset. Or perhaps there is a way to simply adjust the facet_wrap so that individual plots are created?
Thanks!
Because you want to split up the dataset and make a plot for each level of a factor, I would approach this with one of the split-apply-return tools from the plyr package.
Here is a toy example using the mtcars dataset. I first create the plot and name it p, then use dlply to split the dataset by a factor and return a plot for each level. I'm taking advantage of %+% from ggplot2 to replace the data.frame in a plot.
p = ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_line()
require(plyr)
dlply(mtcars, .(cyl), function(x) p %+% x)
This returns all the plots, one after another. If you name the resulting list object you can also call one plot at a time.
plots = dlply(mtcars, .(cyl), function(x) p %+% x)
plots[1]
Edit
I started thinking about putting a title on each plot based on the factor, which seems like it would be useful.
dlply(mtcars, .(cyl), function(x) p %+% x + facet_wrap(~cyl))
Edit 2
Here is one way to save these in a single document, one plot per page. This is working with the list of plots named plots. It saves them all to one document, one plot per page. I didn't change any of the defaults in pdf, but you can certainly explore the changes you can make.
pdf()
plots
dev.off()
Updated to use package dplyr instead of plyr. This is done in do, and the output will have a named column that contains all the plots as a list.
library(dplyr)
plots = mtcars %>%
group_by(cyl) %>%
do(plots = p %+% . + facet_wrap(~cyl))
Source: local data frame [3 x 2]
Groups: <by row>
cyl plots
1 4 <S3:gg, ggplot>
2 6 <S3:gg, ggplot>
3 8 <S3:gg, ggplot>
To see the plots in R, just ask for the column that contains the plots.
plots$plots
And to save as a pdf
pdf()
plots$plots
dev.off()
A few years ago, I wanted to do something similar - plot individual trajectories for ~2500 participants with 1-7 measurements each. I did it like this, using plyr and ggplot2:
library(plyr)
library(ggplot2)
d_ply(dat, .var = "participant_id", .fun = function(x) {
# Generate the desired plot
ggplot(x, aes(x = phase, y = result)) +
geom_point() +
geom_line()
# Save it to a file named after the participant
# Putting it in a subdirectory is prudent
ggsave(file.path("plots", paste0(x$participant_id, ".png")))
})
A little slow, but it worked. If you want to get a sense of all participants' trajectories in one plot (like your second example - aka the spaghetti plot), you can tweak the transparency of the lines (forget coloring them, though):
ggplot(data = dat, aes(x = phase, y = result, group = participant_id)) +
geom_line(alpha = 0.3)
lapply(temp, function(X) ggplot(X, ...))
Where X is your subsetted data
Keep in mind you may have to explicitly print the ggplot object (print(ggplot(X, ..)))

Resources