How to scatter plot with colours assigned to specific factor - r

I would like to plot(x,y) but associated with it are two other factors z and t. There are three levels in z and two levels in t. How do I do a scatter plot with assigned colours to each different factors and levels? ... which would mean a total of six different colours.
I'm considering creating multiple .csv file and using par but I think there should be an easier way to do this.

I'm not sure if you want a single plot or multiple plots. Since you mentioned par, I'm guessing multiple plots. Regardless, to make two factors work together to make the correct number of colors, an easy way is to combine them into a new factor by concatenating them together with paste(). Here's an example with ggplot2 and data.table:
library(data.table)
library(ggplot2)
DT <- as.data.table(mtcars)
DT[, combinedFactor := as.factor(paste(cyl, am))]
ggplot(data = DT, aes(x = mpg, y = disp, color = combinedFactor)) +
geom_point() +
facet_wrap(facets = "am")

Related

Determine order of several boxplots in one plot in R qqplot

I tried to create a relatively simple boxplot plot in R's ggplot2: One value on the x axis and several variables on the y axis. I'm using a code similar to this one:
ggplot() +
# Boxplot 1
geom_boxplot(df[which(df$Xvalue=="Boxplot1"),],
mapping = aes(X, "Y")) +
# Boxplot 2
geom_boxplot(df[which(df$Xvalue=="Boxplot2"),],
mapping = aes(X, "Y")) +
# Boxplot 3
geom_boxplot(df[which(df$Xvalue=="Boxplot3"),],
mapping = aes(X, "Y")) +
The boxplots in my real code are ordered alphabetically, however, I need them to be in a customized, categorial order.
I'm aware I could restructure my data frame so that I don't use a subset and a new geom_boxplot command for each boxplot, but I've structured the data that way for other reasons and that's not the solution I'm looking for right now.
Maybe there is an easy way using the scale_Y_manual or else? Any help is appreciated!

Apply ggplot2 across columns

I am working with a dataframe with many columns and would like to produce certain plots of the data using ggplot2, namely, boxplots, histograms, density plots. I would like to do this by writing a single function that applies across all attributes (columns), producing one boxplot (or histogram etc) and then storing that as a given element of a list into which all the boxplots will be chained, so I could later index it by number (or by column name) in order to return the plot for a given attribute.
The issue I have is that, if I try to apply across columns with something like apply(df,2,boxPlot), I have to define boxPlot as a function that takes just a vector x. And when I do so, the attribute/column name and index are no longer retained. So e.g. in the code for producing a boxplot, like
bp <- ggplot(df, aes(x=Group, y=Attr, fill=Group)) +
geom_boxplot() +
labs(title="Plot of length per dose", x="Group", y =paste(Attr)) +
theme_classic()
the function has no idea how to extract the info necessary for Attr from just vector x (as this is just the column data and doesn't carry the column name or index).
(Note the x-axis is a factor variable called 'Group', which has 6 levels A,B,C,D,E,F, within X.)
Can anyone help with a good way of automating this procedure? (Ideally it should work for all types of ggplots; the problem here seems to simply be how to refer to the attribute name, within the ggplot function, in a way that can be applied / automatically replicated across the columns.) A for-loop would be acceptable, I guess, but if there's a more efficient/better way to do it in R then I'd prefer that!
Edit: something like what would be achieved by the top answer to this question: apply box plots to multiple variables. Except that in that answer, with his code you would still need a for-loop to change the indices on y=y[2] in the ggplot code and get all the boxplots. He's also expanded-grid to include different ````x``` possibilities (I have only one, the Group factor), but it would be easy to simplify down if the looping problem could be handled.
I'd also prefer just base R if possible--dplyr if absolutely necessary.
Here's an example of iterating over all columns of a data frame to produce a list of plots, while retaining the column name in the ggplot axis label
library(tidyverse)
plots <-
imap(select(mtcars, -cyl), ~ {
ggplot(mtcars, aes(x = cyl, y = .x)) +
geom_point() +
ylab(.y)
})
plots$mpg
You can also do this without purrr and dplyr
to_plot <- setdiff(names(mtcars), 'cyl')
plots <-
Map(function(.x, .y) {
ggplot(mtcars, aes(x = cyl, y = .x)) +
geom_point() +
ylab(.y)
}, mtcars[to_plot], to_plot)
plots$mpg

Create a grouped plot by more than one factor

I'm currently investigating the advantages of dplyr and ggplot. In one of my plots I have a data.frame containing two factors, namely F1 and F2. I wanted to create a line plot for each possible combinations of F1 and F2 (Cartesian product, here 4=2 x 2 outcomes).
It seems, that ggplot only accepts one factor here F1. So that the lines in my plots are connected, whereas I expect four separate lines.
library(dplyr)
library(ggplot2)
df<-data.frame("X"=c(1,2,4,5,7,8,10,11),"Y"=c(1,2,3,4,5,6,7,8),"F1"=c(1,1,1,1,2,2,2,2),"F2"=c(1,1,2,2,1,1,2,2))
df$F1<-as.factor(df$F1)
df$F2<-as.factor(df$F2)
ggplot(df, aes(x=X, y=Y, group=F1, color=F2))+geom_line()
A workaround might be to add a new column to my dataframe combining F1 and F2, but I have no idea how to do this.
You're right that you can only specify one factor to control the grouping. If you want to use more than one, you can use the interaction function to work out the combination:
# you can either calculate the interaction beforehand and supply that...
df$F1F2 <- interaction(df$F1, df$F2)
ggplot(df, aes(x = X, y = Y, group = F1F2, color = F1F2)) + geom_line()
# or you can just drop it straight in!
ggplot(df,
aes(x = X, y = Y, group = interaction(F1, F2), color = interaction(F1, F2))) +
geom_line()
You can also use the arguments to interaction to control how it's formatted (for plot presentation).
Alternatively, you can find other ways to visualise multiple factors. For example, you might use one factor for grouping, and then use facetting to display levels of another factor (or two more, in fact).

Plot one numeric variable against n numeric variables in n plots

I have a huge data frame and I would like to make some plots to get an idea of the associations among different variables. I cannot use
pairs(data)
, because that would give me 400+ plots. However, there's one response variable y I'm particularly interested in. Thus, I'd like to plot y against all variables, which would reduce the number of plots from n^2 to n. How can I do it?
EDIT: I add an example for the sake of clarity. Let's say I have the dataframe
foo=data.frame(x1=1:10,x2=seq(0.1,1,0.1),x3=-7:2,x4=runif(10,0,1))
and my response variable is x3. Then I'd like to generate four plots arranged in a row, respectively x1 vs x3, x2 vs x3, an histogram of x3 and finally x4 vs x3. I know how to make each plot
plot(foo$x1,foo$x3)
plot(foo$x2,foo$x3)
hist(foo$x3)
plot(foo$x4,foo$x3)
However I have no idea how to arrange them in a row. Also, it would be great if there was a way to automatically make all the n plots, without having to call the command plot (or hist) each time. When n=4, it's not that big of an issue, but I usually deal with n=20+ variables, so it can be a drag.
Could do reshape2/ggplot2/gridExtra packages combination. This way you don't need to specify the number of plots. This code will work on any number of explaining variables without any modifications
foo <- data.frame(x1=1:10,x2=seq(0.1,1,0.1),x3=-7:2,x4=runif(10,0,1))
library(reshape2)
foo2 <- melt(foo, "x3")
library(ggplot2)
p1 <- ggplot(foo2, aes(value, x3)) + geom_point() + facet_grid(.~variable)
p2 <- ggplot(foo, aes(x = x3)) + geom_histogram()
library(gridExtra)
grid.arrange(p1, p2, ncol=2)
The package tidyr helps doing this efficiently. please refer here for more options
data %>%
gather(-y_value, key = "some_var_name", value = "some_value_name") %>%
ggplot(aes(x = some_value_name, y = y_value)) +
geom_point() +
facet_wrap(~ some_var_name, scales = "free")
you would get something like this
If your goal is only to get an idea of the associations among different variables, you can also use:
plot(y~., data = foo)
It is not as nice as using ggplot and it doesn't automatically put all the graphs in one window (although you can change that using par(mfrow = c(a, b)), but it is a quick way to get what you want.
I faced the same problem, and I don't have any experience of ggplot2, so I created a function using plot which takes the data frame, and the variables to be plotted as arguments and generate graphs.
dfplot <- function(data.frame, xvar, yvars=NULL)
{
df <- data.frame
if (is.null(yvars)) {
yvars = names(data.frame[which(names(data.frame)!=xvar)])
}
if (length(yvars) > 25) {
print("Warning: number of variables to be plotted exceeds 25, only first 25 will be plotted")
yvars = yvars[1:25]
}
#choose a format to display charts
ncharts <- length(yvars)
nrows = ceiling(sqrt(ncharts))
ncols = ceiling(ncharts/nrows)
par(mfrow = c(nrows,ncols))
for(i in 1:ncharts){
plot(df[,xvar],df[,yvars[i]],main=yvars[i], xlab = xvar, ylab = "")
}
}
Notes:
You can provide the list of variables to be plotted as yvars,
otherwise it will plot all (or first 25, whichever is less) the variables in the data frame against xvar.
Margins were going out of bounds if the number of plots exceeds 25,
so I kept a limit to plot 25 charts only. Any suggestions to nicely
handle this are welcome.
Also the y axis labels are removed as titles of the graphs take care
of it. x axis label is set to xvar.

How subset a data frame by a factor and repeat a plot for each subset?

I am new to R. Forgive me if this if this question has an obvious answer but I've not been able to find a solution. I have experience with SAS and may just be thinking of this problem in the wrong way.
I have a dataset with repeated measures from hundreds of subjects with each subject having multiple measurements across different ages. Each subject is identified by an ID variable. I'd like to plot each measurement (let's say body WEIGHT) by AGE for each individual subject (ID).
I've used ggplot2 to do something like this:
ggplot(data = dataset, aes(x = AGE, y = WEIGHT )) + geom_line() + facet_wrap(~ID)
This works well for a small number of subjects but won't work for the entire dataset.
I've also tried something like this:
ggplot(data=data, aes(x = AGE,y = BW, group = ID, colour = ID)) + geom_line()
This also works for a small number of subjects but is unreadable with hundreds of subjects.
I've tried to subset using code like this:
temp <- split(dataset,dataset$ID)
but I'm not sure how to work with the resulting dataset. Or perhaps there is a way to simply adjust the facet_wrap so that individual plots are created?
Thanks!
Because you want to split up the dataset and make a plot for each level of a factor, I would approach this with one of the split-apply-return tools from the plyr package.
Here is a toy example using the mtcars dataset. I first create the plot and name it p, then use dlply to split the dataset by a factor and return a plot for each level. I'm taking advantage of %+% from ggplot2 to replace the data.frame in a plot.
p = ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_line()
require(plyr)
dlply(mtcars, .(cyl), function(x) p %+% x)
This returns all the plots, one after another. If you name the resulting list object you can also call one plot at a time.
plots = dlply(mtcars, .(cyl), function(x) p %+% x)
plots[1]
Edit
I started thinking about putting a title on each plot based on the factor, which seems like it would be useful.
dlply(mtcars, .(cyl), function(x) p %+% x + facet_wrap(~cyl))
Edit 2
Here is one way to save these in a single document, one plot per page. This is working with the list of plots named plots. It saves them all to one document, one plot per page. I didn't change any of the defaults in pdf, but you can certainly explore the changes you can make.
pdf()
plots
dev.off()
Updated to use package dplyr instead of plyr. This is done in do, and the output will have a named column that contains all the plots as a list.
library(dplyr)
plots = mtcars %>%
group_by(cyl) %>%
do(plots = p %+% . + facet_wrap(~cyl))
Source: local data frame [3 x 2]
Groups: <by row>
cyl plots
1 4 <S3:gg, ggplot>
2 6 <S3:gg, ggplot>
3 8 <S3:gg, ggplot>
To see the plots in R, just ask for the column that contains the plots.
plots$plots
And to save as a pdf
pdf()
plots$plots
dev.off()
A few years ago, I wanted to do something similar - plot individual trajectories for ~2500 participants with 1-7 measurements each. I did it like this, using plyr and ggplot2:
library(plyr)
library(ggplot2)
d_ply(dat, .var = "participant_id", .fun = function(x) {
# Generate the desired plot
ggplot(x, aes(x = phase, y = result)) +
geom_point() +
geom_line()
# Save it to a file named after the participant
# Putting it in a subdirectory is prudent
ggsave(file.path("plots", paste0(x$participant_id, ".png")))
})
A little slow, but it worked. If you want to get a sense of all participants' trajectories in one plot (like your second example - aka the spaghetti plot), you can tweak the transparency of the lines (forget coloring them, though):
ggplot(data = dat, aes(x = phase, y = result, group = participant_id)) +
geom_line(alpha = 0.3)
lapply(temp, function(X) ggplot(X, ...))
Where X is your subsetted data
Keep in mind you may have to explicitly print the ggplot object (print(ggplot(X, ..)))

Resources