I have a data frame representing a benchmark and I would like to produce all possible comparison plots. Here is a small example of data frame that represents my problem.
df = data.frame("A"=c(1,2,3,1,2,3,1,2,3,1,2,3), "B"=c(1,1,1,2,2,2,1,1,1,2,2,2), "C"=c(1,1,1,1,1,1,2,2,2,2,2,2), "D"=c(4,5,6,7,8,9,10,11,12,13,14,15))
I want to produce the following plots.
D in function of A, when B and C are fixed. This would produce four (4) different lines, one for each couple (B,C).
D in function of B, when A and C are fixed. This would also produce six (6) different lines.
D in function of C, when A and B are fixed. Again, six (6) different lines.
Is there a simple way to this in R ?
For now, I don't mind that they are in different plots or not. Any representation would be ok at this point. I only need all plots to be produced, since I don't know how we want to display our results.
Edit
I forgot to specify in my example that the columns of the data frame do not have the same factor levels. Here is a more complete example.
df = data.frame("A"=c(1,2,3,1,2,3,1,2,3,1,2,3),
"B"=c("[0,1]","[0,1]","[0,1]","[1,3]","[1,3]","[1,3]","[0,1]","[0,1]","[0,1]","[1,3]","[1,3]","[1,3]"),
"C"=c(1,1,1,1,1,1,2,2,2,2,2,2),
"D"=c(4,5,6,7,8,9,10,11,12,13,14,15))
Using #mattek's solution, I have the following plots.
This is great. If I could remove the extra values from the x-axis and keep only the corresponding factors for each column, that would be perfect.
library(ggplot2)
library(reshape2)
First, we melt your table:
df.plot = melt(df,
measure.vars = c('A', 'B', 'C'),
id.vars = 'D',
variable.name = 'var.name',
value.name = 'val.abc')
Then, we add groupings column:
df.plot$grouping = rep(1:4, 3, each = 3)
And we are ready to plot:
ggplot(df.plot, aes(x = val.abc, y = D, group = as.factor(grouping))) +
facet_wrap(~ var.name) +
geom_line(aes(colour = var.name)) +
geom_point(aes(colour = var.name))
Using facet_wrap(~ var.name, scale = "free_x") instead would get rid of non-existant factors in every facet.
Possible answer for exploratory analysis that will show correlation between variables and also a smoothing line:
df = data.frame("A"=c(1,2,3,1,2,3,1,2,3,1,2,3), "B"=c(1,1,1,2,2,2,1,1,1,2,2,2), "C"=c(1,1,1,1,1,1,2,2,2,2,2,2), "D"=c(4,5,6,7,8,9,10,11,12,13,14,15))
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- cor(x, y)
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(df, lower.panel = panel.smooth, upper.panel = panel.cor)
Another option comes from ggplot using the GGaly package:
library(ggplot2)
library(GGally)
this helps a lot if some of your data is a factor, using your data, lets assume that A is a factor variables
df = data.frame("A"=as.factor(c(1,2,3,1,2,3,1,2,3,1,2,3)), "B"=c(1,1,1,2,2,2,1,1,1,2,2,2), "C"=c(1,1,1,1,1,1,2,2,2,2,2,2), "D"=c(4,5,6,7,8,9,10,11,12,13,14,15))
then ggpairs would make boxplots instead of points, you can choose there
ggpairs(df)
Here's what I would do, I would create three new variables which capture the different combinations of A, B, and C fixed:
library(dplyr)
library(ggplot2)
dat <- data.frame("A"=c(1,2,3,1,2,3,1,2,3,1,2,3),
"B"=c(1,1,1,2,2,2,1,1,1,2,2,2),
"C"=c(1,1,1,1,1,1,2,2,2,2,2,2),
"D"=c(4,5,6,7,8,9,10,11,12,13,14,15))
# add variables for A-B, A-C, B-C
dat <- dat %>%
mutate('A - B' = paste(A, '-', B),
'A - C' = paste(A, '-', C),
'B - C' = paste(B, '-', C))
Then we make the plots:
ggplot(dat, aes(y = D))+
geom_line(aes(x = C, colour = `A - B`))
ggplot(dat, aes(y = D))+
geom_line(aes(x = B, colour = `A - C`))
ggplot(dat, aes(y = D))+
geom_line(aes(x = A, colour = `B - C`))
Related
I'm trying to reorder a factor from a subset of my data frame, defined by another factor using forcats::fct_reorder().
Consider the following data frame df:
set.seed(12)
df <- data.frame(fct1 = as.factor(rep(c("A", "B", 'C'), each = 200)),
fct2 = as.factor(rep(c("j", "k"), each = 100)),
val = c(rnorm(100, 2), # A - j
rnorm(100, 1), # A - k
rnorm(100, 1), # B - j
rnorm(100, 6), # B - k
rnorm(100, 8), # C - j
rnorm(100, 4)))# C - k
I want to plot facetted group densities using the ggridges package. For example:
ggplot(data = df, aes(y = fct2, x = val)) +
stat_density_ridges(geom = "density_ridges_gradient",
calc_ecdf = T,
quantile_fun = median,
quantile_lines = T) +
facet_wrap(~fct1, ncol = 1)
I would now like to order fct1 by the median (default in fct_reorder()) of the values of the upper density in each facet, i.e. where fct2 == "k". The goal in this example would therefore be that the facets appear in the order B - C - A.
This seems very similar to this question here, with the difference that I do not want to summarize the data first because I need the raw data to plot the densities.
I've tried to adapt the code in the answer of the linked question:
df <- df %>% mutate(fct1 = forcats::fct_reorder(fct1, filter(., fct2 == 'k') %>% pull(val)))
But it returns the following error:
Error in forcats::fct_reorder(fct1, filter(., fct2 == "k") %>% pull(val)) :
length(f) == length(.x) is not TRUE
It's obvious that they are not the same length, but I don't quite get why this error is necessary. My guess is that it's generally not guaranteed that all levels of fct1 are present in the subset, which would certainly be problematic. Yet, this isn't the case in my example. Is there a way to work around this error or am I doing something wrong more generally?
I'm aware that I can work around this with a couple of lines of extra code, e.g. create a helper variable of the subsetted data, reorder that and then take the level order to my factor in the original data set. I would still like a prettier solution, because I regularly face that very same task.
You can do this with a little helper function:
f <- function(i) -median(df$val[df$fct2 == "k" & df$fct1 == df$fct1[i]])
Which allows you to reorder like this:
df$fct1 <- forcats::fct_reorder(df$fct1, sapply(seq(nrow(df)), f))
Which gives you this plot:
ggplot(data = df, aes(y = fct2, x = val)) +
stat_density_ridges(geom = "density_ridges_gradient",
calc_ecdf = T,
quantile_fun = median,
quantile_lines = T) +
facet_wrap(~fct1, ncol = 1)
I have a dataset I'm plotting, with facets by variables (in the toy dataset - densities of 2 species). I need to use the actual variable names to do 2 things: 1) italicize species names, and 2) have the 2 in n/m2 properly superscripted (or ASCII-ed, whichever easier).
It's similar to this, but I can't seem to make it work for my case.
toy data
library(ggplot2)
df <- data.frame(x = 1:10, y = 1:10,
z = rep(c("Species1 density (n/m2)", "Species2 density (m/m2)"), each = 5),
z1 = rep(c("Area1", "Area2", "Area3", "Area4", "Area5"), each = 2))
ggplot(df) + geom_point(aes(x = x, y = y)) + facet_grid(z1 ~ z)
I get an error (variable z not found) when I try to use the code in the answer naively. How do I get around having 2 variables in the facetting?
A little modification gets the code from your link to work. I've changed the code to use data_frame to stop the character vector being converted to a factor, and taken the common information out of the codes so it can be added via the labeller (otherwise it would be a pain to make half the text italic)
library(tidyverse)
df <- data_frame(
x = 1:10,
y = 1:10,
z = rep(c("Species1", "Species2"), each = 5),
z1 = rep(c("Area1", "Area2", "Area3", "Area4", "Area5"), each = 2)
)
ggplot(df) +
geom_point(aes(x = x, y = y)) +
facet_grid(z1 ~ z, labeller = label_bquote(col = italic(.(z))~density~m^2))
I have a data frame mydataAll with columns DESWC, journal, and highlight. To calculate the average and standard deviation of DESWC for each journal, I do
avg <- aggregate(DESWC ~ journal, data = mydataAll, mean)
stddev <- aggregate(DESWC ~ journal, data = mydataAll, sd)
Now I plot a horizontal stripchart with the values of DESWC along the x-axis and each journal along the y-axis. But for each journal, I want to indicate the standard deviation and average with a simple line. Here is my current code and the results.
stripchart2 <-
ggplot(data=mydataAll, aes(x=mydataAll$DESWC, y=mydataAll$journal, color=highlight)) +
geom_segment(aes(x=avg[1,2] - stddev[1,2],
y = avg[1,1],
xend=avg[1,2] + stddev[1,2],
yend = avg[1,1]), color="gray78") +
geom_segment(aes(x=avg[2,2] - stddev[2,2],
y = avg[2,1],
xend=avg[2,2] + stddev[2,2],
yend = avg[2,1]), color="gray78") +
geom_segment(aes(x=avg[3,2] - stddev[3,2],
y = avg[3,1],
xend=avg[3,2] + stddev[3,2],
yend = avg[3,1]), color="gray78") +
geom_point(size=3, aes(alpha=highlight)) +
scale_x_continuous(limit=x_axis_range) +
scale_y_discrete(limits=mydataAll$journal) +
scale_alpha_discrete(range = c(1.0, 0.5), guide='none')
show(stripchart2)
See the three horizontal geom_segments at the bottom of the image indicating the spread? I want to do that for all journals, but without handcrafting each one. I tried using the solution from this question, but when I put everything in a loop and remove the aes(), it give me an error that says:
Error in x - from[1] : non-numeric argument to binary operator
Can anyone help me condense the geom_segment() statements?
I generated some dummy data to demonstrate. First, we use aggregate like you have done, then we combine those results to create a data.frame in which we create upper and lower columns. Then, we pass these to the geom_segment specifying our new dataset. Also, I specify x as the character variable and y as the numeric variable, and then use coord_flip():
library(ggplot2)
set.seed(123)
df <- data.frame(lets = sample(letters[1:8], 100, replace = T),
vals = rnorm(100),
stringsAsFactors = F)
means <- aggregate(vals~lets, data = df, FUN = mean)
sds <- aggregate(vals~lets, data = df, FUN = sd)
df2 <- data.frame(means, sds)
df2$upper = df2$vals + df2$vals.1
df2$lower = df2$vals - df2$vals.1
ggplot(df, aes(x = lets, y = vals))+geom_point()+
geom_segment(data = df2, aes(x = lets, xend = lets, y = lower, yend = upper))+
coord_flip()+theme_bw()
Here, the lets column would resemble your character variable.
Here is some workable example of data I wish to plot:
set.seed(123)
x <- rweibull(n = 2000, shape = 2, scale = 10)
x <- round(x, digits = 0)
x <- sort(x, decreasing = FALSE)
y <- c(rep(0.1, times = 500),rep(0.25, times = 500),rep(0.4, times = 500),rep(0.85, times = 500))
z <- rbinom(n=2000, size=1, prob=y)
df1 <- data.frame(x,z)
I want to plot the overal fequency of z across x.
unlike a typical cdf, the function should not reach 1.0, but instead
sum(df1$z)/length(df1$z)
a ymax of 0.36 (721/2000).
using ggplot2 we can create a cdf of x with the following command:
library(ggplot2)
ggplot(df1, aes(x)) + stat_ecdf()
But i want to extend this plot to show the cumulative percentage of z (as a function of 'x')
The end result should like like
EDIT
with some very poor data manipulation I am able to generate the something similiar to a cdf plot, but there must be a more beautiful and easy method using various packages and ggplot
mytable <- table(df1$x, df1$z)
mydf <- as.data.frame.matrix(mytable)
colnames(mydf) <- c("z_no", "z_yes")
mydf$A <- 1:length(mydf$z_no)
mydf$sum <- cumsum(mydf$z_yes)
mydf$dis <- mydf$sum/length(z)
plot(mydf$A, mydf$dis)
You can use the package dplyr to process the data as follows:
library(dplyr)
plot_data <- group_by(df1, x) %>%
summarise(z_num = sum(z)) %>%
mutate(cum_perc_z = cumsum(z_num)/nrow(df1))
This gives the same result as the data processing that you describe in your edit. Note, however, that I get sum(df1$z) = 796 and the maximal y value is thus 796/2000 = 0.398.
For the plot, you can use geom_step() to have a step function and add the horizontal line with geom_hline():
ggplot(plot_data, aes(x = x, y = cum_perc_z)) +
geom_step(colour = "red", size = 0.8) +
geom_hline(yintercept = max(plot_data$cum_perc_z))
So I have two sets of data (of different length) that I am trying to group up and display the density plots for:
dat <- data.frame(dens = c(nEXP,nCNT),lines = rep(c("Exp","Cont")))
ggplot(dat, aes(x = dens, group=lines, fill = lines)) + geom_density(alpha = .5)
when I run the code it spits an error about the different lengths, i.e.
"arguments imply different num of rows: x, y"
I then augment the code to:
dat <- data.frame(dens = c(nEXP,nCNT),lines = rep(c("Exp","Cont"),X))
Where X is the length of the longer argument so the lengths of "lines" will match that of dens.
Now the issue is that when when I go to plot the data I am only getting ONE density plot.... I know there should be two, since plotting the densities with plot/lines, is clearly two non-equal overlapping distributions, so I am assuming the error is with the grouping...
hope that makes sense.
So I am not sure why but basically I simply had to do the rep() function manually:
A<-data.frame(ExpN, key = "exp")
B<-data.frame(ConN,key = "con")
colnames(A) <- c("a","key")
colnames(B) <- c("a","key")
dat <- rbind(A,B)
ggplot(dat, aes(x = dens, fill = key)) + geom_density(alpha = .5)
You need to tell rep how many times to repeat each element to get it to line up
dat <- data.frame(dens = c(nEXP,nCNT),
lines = rep(c("Exp","Cont"), c(length(nEXP),length(nCNT)))
That should give you a dat you can use with your ggplot call.