I have a data frame, called mouse.data, with 3 columns: Eigenvalues, DualEigenvalues and Experiment. This question does not concern the DualEigenvalues data, so that can be forgotten.
We ran 5 experiments and used the data from each experiment to calculate 14 eigenvalues. So the first 14 rows of this data frame are the 14 eigenvalues of the first experiment, with the experiment entry having value 1, the second 14 rows are the 14 eigenvalues of the second experiment with the experiment entry having value 2 etc.
I am then plotting the eigenvalues of each pairwise experiment against each other, here is an example of this code:
eigen.1 <- mouse.data$Eigenvalues[mouse.data$Experiment == 1]
eigen.2 <- mouse.data$Eigenvalues[mouse.data$Experiment == 2]
p.data <- data.frame(x = eigen.1, y = eigen.2)
ggplot(p.data, aes(x,y)) + geom_abline(slope = 1, colour = "red") + geom_point()
This gives me graph like this one:
This is precisely what I want this graph to look like.
What I would like to do, but can't work out, is to plot a facet_grid so that the plot in the ith row and jth column plots the eigenvalues from the ith experiment on the y-axis and the eigenvalues from the jth experiment on the x-axis.
This is the closest I have got so far, I hope this makes it clearer what I mean.
This is tricky without a reproducible example of your data, but it sounds like we can roughly approximate the structure of your data frame like this:
library(ggplot2)
set.seed(1)
Eigen <- as.vector(sapply(runif(5, .5, 1.5),
function(x) sort(rgamma(14, 2, 0.02*x))))
mouse.data <- data.frame(Experiment = rep(seq(5), each = 14), Eigenvalue = Eigen)
head(mouse.data)
#> Experiment Eigenvalue
#> 1 1 39.61451
#> 2 1 44.48163
#> 3 1 54.57964
#> 4 1 75.06725
#> 5 1 75.50014
#> 6 1 94.41255
The key to getting the plot to work is to reshape your data into a long-format data frame that contains each combination of experiments. One way to do this is to split the data frame by Experiment, then use simple indexing of the resultant list (using rep) to get all unique pairs of data frames. Each unique pair is stuck together column-wise, then the resultant 25 data frames are all joined row-wise into the plotting data frame.
experiments <- split(mouse.data, mouse.data$Experiment)
experiments <- mapply(cbind,
experiments[rep(1:5, 5)],
experiments[rep(1:5, each = 5)],
SIMPLIFY = FALSE)
p.data <- do.call(rbind, lapply(experiments, setNames,
nm = c("Experiment1", "x",
"Experiment2", "y")))
Once we have done this, we can use your plot code, with the addition of a facet_grid call:
ggplot(p.data, aes(x,y)) +
geom_abline(slope = 1, colour = "red") +
geom_point() +
facet_grid(Experiment1~Experiment2)
Related
I am working with ggplot to plot bivariate data in groups along with standard ellipses of these data using a separate set of tools. These return n=100 x,y coordinates that define each ellipse, and then for each group, I would like to plot about 10-25 ellipses.
Conceptually, how can this be achieved? I can plot a single ellipse easily using geom_polygon, but I am confused how to get the data organized to make it work so multiple ellipses are plotted and guides (color, fills, linetypes, etc.) are applied per group.
In the traditional R plotting, I could just keep adding lines using a for loop.
Thanks!
UPDATE: Here is a CSV containing 100 coordinates for a single ellipse.
Data
Let's say I have three groups of bivariate data to which the ellipse fitting has been applied: Green, Red, Blue. For each group, I'd like to plot several ellipses.
I don't know how I would organize the data in such a way to work in the long format prefered by ggplot and preserve the group affiliations. Would a list work?
UPDATE2:
Here is a csv of raw x and y data organized into two groups: river and lake
Data
The data plot like this:
test.data <- read.csv("ellipse_test_data.csv")
ggplot(test.data) +
geom_point(aes(x, y, color = group)) +
theme_classic()
I am using a package called SIBER, which will fit Bayesian ellipses to the data for comparing groups by ellipse area, etc. The output of the following creates a list with number of elements = number of groups of data, and each element contains a 6 x n (n=number of draws) for each fitted ellipse - first four columns are a covariance matrix Sigma in vector format and the last two are the bivariate means:
# options for running jags
parms <- list()
parms$n.iter <- 2 * 10^5 # number of iterations to run the model for
parms$n.burnin <- 1 * 10^3 # discard the first set of values
parms$n.thin <- 100 # thin the posterior by this many
parms$n.chains <- 2 # run this many chains
# define the priors
priors <- list()
priors$R <- 1 * diag(2)
priors$k <- 2
priors$tau.mu <- 1.0E-3
# fit the ellipses which uses an Inverse Wishart prior
# on the covariance matrix Sigma, and a vague normal prior on the
# means. Fitting is via the JAGS method.
ellipses.test <- siberMVN(siber.test, parms, priors)
First few rows of the first element in the list:
$`1.river`
Sigma2[1,1] Sigma2[2,1] Sigma2[1,2] Sigma2[2,2] mu[1] mu[2]
[1,] 1.2882740 2.407070e-01 2.407070e-01 1.922637 -15.52846 12.14774
[2,] 1.0677979 -3.997169e-02 -3.997169e-02 2.448872 -15.49182 12.37709
[3,] 1.1440816 7.257331e-01 7.257331e-01 4.040416 -15.30151 12.14947
I would like to be able to extract a random number of these ellipses and plot them with ggplot using alpha transparency.
The package SIBER has a function (addEllipse) to convert the '6 x n' entries to a set number of x and y points that define an ellipse, but I don't know how to organize that output for ggplot. I thought there might be an elegant way to do with all internally with ggplot.
The ideal output would be something like this, but in ggplot so the ellipses could match the aesthetics of the levels of data:
some code to do this on the bundled demo dataset from SIBER.
In this example we try to create some plots of the multiple samples of the posterior ellipses using ggplot2.
library(SIBER)
library(ggplot2)
library(dplyr)
library(ellipse)
Fit a basic SIBER model to the example data bundled with the package.
# load in the included demonstration dataset
data("demo.siber.data")
#
# create the siber object
siber.example <- createSiberObject(demo.siber.data)
# Calculate summary statistics for each group: TA, SEA and SEAc
group.ML <- groupMetricsML(siber.example)
# options for running jags
parms <- list()
parms$n.iter <- 2 * 10^4 # number of iterations to run the model for
parms$n.burnin <- 1 * 10^3 # discard the first set of values
parms$n.thin <- 10 # thin the posterior by this many
parms$n.chains <- 2 # run this many chains
# define the priors
priors <- list()
priors$R <- 1 * diag(2)
priors$k <- 2
priors$tau.mu <- 1.0E-3
# fit the ellipses which uses an Inverse Wishart prior
# on the covariance matrix Sigma, and a vague normal prior on the
# means. Fitting is via the JAGS method.
ellipses.posterior <- siberMVN(siber.example, parms, priors)
# The posterior estimates of the ellipses for each group can be used to
# calculate the SEA.B for each group.
SEA.B <- siberEllipses(ellipses.posterior)
siberDensityPlot(SEA.B, xticklabels = colnames(group.ML),
xlab = c("Community | Group"),
ylab = expression("Standard Ellipse Area " ('\u2030' ^2) ),
bty = "L",
las = 1,
main = "SIBER ellipses on each group"
)
Now we want to create some plots of some sample ellipses from these distributions. We need to create a data.frame object of all the ellipses for each group. In this exmaple we simply take the frist 10 posterior draws assuming them to be independent of one another, but you could take a random sample if you prefer.
# how many of the posterior draws do you want?
n.posts <- 10
# decide how big an ellipse you want to draw
p.ell <- 0.95
# for a standard ellipse use
# p.ell <- pchisq(1,2)
# a list to store the results
all_ellipses <- list()
# loop over groups
for (i in 1:length(ellipses.posterior)){
# a dummy variable to build in the loop
ell <- NULL
post.id <- NULL
for ( j in 1:n.posts){
# covariance matrix
Sigma <- matrix(ellipses.posterior[[i]][j,1:4], 2, 2)
# mean
mu <- ellipses.posterior[[i]][j,5:6]
# ellipse points
out <- ellipse::ellipse(Sigma, centre = mu , level = p.ell)
ell <- rbind(ell, out)
post.id <- c(post.id, rep(j, nrow(out)))
}
ell <- as.data.frame(ell)
ell$rep <- post.id
all_ellipses[[i]] <- ell
}
ellipse_df <- bind_rows(all_ellipses, .id = "id")
# now we need the group and community names
# extract them from the ellipses.posterior list
group_comm_names <- names(ellipses.posterior)[as.numeric(ellipse_df$id)]
# split them and conver to a matrix, NB byrow = T
split_group_comm <- matrix(unlist(strsplit(group_comm_names, "[.]")),
nrow(ellipse_df), 2, byrow = TRUE)
ellipse_df$community <- split_group_comm[,1]
ellipse_df$group <- split_group_comm[,2]
ellipse_df <- dplyr::rename(ellipse_df, iso1 = x, iso2 = y)
Now to create the plots. First plot all the raw data as we want.
first.plot <- ggplot(data = demo.siber.data, aes(iso1, iso2)) +
geom_point(aes(color = factor(group):factor(community)), size = 2)+
ylab(expression(paste(delta^{15}, "N (\u2030)")))+
xlab(expression(paste(delta^{13}, "C (\u2030)"))) +
theme(text = element_text(size=15))
print(first.plot)
Now we can try to add the posterior ellipses on top and facet by group
second.plot <- first.plot + facet_wrap(~factor(group):factor(community))
print(second.plot)
# rename columns of ellipse_df to match the aesthetics
third.plot <- second.plot +
geom_polygon(data = ellipse_df,
mapping = aes(iso1, iso2,
group = rep,
color = factor(group):factor(community),
fill = NULL),
fill = NA,
alpha = 0.2)
print(third.plot)
Facet-wrapped plot of sample of posterior ellipses by group
I have 36 different data frames that contain dX and dY variables. I have stored them in a list and want to display them all on the same graph with x = dX and y = dY.
The 36 data frames do not share the same dX values. They roughly cover the same range but don't have the exact same values, so using a merge creates a ton of NA values. The number of rows are however identical.
I tried something ugly that almost works:
g <- ggplot()
for (i in 1:36) {
g <- g + geom_line(data = df.list[[i]], aes(dX, dY, colour = i))
}
print(g)
This displays the curves correctly, but the colours are not applied (and I don't have an appropriate legend). OK, 36 lines in the legend might not be practical. In that case I would reduce the number of lines to draw.
Second approach: I tried melting the data frames as follows.
df <- melt(df.list, id.vars = "dX")
ggplot(df, aes(x = dX, y = value, colour = L1)) + geom_line()
But this creates a 4-variable data frame with columns: dX, variable (always equal to dY), value (here are the dY values) and L1, which contains the index of the data frame in the list.
Here are the first lines of the melted data frame:
dX variable value L1
1 4.952296 dY 6.211485e-05 1
2 6.766889 dY 7.661041e-05 1
3 8.581481 dY 9.550221e-05 1
4 10.396074 dY 1.192053e-04 1
5 12.210666 dY 1.498834e-04 1
6 14.025259 dY 1.883612e-04 1
7 15.839851 dY 2.365646e-04 1
8 17.654444 dY 2.956796e-04 1
9 19.469036 dY 3.662252e-04 1
10 21.283629 dY 4.470143e-04 1
There are several problems here:
"variable" is always equal to dY. What I was expecting was the index
of the data frame in the list (which is stored in L1), or even
better, the result of a function name(i)
The curve uses a continuous scale, ranging from 1 to 36 while I wanted a discrete scale
Finally, using the geom_line() does not seem to draw the data frames curves individually, but links the points of different data sets together
Any idea how to solve my problem?
I would combine the data.frame into one large data.frame, add an id column, and then plot with ggplot. Lots of ways to do this, here is one:
newDF <- do.call(rbind, list.df)
newDF$id <- factor(rep(1:length(df.list), each = sapply(df.list, nrow)))
g <- geom(newDF, aes(x = dX, y = dY, colour = id)
g <- g + geom_line()
print(g)
It seems like the most straightforward option would be to create a single data frame (as suggested by one of the commenters) and use the index of the source data frame for the colour aesthetic:
library(dplyr) # For bind_rows() function
ggplot(bind_rows(df.list, .id="id"), aes(dX, dY, colour=id)) +
geom_line()
In the code above, .id="id" causes bind_rows to include a column called id containing the names of the list elements containing each of the data frames.
Let's say I have a histogram with two overlapping groups. Here's a possible command from ggplot2 and a pretend output graph.
ggplot2(data, aes(x=Variable1, fill=BinaryVariable)) + geom_histogram(position="identity")
So what I have is the frequency or count of each event. What I'd like to do instead is to get the difference between the two events in each bin. Is this possible? How?
For example, if we do RED minus BLUE:
Value at x=2 would be ~ -10
Value at x=4 would be ~ 40 - 200 = -160
Value at x=6 would be ~ 190 - 25 = 155
Value at x=8 would be ~ 10
I'd prefer to do this using ggplot2, but another way would be fine. My dataframe is set up with items like this toy example (dimensions are actually 25000 rows x 30 columns) EDITED: Here is example data to work with GIST Example
ID Variable1 BinaryVariable
1 50 T
2 55 T
3 51 N
.. .. ..
1000 1001 T
1001 1944 T
1002 1042 N
As you can see from my example, I'm interested in a histogram to plot Variable1 (a continuous variable) separately for each BinaryVariable (T or N). But what I really want is the difference between their frequencies.
So, in order to do this we need to make sure that the "bins" we use for the histograms are the same for both levels of your indicator variable. Here's a somewhat naive solution (in base R):
df = data.frame(y = c(rnorm(50), rnorm(50, mean = 1)),
x = rep(c(0,1), each = 50))
#full hist
fullhist = hist(df$y, breaks = 20) #specify more breaks than probably necessary
#create histograms for 0 & 1 using breaks from full histogram
zerohist = with(subset(df, x == 0), hist(y, breaks = fullhist$breaks))
oneshist = with(subset(df, x == 1), hist(y, breaks = fullhist$breaks))
#combine the hists
combhist = fullhist
combhist$counts = zerohist$counts - oneshist$counts
plot(combhist)
So we specify how many breaks should be used (based on values from the histogram on the full data), and then we compute the differences in the counts at each of those breaks.
PS It might be helpful to examine what the non-graphical output of hist() is.
Here's a solution that uses ggplot as requested.
The key idea is to use ggplot_build to get the rectangles computed by stat_histogram. From that you can compute the differences in each bin and then create a new plot using geom_rect.
setup and create a mock dataset with lognormal data
library(ggplot2)
library(data.table)
theme_set(theme_bw())
n1<-500
n2<-500
k1 <- exp(rnorm(n1,8,0.7))
k2 <- exp(rnorm(n2,10,1))
df <- data.table(k=c(k1,k2),label=c(rep('k1',n1),rep('k2',n2)))
Create the first plot
p <- ggplot(df, aes(x=k,group=label,color=label)) + geom_histogram(bins=40) + scale_x_log10()
Get the rectangles using ggplot_build
p_data <- as.data.table(ggplot_build(p)$data[1])[,.(count,xmin,xmax,group)]
p1_data <- p_data[group==1]
p2_data <- p_data[group==2]
Join on the x-coordinates to compute the differences. Note that the y-values aren't the counts, but the y-coordinates of the first plot.
newplot_data <- merge(p1_data, p2_data, by=c('xmin','xmax'), suffixes = c('.p1','.p2'))
newplot_data <- newplot_data[,diff:=count.p1 - count.p2]
setnames(newplot_data, old=c('y.p1','y.p2'), new=c('k1','k2'))
df2 <- melt(newplot_data,id.vars =c('xmin','xmax'),measure.vars=c('k1','diff','k2'))
make the final plot
ggplot(df2, aes(xmin=xmin,xmax=xmax,ymax=value,ymin=0,group=variable,color=variable)) + geom_rect()
Of course the scales and legends still need to be fixed, but that's a different topic.
I'm a bit out of my depth with this one here. I have the following code that generates two equally sized matrices:
MAX<-100
m<-5
n<-40
success<-matrix(runif(m*n,0,1),m,n)
samples<-floor(MAX*matrix(runif(m*n),m))+1
the success matrix is the probability of success and the samples matrix is the corresponding number of samples that was observed in each case. I'd like to make a bar graph that groups each column together with the height being determined by the success matrix. The color of each bar needs to be a color (scaled from 1 to MAX) that corresponds to the number of observations (i.e., small samples would be more red, for instance, whereas high samples would be green perhaps).
Any ideas?
Here is an example with ggplot. First, get data into long format with melt:
library(reshape2)
data.long <- cbind(melt(success), melt(samples)[3])
names(data.long) <- c("group", "x", "success", "count")
head(data.long)
# group x success count
# 1 1 1 0.48513473 8
# 2 2 1 0.56583802 58
# 3 3 1 0.34541582 40
# 4 4 1 0.55829073 64
# 5 5 1 0.06455401 37
# 6 1 2 0.88928606 78
Note melt will iterate through the row/column combinations of both matrices the same way, so we can just cbind the resulting molten data frames. The [3] after the second melt is so we don't end up with repeated group and x values (we only need the counts from the second melt). Now let ggplot do its thing:
library(ggplot2)
ggplot(data.long, aes(x=x, y=success, group=group, fill=count)) +
geom_bar(position="stack", stat="identity") +
scale_fill_gradient2(
low="red", mid="yellow", high="green",
midpoint=mean(data.long$count)
)
Using #BrodieG's data.long, this plot might be a little easier to interpret.
library(ggplot2)
library(RColorBrewer) # for brewer.pal(...)
ggplot(data.long) +
geom_bar(aes(x=x, y=success, fill=count),colour="grey70",stat="identity")+
scale_fill_gradientn(colours=brewer.pal(9,"RdYlGn")) +
facet_grid(group~.)
Note that actual values are probably different because you use random numbers in your sample. In future, consider using set.seed(n) to generate reproducible random samples.
Edit [Response to OP's comment]
You get numbers for x-axis and facet labels because you start with matrices instead of data.frames. So convert success and samples to data.frames, set the column names to whatever your test names are, and prepend a group column with the "list of factors". Converting to long format is a little different now because the first column has the group names.
library(reshape2)
set.seed(1)
success <- data.frame(matrix(runif(m*n,0,1),m,n))
success <- cbind(group=rep(paste("Factor",1:nrow(success),sep=".")),success)
samples <- data.frame(floor(MAX*matrix(runif(m*n),m))+1)
samples <- cbind(group=success$group,samples)
data.long <- cbind(melt(success,id=1), melt(samples, id=1)[3])
names(data.long) <- c("group", "x", "success", "count")
One way to set a threshold color is to add a column to data.long and use that for fill:
threshold <- 25
data.long$fill <- with(data.long,ifelse(count>threshold,max(count),count))
Putting it all together:
library(ggplot2)
library(RColorBrewer)
ggplot(data.long) +
geom_bar(aes(x=x, y=success, fill=fill),colour="grey70",stat="identity")+
scale_fill_gradientn(colours=brewer.pal(9,"RdYlGn")) +
facet_grid(group~.)+
theme(axis.text.x=element_text(angle=-90,hjust=0,vjust=0.4))
Finally, when you have names for the x-axis labels they tend to get jammed together, so I rotated the names -90°.
I'm running a monte-carlo simulation and the output is in the form:
> d = data.frame(iter=seq(1, 2), k1 = c(0.2, 0.6), k2=c(0.3, 0.4))
> d
iter k1 k2
1 0.2 0.3
2 0.6 0.4
The plots I want to generate are:
plot(d$iter, d$k1)
plot(density(d$k1))
I know how to do equivalent plots using ggplot2, convert to data frame
new_d = data.frame(iter=rep(d$iter, 2),
k = c(d$k1, d$k2),
label = rep(c('k1', 'k2'), each=2))
then plotting is easy. However the number of iterations can be very large and the number of k's can also be large. This means messing about with a very large data frame.
Is there anyway I can avoid creating this new data frame?
Thanks
Short answer is "no," you can't avoid creating a data frame. ggplot requires the data to be in a data frame. If you use qplot, you can give it separate vectors for x and y, but internally, it's still creating a data frame out of the parameters you pass in.
I agree with juba's suggestion -- learn to use the reshape function, or better yet the reshape package with melt/cast functions. Once you get fast with putting your data in long format, creating amazing ggplot graphs becomes one step closer!
Yes, it is possible for you to avoid creating a data frame: just give an empty argument list to the base layer, ggplot(). Here is a complete example based on your code:
library(ggplot2)
d = data.frame(iter=seq(1, 2), k1 = c(0.2, 0.6), k2=c(0.3, 0.4))
# desired plots:
# plot(d$iter, d$k1)
# plot(density(d$k1))
ggplot() + geom_point(aes(x = d$iter, y = d$k1))
# there is not enough data for a good density plot,
# but this is how you would do it:
ggplot() + geom_density(aes(d$k1))
Note that although this allows for you not to create a data frame, a data frame might still be created internally. See, e.g., the following extract from ?geom_point:
All objects will be fortified to produce a data frame.
You can use the reshape function to transform your data frame to "long" format. May be it is a bit faster than your code ?
R> reshape(d, direction="long",varying=list(c("k1","k2")),v.names="k",times=c("k1","k2"))
iter time k id
1.k1 1 k1 0.2 1
2.k1 2 k1 0.6 2
1.k2 1 k2 0.3 1
2.k2 2 k2 0.4 2
So just to add to the previous answers. With qplot you could do
p <- qplot(y=d$k2, x=d$k1)
and then from there building it further, e.g. with
p + theme_bw()
But I agree - melt/cast is genereally the way forward.
Just pass NULL as the data frame, and define the necessary aesthetics using the data vectors. Quick example:
library(MASS)
library(tidyverse)
library(ranger)
rf <- ranger(medv ~ ., data = Boston, importance = "impurity")
rf$variable.importance
ggplot(NULL, aes(x = fct_reorder(names(rf$variable.importance), rf$variable.importance),
y = rf$variable.importance)) +
geom_col(fill = "navy blue", alpha = 0.7) +
coord_flip() +
labs(x = "Predictor", y = "Importance", title = "Random Forest") +
theme_bw()