Can I avoid using data frames in ggplot2? - r

I'm running a monte-carlo simulation and the output is in the form:
> d = data.frame(iter=seq(1, 2), k1 = c(0.2, 0.6), k2=c(0.3, 0.4))
> d
iter k1 k2
1 0.2 0.3
2 0.6 0.4
The plots I want to generate are:
plot(d$iter, d$k1)
plot(density(d$k1))
I know how to do equivalent plots using ggplot2, convert to data frame
new_d = data.frame(iter=rep(d$iter, 2),
k = c(d$k1, d$k2),
label = rep(c('k1', 'k2'), each=2))
then plotting is easy. However the number of iterations can be very large and the number of k's can also be large. This means messing about with a very large data frame.
Is there anyway I can avoid creating this new data frame?
Thanks

Short answer is "no," you can't avoid creating a data frame. ggplot requires the data to be in a data frame. If you use qplot, you can give it separate vectors for x and y, but internally, it's still creating a data frame out of the parameters you pass in.
I agree with juba's suggestion -- learn to use the reshape function, or better yet the reshape package with melt/cast functions. Once you get fast with putting your data in long format, creating amazing ggplot graphs becomes one step closer!

Yes, it is possible for you to avoid creating a data frame: just give an empty argument list to the base layer, ggplot(). Here is a complete example based on your code:
library(ggplot2)
d = data.frame(iter=seq(1, 2), k1 = c(0.2, 0.6), k2=c(0.3, 0.4))
# desired plots:
# plot(d$iter, d$k1)
# plot(density(d$k1))
ggplot() + geom_point(aes(x = d$iter, y = d$k1))
# there is not enough data for a good density plot,
# but this is how you would do it:
ggplot() + geom_density(aes(d$k1))
Note that although this allows for you not to create a data frame, a data frame might still be created internally. See, e.g., the following extract from ?geom_point:
All objects will be fortified to produce a data frame.

You can use the reshape function to transform your data frame to "long" format. May be it is a bit faster than your code ?
R> reshape(d, direction="long",varying=list(c("k1","k2")),v.names="k",times=c("k1","k2"))
iter time k id
1.k1 1 k1 0.2 1
2.k1 2 k1 0.6 2
1.k2 1 k2 0.3 1
2.k2 2 k2 0.4 2

So just to add to the previous answers. With qplot you could do
p <- qplot(y=d$k2, x=d$k1)
and then from there building it further, e.g. with
p + theme_bw()
But I agree - melt/cast is genereally the way forward.

Just pass NULL as the data frame, and define the necessary aesthetics using the data vectors. Quick example:
library(MASS)
library(tidyverse)
library(ranger)
rf <- ranger(medv ~ ., data = Boston, importance = "impurity")
rf$variable.importance
ggplot(NULL, aes(x = fct_reorder(names(rf$variable.importance), rf$variable.importance),
y = rf$variable.importance)) +
geom_col(fill = "navy blue", alpha = 0.7) +
coord_flip() +
labs(x = "Predictor", y = "Importance", title = "Random Forest") +
theme_bw()

Related

Plotting a facet grid in R using ggplot2 with only one variable

I have a data frame, called mouse.data, with 3 columns: Eigenvalues, DualEigenvalues and Experiment. This question does not concern the DualEigenvalues data, so that can be forgotten.
We ran 5 experiments and used the data from each experiment to calculate 14 eigenvalues. So the first 14 rows of this data frame are the 14 eigenvalues of the first experiment, with the experiment entry having value 1, the second 14 rows are the 14 eigenvalues of the second experiment with the experiment entry having value 2 etc.
I am then plotting the eigenvalues of each pairwise experiment against each other, here is an example of this code:
eigen.1 <- mouse.data$Eigenvalues[mouse.data$Experiment == 1]
eigen.2 <- mouse.data$Eigenvalues[mouse.data$Experiment == 2]
p.data <- data.frame(x = eigen.1, y = eigen.2)
ggplot(p.data, aes(x,y)) + geom_abline(slope = 1, colour = "red") + geom_point()
This gives me graph like this one:
This is precisely what I want this graph to look like.
What I would like to do, but can't work out, is to plot a facet_grid so that the plot in the ith row and jth column plots the eigenvalues from the ith experiment on the y-axis and the eigenvalues from the jth experiment on the x-axis.
This is the closest I have got so far, I hope this makes it clearer what I mean.
This is tricky without a reproducible example of your data, but it sounds like we can roughly approximate the structure of your data frame like this:
library(ggplot2)
set.seed(1)
Eigen <- as.vector(sapply(runif(5, .5, 1.5),
function(x) sort(rgamma(14, 2, 0.02*x))))
mouse.data <- data.frame(Experiment = rep(seq(5), each = 14), Eigenvalue = Eigen)
head(mouse.data)
#> Experiment Eigenvalue
#> 1 1 39.61451
#> 2 1 44.48163
#> 3 1 54.57964
#> 4 1 75.06725
#> 5 1 75.50014
#> 6 1 94.41255
The key to getting the plot to work is to reshape your data into a long-format data frame that contains each combination of experiments. One way to do this is to split the data frame by Experiment, then use simple indexing of the resultant list (using rep) to get all unique pairs of data frames. Each unique pair is stuck together column-wise, then the resultant 25 data frames are all joined row-wise into the plotting data frame.
experiments <- split(mouse.data, mouse.data$Experiment)
experiments <- mapply(cbind,
experiments[rep(1:5, 5)],
experiments[rep(1:5, each = 5)],
SIMPLIFY = FALSE)
p.data <- do.call(rbind, lapply(experiments, setNames,
nm = c("Experiment1", "x",
"Experiment2", "y")))
Once we have done this, we can use your plot code, with the addition of a facet_grid call:
ggplot(p.data, aes(x,y)) +
geom_abline(slope = 1, colour = "red") +
geom_point() +
facet_grid(Experiment1~Experiment2)

How to create surface plot in R

I'm currently trying to develop a surface plot that examines the results of the below data frame. I want to plot the increasing values of noise on the x-axis and the increasing values of mu on the y-axis, with the point estimate values on the z-axis. After looking at ggplot2 and ggplotly, it's not clear how I would plot each of these columns in surface or 3D plot.
df <- "mu noise0 noise1 noise2 noise3 noise4 noise5
1 1 0.000000 0.9549526 0.8908646 0.919630 1.034607
2 2 1.952901 1.9622004 2.0317115 1.919011 1.645479
3 3 2.997467 0.5292921 2.8592976 3.034377 3.014647
4 4 3.998339 4.0042379 3.9938346 4.013196 3.977212
5 5 5.001337 4.9939060 4.9917115 4.997186 5.009082
6 6 6.001987 5.9929932 5.9882173 6.015318 6.007156
7 7 6.997924 6.9962483 7.0118066 6.182577 7.009172
8 8 8.000022 7.9981131 8.0010066 8.005220 8.024569
9 9 9.004437 9.0066182 8.9667536 8.978415 8.988935
10 10 10.006595 9.9987245 9.9949733 9.993018 10.000646"
Thanks in advance.
Here's one way using geom_tile(). First, you will want to get your data frame into more of a Tidy format, where the goal is to have columns:
mu: nothing changes here
noise: need to combine your "noise0", "noise1", ... columns together, and
z: serves as the value of the noise and we will apply the fill= aesthetic using this column.
To do that, I'm using dplyr and gather(), but there are other ways (melt(), or pivot_longer() gets you that too). I'm also adding some code to pull out just the number portion of the "noise" columns and then reformatting that as an integer to ensure that you have x and y axes as numeric/integers:
# assumes that df is your data as data.frame
df <- df %>% gather(key="noise", value="z", -mu)
df <- df %>% separate(col = "noise", into=c('x', "noise"), sep=5) %>% select(-x)
df$noise <- as.integer(df$noise)
Here's an example of how you could plot it, but aesthetics are up to you. I decided to also include geom_text() to show the actual values of df$z so that we can see better what's going on. Also, I'm using rainbow because "it's pretty" - you may want to choose a more appropriate quantitative comparison scale from the RColorBrewer package.
ggplot(df, aes(x=noise, y=mu, fill=z)) + theme_bw() +
geom_tile() +
geom_text(aes(label=round(z, 2))) +
scale_fill_gradientn(colors = rainbow(5))
EDIT: To answer OP's follow up, yes, you can also showcase this via plotly. Here's a direct transition:
p <- plot_ly(
df, x= ~noise, y= ~mu, z= ~z,
type='mesh3d', intensity = ~z,
colors= colorRamp(rainbow(5))
)
p
Static image here:
A much more informative way to show this particular set of information is to see the variation of df$z as it relates to df$mu by creating df$delta_z and then using that to plot. (you can also plot via ggplot() + geom_tile() as above):
df$delta_z <- df$z - df$mu
p1 <- plot_ly(
df, x= ~noise, y= ~mu, z= ~delta_z,
type='mesh3d', intensity = ~delta_z,
colors= colorRamp(rainbow(5))
)
Giving you this (static image here):
ggplot accepts data in the long format, which means that you need to melt your dataset using, for example, a function from the reshape2 package:
dfLong = melt(df,
id.vars = "mu",
variable.name = "noise",
value.name = "meas")
The resulting column noise contains entries such as noise0, noise1, etc. You can extract the numbers and convert to a numeric column:
dfLong$noise = with(dfLong, as.numeric(gsub("noise", "", noise)))
This converts your data to:
mu noise meas
1 1 0 1.0000000
2 2 0 2.0000000
3 3 0 3.0000000
...
As per ggplot documentation:
ggplot2 can not draw true 3D surfaces, but you can use geom_contour(), geom_contour_filled(), and geom_tile() to visualise 3D surfaces in 2D.
So, for example:
ggplot(dfLong,
aes(x = noise
y = mu,
fill = meas)) +
geom_tile() +
scale_fill_gradientn(colours = terrain.colors(10))
Produces:

How to make a single plot from two dataframes with ggplot2

I have 2 datasets, called A and B.
I want to compare the distribution of one common variable, called k, showing up in both dataset, but of different lengths (A contains 2000 values of k, while B has 1000, both have some N/A). So I would like to plot the distribution of A$k anf B$k in the same plot.
I have tried:
g1 <- ggplot(A, aes(x=A$k)) + geom_density()
g2 <- ggplot(B, aes(x=B$k)) + geom_density()
g <- g1 + g2
But then comes the error:
Don't know how to add o to a plot.
How can I overcome this problem?
Since we dont have any data it is hard to provide a specific solution that meets your scenario. But below is a general principal of what I think you trying to do.
The trick is to put your data together and have another column that identifies group A and group B. This is then used in the aes() argument in ggplot. Bearing in mind that combining your data frames might not be as simple as what I have done since you might have some extra columns etc.
# generating some pseudo data from a poisson distribution
A <- data.frame(k = rpois(2000, 4))
B <- data.frame(k = rpois(1000, 7))
# Create identifier
A$id <- "A"
B$id <- "B"
A_B <- rbind(A, B)
g <- ggplot(data = A_B, aes(x = k,
group = id, colour = id, fill = id)) + # fill/colour aes is not required
geom_density(alpha = 0.6) # alpha for some special effects
g
I can't tell you exactly that to do without knowing what data sets actually look like. But merging data sets into one then use ggplot() by specifying group or 'colour' would be one way to compare.
Another way is to use grid.arrange() from gridExtra package.
gridExtra::grid.arrange(g1, g2)
This is really easy and pretty convenient function. If you want to know more about gridExtra package, visit this official document.

Reshaping data in R with time values in column names

I have a data frame which looks like this (simplified):
data1.time1 data1.time2 data2.time1 data2.time2 data3.time1 group
1 1.53 2.01 6.49 5.22 3.46 A
...
24 2.12 3.14 4.96 4.89 3.81 C
where there are actually dataK.timeT for K in 1..27 and T in some (but maybe not all) of 1..8.
I would like to rearrange the data into K data frames so that I can plot, for each K, the summary data (for now let's say mean and mean ± standard deviation) for each of the three groups A, B, and C. That is, I want 27 graphs with three lines per graph, and also marks for the deviations.
Once I rearrange the data it should be easy enough to collapse by group, compute summary statistics, etc. But I'm not really sure how to get the data into this form. I looked at the reshape package, which suggests melting it into a key-value store format and rearranging from there, but it doesn't seem to support the columns containing the T values as I have here.
Is there a good way to do this? I'm quite willing to use something other than R to do this, since I can just import the results into R after transforming.
After creating fake data with a structure similar to yours, we convert from wide to long format, making a "tidy" data frame that is ready for plotting with ggplot2.
library(reshape2)
library(ggplot2)
library(dplyr)
Create fake data
set.seed(194)
dat = data.frame(replicate(27*8, cumsum(rnorm(24*3))))
names(dat) = paste0(rep(paste0("data",1:27), each=8), ".", rep(paste0("time",1:8), 27))
dat$group = rep(LETTERS[1:3], each=24)
Remove some columns so that number of time points will be different for different data sources:
dat = dat[ , -c(2,4,9,43,56,78,100:103,115:116,134:136,202,205)]
Reshape from wide to long format
datl = melt(dat, id.var="group")
Split data source and time point into separate columns:
datl$source = gsub("(.*)\\..*","\\1", datl$variable)
datl$time = as.numeric(gsub(".*time(.*)","\\1", datl$variable))
# Order data frame names by number (rather than alphabetically)
datl$source = factor(datl$source, levels=paste0("data",1:length(unique(datl$source))))
Plot the data using ggplot2
# Helper function for plotting standard deviation
sdFnc = function(x) {
vals = c(mean(x) - sd(x), mean(x) + sd(x))
names(vals) = c("ymin", "ymax")
vals
}
pd = position_dodge(0.7)
ggplot(datl, aes(time, value, group=group, color=group)) +
stat_summary(fun.y=mean, geom="line", position=pd) +
stat_summary(fun.data=sdFnc, geom="errorbar", width=0.4, position=pd) +
stat_summary(fun.y=mean, geom="point", position=pd) +
facet_wrap(~source, ncol=3) +
theme_bw()
Original (unnecessarily complicated) reshaping code. (Note, this code will no longer work with the updated (fake) data set, because the number of time columns is no longer uniform):
# Convert data source from wide to long
datl = data.frame()
for (i in seq(1,27*8,8)) {
tmp.dat = dat[, c(i:(i+7),grep("group",names(dat)))]
tmp.dat$source = gsub("(.*)\\..*", "\\1", names(tmp.dat)[1])
names(tmp.dat)[1:8] = 1:8
#datl = rbind(datl, tmp.dat)
datl = bind_rows(datl, tmp.dat) # Updated based on comment
}
datl$source = factor(datl$source, levels=paste0("data",1:27))
# Convert time from wide to long
datl = melt(datl, id.var = c("source","group"), variable.name="time")
Could do something like this with dplyr:
for(i in 1:K){ ## for 1:27
my.data.ind <- paste0("data",i,"|group") ## "datai|group"
one.month <- select(data, contains(my.data.ind) %>% ## grab cols that have these
group_by(group) %>% ## group by your group
summarise_each(funs(mean), funs(sd)) ## find mean for each col within each group
}
That should leave you with a 3xT data frame that has the average value of each group over time T

Graphing a Multi-Series Bar/Dot Plot with R

So I'm having trouble creating a dot plot/bar graph of this data set I have. My data set looks like this. I want an output that looks like this. However, geom_bar() through ggplot will only give me counts, and won't take the individual decimal values from the table. I've tried using Plotly as well, but it doesn't seem to scale well to plots with multiple players.
I've already set up a larger data frame with 200+ variables. I'm trying to make something that can search for specific players in that data frame, and then create a plot from it. Consequently, I'm ideally looking for something that can easily handle 5-10 different series.
Any help would be greatly appreciated.
Thanks!
This is pretty straightforward, the key is to get your data from its current wide format into the long format that is more useful for plotting in R. And use geom_point rather than geom_bar.
First, some reproducible example data (that you should use again in your question if you post another question here, makes it much easier for others to help you):
library(ggplot2)
library(reshape2)
dataset <- data.frame(
PlayerName = letters[1:6],
IsolationPossG = runif(6),
HandoffPossG = runif(6),
OffScreenPossG = runif(6)
)
This is your current data, in the wide format:
dataset
PlayerName IsolationPossG HandoffPossG OffScreenPossG
1 a 0.78184751 0.939183520 0.74461784
2 b 0.06557433 0.745699149 0.96540299
3 c 0.21105745 0.753534811 0.02977973
4 d 0.41271918 0.555475622 0.18317886
5 e 0.38153149 0.246292074 0.74862310
6 f 0.89946318 0.008412111 0.53195933
Now we convert to the long format:
molten <- melt(
dataset,
id.vars = "PlayerName",
measure.vars = c("IsolationPossG", "HandoffPossG", "OffScreenPossG")
)
Here is the long format, much more useful for plotting in R:
head(molten)
PlayerName variable value
1 a IsolationPossG 0.78184751
2 b IsolationPossG 0.06557433
3 c IsolationPossG 0.21105745
4 d IsolationPossG 0.41271918
5 e IsolationPossG 0.38153149
6 f IsolationPossG 0.89946318
Here's how to plot it:
ggplot(molten, aes(x = variable, y = value, colour = PlayerName)) +
geom_point(size = 4) +
theme_bw() +
theme(legend.position="bottom",legend.direction="horizontal")
Which gives:
h/t how to have multple labels in ggplot2 for bubble plot
If you want the shape of the data point to vary by name, as your example image shows (but it seems rather excessive to have the player name variable on two of the plot's aesthetics):
ggplot(molten, aes(x = variable, y = value, shape = PlayerName, colour = PlayerName)) +
geom_point(size = 4) +
theme_bw() +
theme(legend.position="bottom",legend.direction="horizontal")

Resources