Summarising data before plotting with geom_tile() renders different results - r

I have noticed that when plotting with ggplot2's geom_tile(), summarising the data before plotting renders a completely different result than when it is not pre-summarised. I don't understand why.
For a dataframe with three columns, year (character), state (character) and profit (numeric), consider the following examples:
# Plot straight away
data %>%
ggplot(aes(x=year, y=state)) + geom_tile(aes(fill=profit))
# Summarise before plotting
data %>% group_by(year, state) %>% summarize(profit_mean = mean(profit)) %>%
ungroup() %>%
ggplot(aes(x=year, y=state)) + geom_tile(aes(fill=profit_mean))
These two examples render two different tile plots - the values are quite different. I thought that these two methods of plotting would be analogous and that ggplot2 would take a mean automatically - is that not so?
I tried reproducing this error on a smaller subset of data, but it didn't appear. What could be going on here?

OP, this was a very interesting question.
First, let's get this out of the way. It is clear what plotting the summary of your data is plotting just that: the summary. You are summarizing via mean, so what is plotted equals the mean of the values for each tile.
The actual question here is: If you have a dataset containing more than one value per tile, what is the result of plotting the "non-summarized" dataset?
User #akrun is correct: the default stat used for geom_tile is stat="identity", but it might not be clear what that exactly means. It says it "leaves the data unchanged"... but that's not clear what that means here.
Illustrative Example Dataset
For purposes of demonstration, I'll create an illustrative dataset, which will answer the question very clearly. I'm creating two individual datasets df1 and df2, which each contain 4 "tiles" of data. The difference between these is that the values themselves for the tiles are different. I've include text labels on each tile for more clarity.
library(ggplot2)
library(cowplot)
df1 <- data.frame(
x=rep(paste("Test",1:2), 2),
y=rep(c("A", "B"), each=2),
value=c(5,15,20,25)
)
df2 <- data.frame(
x=rep(paste("Test",1:2), 2),
y=rep(c("A", "B"), each=2),
value=c(10,5,25,15)
)
tile1 <- ggplot(df1, aes(x,y, fill=value, label=value)) +
geom_tile() + geom_text() + labs(title="df1")
tile2 <- ggplot(df2, aes(x,y, fill=value, label=value)) +
geom_tile() + geom_text() + labs(title="df2")
plot_grid(tile1, tile2)
Plotting the Combined Data Frame
Each of the data frames df1 and df2 contain only one value per tile, so in order to see how that changes when we have more than one value per tile, we need to combine them into one so that each tile will contain 2 values. In this example, we are going to combine them in two ways: first df1 then df2, and the other way is df2 first, then df1.
df12 <- rbind(df1, df2)
df21 <- rbind(df2, df1)
Now, if we plot each of those as before and compare, the reason for the discrepancy the OP posted should be quite obvious. I'm including the value for each tile for each originating dataset to make things super-clear.
tile12 <- ggplot(df12, aes(x,y, fill=value, label=value)) +
geom_tile() + labs(title="df1, then df2") +
geom_text(data=df1, aes(label=paste("df1:",value)), nudge_y=0.1) +
geom_text(data=df2, aes(label=paste("df2:",value)), nudge_y=-0.1)
tile21 <- ggplot(df21, aes(x,y, fill=value, label=value)) +
geom_tile() + labs(title="df2, then df1") +
geom_text(data=df1, aes(label=paste("df1:",value)), nudge_y=0.1) +
geom_text(data=df2, aes(label=paste("df2:",value)), nudge_y=-0.1)
plot_grid(tile12, tile21)
Note that the legend colorbar value does not change, so it's not doing an addition. Plus, since we know it's stat="identity", we know this should not be the case. When we use the dataset that contains first observations from df1, then observations from df2, the value plotted is the one from df2. When we use the dataset that contains observations first from df2, then from df1, the value plotted is the one from df1.
Given this piece of information, it can be clear that the value shown in geom_tile() when using stat="identity" (default argument) corresponds to the last observation for that particular tile represented in the data frame.
So, that's the reason why your plot looks odd OP. You can either summarize beforehand as you have done, or use stat_summary(geom="tile"... to do the transformation in one go within ggplot.

Related

R: How to ggplot two time series dataframes, one of which contains non-numerical labels and slightly different timeStamps

I have 2 data frames:
df1 <- setNames(data.frame(c(as.POSIXct("2022-07-29 00:00:00","2022-07-29 00:05:00","2022-07-29 00:10:00","2022-07-29 00:15:00","2022-07-29 00:20:00")), c(1,2,3,4,5)), c("timeStamp", "value"))
df2 <- setNames(data.frame(c(as.POSIXct("2022-07-29 00:00:05","2022-07-29 00:05:05","2022-07-29 00:20:05")), c("a","b","c")), c("timeStamp", "text"))
I want to plot them, so as to to have the main graph be a numerical y scale geom_point, and then collate in the second dataframe with the labels (a,b,c) at the correct timeStamps on the continuous time series x axis.
ggplot() +
geom_point(data=df1, aes(x=timeStamp, y= value)) +
geom_text(data=df2, aes(x=timeStamp, y= text))
The difficulty I think lies in the fact that the timeStamps do not perfectly match up, and I keep getting returned with "Error: Discrete value supplied to continuous scale". Can anybody please offer some advice here?
The end result should look something like this (this an example from a much larger dataframe)
labeled time series using labels from different time series dataframe
Thank you
The issue is not the timeStamp but that for the geom_point you are mapping a numeric or continuous variable on y while for the geom_text you map a discrete one on y. Hence you get the error
Error: Discrete value supplied to continuous scale
To fix that map your text on the label aes (which BTW is required for geom_text) and use the y aes to specify the position where you want to add the labels:
library(ggplot2)
ggplot() +
geom_point(data=df1, aes(x=timeStamp, y= value)) +
geom_text(data=df2, aes(x=timeStamp, label = text, y = 6))
DATA
df1 <- setNames(data.frame(as.POSIXct(c("2022-07-29 00:00:00","2022-07-29 00:05:00","2022-07-29 00:10:00","2022-07-29 00:15:00","2022-07-29 00:20:00")), c(1,2,3,4,5)), c("timeStamp", "value"))
df2 <- setNames(data.frame(as.POSIXct(c("2022-07-29 00:00:05","2022-07-29 00:05:05","2022-07-29 00:20:05")), c("a","b","c")), c("timeStamp", "text"))
Update: removed 1. answer:
I am still not sure. Also #stefan's answer seems more correct, but maybe you think of something like this:
If you want to position the labels from df2 on top of the points from df1 conditional to the nearest time points between df1 and df2 then we would need to use roll from data.table. This answer was adapted from here Merging two sets of data by data.table roll='nearest' function
library(data.table)
library(tidyverse)
setDT(df1)
setDT(df2)
# Create time column by which to do a rolling join
df1[, time := timeStamp]
df2[, time := timeStamp]
setkey(df1, time)
setkey(df2, time)
set_merged <- df2[df1, roll = "nearest"]
set_merged %>%
as_tibble() %>%
ggplot(aes(x = time, y=value, group=1)) +
geom_point() +
geom_line()+
geom_text(aes(x=time, y=max(value)+0.1, label=text))+
theme_minimal()

Create multiple stacked violin plots with ggplot

I have a model which runs on different landscapes, once on both together and once on each separately.
I would like to plot the results in violin plots, but I'd like to have both runs side by side in the same plot, and each landscape to have its own violin (so a collective 4 violins in 2 stacks).
Example data:
df1 <- data.frame('means' = 1:6, 'landscape' = rep(c('forest', 'desert', 3)))
df2 <- data.frame('means' = rep(c(1,2), 3), 'landscape' = rep(c('forest', 'desert', 3)))
How I'd like the final product to look like (illustration in MS Paint and I'm a terrible artist):
where green is for forests and gold is got deserts.
Note that this post implictly asks "how to have violin geoms printed above each other?". This is of course answered through the position argument which for violins defaults to 'dodge' - changing it to 'identity' does the trick. Little side remark - asking for "stacking" is actually somewhat misleading as position='stack' will vertically stack them.
This approach in particular avoids the clunky second data set and hence easily works with multiple colors.
library(tidyverse)
mpg %>%
filter(class %in% c("compact","midsize")) %>%
mutate(coloringArgument = paste(drv,class)) %>%
ggplot(aes(as.factor(drv), cty, color=coloringArgument)) +
geom_violin(position = "identity")
Using ggplot - you can add two geom_violin() and in the second one, use new data.
I used mtcars as sample data.
library(tidyverse)
df1 <- mtcars[1:15, ]
df2 <- mtcars[16:31, ]
df1 %>%
ggplot(aes(factor(vs), disp)) +
geom_violin() +
geom_violin(data = df2,
aes(factor(vs), disp))
EDIT
If possible, I think that the easier way would be to combine the data frames, into one and create a key for each of one them for later use.
To combine the data.frames I used bind_rows which binds the data.frames on top of the others. The argument .id = enables me to add a new column with the data.frame's name. Next, inside ggplot you can set the aes color to the data.frame id you gave. This will give different colord for each data.frame. In addition, adding position = "identity" inside geom_violin lets you stack them over each other.
bind_rows(list(df1 = df1, df2 = df2),
.id = "dfName") %>%
ggplot(aes(factor(vs), disp, color = dfName)) +
geom_violin(position = "identity")

GGplot2: plotting values per year of specific columns in a dataframe

I am trying to plot values on the y axis against years on the x axis with ggplot2.
This is the dataset: https://drive.google.com/file/d/1nJYtXPrxD0xvq6rBz2NXlm4Epi52rceM/view?usp=sharing
I want to plot the values of specific countries.
It won't work by just specifying year as the x axis and a country's values on the y axis. I'm reading I need to melt the data frame, so I did that, but it's now in a format that doesn't seem convenient to get the job done.
I'm assuming I haven't correctly melted, but I have a hard time finding what I need to specifically do.
What I did beforehand is manually transpose the data and make the years a column, as well as all the countries.
This is the dataset transposed:
https://drive.google.com/file/d/131wNlubMqVEG9tID7qp-Wr8TLli9KO2Q/view?usp=sharing
Here's how I melted:
inv_melt.data <- melt(investments_t.data, id.vars="Year")
ggplot() +
geom_line(aes(x=Year, y=value), data = inv_melt.data)
The plot shows the aggregated values of all countries per year, but I want them per country in such a manner that I can also select to plot certain countries only.
How do I utilize melt in such a manner? Could someone walk me through this?
There are no columns named "Year" in the linked to data set, there are columns per year. So it need to be melted by "country" and then the "variable" edited with sub.
inv_melt.data <- reshape2::melt(investments_t.data, id.vars="country")
inv_melt.data$variable <- as.integer(sub("^X", "", inv_melt.data$variable))
ggplot(inv_melt.data, aes(variable, value, color = country)) +
geom_line(show.legend = FALSE)
Edit.
The following code keeps only some countries, filtering out the ones with more missing values.
i <- sapply(investments_t.data[-1], function(x) sum(is.na(x)) == 0)
i <- c(1, which(i))
inv_melt.data <- reshape2::melt(investments_t.data[i], id.vars = "Year")
ggplot(inv_melt.data, aes(Year, value, color = variable)) +
geom_line(show.legend = FALSE)

Multiple columns on x-axis in R

I'm quite new to R, and there has been a question similar to mine asked before, however it doesn't quite get to what I need.
I have a table as follows:
I wish to plot the Value, and Threshold alongside each other on the X-axis for each metric, so effectively, I will have three pairs of plots on the X-axis. I have attempted to use reshape2 and ggplot2 for this as follows:
library(reshape2)
df <- melt(msi, id.vars="Average Metric Value (Abbr)")
# I get an error message, but the output seems ok.
library(ggplot2)
ggplot(df, aes(x="Average Metric Value (Abbr)", y=value, fill=variable)) + geom_bar(stat='identity', position='dodge')
The output graph is as follows:
I'm sure I can work out how to separate each of the three pairs later, but as you can see, I don't have the metric names for each of the three pairs along the x-axis, and I am missing the first "Value" bar, presumably because it equals the same as the second and I am only getting unique values plotted.
How do I get around that and have the names of each metric beneath each pairs of values?
We can do this by placing inside the aes_string or use backquotes in the aes for those columns that have spaces in its names
library(dplyr)
library(tidyr)
gather(msi, variable, value, Value:Threshold) %>%
ggplot(., aes(x= `Average Metric Value (Abbr)`,
y=value,
fill=variable)) +
geom_bar(stat='identity', position='dodge')

Ordering bar plots with ggplot2 according to their size, i.e. numerical value

This question asks about ordering a bar graph according to an unsummarized table. I have a slightly different situation. Here's part of my original data:
experiment,pvs_id,src,hrc,mqs,mcs,dmqs,imcs
dna-wm,0,7,9,4.454545454545454,1.4545454545454546,1.4545454545454541,4.3939393939393945
dna-wm,1,7,4,2.909090909090909,1.8181818181818181,0.09090909090909083,3.9090909090909087
dna-wm,2,7,1,4.818181818181818,1.4545454545454546,1.8181818181818183,4.3939393939393945
dna-wm,3,7,8,3.4545454545454546,1.5454545454545454,0.4545454545454546,4.272727272727273
dna-wm,4,7,10,3.8181818181818183,1.9090909090909092,0.8181818181818183,3.7878787878787876
dna-wm,5,7,7,3.909090909090909,1.9090909090909092,0.9090909090909092,3.7878787878787876
dna-wm,6,7,0,4.909090909090909,1.3636363636363635,1.9090909090909092,4.515151515151516
dna-wm,7,7,3,3.909090909090909,1.7272727272727273,0.9090909090909092,4.030303030303029
dna-wm,8,7,11,3.6363636363636362,1.5454545454545454,0.6363636363636362,4.272727272727273
I only need a few variables from this, namely mqs and imcs, grouped by their pvs_id, so I create a new table:
m = melt(t, id.var="pvs_id", measure.var=c("mqs","imcs"))
I can plot this as a bar graph where one can see the correlation between MQS and IMCS.
ggplot(m, aes(x=pvs_id, y=value))
+ geom_bar(aes(fill=variable), position="dodge", stat="identity")
However, I'd like the resulting bars to be ordered by the MQS value, from left to right, in decreasing order. The IMCS values should be ordered with those, of course.
How can I accomplish that? Generally, given any molten dataframe — which seems useful for graphing in ggplot2 and today's the first time I've stumbled over it — how do I specify the order for one variable?
It's all in making
pvs_id a factor and supplying the appropriate levels to it:
dat$pvs_id <- factor(dat$pvs_id, levels = dat[order(-dat$mqs), 2])
m = melt(dat, id.var="pvs_id", measure.var=c("mqs","imcs"))
ggplot(m, aes(x=pvs_id, y=value))+
geom_bar(aes(fill=variable), position="dodge", stat="identity")
This produces the following plot:
EDIT:
Well since pvs_id was numeric it is treated in an ordered fashion. Where as if you have a factor no order is assumed. So even though you have numeric labels pvs_id is actually a factor (nominal). And as far as dat[order(-dat$mqs), 2] is concerned the order function with a negative sign orders the data frame from largest to smallest along the variable mqs. But you're interested in that order for the pvs_id variable so you index that column which is the second column. If you tear that apart you'll see it gives you:
> dat[order(-dat$mqs), 2]
[1] 6 2 0 5 7 4 8 3 1
Now you supply that to the levels argument of factor and this orders the factor as you want it.
With newer tidyverse functions, this becomes much more straightforward (or at least, easy to read for me):
library(tidyverse)
d %>%
mutate_at("pvs_id", as.factor) %>%
mutate(pvs_id = fct_reorder(pvs_id, mqs)) %>%
gather(variable, value, c(mqs, imcs)) %>%
ggplot(aes(x = pvs_id, y = value)) +
geom_col(aes(fill = variable), position = position_dodge())
What it does is:
create a factor if not already
reorder it according to mqs (you may use desc(mqs) for reverse-sorting)
gather into individual rows (same as melt)
plot as geom_col (same as geom_bar with stat="identity")

Resources