stat_sum and stat_identity give weird results - r

I have the following code, including randomly generated demo data:
n <- 10
group <- rep(1:4, n)
mass.means <- c(10, 20, 15, 30)
mass.sigma <- 4
score.means <- c(5, 5, 7, 4)
score.sigma <- 3
mass <- as.vector(model.matrix(~0+factor(group)) %*% mass.means) +
rnorm(n*4, 0, mass.sigma)
score <- as.vector(model.matrix(~0+factor(group)) %*% score.means) +
rnorm(n*4, 0, score.sigma)
data <- data.frame(id = 1:(n*4), group, mass, score)
head(data)
Which gives:
id group mass score
1 1 1 12.643603 5.015746
2 2 2 21.458750 5.590619
3 3 3 15.757938 8.777318
4 4 4 32.658551 6.365853
5 5 1 6.636169 5.885747
6 6 2 13.467437 6.390785
And then I want to plot the sum of "score", grouped by "group", in a bar chart:
plot <- ggplot(data = data, aes(x = group, y = score)) +
geom_bar(stat="sum")
plot
This gives me:
Weirdly, using stat_identity seems to give the result I am looking for:
plot <- ggplot(data = data, aes(x = group, y = score)) +
geom_bar(stat="identity")
plot
Is this a bug? Using ggplot2 1.0.0 on R
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 1.2
year 2014
month 10
day 31
svn rev 66913
language R
version.string R version 3.1.2 (2014-10-31)
nickname Pumpkin Helmet
Or what am I doing wrong?

plot <- ggplot(data = data, aes(x = group, y = score)) +
stat_summary(fun.y = "sum", geom = "bar", position = "identity")
plot
aggregate(score ~ group, data=data, FUN=sum)
# group score
#1 1 51.71279
#2 2 58.94611
#3 3 67.52100
#4 4 39.24484
Edit:
stat_sum does not work, because it doesn't just return the sum. It returns the "number of observations at position" and "percent of points in that panel at that position". It was designed for a different purpose. The docs say " Useful for overplotting on scatterplots."
stat_identity (kind of) works because geom_bar by default stacks the bars. You have many bars on top of each other in contrast to my solution that gives you just one bar per group. Look at this:
plot <- ggplot(data = data, aes(x = group, y = score)) +
geom_bar(stat="identity", color = "red")
plot
Also consider the warning:
Warning message:
Stacking not well defined when ymin != 0

Related

how to add significance letters from emmeans to a plot with fitted values

I have a dataset that looks like this with 3 more levels for scarification. Germination is my response variable.
scarification
time
germination
Water
0
0
Water
2
0
Water
4
8
Water
8
23
Ethanol
0
0
Ethanol
2
18
Ethanol
4
19
Ethanol
8
22
I have made a glm for the data and plotted the fitted values, and done pairwise contrasts using emmeans. I'd like to add letters to my bar chart to indicate letters of significance, but am having trouble extracting cld data as the cld function does not work with emmGrid objects, and the variable names used in emmeans are different to those used in the plot. I have tried renaming the variables but that does not work. I have also tried using geom_signif but that does not seem to work either.
geom_signif(comparisons = em,
+ test = "emmeans",
+ map_signif_level = TRUE)
Warning message:
Computation failed in `stat_signif()`
Caused by error in `mapped_discrete()`:
! Can't convert `x` <list> to <double>.
Here is the code I have so far
#make a glm
summary(mod_8 <- glm(cbind(germination, total - germination) ~ scarification*time, data = df, family = binomial))
# make a new df with the predicted values from the model, specifying for stratification to just do 0, 2, 4, and 8 from the continuous variable
mydf <- ggpredict(mod_8, terms = c("time [0,2,4,8]", "scarification"))
#add time as a factor to the new df
mydf$x_fact <- as.factor(mydf$x)
#get contrast values
em <- emmeans(mod_8, ~scarification + time,
at = list(time = c(0, 2, 4, 8)),
trans = "response") %>%
contrast(interaction = c("pairwise", "pairwise"),
by = "time")
#make a grouped bar chart with scarification (group) on the x axis, predicted on the y axis, and grouped by the factor version of time (x_fact)
ggplot(mydf, aes(x = group, y = predicted, fill = x_fact)) +
geom_col(position = "dodge") +
geom_bar(stat = "identity", position = "dodge") +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), position = position_dodge(width = 0.9)) +
labs(x = "Scarification", y = "Predicted Germination Proportion", fill = "Time") +
ggtitle("Grouped Bar Chart of Germination by Scarification and Time")
If anyone has any ideas I would appreciate it.

Present correlation in plot between two time series for a multiline time series

I have many graphics with two times series plotted on them.
That is to say, I have one plot of y_1 and y_2 against a common set of dates.
For each plot, I would like to present the correlation on the plot between each pair of series. That is to say I would like to compute: cor(y_1,y_2) and include the resulting number on each plot.
This is surprisingly difficult to do in a principled way in ggplot2. I've found no simple way to do it using stat_cor so far.
I have already looked at other functions recommended for this task, but they are all designed for reporting the correlation of y_1 and y_2 in situations in which y_1 is plot against y_2 rather than both y_1 and y_2 are plot against time.
I would prefer a ggplot2-ish way to do this but I'm open to using any graphics software within R. Here is code for a minimal working example and what I have tried.
library(reprex); library(ggplot2); library(ggpubr)
n <- 6;
Q=sample(18:30, n, replace=TRUE)
# make sample data
dat <- data.frame(id=1:n,
date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"),
group=rep(LETTERS[1:2], n/2),
quantity= Q,
price= 100 - 2*Q + rnorm(n))
dat
#> id date group quantity price
#> 1 1 2020-12-26 A 19 63.02628
#> 2 2 2020-12-27 B 26 49.66597
#> 3 3 2020-12-28 A 27 44.98031
#> 4 4 2020-12-29 B 24 51.11224
#> 5 5 2020-12-30 A 29 41.11129
#> 6 6 2020-12-31 B 28 43.04494
tseriesplot <- ggplot(dat, aes(x = date)) + ggtitle("Oil: Daily Quantity and Price") +
geom_line(aes(y = Q, color = "Quantity (thousands of barrels)")) +
geom_line(aes(y = price, color = "Price"))
tseriesplot
# naive attempt fails
tseriesplot + stat_cor(data = dat, aes(x=quantity, y=price),method="pearson")
#> Error: Invalid input: date_trans works with objects of class Date only
Created on 2021-01-05 by the reprex package (v0.3.0)
I thought this would be a good question because it is similar to more complex questions elsewhere, e.g. https://stat.ethz.ch/pipermail/r-help/2020-July/467805.html but much more basic.
1) annotate Create the text txt you want to plot and then use annotate:
txt <- with(dat, sprintf("cor: %.2f", cor(quantity, price)))
tseriesplot +
annotate("text", label = txt, x = min(dat$date), y = max(dat$quantity, dat$price),
hjust = -0.1)
2) grid.text Another approach is to use grid graphics which allows one to specify the location independently of the data. Using txt from above:
library(grid)
tseriesplot
grid.text(txt, 0.1, 0.9)
3a) zoo This would also work:
library(zoo)
z <- read.zoo(dat[c("date", "price", "quantity")])
txt <- sprintf("cor: %.2f", cor(z)[2])
autoplot(z, facet = NULL) +
annotate("text", label = txt, x = start(z), y = max(z), hjust = -0.1)
3b) scale
or you could scale the variables as that does not affect the correlation:
z <- scale(z)
autoplot(z, facet = NULL) +
annotate("text", label = txt, x = start(z), y = max(z), hjust = -0.1)
Discussion
Overall putting together parts of different solutions this seems the most compact
library(zoo)
library(grid)
z <- read.zoo(dat[c("date", "price", "quantity")])
autoplot(z, facet = NULL)
grid.text(sprintf("cor: %.2f", cor(z)[2]), 0.1, 0.9)
Instead of trying to figure out how to do this with ggpubr::stat_cor you could simply compute the correlation coefficient and add it as an annotation to your plot using e.g. annotate:
library(ggplot2)
library(ggpubr)
set.seed(42)
n <- 6;
Q=sample(18:30, n, replace=TRUE)
# make sample data
dat <- data.frame(id=1:n,
date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"),
group=rep(LETTERS[1:2], n/2),
quantity= Q,
price= 100 - 2*Q + rnorm(n))
dat
#> id date group quantity price
#> 1 1 2020-12-26 A 18 64.63286
#> 2 2 2020-12-27 B 22 56.40427
#> 3 3 2020-12-28 A 18 63.89388
#> 4 4 2020-12-29 B 26 49.51152
#> 5 5 2020-12-30 A 27 45.90534
#> 6 6 2020-12-31 B 21 60.01842
tseriesplot <- ggplot(dat, aes(x = date)) + ggtitle("Oil: Daily Quantity and Price") +
geom_line(aes(y = quantity, color = "Quantity (thousands of barrels)")) +
geom_line(aes(y = price, color = "Price"))
tseriesplot +
annotate("text",
x = min(dat$date),
y = 70,
label = paste0("p = ", scales::number(cor(dat$quantity, dat$price, method = "pearson"), accuracy = .01)),
hjust = 0)

Need to make a 2x2 ggplot in R

I have the following data:
unigrams Freq
1 the 236133
2 to 154296
3 and 128165
4 a 127434
5 i 124599
6 of 103380
7 in 81985
8 you 69504
9 is 65243
10 for 62425
11 it 60298
12 that 58605
13 on 45935
14 my 45424
15 with 38270
16 this 34799
17 was 33009
18 be 32725
19 have 31728
20 at 30255
and this set of data:
bigrams Freq
1 of the 20707
2 in the 19443
3 for the 11090
4 to the 10939
5 on the 10280
6 to be 9555
7 at the 7184
8 i have 6408
9 and the 6387
10 i was 6143
11 is a 6114
12 and i 5993
13 i am 5843
14 in a 5770
15 it was 5644
16 for a 5343
17 if you 5326
18 it is 5196
19 with the 5092
20 have a 4936
I would like to place two qplots together side-by-side, ncol = 2. I tried the gridExtra library, but it is generating errors that I can't seem to figure out how to correct. Any ideas on how to do this, please?
library(gridExtra)
# The 20 most unigrams in the dataset
ugrams <- as.data.frame(unigrams)
graph.data <- ugrams[order(ugrams$Freq, decreasing = T), ]
graph.data <- graph.data[1:20, ]
p1 <- qplot(unigrams,Freq, data=graph.data,fill=unigrams,geom=c("histogram"))
# The 20 most bigrams in the dataset
bgrams <- as.data.frame(bigrams)
graph.data <- bgrams[order(bgrams$Freq, decreasing = T), ]
graph.data <- graph.data[1:20, ]
p2 <- qplot(bigrams,Freq, data=graph.data,fill=bigrams,geom=c("histogram"))
grid.arrange(p1,p2,ncol=2)
This is the error that is generated:
<error/rlang_error>
stat_bin() can only have an x or y aesthetic.
Backtrace:
1. (function (x, ...) ...
2. ggplot2:::print.ggplot(x)
4. ggplot2:::ggplot_build.ggplot(x)
5. ggplot2:::by_layer(function(l, d) l$compute_statistic(d, layout))
6. ggplot2:::f(l = layers[[i]], d = data[[i]])
7. l$compute_statistic(d, layout)
8. ggplot2:::f(..., self = self)
9. self$stat$setup_params(data, self$stat_params)
10. ggplot2:::f(...)
I would like to have the graphs resemble this one:
Which was accomplished by the following code:
# The 20 most quadgrams in the dataset
qgrams <- as.data.frame(quadgrams)
graph.data <- qgrams[order(qgrams$Freq, decreasing = T), ]
graph.data <- graph.data[1:20, ]
ggplot(data=graph.data, aes(x=quadgrams, y=Freq, fill=quadgrams)) + geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 40, hjust = 1))
Is that possible
Edited for your shift from histograms to bar plots. Assuming that graph.data is actually your ugrams dataset, the working single plot is
Putting them side-by-side can be done with facets:
dplyr::bind_rows(
unigrams = select(ugrams, grams = unigrams, Freq),
bigrams = select(bigrams, grams = bigrams, Freq),
.id = "id") %>%
arrange(-Freq) %>%
mutate(
id = factor(id, levels = c("unigrams", "bigrams")),
grams = factor(grams, levels = grams)
) %>%
ggplot(aes(x = grams, y = Freq, fill = grams)) +
facet_wrap(~ id, ncol = 2, scales = "free_x") +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 40, hjust = 1))
(Obviously, these are "too small" to hold all of the legend, but that depends on where you are using it. I wonder if the legend shouldn't be included, since it is somewhat redundant with the x-axis labels.)
The y-axis on the left is harder to see because it is dwarfed by the unigrams on the right. While it does bias the plot (it might be natural to compare the vertical levels of the plot on the left with those on the right), you can alleviate that by freeing both the "x" (already free) and "y" axes with scales="free":

How to draw the following graph in R? And what are these types of graphs called?

I am trying to present the following data
x <- factor(c(1,2,3,4,5))
x
[1] 1 2 3 4 5
Levels: 1 2 3 4 5
value <- c(10,5,7,4,12)
value
[1] 10 5 7 4 12
y <- data.frame(x, value)
y
x value
1 1 10
2 2 5
3 3 7
4 4 4
5 5 12
I want to convert the above information into the following graphical representation
What is the above type of graphs called. I checked out dot plot, but that only stacks vertically.
This solution plots sets of three bar graphs facetted by x. The height of the bars within each set is determined using the remainder from dividing value by 3. Horizontal spacing is provided by natural geom spacing. Vertical spacing is created using white gridlines.
library(ggplot2)
library(reshape2)
Data
dataset <- data.frame('x' = 1:5, 'value' = c(10, 5, 7, 4, 12))
Since every value is supposed to be represented by three bars, we will add 3 columns to the dataset and distribute the magnitude of the value among them using integer division:
dataset[, c('col1', 'col2', 'col3')] <- floor(dataset$value / 3)
r <- dataset$value %% 3
dataset[r == 1, 'col1'] <- dataset[dataset$value %% 3 == 1, 'col1'] + 1
dataset[r == 2, c('col1', 'col2')] <- dataset[r == 2, c('col1', 'col2')] + 1
Now, we will melt the dataframe for the purposes of plotting:
dataset <- melt(dataset, id.vars = c('x', 'value'))
colnames(dataset)[4] <- 'magnitude' # avoiding colnames conflict
dataset$variable <- as.character(dataset$variable) # column ordering within a facet
Plot
First, we will make a regular bar graph. We can move facet labels to the bottom of the plot area using the switch parameter.
plt <- ggplot(data = dataset)
plt <- plt + geom_col(aes(x=variable, y = magnitude), fill = 'black')
plt <- plt + facet_grid(.~x, switch="both")
Then we will use theme_minimal() and add a few tweaks to the parameters that govern the appearance of gridlines. Specifically, we will make sure that minor XY gridlines and major X gridlines are blank, whereas major Y gridlines are white and plotted on top of the data.
plt <- plt + theme_minimal()
plt <- plt + theme(panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(colour = "white", size = 1.5),
panel.grid.minor = element_blank(),
panel.ontop = TRUE)
We can add value labels using geom_text(). We will only use x values from col2 records such that we're not plotting the value over each bar within each set (col2 happens to be the middle bar).
plt <- plt + geom_text(data = dataset[dataset$variable == 'col2', ],
aes(label = value, x = variable, y = magnitude + 0.5))
plt <- plt + theme(axis.text.x=element_blank()) # removing the 'col' labels
plt + xlab('x') + ylab('value')
The following code will do a graph similar to the one in the question.
I had to change the data.frame, yours was not fit to graph with geom_dotplot. The new variable z$value is a vector of the values 1:5 each repeated as many times as value.
library(ggplot2)
value <- c(10, 5, 7, 4, 12)
z <- sapply(value, function(v) c(1, rep(0, v - 1)))
z <- cumsum(unlist(z))
z <- data.frame(value = z)
ggplot(z, aes(x = jitter(value))) +
geom_dotplot() +
xlab("value")

ggplot2: plotting line behind boxplot

I want to plot a line using geom_line behind my boxplot, I finally managed to combine line plotting with a boxplot. I have this dataset which I used to create a boxplot:
>head(MdataNa)
1 2 3 4 5 6 7
1 -0.02798634 -0.05740014 -0.02643664 0.02203644 0.02366325 -0.02868668 -0.01278713
2 0.20278229 0.19960302 0.10896017 0.24215229 0.31925211 0.29928739 0.15911725
3 0.06570653 0.08658396 -0.06019098 0.01437147 0.02078022 0.13814853 0.11369999
4 -0.42805441 -0.91945721 -1.05555731 -0.90877542 -0.77493682 -0.90620917 -1.00535742
5 0.39922939 0.12347996 0.06712451 0.07419287 -0.09517628 -0.12056720 -0.40863078
6 0.52821596 0.30827515 0.29733794 0.30555717 0.31636676 0.11592717 0.16957927
I have glucose concentration which should be plotted in a line behind this boxplot:
# glucose curve values
require("scales")
offconc <- c(0,0.4,0.8,1.8,3.5,6.9,7.3)
offtime <- c(9,11.4,12.9,14.9,16.7,18.3,20.5)
# now we have to scale them so they fit in the (boxplot)plot
time <- rescale(offtime, to=c(1,7))
conc <- rescale(offconc, to=c(-1,1))
glucoseConc <- data.frame(time,conc)
glucoseConc2 <- melt(glucoseConc, id = "time")
Then I plotted this data, but I was only able to plot the glucose curve in FRONT of the boxplot instead of behind it, I used this code:
boxNa <- ggplot(stack(MdataNa), aes(x = ind, y = values)) +
geom_boxplot() +
coord_cartesian(y = c(-1.5,1.5)) +
labs(list(title = "After Loess", x = "Timepoint", y = "M")) +
geom_line(data=glucoseConc2,aes(x=time,y=value),group=1)
output of the code above:
EDIT as suggested by the comments(NOT WORKING)
boxNa <- ggplot(stack(MdataNa), aes(x = ind, y = values)) +
geom_line(data=glucoseConc2,aes(x=time,y=value),group=1) +
geom_boxplot(data=stack(MdataNa), aes(x = ind, y = values)) +
coord_cartesian(y = c(-1.5,1.5)) +
labs(list(title = "After Loess", x = "Timepoint", y = "M"))
this will give the following error:
Error: Discrete value supplied to continuous scale
probably I'm doing something wrong then?
Here's a solution.
The idea is to convert the x axis in continous values:
ggplot() +
geom_line(data=glucoseConc2,aes(x=time,y=value),group=1)+
geom_boxplot(data=stack(MdataNA), aes(x = as.numeric(ind), y = values, group=ind)) +
coord_cartesian(y = c(-1.5,1.5)) +
labs(list(title = "After Loess", x = "Timepoint", y = "M"))+
scale_x_continuous(breaks=1:7)

Resources