combined barplots with R ggplot2: dodged and stacked - r

I have a table of data which already contain several values to be plotted on a barplot with ggplot2 package (already cumulative data).
The data in the data frame "reserves" has the form (simplified):
period,amount,a1,a2,b1,b2,h1,h2,h3,h4
J,18.1,30,60,40,60,15,50,30,5
K,29,65,35,75,25,5,50,40,5
P,13.3,94,6,85,15,10,55,20,15
N,21.6,95,5,80,20,10,55,20,15
The first column (period) is the geological epoch. It will be on x axis, and I needed to have no extra ordering on it, so I prepared appropriate factor labelling with the command
reserves$period <- factor(reserves$period, levels = reserves$period)
The column "amount" is the main column to be plotted as y axis (it is percentage of hydrocarbons in each epoch, but it could be in absolute values as well, say, millions of tons or whatever). So basic plot is invoked by the command:
ggplot(reserves,aes(x=period,y=amount)) + geom_bar(stat="identity")
But here is the question. I need to plot other values, that is a1-a2, b1-b2, and h1-h4 on the same bar graph. These values are percentage values for each letter (for example, a1=60, then a2=40; the same for b1-b2; and for h1-h4 as well they sum up to 100. So: I need to have values a1-a2 as some color, proportionally dividing the "amount" bar for each value of x (stacked barplot), then I need the same for values b1-b2; so we have for each period two adjacent columns (grouped barplots), each of them is stacked. And next, I need the third column, for values h1-h4, perhaps, also as a stacked barplot, but either as a third column, or as a staggered barplot above the first one.
So the layout looks like this:
I learned that I need first to reshape data with package reshape2, and then use the option position="dodge" or position="fill" in geom_bar(), but here is the combination thereof. And the third barplot (for values h1-h4) seems to need "stacked percent" representation with fixed height.
Are there packages which handle the data for plotting in a more intuitive way? Lets say, we just declare, that we want variables ai,bi, hi to be plotted.

First you should reshape your data from wide to long, then scale your proportions to their raw values. Then split your old column names (now levels of "lett") into their letters and numbers for labeling. If your real data aren't formatted like this (a1...h4) there's ways to handle that as well.
library(dplyr)
library(tidyr)
library(ggplot2)
reserves <- read.csv(text = "period,amount,a1,a2,b1,b2,h1,h2,h3,h4
J,18.1,30,60,40,60,15,50,30,5
K,29,65,35,75,25,5,50,40,5
P,13.3,94,6,85,15,10,55,20,15
N,21.6,95,5,80,20,10,55,20,15")
reserves.tidied <- reserves %>%
gather(key = lett, value = prop, -period, -amount) %>%
mutate(rawvalue = prop * amount/100,
lett1 = substr(lett, 1, 1),
num = substr(lett, 2, 2))
reserves.tidied
period amount lett prop rawvalue lett1 num
1 J 18.1 a1 30 5.430 a 1
2 K 29.0 a1 65 18.850 a 1
3 P 13.3 a1 94 12.502 a 1
4 N 21.6 a1 95 20.520 a 1
5 J 18.1 a2 60 10.860 a 2
6 K 29.0 a2 35 10.150 a 2
7 P 13.3 a2 6 0.798 a 2
8 N 21.6 a2 5 1.080 a 2
9 J 18.1 b1 40 7.240 b 1
10 K 29.0 b1 75 21.750 b 1
11 P 13.3 b1 85 11.305 b 1
12 N 21.6 b1 80 17.280 b 1
13 J 18.1 b2 60 10.860 b 2
14 K 29.0 b2 25 7.250 b 2
15 P 13.3 b2 15 1.995 b 2
16 N 21.6 b2 20 4.320 b 2
17 J 18.1 h1 15 2.715 h 1
18 K 29.0 h1 5 1.450 h 1
19 P 13.3 h1 10 1.330 h 1
20 N 21.6 h1 10 2.160 h 1
21 J 18.1 h2 50 9.050 h 2
22 K 29.0 h2 50 14.500 h 2
23 P 13.3 h2 55 7.315 h 2
24 N 21.6 h2 55 11.880 h 2
25 J 18.1 h3 30 5.430 h 3
26 K 29.0 h3 40 11.600 h 3
27 P 13.3 h3 20 2.660 h 3
28 N 21.6 h3 20 4.320 h 3
29 J 18.1 h4 5 0.905 h 4
30 K 29.0 h4 5 1.450 h 4
31 P 13.3 h4 15 1.995 h 4
32 N 21.6 h4 15 3.240 h 4
Then to plot your tidied data, you want the letters across the x axis, and the rawvalue we just calculated with amount*proportion on the y axis. We stack the geom_col up from 1 to 2 or 1 to 4 (the reverse=T argument overrides the default, which would have 2 or 4 at the bottom of the stack). alpha and fill let us distinguish between groups in the same bar and between bars.
Then the geom_text labels each stacked segment with the name, a newline, and the original percentage, centered on each segment. The scale reverses the default behavior again, making 1 the darkest and 2 or 4 the lightest in each bar. Then you facet across, making one group of bars for each period.
ggplot(reserves.tidied,
aes(x = lett1, y = rawvalue, alpha = num, fill = lett1)) +
geom_col(position = position_stack(reverse = T), colour = "black") +
geom_text(position = position_stack(reverse = T, vjust = .5),
aes(label = paste0(lett, ":\n", prop, "%")), alpha = 1) +
scale_alpha_discrete(range = c(1, .1)) +
facet_grid(~period) +
guides(fill = F, alpha = F)
Rearranging it so that the "h" bars are different from the "a" and "b" bars is a bit more complex, and you'd have to think about how you want it presented, but it's totally doable.

Related

Create heatmap with range of colors in a single cell in R

# dataframe
df1 <- df %>%
mutate(valuesrange=cut(values, breaks=c(0,0.05,10,100,1000,2000,3000, max(values, na.rm=T)),
labels=c("0-0.05", "0.05-10", "10-100", "100-1000", "1000-2000", "2000-3000", ">3000"))) %>%
mutate(valuesrange=factor(as.character(valuesrange), levels=rev(levels(valuesrange))))
#Order for X and Y axis labels
df1$objx <- factor(df1$objx, levels=unique(df1$objx))
df1$objy <- factor(df1$objy, levels=unique(df1$objy))
ggplot(data = df1, aes(x=objx, y=objy, fill = valuesrange)) +
geom_tile()+
scale_fill_manual(values=rev(brewer.pal(7, "YlGnBu")), na.value="grey90")
The df1 data looks like this
objy objx values valuesrange
1 1 15 1219 1000-2000
2 1 15 3911 >3000
3 1 15 3224 >3000
4 1 15 14708 >3000
5 1 15 5054 >3000
6 1 15 31499 >3000
7 1 15 1131 1000-2000
8 1 15 4368 >3000
9 1 15 2749 2000-3000
10 1 15 666. 100-1000
11 1 15 1982 1000-2000
I would like to create a heatmap of df1 data with single tick values of x axis and y axis , and the range values as mentioned in above, i need color for every rangevalues , however if use mentioned code i am able to see only one single color as in the image.
Could you please help how to generate multiple color with in signal cell:

R: Find out which observations are located in each "bar" of the histogram

I am working with the R programming language. Suppose I have the following data:
a = rnorm(1000,10,1)
b = rnorm(200,3,1)
c = rnorm(200,13,1)
d = c(a,b,c)
index <- 1:1400
my_data = data.frame(index,d)
I can make the following histograms of the same data by adjusting the "bin" length (via the "breaks" option):
hist(my_data, breaks = 10, main = "Histogram #1, Breaks = 10")
hist(my_data, breaks = 100, main = "Histogram #2, Breaks = 100")
hist(my_data, breaks = 5, main = "Histogram #3, Breaks = 5")
My Question: In each one of these histograms there are a different number of "bars" (i.e. bins). For example, in the first histogram there are 8 bars and in the third histogram there are 4 bars. For each one of these histograms, is there a way to find out which observations (from the original file "d") are located in each bar?
Right now, I am trying to manually do this, e.g. (for histogram #3)
histogram3_bar1 <- my_data[which(my_data$d < 5 & my_data$d > 0), ]
histogram3_bar2 <- my_data[which(my_data$d < 10 & my_data$d > 5), ]
histogram3_bar3 <- my_data[which(my_data$d < 15 & my_data$d > 10), ]
histogram3_bar4 <- my_data[which(my_data$d < 15 & my_data$d > 20), ]
head(histogram3_bar1)
index d
1001 1001 4.156393
1002 1002 3.358958
1003 1003 1.605904
1004 1004 3.603535
1006 1006 2.943456
1007 1007 1.586542
But is there a more "efficient" way to do this?
Thanks!
hist itself can provide for the solution to the question's problem, to find out which data points are in which intervals. hist returns a list with first member breaks
First, make the problem reproducible by setting the RNG seed.
set.seed(2021)
a = rnorm(1000,10,1)
b = rnorm(200,3,1)
c = rnorm(200,13,1)
d = c(a,b,c)
Now, save the return value of hist and have findInterval tell the bins where each data points are in.
h1 <- hist(d, breaks = 10)
f1 <- findInterval(d, h1$breaks)
h1$breaks
# [1] -2 0 2 4 6 8 10 12 14 16
head(f1)
#[1] 6 7 7 7 7 6
The first six observations are intervals 6 and 7 with end points 8, 10 and 12, as can be seen indexing d by f1:
head(d[f1])
#[1] 8.07743 10.26174 10.26174 10.26174 10.26174 8.07743
As for whether the intervals given by end points 8, 10 and 12 are left- or right-closed, see help("findInterval").
As a final check, table the values returned by findInterval and see if they match the histogram's counts.
table(f1)
#f1
# 1 2 3 4 5 6 7 8 9
# 2 34 130 34 17 478 512 169 24
h1$counts
#[1] 2 34 130 34 17 478 512 169 24
To have the intervals for each data point, the following
bins <- data.frame(bin = f1, min = h1$breaks[f1], max = h1$breaks[f1 + 1L])
head(bins)
# bin min max
#1 6 8 10
#2 7 10 12
#3 7 10 12
#4 7 10 12
#5 7 10 12
#6 6 8 10

Plotting missing data

I'm trying plotting the following imputed dataset with LOCF method, according this procedure
> dati
# A tibble: 27 x 6
id sex d8 d10 d12 d14
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 F 21 20 21.5 23
2 2 F 21 21.5 24 25.5
3 3 NA NA 24 NA 26
4 4 F 23.5 24.5 25 26.5
5 5 F 21.5 23 22.5 23.5
6 6 F 20 21 21 22.5
7 7 F 21.5 22.5 23 25
8 8 F 23 23 23.5 24
9 9 F NA 21 NA 21.5
10 10 F 16.5 19 19 19.5
# ... with 17 more rows
dati_locf <- dati %>% mutate(across(everything(),na.locf)) %>%
mutate(across(everything(),na.locf,fromlast = T))
apply(dati_locf[which(dati_locf$sex=="F"),1:4], 1, function(x) lines(x, col = "green"))
Howrever, when I run the last line to plot dataset it turns me back both these error and warning messages:
Warning in xy.coords(x, y) : a NA has been produced by coercion
Error in plot.xy(xy.coords(x, y), type = type, ...) :
plot.new has not been called yet
Called from: plot.xy(xy.coords(x, y), type = type, ...)
Can you explain why and how I could fix them? I let you attach the page I has been being address to after running it.
enter image description here
If you just want to plot the LOCF imputation for one variable to see how good the fit for the imputations looks for this one variable, you can use the following:
library(imputeTS)
# Example 1: Visualize imputation by LOCF
imp_locf <- na_locf(tsAirgap)
ggplot_na_imputations(tsAirgap, imp_locf)
tsAirgap is an time series example, which comes with the imputeTS package. You would have to replace this with the time series / variable you want to plot. Imputed values are shown in red. As you can see, for this series last observation carried forward would be kind of ok, but there are algorithms tat come with the imputeTS package, that give a better result (e.g. na_kalman or na_seadec). Here is also an example of next observation carried backward, since you also used NOCB.
library(imputeTS)
# Example 2: Visualize imputation by NOCB
imp_locf <- na_locf(tsAirgap, option = "nocb")
ggplot_na_imputations(tsAirgap, imp_locf)
There are several problems here:
apply will convert its first argument to matrix and since the second column is character it gives a character matrix. Clearly one can't plot that with lines.
presumably we want to plot columns 3:6, not 1:4
na.locf will produce multiple values that are the same wherever there is an NA but what we really want is to connect non-NA points. Use na.approx instead.
lines can only be used after plot but there is no plot command. Use matplot instead.
Making these changes we have the following.
library(zoo)
# see Note below for dati in reproducible form
matplot(na.approx(dati[3:6]), type = "l", ylab = "")
legend("topright", names(dati)[3:6], col = 1:4, lty = 1:4)
(continued after plot)
We could alternately use ggplot2 graphics. First convert to zoo and then use na.approx and autoplot. Omit facet=NULL if you want separate panels.
library(ggplot2)
autoplot(na.approx(zoo(dati[3:6])), facet = NULL)
Note
We provide dati in reproducible form below. Note that the sex column only contains NA and F so in the absence of direction it will assume those are a logical NA and FALSE. Instead we specify that the sex column is character in the read.table line.
Lines <- "
id sex d8 d10 d12 d14
1 1 F 21 20 21.5 23
2 2 F 21 21.5 24 25.5
3 3 NA NA 24 NA 26
4 4 F 23.5 24.5 25 26.5
5 5 F 21.5 23 22.5 23.5
6 6 F 20 21 21 22.5
7 7 F 21.5 22.5 23 25
8 8 F 23 23 23.5 24
9 9 F NA 21 NA 21.5
10 10 F 16.5 19 19 19.5"
dati <- read.table(text = Lines, colClasses = list(sex = "character"))

How to prevent R from rounding in frequency function?

I used the freq function of frequency package to get frequency percent on my dataset$MoriskyAdherence, then R gives me percent values with rounding. I need more decimal places.
MoriskyAdherence=dataset$MoriskyAdherence
freq(MoriskyAdherence)
The result is:
The Percent values are 35.5, 41.3,23.8. The sum of them is 100.1.
The exact amounts should be 35.5, 41.25, 23.75.
What should I do?
I used sprintf, as.data.frame,formatC, and some other function to deal with it.But...
The function freq returns a character data frame, and has no option to adjust the number of decimal places. However, it is easy to recreate the table however you want it. For example, I have written this function, which will give you the same result but with two decimal places instead of one:
freq2 <- function(data_frame)
{
df <- frequency::freq(data_frame)
lapply(df, function(x)
{
n <- suppressWarnings(as.numeric(x$Freq))
sum_all <- as.numeric(x$Freq[nrow(x)])
raw_percent <- suppressWarnings(100 * n / sum_all)
t_row <- grep("Total", x[,2])[1]
valid_percent <- suppressWarnings(100*n / as.numeric(x$Freq[t_row]))
x$Percent <- format(round(raw_percent, 2), nsmall = 2)
x$'Valid Percent' <- format(round(valid_percent, 2), nsmall = 2)
x$'Cumulative Percent' <- format(round(cumsum(valid_percent), 2), nsmall = 2)
x$'Cumulative Percent'[t_row:nrow(x)] <- ""
x$'Valid Percent'[(t_row + 1):nrow(x)] <- ""
return(x)
})
}
Now instead of
freq(MoriskyAdherence)
#> Building tables
#> |===========================================================================| 100%
#> $`x:`
#> x label Freq Percent Valid Percent Cumulative Percent
#> 2 Valid High Adherence 56 35.0 35.0 35.0
#> 3 Low Adherence 66 41.3 41.3 76.3
#> 4 Medium Adherence 38 23.8 23.8 100.0
#> 41 Total 160 100.0 100.0
#> 1 Missing <blank> 0 0.0
#> 5 <NA> 0 0.0
#> 7 Total 160 100.0
you can do
freq2(MoriskyAdherence)
#> Building tables
#> |===========================================================================| 100%
#> $`x:`
#> x label Freq Percent Valid Percent Cumulative Percent
#> 2 Valid High Adherence 56 35.00 35.00 35.00
#> 3 Low Adherence 66 41.25 41.25 76.25
#> 4 Medium Adherence 38 23.75 23.75 100.00
#> 41 Total 160 100.00 100.00
#> 1 Missing <blank> 0 0.00
#> 5 <NA> 0 0.00
#> 7 Total 160 100.00
which is exactly what you were looking for.
Two (potential) solutions:
Solution #1:
Make changes inside the function freq. This can be done by retrieving the function's code with the command freq (without round brackets), or by retrieving the code, with comments, from https://rdrr.io/github/wilcoxa/frequencies/src/R/freq.R.
My hunch is that to obtain more decimals, changes must be implemented at this point in the code:
# create a list of frequencies
message("Building tables")
all_freqs <- lapply_pb(names(x), function(y, x1 = as.data.frame(x), maxrow1 = maxrow, trim1 = trim){
makefreqs(x1, y, maxrow1, trim1)
})
Solution #2:
If you're only after percentages with more decimals, you can use aggregate. Let's suppose your data has this structure: a dataframe with two variables, one numeric, one a factor by which you want to group:
set.seed(123)
Var1 <- sample(LETTERS[1:4], 10, replace = T)
Var2 <- sample(10:100, 10, replace = T)
df <- data.frame(Var1, Var2)
Var1 Var2
1 B 97
2 D 51
3 B 71
4 D 62
5 D 19
6 A 91
7 C 32
8 D 13
9 C 39
10 B 96
Then to obtain your percentages by factor, you would use aggregatethus:
aggregate(Var2 ~ Var1, data = df, function(x) sum(x)/sum(Var2)*100)
Var1 Var2
1 A 15.93695
2 B 46.23468
3 C 12.43433
4 D 25.39405
You can control the number of decimals by using round:
aggregate(Var2 ~ Var1, data = df, function(x) round(sum(x)/sum(Var2)*100,3))

Grouped barplot side by side

I'm trying to plot the table below using a grouped barplot with ggplot2.
How do I plot it in a way such that the scheduled audits and noofemails are plotted sided by side based on each day?
Email Type Sent Month Sent Day Scheduled Audits Noofemails
27 A 1 30 7 581
29 A 1 31 0 9
1 A 2 1 2 8
26 B 1 29 1048 25312
28 B 1 30 23 170
30 B 1 31 18 109
2 B 2 1 6 93
3 B 2 2 9 86
4 B 2 4 3 21
ggplot(joined, aes(x=`Sent Day`, y=`Scheduled Audits`, fill = Noofemails )) +
geom_bar(stat="identity", position = position_dodge()) +
scale_x_continuous(breaks = c(1:29)) +
ggtitle("Number of emails sent in February") +
theme_classic()
Does not achieve the plot I hope to see.
Using this data format, so slightly new column names, no more back-ticks. read_table(text = "") is a nice way to share little datasets on Stack
joined <- read.table(text =
"ID Email_Type Sent_Month Sent_Day Scheduled_Audits Noofemails
27 A 1 30 7 581
29 A 1 31 0 9
1 A 2 1 2 8
26 B 1 29 1048 25312
28 B 1 30 23 170
30 B 1 31 18 109
2 B 2 1 6 93
3 B 2 2 9 86
4 B 2 4 3 21",
header = TRUE)
This is why ggplot2 really likes long data instead of wide data. Because it needs column names to create the aesthetics.
So you can use the function tidyr::gather() to rearrange the two columns of interest into one with labels and one with values. This increase the number of rows in the data frame, so thats why its called long.
long <- tidyr::gather(joined,"key", "value", Scheduled_Audits, Noofemails)
ggplot(long, aes(Sent_Day, value, fill = key)) +
geom_col(position = "dodge")
Alternatively you can use the melt() function from the reshape package. See example below.
library("ggplot2")
library(reshape2)
joined2 <- melt(joined[,c("Sent_Day", "Noofemails", "Scheduled_Audits")], id="Sent_Day")
ggplot(joined2, aes(x=`Sent_Day`, y= value, group = variable, fill= variable)) +
geom_bar(stat="identity", position = position_dodge()) +
scale_x_continuous(breaks = c(1:29)) +
ggtitle("Number of emails sent in February") +
theme_classic()

Resources