How to group data and then draw bar chart in ggplot2 - r

I have data frame (df) with 3 columns e.g.
NUMERIC1: NUMERIC2: GROUP(CHARACTER):
100 1 A
200 2 B
300 3 C
400 4 A
I want to group NUMERIC1 by GROUP(CHARACTER), and then calculate mean for each group.
Something like that:
mean(NUMERIC1): GROUP(CHARACTER):
250 A
200 B
300 C
Finally I'd like to draw bar chart using ggplot2 having GROUP(CHARACTER) on x axis a =nd mean(NUMERIC) on y axis.
It should look like:
I used
mean <- tapply(df$NUMERIC1, df$GROUP(CHARACTER), FUN=mean)
but I'm not sure if it's ok, and even if it's, I don't know what I supposed to do next.

This is what stat_summmary(...) is designed for:
colnames(df) <- c("N1","N2","GROUP")
library(ggplot2)
ggplot(df) + stat_summary(aes(x=GROUP,y=N1),fun.y=mean,geom="bar",
fill="lightblue",col="grey50")

Try something like:
res <- aggregate(NUMERIC1 ~ GROUP, data = df, FUN = mean)
ggplot(res, aes(x = GROUP, y = NUMERIC1)) + geom_bar(stat = "identity")
data
df <- structure(list(NUMERIC1 = c(100L, 200L, 300L, 400L), NUMERIC2 = 1:4,
GROUP = structure(c(1L, 2L, 3L, 1L), .Label = c("A", "B",
"C"), class = "factor")), .Names = c("NUMERIC1", "NUMERIC2",
"GROUP"), class = "data.frame", row.names = c(NA, -4L))

I'd suggest something like:
#Imports; data.table, which allows for really convenient "apply a function to
#"each part of a df, by unique value", and ggplot2
library(data.table)
library(ggplot2)
#Convert df to a data.table. It remains a data.frame, so any function that works
#on a data.frame can still work here.
data <- as.data.table(df)
#By each unique value in "CHARACTER", subset and calculate the mean of the
#NUMERIC1 values within that subset. You end up with a data.frame/data.table
#with the columns CHARACTER and mean_value
data <- data[, j = list(mean_value = mean(NUMERIC1)), by = "CHARACTER"]
#And now we play the plotting game (the plotting game is boring, lets
#play Hungry Hungry Hippos!)
plot <- ggplot(data, aes(CHARACTER, mean_value)) + geom_bar()
#And that should do it.

Here's a solution using dplyr to create the summary. In this case, the summary is created on the fly within ggplot, but you can also create a separate summary data frame first and then feed that to ggplot.
library(dplyr)
library(ggplot2)
ggplot(df %>% group_by(GROUP) %>%
summarise(`Mean NUMERIC1`=mean(NUMERIC1)),
aes(GROUP, `Mean NUMERIC1`)) +
geom_bar(stat="identity", fill=hcl(195,100,65))
Since you're plotting means, rather than counts, it might make more sense use points, rather than bars. For example:
ggplot(df %>% group_by(GROUP) %>%
summarise(`Mean NUMERIC1`=mean(NUMERIC1)),
aes(GROUP, `Mean NUMERIC1`)) +
geom_point(pch=21, size=5, fill="blue") +
coord_cartesian(ylim=c(0,310))

Why ggplot when you could do the same with your own code and barplot:
barplot(tapply(df$NUMERIC1, df$GROUP, FUN=mean))

Related

Plot only specific dataframe rows that matches a criteria in R

I have a data frame built like this:
Id Client data
1 5 25
2 8 63
3 13 42
4 5 87
5 8 35
and a array: clients <- c(5,8)
I need to plot a different histogram(of the data column) for each client that is in the "clients" array. In this example i would plot histogram for the client 5 with two bars(25,87) and one for the client 8 also with two bars(63,35). I think that i need to use the facet_wrap function to plot a histogram for each client, i also tried to do something like a for plotting for each client but didn't worked. I'm not sure about how i can do it so any help would be great!
Seems like you just didn't do enough data-wrangling. Also, from your description, you need barplot, not a histogram (which would report counts of particular values in data, not their value).
This is a solution in base.
dt = data.frame("id" = 1:5, "client" = c(5,8,13,5,8), "data"=c(25,63,42,87,35))
clients = c(5,8,13) # for particular clients, or unique(dt$client) for all clients
# get data for every client
lst = lapply(clients, function(x){dt[dt$client == x, "data"]})
# unify length and transform into a matrix
len = max(sapply(lst, length))
mat = do.call(cbind, lapply(lst, "[", seq_len(len)))
# Put some nice legend
colnames(mat) = paste("Client", clients)
# plot this matrix with barplot
barplot(mat, beside=TRUE, las=1)
You can plot on the same graph if there are limited number of clients.
library(dplyr)
library(ggplot2)
df %>%
filter(Client %in% clients) %>%
group_by(Client) %>%
mutate(Id = factor(row_number())) %>%
ggplot() + aes(Client, data, fill = Id) +
geom_bar(stat = 'identity', position = 'dodge')
With facets :
df %>%
filter(Client %in% clients) %>%
group_by(Client) %>%
mutate(Id = factor(row_number())) %>%
ggplot() + aes(Client, data, fill = Id) +
geom_bar(stat = 'identity', position = 'dodge') +
facet_wrap(~Client, scales = 'free_x')
data
df <- structure(list(Id = 1:5, Client = c(5L, 8L, 13L, 5L, 8L), data = c(25L,
63L, 42L, 87L, 35L)), class = "data.frame", row.names = c(NA, -5L))
clients <- c(5,8)

Plot the Share of one Category of a Categorical Variable with Respect to all Categories of a Second Variable

I have a dataframe like this:
df <- data.frame(Reason = sample(rep(c("R1", "R2", "R3", "R4"), each = 100)),
Answer = sample(rep(c("yes", "no", "no", "no"), 100)))
head(df)
I want ggplot to do a bar plot that shows the share of "yes"-answers (y-axis) for every reason (x-axis).
I tried this:
ggplot(data = df, aes(x = interaction(Reason, Answer))) +
geom_bar(aes(y = ..count../sum(..count..)))
This leads to the following outcome:
how it looks like
The problem is that the bars sum up to 1 (in total). I want them to sum up to one within each Reason-category. (R1.no and R1.yes should sum up to 1, R2.no and R2.yes should sum up to one and so on).
When this is done, I want to discard all bars bearing information about the "no"-answers. So basically, I just want the shares of the "yes"-answers within each Reason-category. This should look something like that:
how it should look like
I obtained the desired result doing this:
a <- prop.table(table(df$Reason, df$Answer),1)
df2 <- data.frame(Reason = rownames(as.matrix(a)),
share = as.matrix(a)[,2])
ggplot(data = df2, aes(x = reorder(Reason, share), y = share)) +
geom_bar(stat = "identity") +
ylab("share of yes-answers")
Can I avoid this work-around and directly get the desired result from ggplot? This would have some major advantages for me.
Thanks alot,
Andi
The solution by Yuriy only works when it sums up to 100. I think you have to calculate the proportion somehow, otherwise you cannot sort before hand. So in the first part, I manipulate the data by adding a column p, 1 if yes 0 if no.
library(dplyr)
library(ggplot2)
set.seed(99)
df <- data.frame(
Reason = sample(rep(c("R1", "R2", "R3", "R4"), each = 100)),
Answer = sample(rep(c("yes", "no", "no", "no"), 100)))
head(df %>% mutate(p=as.numeric(Answer=="yes")),3)
Reason Answer p
1 R3 no 0
2 R3 yes 1
3 R1 no 0
Then we plot with this data frame, and the y axis is simply the mean of each group on the x-axis, and we can use stat_summary with fun.y=mean. Now reorder works very well in this case because it calculates the averages of each category and reorders according to that:
ggplot(df %>% mutate(p=as.numeric(Answer=="yes")),
aes(x=reorder(Reason,p),y=p)) +
stat_summary(fun.y="mean",geom="bar",fill="orchid4")
And this will work for situations when you have different number of observations for different categories:
set.seed(100)
df <- data.frame(
Reason = rep(c("R1", "R2", "R3", "R4"),times=seq(50,200,length.out=4)),
Answer = sample(c("yes","no"),500,prob=c(0.5,0.5),replace=TRUE)
)
# we expect
sort(tapply(df$Answer=="yes",df$Reason,mean))
R2 R4 R3 R1
0.460 0.505 0.520 0.540
ggplot(df %>% mutate(p=as.numeric(Answer=="yes")),
aes(x=reorder(Reason,p),y=p)) +
stat_summary(fun.y="mean",geom="bar",fill="orange")
ggplot(df[df$Answer == "yes", ]) +
geom_bar(aes(x = Reason, y = sort(..prop..), group = 1))

Plot difference over time in R

Consider the following data:
set.seed(4235)
dates <- c("2016-01-01", "2015-01-01", "2014-01-01", "2013-01-01")
small <- data.frame(group = "small", n1 = rnorm(4), dates = as.Date(dates))
medium <- data.frame(group = "medium", n1 = rnorm(4), dates = as.Date(dates))
large <- data.frame(group = "large", n1 = rnorm(4), dates = as.Date(dates))
data <- rbind(small, medium, large)
This is pretty basic data that can be plotted like this:
ggplot(data, aes(dates, col = group)) +
geom_line(aes(y = n1))
However, imagine that I want to plot the small and medium group against the large group. In other words the difference between the small and medium group and the large. In other words the large group should be represented by a straight line around zero and the other groups should represent the difference. Something like a autocovariance plot.
Any idea on how to do this with ggplot?
You probably can't do it with ggplot directly, though it is relatively straightforward to calculate the differences first, then pass them back to ggplot.
Here, I am using tidyr and dplyr to do the manipulations. First, I spread the data to get the groups in their own columns (with one row per date) to allow comparison. Then, I mutate to create the difference variables of interest. Finally, I gather the comparisons back into long form (be warned that this duplicates the entries in small, medium, and large; however, those can be dropped with select if needed). Then, simply pass the result to ggplot and plot however you desire (here, simple lines again).
data %>%
spread(group, n1) %>%
mutate(large - medium
, large - small) %>%
gather(Comparison, Difference, `large - medium`, `large - small`) %>%
ggplot(aes(x = dates
, y = Difference
, col = Comparison)) +
geom_line()
gives:
This should also work:
library(reshape2)
df <- dcast(data, dates~group, value.var = 'n1')
df$diff.small <- df$small - df$large
df$diff.medium <- df$medium - df$large
df$large <- 0
data <- subset(melt(df, id='dates'), variable %in% c('diff.small', 'diff.medium', 'large'))
ggplot(data, aes(dates, fill = variable, col = variable)) +
geom_ribbon(aes(ymax=value, ymin=0), alpha=0.2) +
geom_line(aes(y = value))

Multiplot of multiplots in ggplot2

I recently discovered the multiplot function from the Rmisc package to produce stacked plots using ggplot2 plots/objects. What I am trying to do now is to create a multiplot of multiplots. Unfortunately, unlike the ggplot function, multiplot does not produce objects, so my issue cannot be resolved by simply nesting multiplot.
I will create a dataframe to make my point clear. In my dataframe named df, I have 3 columns: period, group and value. A certain value is recorded for each of 3 groups over 10 periods. (Note: I don't use a seed number below despite the use of the sample function because the focus is not numerical, it is graphical)
# Create a data frame for illustration purposes
df <- data.frame(period = rep(1:10, 3),
group = rep(LETTERS[1:3], each = 10),
value = sample(100, 30, replace = TRUE))
I then add a fourth column to df, which is the exponential transformation of the value column.
df$exp.value = exp(df$value)
I would like to create stacked plots allowing me to compare the values in each group to their exponential counterparts.
# Split dataframe by group
df_split <- split(df, df$group)
# Plots of values in each group
plots <- lapply(df_split, function(i){
ggplot(data = i, aes(x = period, y = value)) + geom_line()
})
# Plots of logged values in each group
plots_exp <- lapply(df_split, function(i){
ggplot(data = i, aes(x = period, y = exp.value)) + geom_line()
})
plots and plots_exp are both lists of 3 elements each containing ggplot objects. The first element of each list corresponds to group A, the second element corresponds to group B and the third element corresponds to group C.
In order to compare each group's values to the exponential values, I can use the multiplot function. Following is an example with group A:
multiplot(plots[[1]], plots_log[[1]], cols = 1)
How can I create a grid which will include the multiplot above as well as the ones for groups B and C? As if the code included ... + facet_grid(. ~ group)?
We can use cowplot package:
library(cowplot)
plot_grid(plots[[1]], plots_exp[[1]],
plots[[2]], plots_exp[[2]],
plots[[3]], plots_exp[[3]],
labels = c("A", "A", "B", "B", "C", "C"),
ncol = 1, align = "v")
We can output to a pdf looping through plots and plots_exp list objects. Every page will contain 2 plots. This is a better option when we have a lot of groups:
pdf("myPlots.pdf")
lapply(seq(length(plots)), function(i){
plot_grid(plots[[i]], plots_exp[[i]], ncol = 1, align = "v")
})
dev.off()
Another option is to prepare the data for ggplot and use facet as usual:
library(dplyr)
library(tidyr)
library(ggplot2)
gather(df, valueType, value, -c(group, period)) %>%
mutate(myGroup = paste(group, valueType)) %>%
ggplot(aes(period, value)) +
geom_line() +
facet_grid(myGroup ~ ., scales = "free_y")

barplot of percentages per category, per variable

Given the following example data:
df<-data.frame(cbind(cntry<- c("BE","ES","IN","GE","BE","ES","GE",NA,"IN","IN"),
gndr<- c(NA,1,2,2,2,2,1,1,1,2),
plcvcrcR<-c(0,1,NA,0,0,1,1,1,0,0),
plcpvcrR<-c(0,1,1,1,NA,0,0,0,0,0),
plccbrgR<- c(0,1,0,NA,0,1,0,1,1,0),
plcarcrR<-c(1,0,0,NA,1,0,1,0,0,0),
plcrspcR<-c(1,1,0,0,0,0,0,1,1,NA)))
colnames(df)<- c("cntry", "gndr", "plcvcrcR", "plcpvcrR", "plccbrgR", "plcarcrR", "plcrspcR")
df
How could I make barplots showing for example for each gender (gndr) the percentage of 1-values on the variables plcpvcrR, plccbrgR, plcarcrR? Prefeably the bars for each gender are grouped, and of a different colour for the different variables.
Something like this image, where one colour refers to the question, and the group to the gender (without the confidence interval):
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSsAlUJsqdhxXHiY35FxFmVx3BREVji_ca24w9ub_OYEfZ3O50X5Q
I have experimented with the following function, of which I am aware it contains many flaws:
barplot(((colSums(df[c(3:5)], na.rm=TRUE)/nrow(df[c(3:5)]))*100)~gndr)
I'd do something like this:
require(ggplot2)
require(reshape2)
require(scales)
require(plyr)
# remove NA from gndr
df <- df[!is.na(df$gndr), ]
# now get percentages
df.o <- ddply(df, .(gndr), summarise,
plcpvcrR = sum(plcpvcrR == 1, na.rm = T)/sum(!is.na(plcpvcrR)),
plccbrgR = sum(plccbrgR == 1, na.rm = T)/sum(!is.na(plccbrgR)),
plcrspcR = sum(plcrspcR == 1, na.rm = T)/sum(!is.na(plcrspcR)))
# melt it:
df.m <- melt(df.o, id.var = "gndr")
# plot it:
ggplot(data = df.m, aes(x=gndr)) + geom_bar(aes(weights=value, fill=variable),
position = "dodge") + scale_y_continuous(labels=percent)
There may be easier/straightforward way to get percentages. Here's the plot:

Resources