R - aggregating data into a dataframe - r

I was recently working with some output and I can't seem to plot it informatively. The output looks like the following:
180,A,71
180,C,61
180,G,68
180,U,78
182,A,70
182,C,34
182,G,123
182,U,51
I would like to plot this data so i have on the x axis the first column, and on the y axis bars which are filled according to four different types(column 2) and their frequencies (column 3). So on y axis would be frequency of all types on one value from first column, but that bar would be divided according to size of types.
I hope the question was clear and thanks for any help.

How's this?
df <- data.frame(X=rep(c(180,182), each=4), Group=rep(c("A","C","G","U"),2),
Y=c(71,61,68,78,70,34,123,51))
# Calculating percentages (just using base)
groupSum <- tapply(df$X, df$Group, sum)
df$Label <- paste0(round(100 * df$Y / groupSum[df$Group], 1), "%")
# Go for the plot
library(ggplot2)
ggplot(data=df, aes(x=X, y=Y,fill=Group)) +
geom_bar(position="dodge", stat="identity") +
scale_x_continuous(breaks=unique(df$X))
The last part only labels the x values actually used.
And this is what #Haroka's plot would look like (with percentages now added as per request -- also see here):
ggplot(data=df, aes(x=X, y=Y,fill=Group)) +
geom_bar(position="stack", stat="identity") +
scale_x_continuous(breaks=unique(df$X)) +
geom_text(aes(label = Label), size=12, hjust=0.5, vjust=3, position="stack")

Related

Change plotting order of bars in ggplot2

I'm preparing an appendix plot for a revised manuscript where I need to give information of the within-year ranges (variability) of several variables between years and sites.
I figured the tidiest way to do this (I have 7 sites, 21 years, and 5 variables...) would be to use a rose plot using coord_polar. However, I stumbled upon something that has always frustrated me about ggplot - the default ordering assumptions. While factors are easily reordered based on some value, this seems to only work in a fixed fashion: as far as I've understood, the order needs to apply throughout the data frame.
In this plot, the ordering needs to depend on a value which changes between years, and therefore the colour and fill values need to change in plotting order within the panel.
To demonstrate, I've created a reproducible example coded below (pictured in the way it should not work)
Basically, I always need the Site with the minimum value within a given Year to be plotted first (in the centre), followed outwards by the increase in value of the other sites, in order of the original value (see order and diff columns of the data frame). In other words, some years Site a will be at the centre, some years Site c will be in the centre, etc.
Any help would be massively appreciated.
library('ggplot2')
library('reshape2')
library("plyr")
## reproducible example of problem: create dummy data
madeup <- data.frame(Year = rep(2000:2015, each=20), Site=rep(c("a","b","c","d"), each=5, times=16),
var1 = rnorm(n=16*20, mean=20, sd=5), var2= rnorm(n=16*20, mean=50, sd=1))
## create ranges of the data by Year and Site
myRange <- function(dat) {range=max(dat, na.rm=TRUE)-min(dat,na.rm = TRUE)}
vardf <- ddply(madeup, .(Site, Year), summarise, var1=myRange(var1),
var2=myRange(var2))
varmelt <- melt(vardf, id.vars = c("Site","Year"))
varmelt$Site <- as.character(varmelt$Site) # this to preserve the new order when rbind called
varmelt <- by(varmelt, list(varmelt$Year, varmelt$variable), function(x) {x <- x[order(x$value),]
x$order <- 1:nrow(x)
return(x)})
varmelt <- do.call(rbind, varmelt)
## create difference between these values so that each site gets plotted cumulatively on the rose plot
## (otherwise areas close to the centre become uninterpretable)
vartest <- by(varmelt, list(varmelt$Year, varmelt$variable), function(x) {
x$diff <- c(x$value[1], diff(x$value))
return(x)
})
vartest <- do.call(rbind,vartest)
## plot rose plot to display how ranges in variables vary by year and between sites
## for this test example we'll just take one variable, but the idea is to facet by variable
max1 <- max(vartest$value[vartest$variable=='var1'])
yearlength <- length(2000:2015)
ggplot(vartest[vartest$variable=="var1",], aes(x=factor(Year), y=diff)) +
theme_bw() +
geom_hline(yintercept = seq(0,max1, by=1), size=0.3, col="grey60",lty=3) +
geom_vline(xintercept=seq(1,yearlength,1), size=0.3, col='grey30', lty=2) +
geom_bar(stat='identity', width=1, size=0.5, aes(col=Site, fill=Site)) +
scale_x_discrete() +
coord_polar() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
As long as you don't use stacked bars (position = "stack", which is the default for geom_bar), ggplot2 will actually use the order of the rows in your data for the plotting order. So all you need to do, is use the original values for the y-axis (rather than the cumulatively differenced ones) along with position = "identity", and order your data from largest to smallest value before plotting:
ordered_data <- vartest[order(-vartest$value), ]
ggplot(ordered_data, aes(factor(Year), value)) +
geom_col(aes(fill = Site), position = "identity", width = 1) +
coord_polar() +
facet_wrap(~ variable)
Created on 2018-02-17 by the reprex package (v0.2.0).
PS. When generating random data for an example, consider using set.seed so that your results can be reproduced exactly.
You can start with a single plot of the largest site, and then layer smaller sites on top like so:
a <- ggplot(vartest[vartest$variable=="var1"& vartest$order==4,], aes(x=factor(Year), y=value,group=order)) +
theme_bw() +
geom_hline(yintercept = seq(0,max1, by=1), size=0.3, col="grey60",lty=3) +
geom_vline(xintercept=seq(1,yearlength,1), size=0.3, col='grey30', lty=2) +
geom_bar(stat='identity', width=1, size=0.5, aes(col=Site, fill=Site)) +
scale_x_discrete() +
coord_polar() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
b <- a + geom_bar(data = vartest[vartest$variable=="var1"& vartest$order==3,],
stat='identity', width=1, size=0.5, aes(x=factor(Year), y=value,col=Site, fill=Site))
c <- b + geom_bar(data = vartest[vartest$variable=="var1"& vartest$order==2,],
stat='identity', width=1, size=0.5, aes(x=factor(Year), y=value,col=Site, fill=Site))
c + geom_bar(data = vartest[vartest$variable=="var1"& vartest$order==1,],
stat='identity', width=1, size=0.5, aes(x=factor(Year), y=value,col=Site, fill=Site))
This produces the following:
Is that what you wanted?

Merge two stacked bar graphs into one plot R (ggplot2) [duplicate]

I'm having quite the time understanding geom_bar() and position="dodge". I was trying to make some bar graphs illustrating two groups. Originally the data was from two separate data frames. Per this question, I put my data in long format. My example:
test <- data.frame(names=rep(c("A","B","C"), 5), values=1:15)
test2 <- data.frame(names=c("A","B","C"), values=5:7)
df <- data.frame(names=c(paste(test$names), paste(test2$names)), num=c(rep(1,
nrow(test)), rep(2, nrow(test2))), values=c(test$values, test2$values))
I use that example as it's similar to the spend vs. budget example. Spending has many rows per names factor level whereas the budget only has one (one budget amount per category).
For a stacked bar plot, this works great:
ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity")
In particular, note the y value maxes. They are the sums of the data from test with the values of test2 shown on blue on top.
Based on other questions I've read, I simply need to add position="dodge" to make it a side-by-side plot vs. a stacked one:
ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity", position="dodge")
It looks great, but note the new max y values. It seems like it's just taking the max y value from each names factor level from test for the y value. It's no longer summing them.
Per some other questions (like this one and this one, I also tried adding the group= option without success (produces the same dodged plot as above):
ggplot(df, aes(x=factor(names), y=values, fill=factor(num), group=factor(num))) +
geom_bar(stat="identity", position="dodge")
I don't understand why the stacked works great and the dodged doesn't just put them side by side instead of on top.
ETA: I found a recent question about this on the ggplot google group with the suggestion to add alpha=0.5 to see what's going on. It isn't that ggplot is taking the max value from each grouping; it's actually over-plotting bars on top of one another for each value.
It seems that when using position="dodge", ggplot expects only one y per x. I contacted Winston Chang, a ggplot developer about this to confirm as well as to inquire if this can be changed as I don't see an advantage.
It seems that stat="identity" should tell ggplot to tally the y=val passed inside aes() instead of individual counts which happens without stat="identity" and when passing no y value.
For now, the workaround seems to be (for the original df above) to aggregate so there's only one y per x:
df2 <- aggregate(df$values, by=list(df$names, df$num), FUN=sum)
p <- ggplot(df2, aes(x=Group.1, y=x, fill=factor(Group.2)))
p <- p + geom_bar(stat="identity", position="dodge")
p
I think the problem is that you want to stack within values of the num group, and dodge between values of num.
It might help to look at what happens when you add an outline to the bars.
library(ggplot2)
set.seed(123)
df <- data.frame(
id = 1:18,
names = rep(LETTERS[1:3], 6),
num = c(rep(1, 15), rep(2, 3)),
values = sample(1:10, 18, replace=TRUE)
)
By default, there are a lot of bars stacked - you just don't see that they're separate unless you have an outline:
# Stacked bars
ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity", colour="black")
If you dodge, you get bars that are dodged between values of num, but there may be multiple bars within each value of num:
# Dodged on 'num', but some overplotted bars
ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity", colour="black", position="dodge", alpha=0.1)
If you also add id as a grouping var, it'll dodge all of them:
# Dodging with unique 'id' as the grouping var
ggplot(df, aes(x=factor(names), y=values, fill=factor(num), group=factor(id))) +
geom_bar(stat="identity", colour="black", position="dodge", alpha=0.1)
I think what you want is to both dodge and stack, but you can't do both.
So the best thing is to summarize the data yourself.
library(plyr)
df2 <- ddply(df, c("names", "num"), summarise, values = sum(values))
ggplot(df2, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity", colour="black", position="dodge")

How to change origin line position in ggplot bar graph?

Say I'm measuring 10 personality traits and I know the population baseline. I would like to create a chart for individual test-takers to show them their individual percentile ranking on each trait. Thus, the numbers go from 1 (percentile) to 99 (percentile). Given that a 50 is perfectly average, I'd like the graph to show bars going to the left or right from 50 as the origin line. In bar graphs in ggplot, it seems that the origin line defaults to 0. Is there a way to change the origin line to be at 50?
Here's some fake data and default graphing:
df <- data.frame(
names = LETTERS[1:10],
factor = round(rnorm(10, mean = 50, sd = 20), 1)
)
library(ggplot2)
ggplot(data = df, aes(x=names, y=factor)) +
geom_bar(stat="identity") +
coord_flip()
Picking up on #nongkrong's comment, here's some code that will do what I think you want while relabeling the ticks to match the original range and relabeling the axis to avoid showing the math:
library(ggplot2)
ggplot(data = df, aes(x=names, y=factor - 50)) +
geom_bar(stat="identity") +
scale_y_continuous(breaks=seq(-50,50,10), labels=seq(0,100,10)) + ylab("Percentile") +
coord_flip()
This post was really helpful for me - thanks #ulfelder and #nongkrong. However, I wanted to re-use the code on different data without having to manually adjust the tick labels to fit the new data. To do this in a way that retained ggplot's tick placement, I defined a tiny function and called this function in the label argument:
fix.labels <- function(x){
x + 50
}
ggplot(data = df, aes(x=names, y=factor - 50)) +
geom_bar(stat="identity") +
scale_y_continuous(labels = fix.labels) + ylab("Percentile") +
coord_flip()

R/ggplot2 - Overlapping labels on facet_grid

Folks,
I am plotting histograms using geom_histogram and I would like to label each histogram with the mean value (I am using mean for the sake of this example). The issue is that I am drawing multiple histograms in one facet and I get labels overlapping. This is an example:
library(ggplot2)
df <- data.frame (type=rep(1:2, each=1000), subtype=rep(c("a","b"), each=500), value=rnorm(4000, 0,1))
plt <- ggplot(df, aes(x=value, fill=subtype)) + geom_histogram(position="identity", alpha=0.4)
plt <- plt + facet_grid(. ~ type)
plt + geom_text(aes(label = paste("mean=", mean(value)), colour=subtype, x=-Inf, y=Inf), data = df, size = 4, hjust=-0.1, vjust=2)
Result is:
The problem is that the labels for Subtypes a and b are overlapping. I would like to solve this.
I have tried the position, both dodge and stack, for example:
plt + geom_text(aes(label = paste("mean=", mean(value)), colour=subtype, x=-Inf, y=Inf), position="stack", data = df, size = 4, hjust=-0.1, vjust=2)
This did not help. In fact, it issued warning about the width.
Would you pls help ?
Thx,
Riad.
I think you could precalculate mean values before plotting in new data frame.
library(plyr)
df.text<-ddply(df,.(type,subtype),summarise,mean.value=mean(value))
df.text
type subtype mean.value
1 1 a -0.003138127
2 1 b 0.023252169
3 2 a 0.030831337
4 2 b -0.059001888
Then use this new data frame in geom_text(). To ensure that values do not overlap you can provide two values in vjust= (as there are two values in each facet).
ggplot(df, aes(x=value, fill=subtype)) +
geom_histogram(position="identity", alpha=0.4)+
facet_grid(. ~ type)+
geom_text(data=df.text,aes(label=paste("mean=",mean.value),
colour=subtype,x=-Inf,y=Inf), size = 4, hjust=-0.1, vjust=c(2,4))
Just to expand on #Didzis:
You actually have two problems here. First, the text overlaps, but more importantly, when you use aggregating functions in aes(...), as in:
geom_text(aes(label = paste("mean=", mean(value)), ...
ggplot does not respect the subsetting implied in the facets (or in the groups for that matter). So mean(value) is based on the full dataset regardless of faceting or grouping. As a result, you have to use an auxillary table, as #Didzis shows.
BTW:
df.text <- aggregate(df$value,by=list(type=df$type,subtype=df$subtype),mean)
gets you the means and does not require plyr.

Issue with ggplot2, geom_bar, and position="dodge": stacked has correct y values, dodged does not

I'm having quite the time understanding geom_bar() and position="dodge". I was trying to make some bar graphs illustrating two groups. Originally the data was from two separate data frames. Per this question, I put my data in long format. My example:
test <- data.frame(names=rep(c("A","B","C"), 5), values=1:15)
test2 <- data.frame(names=c("A","B","C"), values=5:7)
df <- data.frame(names=c(paste(test$names), paste(test2$names)), num=c(rep(1,
nrow(test)), rep(2, nrow(test2))), values=c(test$values, test2$values))
I use that example as it's similar to the spend vs. budget example. Spending has many rows per names factor level whereas the budget only has one (one budget amount per category).
For a stacked bar plot, this works great:
ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity")
In particular, note the y value maxes. They are the sums of the data from test with the values of test2 shown on blue on top.
Based on other questions I've read, I simply need to add position="dodge" to make it a side-by-side plot vs. a stacked one:
ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity", position="dodge")
It looks great, but note the new max y values. It seems like it's just taking the max y value from each names factor level from test for the y value. It's no longer summing them.
Per some other questions (like this one and this one, I also tried adding the group= option without success (produces the same dodged plot as above):
ggplot(df, aes(x=factor(names), y=values, fill=factor(num), group=factor(num))) +
geom_bar(stat="identity", position="dodge")
I don't understand why the stacked works great and the dodged doesn't just put them side by side instead of on top.
ETA: I found a recent question about this on the ggplot google group with the suggestion to add alpha=0.5 to see what's going on. It isn't that ggplot is taking the max value from each grouping; it's actually over-plotting bars on top of one another for each value.
It seems that when using position="dodge", ggplot expects only one y per x. I contacted Winston Chang, a ggplot developer about this to confirm as well as to inquire if this can be changed as I don't see an advantage.
It seems that stat="identity" should tell ggplot to tally the y=val passed inside aes() instead of individual counts which happens without stat="identity" and when passing no y value.
For now, the workaround seems to be (for the original df above) to aggregate so there's only one y per x:
df2 <- aggregate(df$values, by=list(df$names, df$num), FUN=sum)
p <- ggplot(df2, aes(x=Group.1, y=x, fill=factor(Group.2)))
p <- p + geom_bar(stat="identity", position="dodge")
p
I think the problem is that you want to stack within values of the num group, and dodge between values of num.
It might help to look at what happens when you add an outline to the bars.
library(ggplot2)
set.seed(123)
df <- data.frame(
id = 1:18,
names = rep(LETTERS[1:3], 6),
num = c(rep(1, 15), rep(2, 3)),
values = sample(1:10, 18, replace=TRUE)
)
By default, there are a lot of bars stacked - you just don't see that they're separate unless you have an outline:
# Stacked bars
ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity", colour="black")
If you dodge, you get bars that are dodged between values of num, but there may be multiple bars within each value of num:
# Dodged on 'num', but some overplotted bars
ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity", colour="black", position="dodge", alpha=0.1)
If you also add id as a grouping var, it'll dodge all of them:
# Dodging with unique 'id' as the grouping var
ggplot(df, aes(x=factor(names), y=values, fill=factor(num), group=factor(id))) +
geom_bar(stat="identity", colour="black", position="dodge", alpha=0.1)
I think what you want is to both dodge and stack, but you can't do both.
So the best thing is to summarize the data yourself.
library(plyr)
df2 <- ddply(df, c("names", "num"), summarise, values = sum(values))
ggplot(df2, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity", colour="black", position="dodge")

Resources