R stacked bar charts including "other" (using ggplot2) - r

I want to make a stacked barchart that describes abundances of taxa at two locations in three different seasons. I'm using ggplot2. Making the plot is ok, but I have 48 taxa so I end up with a lot of different colours in the bar. There are only eight taxa that occur frequently and abundantly, so I'd like to group the others into "Other" for the plot.
My data looks like this:
SampleID TransectID SampleYear Season Location Taxa1 Taxa2 Taxa3 .... Taxa48
BW15001 1 2015 fall SiteA 25 0 0 0
BW15001 2 2015 fall SiteA 32 0 0 2
BW15001 2 2015 fall SiteA 6 0 45 0
BW15001 3 2015 fall SiteA 78 1 2 0
This is what I have tried (modified from here):
y <- rowSums(invert[6:54])
x<-invert[6:54]/y
x<-invert[,order(-colSums(x))]
#Extract list of top N Taxa
N<-8
taxa_list<-colnames(x)[1:N]
#remove "__Unknown__" and add it to others
taxa_list<-taxa_list[!grepl("Unknown",taxa_list)]
N<-length(taxa_list)
#Generate a new table with everything added to Others
new_x<-data.frame(x[,colnames(x) %in% taxa_list],
Others=rowSums(x[,!colnames(x) %in% taxa_list]))
df<-NULL
for (i in 1:dim(new_x)[2]){
tmp<-data.frame(row.names=NULL,Sample=rownames(new_x),
Taxa=rep(colnames(new_x)[i],dim(new_x) [1]),Value=new_x[,i],Type=grouping_info[,1])
if(i==1){df<-tmp} else {df<-rbind(df,tmp)}
}
To plot the graph:
colours <- c("#F0A3FF", "#0075DC", "#993F00","#4C005C","#2BCE48","#FFCC99","#808080","#94FFB5","#8F7C00","#9DCC00","#C20088","#003380","#FFA405","#FFA8BB","#426600","#FF0010","#5EF1F2","#00998F","#740AFF","#990000","#FFFF00");
library(ggplot2)
p<-ggplot(df,aes(Sample,Value,fill=Taxa))+
geom_bar(stat="identity")+
facet_grid(. ~ Type, drop=TRUE,scale="free",space="free_x")
p<-p+scale_fill_manual(values=colours[1:(N+1)])
p<-p+theme_bw()+ylab("Proportions")
p<-p+ scale_y_continuous(expand = c(0,0))+
theme(strip.background = element_rect(fill="gray85"))+
theme(panel.spacing = unit(0.3, "lines"))
p<-p+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
p
The main problem that I would like help with today is pulling out the main taxa and lumping the rest as "Other". I think I can figure out how to group the graph by Season and Location using facet_grid() later...
Thanks!

Expanding on my comment. Take a look at the forcats package. Without a full example, it's hard to say, but the following should work:
library(tidyverse)
library(forcats)
temp <- df %>%
gather(taxa, amount, -c(1:5))
# Reshape the data so that that there is one record per each amount
tidy_df <- temp[rep(rownames(temp), times = temp$amount), ]
tidy_df %>%
select(-amount) %>%
mutate(taxa = fct_lump(taxa, n = 2)) %>% # Check out this line
ggplot(., aes(x = SampleID, fill = taxa)) +
geom_bar()
You can change fct_lump(taxa, n = 2) to fct_lump(taxa, n = 8) to group the top 8 categories. Alternatively, you can use fct_lump(taxa, prop = 0.9) to lump things up by proportions.
If you are simply going after the "presence" of the taxa in a sample (and not the value or amount), things are a bit simpler and can likely be handled in one pipe:
df %>%
gather(taxa, amount, -c(1:5)) %>%
mutate(amount = na_if(amount, 0)) %>%
na.omit() %>%
mutate(taxa = fct_lump(taxa, n = 2)) %>%
ggplot(., aes(x = SampleID, fill = taxa)) +
geom_bar()

One way of doing it:
library(plyr)
d=data.frame(SampleID=rep('BW15001',4),
TransectID=c(1,2,2,3),
SampleYear=rep(2015,4),
Taxa1=c(25,32,6,78),
Taxa2=c(0,0,0,1),
Taxa3=c(0,0,45,3))
#Reshape the df so that all taxa columns are melted into two
d=melt(d,id=colnames(d[,1:3]))
d$variable=as.character(d$variable)
# rename all uninteresting taxa as 'other'
`%ni%` <- Negate(`%in%`) # Here I decided to select the ones to keep, but the other way around is fine as well of course
d[d$variable %ni% c('Taxa1','Taxa2'),'variable']='Other' #here you could add a function to automatically determine which taxta you want to keep, as you already did
# aggregate all data for 'other'
d=ddply(d,colnames(d[,1:4]),summarise,value=sum(value))
#make your plot, this one is just a bad example
ggplot(d,aes(SampleID,value,fill=variable))+
geom_bar(stat="identity")+
facet_grid(. ~ Type, drop=TRUE,scale="free",space="free_x")

Related

Ggplot: how to show boxplots in a given order?

I have a peculiar problem with arranging boxplots given a certain order of the x-axis, as I am adding two boxplots from different dataframe in the same plot and each time I add the second geom_boxplot, R reorders my x axis alphabetically instead of following ordered levels of factor(x).
So, I have two dataframe of different lengths lookings something like this:
df1:
id value
1 A 1
2 A 2
3 A 3
4 A 5
5 B 10
6 B 8
7 B 1
8 C 3
9 C 7
df2:
id value
1 A 4
2 A 5
3 B 6
4 B 8
There is always more observations per id in df1 than in df2 and there is some ids in df1 that are not available in df2.
I'd like df1 to be sorted by the median(value) (ascending) and to first plot boxplots for each id in that order.
Then I add a second layer with boxplots for all other measurements per id from df2, which should maintain the same order on the x-axis.
Here's how I approached that:
vec <- df %>%
group_by(id) %>%
summarize(m = median(value)) %>%
arrange(m) %>%
pull(id)
p1 <- df1 %>%
ggplot(aes(x = factor(id, levels = vec), y = value)) +
geom_boxplot()
p1
p2 <- p1 +
geom_boxplot(data = df2, aes(x = factor(id, levels = vec), y = value))
p2
p1 shows the right order (ids are ordered based on ascending medians), p2 always throws my order off and goes back to plotting ids alphabetically (my id is a character column with names actually). I tried with sample dataframes and the above code achieves what is required. Hence, I am not sure what could be specifically wrong about my data so that the code fails when applied to the specific data and not the above mock data.
Any ideas?
Thanks a lot in advance!
If I understood correctly, this shoud work.
library(tidyverse)
# Sample data
df1 <-
tibble(
id = c("A","A","A","A","B","B","B","C","C"),
value = c(1,2,3,5,10,8,1,3,7),
type = "df1"
)
df2 <-
tibble(
id = c("A","A","B","B"),
value = c(4,5,6,8),
type = "df2"
)
df <-
# Create single data.frame
df1 %>%
bind_rows(df2) %>%
# Reorder id by median(value)
mutate(id = fct_reorder(id,value,median))
df %>%
ggplot(aes(id, y = value, fill = type)) +
geom_boxplot()

ggplot facet grid within a factor

Consider data that looks like this
fitem<-rep(rep(1:16,each=3),2)
fsubs<-factor(rep(rep(paste('sub',1:3,sep=''),16),2))
ftime<-factor(as.character(rep(c('a','b'),each=48)))
fcounts<-as.numeric(round(runif(96,1,10)))
fdf<-data.frame(fsubs,fitem,fcounts,ftime)
head(df)
fsubs fitem fcounts ftime
1 sub1 1 8 a
2 sub2 1 10 a
3 sub3 1 4 a
4 sub1 2 4 a
5 sub2 2 1 a
6 sub3 2 6 a
I would like to plot a facet grid that shows the counts for the two time points ('a','b'), subject-wise. I can't seem to figure out how to plot this in ggplot
here is my ugly attempt to do it
fdf_counts<-data.frame()
for (i in unique(fdf$fsubs)){
fdf_counts<-append(fdf_counts,cbind(fdf%>%filter(fsubs==i,ftime=='a')%>%dplyr::select(fcounts),
fdf%>%filter(fsubs==i,ftime=='b')%>%dplyr::select(fcounts)))
fdf_counts<-data.frame(fdf_counts)
}
s1<-ggplot(fdf_counts,aes(x=fcounts,y=fcounts.1))+geom_point()+geom_smooth(method='lm')+labs(x='a',y='b',title='sub1')
s2<-ggplot(fdf_counts,aes(x=fcounts.2,y=fcounts.3))+geom_point()+geom_smooth(method='lm')+labs(x='a',y='b',title='sub2')
s3<-ggplot(fdf_counts,aes(x=fcounts.4,y=fcounts.5))+geom_point()+geom_smooth(method='lm')+labs(x='a',y='b',title='sub3')
plot_grid(s1,s2,s3)#from 'cowplot' package
How can I do this with using the original fdf data.frame? Especially as the # of subs increase
Or for example if I wanted to plot one scatter plot across all of the subs with fcounts against eachother with ftime(a) as x axis and ftime(b) as y axis?
Consider a merge solution with data frame by itself on fsubs and fitem (being sequential number of items per fsubs and ftime grouping). This approach allows you to keep your long, tidy data format which is ideal format for ggplot since you can then facet_grid using fsubs without iteration.
mdf <- merge(subset(fdf, ftime=="a"),
subset(fdf, ftime=="b"),
by=c("fsubs", "fitem"),
suffixes=c("", "_"))
ggplot(mdf, aes(x=fcounts, y=fcounts_)) +
geom_point() +
geom_smooth(method='lm') +
labs(x='a', y='b') +
facet_grid(~fsubs)
This should get you close:
library(dplyr)
library(tidyr)
library(tibble)
library(ggplot2)
fitem<-rep(rep(1:16,each=3),2)
fsubs<-factor(rep(rep(paste('sub',1:3,sep=''),16),2))
ftime<-factor(as.character(rep(c('a','b'),each=48)))
fcounts<-as.numeric(round(runif(96,1,10)))
fdf<-tibble(fsubs,fitem,fcounts,ftime)
fdf <- fdf %>%
group_by(ftime) %>%
mutate(row_id = row_number()) %>%
pivot_wider(values_from = fcounts,
names_from = ftime)
ggplot(data = fdf, aes(x = a, y = b)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(fsubs ~ ., ncol = 1)
The tidyr function pivot_wider allows us to create the shape of the data we need without explicit loops: create new columns a and b with values from fcounts. We do need to create a unique row id to make this work.
By the way, when I run your code the plots look different from what you posted in the question.
With this output:
Just tried to create a visualization that would analyze all 4 variables. Got a geom_histogram
```{r}
fitem<-rep(rep(1:16,each=3),2)
fsubs<-factor(rep(rep(paste('sub',1:3,sep=''),16),2))
ftime<-factor(as.character(rep(c('a','b'),each=48)))
fcounts<-as.numeric(round(runif(96,1,10)))
fdf<-data.frame(fsubs,fitem,fcounts,ftime)
fdf_counts<-data.frame()
for (i in unique(fdf$fsubs)){
fdf_counts<-append(fdf_counts,cbind(fdf%>%filter(fsubs==i,ftime=='a')%>%dplyr::select(fcounts),
fdf%>%filter(fsubs==i,ftime=='b')%>%dplyr::select(fcounts)))
fdf_counts<-data.frame(fdf_counts)
}
ggplot(data = fdf, mapping = aes(x = fdf$fsubs, y = fdf$fcounts, fill = fdf$fitem)) + geom_bar(stat = "identity", position = "dodge") + facet_grid(cols = vars(ftime))
```

Using R to create a log-log scatter plot of the mean of two sets replicates

I have some RNA seq data for two different cell types (naive and Th1).
How would I find the average of the two replicates of each cell type for each gene (the grouping variable) and then plot a log-log scatter graph of these data?
My table looks like this:
df <- data.frame(Gene_name=c("Mrpl15", "Lypla1", "Tcea1"),
naive_A=c(5.21212, 6.62643, 5.74654),
naive_B=c(4.52376, 5.64459, 4.52153),
Th1_A=c(15.50650, 14.46030, 11.57770),
Th1_B=c(5.876490, 5.193010, 2.107200), stringsAsFactors=F)
df
Gene_name naive_A naive_B Th1_A Th1_B
1 Mrpl15 5.21212 4.52376 15.5065 5.87649
2 Lypla1 6.62643 5.64459 14.4603 5.19301
3 Tcea1 5.74654 4.52153 11.5777 2.10720
If anyone can help, I'd be much obliged!! Thanks
You could do something like this
df %>%
gather(sample, expr, -Gene_name) %>%
mutate(condition = gsub("_\\w$", "", sample)) %>%
group_by(Gene_name, condition) %>%
summarise(expr.mean = mean(expr)) %>%
spread(condition, expr.mean) %>%
ggplot(aes(x = log10(naive), y = log10(Th1), label = Gene_name)) +
geom_point() +
geom_text()
Explanation: Reshape data from wide to long, extract a condition column and average expression values for every gene for every condition. Then reshape data from long to wide, and plot the log10-transformed mean expression from naive vs. mean expression from Th1 for every gene.
I've added gene labels just for illustration purposes. You can remove them by removing + geom_text().
Sample data
df <- read.table(text =
"Gene_name naive_A naive_B Th1_A Th1_B
1 Mrpl15 5.21212 4.52376 15.50650 5.876490
2 Lypla1 6.62643 5.64459 14.46030 5.193010
3 Tcea1 5.74654 4.52153 11.57770 2.107200", header = T)

Stacked bar graph with only top/bottom results

I have an Excel file and am trying to create a bar chart that groups categories and shows the average rating of the category. Because there are a lot of categories, I'd also like to only show either the top 10 or bottom 10 in the resulting horizontal bar chart.
category rating
A 10
A 8
A 9
B 1
B 4
B 9
C 6
C 7
D 9
Something like this (representative bar instead of the numbers):
A 9
D 9
...
C 6.5
B 4.66
I know this seems super simple to do, but I can't seem to be able to get anything working after trying various answers around here. Using ggplot2 seems to be the most promising so far. Closest I've gotten is showing the number of ratings for each category...
Edit: didn't save the work I did earlier as it wasn't the result I wanted, but it was something like this (didn't use ggplot)
dat[,c(1,12)]
category = dat[,1] //selecting column from sheet
rating = dat[,12] //selecting column from sheet
rating<-as.numeric(unlist(dat[,12]))
dat<-table(dat$rating,dat$category)
barplot(dat, main="Overall Ratings",
xlab="Ratings", col=c("skyblue","red"), horiz=TRUE,
legend = rownames(dat))
Here's a chaining solution using dplyr and tidyr. First, we need to load the data.
library(dplyr)
library(tidyr)
library(ggplot2)
df <- read.table(text="category,rating
A,10
A,8
A,9
B,1
B,4
B,9
C,6
C,7
D,9
", sep=",", header=TRUE)
Now to the solution. After grouping the data by category, we calculate each category's mean rating.
means.df <-
df %>%
group_by(category) %>%
summarise(mean = mean(rating))
top_n selects the top (positive number) or bottom (negative number) n rows from a dataset. We apply this to our dataset with means. In your real data, adjust the 2 to 10 for the top and to -10 for the bottom 10 categories.
means.df %>%
top_n(2, mean) %>%
ggplot(aes(x = category, y = mean)) +
geom_bar(stat = 'identity')
The following code plots the top/bottom cutoff_number categories into one plot. Adjust the variable cutoff_number as needed.
cutoff_number <- 2
means.df %>%
arrange(-mean) %>%
mutate(
topbottom = ifelse(row_number() <= cutoff_number, "top", NA),
topbottom = ifelse(row_number() > nrow(.) - cutoff_number, "bottom", topbottom)
) %>%
ggplot(aes(x = category, y = mean)) +
geom_bar(stat = 'identity') +
facet_wrap(~topbottom, scales = 'free_x')
This solution uses data.table to summarize the data, then delivers the result to ggplot:
library(data.table);library(ggplot2)
category=c("A","A","A","B","B","B","C","C","D")
rating=c(10,9,8,1,4,9,6,7,9)
dt=as.data.table(cbind(category,rating))
ggplot(dt[,mean(as.numeric(rating)),by=category],aes(category,V1))+geom_col()+ylab("Mean")

cumsum data over time by factor

I'm using the campaign contributions data from Oregon and I'm trying to make a graph that displays the cumulative amount of contributions per candidate over time. Here's what I have so far:
ggplot(aes(x = as.Date(contb_receipt_dt, "%d-%b-%y"),
y = cumsum(contb_receipt_amt)),
data = subset(oregon_data,
table(oregon_data$cand_nm)[oregon_data$cand_nm] > 1000
& as.Date(contb_receipt_dt, "%d-%b-%y") > as.Date("2015-01-01")))
+ geom_line(aes(color = cand_nm), bins = 5)
This is what it looks like:
What I would like to see is a line for each candidate that starts off at 0 and slowly goes up with each additional contribution. What should I do?
I would use dplyr to calculate the cumsum column before sending it on to ggplot. This should give you enough to get sarted, however you will need to pretty it up and filter the data to get the results you are looking for:
WashingtonData <- read.csv("P00000001-WA.csv")
WashingtonData <- WashingtonData %>% arrange(contb_receipt_dt)
MyGraphData <- WashingtonData %>% group_by(cand_nm) %>% mutate(cumsum = cumsum(contb_receipt_amt))
g <- ggplot(data=MyGraphData, aes(y=cumsum, x=contb_receipt_dt, color=cand_nm)) + geom_line()
g

Resources