plotting the top 5 values from a table in R - r

I'm very new to R so this may be a simple question. I have a table of data that contains frequency counts of species like this:
Acidobacteria 47
Actinobacteria 497
Apicomplexa 7
Aquificae 16
Arthropoda 26
Ascomycota 101
Bacillariophyta 1
Bacteroidetes 50279
...
There are about 50 species in the table. As you can see some of the values are a lot larger than the others. I would like to have a stacked barplot with the top 5 species by percentage and one category of 'other' that has the sum of all the other percentages. So my barplot would have 6 categories total (top 5 and other).
I have 3 additional datasets (sample sites) that I would like to do the same thing to only highlighting the first dataset's top 5 in each of these datasets and put them all on the same graph. The final graph would have 4 stacked bars showing how the top species in the first dataset change in each additional dataset.
I made a sample plot by hand (tabulated the data outside of R and just fed in the final table of percentages) to give you an idea of what I'm looking for: http://dl.dropbox.com/u/1938620/phylumSum2.jpg
I would like to put these steps into an R script so I can create these plots for many datasets.
Thanks!

Say your data is in the data.frame DF
DF <- read.table(textConnection(
"Acidobacteria 47
Actinobacteria 497
Apicomplexa 7
Aquificae 16
Arthropoda 26
Ascomycota 101
Bacillariophyta 1
Bacteroidetes 50279"), stringsAsFactors=FALSE)
names(DF) <- c("Species","Count")
Then you can determine which species are in the top 5 by
top5Species <- DF[rev(order(DF$Count)),"Species"][1:5]
Each of the data sets can then be converted to these 5 and "Other" by
DF$Group <- ifelse(DF$Species %in% top5Species, DF$Species, "Other")
DF$Group <- factor(DF$Group, levels=c(top5Species, "Other"))
DF.summary <- ddply(DF, .(Group), summarise, total=sum(Count))
DF.summary$prop <- DF.summary$total / sum(DF.summary$total)
Making Group a factor keeps them all in the same order in DF.summary (largest to smallest per the first data set).
Then you just put them together and plot them as you did in your example.

We should make it a habit to use data.table wherever possible:
library(data.table)
DT<-data.table(DF,key="Count")
DT[order(-rank(Count), Species)[6:nrow(DT)],Species:="Other"]
DT<-DT[, list(Count=sum(Count),Pcnt=sum(Count)/DT[,sum(Count)]),by="Species"]

Related

Plotting mean values of groups in a dataframe in R

I have conducted a study with triplicates (SampleID) for each sample (Sample) on different time points.
Now, I want to plot the means of the triplicates for the characteristic "Aerobic".
I want to plot for example the development of amount of aerobic bacteria over time. Therefore, I need to calculate the means (and the standard deviation) of the triplicates and then plot these means in the graph. Here, I could imagine to use a geom_line or geom_point diagram.
SampleID Sample Aerobic Anaerobic Day
[Factor] [Factor] [num] [num] [num]
1 V1.1.K1 V1.1.K 0.610063430 0.05146154 1
2 V1.1.K2 V1.1.K 0.740887757 0.02115290 1
3 V1.1.K3 V1.1.K 0.683726217 0.04270182 1
4 V1.1.N1 V1.1.N 0.432019752 0.35722350 1
5 V1.1.N2 V1.1.N 0.515792694 0.41357935 1
6 V1.14.K16 V1.14.K 0.038141335 0.84496088 14
7 V1.14.K17 V1.14.K 0.042078682 0.76523093 14
8 V1.14.K18 V1.14.K 0.009594763 0.90767637 14
9 V1.14.N0 V1.14.N 0.513100502 0.10618731 14
10 V1.14.W16 V1.14.W 0.483710571 0.32765968 14
How should i do this?
I tried it with the following code
plot <- mydata %>%
group_by(Sample) %>%
mutate(Mean=mean(Aerobic)) %>%
ggplot(aes(x=Day, y=Aerobic)) +
geom_point()
If I google the questions I get only information about how to calculate the mean alone, but not to set up a new table with the means for the different variables.
Is there something like
calc_mean_by_group ??
You would help me a lot :)
Simple base-R solution for calculating the means:
tapply(X = foo$Aerobic, INDEX = foo$Sample, FUN = mean)
("foo" being the name of your data.frame)

How to plot profiles in R with ggplot2

I have a large data set with protein IDs and corresponding abundance profiles across a number of gel fractions. I want to plot these profiles of abundances across the fractions.
The data looks like this
IDs<- c("prot1", "prot2", "prot3", "prot4")
fraction1 <- c(3,4,2,4)
fraction2<- c(1,2,4,1)
fraction3<- c(6,4,6,2)
plotdata<-data.frame(IDs, fraction1, fraction2, fraction3)
> plotdata
IDs fraction1 fraction2 fraction3
1 prot1 3 1 6
2 prot2 4 2 4
3 prot3 2 4 6
4 prot4 4 1 2
I want it to look like this:
Every protein has a profile. Every fraction has a corresponding abundance value per protein. I want to have multiple proteins per plot.
I tried figuring out ggplot2 using the cheat sheet and failed. I don't know what the input df should look like and what method I should use to get these profiles.
I would use excel, but a bug draws the wrong profile of my data depending on order of data, so I can't trust it to do what I want.
First, you'll have to reorganize your data.frame for ggplot2. You can do it one step with reshape2::melt. Here you can change the 'variable' and 'value' names.
library(reshape2)
library(dplyr)
library(ggplot2)
data2 <- melt(plotdata, id.vars = "IDs")
Then, we'll group the data by protein:
data2 <- group_by(data2, IDs)
Finally, you can plot it quite simply:
ggplot(data2) +
geom_line(aes(variable, value, group = IDs,
color = IDs))

Creating stacked barplots in R using different variables

I am a novice R user, hence the question. I refer to the solution on creating stacked barplots from R programming: creating a stacked bar graph, with variable colors for each stacked bar.
My issue is slightly different. I have 4 column data. The last column is the summed total of the first 3 column. I want to plot bar charts with the following information 1) the summed total value (ie 4th column), 2) each bar is split by the relative contributions of each of the three column.
I was hoping someone could help.
Regards,
Bernard
If I understood it rightly, this may do the trick
the following code works well for the example df dataframe
df <- a b c sum
1 9 8 18
3 6 2 11
1 5 4 10
23 4 5 32
5 12 3 20
2 24 1 27
1 2 4 7
As you don't want to plot a counter of variables, but the actual value in your dataframe, you need to use the goem_bar(stat="identity") method on ggplot2. Some data manipulation is necessary too. And you don't need a sum column, ggplot does the sum for you.
df <- df[,-ncol(df)] #drop the last column (assumed to be the sum one)
df$event <- seq.int(nrow(df)) #create a column to indicate which values happaned on the same column for each variable
df <- melt(df, id='event') #reshape dataframe to make it readable to gpglot
px = ggplot(df, aes(x = event, y = value, fill = variable)) + geom_bar(stat = "identity")
print (px)
this code generates the plot bellow

Why does stacked bar plot change when add facet in r ggplot2

I have data that are in several groups, and I want to display them in a faceted stacked bar chart. The data show responses to a survey question. When I look at them in the dataframe, they make sense, and when I plot them (without faceting), they make sense.
However the data appear to change when they are faceted. I have never had this problem before. I was able to re-create a change (not the exact same change) with some dummy data.
myDF <- data.frame(rep(c('aa','ab','ac'), each = 9),
rep(c('x','y','z'),times = 9),
rep(c("yes", "no", "maybe"), each=3, times=3),
sample(50:200, 27, replace=FALSE))
colnames(myDF) <- c('place','program','response','number')
library(dplyr)
myDF2 <- myDF %>%
group_by(place,program) %>%
mutate(pct=(100*number)/sum(number))
The data in myDF are basically a count of responses to a question. The myDF2 only creates a percent of respondents with any particular response within each place and program.
library(ggplot2)
my.plot <-ggplot(myDF2,
aes(x=place, y=pct)) +
geom_bar(aes(fill=myDF$response),stat="identity")
my.plot.facet <-ggplot(myDF2,
aes(x=place, y=pct)) +
geom_bar(aes(fill=myDF$response),stat="identity")+
facet_wrap(~program)
I am hoping to see a plot that shows the proper "pct" for each "response" within each "program" and "place". However, my.plot.facet shows only one "response" per place.
The data are not like that. For example, head(myDF2) shows that program 'aa' in place 'x' has both yes and no.
> head(myDF2)
Source: local data frame [6 x 5]
Groups: place, program
place program response number pct
1 aa x yes 69 18.35106
2 aa y yes 95 25.81522
3 aa z yes 192 41.64859
4 aa x no 129 34.30851
5 aa y no 188 51.08696
6 aa z no 162 35.14100
It turns out that ORDER matters here. The myDF2 is not a data frame anymore. It is a dplyr object. That means that ggplot2 is really struggling.
If the data need to be faceted by program, 'program' needs to be first called in the group_by()
Note that this is true here by looking at the inverse plot faceting.
my.plot.facet2 <-ggplot(myDF2,
aes(x=program, y=pct)) +
geom_bar(aes(fill=myDF2$response),stat="identity")+
facet_wrap(~place)
produces:

Create a barplot of two tables of differing length

I can not seem to figure out how to get a nice barplot that contains the data from two tables that contain a different number of columns.
The tables in question are something like (snipped some data from the end):
> tab1
1 2 3 6 8 31
5872 1525 831 521 299 4
> tab2
1 2 3 4 22
7874 422 2 5 1
Note the column names and sizes are different. When I just do barplot() on one of these tables it comes out with the plot I'd like (showing the column names as the X-axis, frequencies on Y-axis). But, I would like these two side by side.
I've gotten as far as creating a data frame containing both variables as comments and the different row names in the first column (with data.frame()and merge()), but when I plot this the X-axis seems to be all wrong. Attempting to reorder the columns gives me an exception about lengths differing.
Code:
combined <- merge(data.frame(tab1), data.frame(tab2), by = c('Var1'), all=T)
barplot(t(combined[,2:3]), names.arg = combined[,1], beside=T)
This shows a plot, but not all labels are present and the value for position 26 is plotted after 33.
Is there any simple way to get this plot working? A ggplot2 solution would be nice.
You can put all your data in one data frame (as in example).
df<-data.frame(group=rep(c("A","B"),times=c(2,3)),
values=c(23,56,345,6,7),xval=c(1,2,1,2,8))
group values xval
1 A 23 1
2 A 56 2
3 B 345 1
4 B 6 2
5 B 7 8
Then ggplot() with geom_bar() can be used to plot the data.
ggplot(df,aes(xval,values,fill=group))+
geom_bar(stat="identity",position="dodge")

Resources