I need to build a barplot of my data, showing bacterial relative abundance in different samples (each column should sum to 1 in the complete dataset).
A subset of my data:
> mydata
Taxon CD6 CD1 CD12
Actinomycetaceae;g__Actinomyces 0.031960309 0.066683743 0.045638509
Coriobacteriaceae;g__Atopobium 0.018691589 0.003244536 0.00447774
Corynebacteriaceae;g__Corynebacterium 0.001846083 0.006403689 0.000516662
Micrococcaceae;g__Rothia 0.001730703 0.000426913 0.001894429
Porphyromonadaceae;g__Porphyromonas 0.073497173 0.065915301 0.175406872
What I'd like to have is a bar for each sample (CD6, CD1, CD12), where the y values are the relative abundance of bacterial species (the Taxon column).
I think (but I'm not sure) my data format is not right to do the plot, since I don't have a variable to group by like in the examples I found...
ggplot(data) + geom_bar(aes(x=revision, y=added), stat="identity", fill="white", colour="black")
Is there a way to order my data making them right as input to this code?
Or how can I modify it?
Thanks!
Do you want something like this?
# sample data
df <- read.table(header=T, sep=" ", text="
Taxon CD6 CD1 CD12
Actinomycetaceae;g__Actinomyces 0.031960309 0.066683743 0.045638509
Coriobacteriaceae;g__Atopobium 0.018691589 0.003244536 0.00447774
Corynebacteriaceae;g__Corynebacterium 0.001846083 0.006403689 0.000516662
Micrococcaceae;g__Rothia 0.001730703 0.000426913 0.001894429
Porphyromonadaceae;g__Porphyromonas 0.073497173 0.065915301 0.175406872")
# convert wide data format to long format
require(reshape2)
df.long <- melt(df, id.vars="Taxon",
measure.vars=grep("CD\\d+", names(df), val=T),
variable.name="sample",
value.name="value")
# calculate proportions
require(plyr)
df.long <- ddply(df.long, .(sample), transform, value=value/sum(value))
# order samples by id
df.long$sample <- reorder(df.long$sample, as.numeric(sub("CD", "", df.long$sample)))
# plot using ggplot
require(ggplot2)
ggplot(df.long, aes(x=sample, y=value, fill=Taxon)) +
geom_bar(stat="identity") +
scale_fill_manual(values=scales::hue_pal(h = c(0, 360) + 15, # add manual colors
c = 100,
l = 65,
h.start = 0,
direction = 1)(length(levels(df$Taxon))))
Related
I'm a beginner in r and I've been trying to find how I can plot this graphic.
I have 4 variables (% of gravel, % of sand, % of silt in five places). I'm trying to plot the percentages of these 3 types of sediment (y) in each station (x). So it's five groups in x axis and 3 bars per group.
Station % gravel % sand % silt
1 PRA1 28.430000 70.06000 1.507000
2 PRA3 19.515000 78.07667 2.406000
3 PRA4 19.771000 78.63333 1.598333
4 PRB1 7.010667 91.38333 1.607333
5 PRB2 18.613333 79.62000 1.762000
I tried plotting a grouped barchart with
grao <- read_excel("~/Desktop/Masters/Data/grao.xlsx")
colors <- c('#999999','#E69F00','#56B4E9','#94A813','#718200')
barplot(table(grao$Station, grao$`% gravel`, grao$`% sand`, grao$`% silt`), beside = TRUE, col = colors)
But this error message keeps happening:
'height' must be a vector or matrix
I also tried
ggplot(grao, aes(Station, color=as.factor(`% gravel`), shape=as.factor(`% sand`))) +
geom_bar() + scale_color_manual(values=c('#999999','#E69F00','#56B4E9','#94A813','#718200')+ theme(legend.position="top")
But it's creating a crazy graphic.
Could someone help me, please? I've been stuck for weeks now in this one.
Cheers
I think this may be what you are looking for:
#install.packages("tidyverse")
library(tidyverse)
df <- data.frame(
station = c("PRA1", "PRA3", "PRA4", "PRB1", "PRB2"),
gravel = c(28.4, 19.5, 19.7, 7.01, 18.6),
sand = c(70.06, 78.07, 78.63, 91, 79),
silt = c(1.5, 2.4, 1.6, 1.7, 1.66)
)
df2 <- df %>%
pivot_longer(cols = c("gravel", "sand", "silt"), names_to = "Sediment_Type", values_to = "Percentage")
ggplot(df2) +
geom_bar(aes(x = station, y = Percentage, fill = Sediment_Type ), stat = "identity", position = "dodge") +
theme_minimal() #theme_minimal() is from the ggthemes package
provides:
You need to "pivot" your data set "longer". Part of the tidy way is ensuring all columns represent a single variable. You will notice in your initial dataframe that each column name is a variable ("Sediment_type") and each column fill is just the percentage for each ("Percentage"). The function pivot_longer() takes a dataset and allows one to gather up all the columns then turn them into just two - the identity and value.
Once you've done this, ggplot will allow you to specify your x axis, and then a grouping variable by "fill". You can switch these two up. If you end up with lots of data and grouping variables, faceting is also an option worth looking in to!
Hope this helps,
Brennan
barplot wants a "matrix", ideally with both dimension names. You could transform your data like this (remove first column while using it for row names):
dat <- `rownames<-`(as.matrix(grao[,-1]), grao[,1])
You will see, that barplot already does the tabulation for you. However, you also could use xtabs (table might not be the right function for your approach).
# dat <- xtabs(cbind(X..gravel, X..sand, X..silt) ~ Station, grao) ## alternatively
I would advise you to use proper variable names, since special characters are not the best idea.
colnames(dat) <- c("gravel", "sand", "silt")
dat
# gravel sand silt
# PRA1 28.430000 70.06000 1.507000
# PRA3 19.515000 78.07667 2.406000
# PRA4 19.771000 78.63333 1.598333
# PRB1 7.010667 91.38333 1.607333
# PRB2 18.613333 79.62000 1.762000
Then barplot knows what's going on.
.col <- c('#E69F00','#56B4E9','#94A813') ## pre-define colors
barplot(t(dat), beside=T, col=.col, ylim=c(0, 100), ## barplot
main="Here could be your title", xlab="sample", ylab="perc.")
legend("topleft", colnames(dat), pch=15, col=.col, cex=.9, horiz=T, bty="n") ## legend
box() ## put it in a box
Data:
grao <- read.table(text=" Station '% gravel' '% sand' '% silt'
1 PRA1 28.430000 70.06000 1.507000
2 PRA3 19.515000 78.07667 2.406000
3 PRA4 19.771000 78.63333 1.598333
4 PRB1 7.010667 91.38333 1.607333
5 PRB2 18.613333 79.62000 1.762000 ", header=TRUE)
user_a - 3
user_b - 4
user_c - 1
user_d - 4
I want to show the distribution over number of tweets per author in r using a histogram. The original file has 1048575 such rows
I did hist(df$twitter_count, nrow(df)) but I don't think its correct
It seems I have misunderstood the question. I think following could be
what the OP is looking for.
df <- data.frame(user = letters,
twitter_count = sample.int(200, 26))
ggplot(df, aes(user, twitter_count)) +
geom_col()
Assuming you are looking for multiple histograms.
Replace user with respective variable name in your data.frame.
# Example data
df <- data.frame(user = iris$Species,
twitter_count= round(iris[, 1]*10))
# Histograms using ggplot2 package
library(ggplot2)
ggplot(df, aes(x = twitter_count)) +
geom_histogram() + facet_grid(.~user)
Best to use an alternative method to see the distributions of twitter counts if your data contain many twitter users.
If each row of the data.frame represents a user -
set.seed(1)
df <- data.frame(user = letters, twitter_count = rpois(26, lambda = 4) + 1)
hist(df$twitter_count)
Since you said, distribution for 'each user', I think it should be a bar blot:
require(data.table)
dat <- fread("
user_a - 3
user_b - 4
user_c - 1
user_d - 4"
)
barplot( names.arg = dat$V1, as.numeric(dat$V3) )
barplot
or if you are looking for histograms, then:
hist(as.numeric(dat$V3), xlab = "", main="Histogram")
histogram
I have a dataframe that contains time(H:M:S), thetaX(degrees), thetaY(degrees), and thetaZ(degress). I want to plot the degrees vs time using ggplot as mentioned here.
This is the original state of my dataframe:
> head(df)
time thetaX thetaY thetaZ
1 08:27:27 0.01539380 -0.001609785 -0.03271715
2 08:27:27 0.03079389 -0.003863202 -0.06512209
3 08:27:27 0.04588598 -0.006668402 -0.09720450
4 08:27:28 0.06008822 -0.008774166 -0.12872514
5 08:27:28 0.07400642 -0.008951306 -0.15985775
6 08:27:28 0.08823425 -0.012280650 -0.19023676
I run these lines to plot each column of df over time:
df = data.frame(time, thetaX,thetaY,thetaZ)
> df.m = melt(df,id="time")
> ggplot(data = df.m, aes(x = x, y = value)) + geom_point() + facet_grid(variable ~ .)
But, this is what comes out:
Question: Why is my data plotting from the what looks like the tail end at #1pm-ish of my df then jumping across to the beginning #8am-ish and finishing through the rest?
I have imported data in this form:
Sample1 Sample2 Identity
1 2 chr11-50-T
3 4 chr11-200-A
v <- read.table("myfile", header = TRUE)
I have a vector that looks like this:
x <- c(50,100)
And without some other aesthetic stuff I am plotting column 1 vs column 2 labeled with column 3.
p <- ggplot(v, aes(x=sample1, y=sample2, alpha=0.5, label=identity)) +
geom_point() +
geom_text_repel(aes(label=ifelse(sample2>0.007 |sample1>0.007 ,as.character(identity),''))) +
I would like to somehow indicate those points that contain a number in their ID, found within the vector x. I was thinking this could be done with color, but it doesn't really matter to me as long as there is a difference between the two types of points.
So for instance if the points containing a number in x were to be colored red, the first point would be red because it has 50 in the ID and the second point would not be, because 200 is not a value in x.
You could add in a TRUE/FALSE value as a column and use that as a color. I had to remove your label = ... aes since that's not an aes in ggplot2. Also everything is transparent because you use aes(alpha = 0.5):
library(ggrepel)
library(ggplot2)
vafs$col <- grepl(paste0(x,collapse = "|"), vafs$Identity)
p <- ggplot(vafs, aes(x=Sample1, y=Sample2, alpha=0.5, color = col)) +
geom_point() +
geom_text_repel(aes(label=ifelse(Sample2>0.007 |Sample1>0.007 ,as.character(Identity),'')))
I came up with the following solution:
vafs<-read.table(text="Sample1 Sample2 Identity
1 2 chr11-50-T
3 4 chr11-200-A", header=T)
vec <- c(50,100)
vafs$vec<- sapply(vafs$Identity, FUN=function(x)
ifelse(length(grep(pattern=paste(vec,collapse="|"), x))>0,1,0))
vafs$vec <- as.factor(vafs$vec)
ggplot(vafs, aes(x=Sample1, y=Sample2, label=Identity, col=vec),alpha=0.5)+geom_point()
After succesfully performing a cast (using the reshape package) on a small data set I obtain the following frame(e_disp) which is what I am looking for.
Date Code 200g
1 2010/06/01 cg4j 0.519880141
2 2010/09/19 7gv2 0.158999682
3 2011/04/14 zl94 0.294174203
4 2011/05/27 a13t 0.140232549
My problem is that I wish to create a barplot which has the values under the column 200g plotted in bars with the x-axis being the date and each bar having the code associated with value. (This could also be on the x-axis above or below the date)
My problem is that I get the following error
"Error in barplot.default(e_disp) : 'height' must be a vector or a matrix"
So my questions are
1) Can what I am trying to do be done after using 'cast'
2) If so any suggestions as to how to accomplish this
Any help would be appreciated
This is quite easily done with ggplot2. Here is an example
# generate dummy data
mydf = data.frame(date = 1:5, code = letters[1:5], value = rpois(5, 40))
# plot it using ggplot2
library(ggplot2)
pl = ggplot(mydf, aes(x = date, y = value)) +
geom_bar(stat = 'identity') +
geom_text(aes(label = code), vjust = -1)
print(p1)
Is this what you are after:
dat <- read.table(textConnection("Date Code x200g
1 2010/06/01 cg4j 0.519880141
2 2010/09/19 7gv2 0.158999682
3 2011/04/14 zl94 0.294174203
4 2011/05/27 a13t 0.14023254"), header=TRUE, as.is=TRUE)
dat$Date <- as.Date(dat$Date)
Pasting the Date and Code columns separated by linefeed (\n") to make labels:
barplot(dat$x200g, names.arg=paste(dat$Date,"\n", dat$Code, sep=""), ylab=" ")