I have a set of data:
COL1 COL2
1 3.45
2 8.48
1 2.53
2 9.42
2 2.56
etc.
COL1 specifies a category, whereas COL2 is data. I'd like to, for each distinct value in COL1 generate mean, stddev, min & max values. So in the end have something like (not real numbers):
COL1VAL MEAN STDDEV
1 4.59 1.24
2 4.75 1.20
I'd also then like to generate a bar chart with error bars, with X axis being the COL1VAL and bar height being the mean.
Can one do this in R, and if so, how?
Here's how you could do those things using packages dplyr and ggplot2, assuming your data frame is called df.
library(dplyr)
dfsummary <- df %>%
group_by(COL1) %>%
summarise_each(funs(mean, sd, min, max))
dfsummary
#Source: local data frame [2 x 5]
#
# COL1 mean sd min max
#1 1 2.99 0.6505382 2.53 3.45
#2 2 6.82 3.7190859 2.56 9.42
library(ggplot2)
ggplot(dfsummary, aes(x = factor(COL1), y = mean)) +
geom_bar(stat = "identity", fill = "lightblue") +
geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd))
If you prefer to stay in base R, you could use tapply and arrows:
head(chickwts, 15) # chicken growth depending on food#
means <- tapply(X=chickwts$weight, INDEX=chickwts$feed, FUN=mean)
sds <- tapply(X=chickwts$weight, INDEX=chickwts$feed, FUN=sd )
or <- order(means)
bp <- barplot(means[or], ylim=c(0, 390), las=2)
arrows(x0=bp, y0=(means+sds)[or], y1=(means-sds)[or],
code=3, angle=90, length=0.1)
Regards,
Berry
Related
I'm having a hard time binding rows of two random samples of 500 each to get one file with 1000 rows.
Then I'm trying to plot a histogram of this combined sample and geom_density().
For my bind_rows line, the error I get is
"Argument 1 must have names"
Does anyone have an idea what is wrong? Thank you,
x <- 1:500
rand1 <- rnorm(length(x), -1, 0.6)
rand2 <- rnorm(length(x), 1, 1.2)
combined <- bind_rows(rand1, rand2)
ggplot(combined, aes(x=y, y=..density..)) +
geom_histogram(fill = "red", alpha = 0.5, color="darkred") +
geom_density()
Change the offending line to:
combined <- data.frame(y= c(rand1, rand2))
There were two issues that prevent the original code from completing the task: a) no name for the data argument, and b) lack of packaging in a form that could be coerced to a dataframe. The combined could also have been a named list.
To be able to bind your two samples, you need to convert them as dataframe. However, you will also need to have their names matching.
So something like that should work:
library(dplyr)
combined <- bind_rows(data.frame(x =rand1),
data.frame(x =rand2))
x
1 -1.1979747
2 -0.7819008
3 -2.0965976
4 -0.4637334
5 -1.4314750
6 -0.4356943
However, you can't differentiate rand1 and rand2 anymore.
So, an alternative solution is to start by binding your two random samples as columns and then pivot the dataframe into a longer format using pivot_longer from tidyr package:
df <- data.frame(rand1, rand2)
library(tidyverse)
df <- df %>% pivot_longer(everything(), names_to = "rand", values_to = "value")
rand value
<chr> <dbl>
1 rand1 -1.20
2 rand2 2.45
3 rand1 -0.782
4 rand2 1.35
5 rand1 -2.10
6 rand2 1.98
7 rand1 -0.464
8 rand2 0.733
9 rand1 -1.43
10 rand2 2.72
# … with 990 more rows
For plotting histogram and density, I used stat(ndensity) and ..scaled.. in order to set both random samples to be scaled up to 1:
library(ggplot2)
ggplot(df, aes(x = value, fill = rand))+
geom_density(aes(y = ..scaled..), alpha = 0.4)+
geom_histogram(aes(x = value, stat(ndensity)), color = "black", alpha =0.2)
I have a question on how to plot my data using a boxplot and integrating 3 different information types. In particular, I have a data frame that looks like this:
Exp_number Condition Cell_Type Gene1 Gene2 Gene3
1 2 Cancer 0.33 0.2 1.2
1 2 Cancer 0.12 1.12 2.5
1 4 Fibro 3.4 2.2 0.8
2 4 Cancer 0.12 0.4 0.11
2 4 Normal 0.001 0.01 0.001
3 1 Cancer 0.22 1.2 3.2
2 1 Normal 0.001 0.00003 0.00045
for a total of 20.000 columns and 110 rows (rows are samples).
I would like to plot a boxplot in which data are grouped first by a condition. Then, in each condition, I would like to highlight, for example using different colors, the exp_number and finally, I don't know how but I would like to highlight the cell type. The aim is to highlight the differences between exp_number between conditions in terms of gene expression and also differences of cell types between Exp_numbers.
Is there a simple way to integrate all this information in a single plot?
Thank you in advance
What about this approach
dat <- data.frame(Exp_number=factor(sample(1:3,100,replace = T)),
condition=factor(sample(1:4,100,T)),
Cell_type=factor(sample(c("Normal", "Cancer", "Fibro"), 100, replace=T)),
Gene1=abs(rnorm(100, 5, 1)),
Gene2=abs(rnorm(100, 6, 0.5)),
Gene3=abs(rnorm(100, 4, 3)))
library(reshape2)
dat2 <- melt(dat, id=c("Exp_number", "condition", "Cell_type"))
ggplot(dat2, aes(x=Exp_number, y=value, col=Cell_type)) +
geom_boxplot() +
facet_grid(~ condition) +
theme_bw() +
ylab("Expression")
That gives the following result
Similar to #storaged's answer, but leveraging the two dimensions of facet_grid to represent 2 of your variables:
ggplot(dat2, aes(x=Cell_type, y=Expression)) +
geom_boxplot() +
facet_grid(Exp_number ~ condition) +
theme_bw()
The data:
library(reshape2)
dat <- data.frame(Exp_number=factor(sample(1:3,100,replace = T)),
condition=factor(sample(1:4,100,T)),
Cell_type=factor(sample(c("Normal", "Cancer", "Fibro"), 100, replace=T)),
Gene1=abs(rnorm(100, 5, 1)),
Gene2=abs(rnorm(100, 6, 0.5)),
Gene3=abs(rnorm(100, 4, 3)))
dat2 <- melt(dat, id=c("Exp_number", "condition", "Cell_type"), value.name = 'Expression')
dat2$Exp_number <- paste('Exp.', dat2$Exp_number)
dat2$condition <- paste('Condition', dat2$condition)
I have multiple .csv files, every on of this has a column (called: Data) that I want to compare with each other. But first, I have to group the values in a column of each file. In the end I want to have multiple colored "lines" with the mean value of each group in one graph. I will describe the process I use to get the graph I want below. This works for a single file but I don't know how to add multiple "lines" of multiple files in one graph using ggplot.
This is what I got so far:
data = read.csv(file="my01data.csv",header=FALSE, sep=",")
A single .csv File looks like the following, but without the headline
ID Data Range
1,63,5.01
2,61,5.02
3,65,5.00
4,62,4.99
5,62,4.98
6,64,5.01
7,71,4.90
8,72,4.93
9,82,4.89
10,82,4.80
11,83,4.82
10,85,4.79
11,81,4.80
After getting the data I group it with the following lines:
data["Group"] <- NA
data[(data$Range>4.95), "Group"] <- 5.0
data[(data$Range>4.85 & data$Range<4.95), "Group"] <- 4.9
data[(data$Range>4.75 & data$Range<4.85), "Group"] <- 4.8
The final data looks like this:
myTable <- "ID Data Range Group
1 63 5.01 5.00
2 61 5.02 5.00
3 65 5.00 5.00
4 62 4.99 5.00
5 62 4.98 5.00
6 64 5.01 5.00
7 71 4.90 4.90
8 72 4.93 4.90
9 72 4.89 4.90
10 82 4.80 4.80
11 83 4.82 4.80
10 85 4.79 4.80
11 81 4.80 4.80"
myData <- read.table(text=myTable, header = TRUE)
To plot this dataframe I use the following lines:
( pplot <- ggplot(data=myDAta, aes(x=myDAta$Group, y=myDAta$Data))
+ stat_summary(fun.y = mean, geom = "line", color='red')
+ xlab("Group")
+ ylab("Data")
)
Which results in a graph like this:
I assume you have the names of your .csv-files stored in a vector named file_names. Then you can run the following code and should get a different line for each file:
library(ggplot2)
data_list <- lapply(file_names, read.csv , header=FALSE, sep=",")
data_list <- lapply(seq_along(data_list), function(i){
df <- data_list[[i]]
df$Group <- round(df$Range, 1)
df$DataNumber <- i
df
})
finalTable <- do.call(rbind, data_list)
finalTable$DataNumber <- factor(finalTable$DataNumber)
ggplot(finalTable, aes(x=Group, y=Data, group = DataNumber, color = DataNumber)) +
stat_summary(fun.y = mean, geom = "line") +
xlab("Group") +
ylab("Data")
How it works
First the different datasets are read with read.csv into a list data_list. Then each data.frame in that list is assigned a Group.
I used round here with k=1, which means it rounds to one decimal point (I figured that's what your are doing).
Then also a unique number (in this case simply the index of the list) is assigned to each data.frame. After that the list is combined to one data.frame with rbind and then DataNumber is turned into a factor (prettier for plotting). Finally I added DataNumber as a group and color variable to the plot.
You can add another line by using stat_summary again; you can define the data and aes argument to any other dataset:
#some pseudo data for testing
my_other_data <- myData
my_other_data$Data <- my_other_data$Data * 0.5
pplot <- ggplot(data=myData, aes(x=Group, y=Data)) +
stat_summary(fun.y = mean, geom = "line", color='red') +
stat_summary(data=my_other_data, aes(x=Group, y=Data),
fun.y = mean, geom = "line", color='green') +
xlab("Group") +
ylab("Data")
pplot
Why not creating a classifying column ("Class")
myTable1$Class <- "table1"
myTable1
"ID Data Range Group Class
1 63 5.01 5.00 table1
2 61 5.02 5.00 table1
3 65 5.00 5.00 table1"
myTable2$Class <- "table2"
myTable2
"ID Data Range Group Class
1 63 5.01 5.00 table2
2 61 5.02 5.00 table2
3 65 5.00 5.00 table2"
And merging dataframe
dfBIND <- rbind(myTable1, MyTable2)
So that you can ggplot with a grouping or coloring variable
pplot <- ggplot(data=dfBIND, aes(x= dfBIND$Group, y= dfBIND$Data, group=Class)) +
stat_summary(fun.y = mean, geom = "line", color='red') +
xlab("Group") +
ylab("Data")
I am new to R and ggplot2. Any help is much appreciated! I have here a data set, I am trying to graph
weight band mean_1 mean_2 SD_1 SD_2 min_1 min_2 max_1 max_2
1 5 . 3 . 0.17 . 27 .
2 6 . 3.7 . 1.1 . 23 .
3 8 8 4.3 4.1 1 1.749 27 27
4 8 9 3.3 6 2.3 1.402 13 42
In this set of data, I am trying to plot a bar graph of mean 1 and mean 2 side by side under the given weight_band (1-4) and applying error bars for min (1&2 respectively) and max (1&2 respectively). The "." notates that no data.
I have browsed through stackoverflow and other website, but haven't find the solution I am looking for.
the code I have is as follows:
sk1 <- read.csv(file="analysis.csv")
library(reshape2)
sk2 <- melt(sk1,id.vars = "Weight_band")
c <- ggplot(sk2, aes(x = Weight_band, y = value, fill = variable))
c + geom_bar(stat = "identity", position="dodge")
However, using this method, it does not limit the graph to only plotting the mean only. Is there a set of code to do so? Furthermore, is there a method to apply min and max as error bars to their respective mean? I thank everyone in advance. This would help me greatly in advancing my understanding of R's ggplot2 function
This should get you close, we need to do a little more data cleaning and reshaping to make ggplot happy :)
library(reshape2)
df <- read.table(text = "weight_band mean_1 mean_2 SD_1 SD_2 min_1 min_2 max_1 max_2
1 5 . 3 . 0.17 . 27 .
2 6 . 3.7 . 1.1 . 23 .
3 8 8 4.3 4.1 1 1.749 27 27
4 8 9 3.3 6 2.3 1.402 13 42", header = T)
sk2 <- melt(df,id.vars = "weight_band")
## Clean
sk2$group <- gsub(".*_(\\d)", "\\1", sk2$variable)
# new column used for color or fill aes and position dodging
sk2$variable <- gsub("_.*", "", sk2$variable)
# make these variables universal not group specific
## Reshape again
sk3 <- dcast(sk2, weight_band + group ~ variable)
# spread it back to kinda wide format
sk3 <- dplyr::mutate_if(sk3, is.character, as.numeric)
# convert every column to numeric if character now
# plot values seem a little wonky but the plot is coming together
ggplot(sk3, aes(x = as.factor(weight_band), y = mean, color = as.factor(group))) +
geom_bar(position = "dodge", stat = "identity") +
geom_errorbar(aes(ymax = max, ymin = min), position = "dodge")
Here is my simplified data :
company <-c(rep(c(rep("company1",4),rep("company2",4),rep("company3",4)),3))
product<-c(rep(c(rep(c("product1","product2","product3","product4"),3)),3))
week<-c( c(rep("w1",12),rep("w2",12),rep("w3",12)))
mydata<-data.frame(company=company,product=product,week=week)
mydata$rank<-c(rep(c(1,3,2,3,2,1,3,2,3,2,1,1),3))
mydata=mydata[mydata$company=="company1",]
And, R code I used :
ggplot(mydata,aes(x = week,fill = as.factor(rank))) +
geom_bar(position = "fill")+
scale_y_continuous(labels = percent_format())
In the bar plot, I want to label the percentage by week, by rank.
The problem is the fact that the data doesn't have percentage of rank. And the structure of this data is not suitable to having one.
(of course, the original data has much more observations than the example)
Is there anyone who can teach me How I can label the percentage in this graph ?
I'm not sure I understand why geom_text is not suitable. Here is an answer using it, but if you specify why is it not suitable, perhaps someone might come up with an answer you are looking for.
library(ggplot2)
library(plyr)
mydata = mydata[,c(3,4)] #drop unnecessary variables
data.m = melt(table(mydata)) #get counts and melt it
#calculate percentage:
m1 = ddply(data.m, .(week), summarize, ratio=value/sum(value))
#order data frame (needed to comply with percentage column):
m2 = data.m[order(data.m$week),]
#combine them:
mydf = data.frame(m2,ratio=m1$ratio)
Which gives us the following data structure. The ratio column contains the relative frequency of given rank within specified week (so one can see that rank == 3 is twice as abundant as the other two).
> mydf
week rank value ratio
1 w1 1 1 0.25
4 w1 2 1 0.25
7 w1 3 2 0.50
2 w2 1 1 0.25
5 w2 2 1 0.25
8 w2 3 2 0.50
3 w3 1 1 0.25
6 w3 2 1 0.25
9 w3 3 2 0.50
Next, we have to calculate the position of the percentage labels and plot it.
#get positions of percentage labels:
mydf = ddply(mydf, .(week), transform, position = cumsum(value) - 0.5*value)
#make plot
p =
ggplot(mydf,aes(x = week, y = value, fill = as.factor(rank))) +
geom_bar(stat = "identity")
#add percentage labels using positions defined previously
p + geom_text(aes(label = sprintf("%1.2f%%", 100*ratio), y = position))
Is this what you wanted?