Plot data from different csv files in one graph using R + ggplot - r

I have multiple .csv files, every on of this has a column (called: Data) that I want to compare with each other. But first, I have to group the values in a column of each file. In the end I want to have multiple colored "lines" with the mean value of each group in one graph. I will describe the process I use to get the graph I want below. This works for a single file but I don't know how to add multiple "lines" of multiple files in one graph using ggplot.
This is what I got so far:
data = read.csv(file="my01data.csv",header=FALSE, sep=",")
A single .csv File looks like the following, but without the headline
ID Data Range
1,63,5.01
2,61,5.02
3,65,5.00
4,62,4.99
5,62,4.98
6,64,5.01
7,71,4.90
8,72,4.93
9,82,4.89
10,82,4.80
11,83,4.82
10,85,4.79
11,81,4.80
After getting the data I group it with the following lines:
data["Group"] <- NA
data[(data$Range>4.95), "Group"] <- 5.0
data[(data$Range>4.85 & data$Range<4.95), "Group"] <- 4.9
data[(data$Range>4.75 & data$Range<4.85), "Group"] <- 4.8
The final data looks like this:
myTable <- "ID Data Range Group
1 63 5.01 5.00
2 61 5.02 5.00
3 65 5.00 5.00
4 62 4.99 5.00
5 62 4.98 5.00
6 64 5.01 5.00
7 71 4.90 4.90
8 72 4.93 4.90
9 72 4.89 4.90
10 82 4.80 4.80
11 83 4.82 4.80
10 85 4.79 4.80
11 81 4.80 4.80"
myData <- read.table(text=myTable, header = TRUE)
To plot this dataframe I use the following lines:
( pplot <- ggplot(data=myDAta, aes(x=myDAta$Group, y=myDAta$Data))
+ stat_summary(fun.y = mean, geom = "line", color='red')
+ xlab("Group")
+ ylab("Data")
)
Which results in a graph like this:

I assume you have the names of your .csv-files stored in a vector named file_names. Then you can run the following code and should get a different line for each file:
library(ggplot2)
data_list <- lapply(file_names, read.csv , header=FALSE, sep=",")
data_list <- lapply(seq_along(data_list), function(i){
df <- data_list[[i]]
df$Group <- round(df$Range, 1)
df$DataNumber <- i
df
})
finalTable <- do.call(rbind, data_list)
finalTable$DataNumber <- factor(finalTable$DataNumber)
ggplot(finalTable, aes(x=Group, y=Data, group = DataNumber, color = DataNumber)) +
stat_summary(fun.y = mean, geom = "line") +
xlab("Group") +
ylab("Data")
How it works
First the different datasets are read with read.csv into a list data_list. Then each data.frame in that list is assigned a Group.
I used round here with k=1, which means it rounds to one decimal point (I figured that's what your are doing).
Then also a unique number (in this case simply the index of the list) is assigned to each data.frame. After that the list is combined to one data.frame with rbind and then DataNumber is turned into a factor (prettier for plotting). Finally I added DataNumber as a group and color variable to the plot.

You can add another line by using stat_summary again; you can define the data and aes argument to any other dataset:
#some pseudo data for testing
my_other_data <- myData
my_other_data$Data <- my_other_data$Data * 0.5
pplot <- ggplot(data=myData, aes(x=Group, y=Data)) +
stat_summary(fun.y = mean, geom = "line", color='red') +
stat_summary(data=my_other_data, aes(x=Group, y=Data),
fun.y = mean, geom = "line", color='green') +
xlab("Group") +
ylab("Data")
pplot

Why not creating a classifying column ("Class")
myTable1$Class <- "table1"
myTable1
"ID Data Range Group Class
1 63 5.01 5.00 table1
2 61 5.02 5.00 table1
3 65 5.00 5.00 table1"
myTable2$Class <- "table2"
myTable2
"ID Data Range Group Class
1 63 5.01 5.00 table2
2 61 5.02 5.00 table2
3 65 5.00 5.00 table2"
And merging dataframe
dfBIND <- rbind(myTable1, MyTable2)
So that you can ggplot with a grouping or coloring variable
pplot <- ggplot(data=dfBIND, aes(x= dfBIND$Group, y= dfBIND$Data, group=Class)) +
stat_summary(fun.y = mean, geom = "line", color='red') +
xlab("Group") +
ylab("Data")

Related

Improve the autoBinning Histogram

I'm doing auto Binning Histogram for my second time, but it looks elementary. I'm seeking help to improve it.
what I have tried is
> DAta <- read.table(text="Species DNA LINE LTR SINE Helitron Unclassified Unmasked
+ darius 2.68 10.37 18.00 1.52 3.64 0.03 63.79
+ Derian 2.74 10.59 16.61 1.56 4.24 0.03 64.23
+ rats 2.77 10.97 15.20 1.57 4.69 0.03 64.77
+ Mouos 2.53 10.42 17.33 1.42 3.68 0.02 64.6", header=TRUE)
> library(reshape2)
> DF1 <- melt(DF, id.var="Rank")
> DF1 <- melt(DAta, id.var="Species")
> library(ggplot2)
> ggplot(DF1, aes(x = Species, y = value, fill = variable)) +
+ geom_bar(stat = "identity")
Output:
How can I make the species name in Italic?
The order of the histogram should be as the same as the input? start from left to right (darius, Derian, rats and Mouos)
Colours and style to look better and reasonable.
There are 3 questions here:
To change the axis labels to italics, one needs adjust the
x.axis.text, see the question/answers referenced at the bottom.
To change the ordering of the axis labels, you need to specify the
variable Species as a factor variable defining the desire order of
the levels.
Finally, to change the color scheme, use the
scale_fill_ function. I like the colorBrewer package with several good color schemes available. There
are few other define scale_fill options available.
Note: this a barchart and not a histogram.
See the comments for additional details:
DAta <- read.table(text="Species DNA LINE LTR SINE Helitron Unclassified Unmasked
darius 2.68 10.37 18.00 1.52 3.64 0.03 63.79
Derian 2.74 10.59 16.61 1.56 4.24 0.03 64.23
rats 2.77 10.97 15.20 1.57 4.69 0.03 64.77
Mouos 2.53 10.42 17.33 1.42 3.68 0.02 64.6", header=TRUE)
#updated method to reshape data. tidyr is replacement for reshape2
library(tidyr)
library(tidyr)
DF1 <- pivot_longer(DAta, cols=-1, names_to = "Classification", values_to = "Value" )
#Set Species as factors defining the order of the labels
DF1$Species<-factor(DF1$Species, levels=c("darius", "Derian", "rats", "Mouos"))
library(ggplot2)
ggplot(DF1, aes(x = Species, y = Value, fill = Classification)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Pastel1") +
theme(axis.text.x = element_text(face="italic"))
Option: If the number of columns or the naming of the columns can change then here is a potential option for maintaining the proper ordering of the Species names:
#retrieves column names from original dataframe the 2nd to the end
# assumes the columns are "Species" and then only the species names
DF1$Species<-factor(DF1$Species, levels= names(DAta)[-1])
To adjust the axis labels here is a good reference:
Changing font size and direction of axes text in ggplot2

How can I bind_rows of two random sets and then plot them as a histogram in dplyr?

I'm having a hard time binding rows of two random samples of 500 each to get one file with 1000 rows.
Then I'm trying to plot a histogram of this combined sample and geom_density().
For my bind_rows line, the error I get is
"Argument 1 must have names"
Does anyone have an idea what is wrong? Thank you,
x <- 1:500
rand1 <- rnorm(length(x), -1, 0.6)
rand2 <- rnorm(length(x), 1, 1.2)
combined <- bind_rows(rand1, rand2)
ggplot(combined, aes(x=y, y=..density..)) +
geom_histogram(fill = "red", alpha = 0.5, color="darkred") +
geom_density()
Change the offending line to:
combined <- data.frame(y= c(rand1, rand2))
There were two issues that prevent the original code from completing the task: a) no name for the data argument, and b) lack of packaging in a form that could be coerced to a dataframe. The combined could also have been a named list.
To be able to bind your two samples, you need to convert them as dataframe. However, you will also need to have their names matching.
So something like that should work:
library(dplyr)
combined <- bind_rows(data.frame(x =rand1),
data.frame(x =rand2))
x
1 -1.1979747
2 -0.7819008
3 -2.0965976
4 -0.4637334
5 -1.4314750
6 -0.4356943
However, you can't differentiate rand1 and rand2 anymore.
So, an alternative solution is to start by binding your two random samples as columns and then pivot the dataframe into a longer format using pivot_longer from tidyr package:
df <- data.frame(rand1, rand2)
library(tidyverse)
df <- df %>% pivot_longer(everything(), names_to = "rand", values_to = "value")
rand value
<chr> <dbl>
1 rand1 -1.20
2 rand2 2.45
3 rand1 -0.782
4 rand2 1.35
5 rand1 -2.10
6 rand2 1.98
7 rand1 -0.464
8 rand2 0.733
9 rand1 -1.43
10 rand2 2.72
# … with 990 more rows
For plotting histogram and density, I used stat(ndensity) and ..scaled.. in order to set both random samples to be scaled up to 1:
library(ggplot2)
ggplot(df, aes(x = value, fill = rand))+
geom_density(aes(y = ..scaled..), alpha = 0.4)+
geom_histogram(aes(x = value, stat(ndensity)), color = "black", alpha =0.2)

How to maintain the order of elements of a row when using by and rbind function in r?

I have written a function which takes a subset of data based on the value of name column.It Computes the outlier for column "mark" and replaces all the outliers.
However when I try to combine these different subsets, the order of my elements changes. Is there any way by which I can maintain the order of my elements in the column "mark"
My data set is:
name mark
A 100.0
B 0.5
C 100.0
A 50.0
B 90.0
B 1000.0
C 1200.0
C 5000.0
A 210.0
The function which I have written is :
data.frame(do.call("rbind", as.list(by(data, data$name,
function(x){apply(x[, .(mark)],2,
function(y) {y[y > (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark))]
<- (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark));y})}))))
The result of the above function is the first column below (I've manually added back name for illustratory purposes):
mark NAME
100.000 ----- A
50.000 ----- A
210.000 ----- A
0.500 ----- B
90.000 ----- B
839.625 ----- B
100.000 ----- C
1200.000 ----- C
4875.000 ----- C
In the above result, the order of the values for mark column are changed. Is there any way by which I can maintain the order of the elements ?
Are you sure that code is doing what you think it is?
It looks like you're replacing any value greater than the median (third returned value of quantile) with the median + 1.5*IQR. Maybe that's what you intend, I don't know. The bigger problem is that you're doing that in an apply function, so it's going to re-calculate that median and IQR each iteration, updated with the previous rows already being changed. I'd wager that's not what you intend, but I suppose I've seen stranger.
A better option might be to create an external function to do the work, which takes in all of the data, does the calculation, then outputs all the data. I like dplyr for this simply because it's clean.
Reading your data in (why the "----"?)
scores <- read.table(text="
name mark
A 100.0
B 0.5
C 100.0
A 50.0
B 90.0
B 1000.0
C 1200.0
C 5000.0
A 210.0", header=TRUE)
and creating a function that does something a little more sensible; replaces any value greater than the 75% quantile (referenced by name so you know what it is) or less than the 25% quantile with that limiting value
scale_outliers <- function(data) {
lim <- quantile(data, na.rm = TRUE)
data[data > lim["75%"]] <- lim["75%"]
data[data < lim["25%"]] <- lim["25%"]
return(data)
}
Chaining this processing into dplyr::mutate is neat, and can then be passed on to ggplot. Here's the original data
gg1 <- scores %>% ggplot(aes(x=name, y=mark))
gg1 <- gg1 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg1
And if we alter it with the new function we get the data back without rows changed around
scores %>% mutate(new_mark = scale_outliers(mark))
#> name mark new_mark
#> 1 A 100.0 100
#> 2 B 0.5 90
#> 3 C 100.0 100
#> 4 A 50.0 90
#> 5 B 90.0 90
#> 6 B 1000.0 1000
#> 7 C 1200.0 1000
#> 8 C 5000.0 1000
#> 9 A 210.0 210
and we can plot that,
gg2 <- scores %>% mutate(new_mark = scale_outliers(mark)) %>% ggplot(aes(x=name, y=new_mark))
gg2 <- gg2 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg2
Best of all, if you now want to do that quantile comparison group-wise (say, by the name column, it's as easy as using dplyr::group_by(name),
gg3 <- scores %>% group_by(name) %>% mutate(new_mark = scale_outliers(mark)) %>% ggplot(aes(x=name, y=new_mark))
gg3 <- gg3 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg3
A slightly refactored version of Hack-R's answer -- you can add a index to your data.table:
data <- data.table(name = c("A", "B","C", "A","B","B","C","C","A"),mark = c(100,0.5,100,50,90,1000,1200,5000,210))
data[,i:=.I]
Then you perform your calculation but you keep the name and i:
df <- data.frame(do.call("rbind", as.list(
by(data, data$name,
function(x) cbind(i=x$i,
name=x$name,
apply(x[, .(mark)], 2,function(y) {y[y > (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark))] <- (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark));y})
)))))
And finally you order using the index:
df[order(df$i),]
i name mark
1 1 A 100
4 2 B 0.5
7 3 C 100
2 4 A 50
5 5 B 90
6 6 B 839.625
8 7 C 1200
9 8 C 4875
3 9 A 210

Adding Legend in R using row names

I have data frame which I want to pass first two columns rows+variable names to the legend.
Inside of df I have group of dataset in which they grouped with letters from a to h.
The thing I want to succeed is that something like 78_256_DQ0_a and
78_256_DQ1_a and 78_256_DQ2_a to legends a and so on for other groups.
I dont know how to pass this format to the ggplot.
Any help will be appreciated.
Lets say I have a data frame like this;
df <- do.call(rbind,lapply(1,function(x){
AC <- as.character(rep(rep(c(78,110),each=10),times=3))
AR <- as.character(rep(rep(c(256,320,384),each=20),times=1))
state <- rep(rep(c("Group 1","Group 2"),each=5),times=6)
V <- rep(c(seq(2,40,length.out=5),seq(-2,-40,length.out=5)),times=2)
DQ0 = sort(replicate(6, runif(10,0.001:1)))
DQ1 = sort(replicate(6, runif(10,0.001:1)))
DQ2 = sort(replicate(6, runif(10,0.001:1)))
No = c(replicate(1,rep(letters[1:6],each=10)))
data.frame(AC,AR,V,DQ0,DQ1,DQ2,No)
}))
head(df)
AC AR V DQ0 DQ1 DQ2 No
1 78 256 2.0 0.003944916 0.00902776 0.00228837 a
2 78 256 11.5 0.006629239 0.01739512 0.01649540 a
3 78 256 21.0 0.048515226 0.02034436 0.04525160 a
4 78 256 30.5 0.079483625 0.04346118 0.04778420 a
5 78 256 40.0 0.099462310 0.04430493 0.05086738 a
6 78 256 -2.0 0.103686255 0.04440260 0.09931459 a
*****************************************************
this code for plotting the df
library(reshape2)
df_new <- melt(df,id=c("V","No"),measure=c("DQ0","DQ1","DQ2"))
library(ggplot2)
ggplot(df_new,aes(y=value,x=V,group=No,colour=No))+
geom_point()+
geom_line()
Adding lty = variable to your aesthetics, like so:
ggplot(df_new, aes(y = value, x = V, lty = variable, colour = No)) +
geom_point() +
geom_line()
will give you separate lines for DQ0, DQ1, and DQ2.

How to generate summary information and error bars in R

I have a set of data:
COL1 COL2
1 3.45
2 8.48
1 2.53
2 9.42
2 2.56
etc.
COL1 specifies a category, whereas COL2 is data. I'd like to, for each distinct value in COL1 generate mean, stddev, min & max values. So in the end have something like (not real numbers):
COL1VAL MEAN STDDEV
1 4.59 1.24
2 4.75 1.20
I'd also then like to generate a bar chart with error bars, with X axis being the COL1VAL and bar height being the mean.
Can one do this in R, and if so, how?
Here's how you could do those things using packages dplyr and ggplot2, assuming your data frame is called df.
library(dplyr)
dfsummary <- df %>%
group_by(COL1) %>%
summarise_each(funs(mean, sd, min, max))
dfsummary
#Source: local data frame [2 x 5]
#
# COL1 mean sd min max
#1 1 2.99 0.6505382 2.53 3.45
#2 2 6.82 3.7190859 2.56 9.42
library(ggplot2)
ggplot(dfsummary, aes(x = factor(COL1), y = mean)) +
geom_bar(stat = "identity", fill = "lightblue") +
geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd))
If you prefer to stay in base R, you could use tapply and arrows:
head(chickwts, 15) # chicken growth depending on food#
means <- tapply(X=chickwts$weight, INDEX=chickwts$feed, FUN=mean)
sds <- tapply(X=chickwts$weight, INDEX=chickwts$feed, FUN=sd )
or <- order(means)
bp <- barplot(means[or], ylim=c(0, 390), las=2)
arrows(x0=bp, y0=(means+sds)[or], y1=(means-sds)[or],
code=3, angle=90, length=0.1)
Regards,
Berry

Resources