geom_tile() in R: Sort Order - r

I have a dataframe that I want to plot a heatmap of:
dput(df)
structure(list(`0` = c(6.08, 7.91, 5.14, 2.23, 0.72, 0.19, 0.04,
0.01, 0, 0, 0), `1` = c(9.12, 11.86, 7.71, 3.34, 1.09, 0.28,
0.06, 0.01, 0, 0, 0), `2` = c(6.84, 8.89, 5.78, 2.5, 0.81, 0.21,
0.05, 0.01, 0, 0, 0), `3` = c(3.42, 4.45, 2.89, 1.25, 0.41, 0.11,
0.02, 0, 0, 0, 0), `4` = c(1.28, 1.67, 1.08, 0.47, 0.15, 0.04,
0.01, 0, 0, 0, 0), `5` = c(0.38, 0.5, 0.33, 0.14, 0.05, 0.01,
0, 0, 0, 0, 0), `6` = c(0.1, 0.13, 0.08, 0.04, 0.01, 0, 0, 0,
0, 0, 0), `7` = c(0.02, 0.03, 0.02, 0.01, 0, 0, 0, 0, 0, 0, 0
), `8` = c(0, 0.01, 0, 0, 0, 0, 0, 0, 0, 0, 0), `9` = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `10 or more` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0)), row.names = c("0", "1", "2", "3", "4", "5",
"6", "7", "8", "9", "10 or more"), class = "data.frame")
Now to plot the heatmap using ggplot2 this is how I approach the solution:
df %>%
as_tibble(rownames = "homeScore") %>%
pivot_longer(cols = -homeScore, names_to = "awayScore", values_to = "level") %>%
ggplot() +geom_tile(aes(x=homeScore, y=awayScore, fill = level))
The problem I face is that the columns and rows are being sorted on (0,1,10+,2,..) instead of (0,1,2,...10+). Here's the example:
How do I sort the values such that 10+ is the last for row and column, instead of the third?

As #Nate already mentioned you have to convert your vars to factors and put the levels in the right order. Instead of converting via factor(as.numeric(.)) (which converts "10 or more" to NA) I would recommend to make use of forcats::fct_relevel which allows you to change the order of the levels, e.g. forcats::fct_relevel(homeScore, "10 or more", after = 10) will change the order of the levels such that 10 or more becomes the last level. Try this:
library(ggplot2)
library(tidyr)
library(dplyr)
library(forcats)
df %>%
as_tibble(rownames = "homeScore") %>%
pivot_longer(cols = -homeScore, names_to = "awayScore", values_to = "level") %>%
mutate_at(vars(homeScore, awayScore), ~forcats::fct_relevel(.x, "10 or more", after = 10)) %>%
ggplot() +
geom_tile(aes(x=homeScore, y=awayScore, fill = level))

Related

can I plot the dataframe cells as they are with colors?

I have the following dataframe (df). I want to make a graph that shows the same positions of the cells in the dataframe and having each cell with a color depending on the value (-1:1). if the value is zero then the color is blue, if the value is 1 or -1 then the value is red.
df<- structure(list(`0` = 1:8, `1` = c(0.885, 0.695, 0.828, 0.888,
0.823, 0.231, 0.599, 0.153), `2` = c(0.834, 0.68, 0.857, 0.802,
0.734, 0.205, 0.62, 0.044), `3` = c(0.854, 0.66, 0.83, 0.829,
0.729, 0.159, 0.559, 0.081), `4` = c(0.87, 0.583, 0.778, 0.853,
0.75, 0.087, 0.515, -0.011), `5` = c(0.922, 0.739, 0.787, 0.805,
0.635, -0.017, 0.498, -0.204), `6` = c(0.815, 0.535, 0.833, 0.784,
0.803, 0.092, 0.502, -0.419), `7` = c(0.859, 0.517, 0.8, 0.829,
0.557, 0.22, 0.368, -0.42), `8` = c(0.86, 0.701, 0.701, 0.786,
0.567, 0.414, 0.324, -0.396), `9` = c(0.774, 0.781, 0.805, 0.862,
0.405, 0.852, 0.1, -0.448), `10` = c(0.869, 0.788, 0.837, 0.838,
0.481, 0, -0.072, -0.48), `11` = c(0.816, 0.795, 0.807, 0.744,
0.217, 0, 0.096, -0.346), `12` = c(0.829, 0.792, 0.774, 0.778,
0.003, 0, 0, 0), `13` = c(0.799, 0.84, 0.775, 0.66, -0.024, 0,
0, 0), `14` = c(0.842, 0.765, 0.852, 0.679, 0, 0, 0, 0), `15` = c(0.804,
0.811, 0.818, 0.468, 0, 0, 0, 0), `16` = c(0.801, 0.757, 0.715,
0.091, 0, 0, 0, 0), `17` = c(0.807, 0.786, 0.799, -0.042, 0,
0, 0, 0), `18` = c(0.595, 0.795, 0.73, 0, 0, 0, 0, 0), `19` = c(0.822,
0.789, 0.623, 0, 0, 0, 0, 0), `20` = c(0.829, 0.822, 0.048, 0,
0, 0, 0, 0), `21` = c(0.805, 0.788, -0.205, 0, 0, 0, 0, 0), `22` = c(0.788,
0.791, -0.065, 0, 0, 0, 0, 0), `23` = c(0.839, 0.786, -0.217,
0, 0, 0, 0, 0), `24` = c(0.804, 0.815, 0, 0, 0, 0, 0, 0), `25` = c(0.789,
0.784, 0, 0, 0, 0, 0, 0), `26` = c(0.754, 0.787, 0, 0, 0, 0,
0, 0), `27` = c(0.832, 0.741, 0, 0, 0, 0, 0, 0), `28` = c(0.846,
0.778, 0, 0, 0, 0, 0, 0), `29` = c(0.797, 0.69, 0, 0, 0, 0, 0,
0), `30` = c(0.843, 0.644, 0, 0, 0, 0, 0, 0), `31` = c(0.825,
0.622, 0, 0, 0, 0, 0, 0), `32` = c(0.824, 0.726, 0, 0, 0, 0,
0, 0), `33` = c(0.749, 0.493, 0, 0, 0, 0, 0, 0), `34` = c(0.774,
-0.082, 0, 0, 0, 0, 0, 0), `35` = c(0.652, -0.255, 0, 0, 0, 0,
0, 0), `36` = c(0.833, 0, 0, 0, 0, 0, 0, 0), `37` = c(0.795,
0, 0, 0, 0, 0, 0, 0), `38` = c(0.864, 0, 0, 0, 0, 0, 0, 0), `39` = c(0.226,
0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, -8L), class = "data.frame")
You don't have any cells that are -1 or 1, so there aren't any that would be red. You didn't mention what color the cells should be if they are neither -1, 0 or 1, so I have left these white:
library(ggplot2)
library(tidyr)
library(dplyr)
pivot_longer(df, -1) %>%
mutate(name = factor(as.numeric(name),
sort(unique(as.numeric(name))))) %>%
ggplot(aes(name, `0`, fill = ifelse(value == 0, "blue",
ifelse(abs(value) == 1, "red", "white")))) +
geom_tile(color = "black") +
scale_y_reverse(breaks = seq(nrow(df)), expand = c(0, 0)) +
scale_x_discrete(position = "top") +
scale_fill_identity() +
coord_equal() +
theme_classic() +
theme(axis.line = element_blank(),
axis.title = element_blank(),
legend.position = "none")
Edit
Update based on comment from OP:
library(ggplot2)
library(tidyr)
library(dplyr)
pivot_longer(df, -1) %>%
mutate(name = factor(as.numeric(name),
sort(unique(as.numeric(name))))) %>%
ggplot(aes(name, `0`, fill = value)) +
geom_tile(color = "black") +
scale_y_reverse(breaks = seq(nrow(df)), expand = c(0, 0)) +
scale_x_discrete(position = "top") +
scale_fill_gradient2(low = "red", mid = "blue", high = "red",
breaks = c(-1, 0, 1), limits = c(-1, 1)) +
coord_equal() +
theme_classic() +
theme(axis.line = element_blank(),
axis.title = element_blank())
legend.position = "none")

Make Heatmap Look More Professional in R

I can build very basic plots in R, but I'm trying to make my heatmap look more professional and I'm not sure how to do it. I have a data frame df of 11 observations of 11 variables:
> dput(df)
structure(list(`0` = c(6.08, 7.91, 5.14, 2.23, 0.72, 0.19, 0.04,
0.01, 0, 0, 0), `1` = c(9.12, 11.86, 7.71, 3.34, 1.09, 0.28,
0.06, 0.01, 0, 0, 0), `2` = c(6.84, 8.89, 5.78, 2.5, 0.81, 0.21,
0.05, 0.01, 0, 0, 0), `3` = c(3.42, 4.45, 2.89, 1.25, 0.41, 0.11,
0.02, 0, 0, 0, 0), `4` = c(1.28, 1.67, 1.08, 0.47, 0.15, 0.04,
0.01, 0, 0, 0, 0), `5` = c(0.38, 0.5, 0.33, 0.14, 0.05, 0.01,
0, 0, 0, 0, 0), `6` = c(0.1, 0.13, 0.08, 0.04, 0.01, 0, 0, 0,
0, 0, 0), `7` = c(0.02, 0.03, 0.02, 0.01, 0, 0, 0, 0, 0, 0, 0
), `8` = c(0, 0.01, 0, 0, 0, 0, 0, 0, 0, 0, 0), `9` = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `10+` = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0)), row.names = c("0", "1", "2", "3", "4", "5", "6", "7",
"8", "9", "10+"), class = "data.frame")
So I transform df into a matrix to get a heatmap:
heatmap(data.matrix(df), Rowv=NA, Colv=NA, col = heat.colors(256), scale="column", margins=c(5,10))
This is what the plot looks like:
I'm not sure how to:
Change the location of the keys for the row and columns. I want them both to start from 0 in the top left corner, and both row and column to continue ascending until 10+
I'd also like more granularity in the colour. Right now you can't event tell the difference in values by looking at the colour...
Is heatmap from base R even the right library for this? I looked up a few examples and I wasn't sure if there's a better library to achieve what I want.
There are several libraries that offer heatmap functionalities. IMO base heatmap and gplots::heatmap.2 did not age well and are not the best options anymore. 3 good possibilities are with ggplot2::geom_tile, pheatmap and ComplexHeatmap.
Example data
Let's assume we have a matrix
dta <- matrix(rnorm(25), nrow=5)
rownames(dta) <- letters[1:5]
colnames(dta) <- LETTERS[1:5]
ggplot2::geom_tile
The ggplot2 version requires your data to be a tidy dataframe, so we can transform our matrix with tidyr::pivot_longer().
dta %>%
as_tibble(rownames = "myrows") %>%
pivot_longer(cols = -myrows, names_to = "mycols", values_to = "level") %>%
ggplot() +
geom_tile(aes(x=myrows, y=mycols, fill = level))
pheatmap
The pheatmap package is quite good at generating modern heatmaps. It takes a matrix as input. It can cluster the rows and columns and make a dendrogram, which is often a desired feature. It can also scale rows and columns (effectively plotting a Z-score).
pheatmap::pheatmap(dta,
scale = "none",
cluster_rows = FALSE,
cluster_cols = FALSE)
Note that the positions of rows and columns are not the same as with ggplot. You can look at the options that allow some useful customization. For example, if our rows have classes defined elsewhere.
ann_df <- data.frame(row.names = rownames(dta),
classification = rep(c("first", "second"), times = c(2,3)))
pheatmap::pheatmap(dta,
scale = "none",
cluster_rows = FALSE,
cluster_cols = FALSE,
annotation_row = ann_df,
gaps_row = c(2))
Color scale
One of the big aspects that make your heatmap look professional is the color scale. On ggplot, you should check out scale_fill_gradient2().
On pheatmap, you can try these settings for color as a starting point (see the documentation of these functions):
color = scales::div_gradient_pal(low = "navy",
mid = "green",
high="yellow")(seq(0,1,
length.out = max(dta))),
color = colorRampPalette(RColorBrewer::brewer.pal(n = 9,
name = "Blues"))(max(dta)),
color = viridisLite::plasma(max(dta)),
ComplexHeatmap
Finally, a package that has gained success recently is ComplexHeatmap. It is based on pheatmap but offers many additional options. See the link in zx8754's comment for a detailed book full of examples.

R Heatmap - Change Axes to Variable From Function

I'm trying to create a function in R that will take in two names (characters) and two doubles. It will then output a ggplot2 heatmap. Here's the data:
> dput(df)
structure(list(`0` = c(0.0608, 0.0791, 0.0514, 0.0223, 0.0072,
0.0019, 4e-04, 1e-04, 0, 0, 0), `1` = c(0.0912, 0.1186, 0.0771,
0.0334, 0.0109, 0.0028, 6e-04, 1e-04, 0, 0, 0), `2` = c(0.0684,
0.0889, 0.0578, 0.025, 0.0081, 0.0021, 5e-04, 1e-04, 0, 0, 0),
`3` = c(0.0342, 0.0445, 0.0289, 0.0125, 0.0041, 0.0011, 2e-04,
0, 0, 0, 0), `4` = c(0.0128, 0.0167, 0.0108, 0.0047, 0.0015,
4e-04, 1e-04, 0, 0, 0, 0), `5` = c(0.0038, 0.005, 0.0033,
0.0014, 5e-04, 1e-04, 0, 0, 0, 0, 0), `6` = c(0.001, 0.0013,
8e-04, 4e-04, 1e-04, 0, 0, 0, 0, 0, 0), `7` = c(2e-04, 3e-04,
2e-04, 1e-04, 0, 0, 0, 0, 0, 0, 0), `8` = c(0, 1e-04, 0,
0, 0, 0, 0, 0, 0, 0, 0), `9` = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0), `10+` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c("0",
"1", "2", "3", "4", "5", "6", "7", "8", "9", "10+"))
Now if I hardcode the names on the axes (homeScore and awayScore) it works:
df %>%
as_tibble(rownames = "awayScore") %>%
pivot_longer(cols = -awayScore, names_to = "homeScore", values_to = "probability") %>%
mutate_at(vars(awayScore, homeScore), ~forcats::fct_relevel(.x, "10+", after = 10)) %>%
ggplot() +
geom_tile(aes(x=awayScore, y=homeScore, fill = probability)) +
scale_fill_gradient2(low = "red", mid = "white", high = muted("blue"))+
theme(plot.margin = unit(c(2,2,2,2),"cm"))
This is the result:
But now, I want homeScore and awayScore to be variables in a function. So this is my new function with the same df:
TestFunction<-function(home,away){
df %>%
as_tibble(rownames = away) %>%
pivot_longer(cols = -away, names_to = "home", values_to = "probability") %>%
mutate_at(vars(away, home), ~forcats::fct_relevel(.x, "10+", after = 10)) %>%
ggplot() +
geom_tile(aes(x=away, y=home, fill = probability)) +
scale_fill_gradient2(low = "red", mid = "white", high = muted("blue"))+
theme(plot.margin = unit(c(2,2,2,2),"cm"))
}
But this plot is not what's expected:
What do I have to make the homeScore and awayScore on either axes become variables from a function?
When you want to use strings in aes instead of variables names of the dataframe you need to use the function aes_string instead.
Here is your updated code:
TestFunction<-function(home,away){
df %>%
as_tibble(rownames = away) %>%
pivot_longer(cols = -away, names_to = home, values_to = "probability") %>%
mutate_at(vars(away, home), ~forcats::fct_relevel(.x, "10+", after = 10)) %>%
ggplot() +
geom_tile(aes_string(x=away, y=home, fill = "probability")) +
scale_fill_gradient2(low = "red", mid = "white", high = muted("blue"))+
theme(plot.margin = unit(c(2,2,2,2),"cm"))
}
TestFunction("homeScore",'awayScore')
Note that I changed the line pivot_longer to include name_to=home (you had it hardcoded) and also in the aes_string I put quotes in the fill="probability
Output:

How to get multiple sets of subset from a matrix for multiple conditions using loop?

I would like to ask a data manipulation question by R.
all_matrix<-structure(list(V1 = c(0.012, 0, 0, 0, 0.037, 0, 0, 0, 0.007, 0, 0.104, 0.149, 0.164, 0.258, 3.986, 0, 0.002, 0, 0, 0), V2 = c(0, 0.07, 0, 0, 0.017, 0, 0, 0, 0.025, 0, 2.322, 0.327, 0.134, 1.035, 2.732, 0.01, 1.097, 0.388, 0, 0), V3 = c(0, 0, 0, 0.005, 0, 0, 0, 0, 0, 0, 0.007, 0, 0, 0, 1.777, 0, 0.241, 0, 0, 0), V4 = c(0, 0, 0, 0.001, 0.003, 0, 0, 0, 0, 0, 0.207, 0.002, 0.003, 0.015, 0.032, 0, 0.007, 0, 0, 0), V5 = c(0, 0, 0, 0, 0.026, 0, 0, 0, 0.001, 0, 0.101, 0, 0, 0.005, 0.01, 0, 0, 0, 0, 0), V6 = c(0, 0, 0, 0.003, 0.009, 0, 0, 0, 0.076, 0, 0.01, 0.006, 0, 0.091, 0.829, 0, 0.002, 0, 0, 0), V7 = c(0, 0, 0, 0, 0.026, 0, 0, 0, 0.351, 0, 1.849, 0.003, 0, 0.005, 0.998, 0.009, 0.18, 0, 0, 0), V8 = c(0, 0, 0.002, 0.047, 0.01, 0.003, 0, 0, 0.021, 0, 0.848, 0.007, 0.005, 0.206, 0.023, 0, 0.025, 0, 0, 0), V9 = c(0, 0, 0, 0.02, 0.013, 0, 0, 0, 0, 0, 0.008, 0, 0, 0, 0, 0, 0, 0, 0, 0), V10 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.012, 9.895, 4.362, 0, 0, 0, 0, 0), P_diff = c(-4.051, -3.769, -3.602, -7.563, -6.398, -1.816, 0.84, -1.91, 3.095, -1.544, 6.068, 16.469, 6.403, 9.9, 9.089, 18.977, 14.123, 3.103, 1.527, -2.902), PH_fold = c(0.144, 0.511, 0.686, 0.372, 0.582, 0.325, 1.312, 0.436, 1.061, 0.371, 1.119, 1.298, 1.134, 1.146, 1.123, 1.484, 1.204, 1.263, 1.843, 0.423), PNH_fold = c(6.933, 1.955, 1.459, 2.69, 1.718, 3.081, 0.762, 2.291, 0.943, 2.696, 0.894, 0.77, 0.882, 0.872, 0.891, 0.674, 0.831, 0.792, 0.543, 2.366)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "P_diff", "PH_fold", "PNH_fold"), class = "data.frame", row.names = c("S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8", "S9", "S10", "S11", "S12", "S13", "S14", "S15", "S16", "S17", "S18", "S19", "S20"))
sig1<-data.frame(subset(all_matrix, all_matrix$PH_fold >= 1.1 & all_matrix$P_diff >=1))##conditions are in range like all_matrix$PH_fold >= 1.1,1.2,1.3,....,2 & all_matrix$P_diff >= 1,2,3,4......, upto 20###
sig2<-data.frame(subset(all_matrix, all_matrix$PNH_fold >= 1.1 & all_matrix$P_diff <= -1))##conditions are in range like all_matrix$PNH_fold >= 1.1,1.2,1.3,....,2 & all_matrix$P_diff >= -1,-2,-3,-4......, upto -20###
d <- function(x){sum((log(x[x>0]))*(x[x>0]))*(-1)}
sh1<-apply((sig1[,-c(5:7)]/100), 2, d)
sh2<-apply((sig2[,-c(5:7)]/100), 2, d)
count1<-apply(sig1[,-c(5:7)], 2, function(i) (sum(i > 0)))
count2<-apply(sig2[,-c(5:7)], 2, function(i) (sum(i > 0)))
G1<-(sig1*sh1)
G2<-(sig2*sh2)
G<-data.frame(G1/G2)
I would like to set a R code to calculate "G" for all subsets for all_matrix based on each condition as mentioned above.
So how can I get the subsets of a matrix using different conditions for each subset using loop for further process to calculate "G":
I want to use subset() function in loop:
Can anyone help me out?
Thank you!
I appreciate any replies!
If you have large dataframes, don't use subset, it's slower than any other option. The fastest is generally the indexing way.
I first create two indices with the conditions given. Then a list to hold the sub-dfs.
i1 <- data1$age < 15
i2 <- data1$age > 20
sub_data1 <- list()
sub_data1[['less_15']] <- data1[i1, ]
sub_data1[['over_20']] <- data1[i2, ]
Note that you can skip the indices creation code and simply do
sub_data1 <- list()
sub_data1[['less_15']] <- data1[data1$age < 15, ]
sub_data1[['over_20']] <- data1[data1$age > 20, ]

Using aggregate on list in R

I have a list (lst3, subset below) and would like to do some calculations on it, e.g.:
lst4 <-lapply(lst3, function(x) aggregate(x[,5:ncol(x)], x[c(4)], FUN = mean)) #column means
lst5<-lapply(lst4,function(x) apply(x[,-c(1)],1,mean)) # get row mean
However, I am unable to get row mean without ignoring "Site".
I would like my final list to look like this:
lst5<-
[[1]]
Site x
G116 1.864233
[[2]]
Site x
GG16 2.064567
The essence is that the final list should have the above structure so that I can write my data to working directory using:
lapply(lst5,function(x)write.table(x,file=paste(getwd(),"summer",paste0(unique(x$Site),".csv"),
sep="/"),row.names=FALSE,quote=FALSE)) ### create a folder called "summer" and write files to directory###
Thanks,
AZ.
list(structure(list(Year = c(2005L, 2005L, 2005L), Month = c(8L,
8L, 8L), Day = 29:31, Site = structure(c(1L, 1L, 1L), .Label = "G116", class = "factor"),
Sim001 = c(8.4, 17.72, 6.03), Sim002 = c(0.27, 0, 0), Sim003 = c(2.83,
0.14, 0.1), Sim004 = c(0, 0, 0), Sim005 = c(0, 0.77, 0.28
), Sim006 = c(0, 0, 0), Sim007 = c(0, 0, 0), Sim008 = c(10.94,
4.77, 0), Sim009 = c(0, 0, 0), Sim010 = c(3.43, 2.74, 0.65
), Sim011 = c(0.36, 0, 2.75), Sim012 = c(26.91, 0, 2.16),
Sim013 = c(0.88, 1.33, 0.87), Sim014 = c(0, 0.86, 9.42),
Sim015 = c(0, 0.17, 1.15), Sim016 = c(0, 0, 0), Sim017 = c(0.13,
0, 0), Sim018 = c(0, 0, 6.72), Sim019 = c(8.45, 12.99, 23.72
), Sim020 = c(1.76, 0, 0), Sim021 = c(0, 0, 2.34), Sim022 = c(0,
0, 0), Sim023 = c(1.2, 0, 0.26), Sim024 = c(0.85, 0, 0),
Sim025 = c(0, 0, 0), Sim026 = c(2.05, 0.76, 5.03), Sim027 = c(0.78,
0, 0), Sim028 = c(1.2, 0, 0), Sim029 = c(22, 0.19, 0), Sim030 = c(0.12,
0, 0), Sim031 = c(3.1, 13.67, 0), Sim032 = c(0, 0, 17.88),
Sim033 = c(0, 0, 0), Sim034 = c(1.11, 0, 0), Sim035 = c(1.17,
1.41, 23.35), Sim036 = c(0, 0.48, 1.71), Sim037 = c(1.51,
11.1, 7.98), Sim038 = c(0, 0, 0), Sim039 = c(0, 0, 5.46),
Sim040 = c(5.21, 0, 0), Sim041 = c(0.1, 0.11, 0), Sim042 = c(0,
0.15, 5.23), Sim043 = c(0, 0, 0), Sim044 = c(0, 0.1, 0),
Sim045 = c(0, 0, 0), Sim046 = c(0, 0, 0), Sim047 = c(0, 0,
0.11), Sim048 = c(0, 0, 0), Sim049 = c(0, 0, 4.05), Sim050 = c(0,
0, 0), Sim051 = c(0, 0.12, 0), Sim052 = c(0.24, 2.58, 0),
Sim053 = c(3.63, 0, 0.17), Sim054 = c(10.94, 2.69, 0), Sim055 = c(0,
0, 0), Sim056 = c(0.24, 0.44, 8.27), Sim057 = c(0, 0, 0),
Sim058 = c(0, 0, 3.75), Sim059 = c(0.19, 11.06, 0), Sim060 = c(0,
0, 1.65), Sim061 = c(0, 4.95, 0), Sim062 = c(0.15, 0, 4.73
), Sim063 = c(2.99, 0.12, 1.28), Sim064 = c(0, 0, 0), Sim065 = c(0,
0, 0), Sim066 = c(0, 0, 0), Sim067 = c(0.11, 0.62, 0.56),
Sim068 = c(2.84, 0, 0), Sim069 = c(0, 0, 0), Sim070 = c(17.91,
0.11, 4.78), Sim071 = c(0, 0, 1.68), Sim072 = c(0, 0, 1.38
), Sim073 = c(1.68, 0, 0), Sim074 = c(0.53, 0, 2.87), Sim075 = c(0,
0, 0), Sim076 = c(2.58, 0.27, 0.11), Sim077 = c(0, 0, 0),
Sim078 = c(9.07, 3.13, 8.62), Sim079 = c(0.98, 0, 2.38),
Sim080 = c(3.4, 0, 0), Sim081 = c(0, 0, 4.57), Sim082 = c(1.87,
2.86, 0), Sim083 = c(21.76, 2.24, 0), Sim084 = c(0.45, 4.03,
0.39), Sim085 = c(0, 0, 0), Sim086 = c(0, 0, 0), Sim087 = c(0,
0, 17.12), Sim088 = c(5.05, 0, 0), Sim089 = c(0, 0, 1.4),
Sim090 = c(0.1, 0, 0), Sim091 = c(1.96, 0, 1.38), Sim092 = c(0,
0, 0), Sim093 = c(0, 0, 0), Sim094 = c(0, 0, 1.81), Sim095 = c(2.72,
7.16, 1.7), Sim096 = c(6.37, 0, 0), Sim097 = c(0, 1.12, 25.7
), Sim098 = c(0, 0, 0), Sim099 = c(0, 0, 0), Sim100 = c(6.77,
10.87, 2.6)), .Names = c("Year", "Month", "Day", "Site",
"Sim001", "Sim002", "Sim003", "Sim004", "Sim005", "Sim006", "Sim007",
"Sim008", "Sim009", "Sim010", "Sim011", "Sim012", "Sim013", "Sim014",
"Sim015", "Sim016", "Sim017", "Sim018", "Sim019", "Sim020", "Sim021",
"Sim022", "Sim023", "Sim024", "Sim025", "Sim026", "Sim027", "Sim028",
"Sim029", "Sim030", "Sim031", "Sim032", "Sim033", "Sim034", "Sim035",
"Sim036", "Sim037", "Sim038", "Sim039", "Sim040", "Sim041", "Sim042",
"Sim043", "Sim044", "Sim045", "Sim046", "Sim047", "Sim048", "Sim049",
"Sim050", "Sim051", "Sim052", "Sim053", "Sim054", "Sim055", "Sim056",
"Sim057", "Sim058", "Sim059", "Sim060", "Sim061", "Sim062", "Sim063",
"Sim064", "Sim065", "Sim066", "Sim067", "Sim068", "Sim069", "Sim070",
"Sim071", "Sim072", "Sim073", "Sim074", "Sim075", "Sim076", "Sim077",
"Sim078", "Sim079", "Sim080", "Sim081", "Sim082", "Sim083", "Sim084",
"Sim085", "Sim086", "Sim087", "Sim088", "Sim089", "Sim090", "Sim091",
"Sim092", "Sim093", "Sim094", "Sim095", "Sim096", "Sim097", "Sim098",
"Sim099", "Sim100"), row.names = 15947:15949, class = "data.frame"),
structure(list(Year = c(2005L, 2005L, 2005L), Month = c(8L,
8L, 8L), Day = 29:31, Site = structure(c(1L, 1L, 1L), .Label = "GG16", class = "factor"),
Sim001 = c(18.36, 0.33, 0.14), Sim002 = c(0, 10.92, 0
), Sim003 = c(0, 0, 0), Sim004 = c(0, 0, 1.7), Sim005 = c(0,
0, 0), Sim006 = c(0.91, 4.24, 0), Sim007 = c(0, 0, 0.22
), Sim008 = c(0.63, 2.9, 2.24), Sim009 = c(0, 0, 0),
Sim010 = c(0, 0, 6.91), Sim011 = c(0, 3.28, 10.18), Sim012 = c(8.39,
14.58, 45.62), Sim013 = c(2.87, 0.53, 0.11), Sim014 = c(9.15,
21.1, 0.66), Sim015 = c(0, 1.75, 2.2), Sim016 = c(0,
7.86, 0), Sim017 = c(0, 0, 0), Sim018 = c(0, 0, 0), Sim019 = c(0,
0, 0), Sim020 = c(0.39, 0, 0), Sim021 = c(0.13, 0, 1.05
), Sim022 = c(0, 0, 10.91), Sim023 = c(0.23, 0, 0), Sim024 = c(0.12,
0.83, 5.35), Sim025 = c(0, 0, 0), Sim026 = c(7.75, 0,
4.82), Sim027 = c(20.04, 0, 0), Sim028 = c(12.41, 0,
5.3), Sim029 = c(0, 0, 0), Sim030 = c(0, 0, 0), Sim031 = c(0,
8.06, 0), Sim032 = c(0, 0, 0), Sim033 = c(0, 0, 0), Sim034 = c(0.1,
0, 3.34), Sim035 = c(0, 4.34, 3.53), Sim036 = c(2.89,
0.27, 0), Sim037 = c(0, 0, 0), Sim038 = c(0, 0, 0), Sim039 = c(0,
0.11, 0), Sim040 = c(9.83, 1.55, 9.09), Sim041 = c(3.6,
0, 0), Sim042 = c(0, 0, 1.37), Sim043 = c(0, 0, 0), Sim044 = c(0,
0, 0), Sim045 = c(0, 0, 0), Sim046 = c(0, 0, 0), Sim047 = c(0,
20.52, 0.65), Sim048 = c(1.77, 0.67, 0), Sim049 = c(0,
0, 0), Sim050 = c(0, 0, 0), Sim051 = c(0, 4.9, 0), Sim052 = c(0.71,
11.34, 0), Sim053 = c(3.46, 2.59, 1.5), Sim054 = c(0,
23.63, 0), Sim055 = c(0, 16.48, 4.99), Sim056 = c(0,
0, 0), Sim057 = c(0, 0, 0), Sim058 = c(0, 0, 0), Sim059 = c(0,
0, 0), Sim060 = c(16.87, 0, 0), Sim061 = c(0, 3.43, 0
), Sim062 = c(0.45, 0, 0), Sim063 = c(0, 11.14, 7.22),
Sim064 = c(0, 0, 0), Sim065 = c(0, 0, 0), Sim066 = c(0,
16.08, 1.87), Sim067 = c(0, 0, 0), Sim068 = c(5.16, 0.88,
0.1), Sim069 = c(0, 0, 3.91), Sim070 = c(0, 0, 0), Sim071 = c(0.17,
0, 5.22), Sim072 = c(0, 0, 6.95), Sim073 = c(0, 0, 0),
Sim074 = c(0.14, 0, 0), Sim075 = c(0, 0, 0), Sim076 = c(0,
9.62, 0), Sim077 = c(0, 0, 0), Sim078 = c(1.65, 0, 0),
Sim079 = c(0.23, 8.41, 0.28), Sim080 = c(0.78, 0, 0),
Sim081 = c(0, 0, 0), Sim082 = c(0.11, 2.75, 0), Sim083 = c(0.26,
7.34, 5.92), Sim084 = c(0, 0, 4.27), Sim085 = c(0, 0,
0), Sim086 = c(0, 0, 0.1), Sim087 = c(27.18, 0.72, 28.29
), Sim088 = c(0, 0, 4.2), Sim089 = c(0, 9.37, 6.59),
Sim090 = c(0.21, 2.57, 0), Sim091 = c(0.45, 0, 0), Sim092 = c(0,
4.97, 0), Sim093 = c(1.43, 0, 0), Sim094 = c(0, 0, 2.15
), Sim095 = c(6, 0, 1.63), Sim096 = c(7.21, 0, 0), Sim097 = c(0,
0.39, 1.92), Sim098 = c(0, 0, 0), Sim099 = c(4.38, 0,
0), Sim100 = c(0, 0, 0)), .Names = c("Year", "Month",
"Day", "Site", "Sim001", "Sim002", "Sim003", "Sim004", "Sim005",
"Sim006", "Sim007", "Sim008", "Sim009", "Sim010", "Sim011",
"Sim012", "Sim013", "Sim014", "Sim015", "Sim016", "Sim017",
"Sim018", "Sim019", "Sim020", "Sim021", "Sim022", "Sim023",
"Sim024", "Sim025", "Sim026", "Sim027", "Sim028", "Sim029",
"Sim030", "Sim031", "Sim032", "Sim033", "Sim034", "Sim035",
"Sim036", "Sim037", "Sim038", "Sim039", "Sim040", "Sim041",
"Sim042", "Sim043", "Sim044", "Sim045", "Sim046", "Sim047",
"Sim048", "Sim049", "Sim050", "Sim051", "Sim052", "Sim053",
"Sim054", "Sim055", "Sim056", "Sim057", "Sim058", "Sim059",
"Sim060", "Sim061", "Sim062", "Sim063", "Sim064", "Sim065",
"Sim066", "Sim067", "Sim068", "Sim069", "Sim070", "Sim071",
"Sim072", "Sim073", "Sim074", "Sim075", "Sim076", "Sim077",
"Sim078", "Sim079", "Sim080", "Sim081", "Sim082", "Sim083",
"Sim084", "Sim085", "Sim086", "Sim087", "Sim088", "Sim089",
"Sim090", "Sim091", "Sim092", "Sim093", "Sim094", "Sim095",
"Sim096", "Sim097", "Sim098", "Sim099", "Sim100"), row.names = 15947:15949, class = "data.frame"))
You can go from lst3 directly to lst5 without the intermediate aggregate step:
lapply(lst3, function(df){
data.frame(Site = df$Site[1], x = mean(unlist(df[-c(1:4)])))
})
#[[1]]
# Site x
#1 G116 1.864233
#
#[[2]]
# Site x
#1 GG16 2.064567
Since you're calculating the mean of all columns except the first 4 columns and over all the rows of the other columns, it's quite easy to unlist the data, creating a single vector, and then using standard mean on it. Also, by skipping the lst4 step, this most likely be noticeably faster.
Or, as commented by Richard, a variation could be:
lapply(lst3, function(df){
data.frame(Site = df$Site[1], x = mean(colMeans(df[-c(1:4)])))
})
Benchmark:
library(microbenchmark)
microbenchmark(
f1 = {lapply(lst3, function(df){
data.frame(Site = df$Site[1], x = mean(unlist(df[-c(1:4)])))
})},
f2 = {lapply(lst3, function(df){
data.frame(Site = df$Site[1], x = mean(colMeans(df[-c(1:4)])))
})},
unit = "relative"
)
Unit: relative
expr min lq median uq max neval
f1 1.00000 1.000000 1.000000 1.000000 1.000000 100
f2 2.91545 2.937272 2.927799 2.894704 3.486007 100
Here's another option for your consideration:
library(reshape2)
x <- melt(lst3)
aggregate(value ~ Site, x[grepl("^Sim.*", x$variable),], FUN = mean)
# Site value
#1 G116 1.864233
#2 GG16 2.064567
Or the same concept but using dplyr:
library(dplyr)
filter(x, grepl("^Sim.*", variable)) %>% group_by(Site) %>% summarise(x = mean(value))
#Source: local data frame [2 x 2]
#
# Site x
#1 G116 1.864233
#2 GG16 2.064567
Of course, this could also be done using data.table, for example like this (there are probably several even slightly more efficient ways to do this in data.table):
library(data.table)
setDT(x)[grepl("^Sim.*", variable), list(x = mean(value)), by = Site]
# Site x
#1: G116 1.864233
#2: GG16 2.064567

Resources