I have a dataframe of gene expression scores (cells x genes). I also have the cluster that each cell belongs to in stored as a column.
I want to calculate the mean expression values per cluster for a group of genes (columns), however, I only want to include values > 0 in these calculations.
My attempt at this is as follows:
test <- gene_scores_df2 %>%
select(all_of(gene_list), Clusters) %>%
group_by(Clusters) %>%
summarize(across(c(1:13), ~mean(. > 0)))
This produces the following tibble:
# A tibble: 16 x 14
Clusters SLC17A7 GAD1 GAD2 SLC32A1 GLI3 TNC PROX1 SCGN LHX6 NXPH1 MEIS2 ZFHX3 C3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 C1 0.611 0.605 0.817 0.850 0.979 0.590 0.725 0.434 0.275 0.728 0.949 0.886 0.332
2 C10 0.484 0.401 0.434 0.401 0.791 0.387 0.431 0.362 0.204 0.652 0.715 0.580 0.186
3 C11 0.495 0.5 0.538 0.412 0.847 0.437 0.516 0.453 0.187 0.764 0.804 0.640 0.160
4 C12 0.807 0.626 0.559 0.703 0.942 0.448 0.644 0.366 0.403 0.702 0.917 0.859 0.228
5 C13 0.489 0.578 0.709 0.719 0.796 0.409 0.565 0.371 0.367 0.773 0.716 0.776 0.169
6 C14 0.541 0.347 0.330 0.388 0.731 0.281 0.438 0.279 0.198 0.577 0.777 0.633 0.128
7 C15 0.152 0.306 0.337 0.198 0.629 0.304 0.331 0.179 0.132 0.496 0.509 0.405 0.0556
8 C16 0.402 0.422 0.542 0.418 0.813 0.514 0.614 0.287 0.267 0.729 0.574 0.737 0.279
9 C2 0.152 0.480 0.458 0.297 0.883 0.423 0.511 0.195 0.152 0.722 0.692 0.598 0.0632
10 C3 0.585 0.679 0.659 0.711 0.996 0.886 0.801 0.297 0.305 0.789 0.992 0.963 0.346
11 C4 0.567 0.756 0.893 0.940 0.892 0.334 0.797 0.750 0.376 0.686 0.897 0.885 0.240
12 C5 0.220 0.516 0.560 0.625 0.673 0.250 0.466 0.275 0.358 0.590 0.571 0.641 0.112
13 C6 0.558 0.908 0.836 0.973 0.725 0.280 0.830 0.642 0.871 0.927 0.830 0.916 0.202
14 C7 0.380 0.743 0.749 0.772 0.825 0.415 0.480 0.211 0.199 0.614 0.860 0.901 0.135
15 C8 0.616 0.348 0.312 0.334 0.749 0.271 0.451 0.520 0.129 0.542 0.743 0.735 0.147
16 C9 0.406 0.381 0.400 0.265 0.679 0.266 0.465 0.233 0.0820 0.648 0.565 0.557 0.119
However, when I check this against (what I assume is) a similar procedure on a single column I get different mean values.
Here is the code for SLC1747:
gene_scores_df2 %>%
select(SLC17A7, Clusters) %>%
group_by(Clusters) %>%
filter(SLC17A7 > 0) %>%
summarize(SLC17A7 = mean(SLC17A7))
And the result:
# A tibble: 16 x 2
Clusters SLC17A7
<chr> <dbl>
1 C1 0.780
2 C10 1.42
3 C11 1.21
4 C12 1.64
5 C13 1.09
6 C14 1.83
7 C15 1.61
8 C16 0.968
9 C2 1.09
10 C3 0.512
11 C4 0.920
12 C5 1.53
13 C6 0.814
14 C7 1.22
15 C8 2.24
16 C9 1.72
I'm unsure what exactly is wrong with the first attempt above.
Any suggestions would be greatly appreciated.
Code snippet for original df for
# First 20 cols of:
gene_scores_df2 %>%
select(all_of(gene_list), Clusters) %>%
group_by(Clusters)
structure(list(SLC17A7 = c(0.273, 0.722, 0.699, 0.71, 0.333,
0.674, 0.63, 0.481, 0.274, 0.981, 0.586, 0.401, 0.325, 0.583,
0, 0.348, 0.287, 0, 0.295, 0.351), GAD1 = c(0.355, 0.392, 0.455,
0.34, 0.108, 1.169, 0, 0.426, 2.219, 0.099, 1.16, 0.332, 0.404,
0.284, 0, 5.297, 0.518, 0.027, 1.19, 0.346), GAD2 = c(0.12, 0.562,
0.337, 0.49, 0.095, 0.958, 0.09, 1.518, 1.464, 0.175, 0.419,
0.536, 0.501, 1.103, 0.343, 0, 0.247, 0, 0.635, 0.906), SLC32A1 = c(0,
0.97, 0.067, 0.999, 0.224, 1.04, 0, 2.569, 1.544, 0.059, 2.177,
3.227, 3.603, 1.229, 0.102, 2.421, 0.055, 0.826, 2.646, 0.228
), GLI3 = c(1.527, 0.487, 0.341, 3.352, 0.346, 0.694, 1.395,
0.767, 1.334, 1.373, 1.7, 2.216, 0.394, 1.029, 1.235, 0.55, 2.043,
4.469, 2.901, 4.139), TNC = c(0, 0, 0.448, 0.03, 1.377, 0.045,
0, 0.169, 0.123, 0, 0.188, 0.075, 0, 1.074, 0, 1.272, 0.124,
0.505, 0.173, 0.889), PROX1 = c(0, 0.075, 0.167, 0.782, 0.802,
0.561, 0.098, 0.734, 0.448, 1.645, 0.735, 0.795, 0.102, 0.317,
0.124, 0.324, 0.352, 0.236, 0.826, 0.308), SCGN = c(0.696, 0.234,
0, 0.202, 0.059, 0.162, 0, 0.653, 0.383, 0.42, 0.094, 0.779,
0.228, 0.248, 0.171, 0.089, 0.081, 0.026, 0.159, 0), LHX6 = c(0,
0, 0.134, 0.1, 0.829, 1.489, 0, 0.38, 0.526, 0.117, 0, 0.205,
0.299, 2.235, 0, 1.335, 0, 0.115, 0.454, 0.108), NXPH1 = c(0.792,
0.143, 0.175, 0.658, 0, 1.034, 1.798, 0.219, 0.896, 0.249, 1.336,
1.507, 0.26, 0.242, 1.235, 2.16, 0.235, 0.349, 1.297, 2.234),
MEIS2 = c(4.337, 0.559, 0.978, 1.972, 0.964, 0.657, 0.162,
0.827, 0.882, 0.157, 1.494, 1.171, 2.524, 2.458, 0.205, 0.448,
2.027, 4.767, 1.514, 2.077), ZFHX3 = c(1.48, 1.38, 2.323,
1.039, 1.343, 1.354, 0.238, 1.224, 1.676, 0.811, 0.316, 2.012,
2.298, 1.869, 0.201, 0.176, 1.829, 1.081, 0.522, 0.959),
C3 = c(0.52, 0.527, 0, 0.073, 0, 0.15, 0.094, 0.315, 0.174,
0, 0, 0.17, 0.165, 0, 0.237, 0, 0.091, 0.095, 0, 0.081),
Clusters = c("C12", "C5", "C13", "C4", "C12", "C13", "C13",
"C4", "C6", "C8", "C4", "C4", "C4", "C12", "C5", "C6", "C1",
"C3", "C4", "C3")), row.names = c(NA, -20L), groups = structure(list(
Clusters = c("C1", "C12", "C13", "C3", "C4", "C5", "C6",
"C8"), .rows = structure(list(17L, c(1L, 5L, 14L), c(3L,
6L, 7L), c(18L, 20L), c(4L, 8L, 11L, 12L, 13L, 19L), c(2L,
15L), c(9L, 16L), 10L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
What you want is:
library(tidyverse)
df %>%
group_by(Clusters) %>%
summarize(across(everything(), ~mean(.[. > 0])))
~mean(. > 0) checks if an element is greater 0 or not and thus returns TRUE/FALSE and then gives you the mean of the underlying 0/1's. Instead you want to filter each column which can be achieved with the usual [] approach
I have this quite complicate data in a data frame called ulyDataLefs60_12:
Year Day Hour Min Sec. E1.S1 E1.S2 E1.S3 E1.S4 E1.S5 E1.S6 E1.S7 E1.S8 E2.S1 E2.S2 E2.S3 E2.S4
1 2000 122 0 1 38.01 3.31 0.662 0.662 2.65 1.32 0.000 3.310 1.32 1.980 1.980 0.662 0.000
2 2000 122 0 1 50.10 1.98 3.310 1.980 1.98 1.98 1.320 4.630 1.32 1.320 0.662 0.000 3.310
3 2000 122 0 2 2.19 1.98 1.320 3.970 1.98 1.32 0.662 0.662 3.97 1.320 0.662 1.320 0.662
4 2000 122 0 2 14.28 2.65 1.320 2.650 3.31 2.65 1.320 3.970 2.65 2.650 0.000 0.662 2.650
5 2000 122 0 2 26.38 3.97 6.620 0.662 3.31 3.31 4.630 5.290 1.98 0.000 0.000 1.980 0.662
6 2000 122 0 2 38.47 2.65 0.662 3.310 1.98 1.32 1.980 1.980 2.65 0.662 1.320 1.980 1.320
E2.S5 E2.S6 E2.S7 E2.S8 E3.S1 E3.S2 E3.S3 E3.S4 E3.S5 E3.S6 E3.S7 E3.S8 E4.S1 E4.S2 E4.S3 E4.S4
1 1.320 1.32 2.65 2.650 0.662 0.000 1.320 2.650 1.320 0.000 1.320 1.320 0.000 0.000 0.662 0.662
2 0.000 0.00 1.98 0.662 0.000 0.662 0.000 0.662 1.980 1.980 0.662 1.320 0.000 0.000 0.000 0.662
3 0.662 1.98 2.65 1.980 0.000 0.662 0.662 1.320 0.662 0.000 1.320 3.310 0.662 0.000 1.980 0.662
4 0.662 1.32 1.32 0.662 0.000 0.662 0.662 0.662 0.662 0.662 0.662 0.000 0.000 0.662 0.000 0.000
5 0.000 1.32 1.32 0.662 0.662 0.000 0.000 0.662 0.000 0.662 1.320 0.662 0.000 0.000 0.000 0.662
6 1.320 1.32 1.32 0.000 1.320 0.000 0.000 0.662 1.320 0.000 0.662 0.662 0.662 1.320 0.000 0.000
E4.S5 E4.S6 E4.S7 E4.S8 FP5.S1 FP5.S2 FP5.S3 FP5.S4 FP5.S5 FP5.S6 FP5.S7 FP5.S8 FP6.S1 FP6.S2
1 0.000 0.662 0.662 0.000 0.331 0 0.662 0.000 0.662 0 0.000 0.331 0 0.331
2 0.000 0.000 0.662 0.662 0.331 0 0.662 0.000 0.662 0 0.000 0.331 0 0.331
3 0.662 0.000 0.662 1.320 0.000 0 0.662 0.000 0.331 0 0.000 0.000 0 0.000
4 0.662 0.662 0.000 0.662 0.000 0 0.662 0.000 0.331 0 0.000 0.000 0 0.000
5 0.000 0.000 0.662 0.000 0.331 0 0.000 0.331 0.331 0 0.331 0.000 0 0.000
6 0.000 0.000 0.662 0.662 0.331 0 0.000 0.331 0.331 0 0.331 0.000 0 0.000
FP6.S3 FP6.S4 FP6.S5 FP6.S6 FP6.S7 FP6.S8 FP7.S1 FP7.S2 FP7.S3 FP7.S4 FP7.S5 FP7.S6 FP7.S7 FP7.S8
1 0.331 0.000 0.000 0.000 0 0.000 0 0.331 0.331 0.662 0 0.000 0.331 0.000
2 0.331 0.000 0.000 0.000 0 0.000 0 0.331 0.331 0.662 0 0.000 0.331 0.000
3 0.662 0.000 0.662 0.000 0 0.331 0 0.000 0.000 0.331 0 0.000 0.000 0.000
4 0.662 0.000 0.662 0.000 0 0.331 0 0.000 0.000 0.331 0 0.000 0.000 0.000
5 0.000 0.662 0.000 0.992 0 0.000 0 0.000 0.000 0.000 0 0.331 0.000 0.331
6 0.000 0.662 0.000 0.992 0 0.000 0 0.000 0.000 0.000 0 0.331 0.000 0.331
PA.LEFS60S1 PA.LEFS60S2 PA.LEFS60S3 PA.LEFS60S4 PA.LEFS60S5 PA.LEFS60S6 PA.LEFS60S7 PA.LEFS60S8
1 64.2 52.0 70.9 105.0 144 170 134 96.2
2 62.6 49.5 68.8 104.0 142 168 134 95.4
3 62.7 47.7 66.2 101.0 140 167 135 96.5
4 62.4 46.3 64.4 99.3 138 166 135 96.7
5 59.9 43.7 63.2 98.8 138 164 133 94.8
6 62.3 45.7 63.7 98.7 137 166 136 96.9
BX BY BZ Bmag....nT. X datetime
1 2.64 4.98 2.25 6.07 NA 2000-05-01 00:01:38
2 2.67 5.16 2.03 6.15 NA 2000-05-01 00:01:50
3 2.52 5.35 1.88 6.21 NA 2000-05-01 00:02:02
4 2.43 5.45 1.74 6.22 NA 2000-05-01 00:02:14
5 2.53 5.46 1.46 6.19 NA 2000-05-01 00:02:26
6 2.29 5.26 1.61 5.96 NA 2000-05-01 00:02:38
dput(head(ulyDataLefs60_12))
structure(list(Year = c(2000L, 2000L, 2000L, 2000L, 2000L, 2000L
), Day = c(122L, 122L, 122L, 122L, 122L, 122L), Hour = c(0L,
0L, 0L, 0L, 0L, 0L), Min = c(1L, 1L, 2L, 2L, 2L, 2L), Sec. = c(38.01,
50.1, 2.19, 14.28, 26.38, 38.47), E1.S1 = c(3.31, 1.98, 1.98,
2.65, 3.97, 2.65), E1.S2 = c(0.662, 3.31, 1.32, 1.32, 6.62, 0.662
), E1.S3 = c(0.662, 1.98, 3.97, 2.65, 0.662, 3.31), E1.S4 = c(2.65,
1.98, 1.98, 3.31, 3.31, 1.98), E1.S5 = c(1.32, 1.98, 1.32, 2.65,
3.31, 1.32), E1.S6 = c(0, 1.32, 0.662, 1.32, 4.63, 1.98), E1.S7 = c(3.31,
4.63, 0.662, 3.97, 5.29, 1.98), E1.S8 = c(1.32, 1.32, 3.97, 2.65,
1.98, 2.65), E2.S1 = c(1.98, 1.32, 1.32, 2.65, 0, 0.662), E2.S2 = c(1.98,
0.662, 0.662, 0, 0, 1.32), E2.S3 = c(0.662, 0, 1.32, 0.662, 1.98,
1.98), E2.S4 = c(0, 3.31, 0.662, 2.65, 0.662, 1.32), E2.S5 = c(1.32,
0, 0.662, 0.662, 0, 1.32), E2.S6 = c(1.32, 0, 1.98, 1.32, 1.32,
1.32), E2.S7 = c(2.65, 1.98, 2.65, 1.32, 1.32, 1.32), E2.S8 = c(2.65,
0.662, 1.98, 0.662, 0.662, 0), E3.S1 = c(0.662, 0, 0, 0, 0.662,
1.32), E3.S2 = c(0, 0.662, 0.662, 0.662, 0, 0), E3.S3 = c(1.32,
0, 0.662, 0.662, 0, 0), E3.S4 = c(2.65, 0.662, 1.32, 0.662, 0.662,
0.662), E3.S5 = c(1.32, 1.98, 0.662, 0.662, 0, 1.32), E3.S6 = c(0,
1.98, 0, 0.662, 0.662, 0), E3.S7 = c(1.32, 0.662, 1.32, 0.662,
1.32, 0.662), E3.S8 = c(1.32, 1.32, 3.31, 0, 0.662, 0.662), E4.S1 = c(0,
0, 0.662, 0, 0, 0.662), E4.S2 = c(0, 0, 0, 0.662, 0, 1.32), E4.S3 = c(0.662,
0, 1.98, 0, 0, 0), E4.S4 = c(0.662, 0.662, 0.662, 0, 0.662, 0
), E4.S5 = c(0, 0, 0.662, 0.662, 0, 0), E4.S6 = c(0.662, 0, 0,
0.662, 0, 0), E4.S7 = c(0.662, 0.662, 0.662, 0, 0.662, 0.662),
E4.S8 = c(0, 0.662, 1.32, 0.662, 0, 0.662), FP5.S1 = c(0.331,
0.331, 0, 0, 0.331, 0.331), FP5.S2 = c(0, 0, 0, 0, 0, 0),
FP5.S3 = c(0.662, 0.662, 0.662, 0.662, 0, 0), FP5.S4 = c(0,
0, 0, 0, 0.331, 0.331), FP5.S5 = c(0.662, 0.662, 0.331, 0.331,
0.331, 0.331), FP5.S6 = c(0, 0, 0, 0, 0, 0), FP5.S7 = c(0,
0, 0, 0, 0.331, 0.331), FP5.S8 = c(0.331, 0.331, 0, 0, 0,
0), FP6.S1 = c(0, 0, 0, 0, 0, 0), FP6.S2 = c(0.331, 0.331,
0, 0, 0, 0), FP6.S3 = c(0.331, 0.331, 0.662, 0.662, 0, 0),
FP6.S4 = c(0, 0, 0, 0, 0.662, 0.662), FP6.S5 = c(0, 0, 0.662,
0.662, 0, 0), FP6.S6 = c(0, 0, 0, 0, 0.992, 0.992), FP6.S7 = c(0,
0, 0, 0, 0, 0), FP6.S8 = c(0, 0, 0.331, 0.331, 0, 0), FP7.S1 = c(0,
0, 0, 0, 0, 0), FP7.S2 = c(0.331, 0.331, 0, 0, 0, 0), FP7.S3 = c(0.331,
0.331, 0, 0, 0, 0), FP7.S4 = c(0.662, 0.662, 0.331, 0.331,
0, 0), FP7.S5 = c(0, 0, 0, 0, 0, 0), FP7.S6 = c(0, 0, 0,
0, 0.331, 0.331), FP7.S7 = c(0.331, 0.331, 0, 0, 0, 0), FP7.S8 = c(0,
0, 0, 0, 0.331, 0.331), PA.LEFS60S1 = c(64.2, 62.6, 62.7,
62.4, 59.9, 62.3), PA.LEFS60S2 = c(52, 49.5, 47.7, 46.3,
43.7, 45.7), PA.LEFS60S3 = c(70.9, 68.8, 66.2, 64.4, 63.2,
63.7), PA.LEFS60S4 = c(105, 104, 101, 99.3, 98.8, 98.7),
PA.LEFS60S5 = c(144, 142, 140, 138, 138, 137), PA.LEFS60S6 = c(170,
168, 167, 166, 164, 166), PA.LEFS60S7 = c(134, 134, 135,
135, 133, 136), PA.LEFS60S8 = c(96.2, 95.4, 96.5, 96.7, 94.8,
96.9), BX = c(2.64, 2.67, 2.52, 2.43, 2.53, 2.29), BY = c(4.98,
5.16, 5.35, 5.45, 5.46, 5.26), BZ = c(2.25, 2.03, 1.88, 1.74,
1.46, 1.61), Bmag....nT. = c(6.07, 6.15, 6.21, 6.22, 6.19,
5.96), X = c(NA, NA, NA, NA, NA, NA), datetime = structure(list(
sec = c(38, 50, 2, 14, 26, 38), min = c(1L, 1L, 2L, 2L,
2L, 2L), hour = c(0L, 0L, 0L, 0L, 0L, 0L), mday = c(1L,
1L, 1L, 1L, 1L, 1L), mon = c(4L, 4L, 4L, 4L, 4L, 4L),
year = c(100L, 100L, 100L, 100L, 100L, 100L), wday = c(1L,
1L, 1L, 1L, 1L, 1L), yday = c(121L, 121L, 121L, 121L,
121L, 121L), isdst = c(1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"))), .Names = c("Year", "Day",
"Hour", "Min", "Sec.", "E1.S1", "E1.S2", "E1.S3", "E1.S4", "E1.S5",
"E1.S6", "E1.S7", "E1.S8", "E2.S1", "E2.S2", "E2.S3", "E2.S4",
"E2.S5", "E2.S6", "E2.S7", "E2.S8", "E3.S1", "E3.S2", "E3.S3",
"E3.S4", "E3.S5", "E3.S6", "E3.S7", "E3.S8", "E4.S1", "E4.S2",
"E4.S3", "E4.S4", "E4.S5", "E4.S6", "E4.S7", "E4.S8", "FP5.S1",
"FP5.S2", "FP5.S3", "FP5.S4", "FP5.S5", "FP5.S6", "FP5.S7", "FP5.S8",
"FP6.S1", "FP6.S2", "FP6.S3", "FP6.S4", "FP6.S5", "FP6.S6", "FP6.S7",
"FP6.S8", "FP7.S1", "FP7.S2", "FP7.S3", "FP7.S4", "FP7.S5", "FP7.S6",
"FP7.S7", "FP7.S8", "PA.LEFS60S1", "PA.LEFS60S2", "PA.LEFS60S3",
"PA.LEFS60S4", "PA.LEFS60S5", "PA.LEFS60S6", "PA.LEFS60S7", "PA.LEFS60S8",
"BX", "BY", "BZ", "Bmag....nT.", "X", "datetime"), row.names = c(NA,
6L), class = "data.frame")
And what I want, is to get the average and median of a certain number of rows. Lets say, I want a new data frame, that instead of all these values, has the average or median of every 5 rows in all columns (or at least in all columns starting at the E1.S1 column.
I started by looking at the example: Calculate means of rows and it did get me so far as being able to get the average of a N number of rows for a single column of my dataframe.
ulyDataLefs60_12_avg = colSums(matrix(ulyDataLefs60_12$E1.S1, nrow=5))
The problem is that the R function I wanted to use, colSums, doesn't work with some fields, namely the datetime field (for obvious reasons), so I can't apply it across all the columns and get a nice averaged dataframe.
ulyDataLefs60_12_avg = colSums(matrix(ulyDataLefs60_12, nrow=5))
Error in colSums(matrix(ulyDataLefs60_12, nrow = 5)) :
'x' must be numeric
I'm glad to have the datetime field at the beginning of the every 5 rows I use to get the average and the median (better if I got the datetime at the center of the 5 values interval tough), but so far I didn't come with an answer that does the 2 things at the same time.
Perhaps it's a really easy thing to do, but it's cracking my head.
For this data:
> dput(df)
df <- structure(list(Year = c(2000L, 2000L, 2000L, 2000L, 2000L, 2000L
), Day = c(122L, 122L, 122L, 122L, 122L, 122L), Hour = c(0L,
0L, 0L, 0L, 0L, 0L), Min = c(1L, 1L, 2L, 2L, 2L, 2L), Sec. = c(38.01,
50.1, 2.19, 14.28, 26.38, 38.47), E1.S1 = c(3.31, 1.98, 1.98,
2.65, 3.97, 2.65), E1.S2 = c(0.662, 3.31, 1.32, 1.32, 6.62, 0.662
), E1.S3 = c(0.662, 1.98, 3.97, 2.65, 0.662, 3.31), E1.S4 = c(2.65,
1.98, 1.98, 3.31, 3.31, 1.98), E1.S5 = c(1.32, 1.98, 1.32, 2.65,
3.31, 1.32), E1.S6 = c(0, 1.32, 0.662, 1.32, 4.63, 1.98), E1.S7 = c(3.31,
4.63, 0.662, 3.97, 5.29, 1.98), E1.S8 = c(1.32, 1.32, 3.97, 2.65,
1.98, 2.65), E2.S1 = c(1.98, 1.32, 1.32, 2.65, 0, 0.662), E2.S2 = c(1.98,
0.662, 0.662, 0, 0, 1.32), E2.S3 = c(0.662, 0, 1.32, 0.662, 1.98,
1.98), E2.S4 = c(0, 3.31, 0.662, 2.65, 0.662, 1.32)), .Names = c("Year",
"Day", "Hour", "Min", "Sec.", "E1.S1", "E1.S2", "E1.S3", "E1.S4",
"E1.S5", "E1.S6", "E1.S7", "E1.S8", "E2.S1", "E2.S2", "E2.S3",
"E2.S4"), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6"))
This works:
lapply(split(df, ceiling(seq_len(nrow(df)) / 5)), colMeans)
# $`1`
# Year Day Hour Min Sec. E1.S1 E1.S2 E1.S3 E1.S4 E1.S5 E1.S6 E1.S7
# 2000.0000 122.0000 0.0000 1.6000 26.1920 2.7780 2.6464 1.9848 2.6460 2.1160 1.5864 3.5724
# E1.S8 E2.S1 E2.S2 E2.S3 E2.S4
# 2.2480 1.4540 0.6608 0.9248 1.4568
#
# $`2`
# Year Day Hour Min Sec. E1.S1 E1.S2 E1.S3 E1.S4 E1.S5 E1.S6 E1.S7 E1.S8
# 2000.000 122.000 0.000 2.000 38.470 2.650 0.662 3.310 1.980 1.320 1.980 1.980 2.650
# E2.S1 E2.S2 E2.S3 E2.S4
# 0.662 1.320 1.980 1.320
#
Then you can just bind them:
do.call(rbind, lapply(split(df, ceiling(seq_len(nrow(df)) / 5)), colMeans))
# Year Day Hour Min Sec. E1.S1 E1.S2 E1.S3 E1.S4 E1.S5 E1.S6 E1.S7 E1.S8 E2.S1 E2.S2 E2.S3 E2.S4
# 1 2000 122 0 1.6 26.192 2.778 2.6464 1.9848 2.646 2.116 1.5864 3.5724 2.248 1.454 0.6608 0.9248 1.4568
# 2 2000 122 0 2.0 38.470 2.650 0.6620 3.3100 1.980 1.320 1.9800 1.9800 2.650 0.662 1.3200 1.9800 1.3200
Note: It also helps to check if all the columns you want to take mean of are either integers or numeric by doing:
> sapply(df, class)
# Year Day Hour Min Sec. E1.S1 E1.S2 E1.S3 E1.S4 E1.S5 E1.S6 E1.S7
# "integer" "integer" "integer" "integer" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
# E1.S8 E2.S1 E2.S2 E2.S3 E2.S4
# "numeric" "numeric" "numeric" "numeric" "numeric"
Edit: Following OP's comment:
idx <- ceiling(seq_len(nrow(dd)) / 5)
# do colMeans on all columns except last one.
res <- lapply(split(dd[-(ncol(dd))], idx), colMeans, na.rm = TRUE)
# assign first value of "datetime" in each 5-er group as names to list
names(res) <- dd$datetime[seq(1, nrow(df), by=5)]
# bind them to give a matrix
res <- do.call(rbind, res)
Alternatively, if you want a data.frame and datetime as a column:
idx <- ceiling(seq_len(nrow(dd)) / 5)
res <- as.data.frame(do.call(rbind, lapply(split(dd[-(ncol(dd))], idx),
colMeans, na.rm = TRUE)))
res$datetime <- dd$datetime[seq(1, nrow(dd), by=5)]
You can see your data as a time series. Then using xts package you can use period.apply function
dat.xts <- xts(dat[,-ncol(dat)],dat$datetime)
## here I take every minutes because I don't have enouhgt data
## I think in your case 5 rows is equal to 5*12 mintues = 1 hour
pts <- endpoints(dat.xts,on='mins')
period.apply(dat.xts,pts,mean)
Year Day Hour Min Sec. E1.S1 E1.S2 E1.S3 E1.S4 E1.S5 E1.S6 E1.S7 E1.S8 E2.S1 E2.S2 E2.S3 E2.S4 E2.S5 E2.S6
2000-05-01 00:01:50 2000 122 0 1 44.055 2.6450 1.9860 1.321 2.315 1.65 0.660 3.9700 1.3200 1.650 1.3210 0.3310 1.6550 0.660 0.660
2000-05-01 00:02:38 2000 122 0 2 20.330 2.8125 2.4805 2.648 2.645 2.15 2.148 2.9755 2.8125 1.158 0.4955 1.4855 1.3235 0.661 1.485
EDIT show how to transform xts object to a data.frame:
To plot your data with ggplot2 you need to coerce you xts object to a data.frame. For example you can do this:
dat <- data.frame(date=index(dat.xts),coredata(dat.xts))
Then to plot E1.S1 vs date :
library(ggplot2)
ggplot(data=dat)+
geom_line(aes(x=date,y=E1.S1))