I have a data frame with 1,000,000 rows. I would like to calculate mean and variance of Tor overtime for each SID to see if I can predict when Tor is starting to go out of limits. The Low limit is 0.4 and the high limit is 0.7. Below is a small example of my data.
dat <- structure(list(timestamp = c("29-06-2021-06:00", "29-06-2021-06:01",
"29-06-2021-06:02", "29-06-2021-06:03", "29-06-2021-06:04", "29-06-2021-06:05",
"29-06-2021-06:06", "29-06-2021-06:07", "29-06-2021-06:08", "29-06-2021-06:09",
"29-06-2021-06:10", "29-06-2021-06:11", "29-06-2021-06:12", "29-06-2021-06:13",
"29-06-2021-06:14", "29-06-2021-06:15", "29-06-2021-06:16", "29-06-2021-06:17",
"29-06-2021-06:18", "29-06-2021-06:19", "29-06-2021-06:20", "29-06-2021-06:21",
"29-06-2021-06:22", "29-06-2021-06:23", "29-06-2021-06:24", "29-06-2021-06:25",
"29-06-2021-06:26"), SID = c(301L, 351L, 304L, 357L, 358L, 302L,
303L, 309L, 356L, 304L, 308L, 351L, 304L, 357L, 358L, 302L, 303L,
352L, 307L, 353L, 304L, 308L, 352L, 307L, 304L, 354L, 356L),
Tor = c(0.70161919, 0.639416295, 0.288282073, 0.932362166,
0.368616626, 0.42175565, 0.409735918, 0.942170196, 0.381396521,
0.818102394, 0.659391671, 0.246387978, 0.196001777, 0.632630259,
0.66618385, 0.440625167, 0.639759498, 0.050001835, 0.775660271,
0.762934189, 0.516830196, 0.244674975, 0.38620466, 0.970792903,
0.752674581, 0.190366737, 0.56596405), Lowt = c(0L, 0L, 1L,
0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L), Hit = c(1L, 0L, 0L,
1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-27L))
head(dat)
# timestamp SID Tor Lowt Hit
#1 29-06-2021-06:00 301 0.7016192 0 1
#2 29-06-2021-06:01 351 0.6394163 0 0
#3 29-06-2021-06:02 304 0.2882821 1 0
#4 29-06-2021-06:03 357 0.9323622 0 1
#5 29-06-2021-06:04 358 0.3686166 1 0
#6 29-06-2021-06:05 302 0.4217556 0 0
Timestamp is when sample is recorded
SID is the ID of the part taking the reading. These values can be 301 - 310 and 351 to 360
Tor is the actual reading, and its data type is <dbl>.
Lowt is a binary variable showing that the Tor reading is below the lower limit.
Hit is a binary variable showing that the Tor reading is below the upper limit.
I have read up about variance but I can't seem to get my head around it. Any help would be great.
This is a very good question. You want to compute cumulative mean and cumulative variance of Tor (over time) per SID. Given the volume of your actual dataset, it is appropriate to use online algorithms. See my answer and Benjamin's answer on this topic back in 2018 for algorithmic details. In brief, my contribution is:
cummean <- function (x) cumsum(x) / seq_along(x)
cumvar <- function (x, sd = FALSE) {
x <- x - x[sample.int(length(x), 1)]
n <- seq_along(x)
v <- (cumsum(x ^ 2) - cumsum(x) ^ 2 / n) / (n - 1)
if (sd) v <- sqrt(v)
v
}
The extra work required here, is to apply these functions per SID.
## sort data entries
sorted_dat <- dat[order(dat$SID, dat$timestamp), ]
## split Tor by SID
lst <- split(sorted_dat$Tor, sorted_dat$SID)
## apply cummean() and cumvar()
runmean <- unlist(lapply(lst, cummean), use.names = FALSE)
runvar <- unlist(lapply(lst, cumvar), use.names = FALSE)
## add back
sorted_dat$runmean <- runmean
sorted_dat$runvar <- runvar
Here is the result. Don't be surprised by the NaN in variance. The first value is always NaN in each SID. This is just normal (we can only compute variance when there are 2+ data).
## inspection
sorted_dat
# timestamp SID Tor Lowt Hit runmean runvar
#1 29-06-2021-06:00 301 0.70161919 0 1 0.70161919 NaN
#6 29-06-2021-06:05 302 0.42175565 0 0 0.42175565 NaN
#16 29-06-2021-06:15 302 0.44062517 0 0 0.43119041 0.0001780293
#7 29-06-2021-06:06 303 0.40973592 1 0 0.40973592 NaN
#17 29-06-2021-06:16 303 0.63975950 0 0 0.52474771 0.0264554237
#3 29-06-2021-06:02 304 0.28828207 1 0 0.28828207 NaN
#10 29-06-2021-06:09 304 0.81810239 0 1 0.55319223 0.1403547863
#13 29-06-2021-06:12 304 0.19600178 1 0 0.43412875 0.1127057339
#21 29-06-2021-06:20 304 0.51683020 0 0 0.45480411 0.0768470383
#25 29-06-2021-06:24 304 0.75267458 0 1 0.51437820 0.0753806422
#19 29-06-2021-06:18 307 0.77566027 0 1 0.77566027 NaN
#24 29-06-2021-06:23 307 0.97079290 0 1 0.87322659 0.0190383720
#11 29-06-2021-06:10 308 0.65939167 0 0 0.65939167 NaN
#22 29-06-2021-06:21 308 0.24467497 1 0 0.45203332 0.0859949690
#8 29-06-2021-06:07 309 0.94217020 0 1 0.94217020 NaN
#2 29-06-2021-06:01 351 0.63941629 0 0 0.63941629 NaN
#12 29-06-2021-06:11 351 0.24638798 1 0 0.44290214 0.0772356290
#18 29-06-2021-06:17 352 0.05000184 1 0 0.05000184 NaN
#23 29-06-2021-06:22 352 0.38620466 1 0 0.21810325 0.0565161698
#20 29-06-2021-06:19 353 0.76293419 0 1 0.76293419 NaN
#26 29-06-2021-06:25 354 0.19036674 1 0 0.19036674 NaN
#9 29-06-2021-06:08 356 0.38139652 1 0 0.38139652 NaN
#27 29-06-2021-06:26 356 0.56596405 0 0 0.47368029 0.0170325864
#4 29-06-2021-06:03 357 0.93236217 0 1 0.93236217 NaN
#14 29-06-2021-06:13 357 0.63263026 0 0 0.78249621 0.0449196080
#5 29-06-2021-06:04 358 0.36861663 1 0 0.36861663 NaN
#15 29-06-2021-06:14 358 0.66618385 0 0 0.51740024 0.0442731264
Related
I have the following social network dataset where participants (ego) were asked who provided social, work, and care support in their lives. Those who provided support (alter) were classified according to their relationship with ego (circle) resulting in the following dataset:
ego alter circle social work care
3400 3403 1 0 0 1
3400 3402 1 0 1 0
3400 3401 1 1 0 0
3500 3504 1 0 0 0
3500 3503 1 0 0 0
3500 3502 1 0 1 1
3500 3501 2 1 0 0
3600 3604 1 0 0 0
3600 3603 3 0 0 1
3600 3602 3 0 1 0
3600 3601 2 1 0 0
3700 3702 1 0 1 1
3700 3703 1 0 0 1
3700 3701 2 1 0 0
…
So, for example, in row 1, alter 3403 of social circle 1, did not provide social or work support but provided care support for ego 3400.
My question for you all is: how can I cross tabulate the variable circle with each of the support variables (social, work, and care) and then calculate the averages with ego?
Below is the resulting cross tabulation with totals and percentages, but I need the averages taking into account each ego.
Crosstab result
First, reproducible data using dput():
social <- structure(list(ego = c(3400L, 3400L, 3400L, 3500L, 3500L, 3500L,
3500L, 3600L, 3600L, 3600L, 3600L, 3700L, 3700L, 3700L), alter = c(3403L,
3402L, 3401L, 3504L, 3503L, 3502L, 3501L, 3604L, 3603L, 3602L,
3601L, 3702L, 3703L, 3701L), circle = c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 3L, 3L, 2L, 1L, 1L, 2L), social = c(0L, 0L, 1L, 0L, 0L,
0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L), work = c(0L, 1L, 0L, 0L,
0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L), care = c(1L, 0L, 0L,
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-14L))
Now, counts,
(tbl.count <- aggregate(cbind(social, work, care)~circle, social, sum))
# circle social work care
# 1 1 1 3 4
# 2 2 3 0 0
# 3 3 0 1 1
and means,
(tbl.mean <- aggregate(cbind(social, work, care)~circle, social, mean))
# circle social work care
# 1 1 0.1111111 0.3333333 0.4444444
# 2 2 1.0000000 0.0000000 0.0000000
# 3 3 0.0000000 0.5000000 0.5000000
and percentages,
(tbl.pct <- aggregate(cbind(social, work, care)~circle, social, function(x) mean(x)*100))
# circle social work care
# 1 1 11.11111 33.33333 44.44444
# 2 2 100.00000 0.00000 0.00000
# 3 3 0.00000 50.00000 50.00000
This question already has an answer here:
Sample without replacement, or duplicates, in R
(1 answer)
Closed 1 year ago.
There's a very similar question to mine on stack, but that doesn't directly answer my question.
I have abundance data for 250 species across 1000 sites. Species are columns, sites are rows. My abundances data look something like the data in the linked post above.
0 0 3 0 0 201 0 0 0 82
0 23 5 0 0 0 0 0 0 0
9 0 0 0 0 12 0 0 0 913
0 7 91 0 8 0 0 92 9 0
131 12 0 410 0 0 0 3 0 0
If I wanted to sample 50 individuals from each site, without replacement, how can I do this?
Focusing on code for single sites for now.
This code:
samples <- sample(1:ncol(abundances), 50, rep=FALSE, prob=abundances[1,]) doesn't work unless I change to rep=TRUE. However, I need sampling WITHOUT replacement.
I don't want to use sample(abundances[1,], 50, rep=FALSE) because then instead of sampling individuals, it samples species and will report the whole value in that row (i.e. species 6 may occur 201 times at site 1, it'll report 201, rather than 1 individual from that species, resulting in >50 individuals in final subsample).
I essentially want an output identical to what user Dinre answered in post above, but without it being for bootstrapping. I just want to sample without replacement. This process will ultimately be integrated into a for loop for a subsample from each site.
Here is a way to sample vector elements from each row with the sum of each sampled row equal to a chosen integer, the size. In the code below, n <- 5 is passed as function argument size. The call to runif adds an element of randomness to the sampling function.
fun <- function(x, size){
x <- x*runif(length(x))
y <- size*x/sum(x)
round(y)
}
set.seed(2021)
n <- 5
t(apply(df1, 1, fun, size = n))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#[1,] 0 0 0 0 0 3 0 0 0 2
#[2,] 0 4 1 0 0 0 0 0 0 0
#[3,] 0 0 0 0 0 0 0 0 0 5
#[4,] 0 0 3 0 0 0 0 1 0 0
#[5,] 0 0 0 5 0 0 0 0 0 0
Data
Here is the question's data in dput format.
X <-
structure(c(0L, 0L, 9L, 0L, 131L, 0L, 23L, 0L, 7L, 12L, 3L, 5L,
0L, 91L, 0L, 0L, 0L, 0L, 0L, 410L, 0L, 0L, 0L, 8L, 0L, 201L,
0L, 12L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 92L, 3L, 0L,
0L, 0L, 9L, 0L, 82L, 0L, 913L, 0L, 0L), .Dim = c(5L, 10L), .Dimnames = list(
NULL, c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10")))
I have next data
mydat=structure(list(group = c(111L, 111L, 111L, 111L, 111L, 111L,
111L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 555L, 555L, 555L,
555L, 555L, 555L, 555L), group2 = c(222L, 222L, 222L, 222L, 222L,
222L, 222L, 444L, 444L, 444L, 444L, 444L, 444L, 444L, 666L, 666L,
666L, 666L, 666L, 666L, 666L), action = c(0L, 0L, 0L, 1L, 1L,
0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L
), x1 = c(1L, 2L, 3L, 0L, 0L, 1L, 2L, 1L, 2L, 3L, 0L, 0L, 1L,
2L, 1L, 2L, 3L, 10L, 20L, 1L, 2L)), .Names = c("group", "group2",
"action", "x1"), class = "data.frame", row.names = c(NA, -21L
))
Here two group variables(group and group2) .
there are three group
111 222
333 444
555 666
action column can take value only 0 and 1.
So i need find these groups where for 1 category of action they have only zero values by x1.
in our case it is
111 222
333 444
because for all 1 category of action they have zeros by x1.
So i can work only with 555 666 group.
because it has at least one non-zero value of first category of action by x1 variable.
The desired output
Mydat1 here groups with at least one non-zero value of first category of action by x1 variable.
group group2 action x1
555 666 0 1
555 666 0 2
555 666 0 3
555 666 1 **10**
555 666 1 **20**
555 666 0 1
555 666 0 2
mydat2 groups which for all 1 category of action they have zeros by x1
group group2 action x1
111 222 0 1
111 222 0 2
111 222 0 3
111 222 1 **0**
111 222 1 **0**
111 222 0 1
111 222 0 2
333 444 0 1
333 444 0 2
333 444 0 3
333 444 1 **0**
333 444 1 **0**
333 444 0 1
333 444 0 2
If i correctly you, then understand your question is:
i need find these groups where for 1 category of action they have
only zero values by x1.
so here is the response:
library(tidyverse)
mydat %>%
group_by( action ) %>%
filter( action==1 & x1==0 )
and the response is:
group group2 action x1
<int> <int> <int> <int>
1 111 222 1 0
2 111 222 1 0
3 333 444 1 0
4 333 444 1 0
What does this code do?
it looks at action feature, and consider 2 main categories for all rows(0,and 1). Then it filters out the observations which pass action==1 & x1==0. So, it means, among those rows who have action==1 the x1==0 is true as well.
can script return all values of 555+666 group?
No. it does not return these 2 groups. And it should not do that. Let's write a code which filters 555,and 666
library(tidyverse)
mydat %>%
group_by( action ) %>%
filter( group==555 | group2==666 )
and the result is:
group group2 action x1
<int> <int> <int> <int>
1 555 666 0 1
2 555 666 0 2
3 555 666 0 3
4 555 666 1 10
5 555 666 1 20
6 555 666 0 1
7 555 666 0 2
so, as you can see, none of these observation fulfills the condition action==1 & x1==0 . Therefore, they are not among the valid response.
Please find here a very small subset of a long data.table I am working with
dput(dt)
structure(list(id = 1:15, pnum = c(4298390L, 4298390L, 4298390L,
4298558L, 4298558L, 4298559L, 4298559L, 4299026L, 4299026L, 4299026L,
4299026L, 4300436L, 4300436L, 4303566L, 4303566L), invid = c(15L,
101L, 102L, 103L, 104L, 103L, 104L, 106L, 107L, 108L, 109L, 87L,
111L, 2L, 60L), fid = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L,
4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L), .Label = c("CORN", "DowCor",
"KIM", "Texas"), class = "factor"), dom_kn = c(1L, 0L, 0L, 0L,
1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L), prim_kn = c(1L,
0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), pat_kn = c(1L,
0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), net_kn = c(1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L), age_kn = c(1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), legclaims = c(5L,
0L, 0L, 2L, 5L, 2L, 5L, 0L, 0L, 0L, 0L, 5L, 0L, 5L, 2L), n_inv = c(3L,
3L, 3L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L)), .Names = c("id",
"pnum", "invid", "fid", "dom_kn", "prim_kn", "pat_kn", "net_kn",
"age_kn", "legclaims", "n_inv"), class = "data.frame", row.names = c(NA,
-15L))
I am looking to apply a tweaked greater than comparison in 5 different columns.
Within each pnum (patent), there are multiple invid (inventors). I want to compare the values of the columns dom_kn, prim_kn, pat_kn, net_kn, and age_kn per row, to the values in the other rows with the same pnum. The comparison is simply > and if the value is indeed bigger than the other, one "point" should be attributed.
So for the first row pnum == 4298390 and invid == 15, you can see the values in the five columns are all 1, while the values for invid == 101 | 102 are all zero. This means that if we individually compare (is greater than?) each value in the first row to each cell in the second and third row, the total sum would be 10 points. In every single comparison, the value in the first row is bigger and there are 10 comparisons.
The number of comparisons is by design 5 * (n_inv -1).
The result I am looking for for row 1 should then be 10 / 10 = 1.
For pnum == 4298558 the columns net_kn and age_kn both have values 1 in the two rows (for invid 103 and 104), so that each should get 0.5 points (if there would be three inventors with value 1, everyone should get 0.33 points). The same goes for pnum == 4298558.
For the next pnum == 4299026 all values are zero so every comparison should result in 0 points.
Thus note the difference: There are three different dyadic comparisons
1 > 0 --> assign 1
1 = 1 --> assign 1 / number of positive values in column subset
0 = 0 --> assign 0
Desired result
An extra column result in the data.table with values 1 0 0 0.2 0.8 0.2 0.8 0 0 0 0 1 0 0.8 0.2
Any suggestions on how to compute this efficiently?
Thanks!
vars = grep('_kn', names(dt), value = T)
# all you need to do is simply assign the correct weight and sum the numbers up
dt[, res := 0]
for (var in vars)
dt[, res := res + get(var) / .N, by = c('pnum', var)]
# normalize
dt[, res := res/sum(res), by = pnum]
# id pnum invid fid dom_kn prim_kn pat_kn net_kn age_kn legclaims n_inv res
# 1: 1 4298390 15 CORN 1 1 1 1 1 5 3 1.0
# 2: 2 4298390 101 CORN 0 0 0 0 0 0 3 0.0
# 3: 3 4298390 102 CORN 0 0 0 0 0 0 3 0.0
# 4: 4 4298558 103 DowCor 0 0 0 1 1 2 2 0.2
# 5: 5 4298558 104 DowCor 1 1 1 1 1 5 2 0.8
# 6: 6 4298559 103 DowCor 0 0 0 1 1 2 2 0.2
# 7: 7 4298559 104 DowCor 1 1 1 1 1 5 2 0.8
# 8: 8 4299026 106 Texas 0 0 0 0 0 0 4 NaN
# 9: 9 4299026 107 Texas 0 0 0 0 0 0 4 NaN
#10: 10 4299026 108 Texas 0 0 0 0 0 0 4 NaN
#11: 11 4299026 109 Texas 0 0 0 0 0 0 4 NaN
#12: 12 4300436 87 KIM 1 1 1 1 1 5 2 1.0
#13: 13 4300436 111 KIM 0 0 0 0 0 0 2 0.0
#14: 14 4303566 2 DowCor 1 1 1 1 1 5 2 0.8
#15: 15 4303566 60 DowCor 1 0 0 1 0 2 2 0.2
Dealing with the above NaN case (arguably the correct answer), is left to the reader.
Here's a fastish solution using dplyr:
library(dplyr)
dt %>%
group_by(pnum) %>% # group by pnum
mutate_each(funs(. == max(.) & max(.) != 0), ends_with('kn')) %>%
#give a 1 if the value is the max, and not 0. Only for the column with kn
mutate_each(funs(. / sum(.)) , ends_with('kn')) %>%
#correct for multiple maximums
select(ends_with('kn')) %>%
#remove all non kn columns
do(data.frame(x = rowSums(.[-1]), y = sum(.[-1]))) %>%
#make a new data frame with x = rowsums for each indvidual
# and y the colusums
mutate(out = x/y)
#divide by y (we could just use /5 if we always have five columns)
giving your desired output in the column out:
Source: local data frame [15 x 4]
Groups: pnum [6]
pnum x y out
(int) (dbl) (dbl) (dbl)
1 4298390 5 5 1.0
2 4298390 0 5 0.0
3 4298390 0 5 0.0
4 4298558 1 5 0.2
5 4298558 4 5 0.8
6 4298559 1 5 0.2
7 4298559 4 5 0.8
8 4299026 NaN NaN NaN
9 4299026 NaN NaN NaN
10 4299026 NaN NaN NaN
11 4299026 NaN NaN NaN
12 4300436 5 5 1.0
13 4300436 0 5 0.0
14 4303566 4 5 0.8
15 4303566 1 5 0.2
The NaNs come from the groups with no winners, convert them back using eg:
x[is.na(x)] <- 0
If I have a data set laid out like:
Cohort Food1 Food2 Food 3 Food 4
--------------------------------
Group 1 1 2 3
A 1 1 0 1
B 0 0 1 0
C 1 1 0 1
D 0 0 0 1
I want to sum each row, where I can define food groups into different categories. So I would like to use the Group row as the defining vector.
Which would mean that food1 and food2 are in group 1, food3 is in group 2, food 4 is in group 3.
Ideal output something like:
Cohort Group1 Group2 Group3
A 2 0 1
B 0 1 0
C 2 0 1
D 0 0 1
I tried using this rowsum() based functions but no luck, do I need to use ddply() instead?
Example data from comment:
dat <-
structure(list(species = c("group", "princeps", "bougainvillei",
"hombroni", "lindsayi", "concretus", "galatea", "ellioti", "carolinae",
"hydrocharis"), locust = c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L), grasshopper = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L),
snake = c(2L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), fish = c(2L,
1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L), frog = c(2L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 0L, 0L), toad = c(2L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L), fruit = c(3L, 0L, 0L, 0L, 0L, 1L, 1L,
0L, 0L, 0L), seed = c(3L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L)), .Names = c("species", "locust", "grasshopper", "snake",
"fish", "frog", "toad", "fruit", "seed"), class = "data.frame", row.names = c(NA,
-10L))
There are most likely more direct approaches, but here is one you can try:
First, create a copy of your data minus the second header row.
dat2 <- dat[-1, ]
melt() and dcast() and so on from the "reshape2" package don't work nicely with duplicated column names, so let's make the column names more "reshape2 appropriate".
Seq <- ave(as.vector(unlist(dat[1, -1])),
as.vector(unlist(dat[1, -1])),
FUN = seq_along)
names(dat2)[-1] <- paste("group", dat[1, 2:ncol(dat)],
".", Seq, sep = "")
melt() the dataset
m.dat2 <- melt(dat2, id.vars="species")
Use the colsplit() function to split the columns correctly.
m.dat2 <- cbind(m.dat2[-2],
colsplit(m.dat2$variable, "\\.",
c("group", "time")))
head(m.dat2)
# species value group time
# 1 princeps 0 group1 1
# 2 bougainvillei 0 group1 1
# 3 hombroni 1 group1 1
# 4 lindsayi 0 group1 1
# 5 concretus 0 group1 1
# 6 galatea 0 group1 1
Proceed with dcast() as usual
dcast(m.dat2, species ~ group, sum)
# species group1 group2 group3
# 1 bougainvillei 0 0 0
# 2 carolinae 1 1 0
# 3 concretus 0 2 2
# 4 ellioti 0 1 0
# 5 galatea 1 1 1
# 6 hombroni 2 1 0
# 7 hydrocharis 0 0 0
# 8 lindsayi 0 1 0
# 9 princeps 0 1 0
Note: Edited because original answer was incorrect.
Update: An easier way in base R
This problem is much more easily solved if you start by transposing your data.
dat3 <- t(dat[-1, -1])
dat3 <- as.data.frame(dat3)
names(dat3) <- dat[[1]][-1]
t(do.call(rbind, lapply(split(dat3, as.numeric(dat[1, -1])), colSums)))
# 1 2 3
# princeps 0 1 0
# bougainvillei 0 0 0
# hombroni 2 1 0
# lindsayi 0 1 0
# concretus 0 2 2
# galatea 1 1 1
# ellioti 0 1 0
# carolinae 1 1 0
# hydrocharis 0 0 0
You can do this using base R fairly easily. Here's an example.
First, figure out which animals belong in which group:
groupings <- as.data.frame(table(as.numeric(dat[1,2:9]),names(dat)[2:9]))
attach(groupings)
grp1 <- groupings[Freq==1 & Var1==1,2]
grp2 <- groupings[Freq==1 & Var1==2,2]
grp3 <- groupings[Freq==1 & Var1==3,2]
detach(groupings)
Then, use the groups to do a rowSums() on the correct columns.
dat <- cbind(dat,rowSums(dat[as.character(grp1)]))
dat <- cbind(dat,rowSums(dat[as.character(grp2)]))
dat <- cbind(dat,rowSums(dat[as.character(grp3)]))
Delete the initial row and the intermediate columns:
dat <- dat[-1,-c(2:9)]
Then, just rename things correctly:
row.names(dat) <- rm()
names(dat) <- c("species","group_1","group_2","group_3")
And you ultimately get:
species group_1 group_2 group_3
bougainvillei 0 0 0
carolinae 1 1 0
concretus 0 2 2
ellioti 0 1 0
galatea 1 1 1
hombroni 2 1 0
hydrocharis 0 0 0
lindsayi 0 1 0
princeps 0 1 0
EDITED: Changed sort order to alphabetical, like other answer.