Using sample() to sample from nested lists in R - r

I'm looking for a way to use sample() to sample values from different lists based on a value in another column of a data.table - at the moment I'm getting a recursive indexing failed error - code and more explanation below:
First set up example data:
library(stats)
library(data.table)
# list of three different nest survival rates
survival<-list(0.91,0.95,0.99)
# incubation period
inc.period<-28
# then set up function to use the geometric distribution to generate 3 lists of incubation outcomes based on the nest survivals and incubation period above.
# e.g. less than 28 is a nest failure, 28 is a successful nest.
create.sample <- function(survival){
outcome<-rgeom(100,1-survival)
fifelse(outcome > inc.period, inc.period, outcome)
}
# then create list of 100 nest outcomes with 3 different survival values using lapply
inc.outcomes <- lapply(survival,create.sample)
# set up a data.table - each row of data will be a nest.
index<-c(1:3)
iteration<-1:20
dt = CJ(index,iteration)
Then I want to make a new column 'inc.period' which samples from the 'inc.outcomes' list using the index column of the dt to select which of the three 'inc.outcomes' lists to sample from (with a different sample for each row of data).
So e.g. when index = 1, the sampled value comes from inc.outcomes[[1]] - which is the low nest survival list, when index = 2 I sample from inc.outcomes[[2]] etc.
The code would look something like this but this doesn't work (I get a recursive indexing failed error):
dt[,inc.period:= sample(inc.outcomes[[index]],nrow(dt),replace = TRUE)]
Any help or advice gratefully received, also suggestions for different approaches to this problem - this is for an update to code that runs in a shiny simulation so quicker options preferred!

Two problems:
inc.outcomes[[index]] is a problem since index is 60-long here, meaning you are ultimately trying inc.outcomes[[ c(1,1,...,2,2,...,3,3) ]], which is incorrect. [[-indexing is either length-1 (for most uses) or a vector as long as its list is nested. For example, in list(list(1,2),list(3,4))[[ c(1,2) ]] the [[c(1,2)]] with length-2 works because the have 2-deep nested lists. Since inc.outcomes is only 1-deep, we can only have length-1 in the [[ indexing.
This means we need to do this by-index. (An from this, we need to change from nrow(dt) to .N, but frankly we should be using that anyway even without by=.)
dt[, inc.period := sample(inc.outcomes[[ index[1] ]], .N, replace = TRUE), by = index]
# index iteration inc.period
# <int> <int> <num>
# 1: 1 1 17
# 2: 1 2 17
# 3: 1 3 21
# 4: 1 4 24
# 5: 1 5 3
# 6: 1 6 1
# 7: 1 7 17
# 8: 1 8 0
# 9: 1 9 1
# 10: 1 10 0
# ---
# 51: 3 11 0
# 52: 3 12 0
# 53: 3 13 28
# 54: 3 14 28
# 55: 3 15 9
# 56: 3 16 28
# 57: 3 17 7
# 58: 3 18 28
# 59: 3 19 28
# 60: 3 20 28
My data:
dt <- setDT(structure(list(index = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), iteration = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L)), row.names = c(NA, -60L), class = c("data.table", "data.frame"), sorted = c("index", "iteration")))

Related

Create unique ID based on repeated IDs [duplicate]

This question already has answers here:
Create group number for contiguous runs of equal values
(4 answers)
Closed 4 days ago.
I received some data from a colleague who is working with animal observations recorded in several transects. However my colleague used the same three ID codes for identifying each transect: 1, 7, 13 and 19. I would like to replace the repeated IDs with unique IDs. This image shows what I want to do:
Here's the corresponding code:
example_data<-structure(list(ID_Transect = c(1L, 1L, 1L, 1L, 1L, 1L, 7L, 7L,
7L, 7L, 7L, 7L, 13L, 13L, 13L, 13L, 13L, 13L, 19L, 19L, 19L,
19L, 19L, 19L, 1L, 1L, 1L, 1L, 1L, 1L, 7L, 7L, 7L, 7L, 7L, 7L),
transect_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L,
5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L)), class = "data.frame", row.names = c(NA,
-36L))
We can also do
library(data.table)
setDT(example_data)[, transect_id := rleid(ID_Transect)]
You can use data.table rleid -
example_data$transect_id <- data.table::rleid(example_data$ID_Transect)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6
In base R we can use rle -
with(rle(example_data$ID_Transect), rep(seq_along(values), lengths))
Or diff + cumsum -
cumsum(c(TRUE, diff(example_data$ID_Transect) != 0))

Collapsing levels of a variable by another variable with certain conditions

I have the following example data table:
library(data.table)
exdt <- structure(list(domain = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
L1 = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 7L, 8L, 8L, 9L, 9L, 10L,
10L, 11L, 12L, 12L, 13L, 13L, 14L, 15L, 15L, 16L, 16L, 17L, 17L,
18L, 18L, 19L, 19L, 20L, 21L, 22L, 22L, 23L, 23L, 23L, 24L, 25L,
25L, 25L, 25L, 26L, 26L, 26L),
L2 = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 7L,
7L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 11L, 11L,
11L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 14L, 14L, 14L, 14L, 14L,
14L, 14L)),
row.names = c(NA, -51L), class = c("data.table", "data.frame"))
I'd like to create a new variable L2, which is a grouping of two consecutive, unique levels of L1 within levels of domain. However, when I get to the end of a domain, I sometimes have a level of L1 that is stand-alone. In that case, I'd like to merge it with the two unique levels before it. This means that at the end of a domain, I may have merged together 3 consecutive, unique levels of L1 instead of 2 unique levels. The desired output is shown below.
exdt_L2_desired <- structure(list(domain = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
L1 = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 7L, 8L, 8L, 9L, 9L, 10L,
10L, 11L, 12L, 12L, 13L, 13L, 14L, 15L, 15L, 16L, 16L, 17L, 17L,
18L, 18L, 19L, 19L, 20L, 21L, 22L, 22L, 23L, 23L, 23L, 24L, 25L,
25L, 25L, 25L, 26L, 26L, 26L),
L2 = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 6L,
6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 10L, 10L, 11L, 11L,
11L, 11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L)), row.names = c(NA, -51L),
class = c("data.table","data.frame"))
domain L1 L2
1 1 1
1 1 1
1 2 1
1 2 1
1 3 1
1 3 1
2 4 2
2 4 2
2 5 2
2 5 2
2 5 2
2 5 2
2 6 3
2 7 3
2 8 4
2 8 4
2 9 4
2 9 4
2 10 5
2 10 5
2 11 5
2 12 6
2 12 6
2 13 6
2 13 6
2 14 7
2 15 7
2 15 7
2 16 8
2 16 8
2 17 8
2 17 8
2 18 9
2 18 9
2 19 9
2 19 9
2 20 10
2 21 10
2 22 11
2 22 11
2 23 11
2 23 11
2 23 11
2 24 12
2 25 12
2 25 12
2 25 12
2 25 12
2 26 12
2 26 12
2 26 12
You can check that this has the right grouping L2 by:
#Check
exdt_L2_desired[, .(numL1_lev = uniqueN(L1)), by = list(domain,L2)]
domain L2 numL1_lev
1: 1 1 3
2: 2 2 2
3: 2 3 2
4: 2 4 2
5: 2 5 2
6: 2 6 2
7: 2 7 2
8: 2 8 2
9: 2 9 2
10: 2 10 2
11: 2 11 2
12: 2 12 3
As you can see each level of L2 has 2 or 3 levels of L1. For domain=1, numL1_lev=3 because there were only 3 unique L1 values, which were lumped into a single group. For domain=2, only the last level of L2 had the numL1_lev=3.
Attempt
I tried the following, but I seem to still have trouble getting the stand-alone levels of L1 within a given domain:
exdt_L2 <- exdt[, L2 :=
exdt[, {x <- ceiling(L1/2) #Group 2 consecutive, unique L1 levels by domain
#If the number of unique L1 levels at the end is stand-alone, then replace with previous group
if (length(unique(L1[x==x[.N]])) == 1) x[x==x[.N]] <- x[.N]-1
x
}, domain][, rleid(domain, V1)]
]
domain L1 L2
1 1 1
1 1 1
1 2 1
1 2 1
1 3 1
1 3 1
2 4 2
2 4 2
2 5 3
2 5 3
2 5 3
2 5 3
2 6 3
2 7 4
2 8 4
2 8 4
2 9 5
2 9 5
2 10 5
2 10 5
2 11 6
2 12 6
2 12 6
2 13 7
2 13 7
2 14 7
2 15 8
2 15 8
2 16 8
2 16 8
2 17 9
2 17 9
2 18 9
2 18 9
2 19 10
2 19 10
2 20 10
2 21 11
2 22 11
2 22 11
2 23 12
2 23 12
2 23 12
2 24 12
2 25 13
2 25 13
2 25 13
2 25 13
2 26 13
2 26 13
2 26 13
Using just ceiling(L1 / 2) will not work, as this assigns e.g. L1 = 4 and L1 = 5 to different bins, which should be added to the same L2 bin. Below is an updated version in the same spirit as OP's attempt instead using ceiling(rleid(L1) / 2):
library(data.table)
exdt[, L2 := {
## modify rle values
x <- ceiling(rleid(L1) / 2)
n <- length(unique(L1))
## if n is odd update last bin values
if(n > 1 && n %% 2 == 1) {
x[x == x[.N]] <- x[.N] - 1
}
x
}, by = "domain"][, L2 := rleid(domain, L2)]
all.equal(exdt, exdt_L2_desired)
#> [1] TRUE

`ddply` fails to apply logistic regression (GLM) by group to my dataset

I'm working out the LD50 (lethal dosage) for multiple populations from different experiments using the MASS package. It's simple enough when I subset the data and do one at a time, but I'm getting an error when I use ddply. Essentially I need an LD50 for each population at each temperature.
My data looks somewhat like this:
# dput(d)
d <- structure(list(Pop = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L), .Label = c("a", "b", "c"), class = "factor"), Temp = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("high", "low"), class = "factor"),
Dose = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), Dead = c(0L,
11L, 12L, 14L, 2L, 16L, 17L, 7L, 5L, 3L, 17L, 15L, 9L, 20L,
8L, 19L, 7L, 2L, 20L, 14L, 9L, 15L, 1L, 15L), Alive = c(20L,
9L, 8L, 6L, 18L, 4L, 3L, 13L, 15L, 17L, 3L, 5L, 11L, 0L,
12L, 1L, 13L, 18L, 0L, 6L, 11L, 5L, 19L, 5L)), .Names = c("Pop",
"Temp", "Dose", "Dead", "Alive"), class = "data.frame", row.names = c(NA,
-24L))
The following works fine:
d$Mortality <- cbind(d$Alive, d$Dead)
a <- d[d$Pop=="a" & d$Temp=="high",]
library(MASS)
dose.p(glm(Mortality ~ Dose, family="binomial", data=a), p=0.5)[1]
But when I put this into ddply I get the following error:
library(plyr)
d$index <- paste(d$Pop, d$Temp, sep="_")
ddply(d, 'index', function(x) dose.p(glm(Mortality~Dose, family="binomial", data=x), p=0.5)[1])
Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1
I can get the right LD50 when I use a proportion but can't figure out where I've gone wrong with my approach (and had already written this question).
Perhaps this will amaze you. But if you choose to use formula
cbind(Alive, Dead) ~ Dose
instead of
Mortality ~ Dose
the problem will be gone.
library(MASS)
library(plyr)
## `d` is as your `dput` result
## a function to apply
f <- function(x) {
fit <- glm(cbind(Alive, Dead) ~ Dose, family = "binomial", data = x)
dose.p(fit, p=0.5)[[1]]
}
## call `ddply`
ddply(d, .(Pop, Temp), f)
# Pop Temp V1
#1 a high 2.6946257
#2 a low 2.1834099
#3 b high 2.5000000
#4 b low 0.4830998
#5 c high 2.2899553
#6 c low 2.5000000
So what happened with Mortality ~ Dose? Let's set .inform = TRUE when calling ddply:
## `d` is as your `dput` result
d$Mortality <- cbind(d$Alive, d$Dead)
## a function to apply
g <- function(x) {
fit <- glm(Mortality ~ Dose, family = "binomial", data = x)
dose.p(fit, p=0.5)[[1]]
}
## call `ddply`
ddply(d, .(Pop, Temp), g, .inform = TRUE)
#Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1
#Error: with piece 1:
# Pop Temp Dose Dead Alive Mortality
#1 a high 1 0 20 20
#2 a high 2 11 9 9
#3 a high 3 12 8 8
#4 a high 4 14 6 6
Now we we see that variable Mortality has lost dimension, and only the first column (Alive) is retained. For a glm with binomial response, if the response is a single vector, glm expects 0-1 binary or a factor of two levels. Now, we have integers 20, 9, 8, 6, ..., hence glm will complain
Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1
There is really no way to fix this issue. I have tried using a protector:
d$Mortality <- I(cbind(d$Alive, d$Dead))
but it still ends up with the same failure.

array manipulation: calculate odds ratios for a layer in a 3-way table

This is a question about array and data frame manipulation and calculation, in the
context of models for log odds in contingency tables. The closest question I've found to this is How can i calculate odds ratio in many table, but mine is more general.
I have a data frame representing a 3-way frequency table, of size 5 (litter) x 2 (treatment) x 3 (deaths).
"Freq" is the frequency in each cell, and deaths is the response variable.
Mice <-
structure(list(litter = c(7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L,
11L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L, 11L, 7L, 7L, 8L,
8L, 9L, 9L, 10L, 10L, 11L, 11L), treatment = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A",
"B"), class = "factor"), deaths = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("0", "1",
"2+"), class = "factor"), Freq = c(58L, 75L, 49L, 58L, 33L, 45L,
15L, 39L, 4L, 5L, 11L, 19L, 14L, 17L, 18L, 22L, 13L, 22L, 12L,
15L, 5L, 7L, 10L, 8L, 15L, 10L, 15L, 18L, 17L, 8L)), .Names = c("litter",
"treatment", "deaths", "Freq"), row.names = c(NA, 30L), class = "data.frame")
From this, I want to calculate the log odds for adjacent categories of the last variable (deaths)
and have this value in a data frame with factors litter (5), treatment (2), and contrast (2), as detailed below.
The data can be seen in xtabs() form:
mice.tab <- xtabs(Freq ~ litter + treatment + deaths, data=Mice)
ftable(mice.tab)
deaths 0 1 2+
litter treatment
7 A 58 11 5
B 75 19 7
8 A 49 14 10
B 58 17 8
9 A 33 18 15
B 45 22 10
10 A 15 13 15
B 39 22 18
11 A 4 12 17
B 5 15 8
>
From this, I want to calculate the (adjacent) log odds of 0 vs. 1 and 1 vs.2+ deaths, which is easy in
array format,
odds1 <- log(mice.tab[,,1]/mice.tab[,,2]) # contrast 0:1
odds2 <- log(mice.tab[,,2]/mice.tab[,,3]) # contrast 1:2+
odds1
treatment
litter A B
7 1.6625477 1.3730491
8 1.2527630 1.2272297
9 0.6061358 0.7156200
10 0.1431008 0.5725192
11 -1.0986123 -1.0986123
>
But, for analysis, I want to have these in a data frame, with factors litter, treatment and contrast
and a column, 'logodds' containing the entries in the odds1 and odds2 tables, suitably strung out.
More generally, for an I x J x K table, where the last factor is the response, my desired result
is a data frame of IJ(K-1) rows, with adjacent log odds in a 'logodds' column, and ideally, I'd like
to have a general function to do this.
Note that if T is the 10 x 3 matrix of frequencies shown by ftable(), the calculation is essentially
log(T) %*% matrix(c(1, -1, 0,
0, 1, -1))
followed by reshaping and labeling.
Can anyone help with this?

For() loop to ID dates that are between others and calculate a mean value

This is a re-post of "R: For() loop checking if date is between two dates in separate object", that has been changed to incorporate a mock/test minimal after the suggestions of Henrik and Metrics. Thanks to them.
I have two large datasets, both contain columns of date/time fields. My first dataset has a single date, the second has two dates. In short I am trying to find all dates from the first data set that are between the other two dates of the second and then find an average value. In order to provide clarity, I have created a mock minimal data set using values rather than dates.
The head() of my first mock data set is below – as well as the dput() output. The data is specific to an individual noted by the IndID column.
IndID MockDate RandNumber
1 1 5 1.862084
2 1 3 1.103154
3 1 5 1.373760
4 1 1 1.497397
5 1 1 1.319488
6 1 3 2.120354
actData <- structure(list(IndID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L), MockDate = c(5L, 3L, 5L, 1L, 1L, 3L, 4L,
2L, 2L, 5L, 2L, 1L, 5L, 3L, 5L, 3L, 5L, 3L, 5L, 1L, 5L, 3L, 5L,
5L, 2L, 3L, 1L, 4L, 3L, 3L), RandNumber = c(1.862083679, 1.103154127,
1.37376001, 1.497397482, 1.319487885, 2.120353884, 1.895660195,
1.150411874, 2.61036961, 1.99354158, 1.547706758, 1.941501873,
1.739226419, 2.455590044, 2.907382515, 2.110502618, 2.076187012,
2.507527308, 2.167657681, 1.662405916, 2.428807116, 2.04699653,
1.937335768, 1.456518889, 1.948952907, 2.104325112, 2.311519732,
2.092650229, 2.109051215, 2.089144475)), .Names = c("IndID",
"MockDate", "RandNumber"), class = "data.frame", row.names = c(NA,
-30L))
The head() of my 2nd mock data set is below – as well as the dput() output.
IndID StartTime EndTime
1 1 4 5
2 1 7 11
3 1 6 9
4 1 7 9
5 1 6 10
6 1 2 12
clstrData <- structure(list(IndID.1 = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), StartTime = c(4L, 7L,
6L, 7L, 6L, 2L, 6L, 4L, 3L, 5L, 2L, 5L, 7L, 3L, 4L, 3L, 2L, 5L,
5L), EndTime = c(5L, 11L, 9L, 9L, 10L, 12L, 8L, 13L, 5L, 13L,
9L, 9L, 17L, 6L, 8L, 6L, 9L, 15L, 7L)), .Names = c("IndID",
"StartTime", "EndTime"), row.names = c(NA, 19L), class = "data.frame")
The second dataset has two number fields representing a start and end time. As above, these data are also specific to an individual noted by the IndD column.
I need to average the ‘RandNumber’ from dataset one for all the instances when ‘MockDate’ is between ‘StartTime’ and ‘EndTime’ of the second dataset for each unique IndID. Thus, ‘RandNumber’ values should only be averaged if 1) they are within the ‘StartTime’ and ‘EndTime’ and 2) the IndID for both rows are the same.
I started by creating a function to ID if MockDate is between StartTime and EndTime
is.between <- function(x, a, b) {
x > a & x < b
}
Testing that function works for a single value
is.between(actData[1,3], clstrData[,2], clstrData[,3])
But cannot figure out how to loop this for all rows, and then find the mean. My for() loop beginnings are below.
YesNo <- list()
for (i in 1:nrow(actData)) {
YesNo[[i]] <- is.between(actData[1,3], clstrData[,2], clstrData[,3])
}
YesNo[[3]]
This for() gives the same result for all row…
Hope to create...
clstrData$NEWcolum <- mean RandNum for each row.
Thanks, and as always any suggestions are greatly appreciated!
Assuming your machine can handle the data size, you can:
merge the two data frames on the ID, then
group accordingly (ie, by IndID, Start & End dates)
compute mean for those rows where mock date falls between the end dates
Here is some code using data.table
library(data.table)
DT.clstr <- data.table(clstrData, key="IndID")
DT.act <- data.table(actData, key="IndID")
# Adjust to `<=` if needed
ComputedDT <-
merge(DT.clstr, DT.act, allow.cartesian=TRUE)[
MockDate > StartTime & MockDate < EndTime
, list(Mean=mean(RandNumber))
, by=list(IndID, StartTime, EndTime)
]
Results
ComputedDT
IndID StartTime EndTime Mean
1: 1 2 12 1.671002
2: 2 4 13 2.176799
3: 2 2 9 2.244702
4: 3 3 6 1.978828
5: 3 4 8 1.940887
6: 3 2 9 2.033104
Thanks to Ricardo Saporta for earlier thoughts.
However, constructing a long conditional in my for() loop was the best option for me - although not as fast as data.table().
Using the data above, the code below is what I ended up constructing.
clstrData$meanAct = rep(NA, nrow(clstrData))
for (i in 1:nrow(clstrData)){
clstrData$meanAct[i] = mean(actData$RandNumber[actData$IndID==clstrData$IndID[i]
&is.between(actData$RandNumber, clstrData$StartTime[i], clstrData$EndTime[i])])
}
head(clstrData)
tail(clstrData)
Were there is no corresponding value between the Start and End times, NAN's are produced.

Resources