R function works with some dataframes but not others?

R function works with some dataframes but not others? - r

I've a data frame that summarises the number of missing and non-missing observations in a data frame that is passed to it[1]. I then have been asked to test for differences between two treatment arms in the data I have (personally I disagree with the need or utility of doing so, but its what I've been asked to do). So I've written a small function to do this...
quick.test <- function(x, y){
chisq <- chisq.test(x = x, y = y)
fisher <- fisher.test(x = x, y = y)
results <- cbind(chisq = chisq$statistic,
df = chisq$parameter,
p = chisq$p.value,
fisher = fisher$p.value)
results
}
I then use apply() to pass the relevant columns to this function as follows...
apply(miss.t1, 1, function(x) quick.test(x[2:3], x[4:5]))
This is fine for the above specified miss.t1 data frame, but I'm working with time-series data and have three time-points I wish to summarise so have miss.t2 and miss.t3 (each of which is summarising the number of present/missing data for each time point, and have been created in the same manner using the function described in [1]).
miss.t2 fails with the following error...
apply(miss.t2, 1, function(x) quick.test(x[2:3], x[4:5]))
Error in chisq.test(x = x, y = y) :
'x' and 'y' must have at least 2 levels
My initial thought was that one of the columns had a missing value for some reason, but this doesn't appear to be the case...
> describe(miss.t2)
miss.t2
5 Variables 171 Observations
--------------------------------------------------------------------------------
variable
n missing unique
171 0 171
lowest : Abtotal Abyn agg_ment agg_phys All.score
highest: z_pf z_re z_rp z_sf z_vt
--------------------------------------------------------------------------------
nmiss.1
n missing unique Mean
171 0 4 8.649
0 (6, 4%), 8 (9, 5%), 9 (153, 89%), 10 (3, 2%)
--------------------------------------------------------------------------------
npresent.1
n missing unique Mean
171 0 4 9.351
8 (3, 2%), 9 (153, 89%), 10 (9, 5%), 18 (6, 4%)
--------------------------------------------------------------------------------
nmiss.2
n missing unique Mean
171 0 4 10.65
0 (6, 4%), 11 (160, 94%), 12 (4, 2%), 13 (1, 1%)
--------------------------------------------------------------------------------
npresent.2
n missing unique Mean
171 0 4 14.35
12 (1, 1%), 13 (4, 2%), 14 (160, 94%), 25 (6, 4%)
--------------------------------------------------------------------------------
Next thing I tried was trying subsets of miss.t2 by taking head(miss.t2, n=XX) and it works fine upto row 54...
> apply(head(miss.t2, n=53), 1, function(x) quick.test(x[2:3], x[4:5]))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[3,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[4,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[3,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[4,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
There were 50 or more warnings (use warnings() to see the first 50)
> apply(head(miss.t2, n=54), 1, function(x) quick.test(x[2:3], x[4:5]))
Error in chisq.test(x = x, y = y) :
'x' and 'y' must have at least 2 levels
> miss.t2[54,]
variable nmiss.1 npresent.1 nmiss.2 npresent.2
54 psq 10 8 11 14
> traceback()
5: stop("'x' and 'y' must have at least 2 levels") at #2
4: chisq.test(x = x, y = y) at #2
3: quick.test(x[2:3], x[4:5])
2: FUN(newX[, i], ...)
1: apply(head(miss.t2, n = 54), 1, function(x) quick.test(x[2:3],
x[4:5]))
Similarly with the 'bottom' of the data frame the last 26 rows are parsed fine, but not the 27th from last...
> apply(tail(miss.t2, n=26), 1, function(x) quick.test(x[2:3], x[4:5]))
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163
[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[3,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[4,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
164 165 166 167 168 169 170 171
[1,] 0 0 0 0 0 0 0 0
[2,] 1 1 1 1 1 1 1 1
[3,] 1 1 1 1 1 1 1 1
[4,] 1 1 1 1 1 1 1 1
There were 26 warnings (use warnings() to see them)
> apply(tail(miss.t2, n=27), 1, function(x) quick.test(x[2:3], x[4:5]))
Error in chisq.test(x = x, y = y) :
'x' and 'y' must have at least 2 levels
In addition: Warning message:
In chisq.test(x = x, y = y) : Chi-squared approximation may be incorrect
> miss.t2[118,]
variable nmiss.1 npresent.1 nmiss.2 npresent.2
118 sf16 9 9 11 14
I can't see anything wrong with these two lines that means they should fail and the traceback() shown above doesn't reveal anything useful (to my mind).
Can anyone offer any suggestions as to why or where things are going wrong?
Many thanks in advance,
Neil
EDIT : Formatted reply to Vincent Zoonekynd ...
I opted for the chisq.test(x = x, y = y) version described in ?chisq.test(), using cbind() as you suggest to produce a matrix results in
Error in sum(x) : invalid 'type' (character) of argument.
Putting print statements and showing the length of x and y results in the same error, but shows the values and lenghts as being...
> miss.t2.res <- data.frame(t(apply(miss.t2, 1, function(x) quick.test(x[2:3], x[4:5]))))
[1] "Your x is : 9" "Your x is : 9"
[1] 2 ### < Length of x
[1] "Your y is : 11" "Your y is : 14"
[1] 2 ### < Length of y
Error in chisq.test(x = x, y = y) : 'x' and 'y' must have at least 2 levels
EDIT 2 : Thanks to Vincent Zoonekynd pointers the problem was that because the counts were identical for two cells the call to chisq.test() treats these as factors and collapses them. The solution was to modify the quick.test() function and coerce the arguments that are being passed into a matrix, so the function that worked is now....
quick.test <- function(x, y){
chisq <- chisq.test(rbind(as.numeric(x), as.numeric(y)))
fisher <- fisher.test(rbind(as.numeric(x), as.numeric(y)))
results <- cbind(chisq = chisq$statistic,
df = chisq$parameter,
p = chisq$p.value,
fisher = fisher$p.value)
results
}
Many thanks for the help & pointers Vincent, very much appreciated.
[1] http://gettinggeneticsdone.blogspot.co.uk/2011/02/summarize-missing-data-for-all.html

The solution suggested by Vincent Zoonkeynd in the comments above, was to modify the quick.test() function and coerce the arguments that are being passed into a matrix, so the function that worked is now....
quick.test <- function(x, y){
chisq <- chisq.test(rbind(as.numeric(x), as.numeric(y)))
fisher <- fisher.test(rbind(as.numeric(x), as.numeric(y)))
results <- cbind(chisq = chisq$statistic,
df = chisq$parameter,
p = chisq$p.value,
fisher = fisher$p.value)
results
}

Related

Create an index variable for blocks of values

I have a dataframe "data" with a grouping variable "grp" and a binary classification variable "classif". For each group in grp, I want to create a "result" variable creating an index of separate blocks of 0 in the classif variable. For the time being, I don't know how to reset the count for each level of the grouping variable and I don't find a way to only create the index for blocks of 0s (ignoring the 1s).
Example data:
grp <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3)
classif <- c(0,1,0,0,1,0,0,1,1,0,0,0,0,1,0,1,1,1,0,0,1,1,0,0,0,1,0,1,0)
result <- c(1,0,2,2,0,3,3,0,0,1,1,1,1,0,2,0,0,0,3,3,0,0,1,1,1,0,2,0,3)
wrong_result <- c(1,2,3,3,4,5,5,1,1,2,2,2,2,3,4,5,5,5,6,6,1,1,2,2,2,3,4,5,6)
Data <- data.frame(grp,classif,result, wrong_result)
I have tried using rleid but the following command produces "wrong_result", which is not what I'm after.
data[, wrong_result:= rleid(classif)]
data[, wrong_result:= rleid(classif), by=grp]

With dplyr, use cumsum() and lag() to find blocks of zeroes .by group. (Make sure you’re using the latest version of dplyr to use the .by argument).
library(dplyr)
Data %>%
mutate(
result2 = ifelse(
classif == 0,
cumsum(classif == 0 & lag(classif, default = 1) == 1),
0
),
.by = grp
)
grp classif result result2
1 1 0 1 1
2 1 1 0 0
3 1 0 2 2
4 1 0 2 2
5 1 1 0 0
6 1 0 3 3
7 1 0 3 3
8 2 1 0 0
9 2 1 0 0
10 2 0 1 1
11 2 0 1 1
12 2 0 1 1
13 2 0 1 1
14 2 1 0 0
15 2 0 2 2
16 2 1 0 0
17 2 1 0 0
18 2 1 0 0
19 2 0 3 3
20 2 0 3 3
21 3 1 0 0
22 3 1 0 0
23 3 0 1 1
24 3 0 1 1
25 3 0 1 1
26 3 1 0 0
27 3 0 2 2
28 3 1 0 0
29 3 0 3 3

Use rle and sequentially number the runs produced and then convert back and zero out the runs of 1's. No packages are used.
seq0 <- function(x) {
r <- rle(x)
is0 <- r$values == 0
r$values[is0] <- seq_len(sum(is0))
inverse.rle(r) * !x
}
transform(Data, result2 = ave(classif, grp, FUN = seq0))

difference between 2 datasets with map2_df not giving the desired result within a function, but works properly when run alone

I have this function that calculate the difference between 2 datasets, this function works perfectly on a small dataset, but when i tried it on a large data set it is giving me an error, what i can not understand is that when i run every step alone it is working but once i run the whole function it is giving me an error.
# function
soustraction.j=function(D,R,i,threshold){
D=as.data.frame(D)
R=as.data.frame(R)
dif=purrr::map2_df(D[-1], R[i,-1], `-`)
dif[dif<0] = 0
dif$mismatch=rowSums(dif)
dif= cbind(ID = D[1],R[1], dif)
dif=dif[which(dif$mismatch <= threshold),]
return(dif)
}
# trying it on a small dataset
# small data sets
#####################################
# data frame for recipients
IDr= c(seq(1,4))
BTR=c("A","B","AB","O")
data_R=data.frame(IDr,BTR,A=rep(0,4),B=c(rep(0,3),1),C=c(rep(1,3),0),D=rep(1,4),E=c(rep(0,2),rep(1,1),0),stringsAsFactors=FALSE)
data_R
IDr BTR A B C D E
1 1 A 0 0 1 1 0
2 2 B 0 0 1 1 0
3 3 AB 0 0 1 1 1
4 4 O 0 1 0 1 0
# data frame for donors
IDd= c(seq(1,8))
BTD= c(rep("A", each=2),rep("B", each=2),rep("AB", each=2),rep("O", each=2))
WD= c(rep(0.25, each=2),rep(0.125, each=2),rep(0.125, each=2),rep(0.5, each=2))
data_D=data.frame(IDd,BTD,A=c(rep(0,6),1,1),B=c(rep(0,6),1,1),C=c(rep(1,7),0),D=rep(1,8),E=c(rep(0,6),rep(1,1),0),WD,stringsAsFactors=FALSE)
data_D
IDd BTD A B C D E WD
1 1 A 0 0 1 1 0 0.250
2 2 A 0 0 1 1 0 0.250
3 3 B 0 0 1 1 0 0.125
4 4 B 0 0 1 1 0 0.125
5 5 AB 0 0 1 1 0 0.125
6 6 AB 0 0 1 1 0 0.125
7 7 O 1 1 1 1 1 0.500
8 8 O 1 1 0 1 0 0.500
# Applying function
soustraction.j(data_D[,c(1, 3:7)],data_R[,c(1, 3:7)],1,3)
IDd IDr A B C D E mismatch
1 1 1 0 0 0 0 0 0
2 2 2 0 0 0 0 0 0
3 3 3 0 0 0 0 0 0
4 4 4 0 0 0 0 0 0
5 5 1 0 0 0 0 0 0
6 6 2 0 0 0 0 0 0
7 7 3 1 1 0 0 1 3
8 8 4 1 1 0 0 0 2
###############################################
###### different datasets #################
######### generating a pool of donor#########
set.seed(1023)
x=t(replicate(1000,rbinom(400, 1, 0.5)))
colnames(x)=paste0("epi", sprintf("%02d", 1:400))
pool1=as.data.frame(x)
duptimes <- c(5,rep(1,10),rep(0,298),rep(2,200),rep(3,100),rep(0,200),rep(1,100),rep(4,50),rep(0,40),10)
# Create an index of the rows you want with duplication
idx <- rep(1:nrow(pool1), duptimes)
# Use that index to generate the new data frame
dupdf <- pool1[idx,]
pool=rbind(pool1,dupdf)
y=runif(2025)
freq.g=y/sum(y)
BTD=replicate(2025,sample(c("A","B","AB","O"), 1, prob = c(0.42, 0.09, 0.03, 0.46)))
pooldup=as.data.frame(cbind(IDd=seq(1:2025),BTD,pool,freq.g))
pooldup[1:5,1:6]
IDd BTD epi01 epi02 epi03 epi04
1 1 A 0 0 1 0
2 2 O 0 1 1 1
3 3 O 1 1 1 1
4 4 AB 1 0 0 0
5 5 A 1 1 1 0
######### generating recipient data#########
set.seed(1024)
x1=t(replicate(20,rbinom(400, 1, 0.5)))
colnames(x1)=paste0("epi", sprintf("%02d", 1:400))
x1r=as.data.frame(x1)
BTR=replicate(20,sample(c("A","B","AB","O"), 1, prob = c(0.42, 0.09, 0.03, 0.46)))
rdata=as.data.frame(cbind(IDr=seq(1:20),BTR,x1r))
rdata[1:5,1:6]
IDr BTR epi01 epi02 epi03 epi04
1 1 B 0 1 0 0
2 2 B 1 1 0 0
3 3 O 1 1 1 1
4 4 A 0 0 0 0
5 5 O 1 1 0 0
# Applying the function
soustraction.j(pooldup[,c(1,3:402)],rdata[,c(1,3:402)],1,75)
# Error in data.frame(..., check.names = FALSE) :
# arguments imply differing number of rows: 2025, 20
When i run separately the step purrr::map2_df(D[-1], R[i,-1],-) its working but not within the function. I apologize if my code is long i just do not understand where this error is coming from. Thank you for your help.

The error is in the cbind. dif is a data frame with 2025 rows and R has 20 rows.
cbind will normally recycle values
cbind(1:2, 1:3)
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 2 2
#> [3,] 1 3
#> Warning message:
#> In cbind(1:2, 1:3) :
#> number of rows of result is not a multiple of vector length (arg 1)
However, for data frames, it will error on fractional recycling
x <- data.frame(a = 1:2)
y <- data.frame(a = 1:3)
cbind(x, y)
#> Error in data.frame(..., check.names = FALSE) :
#> arguments imply differing number of rows: 2, 3
If you want fractional recyling to work with cbind, convert to a matrix and set the number of rows
cbind(matrix(unlist(x), ncol = 1, nrow = 3), matrix(unlist(y), ncol = 1, nrow = 3))
#> x
#> [1,] 1 1
#> [2,] 2 2
#> [3,] 1 3
#> Warning message:
#> In matrix(unlist(x), ncol = 1, nrow = 3) :
#> data length [2] is not a sub-multiple or multiple of the number of rows [3]

Ranges surrounding values in data frame R dplyr

I have a data frame that looks something like this :
test <- data.frame(chunk = c(rep("a",27),rep("b",27)), x = c(1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1))
There is a column by which I would like to group the data using group_by() in dplyr, which in this example is called chunk
I want to add another column to each chunk of test called x1 so the resulting data frame looks like this :
test1 <- data.frame(test, x1 = c(0,0,0,0,0,0,0,1,1,1,1,1,2,2,2,2,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,2,2,2,2,1,1,1,1,1,0,0,0,0,0,0))
x1 identifies all of the occurrences of 0 in x and takes a range of +-5 rows in each direction from the end 0s and adds an identifier. What the identifier is doesn't matter, but in this example the identifier in x1 is 1 for the range and 2 for the occurrences of 0 in x
Thanks for any and all help!

Here's an option to do it in dplyr:
Shorter version:
n <- 1:5
test %>%
group_by(chunk) %>%
mutate(x1 = ifelse((row_number() - min(which(x == 0))) %in% -n |
(row_number(chunk) - max(which(x == 0))) %in% n, 1, ifelse(x == 0, 2, 0)))
Longer (first) version:
test %>%
group_by(chunk) %>%
mutate(start = (row_number() - min(which(x == 0))) %in% -5:-1,
end = (row_number() - max(which(x == 0))) %in% 1:5,
x1 = ifelse(start | end, 1, ifelse(x == 0, 2, 0))) %>%
select(-c(start, end))
Source: local data frame [54 x 3]
Groups: chunk
chunk x x1
1 a 1 0
2 a 1 0
3 a 1 0
4 a 1 0
5 a 1 0
6 a 1 0
7 a 1 0
8 a 1 1
9 a 1 1
10 a 1 1
11 a 1 1
12 a 1 1
13 a 0 2
14 a 0 2
15 a 0 2
16 a 0 2
17 a 1 1
18 a 1 1
19 a 1 1
20 a 1 1
21 a 1 1
22 a 1 0
23 a 1 0
24 a 1 0
25 a 1 0
26 a 1 0
27 a 1 0
28 b 1 0
29 b 1 0
30 b 1 0
31 b 1 0
32 b 1 0
33 b 1 0
34 b 1 0
35 b 1 1
36 b 1 1
37 b 1 1
38 b 1 1
39 b 1 1
40 b 0 2
41 b 0 2
42 b 0 2
43 b 0 2
44 b 1 1
45 b 1 1
46 b 1 1
47 b 1 1
48 b 1 1
49 b 1 0
50 b 1 0
51 b 1 0
52 b 1 0
53 b 1 0
54 b 1 0
The assumption in this approach is, that in each group of "chunk" there is only one sequence of 0s (as in the sample data). Let me know if that's not the case in your actual data.

Creating a matrix with all combinations within a budget

I am attempting to create a matrix that includes all combinations of numbers within a range such that the row sums to a specific value. I am not sure if there is a function for this or if I would need to create the function manually. I have tried combn function but it does not constrain to the sum and so the matrix gets large very quickly.
example: 3 rows that sum to 5
5,0,0
4,1,0
4,0,1
3,2,0
3,0,2
3,1,1
2,3,0
2,0,3
2,2,1
2,1,2
etc..

These combinatorial objects are called partitions (see also here and even here), and their computation is implemented by the partitions package.
Depending on what you really want, use one of the following:
library(partitions)
## The first argument says you want to enumerate all partitions in which the
## second argument (5) is broken into three summands, each of which can take a
## maximum value of 5.
blockparts(rep(5,3),5) ## Equiv: blockparts(c(5,5,5), 5)
#
# [1,] 5 4 3 2 1 0 4 3 2 1 0 3 2 1 0 2 1 0 1 0 0
# [2,] 0 1 2 3 4 5 0 1 2 3 4 0 1 2 3 0 1 2 0 1 0
# [3,] 0 0 0 0 0 0 1 1 1 1 1 2 2 2 2 3 3 3 4 4 5
restrictedparts(5,3)
#
# [1,] 5 4 3 3 2
# [2,] 0 1 2 1 2
# [3,] 0 0 0 1 1

Perhaps this does what you want:
x <- expand.grid(replicate(3, list(0:5)))
x[rowSums(x) == 5, ]
# Var1 Var2 Var3
# 6 5 0 0
# 11 4 1 0
# 16 3 2 0
# 21 2 3 0
# 26 1 4 0
# 31 0 5 0
# 41 4 0 1
# 46 3 1 1
# 51 2 2 1
# 56 1 3 1
# 61 0 4 1
# 76 3 0 2
# 81 2 1 2
# 86 1 2 2
# 91 0 3 2
# 111 2 0 3
# 116 1 1 3
# 121 0 2 3
# 146 1 0 4
# 151 0 1 4
# 181 0 0 5
expand.grid and combn are somewhat related, but I find expand.grid to be more applicable to these types of problems.
There is also the permutations function from the "gtools" package:
library(gtools)
x <- permutations(6, 3, v = 0:5, set = FALSE, repeats.allowed=TRUE)
x[rowSums(x) == 5, ]

assigning new values based on the location in the sequence

Working in R.
The data tracks changes in brain activity over time. Column "mark" contains information when a particular treatment begins and ends. For examples, the first condition (mark==1) begins in row 3 and ends in row 6. The second experimental condition (mark==2) starts in row 9 and ends in 12. Another batch of treatment one is repeated between rows 15 and 18.
ob.id <- c(1:20)
mark <- c(0,0,1,0,0,1,0,0,2,0,0,2,0,0,1,0,0,1,0,0)
condition<-c(0,0,1,1,1,1,0,0,2,2,2,2,0,0,1, 1,1,1,0,0)
start <- data.frame(ob.id,mark)
result<-data.frame(ob.id,mark,condition)
print (start)
> print (start)
ob.id mark
1 1 0
2 2 0
3 3 1
4 4 0
5 5 0
6 6 1
7 7 0
8 8 0
9 9 2
10 10 0
11 11 0
12 12 2
13 13 0
14 14 0
15 15 1
16 16 0
17 17 0
18 18 1
19 19 0
20 20 0
I need to create a column that would have a dummy variable indicating the membership of an observation in corresponding experimental condition, like this:
> print(result)
ob.id mark condition
1 1 0 0
2 2 0 0
3 3 1 1
4 4 0 1
5 5 0 1
6 6 1 1
7 7 0 0
8 8 0 0
9 9 2 2
10 10 0 2
11 11 0 2
12 12 2 2
13 13 0 0
14 14 0 0
15 15 1 1
16 16 0 1
17 17 0 1
18 18 1 1
19 19 0 0
20 20 0 0
Thanks for your help!

This is a fun little problem. The trick I use below is to first calculate the rle of the mark vector, which makes the problem simpler, as the resulting values vector will always have just one 0 that may or may not need to be replaced (depending on the surrounding values).
# example vector with some edge cases
v = c(0,0,1,0,0,0,1,2,0,0,2,0,0,1,0,0,0,0,1,2,0,2)
v.rle = rle(v)
v.rle
#Run Length Encoding
# lengths: int [1:14] 2 1 3 1 1 2 1 2 1 4 ...
# values : num [1:14] 0 1 0 1 2 0 2 0 1 0 ...
vals = rle(v)$values
# find the 0's that need to be replaced and replace by the previous value
idx = which(tail(head(vals,-1),-1) == 0 & (head(vals,-2) == tail(vals,-2)))
vals[idx + 1] <- vals[idx]
# finally go back to the original vector
v.rle$values = vals
inverse.rle(v.rle)
# [1] 0 0 1 1 1 1 1 2 2 2 2 0 0 1 1 1 1 1 1 2 2 2
Probably the least cumbersome thing to do is to put the above in a function and then apply that to your data.frame vector (as opposed to manipulating the vector explicitly).
Another approach, based on #SimonO101's observation, involves constructing the right groups from the starting data (run the by part separately, piece by piece, to see how it works):
library(data.table)
dt = data.table(start)
dt[, result := mark[1],
by = {tmp = rep(0, length(mark));
tmp[which(mark != 0)[c(F,T)]] = 1;
cumsum(mark != 0) - tmp}]
dt
# ob.id mark result
# 1: 1 0 0
# 2: 2 0 0
# 3: 3 1 1
# 4: 4 0 1
# 5: 5 0 1
# 6: 6 1 1
# 7: 7 0 0
# 8: 8 0 0
# 9: 9 2 2
#10: 10 0 2
#11: 11 0 2
#12: 12 2 2
#13: 13 0 0
#14: 14 0 0
#15: 15 1 1
#16: 16 0 1
#17: 17 0 1
#18: 18 1 1
#19: 19 0 0
#20: 20 0 0
The latter approach will probably be more flexible.

Here is one way I could think of doing it:
# Find where experiments stop and start
ind <- which( result$mark != 0 )
[1] 3 6 9 12 15 18
# Make a matrix of the start and stop indices taking odd and even elements of the vector
idx <- cbind( head(ind , -1)[ 1:length(ind) %% 2 == 1 ] ,tail( ind , -1)[ 1:length(ind) %% 2 == 1 ] )
[,1] [,2]
[1,] 3 6
[2,] 9 12
[3,] 15 18
edit
I realised making the above index matrix would be easier with just taking odd and even elements:
idx <- cbind( ind[ 1:length(ind) %% 2 == 1 ] , ind[ 1:length(ind) %% 2 != 1 ] )
# Make vector of row indices to turn to 1's
ones <- as.vector( apply( idx , 1 , function(x) c( x[1]:x[2] ) ) )
# Make your new column and turn appropriate rows to 1
result$condition <- 0
result$condition[ ones ] <- 1
result
# ob.id mark condition
#1 1 0 0
#2 2 0 0
#3 3 1 1
#4 4 1 1
#5 5 1 1
#6 6 1 1
#7 7 0 0
#8 8 0 0
#9 9 1 1
#10 10 1 1
#11 11 1 1
#12 12 1 1
#13 13 0 0
#14 14 0 0
#15 15 1 1
#16 16 1 1
#17 17 1 1
#18 18 1 1
#19 19 0 0
#20 20 0 0
Edit
#eddi pointed out I needed to put the value of the experiment in, not just one. So this is another strategy which uses gasp(!) a for loop. This will only be really detrimental if you have millions thousands of experiments (remember to pre-allocate your results vector):
ind <- matrix( which( start$mark != 0 ) , ncol = 2 , byrow = TRUE )
ind <- cbind( ind , start$mark[ ind[ , 1 ] ] )
# [,1] [,2] [,3]
#[1,] 3 6 1
#[2,] 9 12 2
#[3,] 15 18 1
res <- integer( nrow( start ) )
for( i in 1:nrow(ind) ){
res[ ind[i,1]:ind[i,2] ] <- ind[i,3]
}
[1] 0 0 1 1 1 1 0 0 2 2 2 2 0 0 1 1 1 1 0 0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R function works with some dataframes but not others? - r

Related

Create an index variable for blocks of values

difference between 2 datasets with map2_df not giving the desired result within a function, but works properly when run alone

Ranges surrounding values in data frame R dplyr

Creating a matrix with all combinations within a budget

assigning new values based on the location in the sequence

Categories

Resources