assigning new values based on the location in the sequence - r

Working in R.
The data tracks changes in brain activity over time. Column "mark" contains information when a particular treatment begins and ends. For examples, the first condition (mark==1) begins in row 3 and ends in row 6. The second experimental condition (mark==2) starts in row 9 and ends in 12. Another batch of treatment one is repeated between rows 15 and 18.
ob.id <- c(1:20)
mark <- c(0,0,1,0,0,1,0,0,2,0,0,2,0,0,1,0,0,1,0,0)
condition<-c(0,0,1,1,1,1,0,0,2,2,2,2,0,0,1, 1,1,1,0,0)
start <- data.frame(ob.id,mark)
result<-data.frame(ob.id,mark,condition)
print (start)
> print (start)
ob.id mark
1 1 0
2 2 0
3 3 1
4 4 0
5 5 0
6 6 1
7 7 0
8 8 0
9 9 2
10 10 0
11 11 0
12 12 2
13 13 0
14 14 0
15 15 1
16 16 0
17 17 0
18 18 1
19 19 0
20 20 0
I need to create a column that would have a dummy variable indicating the membership of an observation in corresponding experimental condition, like this:
> print(result)
ob.id mark condition
1 1 0 0
2 2 0 0
3 3 1 1
4 4 0 1
5 5 0 1
6 6 1 1
7 7 0 0
8 8 0 0
9 9 2 2
10 10 0 2
11 11 0 2
12 12 2 2
13 13 0 0
14 14 0 0
15 15 1 1
16 16 0 1
17 17 0 1
18 18 1 1
19 19 0 0
20 20 0 0
Thanks for your help!

This is a fun little problem. The trick I use below is to first calculate the rle of the mark vector, which makes the problem simpler, as the resulting values vector will always have just one 0 that may or may not need to be replaced (depending on the surrounding values).
# example vector with some edge cases
v = c(0,0,1,0,0,0,1,2,0,0,2,0,0,1,0,0,0,0,1,2,0,2)
v.rle = rle(v)
v.rle
#Run Length Encoding
# lengths: int [1:14] 2 1 3 1 1 2 1 2 1 4 ...
# values : num [1:14] 0 1 0 1 2 0 2 0 1 0 ...
vals = rle(v)$values
# find the 0's that need to be replaced and replace by the previous value
idx = which(tail(head(vals,-1),-1) == 0 & (head(vals,-2) == tail(vals,-2)))
vals[idx + 1] <- vals[idx]
# finally go back to the original vector
v.rle$values = vals
inverse.rle(v.rle)
# [1] 0 0 1 1 1 1 1 2 2 2 2 0 0 1 1 1 1 1 1 2 2 2
Probably the least cumbersome thing to do is to put the above in a function and then apply that to your data.frame vector (as opposed to manipulating the vector explicitly).
Another approach, based on #SimonO101's observation, involves constructing the right groups from the starting data (run the by part separately, piece by piece, to see how it works):
library(data.table)
dt = data.table(start)
dt[, result := mark[1],
by = {tmp = rep(0, length(mark));
tmp[which(mark != 0)[c(F,T)]] = 1;
cumsum(mark != 0) - tmp}]
dt
# ob.id mark result
# 1: 1 0 0
# 2: 2 0 0
# 3: 3 1 1
# 4: 4 0 1
# 5: 5 0 1
# 6: 6 1 1
# 7: 7 0 0
# 8: 8 0 0
# 9: 9 2 2
#10: 10 0 2
#11: 11 0 2
#12: 12 2 2
#13: 13 0 0
#14: 14 0 0
#15: 15 1 1
#16: 16 0 1
#17: 17 0 1
#18: 18 1 1
#19: 19 0 0
#20: 20 0 0
The latter approach will probably be more flexible.

Here is one way I could think of doing it:
# Find where experiments stop and start
ind <- which( result$mark != 0 )
[1] 3 6 9 12 15 18
# Make a matrix of the start and stop indices taking odd and even elements of the vector
idx <- cbind( head(ind , -1)[ 1:length(ind) %% 2 == 1 ] ,tail( ind , -1)[ 1:length(ind) %% 2 == 1 ] )
[,1] [,2]
[1,] 3 6
[2,] 9 12
[3,] 15 18
edit
I realised making the above index matrix would be easier with just taking odd and even elements:
idx <- cbind( ind[ 1:length(ind) %% 2 == 1 ] , ind[ 1:length(ind) %% 2 != 1 ] )
# Make vector of row indices to turn to 1's
ones <- as.vector( apply( idx , 1 , function(x) c( x[1]:x[2] ) ) )
# Make your new column and turn appropriate rows to 1
result$condition <- 0
result$condition[ ones ] <- 1
result
# ob.id mark condition
#1 1 0 0
#2 2 0 0
#3 3 1 1
#4 4 1 1
#5 5 1 1
#6 6 1 1
#7 7 0 0
#8 8 0 0
#9 9 1 1
#10 10 1 1
#11 11 1 1
#12 12 1 1
#13 13 0 0
#14 14 0 0
#15 15 1 1
#16 16 1 1
#17 17 1 1
#18 18 1 1
#19 19 0 0
#20 20 0 0
Edit
#eddi pointed out I needed to put the value of the experiment in, not just one. So this is another strategy which uses gasp(!) a for loop. This will only be really detrimental if you have millions thousands of experiments (remember to pre-allocate your results vector):
ind <- matrix( which( start$mark != 0 ) , ncol = 2 , byrow = TRUE )
ind <- cbind( ind , start$mark[ ind[ , 1 ] ] )
# [,1] [,2] [,3]
#[1,] 3 6 1
#[2,] 9 12 2
#[3,] 15 18 1
res <- integer( nrow( start ) )
for( i in 1:nrow(ind) ){
res[ ind[i,1]:ind[i,2] ] <- ind[i,3]
}
[1] 0 0 1 1 1 1 0 0 2 2 2 2 0 0 1 1 1 1 0 0

Related

Recoding by an order in r

I have a data recoding puzzle. Here is how my sample data looks like:
df <- data.frame(
id = c(1,1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3,3),
scores = c(0,1,1,0,0,-1,-1, 0,0,1,-1,-1,-1, 0,1,0,1,1,0,1),
position = c(1,2,3,4,5,6,7, 1,2,3,4,5,6, 1,2,3,4,5,6,7),
cat = c(1,1,1,1,1,0,0, 1,1,1,0,0,0, 1,1,1,1,1,1,1))
id scores position cat
1 1 0 1 1
2 1 1 2 1
3 1 1 3 1
4 1 0 4 1
5 1 0 5 1
6 1 -1 6 0
7 1 -1 7 0
8 2 0 1 1
9 2 0 2 1
10 2 1 3 1
11 2 -1 4 0
12 2 -1 5 0
13 2 -1 6 0
14 3 0 1 1
15 3 1 2 1
16 3 0 3 1
17 3 1 4 1
18 3 1 5 1
19 3 0 6 1
20 3 1 7 1
There are three ids in the dataset and rows were ordered by a positon variable. For each id, the first row after the scores start by -1 needs to be 0, and the cat variable needs to be 1. For example, for id=1, the first row would be 6th position and in that row, score should be 0 and the cat variable needs to 1. For those ids do not have scores=-1, I keep them as they are.
The desired output should look like below:
id scores position cat
1 1 0 1 1
2 1 1 2 1
3 1 1 3 1
4 1 0 4 1
5 1 0 5 1
6 1 0 6 1
7 1 -1 7 0
8 2 0 1 1
9 2 0 2 1
10 2 1 3 1
11 2 0 4 1
12 2 -1 5 0
13 2 -1 6 0
14 3 0 1 1
15 3 1 2 1
16 3 0 3 1
17 3 1 4 1
18 3 1 5 1
19 3 0 6 1
20 3 1 7 1
Any recommendations??
Thanks
This may be what you are after
df %>%
group_by(id) %>%
mutate(i = which(scores == -1)[1]) %>% # find the first row == -1
mutate(scores = case_when(position == i & scores !=0 ~ 0, T ~ scores), # update the score using position & i
cat = ifelse(scores == -1,0,1)) %>% # then update cat
select (-i) # remove I
After trying a few things and getting ideas from #Ricky and #e.matt, I came up with a solution.
df %>%
filter(scores == -1) %>% # keep cases where var = 1
distinct(id, .keep_all = T) %>% # keep distinct cases based on group
mutate(first = 1) %>% # create first column
right_join(df, by=c("id","scores","position","cat")) %>% # join back original dataset
mutate(first = coalesce(first, 0)) %>% # replace NAs with 0
mutate(scores = case_when(
first == 1 ~ 0,
TRUE~scores)) %>%
mutate(cat = case_when(
first == 1 ~ 1,
TRUE~cat))
This provides my desired output.
id scores position cat first
1 1 0 1 1 0
2 1 1 2 1 0
3 1 1 3 1 0
4 1 0 4 1 0
5 1 0 5 1 0
6 1 0 6 1 1
7 1 -1 7 0 0
8 2 0 1 1 0
9 2 0 2 1 0
10 2 1 3 1 0
11 2 0 4 1 1
12 2 -1 5 0 0
13 2 -1 6 0 0
14 3 0 1 1 0
15 3 1 2 1 0
16 3 0 3 1 0
17 3 1 4 1 0
18 3 1 5 1 0
19 3 0 6 1 0
20 3 1 7 1 0
here is a data.table oneliner
library( data.table )
setDT(df)
df[ df[, .(cumsum( scores == -1 ) == 1), by = .(id)]$V1, `:=`( scores = 0, cat = 1) ]
# id scores position cat
# 1: 1 0 1 1
# 2: 1 1 2 1
# 3: 1 1 3 1
# 4: 1 0 4 1
# 5: 1 0 5 1
# 6: 1 0 6 1
# 7: 1 -1 7 0
# 8: 2 0 1 1
# 9: 2 0 2 1
# 10: 2 1 3 1
# 11: 2 0 4 1
# 12: 2 -1 5 0
# 13: 2 -1 6 0
# 14: 3 0 1 1
# 15: 3 1 2 1
# 16: 3 0 3 1
# 17: 3 1 4 1
# 18: 3 1 5 1
# 19: 3 0 6 1
# 20: 3 1 7 1
You could do something along these lines using the dplyr package:
library(dplyr)
df = mutate(df, cat = ifelse(scores == -1, 1, cat),
scores = ifelse(scores == -1, 0, scores))
Using the mutate() function, I am re-assigning the values for the scores and cat fields according to ifelse() conditional statements. For scores, if the score is -1, the value is replaced by 0, otherwise it keeps the score as is. For cat, it also checks if scores is equal to -1, but would assign a value of 1 when the condition is met, or the already existing value of cat when the condition is not met.
EDIT
After our discussion in the comments, I think something along these lines should be helpful (you may have to modify the logic since I don't exactly follow what the desired output is here):
for(i in 1:nrow(df)){
# Check if score is -1
if(df[i, 'scores'] == -1){
# Update values for the next row
df[i+1, 'scores'] <- 0
df[i+1, 'cat'] <- 1
}
}
Sorry that I don't really follow the desired output, hopefully this is helpful in getting you to your answer!

R- Include starting point in cumsum function

I have this data.frame:
a b
[1,] 1 0
[2,] 2 0
[3,] 3 0
[4,] 4 0
[5,] 5 0
[6,] 6 1
[7,] 7 2
[8,] 8 3
[9,] 9 4
[10,] 10 5
I want to apply cumsum on column a only when its corresponding value on column b is different from 0.
I tried this below but it doesn't include a starting condition on the cumsum:
df_cumsum <- cbind(c(1:10), c(0,0,0,0,0,1,2,3,4,5),
as.data.frame(ave(A[,1], A[,2] != 0, FUN=cumsum)))
Unfortunately, I obtain a cumsum over the whole column:
a b c
1 1 0 1
2 2 0 3
3 3 0 6
4 4 0 10
5 5 0 15
6 6 1 6
7 7 2 13
8 8 3 21
9 9 4 30
10 10 5 40
I would like to obtain:
a b c
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 0
6 6 1 6
7 7 2 13
8 8 3 21
9 9 4 30
10 10 5 40
Thanks for help!
Assuming the input is df as shown reproducibly in the Note at the end, try this. It zeros out any a value for which b is 0.
transform(df, cum = cumsum((b > 0) * a))
giving:
a b cum
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 0
6 6 1 6
7 7 2 13
8 8 3 21
9 9 4 30
10 10 5 40
Note
We assume this input shown in reproducible form:
Lines <- "
a b
1 0
2 0
3 0
4 0
5 0
6 1
7 2
8 3
9 4
10 5"
df <- read.table(text = Lines, header = TRUE)
Update
a and b had been reversed. Have fixed.
It would be better to create an index and update
i1 <- df1$b > 0
df1$c[i1] <- with(df1, cumsum(a[i1]))
Or in a single line
df1$c <- with(df1, cumsum(a * (b > 0)))
df1$c
#[1] 0 0 0 0 0 6 13 21 30 40
I really like how clean the other answers are using the a * (b > 0) but that can sometimes be a bit confusing for newer programers. As an alternative to this syntax you can use a vectorized ifelse function.
df <- data.frame(a=c(1:10), b=c(0,0,0,0,0,1,2,3,4,5))
# One way
df$c <- cumsum(ifelse(df$b>0,df$a,0))
# Another way
df$d <- with(df,cumsum(ifelse(b>0,a,0)))

R data.table update errors

Some more problems I'm having with old data.table code related to this:
R: number rows that match >= other row within group
The data looks like this, with a different ID for each person, IDSEQ is the sequence of each admission for this person and TAG for diabetes medication (TAG=1 hypoglycaemic agent and TAG=2 insulin).
ID IDSEQ TAG
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 0
7 1 7 0
8 1 8 1
9 1 9 0
10 1 10 0
11 2 1 0
12 2 2 0
13 2 3 0
14 2 4 1
15 2 5 0
16 2 6 0
17 2 7 0
18 2 8 2
19 2 9 0
20 2 10 0
# recreate this data with
df <- data.frame(ID=c(rep(1,10),rep(2,10)),
IDSEQ=c(1:10,1:10),
TAG=c(rep(0,7),1,0,0,0,0,0,1,0,0,0,2,0,0))
Exercise: Create two new index sequence variables, COND1 using TAG=1 as the index record, and COND2 using TAG=2 as the index record. Write your syntax so that only record in a block prior to the index records in a block prior to the index record are numbered with a '0'.
a) TAG=1 (seems to still work)
DT <- data.table(df)
setkey(DT, ID)
# counter for condition 1
tmp <- df[which(df$TAG == 1),1:2]
DT1 <- data.table(tmp)
DT1 <- DT1[, list(IDSEQ=min(IDSEQ)), by=ID]
DT[, COND1:=0L]
DT[DT[DT1,.I[IDSEQ >= i.IDSEQ]],COND1:=1:.N,by=ID]
# previously
# DT[DT[DT1,.I[IDSEQ >= i.IDSEQ]]$V1,COND1:=1:.N,by=ID]
a) TAG=2 does not result in the correct result anymore, it is not linked on both ID and IDSEQ.
tmp <- df[which(df$TAG == 2),1:2]
DT1 <- data.table(tmp)
DT1 <- DT1[, list(IDSEQ=min(IDSEQ)), by=ID]
DT[, COND2:=0L]
DT[DT[DT1,.I[IDSEQ >= i.IDSEQ]],COND2:=1:.N,by=ID]
# previously worked with
# DT[DT[DT1,.I[IDSEQ >= i.IDSEQ]]$V1,COND2:=1:.N,by=ID]
The overall result should look like this
ID IDSEQ TAG COND1 COND2
1 1 1 0 0 0
2 1 2 0 0 0
3 1 3 0 0 0
4 1 4 0 0 0
5 1 5 0 0 0
6 1 6 0 0 0
7 1 7 0 0 0
8 1 8 1 1 0
9 1 9 0 2 0
10 1 10 0 3 0
11 2 1 0 0 0
12 2 2 0 0 0
13 2 3 0 0 0
14 2 4 1 1 0
15 2 5 0 2 0
16 2 6 0 3 0
17 2 7 0 4 0
18 2 8 2 5 1
19 2 9 0 6 2
20 2 10 0 7 3
# recreate this data with
data.frame(ID=c(rep(1,10),rep(2,10)),
IDSEQ=c(1:10,1:10),
TAG=c(rep(0,7),1,0,0,0,0,0,1,0,0,0,2,0,0),
COND1=c(rep(0,7),1,2,3,0,0,0,1,2,3,4,5,6,7),
COND2=c(rep(0,17),1,2,3))
data.table Version 1.9.4, R version 3.1.1
Here's one way using data.table:
dt[, `:=`(count1 = cumsum(cumsum(TAG == 1L)),
count2 = cumsum(cumsum(TAG == 2L))
), by=ID]
# ID IDSEQ TAG count1 count2
# 1: 1 1 0 0 0
# 2: 1 2 0 0 0
# 3: 1 3 0 0 0
# 4: 1 4 0 0 0
# 5: 1 5 0 0 0
# 6: 1 6 0 0 0
# 7: 1 7 0 0 0
# 8: 1 8 1 1 0
# 9: 1 9 0 2 0
# 10: 1 10 0 3 0
# 11: 2 1 0 0 0
# 12: 2 2 0 0 0
# 13: 2 3 0 0 0
# 14: 2 4 1 1 0
# 15: 2 5 0 2 0
# 16: 2 6 0 3 0
# 17: 2 7 0 4 0
# 18: 2 8 2 5 1
# 19: 2 9 0 6 2
# 20: 2 10 0 7 3
The corrected last line of your example:
DT[DT[DT1, .I[IDSEQ >= i.IDSEQ], by=.EACHI]$V1, COND2:=1:.N, by=ID]
Alternatively, you could also alter the default behavior, although I wouldn't recommend it due to compatibility issues.
options(datatable.old.bywithoutby=TRUE)
Some background information:
The selection criteria for TAG==2 results in
DT[DT1,.I[IDSEQ >= i.IDSEQ]]
[1] 8 9 10
which are the correct indices (line number) for ID==2 after the subset/join.
You would experience the same problem/result if there would be e.g. an ID==0 without any TAG==1.
df <- data.frame(ID=c(0, rep(1,10),rep(2,10)),
IDSEQ=c(1, 1:10,1:10),
TAG=c(0, rep(0,7),1,0,0,0,0,0,1,0,0,0,2,0,0))
DT <- data.table(df)
setkey(DT, ID)
# counter for condition 1
tmp <- df[which(df$TAG == 1),1:2]
DT1 <- data.table(tmp)
DT1 <- DT1[, list(IDSEQ=min(IDSEQ)), by=ID]
DT[, COND1:=0L]
DT[DT[DT1, .I[IDSEQ >= i.IDSEQ]], COND1:=1:.N, by=ID]
DT[c(1,2, 7:10),]
ID IDSEQ TAG COND1
1: 0 1 0 0
2: 1 1 0 0
3: 1 6 0 0
4: 1 7 0 1
5: 1 8 1 2
6: 1 9 0 3
data.table handles this situation (by-without-by) differently since version 1.9.4. When loading data.table it states
> library(data.table)
data.table 1.9.4 For help type: ?data.table
*** NB: by=.EACHI is now explicit. See README to restore previous behaviour.
Therefore you need to explicitly tell that you want to get your results j (the second part of the statement) not only for the current subset (defined in the first part), but grouped by every key variable.
More information can be found in the data.table FAQ 1.13 and 1.14 on page 5 and 6.

Removing the unordered pairs repeated twice in a file in R

I have a file like this in R.
**0 1**
0 2
**0 3**
0 4
0 5
0 6
0 7
0 8
0 9
0 10
**1 0**
1 11
1 12
1 13
1 14
1 15
1 16
1 17
1 18
1 19
**3 0**
As we can see, there are similar unordered pairs in this ( marked pairs ), like,
1 0
and
0 1
I wish to remove these pairs. And I want to count the number of such pairs that I have and append the count in front of the tow that is repeated. If not repeated, then 1 should be written in the third column.
For example ( A sample of the output file )
0 1 2
0 2 1
0 3 2
0 4 1
0 5 1
0 6 1
0 7 1
0 8 1
0 9 1
0 10 1
1 11 1
1 12 1
1 13 1
1 14 1
1 15 1
1 16 1
1 17 1
1 18 1
1 19 1
How can I achieve it in R?
Here is a way using transform, pmin and pmax to reorder the data by row, and then aggregate to provide a count:
# data
x <- data.frame(a=c(rep(0,10),rep(1,10),3),b=c(1:10,0,11:19,0))
#logic
aggregate(count~a+b,transform(x,a=pmin(a,b), b=pmax(a,b), count=1),sum)
a b count
1 0 1 2
2 0 2 1
3 0 3 2
4 0 4 1
5 0 5 1
6 0 6 1
7 0 7 1
8 0 8 1
9 0 9 1
10 0 10 1
11 1 11 1
12 1 12 1
13 1 13 1
14 1 14 1
15 1 15 1
16 1 16 1
17 1 17 1
18 1 18 1
19 1 19 1
Here's one approach:
First, create a vector of the columns sorted and then pasted together.
x <- apply(mydf, 1, function(x) paste(sort(x), collapse = " "))
Then, use ave to create the counts you are looking for.
mydf$count <- ave(x, x, FUN = length)
Finally, you can use the "x" vector again, this time to detect and remove duplicated values.
mydf[!duplicated(x), ]
# V1 V2 count
# 1 0 1 2
# 2 0 2 1
# 3 0 3 2
# 4 0 4 1
# 5 0 5 1
# 6 0 6 1
# 7 0 7 1
# 8 0 8 1
# 9 0 9 1
# 10 0 10 1
# 12 1 11 1
# 13 1 12 1
# 14 1 13 1
# 15 1 14 1
# 16 1 15 1
# 17 1 16 1
# 18 1 17 1
# 19 1 18 1
# 20 1 19 1

cumulative counter in dataframe R

I have a dataframe with many rows, but the structure looks like this:
year factor
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 1
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 1
18 0
19 0
20 0
I would need to add a counter as a third column. It should count the cumulative cells that contains zero until it set again to zero once the value 1 is encountered. The result should look like this:
year factor count
1 0 0
2 0 1
3 0 2
4 0 3
5 0 4
6 0 5
7 0 6
8 0 7
9 1 0
10 0 1
11 0 2
12 0 3
13 0 4
14 0 5
15 0 6
16 0 7
17 1 0
18 0 1
19 0 2
20 0 3
I would be glad to do it in a quick way, avoiding loops, since I have to do the operations for hundreds of files.
You can copy my dataframe, pasting the dataframe in "..." here:
dt <- read.table( text="...", , header = TRUE )
Perhaps a solution like this with ave would work for you:
A <- cumsum(dt$factor)
ave(A, A, FUN = seq_along) - 1
# [1] 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3
Original answer:
(Missed that the first value was supposed to be "0". Oops.)
x <- rle(dt$factor == 1)
y <- sequence(x$lengths)
y[dt$factor == 1] <- 0
y
# [1] 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 0 1 2 3

Resources