Resetting TIME column when AMT > 0 - r

I have a data frame that looks like this:
ID TIME AMT
1 0 50
1 1 0
1 2 0
1 3 0
1 4 0
1 4 50
1 5 0
1 7 0
1 9 0
1 10 0
1 10 50
The TIME column in the above data frame is continuous. I want to add another time column that resets time from zero when AMT>0. So, my output data frame should look like this:
ID TIME AMT TIME2
1 0 50 0
1 1 0 1
1 2 0 2
1 3 0 3
1 4 0 4
1 4 50 0
1 5 0 1
1 7 0 3
1 9 0 5
1 10 0 6
1 10 50 0
This is basically achieved by subtracting the TIME from a "fixed" reference TIME when AMT>0 (For example; the reference time for the second AMT>0 is 4. So, the TIME2 is calculated by subtracting 5-4=1 ;7-4=3; 9-4=5 etc. How can I do this automatically in R.

A data.table solution :
library(data.table)
setDT(DT)[,TIME2 := TIME-TIME[1],cumsum(AMT>0)]
# ID TIME AMT TIME2
# 1: 1 0 50 0
# 2: 1 1 0 1
# 3: 1 2 0 2
# 4: 1 3 0 3
# 5: 1 4 0 4
# 6: 1 4 50 0
# 7: 1 5 0 1
# 8: 1 7 0 3
# 9: 1 9 0 5
# 10: 1 10 0 6
# 11: 1 10 50 0

Was originally posting the same answer as #agstudy, so here's alternatively a possible base R solution
with(df, ave(TIME, cumsum(AMT > 0L), ID, FUN = function(x) x - x[1L]))
## [1] 0 1 2 3 4 0 1 3 5 6 0
Or
library(dplyr)
df %>%
group_by(cumsum(AMT > 0), ID) %>%
mutate(TIME2 = TIME - first(TIME))

Related

Recoding by an order in r

I have a data recoding puzzle. Here is how my sample data looks like:
df <- data.frame(
id = c(1,1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3,3),
scores = c(0,1,1,0,0,-1,-1, 0,0,1,-1,-1,-1, 0,1,0,1,1,0,1),
position = c(1,2,3,4,5,6,7, 1,2,3,4,5,6, 1,2,3,4,5,6,7),
cat = c(1,1,1,1,1,0,0, 1,1,1,0,0,0, 1,1,1,1,1,1,1))
id scores position cat
1 1 0 1 1
2 1 1 2 1
3 1 1 3 1
4 1 0 4 1
5 1 0 5 1
6 1 -1 6 0
7 1 -1 7 0
8 2 0 1 1
9 2 0 2 1
10 2 1 3 1
11 2 -1 4 0
12 2 -1 5 0
13 2 -1 6 0
14 3 0 1 1
15 3 1 2 1
16 3 0 3 1
17 3 1 4 1
18 3 1 5 1
19 3 0 6 1
20 3 1 7 1
There are three ids in the dataset and rows were ordered by a positon variable. For each id, the first row after the scores start by -1 needs to be 0, and the cat variable needs to be 1. For example, for id=1, the first row would be 6th position and in that row, score should be 0 and the cat variable needs to 1. For those ids do not have scores=-1, I keep them as they are.
The desired output should look like below:
id scores position cat
1 1 0 1 1
2 1 1 2 1
3 1 1 3 1
4 1 0 4 1
5 1 0 5 1
6 1 0 6 1
7 1 -1 7 0
8 2 0 1 1
9 2 0 2 1
10 2 1 3 1
11 2 0 4 1
12 2 -1 5 0
13 2 -1 6 0
14 3 0 1 1
15 3 1 2 1
16 3 0 3 1
17 3 1 4 1
18 3 1 5 1
19 3 0 6 1
20 3 1 7 1
Any recommendations??
Thanks
This may be what you are after
df %>%
group_by(id) %>%
mutate(i = which(scores == -1)[1]) %>% # find the first row == -1
mutate(scores = case_when(position == i & scores !=0 ~ 0, T ~ scores), # update the score using position & i
cat = ifelse(scores == -1,0,1)) %>% # then update cat
select (-i) # remove I
After trying a few things and getting ideas from #Ricky and #e.matt, I came up with a solution.
df %>%
filter(scores == -1) %>% # keep cases where var = 1
distinct(id, .keep_all = T) %>% # keep distinct cases based on group
mutate(first = 1) %>% # create first column
right_join(df, by=c("id","scores","position","cat")) %>% # join back original dataset
mutate(first = coalesce(first, 0)) %>% # replace NAs with 0
mutate(scores = case_when(
first == 1 ~ 0,
TRUE~scores)) %>%
mutate(cat = case_when(
first == 1 ~ 1,
TRUE~cat))
This provides my desired output.
id scores position cat first
1 1 0 1 1 0
2 1 1 2 1 0
3 1 1 3 1 0
4 1 0 4 1 0
5 1 0 5 1 0
6 1 0 6 1 1
7 1 -1 7 0 0
8 2 0 1 1 0
9 2 0 2 1 0
10 2 1 3 1 0
11 2 0 4 1 1
12 2 -1 5 0 0
13 2 -1 6 0 0
14 3 0 1 1 0
15 3 1 2 1 0
16 3 0 3 1 0
17 3 1 4 1 0
18 3 1 5 1 0
19 3 0 6 1 0
20 3 1 7 1 0
here is a data.table oneliner
library( data.table )
setDT(df)
df[ df[, .(cumsum( scores == -1 ) == 1), by = .(id)]$V1, `:=`( scores = 0, cat = 1) ]
# id scores position cat
# 1: 1 0 1 1
# 2: 1 1 2 1
# 3: 1 1 3 1
# 4: 1 0 4 1
# 5: 1 0 5 1
# 6: 1 0 6 1
# 7: 1 -1 7 0
# 8: 2 0 1 1
# 9: 2 0 2 1
# 10: 2 1 3 1
# 11: 2 0 4 1
# 12: 2 -1 5 0
# 13: 2 -1 6 0
# 14: 3 0 1 1
# 15: 3 1 2 1
# 16: 3 0 3 1
# 17: 3 1 4 1
# 18: 3 1 5 1
# 19: 3 0 6 1
# 20: 3 1 7 1
You could do something along these lines using the dplyr package:
library(dplyr)
df = mutate(df, cat = ifelse(scores == -1, 1, cat),
scores = ifelse(scores == -1, 0, scores))
Using the mutate() function, I am re-assigning the values for the scores and cat fields according to ifelse() conditional statements. For scores, if the score is -1, the value is replaced by 0, otherwise it keeps the score as is. For cat, it also checks if scores is equal to -1, but would assign a value of 1 when the condition is met, or the already existing value of cat when the condition is not met.
EDIT
After our discussion in the comments, I think something along these lines should be helpful (you may have to modify the logic since I don't exactly follow what the desired output is here):
for(i in 1:nrow(df)){
# Check if score is -1
if(df[i, 'scores'] == -1){
# Update values for the next row
df[i+1, 'scores'] <- 0
df[i+1, 'cat'] <- 1
}
}
Sorry that I don't really follow the desired output, hopefully this is helpful in getting you to your answer!

Row-wise operation by group over time R

Problem:
I am trying to create variable x2 which is equal to 1, for all rows within each ID group where over time x1 switches from 1 to 0.
Additionally, after the switch, every consecutive 0 in the run, x2 is set to 1.
I tried to figure out how to do this using library(dplyr), but could not figure out how to look at previous records within the group.
Input Data:
ID<-c("1","1","1","1","1","2","2","2","2","3","3","3","4","4","5","5","5")
time<-c("1","2","3","4","5","1","2","3","4","1","2","3","1","2","1","2","3")
x1<-c("0","1","1","1","1","0","0","0","0","1","0","0","1","1","1","0","1")
df<-data.frame(ID,time,x1)
Required Output:
ID time x1 x2
1 1 0 0
1 2 1 0
1 3 1 0
1 4 1 0
1 5 1 0
2 1 0 0
2 2 0 0
2 3 0 0
2 4 0 0
3 1 1 0
3 2 0 1
3 3 0 1
4 1 1 0
4 2 1 0
5 1 1 0
5 2 0 1
5 3 1 0
It is better to have the 'x1' as numeric column
library(data.table)
setDT(df)[, x2 := (cumsum(x1) < 2)*cumsum(c(FALSE, diff(x1) < 0)), ID]
df
# ID time x1 x2
# 1: 1 1 0 0
# 2: 1 2 1 0
# 3: 1 3 1 0
# 4: 1 4 1 0
# 5: 1 5 1 0
# 6: 2 1 0 0
# 7: 2 2 0 0
# 8: 2 3 0 0
# 9: 2 4 0 0
#10: 3 1 1 0
#11: 3 2 0 1
#12: 3 3 0 1
#13: 4 1 1 0
#14: 4 2 1 0
#15: 5 1 1 0
#16: 5 2 0 1
#17: 5 3 1 0
data
ID<-c("1","1","1","1","1","2","2","2","2","3","3","3","4","4","5","5","5")
time<-c("1","2","3","4","5","1","2","3","4","1","2","3","1","2","1","2","3")
x1<- as.integer(c("0","1","1","1","1","0","0","0","0","1","0","0","1","1","1","0","1"))
df<-data.frame(ID,time,x1)
If you want a dplyr answer, you can use #akrun's code in mutate after grouping by ID
library(dplyr)
ID<-c("1","1","1","1","1","2","2","2","2","3","3","3","4","4","5","5","5")
time<-c("1","2","3","4","5","1","2","3","4","1","2","3","1","2","1","2","3")
x1<- as.integer(c("0","1","1","1","1","0","0","0","0","1","0","0","1","1","1","0","1"))
df<-data.frame(ID,time,x1)
df <- df %>%
group_by(ID) %>%
mutate(x2 = (cumsum(x1) < 2)*cumsum(c(FALSE, diff(x1) < 0)))
df
# ID time x1 x2
# 1 1 0 0
# 1 2 1 0
# 1 3 1 0
# 1 4 1 0
# 1 5 1 0
# 2 1 0 0
# 2 2 0 0
# 2 3 0 0
# 2 4 0 0
# 3 1 1 0
# 3 2 0 1
# 3 3 0 1
# 4 1 1 0
# 4 2 1 0
# 5 1 1 0
# 5 2 0 1
# 5 3 1 0

R how to generate a descending sequence by subject measuring the distance from the next uninterrupted series of a given value

I have spent a lot of time trying to figure out how to create a descending sequence which is subject specific and measures the distance from the next uninterrupted series of a given value in another column. Do you have any suggestions?
Here is an example of the problem:
Given the following data, where the "id" column is the subject unique identifier and the column "dummy" is an attribute
mydata<-data.frame(id=rep(seq(1,3),each=5), dummy=c(0,0,0,1,1,0,0,1,0,1,0,0,0,0,0))
id dummy
1 1 0
2 1 0
3 1 0
4 1 1
5 1 1
6 2 0
7 2 0
8 2 1
9 2 0
10 2 1
11 3 0
12 3 0
13 3 0
14 3 0
15 3 0
Generate a new column measuring the distance from the next uninterrupted series of the value 1 in the "dummy" column (notice: I am considering an individual occurrence of the value 1 as an interrupted series). Here is an example of the output:
id dummy output
1 1 0 3
2 1 0 2
3 1 0 1
4 1 1 0
5 1 1 0
6 2 0 2
7 2 0 1
8 2 1 0
9 2 0 1
10 2 1 0
11 3 0 0
12 3 0 0
13 3 0 0
14 3 0 0
15 3 0 0
Thanks,
H
Here's an attempt using the data.table package in two steps.
First step is to shift the dummy column one step further in order to afterwards check if the zero sequences are being followed by one.
Second step is to calculate the sequences by condition that they are zero sequences and being followed by one.
I'm using the shift function from the latest data.table version (v 1.9.6+) for this task, but you can just use indx := c(dummy[-1L], 0L) instead
library(data.table) # V1.9.6+
setDT(mydata)[, indx := shift(dummy, type = "lead", fill = 0L)]
mydata[, output := .N:1L*(dummy == 0L)*(indx[.N] == 1L), by = .(id, cumsum(dummy == 1L))]
# id dummy indx output
# 1: 1 0 0 3
# 2: 1 0 0 2
# 3: 1 0 1 1
# 4: 1 1 1 0
# 5: 1 1 0 0
# 6: 2 0 0 2
# 7: 2 0 1 1
# 8: 2 1 0 0
# 9: 2 0 1 1
# 10: 2 1 0 0
# 11: 3 0 0 0
# 12: 3 0 0 0
# 13: 3 0 0 0
# 14: 3 0 0 0
# 15: 3 0 0 0
Here is an option with base R. First we label the number of consecutive identical entries (with rle) in the dummy column in reverse order:
mydata$output<- unlist(sapply(rle(mydata$dummy)$lengths,function(x) rev(seq(x))))
Then we set the values of the output column to zero for all rows in which dummy is not equal to zero:
mydata$output[mydata$dummy!=0] <- 0
In a last step, we identify the sets of id which only contain zeros as values for dummy and set their entries of the output column to zero, too:
mydata[mydata$id==which(aggregate(dummy ~ id,mydata,sum)$dummy==0),]$output <- 0
#> mydata
# id dummy output
#1 1 0 3
#2 1 0 2
#3 1 0 1
#4 1 1 0
#5 1 1 0
#6 2 0 2
#7 2 0 1
#8 2 1 0
#9 2 0 1
#10 2 1 0
#11 3 0 0
#12 3 0 0
#13 3 0 0
#14 3 0 0
#15 3 0 0
This solution assumes that there are no negative values in the dummy column.

Sum rows in a group, starting when a specific value occurs

I want to accumulate the values of a column till the end of the group, though starting the addition when a specific value occurs in another column. I am only interested in the first instance of the specific value within a group. So if that value occurs again within the group, the addition column should continue to add the values. I know this sounds like a rather strange problem, so hopefully the example table makes sense.
The following data frame is what I have now:
> df = data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4),numToAdd = c(1,1,3,2,4,2,1,3,2,1,2,1,2,3,2),occurs = c(0,0,1,0,0,1,0,0,0,0,1,1,0,0,0))
> df
group numToAdd occurs
1 1 1 0
2 1 1 0
3 1 3 1
4 1 2 0
5 2 4 0
6 2 2 1
7 2 1 0
8 2 3 0
9 2 2 0
10 3 1 0
11 3 2 1
12 3 1 1
13 4 2 0
14 4 3 0
15 4 2 0
Thus, whenever a 1 occurs within a group, I want a cumulative sum of the values from the column numToAdd, until a new group starts. This would look like the following:
> finalDF = data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4),numToAdd = c(1,1,3,2,4,2,1,3,2,1,2,1,2,3,2),occurs = c(0,0,1,0,0,1,0,0,0,0,1,1,0,0,0),added = c(0,0,3,5,0,2,3,6,8,0,2,3,0,0,0))
> finalDF
group numToAdd occurs added
1 1 1 0 0
2 1 1 0 0
3 1 3 1 3
4 1 2 0 5
5 2 4 0 0
6 2 2 1 2
7 2 1 0 3
8 2 3 0 6
9 2 2 0 8
10 3 1 0 0
11 3 2 1 2
12 3 1 1 3
13 4 2 0 0
14 4 3 0 0
15 4 2 0 0
Thus, the added column is 0 until a 1 occurs within the group, then accumulates the values from numToAdd until it moves to a new group, turning the added column back to 0. In group three, a value of 1 is found a second time, yet the cumulated sum continues. Additionally, in group 4, a value of 1 is never found, thus the value within the added column remains 0.
I've played around with dplyr, but can't get it to work. The following solution only outputs the total sum, and not the increasing cumulated number at each row.
library(dplyr)
df =
df %>%
mutate(added=ifelse(occurs == 1,cumsum(numToAdd),0)) %>%
group_by(group)
Try
df %>%
group_by(group) %>%
mutate(added= cumsum(numToAdd*cummax(occurs)))
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
Or using data.table
library(data.table)#v1.9.5+
i1 <-setDT(df)[, .I[(rleid(occurs) + (occurs>0))>1], group]$V1
df[, added:=0][i1, added:=cumsum(numToAdd), by = group]
Or a similar option as in dplyr
setDT(df)[,added := cumsum(numToAdd * cummax(occurs)) , by = group]
You can use split-apply-combine in base R with something like:
df$added <- unlist(lapply(split(df, df$group), function(x) {
y <- rep(0, nrow(x))
pos <- cumsum(x$occurs) > 0
y[pos] <- cumsum(x$numToAdd[pos])
y
}))
df
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
To add another base R approach:
df$added <- unlist(lapply(split(df, df$group), function(x) {
c(x[,'occurs'][cumsum(x[,'occurs']) == 0L],
cumsum(x[,'numToAdd'][cumsum(x[,'occurs']) != 0L]))
}))
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
Another base R:
df$added <- unlist(lapply(split(df,df$group),function(x){
cumsum((cumsum(x$occurs) > 0) * x$numToAdd)
}))

R data.table update errors

Some more problems I'm having with old data.table code related to this:
R: number rows that match >= other row within group
The data looks like this, with a different ID for each person, IDSEQ is the sequence of each admission for this person and TAG for diabetes medication (TAG=1 hypoglycaemic agent and TAG=2 insulin).
ID IDSEQ TAG
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 0
7 1 7 0
8 1 8 1
9 1 9 0
10 1 10 0
11 2 1 0
12 2 2 0
13 2 3 0
14 2 4 1
15 2 5 0
16 2 6 0
17 2 7 0
18 2 8 2
19 2 9 0
20 2 10 0
# recreate this data with
df <- data.frame(ID=c(rep(1,10),rep(2,10)),
IDSEQ=c(1:10,1:10),
TAG=c(rep(0,7),1,0,0,0,0,0,1,0,0,0,2,0,0))
Exercise: Create two new index sequence variables, COND1 using TAG=1 as the index record, and COND2 using TAG=2 as the index record. Write your syntax so that only record in a block prior to the index records in a block prior to the index record are numbered with a '0'.
a) TAG=1 (seems to still work)
DT <- data.table(df)
setkey(DT, ID)
# counter for condition 1
tmp <- df[which(df$TAG == 1),1:2]
DT1 <- data.table(tmp)
DT1 <- DT1[, list(IDSEQ=min(IDSEQ)), by=ID]
DT[, COND1:=0L]
DT[DT[DT1,.I[IDSEQ >= i.IDSEQ]],COND1:=1:.N,by=ID]
# previously
# DT[DT[DT1,.I[IDSEQ >= i.IDSEQ]]$V1,COND1:=1:.N,by=ID]
a) TAG=2 does not result in the correct result anymore, it is not linked on both ID and IDSEQ.
tmp <- df[which(df$TAG == 2),1:2]
DT1 <- data.table(tmp)
DT1 <- DT1[, list(IDSEQ=min(IDSEQ)), by=ID]
DT[, COND2:=0L]
DT[DT[DT1,.I[IDSEQ >= i.IDSEQ]],COND2:=1:.N,by=ID]
# previously worked with
# DT[DT[DT1,.I[IDSEQ >= i.IDSEQ]]$V1,COND2:=1:.N,by=ID]
The overall result should look like this
ID IDSEQ TAG COND1 COND2
1 1 1 0 0 0
2 1 2 0 0 0
3 1 3 0 0 0
4 1 4 0 0 0
5 1 5 0 0 0
6 1 6 0 0 0
7 1 7 0 0 0
8 1 8 1 1 0
9 1 9 0 2 0
10 1 10 0 3 0
11 2 1 0 0 0
12 2 2 0 0 0
13 2 3 0 0 0
14 2 4 1 1 0
15 2 5 0 2 0
16 2 6 0 3 0
17 2 7 0 4 0
18 2 8 2 5 1
19 2 9 0 6 2
20 2 10 0 7 3
# recreate this data with
data.frame(ID=c(rep(1,10),rep(2,10)),
IDSEQ=c(1:10,1:10),
TAG=c(rep(0,7),1,0,0,0,0,0,1,0,0,0,2,0,0),
COND1=c(rep(0,7),1,2,3,0,0,0,1,2,3,4,5,6,7),
COND2=c(rep(0,17),1,2,3))
data.table Version 1.9.4, R version 3.1.1
Here's one way using data.table:
dt[, `:=`(count1 = cumsum(cumsum(TAG == 1L)),
count2 = cumsum(cumsum(TAG == 2L))
), by=ID]
# ID IDSEQ TAG count1 count2
# 1: 1 1 0 0 0
# 2: 1 2 0 0 0
# 3: 1 3 0 0 0
# 4: 1 4 0 0 0
# 5: 1 5 0 0 0
# 6: 1 6 0 0 0
# 7: 1 7 0 0 0
# 8: 1 8 1 1 0
# 9: 1 9 0 2 0
# 10: 1 10 0 3 0
# 11: 2 1 0 0 0
# 12: 2 2 0 0 0
# 13: 2 3 0 0 0
# 14: 2 4 1 1 0
# 15: 2 5 0 2 0
# 16: 2 6 0 3 0
# 17: 2 7 0 4 0
# 18: 2 8 2 5 1
# 19: 2 9 0 6 2
# 20: 2 10 0 7 3
The corrected last line of your example:
DT[DT[DT1, .I[IDSEQ >= i.IDSEQ], by=.EACHI]$V1, COND2:=1:.N, by=ID]
Alternatively, you could also alter the default behavior, although I wouldn't recommend it due to compatibility issues.
options(datatable.old.bywithoutby=TRUE)
Some background information:
The selection criteria for TAG==2 results in
DT[DT1,.I[IDSEQ >= i.IDSEQ]]
[1] 8 9 10
which are the correct indices (line number) for ID==2 after the subset/join.
You would experience the same problem/result if there would be e.g. an ID==0 without any TAG==1.
df <- data.frame(ID=c(0, rep(1,10),rep(2,10)),
IDSEQ=c(1, 1:10,1:10),
TAG=c(0, rep(0,7),1,0,0,0,0,0,1,0,0,0,2,0,0))
DT <- data.table(df)
setkey(DT, ID)
# counter for condition 1
tmp <- df[which(df$TAG == 1),1:2]
DT1 <- data.table(tmp)
DT1 <- DT1[, list(IDSEQ=min(IDSEQ)), by=ID]
DT[, COND1:=0L]
DT[DT[DT1, .I[IDSEQ >= i.IDSEQ]], COND1:=1:.N, by=ID]
DT[c(1,2, 7:10),]
ID IDSEQ TAG COND1
1: 0 1 0 0
2: 1 1 0 0
3: 1 6 0 0
4: 1 7 0 1
5: 1 8 1 2
6: 1 9 0 3
data.table handles this situation (by-without-by) differently since version 1.9.4. When loading data.table it states
> library(data.table)
data.table 1.9.4 For help type: ?data.table
*** NB: by=.EACHI is now explicit. See README to restore previous behaviour.
Therefore you need to explicitly tell that you want to get your results j (the second part of the statement) not only for the current subset (defined in the first part), but grouped by every key variable.
More information can be found in the data.table FAQ 1.13 and 1.14 on page 5 and 6.

Resources