Problem:
I am trying to create variable x2 which is equal to 1, for all rows within each ID group where over time x1 switches from 1 to 0.
Additionally, after the switch, every consecutive 0 in the run, x2 is set to 1.
I tried to figure out how to do this using library(dplyr), but could not figure out how to look at previous records within the group.
Input Data:
ID<-c("1","1","1","1","1","2","2","2","2","3","3","3","4","4","5","5","5")
time<-c("1","2","3","4","5","1","2","3","4","1","2","3","1","2","1","2","3")
x1<-c("0","1","1","1","1","0","0","0","0","1","0","0","1","1","1","0","1")
df<-data.frame(ID,time,x1)
Required Output:
ID time x1 x2
1 1 0 0
1 2 1 0
1 3 1 0
1 4 1 0
1 5 1 0
2 1 0 0
2 2 0 0
2 3 0 0
2 4 0 0
3 1 1 0
3 2 0 1
3 3 0 1
4 1 1 0
4 2 1 0
5 1 1 0
5 2 0 1
5 3 1 0
It is better to have the 'x1' as numeric column
library(data.table)
setDT(df)[, x2 := (cumsum(x1) < 2)*cumsum(c(FALSE, diff(x1) < 0)), ID]
df
# ID time x1 x2
# 1: 1 1 0 0
# 2: 1 2 1 0
# 3: 1 3 1 0
# 4: 1 4 1 0
# 5: 1 5 1 0
# 6: 2 1 0 0
# 7: 2 2 0 0
# 8: 2 3 0 0
# 9: 2 4 0 0
#10: 3 1 1 0
#11: 3 2 0 1
#12: 3 3 0 1
#13: 4 1 1 0
#14: 4 2 1 0
#15: 5 1 1 0
#16: 5 2 0 1
#17: 5 3 1 0
data
ID<-c("1","1","1","1","1","2","2","2","2","3","3","3","4","4","5","5","5")
time<-c("1","2","3","4","5","1","2","3","4","1","2","3","1","2","1","2","3")
x1<- as.integer(c("0","1","1","1","1","0","0","0","0","1","0","0","1","1","1","0","1"))
df<-data.frame(ID,time,x1)
If you want a dplyr answer, you can use #akrun's code in mutate after grouping by ID
library(dplyr)
ID<-c("1","1","1","1","1","2","2","2","2","3","3","3","4","4","5","5","5")
time<-c("1","2","3","4","5","1","2","3","4","1","2","3","1","2","1","2","3")
x1<- as.integer(c("0","1","1","1","1","0","0","0","0","1","0","0","1","1","1","0","1"))
df<-data.frame(ID,time,x1)
df <- df %>%
group_by(ID) %>%
mutate(x2 = (cumsum(x1) < 2)*cumsum(c(FALSE, diff(x1) < 0)))
df
# ID time x1 x2
# 1 1 0 0
# 1 2 1 0
# 1 3 1 0
# 1 4 1 0
# 1 5 1 0
# 2 1 0 0
# 2 2 0 0
# 2 3 0 0
# 2 4 0 0
# 3 1 1 0
# 3 2 0 1
# 3 3 0 1
# 4 1 1 0
# 4 2 1 0
# 5 1 1 0
# 5 2 0 1
# 5 3 1 0
Related
My data set contains three variables:
id <- c(1,1,1,1,1,1,2,2,2,2,5,5,5,5,5,5)
ind <- c(0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1)
price <- c(1,2,3,4,5,6,1,2,3,4,1,2,3,4,5,6)
mdata <- data.frame(id,ind,price)
I need to create a new variable (ind2) that is if ind=0, then ind2=0.
also, if ind=1, then ind2=0, unless the price value is max, then ind2=1.
The new data looks like:
id ind ind2 price
1 0 0 1
1 0 0 2
1 0 0 3
1 0 0 4
1 0 0 5
1 0 0 6
2 1 0 1
2 1 0 2
2 1 0 3
2 1 1 4
5 1 0 1
5 1 0 2
5 1 0 3
5 1 0 4
5 1 0 5
5 1 1 6
library(dplyr)
mdata %>%
group_by(id) %>%
mutate(ind2 = +(ind == 1L & price == max(price)))
# id ind price ind2
# 1 1 0 1 0
# 2 1 0 2 0
# 3 1 0 3 0
# 4 1 0 4 0
# 5 1 0 5 0
# 6 1 0 6 0
# 7 2 1 1 0
# 8 2 1 2 0
# 9 2 1 3 0
# 10 2 1 4 1
# 11 5 1 1 0
# 12 5 1 2 0
# 13 5 1 3 0
# 14 5 1 4 0
# 15 5 1 5 0
# 16 5 1 6 1
Or if you prefer data.table
setDT(mdata)[, ind2 := +(ind == 1L & price == max(price)), by = id]
Or with base R
mdata$ind2 <- unlist(lapply(split(mdata,mdata$id),
function(x) +(x$ind == 1L & x$price == max(x$price))))
I have the following data frame:
i39<-c(5,3,5,4,4,3)
i38<-c(5,3,5,3,4,1)
i37<-c(5,3,5,3,4,3)
i36<-c(5,4,5,5,4,2)
ndat1<-as.data.frame(cbind(i39,i38,i37,i36))
> ndat1
i39 i38 i37 i36
1 5 5 5 5
2 3 3 3 4
3 5 5 5 5
4 4 3 3 5
5 4 4 4 4
6 3 1 3 2
My goal is to convert any value that is a 4 or a 5 into a 1, and anything else into a 0 to yield the following:
> ndat1
i39 i38 i37 i36
1 1 1 1 1
2 0 0 0 1
3 1 1 1 1
4 1 0 0 1
5 1 1 1 1
6 0 0 0 0
With your data set I would just do
ndat1[] <- +(ndat1 >= 4)
# i39 i38 i37 i36
# 1 1 1 1 1
# 2 0 0 0 1
# 3 1 1 1 1
# 4 1 0 0 1
# 5 1 1 1 1
# 6 0 0 0 0
Though a more general solution will be
ndat1[] <- +(ndat1 == 4 | ndat1 == 5)
# i39 i38 i37 i36
# 1 1 1 1 1
# 2 0 0 0 1
# 3 1 1 1 1
# 4 1 0 0 1
# 5 1 1 1 1
# 6 0 0 0 0
Some data.table alternative
library(data.table)
setDT(ndat1)[, names(ndat1) := lapply(.SD, function(x) +(x %in% 4:5))]
And I'll to the dplyr guys have fun with mutate_each
I used the following to solve this issue:
recode<-function(ndat1){
ifelse((as.data.frame(ndat1)==4|as.data.frame(ndat1)==5),1,0)
}
sum_dc1<-as.data.frame(sapply(as.data.frame(ndat1),recode),drop=FALSE)
> sum_dc1
i39 i38 i37 i36
1 1 1 1 1
2 0 0 0 1
3 1 1 1 1
4 1 0 0 1
5 1 1 1 1
6 0 0 0 0
I was just wondering if anyone else had any thoughts, but overall I am satisfied with this way of solving the issue. Thank you.
I have a data frame that looks like this:
ID TIME AMT
1 0 50
1 1 0
1 2 0
1 3 0
1 4 0
1 4 50
1 5 0
1 7 0
1 9 0
1 10 0
1 10 50
The TIME column in the above data frame is continuous. I want to add another time column that resets time from zero when AMT>0. So, my output data frame should look like this:
ID TIME AMT TIME2
1 0 50 0
1 1 0 1
1 2 0 2
1 3 0 3
1 4 0 4
1 4 50 0
1 5 0 1
1 7 0 3
1 9 0 5
1 10 0 6
1 10 50 0
This is basically achieved by subtracting the TIME from a "fixed" reference TIME when AMT>0 (For example; the reference time for the second AMT>0 is 4. So, the TIME2 is calculated by subtracting 5-4=1 ;7-4=3; 9-4=5 etc. How can I do this automatically in R.
A data.table solution :
library(data.table)
setDT(DT)[,TIME2 := TIME-TIME[1],cumsum(AMT>0)]
# ID TIME AMT TIME2
# 1: 1 0 50 0
# 2: 1 1 0 1
# 3: 1 2 0 2
# 4: 1 3 0 3
# 5: 1 4 0 4
# 6: 1 4 50 0
# 7: 1 5 0 1
# 8: 1 7 0 3
# 9: 1 9 0 5
# 10: 1 10 0 6
# 11: 1 10 50 0
Was originally posting the same answer as #agstudy, so here's alternatively a possible base R solution
with(df, ave(TIME, cumsum(AMT > 0L), ID, FUN = function(x) x - x[1L]))
## [1] 0 1 2 3 4 0 1 3 5 6 0
Or
library(dplyr)
df %>%
group_by(cumsum(AMT > 0), ID) %>%
mutate(TIME2 = TIME - first(TIME))
I have data that looks a bit like this:
df <- data.frame(ID=c(rep(1,4),rep(2,2),rep(3,2),4), TYPE=c(1,3,2,4,1,2,2,3,2),
SEQUENCE=c(seq(1,4),1,2,1,2,1))
ID TYPE SEQUENCE
1 1 1
1 3 2
1 2 3
1 4 4
2 1 1
2 2 2
3 2 1
3 3 2
4 2 1
I know need to check if a certain type is present in each ID block (binary), but only record the
answer in the first record per block (SEQUENCE == 1).
The best I came up with so far is coding them in the row they are present in, e.g.
library(data.table)
DT <- data.table(df)
DT$A[DT$TYPE==1] <- 1
DT$B[DT$TYPE==2] <- 1
DT$C[DT$TYPE==3] <- 1
DT$D[DT$TYPE==4] <- 1
DT[is.na(DT)] <- 0
RESULT:
ID TYPE SEQUENCE A B C D
1 1 1 1 0 0 0
1 3 2 0 0 1 0
1 2 3 0 1 0 0
1 4 4 0 0 0 1
2 1 1 1 0 0 0
2 2 2 0 1 0 0
3 2 1 0 1 0 0
3 3 2 0 0 1 0
4 2 1 0 1 0 0
However, the result should look like this:
ID TYPE SEQUENCE A B C D
1 1 1 1 1 1 1
1 3 2 0 0 0 0
1 2 3 0 0 0 0
1 4 4 0 0 0 0
2 1 1 1 1 0 0
2 2 2 0 0 0 0
3 2 1 0 1 1 0
3 3 2 0 0 0 0
4 2 1 0 1 0 0
I assume this can be done with data.table, but I haven't quite found the correct syntax.
This makes one copy of the data.table:
DT[, FAC := factor(TYPE, labels=LETTERS[1:4])]
DT <- dcast.data.table(DT, ID+TYPE+SEQUENCE~FAC, fun.aggregate=length)
DT[,LETTERS[1:4] := lapply(.SD,
function(x) c(any(as.logical(x)), rep(0L, length(x)-1))),
.SDcols=LETTERS[1:4], by=ID]
# ID TYPE SEQUENCE A B C D
#1: 1 1 1 1 1 1 1
#2: 1 2 3 0 0 0 0
#3: 1 3 2 0 0 0 0
#4: 1 4 4 0 0 0 0
#5: 2 1 1 1 1 0 0
#6: 2 2 2 0 0 0 0
#7: 3 2 1 0 1 1 0
#8: 3 3 2 0 0 0 0
#9: 4 2 1 0 1 0 0
I have a sample dataframe sample.data as follows:
x y z
1 0 1
1 0 1
1 0 1
1 0 1
1 0 2
1 0 2
1 0 2
1 0 2
1 0 2
0 1 2
I need to find the max and sum of x and y for each category of z (z is like 1,2,...600). I use ddply from plyr for this:
library(plyr)
z.group<-ddply (sample.data,.(z),summarize,max_x=max(x), max_y=max(y), sum_x=sum(x), sum_y=sum(y))
z.group
z max_x max_y sum_x sum_y
1 1 0 4 0
2 1 1 5 1
Now, I need to insert these sum_x, sum_y, max_x, and max_y as the columns of sample.data under the related rows. For example, if max_x is 1 for z=1, then I insert max_x is 1 for all rows with z=1. The expected output is
x y z max_x max_y sum_x sum_y
1 0 1 1 0 4 0
1 0 1 1 0 4 0
1 0 1 1 0 4 0
1 0 1 1 0 4 0
1 0 2 1 1 5 1
1 0 2 1 1 5 1
1 0 2 1 1 5 1
1 0 2 1 1 5 1
1 0 2 1 1 5 1
0 1 2 1 1 5 1
I wonder how do I get the expected output?
You can do it directly in one step , using transform
.group<-ddply (sample.data,.(z),transform,max_x=max(x), max_y=max(y), sum_x=sum(x), sum_y=sum(y))
> z.group
x y z max_x max_y sum_x sum_y
1 1 0 1 1 0 4 0
2 1 0 1 1 0 4 0
3 1 0 1 1 0 4 0
4 1 0 1 1 0 4 0
5 1 0 2 1 1 5 1
6 1 0 2 1 1 5 1
7 1 0 2 1 1 5 1
8 1 0 2 1 1 5 1
9 1 0 2 1 1 5 1
10 0 1 2 1 1 5 1
I think you can do this with merge:
merge(sample.data, z.group, by="z")
# z x y max_x max_y sum_x sum_y
# 1 1 1 0 1 0 4 0
# 2 1 1 0 1 0 4 0
# 3 1 1 0 1 0 4 0
# 4 1 1 0 1 0 4 0
# 5 2 1 0 1 1 5 1
# 6 2 1 0 1 1 5 1
# 7 2 1 0 1 1 5 1
# 8 2 1 0 1 1 5 1
# 9 2 1 0 1 1 5 1
# 10 2 0 1 1 1 5 1
A data.table alternative:
require(data.table)
dt <- data.table(sample.data, key="z")
dt[, list(x=x, y=y, max_x=max(x), max_y=max(y), sum_x=sum(x), sum_y=sum(y)), by=z]
Even better/shorter solution (as #agstudy suggested, should be possible):
dt[, `:=`(max_x=max(x), max_y=max(y), sum_x=sum(x), sum_y=sum(y)), by=z]